WEBサービス: 八月 2017

2017年8月17日星期四

Scraping Data behind a CAPTCHA - Advanced Web Scraping

Is it possible to scrape data that is protected behind a CAPTCHA with Octoparse? This is one of the most asking questions by our users. Yes Octoparse is able to scrape data behind a CAPTCHA. It's available when running your scraping tasks with local machine. Although Octoparse Cloud Service does not provide CAPTCHA -solving service, our development guys are working very hard on it. So we'll see. ;)

In this tutorial, I will introduce the solutions to two most common issues when CAPTCHA appears:

1) Scrape data behind log in

2) Access a website too frequently in a short time

Here is a general rule to resolve CAPTCHA issue in Octoparse:

· allow the webpage to load the image (see the screenshot below)

· set waiting time before execution

· entering the text CAPTCHA or dragging the slider

1) Scrape data behind log in

Some websites need log in credentials before browsing. Bypassing a CAPTCHA is often needed when verifying the provided credentials.

I’ll take http://agent.octoparse.com/login for example to show you how to scrape the website with a Captcha when logging in.

Step 1.

· Go to the webpage.

· Click the User Name box and choose “Enter Text value”.

· Enter the credentials.

· Click “Save”.

The same way to enter the password.

Step 2. Manually enter the CAPTCHA in the built-in browser.

As the CAPTCHA would change when the webpage reloads, you don’t need to add another step to enter the CAPTCHA in the workflow at this point. Just manually enter the CAPTCHA in the built-in browser.

(Note: the same way to drag a slider.)

Step 3. Set proper waiting timeout before execution.

Click “Sign in” button, choose “Click an item” to log in the website.

To make sure that we have enough time to manually enter the CAPTCHA when starting local extraction, we need to set longer timeout before Octoparse executes “Click item”.

So just click “Click item” in the workflow. ➜ Check “Waiting before execution” under advanced options. ➜ Set proper execution timeout. ➜ Click “Save”.

Step 4. Extract the data you want.

I wouldn’t display the details as all of our tutorials have shown that.

Step 5. Manually enter the CAPTCHA when starting data extraction on local machine.

As you could see, there’s a built-in browser in the process of extraction. You need to wait until the CAPTCHA appears and enter theCAPTCHA.

Then you have solved the CAPTCHA problem and Octoparse would do the rest for you to get the data.

Note:

We extremely recommend you to load and store the cookie by checking the option “Use specified Cookie” under “Cache Settings”. By doing this, you don’t need to enter the CAPTCHA at most times when scraping behind a login.

(Follow the tutorial here to know how to store cookies in Octoparse.)

2) Access a website too frequently in a short time

It’s the same to pass a Captcha when accessing a website. Octoparse mimics the human behaviour with the point-and-click interface, so you just need to manually enter the Captcha like what you do in the normal browser, and set proper waiting timeout to manually enter the captcha at the following step. This is what you should do when making a crawler in the Workflow.

And then you come to the process of extracting data on local machine; you need to wait until the CAPTCHA appears, and manually pass the CAPTCHA before the operation timed out.

Now you know how to scrape data behind a CAPTCHA in Octoparse. I know it's probably less than ideal. But it could be a way of solving CAPTCHA in a short term. Hope you enjoy data hunting with Octoparse!

Check out the tutorials below to learn more about how to crawl data from a website:

Pagination Scraping: Configure “Loop click next page” When It Can’t Be Detected

Scrape Detail Page Data with Pagination

The Octoparse Team

2017年8月14日星期一

Scrape Amazon Product Data with ASIN/UPC

So if you selling online, wouldn’t you want to know how your products are priced comparing to some popular ecommerce sites such as Amazon or Ebay. If your products are not selling, could it be a price problem, a product description problem or even the product pictures or maybe the products are just not good enough. This is exactly why people are turning to Amazon for product mining.

In this article I will show you how you can easily retrieve the product data you need from Amazon using web scraping tool, Octoparse. Let’s get straight to the point.

Step 1: Get prepared

Gather the list of ASIN/UPC for the products you need (to search with on Amazon).

ASIN, short for Amazon Standard Identification Numbers (ASINs) are unique blocks of 10 letters and/or numbers that identify items on Amazon. Each item listed on Amazon will have a unique ASIN. If you happen to know what the ASIN’s are for the products you need to search for, great! If not, try UPC.

UPC, Universal Product Code (UPC) is a 12-digit bar code used extensively for retail packaging in United States. Find out what the UPC’s are for your products and make a list of it.

Step 2: Capture Data

Launch the application (download it at www.octoparse.com) and start a new task.

Click "Quick Start" ➜ Click "New Task (Advanced Mode)" – I prefer the Advanced Mode because it offers a lot more flexibility compared to the Wizard Mode. And I never find it too "advanced" to learn ➜ Fill in the basic information like Task Name, Category etc. After everything’s done, click "Next" to proceed to the next step.

Drag a "Go to webpage" action into the workflow pane ➜ Enter the page URL: www.amazon.com ➜ Click "Save" ➜Amazon's webpage gets loaded in the built-in browser.

Click into the search box on the webpage ➜ when Octoparse prompts you for the next action, pick "Enter text value".

To search repeatedly with a list of UPC codes, we'll need to utilize the loop action. Drag a "loop" action into the workflow pane ➜ Select "Text List" ➜ Copy and paste the list of UPC codes into the text box ➜ Click "OK" ➜ Save ➜ Notice how the list of UPC’s gets added to "Loop Items"

Now, to make sense of the steps in the workflow, drag the "Input Text" action to the inside of the "Loop" action ➜ Under Advanced Options for the "Text Input" action, check "Use text in Loop Item to fill in the text box" ➜ Save ➜ Octoparse is now configured to use the text values added to the loop to search on Amazon's website.

Since the workflow had been re-arranged, click-through the workflow from the first action to make sure the defined actions are working as desired.

In my case, everything is working properly so when I get to the "Enter Text" step, the first UPC code from the list gets synched to the text box automatically.

Next, click on the search button ➜ When prompted, select "Click an item" ➜ A "Click" action gets added to the loop automatically.

As soon as we are on the product detail page, capture whatever product information needed just like any other extraction task. Here, I will need product title, rating, number of review, product category and price.

Click on the title of the product ➜ Select "Extract Text" when prompted ➜ Notice how the product title gets added to the data pane next to the workflow designer ➜ Capture all other data fields similarly.

Edit the field names directly corresponding to the different data fields extracted.

So, is everything good to go now? Not so fast! There's still a bit of work to do for double checking - click through the workflow from the first action to the last to make sure everything works properly. Soon, we find that the "Click" action for the Search button doesn't really work.

For the majority of time, when something fails to work as desired, it is because the XPath auto-generated by Octoparse fails to locate the item correctly. To solve this, we'll need to modify the XPath manually. There are tons of XPath tutorials available in Octoparse's tutorial section (http://www.octoparse.com/features-tutorial?category=XPATH) so we'll skip the detailed steps and hop over to modify the XPath directly.

Select the action "Click an item", make sure it's the one for the Search button ➜ Click "Customized" under Advanced Option ➜ Click on "Define ways to locate an item" ➜ Copy and paste the correct XPath （ .//*[@id='nav-search']/form/div[2]/div/input ）➜ Click "OK".

Now, re-run the workflow, select another UPC code besides the first one to test if the extraction steps are working right to locate the individual data fields.

As we get to the "Extract data" step, we'll notice that price is missing from the extracted data outputs.

So again, we’ll need to modify the XPath to the correct one (.//*[contains(text(), 'Price:')]/following-sibling::td/span[1]).

Finally, we are ready to run the extraction task. We can run it with the local machine utilizing local bandwidth, IP, memory etc. Alternatively, we can choose "Cloud Extraction" to run it in the Cloud (though only for paid users). If you are a heavy user, I strongly recommend Cloud extraction because it’s so much a relief to leave everything in the cloud and come back for the complete data set without having to worry about network interruption or computer getting froze.

Here's what I got. The extracted data looks pretty neat.

So, that's everything I have for this tutorial. Here's the otd file I used for the demonstration. I hope you have enjoyed reading it. The Octoparse team is very supportive most of the time so in case if you need additional help, definitely reach out to them.

Thank you and stay tuned for more Amazon scraping tips! Cheers!

Related tutorials:

Get Started with Octoparse in 2 mins

Scraping Data from Amazon

How to Extract Product Information from Amazon

10 Essential Tutorials That Every Octoparse Newbie Should Know

2017年8月10日星期四

Scrape Data from a List of URLs by Creating a Simple Scraper

Web scraping can be done by creating a web crawler in Python. Before coding a Python-based crawler, you need to look into source and get to know the structure of the target website. And of course you need to learn Python. It will be much easier if you already know how to code. But for a tech noob, it’s very difficult to learn everything from scratch. So we create our app Octoparse to help people who know little to nothing about coding to easily scrape any web data.

In this tutorial we will learn how create the simplest and easiest web scraper to scrape a list of URLs, without any coding at all. This method is best suited to beginners like some of you. (We will assume that Octoparse is already installed on your computer. If that’s not the case, download here)

This tutorial will walk you through these steps:

1. Create a “Loop Item” in the workflow

2. Add a list of URLs into the “Loop Item” created

3. Click to extract data points from one webpage

4. Run the scraper set up

5. Export data scraped

1. Create a “Loop Item” in the workflow

After setting up basic information for your task, drag a “Loop Item” and drop it into the workflow designer.

2. Add a list of URLs into the created “Loop Item”

After create a “Loop Item” in the workflow designer, add a list of URLs in “Loop Item” to create a pattern for navigating each webpage.

· Select “List of URLs” Loop mode under advanced options of “Loop Item” step

· Copy and paste all the URLs into “List of URLs” text box

· Click “OK” and then save the configuration

Note:

1) All the URLs should share the similar layout

2) Add no more than 20,000 URLs

3) You will need to manual copy and paste the URLs into “List of URLs” text box.

4) After entering all the URLs, “Go To Webpage” action will be automatically created in “Loop Item”.

3. Click to extract data points from one webpage

When the webpage is completely loaded, click the data point on the webpage to extract data you need.

· Click the data you need and select “Extract Data”(“Extract Data” action will be automatically created.)

4. Run the scraper set up

The scraper is now created. Run the task with either “Local Extraction” or “Cloud Extraction”.

In this tutorial we run the scraper with “Cloud Extraction”.

· Click “Next” and then Click “Cloud Extraction” to run the scraper on the cloud platform

Note:

1) You are able to close the app or your computer when running the scraper with “Cloud Extraction”.

Just sit back and relax. Then come back for the data. No need to worry about Internet interruption or hardware limitation.

2) You can also run the scraper with “Local Extraction” (on your local machine).

5. Export data extracted

· Click “View Data” to check the data extracted

· Choose “Export Current Page” or “Export All” to export data extracted

Note:

1) Octoparse supports exporting data in Excel(2003), Excel(2007), CSV, or having data delivered to your database.

2) You can also create Octoparse APIs and access data. See more at: http://www.octoparse.com/tutorial/api

Now we've learn how to scrape data from a list of URLs by creating a simple scraper without any coding! Very easy right? Try it for yourself!

Demo data extracted like below:

(I also attach the demo task and demo task exported in excel. Find them here)

Now check out similar case studies:

· URLs - Advanced Mode

· Scrape data from multiple web pages

· Speed up Cloud Extraction (1)

· Speed up Cloud Extraction (2)

2017年8月3日星期四

Octoparse vs. Import.io comparison: which is best for web scraping?

Web scraping software, also known as data extraction tool, is the software to collect the data from the website. It’s usually not easy for us to pick up a web scraping tool as there’s so many web scraping tools available now (refer to Top 30 Free Web Scraping Software to learn more). That’s why I decided to put the web scraping tool Octoparse head to head with import.io to see how the two tools compare. Here is everything you need to know when deciding which web scraping tool better suits you.

Feature Comparison

Here is a general comparison between Octoparse and Import.io features:

Feature	Octoparse	Import.io
Environment	Desktop app for Windows (available fo MAC with virtual machine)	Web based application, support Chrome，Firefox，Safari
Selecting elements	Point-and-click, XPath	Point-and-click, XPath
Pagination	Clicking on pagination links or manually entering the XPath(websites without "Next page" links)	Entering a list of pages
Scraper logic	Variables, loops, conditionals	Selecting and extracting only
Drop downs, tabs,hovering, pop-ups	Yes	No
Infinitely scrolling pages	Yes	No
Entering into search boxes	Yes	No
Captcha	Yes with local machine	No
Signing in to accounts	Yes	Yes
Javascript	Yes	Yes
Transforming data	Regex, javascript expressions	Regular expression
Speed	Fast parallel execution	Fast parallel execution
Hosting	Hosted on cloud of Octoparse servers if subscribed to Octoparse cloud or on local machine with basic version	Hosted on cloud of Import.io servers
IP Rotation	Included in paid plans or manual IP proxy(free version)	Yes
Scheduling runs	With a premium Octoparse account	With a premium import.io plan
Data export	CSV, Excel, Txt, Databases	CSV, JSON, API, Google Sheets
Smart Mode	Yes	No
Cloud service	Yes	Yes
Up-to-date data	Yes (Incremental extraction)	Yes
Image and files extraction	No, only able to extract the image or file URLs	Yes
Coding	No	No
Support	Free professional support, tutorials, community support	Community support or professional support for paid users, customer success training

So what both web scrapers could do for you?

Both the interface is built according to point-and-click principle, which is easily for you to extract data without coding. Both of the scrapers could deal with Javascript and AJAX pages and are able to scrape behind a login. Like a bot, they could follow the links to go into the deeper web pages by clicking the items and extract the data on the other pages. Also, they are able to get data in CSV format and transform data by manually modifying Regular expression or XPath.

They all provide cloud services, which are able to offer API options, IP rotation and services to schedule extractors running in real time. With that, you are easy to get up-to-date data regularly without having to keep your computer on.

What Octoparse could for do you?

The biggest difference between Octoparse and its web scraping alternatives is that Octoparse can get data from interactive websites. It totally mimics human behaviour when browsing a website.

You can instruct Octoparse to scrape data from very complex and dynamic sites, because it can:

Sign in to accounts to scrape behind a login
Select choices from dropdown menus(single and multiple), tabs, pop-up windows
Enter keywords and search with a search bar
Go to a new page simply by clicking on the "next" button
Get data from infinitely scrolling pages
Able to input Captcha in local machine
Visual workflow to understand the logics of the scraper (Variables, loopsand conditionals) and could be changed easily with point-and-click interface
Smart mode to deal with the simple website just by entering the target URL
Extract inter and outer HTML and attributes and customize the values for further extraction
Advanced RegEx tool and XPath tool to modify the regular expression or XPath, which means you don’t need to know how regular expression and XPath are written(see the screenshots below)

And more! Except for the first one, these are all things that import.io cannot handle.

Octoparse RegEx Tool

Octoparse XPath tool

Here is a full list of Octoparse’s scraping features:


Automatic IP Rotation	API	Loops, variables and conditionals logics
Extract text, HTML and attributes	Scheduled Runs	Cloud servers to store data
Extract files and images URLs	Search through forms and inputs	Get data from drop-downs, tabs, pop-ups and hovers
Databases integration	Pagination and navigation	Scrape content from infinitely scrolling pages
RegEx and XPath Tool	Get data from tables and maps	Content that loads with AJAX and JavaScript

The downside of using Octoparse as an alternative to import.io is that you need to install the application on your own computer. And because the software is written in .Net, it only supports Windows system. A Mac visual machine is needed if you want to If you need to run Octoparse on Mac. You would also be annoyed if the Internet is unstable and the scraper stopped unexpectedly, you need to rerun the crawler from the scratch. The other one is that it may take longer to learn Octoparse for you are easily to make mistakes if you don’t understand the logics of the workflow. But luckily, there are plenty of tutorials and great support if you get stuck!

Besides, Octoparse is not able to extract the images and files directly; you need to extract their URLs and download them with other applications. And the function of API is quite limited.

What import.io could do for you?

First of all, import.io is a cloud-based platform, which means you don’t need to run the scraper on your machine and the data could be kept in the cloud. Therefore, you can access your data from any computer connected to Internet. Also, you don’t need to concern about the scraping process maintenance and scalability.

Unlike Octoparse advanced mode, import.io tries to guess what you want from the page, and would build an extractor for you just a few seconds. Other features include:

Connect one data source with another and thus producing new, valuable, real-time data sets
Integrate with Google Sheet and Tableau
Able to extract images and files
API integration

Here is a full list of Import’s scraping features:

Automatic IP Rotation	Cloud servers to store data	Content that loads with AJAX and JavaScript
Extract files and images	Scheduled Runs	XPath and Regular Expressions Selectors
Pagination	Get data from tables and maps	API,Tableau and Goolge Sheet integration

The downside of using import.io is that it’s not as widely used as Octoparse to deal with websites. As is mentioned above, it couldn’t deal with websites with dropdown menus, pop-up windows and captcha. It’s also not able to scrape with infinite scrolling pages, which are quite common for most of web pages. There’s also no scraper logics like conditions for further extraction to specifically locate the web page or items.

And for pagination, it’s not easy as you need to enter a list of pages. As for transforming the data in regular expression and XPath, there are no built-in tools for you and you need to enter the expression yourself, which means you need to master XPath and regular expression if you want to explore more on import.io.

Cost Comparison

There’s no doubt that Octoparse has overwhelming advantages. It provides free version with powerful functions! To summarize, that is:

Brand	Octoparse			Import.io
Brand	Basic	Standard	Professional	Essential	Professional	Enterprise
Monthly plan ($)	Free	89	189	299	-	-
Yearly plan ($)	Free	900	1896	-	1999	4999

Let’s see the screenshots below for more details.

Octoparse Pricing

Import.io Pricing

Octoparse's plans are limited by:

the number of crawlers
the number of crawlers you could concurrent run on your machine
the speed at which you can collect data (different cloud servers)

There are unlimited pages for each crawlers and unlimited computer licence for each version, including the free one.

(Note: When you enter URLs in URL list, it would suggest LESS THAN 20,000 URLs. All the versions are limited by such number as Octoparse need to ensure that the CPU run the crawler at one time. But you could copy the crawler to extract the rest URLs.)

Import.io’s plans are limited by:

the number of queries per month or year
the expiry date of the queries
limited functions like image and file download, API, up-to-date reporting
support

It’s sad to find out that import.io doesn’t provide free version anymore.

Most people build one two crawlers per website on Octoparse. One is to extract the separate web pages URLs and the other one is to use URL lists to bulk extract the data with the extracted URLs. It’s highly recommended when using the cloud service (see Splitting Tasks to Speed Up Cloud Extraction to learn more).

On the other hand, Import.io counts an extractor as one query and it doesn’t provide URL lists to bulk extract the web pages. Therefore, either you need to spin over these separate web pages in one extractor (which usually means missing data) in import.io, or you need to upgrade your version for more queries.

For both Octoparse and Import.io, you have to subscribe to a premium plan for scheduling feature —— the ability to collect data from a website continuously on a schedule (real-time, daily, weekly, monthly).

If you don't want to learn how to use a tool and just want your data on demand, both Octoparse and Import.io provide data service extracting data for you. Just contact the sales of both companies and they will scrape data from the website you want —— delivering them in CSV/Excel or API format.

Conclusion

It is not difficult to start a project either with Octoparse or import.io. And they all deal well with both static and dynamic websites. XPath and regular expression are needed if you want to explore more, though they claim that no programming knowledge is needed. Also, both have their limits.

I will also make some examples to further show you how these two scrapers work. And if there’s something wrong with the information above, just contact me here.

WEBサービス

2017年8月17日星期四

Scraping Data behind a CAPTCHA - Advanced Web Scraping

2017年8月14日星期一

Scrape Amazon Product Data with ASIN/UPC

2017年8月10日星期四

Scrape Data from a List of URLs by Creating a Simple Scraper

2017年8月3日星期四

Octoparse vs. Import.io comparison: which is best for web scraping?

我的简介

链接

先前的博文

存档