2017年9月1日星期五

Web Scraping Service vs. Automatic Web Scraper: Which is the best option for web scraping?


What is web scraping?
Web scraping aka. web extraction or web crawling refers to the process of obtaining various unstructured information from any websites and turn it into structured, clean data such as xls, csv, or txt or populate the captured data to a database directly. Some common uses of web scraping include lead generation, data collection for academic researches, price monitoring from competitors’ websites, product catalogue scraping and many more. For all kinds of good reasons people turn to web scraping and can get pretty confused about which is the best path to go. In this article, I will try to walk through the Pro’s and Con’s of both web scraping service and automatic web scraper.  

What are some web scraping options?
When it comes to web scraping, there are two major kinds of providers available in the market, scraping tool provider and scraping service provider. Product provider basically refers to the many so called web scrapers or web extractors, examples are import.io, Octoparse, Scrapy and others. Some of these products are easier to handle for non-technical users such as Octoparse and Import.io. Some require more programming background such as Scrapy and Content Grabber. For those running on a service model, they are commonly known as DaaS, short for Data as Service. These companies do all the scraping work themselves and will provide the data to you in any formats you like in any frequencies; they will even provide weekly/monthly data feeds to you via API if needed. A few well known ones are Scrapinghub, Datahen and Data Hero etc. Among these there are also companies that provides scraping tool and provide scraping service at the same time, Mozenda scraping service and Octoparse Scraping Service. Just because they offer self-customizable scraper doesn’t mean their scraping service is any less proficient than those only do scraping service. In fact, data service provided by crawler companies can be a lot more cost efficient and are much more friendly to one-time scrapes because obviously they have the edge in owning a customizable scraping tool and only minimum manual intervention will be required.

So what it the essential difference between using a DIY web scraper and seeking help from a web scraping company? While there are many the most critical ones are,
  1. Cost
  2. Willingness to learn
  3. Deadline
  4. Complexity of the scraping project

If you are a student looking to scrape some public data to support your thesis research with a tight budget, a scraping tool will be the best way to go; If you are an enterprises looking to outsource a brand monitoring project running on a tight schedule, data scraping service will provide you with what you need. While these are only two obvious examples of how people of different groups will find themselves at more advantages using one product/service over another, they should give you a general feeling of how to approach this question by going through your specific demands, budget, schedule, project complexity and etc.

Comparing web scraping alternatives: 

Web Scraper SaaS Service
Professional Data Service (DaaS)
Data Service provided by Crawler Company
Pricing
$60 ~ $200 per month
$350 ~ $2500 per project +
$60 ~ $500 monthly maintenance fee if applicable
$100 ~ $2500 per project +
$60 ~ $300 monthly maintenance fee if applicable
Turnaround
depending on your 
 efforts
3 ~ 10 business days
1 ~ 10 business days
Format of data delivery
Most supports export to  xls, csv, html, txt, Json, xml
Most support csv, html, Json, xml
Most support csv, html, Json, xml
Database, API supported
Depends on the specific product
Yes
Yes
Dealing w/ Complex Website
(java script, ajax etc)
depends on the specific tool
Supported most of the time
Supported most of the time
Mass scale scraping
good volume for low cost if you can get what you need with the scraper
Scalable scrape but cost increases as volume goes up
Scalable scrape but cost increases as volume goes up
Support Customized Request
Self help
Highly Flexible
Highly flexible most of the time
One-time Request Friendly
Yes, pay as you go
Mostly No
Yes
Customer Support
Busy support, some are really helpful
depending on the product
Pretty responsive most of the time
High Priority Support

 
Are you ready to scrape?
Just like everything else, there are Pro’s and Con’s with either a web scraping service or a data scraping tool. Whichever is the better option will largely depend on the specific schema, data application and project budget. Do go through your request thoroughly, carry out the necessary research on the products/services available in the market - all these will be essential to finding the best web scraping solution tailoring to your scraping needs.

That's all I have for now. Feel free to drop a message here if you have any specific questions with any web scraper or service. Cheers!

Related Reading:

2017年8月17日星期四

Scraping Data behind a CAPTCHA - Advanced Web Scraping

Is it possible to scrape data that is protected behind a CAPTCHA with Octoparse? This is one of the most asking questions by our users. Yes Octoparse is able to scrape data behind a CAPTCHA. It's available when running your scraping tasks with local machine. Although Octoparse Cloud Service does not provide CAPTCHA -solving service, our development guys are working very hard on it. So we'll see. ;)
In this tutorial, I will introduce the solutions to two most common issues when CAPTCHA appears:



Here is a general rule to resolve CAPTCHA issue in Octoparse: 
· allow the webpage to load the image (see the screenshot below)  
· set waiting time before execution
· entering the text CAPTCHA or dragging the slider
 




1) Scrape data behind log in
Some websites need log in credentials before browsing. Bypassing a CAPTCHA is often needed when verifying the provided credentials.
I’ll take http://agent.octoparse.com/login for example to show you how to scrape the website with a Captcha when logging in.
Step 1.
· Go to the webpage.
· Click the User Name box and choose “Enter Text value”.
· Enter the credentials.
· Click “Save”.
 

The same way to enter the password.

Step 2. Manually enter the CAPTCHA in the built-in browser.
As the CAPTCHA would change when the webpage reloads, you don’t need to add another step to enter the CAPTCHA in the workflow at this point. Just manually enter the CAPTCHA in the built-in browser.
(Note: the same way to drag a slider.)

Step 3. Set proper waiting timeout before execution.
Click  “Sign in” button, choose “Click an item” to log in the website.
 

To make sure that we have enough time to manually enter the CAPTCHA when starting local extraction, we need to set longer timeout before Octoparse executes “Click item”.
So just click “Click item” in the workflow. ➜ Check “Waiting before execution” under advanced options. ➜ Set proper execution timeout. ➜ Click “Save”.
 

Step 4. Extract the data you want.
I wouldn’t display the details as all of our tutorials have shown that.
  
Step 5. Manually enter the CAPTCHA when starting data extraction on local machine.
As you could see, there’s a built-in browser in the process of extraction. You need to wait until the CAPTCHA appears and enter theCAPTCHA.
 
Then you have solved the CAPTCHA problem and Octoparse would do the rest for you to get the data.
Note:
We extremely recommend you to load and store the cookie by checking the option Use specified Cookie under Cache Settings. By doing this, you dont need to enter the CAPTCHA at most times when scraping behind a login.
(Follow the tutorial here to know how to store cookies in Octoparse.)




2) Access a website too frequently in a short time

It’s the same to pass a Captcha when accessing a website. Octoparse mimics the human behaviour with the point-and-click interface, so you just need to manually enter the Captcha like what you do in the normal browser, and set proper waiting timeout to manually enter the captcha at the following step. This is what you should do when making a crawler in the Workflow.
And then you come to the process of extracting data on local machine; you need to wait until the CAPTCHA appears, and manually pass the CAPTCHA before the operation timed out.

Now you know how to scrape data behind a CAPTCHA in Octoparse. I know it's probably less than ideal. But it could be a way of solving CAPTCHA in a short term. Hope you enjoy data hunting with Octoparse!

Check out the tutorials below to learn more about how to crawl data from a website:

The Octoparse Team

2017年8月14日星期一

Scrape Amazon Product Data with ASIN/UPC

So if you selling online, wouldn’t you want to know how your products are priced comparing to some popular ecommerce sites such as Amazon or Ebay. If your products are not selling, could it be a price problem, a product description problem or even the product pictures or maybe the products are just not good enough. This is exactly why people are turning to Amazon for product mining.
In this article I will show you how you can easily retrieve the product data you need from Amazon using web scraping tool, Octoparse. Let’s get straight to the point.

Step 1: Get prepared
Gather the list of ASIN/UPC for the products you need (to search with on Amazon).
ASIN, short for Amazon Standard Identification Numbers (ASINs) are unique blocks of 10 letters and/or numbers that identify items on Amazon. Each item listed on Amazon will have a unique ASIN. If you happen to know what the ASIN’s are for the products you need to search for, great! If not, try UPC.
UPC, Universal Product Code (UPC) is a 12-digit bar code used extensively for retail packaging in United States. Find out what the UPC’s are for your products and make a list of it.

Step 2: Capture Data
Launch the application (download it at www.octoparse.com) and start a new task.
Click "Quick Start" ➜ Click "New Task (Advanced Mode)" – I prefer the Advanced Mode because it offers a lot more flexibility compared to the Wizard Mode. And I never find it too "advanced" to learn ➜ Fill in the basic information like Task Name, Category etc. After everything’s done, click "Next" to proceed to the next step. 

Drag a "Go to webpage" action into the workflow pane ➜ Enter the page URL: www.amazon.com ➜ Click "Save" ➜Amazon's webpage gets loaded in the built-in browser.

Click into the search box on the webpage ➜ when Octoparse prompts you for the next action, pick "Enter text value".

To search repeatedly with a list of UPC codes, we'll need to utilize the loop action. Drag a "loop" action into the workflow pane ➜ Select "Text List" ➜ Copy and paste the list of UPC codes into the text box ➜ Click "OK" ➜ Save ➜ Notice how the list of UPC’s gets added to "Loop Items"
 

Now, to make sense of the steps in the workflow, drag the "Input Text" action to the inside of the "Loop" action ➜ Under Advanced Options for the "Text Input" action, check "Use text in Loop Item to fill in the text box" ➜ Save ➜ Octoparse is now configured to use the text values added to the loop to search on Amazon's website.
 

Since the workflow had been re-arranged, click-through the workflow from the first action to make sure the defined actions are working as desired.
In my case, everything is working properly so when I get to the "Enter Text" step, the first UPC code from the list gets synched to the text box automatically.

 Next, click on the search button ➜ When prompted, select "Click an item" ➜ A "Click" action gets added to the loop automatically.

As soon as we are on the product detail page, capture whatever product information needed just like any other extraction task. Here, I will need product title, rating, number of review, product category and price.
Click on the title of the product ➜ Select "Extract Text" when prompted ➜ Notice how the product title gets added to the data pane next to the workflow designer ➜ Capture all other data fields similarly.

Edit the field names directly corresponding to the different data fields extracted.


So, is everything good to go now? Not so fast! There's still a bit of work to do for double checking - click through the workflow from the first action to the last to make sure everything works properly. Soon, we find that the "Click" action for the Search button doesn't really work.

For the majority of time, when something fails to work as desired, it is because the XPath auto-generated by Octoparse fails to locate the item correctly. To solve this, we'll need to modify the XPath manually. There are tons of XPath tutorials available in Octoparse's tutorial section (http://www.octoparse.com/features-tutorial?category=XPATH) so we'll skip the detailed steps and hop over to modify the XPath directly.
Select the action "Click an item", make sure it's the one for the Search button ➜ Click "Customized" under Advanced Option ➜ Click on "Define ways to locate an item" ➜ Copy and paste the correct XPath ( .//*[@id='nav-search']/form/div[2]/div/input )➜ Click "OK".

Now, re-run the workflow, select another UPC code besides the first one to test if the extraction steps are working right to locate the individual data fields.
  
As we get to the "Extract data" step, we'll notice that price is missing from the extracted data outputs.
 

So again, we’ll need to modify the XPath to the correct one (.//*[contains(text(), 'Price:')]/following-sibling::td/span[1]).

Finally, we are ready to run the extraction task. We can run it with the local machine utilizing local bandwidth, IP, memory etc. Alternatively, we can choose "Cloud Extraction" to run it in the Cloud (though only for paid users). If you are a heavy user, I strongly recommend Cloud extraction because it’s so much a relief to leave everything in the cloud and come back for the complete data set without having to worry about network interruption or computer getting froze. 
Here's what I got. The extracted data looks pretty neat.

So, that's everything I have for this tutorial. Here's the otd file I used for the demonstration. I hope you have enjoyed reading it. The Octoparse team is very supportive most of the time so in case if you need additional help, definitely reach out to them.
Thank you and stay tuned for more Amazon scraping tips! Cheers!

Related tutorials:

2017年8月10日星期四

Scrape Data from a List of URLs by Creating a Simple Scraper

Web scraping can be done by creating a web crawler in Python. Before coding a Python-based crawler, you need to look into source and get to know the structure of the target website. And of course you need to learn Python. It will be much easier if you already know how to code. But for a tech noob, it’s very difficult to learn everything from scratch. So we create our app Octoparse to help people who know little to nothing about coding to easily scrape any web data.  
In this tutorial we will learn how create the simplest and easiest web scraper to scrape a list of URLs, without any coding at all. This method is best suited to beginners like some of you. (We will assume that Octoparse is already installed on your computer. If that’s not the case, download here)

This tutorial will walk you through these steps:



1. Create a “Loop Item” in the workflow
After setting up basic information for your task, drag a “Loop Item” and drop it into the workflow designer.




 2. Add a list of URLs into the created “Loop Item”

After create a “Loop Item” in the workflow designer, add a list of URLs in “Loop Item” to create a pattern for navigating each webpage.
      · Select “List of URLs” Loop mode under advanced options of “Loop Item” step
      · Copy and paste all the URLs into “List of URLs” text box
      · Click “OK” and then save the configuration
Note:
     1) All the URLs should share the similar layout
     2) Add no more than 20,000 URLs
     3) You will need to manual copy and paste the URLs into “List of URLs” text box.
     4) After entering all the URLs, “Go To Webpage” action will be automatically created in “Loop Item”.





 3. Click to extract data points from one webpage
When the webpage is completely loaded, click the data point on the webpage to extract data you need.
      · Click the data you need and select “Extract Data”(“Extract Data” action will be automatically created.)
 



4. Run the scraper set up
The scraper is now created. Run the task with either “Local Extraction” or “Cloud Extraction”.
In this tutorial we run the scraper with “Cloud Extraction”.
      ·  Click “Next” and then Click “Cloud Extraction” to run the scraper on the cloud platform


Note:
     1) You are able to close the app or your computer when running the scraper with “Cloud Extraction”.
         Just sit back and relax. Then come back for the data. No need to worry about Internet interruption or hardware limitation.
     2) You can also run the scraper with “Local Extraction” (on your local machine).




5. Export data extracted

      · Click “View Data” to check the data extracted
      · Choose “Export Current Page” or “Export All” to export data extracted
Note:
       1) Octoparse supports exporting data in Excel(2003), Excel(2007), CSV, or having data delivered to your database.
       2) You can also create Octoparse APIs and access data. See more at: http://www.octoparse.com/tutorial/api


Now we've learn how to scrape data from a list of URLs by creating a simple scraper without any coding! Very easy right? Try it for yourself

Demo data extracted like below: 
(I also attach the demo task and demo task exported in excel. Find them here)


Now check out similar case studies:
     · URLs - Advanced Mode