2017年4月20日星期四

Top 5 Web Scraping Tools Review

Web scraping (also known as web crawling, web data extraction) is meant for extracting data from websites. Normally, there are two options for users to crawl websites. We can build our own crawlers by coding or using any other public APIs. Alternatively, web scraping can also be done using an automation web scraping software, which refers to an automated process implemented using a bot or web crawler. The data extracted from web pages can be exported as various formats or into different types of databases for further analysis.
There are many web scraping tools you may find online. In this post, I would like to share with you some popular automated scrapers that people think well of and have a run-through of their respective featured service .


Visual Web Ripper is an automated web scraping tool that supports a variety of features. It works well for certain tough, difficult-to-scrape websites with some advanced techniques, like running scripts which requires users with programming skills.
This scraping tool has an user-friendly interactive interface to help users grasp the basic operational process fast. The covered featured characteristics include:
Extract varied data formats
Visual Web Ripper is able to cope with difficult blocks layouts, especially for some web elements displayed on the web  page without a direct HTML association.  
AJAX                                                                   
Visual Web Ripper is able to extract the AJAX supplied data.
Login Required
Users can scrape websites which requires login first.
Data Export formats
CSV, Excel, XML, SQL Server, MySQL, SQLite, Oracle and OleDB, Customized C# or VB script file output (if additionally programmed)
IP proxy servers
Proxy to hide IP-address
Even though it provides such many functionalities, it hasn’t provide users with cloud based service yet. That means users can only have this application installed on the local machine and have it run locally, which may limit the scale or efficiency of scraping considering a higher demand on data scraping.
Debugger
Visual Web Ripper has a debugger which will help users build reliable agents where some issues that can be resolved in an effective way.
[Pricing]
Visual Web Ripper charges users from $ 349 to $2090 based on the subscribed user seat number and the maintenance will last for 6 months. Specifically, users who purchased a single seat ($349) can only install and use this application on a single computer, otherwise users will have to pay double or more to run it on other devices. If you feel no problem with this kind of pricing structure, Visual Web Ripper could be listed in your options.
                                     


Octoparse is a full-featured and non-coding desk-top web scraper with many outstanding characteristics compared with others.
It provides users with some useful, easy-to-use built-in tools to extract data from tough or aggressive websites that are difficult to scrape.
It's UI is very user-friendly and designed in a rather logical way. Users won't have too many troubles locating any functions. Additionally, Octoparse does visualized the extraction process using a workflow designer to help users  stay on top of the scraping process for any tasks. Octoparse supports:
Ad Blocking
Ad Blocking will optimize task by reducing loading time and number of HTTP request.
AJAX Setting
Octoparse is able to extract AJAX supplied data and set timeout.
XPath setting
Users can modify XPath to locate web elements more precisely using XPath setting provided by Octoparse.
Regex Setting
Users can normalize the extracted data output using Octoparse built-in Regex tool to generate a matching regular expression automatically.           
Data Export formats
CSV, Excel, XML, SQL Server, MySQL, SQLite, Oracle and OleDB
IP proxy servers
Proxy to hide IP-address
Cloud Service
Octoparse provides cloud based service. It will speed up data extraction speed - 4 to 10 times faster than Local Extraction. Once users use Cloud Extraction, 4 to 10 cloud servers will be assigned to work on their extraction tasks. It will set users free from long time maintenance and certain hardware requirements.  
API Access
Users can create their own API that will return data formatted as XML strings. 
[Pricing]
Octoparse is free to use if you don't choose to use the Cloud Service. Unlimited pages scraping is excellent compared to all the other scrapers in the market. However, if you would want to consider using it's Cloud Service for more sophisticated scraping, it does offer offer two paid editions: Standard Edition and Professional Edition.
Both editions has provided inclusive featured scraping service, 
                                         
Standard Edition: $75 per month when billed annually, or $89 per month when billed monthly.
    Standard Edition offers all featured functions.
    Number of tasks in the Task Group: 100
    Cloud Servers: 6
Professional Edition: $158 per month when billed annually, or $189 per month when billed monthly.
    Professional Edition offers all featured functions.
    Number of tasks in the Task Group: 200
    Cloud Servers: 14
To conclude, Octoparse is a rich-featured scraping software with reasonable subscription pricing , it is worth your try. 


Mozenda is a cloud based web scraping service. It provides many useful utility features for data extraction. Users will be allowed to upload extracted data to cloud storage. Some featured service include:
Extract varied data formats
Mozenda is able to extract many types of data formats, however not that easy to do with some data with irregular data layout.
Regex Setting
Users can normalize the extracted data results using Regex Editor within Mozenda. However, it is not that easy to handle this Regex Editor, you may need to learn more about how to write a regular expression.         
Data Export formats
It can support varied types of export transformation.
AJAX Setting
Mozenda is able to extract AJAX supplied data and set timeout.
[Pricing]
Mozenda users pay for Page Credits, which is the number of individual request to a website to load a web-page. Each Subscription plan comes with a fixed number of pages included in the monthly plan price. That means the web pages out of the range of the limited page numbers will be charged additionally. And cloud storage vary based on different editions. Two Editions are offered for Mozenda:
                                                     


Import.io is a web-based platform for extracting data from websites without writing any code. Users can build their extractors by easy point & click, then Import.io will automatically extract data from web pages into a structured dataset. It serves users with a number of characteristic features:
Authentication
Extract data from behind a login/password
Cloud Service
Use the SaaS platform to store data that is extracted.
Parallelized data acquisitions are distributed automatically by scalable cloud architecture
API Access
Integration with Google Sheets, Excel, Tableau and many others.
[Pricing]
Import.io charges subscribers based on the quantity of the extracting queries per month, so users should better reckon up the number of extracting queries before they make a subscription. (One single query equals one single page URL.)
There are three Paid Editions offered by Import.io:
                                      
Essential Edition: $199 per month when billed annually, or $299 month-to-month when billed monthly.
Essential Edition offers all featured functions.
Essential Edition offers users with up to 10,000 queries per month.
    
Professional Edition: $349 per month when billed annually, or $499 per month when billed monthly.
Professional Edition offers all featured functions.
Professional Edition offers users with up to 50,000 queries per month.

Enterprise Edition: $699 per month when billed annually, or $999 per month when billed monthly.
Enterprise Edition offers all featured functions.
Enterprise Edition offers users with up to 400,000 queries per month.

Content Grabber is one of the most feature rich web scraping tools. It is more suitable for people with advanced programming skills, since it offers many powerful scripting editing, debugging interfaces for people in need. Users are allowed to use C# or VB.NET to write regular expressions instead of generating the matching expression using the built-in Regex tool, like Octoparse. The features covered within Content Grabber include:
Debugger
Content Grabber has a debugger which will help users build reliable agents where some issues that can be
resolved in an effective way.
Visual Studio 2013 Integration
Content Grabber can integrate with Visual Studio 2013 for the most powerful script editing, debugging and unit testing features.

Custom Display Templates

Custom HTML display templates allow you to remove these promotional messages and add your own  designs to the screens - effectively allowing you to white label your self-contained agent.

Programming Interface

The Content Grabber API can be used to add web automation capabilities to your own desktop and web applications. The web API does require access to the Content Grabber Windows service, which is part of the Content Grabber software, and must be installed on the web server or a server accessible to the web server.
[Pricing]
Content Grabber offers two purchasing methods:
                                   
Buy License : Buying any Content Grabber license outright gives you a perpetual license.
For License users, there are three editions are available for users to buy:
Server Edition:
This Basic Edition only provides users with limited Agent Editors. The total cost is $449.
Profession Edition:
It serves users with full featured Agent Editor. However, API is not available. The pricing is $995.
Premium Edition:
This Advanced Edition provides all featured services within Content Grabber. However, it also charges a bit higher with a pricing $2495.

Monthly Subscription: Users who sign up to a monthly subscription will be charged upfront each month for the edition they choose.
For subscribers, there are also the same three editions for users to buy:
Server Edition:
This Basic Edition only provides users with limited Agent Editors. The total cost is $69 per month.
Profession Edition:
It serves users with full featured Agent Editor. However, API is not available. The pricing is $149 per month.
Premium Edition:
This Advanced Edition provides all featured services within Content Grabber. However, it also charges a bit higher with a pricing $299.

Conclusion

In this post, 5 automation web scraper software have been evaluated from various perspectives. Most of these scrapers can satisfy users' basic scraping needs. Some of these scraper tools, like Octoparse, Content Grabber, have even provided more advanced functionality to help users extract matching results from tough websites using their built-in Regex, XPath tools and Proxy Servers. Particularly, users without any programming skills are not suggested to run custom scripts (Visual Web Ripper, Content Grabber and etc). Anyway, whichever scraper any user should choose is totally dependent on your individual requirements. Thus, make sure you have an overall understanding of a scraper's features before you subscribe to it. Lastly, check out the below feature comparison chart if you are putting some serious thoughts on subscribing to a data extraction service provider. Happy data hunting!
                                    
- See more at: Octoparse Blog

标签: , , ,