2017年4月20日星期四

Top 5 Web Scraping Tools Review

Web scraping (also known as web crawling, web data extraction) is meant for extracting data from websites. Normally, there are two options for users to crawl websites. We can build our own crawlers by coding or using any other public APIs. Alternatively, web scraping can also be done using an automation web scraping software, which refers to an automated process implemented using a bot or web crawler. The data extracted from web pages can be exported as various formats or into different types of databases for further analysis.
There are many web scraping tools you may find online. In this post, I would like to share with you some popular automated scrapers that people think well of and have a run-through of their respective featured service .


Visual Web Ripper is an automated web scraping tool that supports a variety of features. It works well for certain tough, difficult-to-scrape websites with some advanced techniques, like running scripts which requires users with programming skills.
This scraping tool has an user-friendly interactive interface to help users grasp the basic operational process fast. The covered featured characteristics include:
Extract varied data formats
Visual Web Ripper is able to cope with difficult blocks layouts, especially for some web elements displayed on the web  page without a direct HTML association.  
AJAX                                                                   
Visual Web Ripper is able to extract the AJAX supplied data.
Login Required
Users can scrape websites which requires login first.
Data Export formats
CSV, Excel, XML, SQL Server, MySQL, SQLite, Oracle and OleDB, Customized C# or VB script file output (if additionally programmed)
IP proxy servers
Proxy to hide IP-address
Even though it provides such many functionalities, it hasn’t provide users with cloud based service yet. That means users can only have this application installed on the local machine and have it run locally, which may limit the scale or efficiency of scraping considering a higher demand on data scraping.
Debugger
Visual Web Ripper has a debugger which will help users build reliable agents where some issues that can be resolved in an effective way.
[Pricing]
Visual Web Ripper charges users from $ 349 to $2090 based on the subscribed user seat number and the maintenance will last for 6 months. Specifically, users who purchased a single seat ($349) can only install and use this application on a single computer, otherwise users will have to pay double or more to run it on other devices. If you feel no problem with this kind of pricing structure, Visual Web Ripper could be listed in your options.
                                     


Octoparse is a full-featured and non-coding desk-top web scraper with many outstanding characteristics compared with others.
It provides users with some useful, easy-to-use built-in tools to extract data from tough or aggressive websites that are difficult to scrape.
It's UI is very user-friendly and designed in a rather logical way. Users won't have too many troubles locating any functions. Additionally, Octoparse does visualized the extraction process using a workflow designer to help users  stay on top of the scraping process for any tasks. Octoparse supports:
Ad Blocking
Ad Blocking will optimize task by reducing loading time and number of HTTP request.
AJAX Setting
Octoparse is able to extract AJAX supplied data and set timeout.
XPath setting
Users can modify XPath to locate web elements more precisely using XPath setting provided by Octoparse.
Regex Setting
Users can normalize the extracted data output using Octoparse built-in Regex tool to generate a matching regular expression automatically.           
Data Export formats
CSV, Excel, XML, SQL Server, MySQL, SQLite, Oracle and OleDB
IP proxy servers
Proxy to hide IP-address
Cloud Service
Octoparse provides cloud based service. It will speed up data extraction speed - 4 to 10 times faster than Local Extraction. Once users use Cloud Extraction, 4 to 10 cloud servers will be assigned to work on their extraction tasks. It will set users free from long time maintenance and certain hardware requirements.  
API Access
Users can create their own API that will return data formatted as XML strings. 
[Pricing]
Octoparse is free to use if you don't choose to use the Cloud Service. Unlimited pages scraping is excellent compared to all the other scrapers in the market. However, if you would want to consider using it's Cloud Service for more sophisticated scraping, it does offer offer two paid editions: Standard Edition and Professional Edition.
Both editions has provided inclusive featured scraping service, 
                                         
Standard Edition: $75 per month when billed annually, or $89 per month when billed monthly.
    Standard Edition offers all featured functions.
    Number of tasks in the Task Group: 100
    Cloud Servers: 6
Professional Edition: $158 per month when billed annually, or $189 per month when billed monthly.
    Professional Edition offers all featured functions.
    Number of tasks in the Task Group: 200
    Cloud Servers: 14
To conclude, Octoparse is a rich-featured scraping software with reasonable subscription pricing , it is worth your try. 


Mozenda is a cloud based web scraping service. It provides many useful utility features for data extraction. Users will be allowed to upload extracted data to cloud storage. Some featured service include:
Extract varied data formats
Mozenda is able to extract many types of data formats, however not that easy to do with some data with irregular data layout.
Regex Setting
Users can normalize the extracted data results using Regex Editor within Mozenda. However, it is not that easy to handle this Regex Editor, you may need to learn more about how to write a regular expression.         
Data Export formats
It can support varied types of export transformation.
AJAX Setting
Mozenda is able to extract AJAX supplied data and set timeout.
[Pricing]
Mozenda users pay for Page Credits, which is the number of individual request to a website to load a web-page. Each Subscription plan comes with a fixed number of pages included in the monthly plan price. That means the web pages out of the range of the limited page numbers will be charged additionally. And cloud storage vary based on different editions. Two Editions are offered for Mozenda:
                                                     


Import.io is a web-based platform for extracting data from websites without writing any code. Users can build their extractors by easy point & click, then Import.io will automatically extract data from web pages into a structured dataset. It serves users with a number of characteristic features:
Authentication
Extract data from behind a login/password
Cloud Service
Use the SaaS platform to store data that is extracted.
Parallelized data acquisitions are distributed automatically by scalable cloud architecture
API Access
Integration with Google Sheets, Excel, Tableau and many others.
[Pricing]
Import.io charges subscribers based on the quantity of the extracting queries per month, so users should better reckon up the number of extracting queries before they make a subscription. (One single query equals one single page URL.)
There are three Paid Editions offered by Import.io:
                                      
Essential Edition: $199 per month when billed annually, or $299 month-to-month when billed monthly.
Essential Edition offers all featured functions.
Essential Edition offers users with up to 10,000 queries per month.
    
Professional Edition: $349 per month when billed annually, or $499 per month when billed monthly.
Professional Edition offers all featured functions.
Professional Edition offers users with up to 50,000 queries per month.

Enterprise Edition: $699 per month when billed annually, or $999 per month when billed monthly.
Enterprise Edition offers all featured functions.
Enterprise Edition offers users with up to 400,000 queries per month.

Content Grabber is one of the most feature rich web scraping tools. It is more suitable for people with advanced programming skills, since it offers many powerful scripting editing, debugging interfaces for people in need. Users are allowed to use C# or VB.NET to write regular expressions instead of generating the matching expression using the built-in Regex tool, like Octoparse. The features covered within Content Grabber include:
Debugger
Content Grabber has a debugger which will help users build reliable agents where some issues that can be
resolved in an effective way.
Visual Studio 2013 Integration
Content Grabber can integrate with Visual Studio 2013 for the most powerful script editing, debugging and unit testing features.

Custom Display Templates

Custom HTML display templates allow you to remove these promotional messages and add your own  designs to the screens - effectively allowing you to white label your self-contained agent.

Programming Interface

The Content Grabber API can be used to add web automation capabilities to your own desktop and web applications. The web API does require access to the Content Grabber Windows service, which is part of the Content Grabber software, and must be installed on the web server or a server accessible to the web server.
[Pricing]
Content Grabber offers two purchasing methods:
                                   
Buy License : Buying any Content Grabber license outright gives you a perpetual license.
For License users, there are three editions are available for users to buy:
Server Edition:
This Basic Edition only provides users with limited Agent Editors. The total cost is $449.
Profession Edition:
It serves users with full featured Agent Editor. However, API is not available. The pricing is $995.
Premium Edition:
This Advanced Edition provides all featured services within Content Grabber. However, it also charges a bit higher with a pricing $2495.

Monthly Subscription: Users who sign up to a monthly subscription will be charged upfront each month for the edition they choose.
For subscribers, there are also the same three editions for users to buy:
Server Edition:
This Basic Edition only provides users with limited Agent Editors. The total cost is $69 per month.
Profession Edition:
It serves users with full featured Agent Editor. However, API is not available. The pricing is $149 per month.
Premium Edition:
This Advanced Edition provides all featured services within Content Grabber. However, it also charges a bit higher with a pricing $299.

Conclusion

In this post, 5 automation web scraper software have been evaluated from various perspectives. Most of these scrapers can satisfy users' basic scraping needs. Some of these scraper tools, like Octoparse, Content Grabber, have even provided more advanced functionality to help users extract matching results from tough websites using their built-in Regex, XPath tools and Proxy Servers. Particularly, users without any programming skills are not suggested to run custom scripts (Visual Web Ripper, Content Grabber and etc). Anyway, whichever scraper any user should choose is totally dependent on your individual requirements. Thus, make sure you have an overall understanding of a scraper's features before you subscribe to it. Lastly, check out the below feature comparison chart if you are putting some serious thoughts on subscribing to a data extraction service provider. Happy data hunting!
                                    
- See more at: Octoparse Blog

标签: , , ,

2017年3月2日星期四

Octoparse Cloud Service

Octoparse has always been dedicated itself to providing users with better experience and more professional service. Notably, Octoparse Cloud-based Service has added more featured services so that users can crawl or scrape data with increasingly high speed and large scale. So, we are proud to say Octoparse Cloud Service is providing high-quality service for people with higher demand on crawling. And we’d like to share more with you about our Cloud Service.

What is Octoparse Cloud Service ?
Defined as a DaaS model, Octoparse Cloud Service manages the infrastructure and platforms that run the applications. Octoparse Cloud Servers install and   operate application software in the cloud and our cloud users can access the software from the cloud clients. That means this service will set free users from long-time maintenance and certain hardware requirements.

How does Cloud Service Work ?
Contributed to the cloud distributed and parallel computing, Octoparse offers multi-threads processing mechanism. Put it in another way, our featured Cloud Service differs from the Local Extraction in its scalabillity, since users’ tasks can achieve a higher crawling speed by cloned onto at least 6 virtual machines running simultaneously. Users will be allowed to extract data on a 24-7 basis using Octoparse cloud service after users upload their configured tasks into the Cloud Platform. After completion of extraction, data extracted will be returned to the clients.

Why you should choose Octoparse Cloud-based Service?

IP Rotation
Multiple cloud servers will be able to  provide users with IP rotations to automate IPs’ leaving without being traced by some aggressive target websites, thus preventing our users from getting blocked.

Extraction Speed Up
Cloud Extraction can be relatively faster compared with Octoparse Local Extraction.  Normally, Octoparse can scrape data 6th to 14th as faster as the Local Extraction, which means at least 6 cloud servers out there are scraping data for you. More advanced, users can adjust the cloud servers based on their higher demand on the Cloud Service.

Scheduling Tasks
Notice to all users, Task Schedule is only available in Cloud Extraction. After configuration of tasks, user tasks will run on the Cloud Platform at the scheduled time. This featured service allows users to schedule their scraping tasks any time from target websites in which inf updates frequently with a precision of minute.

API 
API provided by Octoparse enables Octoparse to connect with any system based on users’ exporting needs. That means Octoparse can provide users with various formats of data to export, like Excel, CSV, HTML,  TXT, and database (MySQL, SQL Server, and Oracle).


When you need our Cloud Service ?

1. Oceans of data needs to be scraped within a shorter period of time.
2. Target websites update their real time data frequently.
3. Data scraped needs to be exported automatically.


Start your Cloud Extraction now!   

Manually activate ‘Cloud Extraction’in the 4thstep ‘Done’ after configuration.


Alternatively, users can active their tasks within ‘In Queue Tasks’ and click ‘Cloud Extraction’ to start their Cloud Extraction.


Schedule Cloud Extraction Settings
Users can schedule cloud settings after completion of configuring their tasks.


To schedule Cloud Extraction Settings, users should apply a valid Select Date and Availability time period based on their requirements to the task.


As the example figure below, enter the Configuration name, click ‘Save the Configuration’ and ‘Save’ button. Then users can apply the Saved Configuration to any other tasks which will reuse the schedule setting.
 

Users can click “OK” to activate Scheduled Cloud Extraction for their tasks. Otherwise, just click × button and Save the configuration.
 

After activation of scheduled tasks, users will be directed into the ‘Cloud: Waiting’ tasks menu, in which users can check the ‘Next Execution Time’ of the schedule tasks.


The reminder about Scheduled Cloud Extraction will be displayed in the 4th step as below.


Users can stop their tasks within the ‘Task Status’. And a confirmation pop-up info box will show when users manage to stop the tasks.



Cloud Extraction Speed UP : Splitting Tasks

Tasks running on the Cloud Platform need to be split up if users are meant to speed up the extraction process, otherwise there will be no difference between Cloud Extraction and Local Extraction. However, it’s noteworthy that whether users can split up tasks is dependent on the loop mode.
In Octoparse, only Fixed list, list of URLs and Text list can split up tasks using Cloud Extraction. While Single element and Variable list will not be able to split the tasks for Cloud Extraction. Take the example of Fixed list splitting up as below.


The task above generates a Variable list in default, which occupies 1 cloud server on the Cloud Platform and disables tasks splitting. To improve this, users can switch the Variable list to the Fixed list to split up the task on the Cloud Platform.
Here, users can modify the XPath of the Variable list and switch to a Fixed list. As an example below, users need to edit XPath //DIV[@id='mainResults']/UL[1]/LI and append an array sequence number to this XPath, like //DIV[@id='mainResults']/UL[1]/LI[i] (i=1, 2, 3 ..., n). After editing the XPath, users can add the modified numbered XPath into the Fixed list one by one, then click ‘OK’ and ‘Save’. As the example figure below, we can see the first item in the loop list after we copy //DIV[@id='mainResults']/UL[1]/LI[1] in the Fixed list and click ‘OK’ and ‘Save’.


The same way, we can add the XPath with a sequential array number one by one, then click ’OK’ and ‘Save’.


After adding the modified XPath, we can get the Loop Items displayed in the list as below.


The scheduled or running task will be split up and cloned onto multiple cloud servers after users make changes with its XPath to speed up its Cloud Extraction, otherwise it would be needless to run tasks using Cloud Servers. Specifically, users can also choose to skip using Cloud Extraction by click the option as below.


Users can adjust the maximum number of tasks running in parallel. Specifically, Octoparse professional Edition sets a threshold of 14 threads working simultaneously. The threads will be assigned to tasks randomly. That means if users set a threshold of 10 threads in parallel, then 10 tasks could be activated and run in the Cloud Servers at most. However, it is highly possible that some tasks may occupy more than 1 thread, if any of these tasks is split up. For example, it is probable that the 8 of the 10 tasks have occupied all of the threads, while leaving 2 idle tasks waiting for free Cloud Servers. More advanced, users can set priorities for the tasks so that the task with the relatively top priority will be executed first. Particularly, a split task which has  occupied cloud servers before setting priorities will keep waiting until tasks which are assigned with a higher priority have completed their Cloud Extraction already.
 



Author: The Octoparse Team
- See more at: Octoparse Tutorial

Web Crawler Service

Web data crawling or scraping is becoming increasingly popular in the last few years. The scraped data can be used for various analysis, even predictions. By analyzing the data, people can gain insight into one industry and take on other competitors. Here, we can see how useful and necessary it is to get high quality data with a faster speed and in a large scale. Also, a higher demand on data has driven the fast growth of Web Crawler Service. 
Web Crawler Service can be found easily if you search it via Google. More exactly, it is one kind of customized Paid Service. Every time  you'd like to crawl a web site or any data set, you need to pay for the service provider and then you can get the crawled data you want. There is something you should notice, you must be careful with the service provider you choose and express your data requirements as clear and exclusive as possible. I will propose some Web Crawler Service I used or learned for your reference. Anyway, the evaluation of services is hard since those services continuously evolve to serve the customer better. The best way to decide is what your requirements are and what is on offer, map them and rank them by yourself.  

DataHen is known as a professional Web Crawler Service Provider. It has offered well-rounded and patient service, covering all levels of data crawling or scraping requirements from personal, startups and enterprises. You will not need to buy or learn a scraping software by using DataHen. They are able to fill up forms when being obfuscated by certain sites which require authentications. The UI is straightforward to understand, as can be seen below, you only need to fill out the required information and they will deliver the data you need to crawl.

 


grepsr is a powerful Crawler Servcie platform which provides multi-kinds of user data crawling needs. To communicate better with users, grepsr has provided a quite clear and all-inclusive requirements gathering user interface as below. There are also three editions of Paid Plan of grepsr from Starters to Enterprises. Users can choose any plan based on their respective crawling needs.


Octoparse should be defined as a web scraping tool, eventhough it also offers customized data crawlers service. Octoparse Web Crawler Service is powerful as well. Tasks can be scheduled to run on the Cloud Platform which include at least 6 Cloud Servers working simultaneously. It also supports IP rotations, which prevents getting blocked by certain websites. Plus, Octoparse API allows users to connect their system to their scraped data in real time. Users can either import the Octoparse data into your own DB, or use the API to require access to their account’s data. Plus, Octoparse provides a Free Edition Extraction Plan. The Free Edition can also meet the basic needs of scraping or crawling from users. Anyone can use it to scrape or crawl data after you register an acount. The only thing is that you need to learn to configure the basic scraping rules to crawl data you need, anyway, it is easy to grasp the configuration skills. The UI is clear and straightforward to understand, as can be seen in the figure below. By the way, their back-up service is professional, users with any doubts can contact them directly and get feedback and solutions ASAP.


Scrapinghub is known as a Web Crawler tool, which also provide correlated crawling service you need to pay for. It can satisfy the basic needs of the scraping or crawling. Also, it has a proxy rotator(Crawlera), which means the crawling process will bypass bot counter measures so  they can crawl large sites faster. Plus, cloud-based web crawling platform, allows to easily deploy crawlers and scale them on demand without needing to worry about servers, monitoring, backups, or cron jobs. It helps developers turn over two billion web pages per month into valuable data.




Author: The Octoparse Team
- See more at: Octoparse Blog