2017年4月20日星期四

Top 5 Web Scraping Tools Review

Web scraping (also known as web crawling, web data extraction) is meant for extracting data from websites. Normally, there are two options for users to crawl websites. We can build our own crawlers by coding or using any other public APIs. Alternatively, web scraping can also be done using an automation web scraping software, which refers to an automated process implemented using a bot or web crawler. The data extracted from web pages can be exported as various formats or into different types of databases for further analysis.
There are many web scraping tools you may find online. In this post, I would like to share with you some popular automated scrapers that people think well of and have a run-through of their respective featured service .


Visual Web Ripper is an automated web scraping tool that supports a variety of features. It works well for certain tough, difficult-to-scrape websites with some advanced techniques, like running scripts which requires users with programming skills.
This scraping tool has an user-friendly interactive interface to help users grasp the basic operational process fast. The covered featured characteristics include:
Extract varied data formats
Visual Web Ripper is able to cope with difficult blocks layouts, especially for some web elements displayed on the web  page without a direct HTML association.  
AJAX                                                                   
Visual Web Ripper is able to extract the AJAX supplied data.
Login Required
Users can scrape websites which requires login first.
Data Export formats
CSV, Excel, XML, SQL Server, MySQL, SQLite, Oracle and OleDB, Customized C# or VB script file output (if additionally programmed)
IP proxy servers
Proxy to hide IP-address
Even though it provides such many functionalities, it hasn’t provide users with cloud based service yet. That means users can only have this application installed on the local machine and have it run locally, which may limit the scale or efficiency of scraping considering a higher demand on data scraping.
Debugger
Visual Web Ripper has a debugger which will help users build reliable agents where some issues that can be resolved in an effective way.
[Pricing]
Visual Web Ripper charges users from $ 349 to $2090 based on the subscribed user seat number and the maintenance will last for 6 months. Specifically, users who purchased a single seat ($349) can only install and use this application on a single computer, otherwise users will have to pay double or more to run it on other devices. If you feel no problem with this kind of pricing structure, Visual Web Ripper could be listed in your options.
                                     


Octoparse is a full-featured and non-coding desk-top web scraper with many outstanding characteristics compared with others.
It provides users with some useful, easy-to-use built-in tools to extract data from tough or aggressive websites that are difficult to scrape.
It's UI is very user-friendly and designed in a rather logical way. Users won't have too many troubles locating any functions. Additionally, Octoparse does visualized the extraction process using a workflow designer to help users  stay on top of the scraping process for any tasks. Octoparse supports:
Ad Blocking
Ad Blocking will optimize task by reducing loading time and number of HTTP request.
AJAX Setting
Octoparse is able to extract AJAX supplied data and set timeout.
XPath setting
Users can modify XPath to locate web elements more precisely using XPath setting provided by Octoparse.
Regex Setting
Users can normalize the extracted data output using Octoparse built-in Regex tool to generate a matching regular expression automatically.           
Data Export formats
CSV, Excel, XML, SQL Server, MySQL, SQLite, Oracle and OleDB
IP proxy servers
Proxy to hide IP-address
Cloud Service
Octoparse provides cloud based service. It will speed up data extraction speed - 4 to 10 times faster than Local Extraction. Once users use Cloud Extraction, 4 to 10 cloud servers will be assigned to work on their extraction tasks. It will set users free from long time maintenance and certain hardware requirements.  
API Access
Users can create their own API that will return data formatted as XML strings. 
[Pricing]
Octoparse is free to use if you don't choose to use the Cloud Service. Unlimited pages scraping is excellent compared to all the other scrapers in the market. However, if you would want to consider using it's Cloud Service for more sophisticated scraping, it does offer offer two paid editions: Standard Edition and Professional Edition.
Both editions has provided inclusive featured scraping service, 
                                         
Standard Edition: $75 per month when billed annually, or $89 per month when billed monthly.
    Standard Edition offers all featured functions.
    Number of tasks in the Task Group: 100
    Cloud Servers: 6
Professional Edition: $158 per month when billed annually, or $189 per month when billed monthly.
    Professional Edition offers all featured functions.
    Number of tasks in the Task Group: 200
    Cloud Servers: 14
To conclude, Octoparse is a rich-featured scraping software with reasonable subscription pricing , it is worth your try. 


Mozenda is a cloud based web scraping service. It provides many useful utility features for data extraction. Users will be allowed to upload extracted data to cloud storage. Some featured service include:
Extract varied data formats
Mozenda is able to extract many types of data formats, however not that easy to do with some data with irregular data layout.
Regex Setting
Users can normalize the extracted data results using Regex Editor within Mozenda. However, it is not that easy to handle this Regex Editor, you may need to learn more about how to write a regular expression.         
Data Export formats
It can support varied types of export transformation.
AJAX Setting
Mozenda is able to extract AJAX supplied data and set timeout.
[Pricing]
Mozenda users pay for Page Credits, which is the number of individual request to a website to load a web-page. Each Subscription plan comes with a fixed number of pages included in the monthly plan price. That means the web pages out of the range of the limited page numbers will be charged additionally. And cloud storage vary based on different editions. Two Editions are offered for Mozenda:
                                                     


Import.io is a web-based platform for extracting data from websites without writing any code. Users can build their extractors by easy point & click, then Import.io will automatically extract data from web pages into a structured dataset. It serves users with a number of characteristic features:
Authentication
Extract data from behind a login/password
Cloud Service
Use the SaaS platform to store data that is extracted.
Parallelized data acquisitions are distributed automatically by scalable cloud architecture
API Access
Integration with Google Sheets, Excel, Tableau and many others.
[Pricing]
Import.io charges subscribers based on the quantity of the extracting queries per month, so users should better reckon up the number of extracting queries before they make a subscription. (One single query equals one single page URL.)
There are three Paid Editions offered by Import.io:
                                      
Essential Edition: $199 per month when billed annually, or $299 month-to-month when billed monthly.
Essential Edition offers all featured functions.
Essential Edition offers users with up to 10,000 queries per month.
    
Professional Edition: $349 per month when billed annually, or $499 per month when billed monthly.
Professional Edition offers all featured functions.
Professional Edition offers users with up to 50,000 queries per month.

Enterprise Edition: $699 per month when billed annually, or $999 per month when billed monthly.
Enterprise Edition offers all featured functions.
Enterprise Edition offers users with up to 400,000 queries per month.

Content Grabber is one of the most feature rich web scraping tools. It is more suitable for people with advanced programming skills, since it offers many powerful scripting editing, debugging interfaces for people in need. Users are allowed to use C# or VB.NET to write regular expressions instead of generating the matching expression using the built-in Regex tool, like Octoparse. The features covered within Content Grabber include:
Debugger
Content Grabber has a debugger which will help users build reliable agents where some issues that can be
resolved in an effective way.
Visual Studio 2013 Integration
Content Grabber can integrate with Visual Studio 2013 for the most powerful script editing, debugging and unit testing features.

Custom Display Templates

Custom HTML display templates allow you to remove these promotional messages and add your own  designs to the screens - effectively allowing you to white label your self-contained agent.

Programming Interface

The Content Grabber API can be used to add web automation capabilities to your own desktop and web applications. The web API does require access to the Content Grabber Windows service, which is part of the Content Grabber software, and must be installed on the web server or a server accessible to the web server.
[Pricing]
Content Grabber offers two purchasing methods:
                                   
Buy License : Buying any Content Grabber license outright gives you a perpetual license.
For License users, there are three editions are available for users to buy:
Server Edition:
This Basic Edition only provides users with limited Agent Editors. The total cost is $449.
Profession Edition:
It serves users with full featured Agent Editor. However, API is not available. The pricing is $995.
Premium Edition:
This Advanced Edition provides all featured services within Content Grabber. However, it also charges a bit higher with a pricing $2495.

Monthly Subscription: Users who sign up to a monthly subscription will be charged upfront each month for the edition they choose.
For subscribers, there are also the same three editions for users to buy:
Server Edition:
This Basic Edition only provides users with limited Agent Editors. The total cost is $69 per month.
Profession Edition:
It serves users with full featured Agent Editor. However, API is not available. The pricing is $149 per month.
Premium Edition:
This Advanced Edition provides all featured services within Content Grabber. However, it also charges a bit higher with a pricing $299.

Conclusion

In this post, 5 automation web scraper software have been evaluated from various perspectives. Most of these scrapers can satisfy users' basic scraping needs. Some of these scraper tools, like Octoparse, Content Grabber, have even provided more advanced functionality to help users extract matching results from tough websites using their built-in Regex, XPath tools and Proxy Servers. Particularly, users without any programming skills are not suggested to run custom scripts (Visual Web Ripper, Content Grabber and etc). Anyway, whichever scraper any user should choose is totally dependent on your individual requirements. Thus, make sure you have an overall understanding of a scraper's features before you subscribe to it. Lastly, check out the below feature comparison chart if you are putting some serious thoughts on subscribing to a data extraction service provider. Happy data hunting!
                                    
- See more at: Octoparse Blog

标签: , , ,

2017年2月26日星期日

Facebook Data Mining


    
Mining data from Facebook has been quite popular and useful in a few past years. The crawled or scraped data will be valuable and constructive for commercial, scientific, and many other fields of prediction and analysis, especially when these data is processed deeply, like data purge, machine learning. Without a doubt, data mining which serves as a basis tier crossing the whole data process is of paramount importance.
Facebook also has provided a serving website allowing those developers to access its data, since data enthusiasts express such intense interest in the data from Facebook, . This website has provided many simple and easy-to-grasp methods with detailed guidelines for users to learn and access to its resource.
Talking about this Facebook API which is known as Graph API, it is one kind of interface with REST (Representational State Transfer), which is based on the network architecture. It implies that Facebook calls functions by using remote methods, like HTTP, GET, POST to send messages and echo back REST service.
Take an Facebook example of Coca-Cola Corp., if users are intended to retrieve remarks posted on the graffiti wall, what they need to do is simply entering :
https://graph.facebook.com/cocacola/feed,then the system will return the data results in JSON file. JSON(JavaScript Object Notation) is one kind of data exchange format which is easy for users to handle, as well as easy for devices to analyze and generate. The data fields include the message ID, detailed info of data, author, author ID, and other kinds of info. Not only the graffiti wall, but also all other Facebook objects can use the following URL structure to retrieve what they want.
    {
       "error": {
                     "message": "Unknown path components: /CONNECTION_TYPE",
                     "type": "OAuthException",
                     "code": 2500,
                      "fbtrace_id": "AU3Q0qQUX1/"
        }
Here, we should note that we can only access to the data  when the objects are public, otherwise we should provide access token if the objects are defined as private.  
Users should feel happy to hear this: there has been a R package which is known as the Rfacebook Package. It provides an interface to the Facebook API. For mining Facebook using R, the Rfacebook package  provides functions that allow R to access Facebook’s API to get information about posts, comments, likes, group that mention specific keywords & much more. Then we can use the specific commands like below to search pages. Apart from R, there exists a portion of people getting used to Python. Here are also tips for reference. First of all, check out documentation on Facebook's Graph API https://developers.facebook.com/docs/reference/api/. If you are not familiar with JSON, DO read a tutorial on it (for instance http://secretgeek.net/json_3mins.asp). Once you grasp the concepts, start using this API. For Python, there are at several alternatives:                                                                                                                                 
  • facebook/python-sdk https://github.com/facebook/python-sdk 
  • pyFaceGraph https://github.com/iplatform/pyFaceGraph/
  • It is also semitrivial to write a simple HTTP client that uses the graph APIUsers are suggested to check out the Python libraries, try out the examples from their documentation and check if they have already done what you need to do. Compared with R, Python can simplify the data process procedure by saving time of code management, output and note files. While using R can optimize the graph visualization, since users can visualize friends on the Facebook.
                                                                                                         
There are still some data extraction tools for some people without any programming skills to scrape or crawl data from Facebook, like OctoparseVisual Scraper.                      
Octoparse:                                                                          
Octoparse is a powerful web scraper that can scrape both static and dynamic websites with AJAX, JavaScript, cookies and etc. First, you need to download the client end and then start with your scraping tasks. For this software, you needn’t have any programming skills, but you should learn some rules that has been set to help users to extract data. Plus, it has provided cloud service and proxy server setting functionality to prevent from IP block and accelerate the extracting process.      

  
Would like to know more, please visit http://www.octoparse.com/

Visual Scraper:
Visual Scraper is another great free web scraper with simple point-and-click interface and could be used to collect data from the web. You can get real-time data from several web pages and export the extracted data as CSV, XML, JSON or SQL files.                                                                                                           The freeware, which is available for Windows, enables you to scrape data from up to 50,000 web pages for only one user.                                                            Besides the SaaS, VisualScraper offer web scraping service such as data delivery services and createing software extractors services.

   
 If you want to know more, please visit http://www.visualscraper.com/pricing


Author: The Octoparse Team
- See more at: Octoparse Tutorial

标签: ,

2017年1月23日星期一

Speed up Cloud Extraction (2)

(from http://www.octoparse.com/tutorial/scraping-hotel-reviews-from-tripadvisorcom/)

In Speed up Cloud Extraction (1), you’ve learned how to speed up Cloud Extraction by telling the program to split up one task into multiple sub tasks. When you use “Fix list”, “List of URLs” or “Text list” loop mode, the task will be auto split up into multiple sub tasks on the cloud platform.

In this tutorial, you will learn how to speed up Cloud Extraction in Octoparse by optimizing pagination.

When you configure pagination by clicking on “Next” button, “Click to paginate” action will be auto generated in Workflow. Since you click on one element, “Single element” loop mode would be the default loop mode, which is not allowed to split on the cloud platform. Assume that a task is meant to extract URLs from a list page:

Let’s say that opening a page task 5s, Extracting data takes 2s, Clicking “Next Page” takes 3s. The extraction process on the cloud platform will be like:
 
( Note: Octoparse’s cloud servers extract data simultaneously.)

In this case, Cloud Extraction would be very slow since the pagination always takes 3s when scraping one data field. To optimize pagination, you will need to use split-table Loop modeon pagination. (List of URLs & Text list)

 1. Query string pagination - Use List of URLsloop mode

The query string pagination is simple URL with query string parameter: “page=1, page=2, page=3...”
For query string pagination, use “List of URLs” loop mode by putting the URLs, instead of creating a “Click to paginate” by clicking on “Next” button.

Step 1. Create a list of URLs and copy them.

Step 2. Drop a “Loop Item” into the Workflow designer.

Step 3. Select “List of URLs” loop mode and paste the URLs in the text box. Then click “OK” &“Save”.
Then continue to configure the task for Cloud Extraction.

2. Jumping to a Specific Page - Use “Text list”loop mode

When the website allows visitors to enter a page number and jump to the specific page, use “Text list” loop mode to enter page numbers.

Step 1. Create page numbers and copy them.

Step 2. Drop a “Loop Item” into the Workflow designer.

Step 3. Select “Text list” loop mode and paste the text in the text box. Then click “OK” &“Save”.

When you optimize the pagination, Cloud Extraction process will takes less 3s when scraping each data field 

                                           (After)



                                                 (Before)
Once we know how to optimize pagination by switching to different “Loop mode”s, we can make Cloud Extraction a lot faster.


Author: The Octoparse Team

标签: , ,

2017年1月19日星期四

Web Scraping|Use Proxy Server for Web Scraping

       
             

Web Scraper or spider becomes more and more popular in data science. This auto-technique can help us retrieve loads of customized data from Web or database. However, the major issue is that requesting too many pages in too short a period of time by a single IP address can be easily traced by the website, thus being blocked by the target website. To limit the chances of getting blocked, we should try to avoid scraping a website with a single IP Address. And normally, we use proxy servers which include discrete proxy IP address whenever the requests are routed over the crawling server.
Concerned about proxy server, the reliability of the proxy should always come first to our mind. Actually, there are around 1000 places to buy proxies and some unreliable proxies would go too fast, which might cause themselves to get blocked. There are also other approaches that can be more related to out-sourcing the IP rotation(think proxy as a service), but these services usually come at a higher cost. Since there is a cost of purchasing the proxy and a cost of re-implementing the proxy each time you purchase a new one. Much often the time, reliability does come at a cost and you will often find that "free" will be very unreliable, "cheap" will be somewhat unreliable and "more expensive" will usually come at a premium. Therefore, the Cloud-based data extraction concept is proposed recently.
The Cloud-based Web Scraping is a true Cloud-based service, it can run from any OS and any browser. We don’t have to host anything ourselves, and everything is done in the cloud. Plus, all the website page views, data formation, transformation can be handled on someone else’s server. Web proxy requirements can be managed by ourselves. On the cloud side, these machines are independent, they can be accessed and run without installing from any PC with Internet access around the world. This service will manage our data with incredible back-end hardware, more specific, we can utilize its anonymous proxy feature that could rotate tons of IP’s addresses to prevent getting blocked by the target website. Actually, we can take a more succinct and efficient approach by using certain Data Scraper Tool with Cloud-based service, like OctoparseImport.io these tools can schedule and run your task any time on the cloud side with tons of PCs running at the same time. Plus, these scraper tools can also provide us a fast way to manually configure these proxy servers as you need.
Some popular scraper tools in the market include Octoparse, Import.io, Webhose.io, Screen Scraper.
Octoparse is a powerful and free data scraper tool which can scrape almost all the websites. Its cloud-based data extraction can provide rich rotating IP addresss proxy servers for web scraping which has limited the chances of getting blocked and saved much time for manual configuration. They have provided precise instructions and clear guidelines to follow the scraping steps. Basically, for this tool, you needn't have any coding skills. Anyway, if you want to deepen and strengthen your crawling and scraping, it has offered a public API if you are in need. Besides, their back-up support is efficient and available.
Import.io  is also an easy-to-use desktop data scraper. It has succinct and effective user interface and simple navigation. For this tool, it also requires less coding skills. Import.io possesses many powerful featrues as well, like Cloud-based service which can help us better take care of our scheduled task and level up our mining ablility for their rotating IP address. However, Improt.io has hard time navigating through combinations of javascript/POST.
Webhose.io is a browser-based data scraping tool which uses various data crawling techniques to crawl amounts of data from multiple channelsWhile it may behave not so good as the previous introduced tools about their cloud service, which means the scraping process dealing with IP rotation or proxy configuration might be somewhat complex. They have provided both free and paid service plan as you need.
Screen Scraper is pretty neat and can wrestle with certain difficult tasks including precise localization, navigation and data extractions, however it does require you have basic programming/tokenization skills if you want to have it perform at its utmost. It implies that you should configure the settings and set the parameters manually most of the time, the pros that you can customize your distinct mining process, while the cons is that it is abit time-consuming and complex. Plus, it is a bit expensive.



Author: The Octoparse Team



Download Octoparse Today


For more information about Octoparse, please click here.
Sign up today!

Author's Picks


标签: , ,