2017年4月20日星期四

Top 5 Web Scraping Tools Review

Web scraping (also known as web crawling, web data extraction) is meant for extracting data from websites. Normally, there are two options for users to crawl websites. We can build our own crawlers by coding or using any other public APIs. Alternatively, web scraping can also be done using an automation web scraping software, which refers to an automated process implemented using a bot or web crawler. The data extracted from web pages can be exported as various formats or into different types of databases for further analysis.
There are many web scraping tools you may find online. In this post, I would like to share with you some popular automated scrapers that people think well of and have a run-through of their respective featured service .


Visual Web Ripper is an automated web scraping tool that supports a variety of features. It works well for certain tough, difficult-to-scrape websites with some advanced techniques, like running scripts which requires users with programming skills.
This scraping tool has an user-friendly interactive interface to help users grasp the basic operational process fast. The covered featured characteristics include:
Extract varied data formats
Visual Web Ripper is able to cope with difficult blocks layouts, especially for some web elements displayed on the web  page without a direct HTML association.  
AJAX                                                                   
Visual Web Ripper is able to extract the AJAX supplied data.
Login Required
Users can scrape websites which requires login first.
Data Export formats
CSV, Excel, XML, SQL Server, MySQL, SQLite, Oracle and OleDB, Customized C# or VB script file output (if additionally programmed)
IP proxy servers
Proxy to hide IP-address
Even though it provides such many functionalities, it hasn’t provide users with cloud based service yet. That means users can only have this application installed on the local machine and have it run locally, which may limit the scale or efficiency of scraping considering a higher demand on data scraping.
Debugger
Visual Web Ripper has a debugger which will help users build reliable agents where some issues that can be resolved in an effective way.
[Pricing]
Visual Web Ripper charges users from $ 349 to $2090 based on the subscribed user seat number and the maintenance will last for 6 months. Specifically, users who purchased a single seat ($349) can only install and use this application on a single computer, otherwise users will have to pay double or more to run it on other devices. If you feel no problem with this kind of pricing structure, Visual Web Ripper could be listed in your options.
                                     


Octoparse is a full-featured and non-coding desk-top web scraper with many outstanding characteristics compared with others.
It provides users with some useful, easy-to-use built-in tools to extract data from tough or aggressive websites that are difficult to scrape.
It's UI is very user-friendly and designed in a rather logical way. Users won't have too many troubles locating any functions. Additionally, Octoparse does visualized the extraction process using a workflow designer to help users  stay on top of the scraping process for any tasks. Octoparse supports:
Ad Blocking
Ad Blocking will optimize task by reducing loading time and number of HTTP request.
AJAX Setting
Octoparse is able to extract AJAX supplied data and set timeout.
XPath setting
Users can modify XPath to locate web elements more precisely using XPath setting provided by Octoparse.
Regex Setting
Users can normalize the extracted data output using Octoparse built-in Regex tool to generate a matching regular expression automatically.           
Data Export formats
CSV, Excel, XML, SQL Server, MySQL, SQLite, Oracle and OleDB
IP proxy servers
Proxy to hide IP-address
Cloud Service
Octoparse provides cloud based service. It will speed up data extraction speed - 4 to 10 times faster than Local Extraction. Once users use Cloud Extraction, 4 to 10 cloud servers will be assigned to work on their extraction tasks. It will set users free from long time maintenance and certain hardware requirements.  
API Access
Users can create their own API that will return data formatted as XML strings. 
[Pricing]
Octoparse is free to use if you don't choose to use the Cloud Service. Unlimited pages scraping is excellent compared to all the other scrapers in the market. However, if you would want to consider using it's Cloud Service for more sophisticated scraping, it does offer offer two paid editions: Standard Edition and Professional Edition.
Both editions has provided inclusive featured scraping service, 
                                         
Standard Edition: $75 per month when billed annually, or $89 per month when billed monthly.
    Standard Edition offers all featured functions.
    Number of tasks in the Task Group: 100
    Cloud Servers: 6
Professional Edition: $158 per month when billed annually, or $189 per month when billed monthly.
    Professional Edition offers all featured functions.
    Number of tasks in the Task Group: 200
    Cloud Servers: 14
To conclude, Octoparse is a rich-featured scraping software with reasonable subscription pricing , it is worth your try. 


Mozenda is a cloud based web scraping service. It provides many useful utility features for data extraction. Users will be allowed to upload extracted data to cloud storage. Some featured service include:
Extract varied data formats
Mozenda is able to extract many types of data formats, however not that easy to do with some data with irregular data layout.
Regex Setting
Users can normalize the extracted data results using Regex Editor within Mozenda. However, it is not that easy to handle this Regex Editor, you may need to learn more about how to write a regular expression.         
Data Export formats
It can support varied types of export transformation.
AJAX Setting
Mozenda is able to extract AJAX supplied data and set timeout.
[Pricing]
Mozenda users pay for Page Credits, which is the number of individual request to a website to load a web-page. Each Subscription plan comes with a fixed number of pages included in the monthly plan price. That means the web pages out of the range of the limited page numbers will be charged additionally. And cloud storage vary based on different editions. Two Editions are offered for Mozenda:
                                                     


Import.io is a web-based platform for extracting data from websites without writing any code. Users can build their extractors by easy point & click, then Import.io will automatically extract data from web pages into a structured dataset. It serves users with a number of characteristic features:
Authentication
Extract data from behind a login/password
Cloud Service
Use the SaaS platform to store data that is extracted.
Parallelized data acquisitions are distributed automatically by scalable cloud architecture
API Access
Integration with Google Sheets, Excel, Tableau and many others.
[Pricing]
Import.io charges subscribers based on the quantity of the extracting queries per month, so users should better reckon up the number of extracting queries before they make a subscription. (One single query equals one single page URL.)
There are three Paid Editions offered by Import.io:
                                      
Essential Edition: $199 per month when billed annually, or $299 month-to-month when billed monthly.
Essential Edition offers all featured functions.
Essential Edition offers users with up to 10,000 queries per month.
    
Professional Edition: $349 per month when billed annually, or $499 per month when billed monthly.
Professional Edition offers all featured functions.
Professional Edition offers users with up to 50,000 queries per month.

Enterprise Edition: $699 per month when billed annually, or $999 per month when billed monthly.
Enterprise Edition offers all featured functions.
Enterprise Edition offers users with up to 400,000 queries per month.

Content Grabber is one of the most feature rich web scraping tools. It is more suitable for people with advanced programming skills, since it offers many powerful scripting editing, debugging interfaces for people in need. Users are allowed to use C# or VB.NET to write regular expressions instead of generating the matching expression using the built-in Regex tool, like Octoparse. The features covered within Content Grabber include:
Debugger
Content Grabber has a debugger which will help users build reliable agents where some issues that can be
resolved in an effective way.
Visual Studio 2013 Integration
Content Grabber can integrate with Visual Studio 2013 for the most powerful script editing, debugging and unit testing features.

Custom Display Templates

Custom HTML display templates allow you to remove these promotional messages and add your own  designs to the screens - effectively allowing you to white label your self-contained agent.

Programming Interface

The Content Grabber API can be used to add web automation capabilities to your own desktop and web applications. The web API does require access to the Content Grabber Windows service, which is part of the Content Grabber software, and must be installed on the web server or a server accessible to the web server.
[Pricing]
Content Grabber offers two purchasing methods:
                                   
Buy License : Buying any Content Grabber license outright gives you a perpetual license.
For License users, there are three editions are available for users to buy:
Server Edition:
This Basic Edition only provides users with limited Agent Editors. The total cost is $449.
Profession Edition:
It serves users with full featured Agent Editor. However, API is not available. The pricing is $995.
Premium Edition:
This Advanced Edition provides all featured services within Content Grabber. However, it also charges a bit higher with a pricing $2495.

Monthly Subscription: Users who sign up to a monthly subscription will be charged upfront each month for the edition they choose.
For subscribers, there are also the same three editions for users to buy:
Server Edition:
This Basic Edition only provides users with limited Agent Editors. The total cost is $69 per month.
Profession Edition:
It serves users with full featured Agent Editor. However, API is not available. The pricing is $149 per month.
Premium Edition:
This Advanced Edition provides all featured services within Content Grabber. However, it also charges a bit higher with a pricing $299.

Conclusion

In this post, 5 automation web scraper software have been evaluated from various perspectives. Most of these scrapers can satisfy users' basic scraping needs. Some of these scraper tools, like Octoparse, Content Grabber, have even provided more advanced functionality to help users extract matching results from tough websites using their built-in Regex, XPath tools and Proxy Servers. Particularly, users without any programming skills are not suggested to run custom scripts (Visual Web Ripper, Content Grabber and etc). Anyway, whichever scraper any user should choose is totally dependent on your individual requirements. Thus, make sure you have an overall understanding of a scraper's features before you subscribe to it. Lastly, check out the below feature comparison chart if you are putting some serious thoughts on subscribing to a data extraction service provider. Happy data hunting!
                                    
- See more at: Octoparse Blog

标签: , , ,

2017年2月28日星期二

Website Crawler & Sentiment Analysis

        

To start with Sentiment Analysis, what comes first to our mind is where and how we can crawl oceans of data for our analysis. Normally, web crawler or crawling from web social media should be one reasonable way to get access to the public opinion data resource. Thus, in this writing, I want to share with you about how I crawled the website using web crawler and proceeded to deal with those data for Sentiment Analysis to develop an application which ranks universities based on users’s opinions crawled from social media website - Twitter.
To crawl the Twitter data, there have been several methods that we can adopt - Build a web crawler on our own by programming or choosing an automated web crawler, like OctoparseImport.io and etc. Also we could use the public APIs provided by certain websites to get access to their data set.
First, as well known, Twitter provides public APIs for developers to read and write tweets conveniently. The REST API identifies Twitter applications and users using OAuth; After learn about this, we can utilize twitter REST APIs to get most recent and popular tweets, Twitter4j has been imported to crawl twitter data through twitter REST API. And Twitter data can be crawled according to specific time range, locations, or other data fields. And the data crawled will be returned as JSON format. Note that APP developers need to generate twitter application accounts, so as to get the authorized access to twitter API. By using a specific Access Token, the application made a request to the POST OAuth2 to exchange credentials so that users can get authenticated access to the REST API .
This mechanism will allow us to pull users’information in the data resource. Then we can use the serach function to crawl these structured tweets related with university topics.
Then, I generated the query set to crawl tweets, which is like the figure below. I collected University Ranking data of USNews 2016, which includes 244 universities and their ranking. Then, I customized the data fiels I need to use for crawling tweets into JSON format.

                                                      

                                                      
                                             

For my tweets crawled results, I extracted 462,413 tweets results totally. And most universities’ tweets crawled number is less than 2000. 
                        
So far, someone may feel that the whole process to crawl Twitter is troublesome already. This method requires people with good programming skills and knowledge on regular expression to crawl the websites for structured and formatted data. This may be tough for someone without any coding skills and related knowledge. Here, for your reference, I’d like to propose some automated web crawler tools which can help you crawl websites without any coding skills, like OctoparseImport.ioMozenda. Based on my user experience, I can share with you one crawler tool I once used - Octoparse. It is quite user-friendly and easy to use. You can download their desk-top crawler software by visiting this link: http://www.octoparse.com/. This local machine installed web crawler tool can automatically crawl the data, like tweets, from the target sites. Its UI is user-friendly as below and you just need to learn how to confiugre the crawling settings of your tasks which you can learn by reading or watching their tutorials. The data crawled can be exported to various structure formats as you need, like Excel, CSV, HTML, TXT, and database(MySQL, SQL Server, and Oracle). Plus, it provides IP rotations which automates IPs leaving without being traced by the target sites. This mechanism serves as an important role, since it can prevent us from getting blocked by certain aggressive sites which doesn’t allow people to crawl any data from them.

                

Back to the University Ranking of my designed application. Ranking technology in my application is to parse tweets crawled from Twitter and then rank related tweets according to their relevance to a specific university. I want to filter high-related tweets (topK) to do the Sentiment Analysis, which will avoid trivial tweets that make our results inaccurate. There are may ranking methods actually, such as rank them based on TF-IDF similarity, text summarization, spatial and temporal factors or machine learning ranking method. Even Twitter itself has provided a method based on time or popularity. However, we need a more advanced method which can filter out the most spam and trivial tweets. In order to measure the trust and popularity of a tweet, I use the following features from tweets: retweets count, followers count, friends acount. Assuming that a trust tweet should be posted by a trust user. And a trust user should have enough friends and followers, then a popular tweet should have high retweet-count. According to my assumption, I builit a model combining the trust and popularity (TP Score) for a tweet. Then I rank those tweets based on TP score. Noteworthy that report news usually has high retweet-count, and this type  of score will be useless for our Sentiment Analysis. Thus, I assigned a relatively lower weight on this portion when computing the TP score. I designed and derived a formula as below. This Twitter supported method has considered the presence of query words, the recency of tweets. Thus, the tweets we crawled has been filtered by query words and posting time. All we need is to consider about the retweets countsm, followers counts, friends counts.
                                        
I make my own university ranking according to public reputation which is represented by sentiment score. However, public reputation is only on of the factors that should be referenced when evaluating a university. Thus, I want to present an overall ranking which combined both commercial ranking and our ranking. And there are three main types of tweets texts as below.

                                              

Sentiment Score: the Sentiment Score was calculated for public reputation. The positive rate of each university was used as the sentiment score for the public reputation ranking. The formula below defines the positive rate. Note that the negative polarity was not considered since it is equal to zero.
                                                                          
Where n is the total number of tweets for each university; And represents the positive polarity for a tweet, 4.
After I completed the Sentiment Anlysis, I would proceed to build a classifier for Sentiment Analysis using machine learning algorithm. This algorithm I will discuss further in the next writing, which is about Maximum Entropy classifier.


Author: The Octoparse Team
- See more at: Octoparse Blog

标签: , ,

2017年2月27日星期一

Web scraping | Introduction to Octoparse XPath Tool

You can use Octoparse to scrape websites now but sometimes the output has missing data or the task is not working properly. A new X Path expression can easily solve the problems and make the task work. Thus, it’s necessary to master Octoparse X Path Tool when scraping data from websites. Just a little bit of effort, you can greatly improve your productivity.

Before reading the article, you can learn basic HTML & XPath knowledge in these documents.


In this tutorial you'll get to know how to use Octoparse XPath Tool with some syntax example, when configuring your scraping task in Octoparse.

Location

There are two ways to get Octoparse X Path Tool:

Method 1. All the actions except “Go To Web Page”in Octoparse Workflow have the “Define ways to locate an item”option under the “Customize Field”setting. Select the “Define ways to locate an item”option and click on the Try XPath Tool link (lower left corner).
 
The Customized Field button in Extract Data
 
The Customized Field button in Enter Text
 
The Customized Field button in Click Item

Select the “Define ways to locate an item”option and click on the Try XPath Tool link (lower left corner).

 


Method 2. The third icon appeared on the upper left corner of Octoparse interface.
 

Main Interface
 

The main interface of Octoparse XPath Tool can be divided into 4 parts, as follows: 
1. The Browser part.
Enter the target URL in the built-in browser and click on the Go button. The content about the web page will be displayed here.
2. The Source Code part. View the source code of the web page.
3. The XPath Setting part. Check the options and fill in some parameters to generate X Path expression by hitting the Generate button.
4. The XPath Result part. After the XPath is generated, click on the Match button to see if the current XPath finds elements on the webpage.

Note: The structure and hierarchy of the source code shown in Octoparse XPath Tool is not clear. It’s strongly recommend you use the Firefox extension - Firepath to check the web page source code. Check out the tutorial HERE to use Firepath.

XPath Setting

The XPath expression is generated automatically after you check the option and fill in some text in the XPath Setting part. 
The X Path Setting part

Item Tag Name:
The blue text such as SPAN, A, HR, BR in the source code describes the tag name in Firefox browser. You can check the Item Tag Name and fill in the tag name from which you want to find the elements.
 
The blue text represents the tag name.

Item Position:
The default value is 1 which represent the first item among the siblings.

Item ID, Item Name, Item Style Class:
Generally, a tag element will have some attributes such as id attribute, name attribute or class attribute inside. You can check the options you need and fill in some text to better locate the elements. Id attribute, name attribute and class attribute are the most common ones and you can edit the X Path generated yourself for other attributes by replacing the attribute name and related parameters copied from the original source code in Firefox browser.
 
Pic: Check the Item ID option with some parameter and generate X Path. 
 
Pic: Copy the X Path generated and paste in Firepath.

Pic: Replace data-action with id and replace ‘sx-card-deck’ with rot-B00D2PNANY'.

Item Text:
The black text in the source code describes all the text information in Firefox browser. You have to make sure all the text is included (even a blank space; the punctuation; the full-angle and half-angle) when filling in the parameters. You can just copy the text inside angle brackets from the original source code.
 

Item Text Contains:
Octoparse will generate X Path that contains specific text and find all the elements that contains the text.

Item Text Start With:
Similarly, Octoparse will generate X Path that contains the beginning of the text/sentence and find all the information that begin with the text.

After you check the option and fill in the parameter, click on the Generate button and an X Path will be created and displayed in the X Path Result part.

On the left side of the Generate button is four buttons that used the find the elements near to the information you really want. Generally you won’t use these buttons and they are usually used after an X Path is created.

Child: selects the child node.
Parent: selects the parent node.
Previous: selects the “preceding-sibling::node()”.
Next: selects the “following-sibling::node()”.
 

Conclusion

There are many ways to find the information on a webpage, that is, you can write different X Path expressions to locate the web elements, or ask an expert for help.
- See more at: Octoparse Tutorial

标签: ,