2017年2月28日星期二

Free Online Web Crawler Tool

                                                                           

With the increasing demand on data, more and more people began crawling web to get access to oceans of data. Therefore, Web Crawling is serving as an ascendingly important role to help people with data needs to fetch data meeting their requirements. Till now, there are three most common methods for people to crawl web data - Utilizing the public APIs provided by the target websites; Programming and build crawler on your own; Using some automated web crawler tools. Based on my user experience, I will mainly discuss several free online web crawler tools in the following section for the reference of web crawler beginners.
Before my introduction about the online web crawler tools, we should learn first about what is the web crawler meant for? Well, the web crawler tool is designed to scrape or crawl data from websites. We can also call them web harvesting tools or extraction tools. It can automate the crawling process with a faster speed and harvest data in a large scale. People who use it are not required to know any coding skills. They just need to learn the configuration rules related to different crawler tools. More advance, the online web crawlers are useful if users would like to gather the information and have it put into an useable form. The URL list can be stored in a spreadsheet and expanded in a dataset over time in the Cloud-Platform. That means the scraped data can be merged into an existing database by using the online web service. Here, I’d like to propose several free online web crawlers for your reference. Anyway, what I propose is just suggestions. Anyone who's going to choose a web crawler tool should learn about its respective detailed functionalities first and select the one based on your requirements.

Import.io

Import.io provides online web scraper service now. The data storage and related techniques are all based on Cloud-based Platform. To activate its function, the user need to add a web browser extension to enable this tool. The user interface of Import.io is easy to handle, users can click and select the data fields to crawl the data they need. For more detailed instructions, users can visit their official website for more tutorials and assistance. The Import.io can customize a dataset for pages with no data in the existing IO library by getting acceess to the Cloud-based library of API’s.
Its Cloud Service provides data storage and related data processing control options in the Cloud-Platform. One can add it to existing databses. Libraries and etc.


Scraper Wiki has set their free online accounts to a fixed maximum of datasets. Good news to all users, their free service provides the same elegant service as the paid service. They have also made a commitment to providing journalists premium accounts without cost.Their free online web scraper has added a new feature that PDF table is available. However, this PDF format doesn’t work well, since it will be practically hard if users would like to make cutting and pasting. The Scraper Wiki also added other more advanced options. Like, they released some other edtions of their application developed in different programming language, like Python, Ruby and Php, for a better flexibility in different operating system platforms.


Dexi.io

CloudScrape Cloud Scraping Service in Dexi.io is meant for regular web users to operate on. It always commits itself in providing high quality Cloud Service Scraping. It provides users with IP Proxy and in-built CAPTCHA resolving features which can help users scrape most of the websites. Users can learn how to use CloudScrape by clicking and pointing easily, even for beginners or amateurs. Cloud hosting makes possible all the scraped data to be stored in cloud. API allows to monitor and remotly manage web robots. It’s CAPTCHA solving option sets CloudScrape apart from services like Import.io or Kimono. The service provides a vast variety of data integrations, so that extracted data might automatically be uploaded thru (S)FTP or into your Google Drive, DropBox, Box or AWS. The data integration can be completed seamlessly.
Apart from some of those free online web crawler tools, there are other reliable web crawler tools providing online service which may charge for their service though.

 

Octoparse is known as a Windows desk-top web crawler application, which provides reliable online crawling service as well. For their Cloud-based service, Octoparse can offer at least 6 cloud servers which can run users’ task concurrently. It also supports Cloud Data Storage and more advanced options about Cloud service. Its UI is very user-friendly and there are lots of related tutorials in their website for users to learn how to configure the tasks and make crawler on their own.

 




Author: The Octoparse Team

- See more at: Octoparse Blog

Price Scraping

                           

Scraping data from websites is nothing new at all. In commercial field, a large amount of scraped data can be used for business analysis. As well known, we can scrape the details, like price, stock, rating and etc, covering various data fields to monitor the change of the items. These data scraped can further help analysts and market sellers to evaluate the potential value or make more significant decisions.
However, there are some websites that we can’t scrape from. More exactly, even if these sites could provide APIs, there still exist some data fields that we couldn’t scrape or have no authentication to access to. For example, Amazon does provide a Product Advertising API, but the API itself couldn’t provide the access to all the information displayed on its product page for people to scrape, like price and etc. In this case, the only way to scrape more data, saying price data field, is to build our own scraper by programming or use certain kinds of automated scraper tools.
Sometimes, even we know how to scrape data on our own by programming, like using Ruby or Python, we still couldn’t scrape data in the end for various possible reasons. In most cases, we probably would be forbidden to scrape from certain websites due to our suspicious repeating scraping actions within a very short period of time traced by those target sites. If so, we may need to utilize IP proxy which automates IPs’ leaving without being traced by those target sites.
The possible solutions described above may require people to be familiar with coding skills and more advanced technical knowledge. Otherwise, it could be a tough or impossible task for us to complete. Thus, to make scraping websites available for most people, I’d like to list several scraper tools that can help you scrape any commercial data, including price, stock, reviews and etc, in a structured way with high efficiency and much faster speed.

Octoparse

I once used this scraper tool to scrape many websites, like Facebook, eBay, Priceline and etc, for data including price, reviews, comments and etc. This Scraper tool trully suits very well in scraping various data in most websites. Users needn’t know any how to program to scrape data by using this scraper, but they should learn to configure their tasks. The configuration of tasks is easy to grasp, the UI is very user-friendly, as the figure you can see below. There is a Workflow Designer pane where you should point&drag the functional visual blocks. It simulates human browsing behaviors and scrape the structured data users need. By using this scraper, you can use the Proxy IP only by setting certain Advanced Options, which is very efficient and fast. Then, you can scrape data, including price, reviews and etc, as you need after completing the configuration.


The extraction of hundreds or more data can be completed within seconds. You can scrape any data type as you want, the data frames will be returned like the figure below which includes price and customers evaluation scraped results. Notice to all users, there are two editions of Octoparse Scraping Service - the Free Edition and the Paid Edition. Both editions will provide the basic scraping needs for users, that means users can scrape data and have it exported in various formats, like CSV, Excel formats, HTML, TXT, and database (MySQL, SQL Server, and Oracle). While, if you want to scrape data with a much more faster speed, you can upgrade your free account to any paid account in which Cloud Service is available. There will be at least 4 cloud servers with Octoparse Cloud Service working on your task simultaneously.

 

Additionally, Octoparse also offers Scraping or Crawling Service, that means you can express your scraping needs and requirements and pay them to scrape what data you need. 

Import.io

Import.io is also known as a web crawler covering all different levels of crawling needs. It offers a Magic tool which can convert a site into a table without any training sessions. While it suggests users to download its desktop app if more complicated websites need to be crawled. Once you’ve built your API, they offer a number of simple integration options such as Google Sheets, Plot.ly, Excel as well as GET and POST requests. It also provides Proxy Servers so that users can prevent from being detected by certain target websites and you can scrape as much data as you need. It is not hard to use this tool at all, the UI of Import. Io is quite friendly to use, you can refer to their official tutorials to learn how to configure your own scraping tasks. When you consider that all this comes with a free-for-life price tag and an awesome support team, import.io is a clear first port of call for those on the hunt for structured data. They also offer a paid enterprise level option for companies looking for more large scale or complex data extraction.


SEO experts, online marketers and even spammers should be very familiar with ScrapeBox with its very user-friendly UI. Users can easily harvest data from a website to grab emails, check page rank, verify working proxis and RSS submission. BY using thousands of rotating proxies, you will be able to sneak on the competitor’s site keywords, do research on .gov sites, harvesting data, and commenting without getting blocked or detected.



Author: The Octoparse Team
- See more at: Octoparse Blog

Website Crawler & Sentiment Analysis

        

To start with Sentiment Analysis, what comes first to our mind is where and how we can crawl oceans of data for our analysis. Normally, web crawler or crawling from web social media should be one reasonable way to get access to the public opinion data resource. Thus, in this writing, I want to share with you about how I crawled the website using web crawler and proceeded to deal with those data for Sentiment Analysis to develop an application which ranks universities based on users’s opinions crawled from social media website - Twitter.
To crawl the Twitter data, there have been several methods that we can adopt - Build a web crawler on our own by programming or choosing an automated web crawler, like OctoparseImport.io and etc. Also we could use the public APIs provided by certain websites to get access to their data set.
First, as well known, Twitter provides public APIs for developers to read and write tweets conveniently. The REST API identifies Twitter applications and users using OAuth; After learn about this, we can utilize twitter REST APIs to get most recent and popular tweets, Twitter4j has been imported to crawl twitter data through twitter REST API. And Twitter data can be crawled according to specific time range, locations, or other data fields. And the data crawled will be returned as JSON format. Note that APP developers need to generate twitter application accounts, so as to get the authorized access to twitter API. By using a specific Access Token, the application made a request to the POST OAuth2 to exchange credentials so that users can get authenticated access to the REST API .
This mechanism will allow us to pull users’information in the data resource. Then we can use the serach function to crawl these structured tweets related with university topics.
Then, I generated the query set to crawl tweets, which is like the figure below. I collected University Ranking data of USNews 2016, which includes 244 universities and their ranking. Then, I customized the data fiels I need to use for crawling tweets into JSON format.

                                                      

                                                      
                                             

For my tweets crawled results, I extracted 462,413 tweets results totally. And most universities’ tweets crawled number is less than 2000. 
                        
So far, someone may feel that the whole process to crawl Twitter is troublesome already. This method requires people with good programming skills and knowledge on regular expression to crawl the websites for structured and formatted data. This may be tough for someone without any coding skills and related knowledge. Here, for your reference, I’d like to propose some automated web crawler tools which can help you crawl websites without any coding skills, like OctoparseImport.ioMozenda. Based on my user experience, I can share with you one crawler tool I once used - Octoparse. It is quite user-friendly and easy to use. You can download their desk-top crawler software by visiting this link: http://www.octoparse.com/. This local machine installed web crawler tool can automatically crawl the data, like tweets, from the target sites. Its UI is user-friendly as below and you just need to learn how to confiugre the crawling settings of your tasks which you can learn by reading or watching their tutorials. The data crawled can be exported to various structure formats as you need, like Excel, CSV, HTML, TXT, and database(MySQL, SQL Server, and Oracle). Plus, it provides IP rotations which automates IPs leaving without being traced by the target sites. This mechanism serves as an important role, since it can prevent us from getting blocked by certain aggressive sites which doesn’t allow people to crawl any data from them.

                

Back to the University Ranking of my designed application. Ranking technology in my application is to parse tweets crawled from Twitter and then rank related tweets according to their relevance to a specific university. I want to filter high-related tweets (topK) to do the Sentiment Analysis, which will avoid trivial tweets that make our results inaccurate. There are may ranking methods actually, such as rank them based on TF-IDF similarity, text summarization, spatial and temporal factors or machine learning ranking method. Even Twitter itself has provided a method based on time or popularity. However, we need a more advanced method which can filter out the most spam and trivial tweets. In order to measure the trust and popularity of a tweet, I use the following features from tweets: retweets count, followers count, friends acount. Assuming that a trust tweet should be posted by a trust user. And a trust user should have enough friends and followers, then a popular tweet should have high retweet-count. According to my assumption, I builit a model combining the trust and popularity (TP Score) for a tweet. Then I rank those tweets based on TP score. Noteworthy that report news usually has high retweet-count, and this type  of score will be useless for our Sentiment Analysis. Thus, I assigned a relatively lower weight on this portion when computing the TP score. I designed and derived a formula as below. This Twitter supported method has considered the presence of query words, the recency of tweets. Thus, the tweets we crawled has been filtered by query words and posting time. All we need is to consider about the retweets countsm, followers counts, friends counts.
                                        
I make my own university ranking according to public reputation which is represented by sentiment score. However, public reputation is only on of the factors that should be referenced when evaluating a university. Thus, I want to present an overall ranking which combined both commercial ranking and our ranking. And there are three main types of tweets texts as below.

                                              

Sentiment Score: the Sentiment Score was calculated for public reputation. The positive rate of each university was used as the sentiment score for the public reputation ranking. The formula below defines the positive rate. Note that the negative polarity was not considered since it is equal to zero.
                                                                          
Where n is the total number of tweets for each university; And represents the positive polarity for a tweet, 4.
After I completed the Sentiment Anlysis, I would proceed to build a classifier for Sentiment Analysis using machine learning algorithm. This algorithm I will discuss further in the next writing, which is about Maximum Entropy classifier.


Author: The Octoparse Team
- See more at: Octoparse Blog

标签: , ,

2017年2月27日星期一

Web scraping | Introduction to Octoparse XPath Tool

You can use Octoparse to scrape websites now but sometimes the output has missing data or the task is not working properly. A new X Path expression can easily solve the problems and make the task work. Thus, it’s necessary to master Octoparse X Path Tool when scraping data from websites. Just a little bit of effort, you can greatly improve your productivity.

Before reading the article, you can learn basic HTML & XPath knowledge in these documents.


In this tutorial you'll get to know how to use Octoparse XPath Tool with some syntax example, when configuring your scraping task in Octoparse.

Location

There are two ways to get Octoparse X Path Tool:

Method 1. All the actions except “Go To Web Page”in Octoparse Workflow have the “Define ways to locate an item”option under the “Customize Field”setting. Select the “Define ways to locate an item”option and click on the Try XPath Tool link (lower left corner).
 
The Customized Field button in Extract Data
 
The Customized Field button in Enter Text
 
The Customized Field button in Click Item

Select the “Define ways to locate an item”option and click on the Try XPath Tool link (lower left corner).

 


Method 2. The third icon appeared on the upper left corner of Octoparse interface.
 

Main Interface
 

The main interface of Octoparse XPath Tool can be divided into 4 parts, as follows: 
1. The Browser part.
Enter the target URL in the built-in browser and click on the Go button. The content about the web page will be displayed here.
2. The Source Code part. View the source code of the web page.
3. The XPath Setting part. Check the options and fill in some parameters to generate X Path expression by hitting the Generate button.
4. The XPath Result part. After the XPath is generated, click on the Match button to see if the current XPath finds elements on the webpage.

Note: The structure and hierarchy of the source code shown in Octoparse XPath Tool is not clear. It’s strongly recommend you use the Firefox extension - Firepath to check the web page source code. Check out the tutorial HERE to use Firepath.

XPath Setting

The XPath expression is generated automatically after you check the option and fill in some text in the XPath Setting part. 
The X Path Setting part

Item Tag Name:
The blue text such as SPAN, A, HR, BR in the source code describes the tag name in Firefox browser. You can check the Item Tag Name and fill in the tag name from which you want to find the elements.
 
The blue text represents the tag name.

Item Position:
The default value is 1 which represent the first item among the siblings.

Item ID, Item Name, Item Style Class:
Generally, a tag element will have some attributes such as id attribute, name attribute or class attribute inside. You can check the options you need and fill in some text to better locate the elements. Id attribute, name attribute and class attribute are the most common ones and you can edit the X Path generated yourself for other attributes by replacing the attribute name and related parameters copied from the original source code in Firefox browser.
 
Pic: Check the Item ID option with some parameter and generate X Path. 
 
Pic: Copy the X Path generated and paste in Firepath.

Pic: Replace data-action with id and replace ‘sx-card-deck’ with rot-B00D2PNANY'.

Item Text:
The black text in the source code describes all the text information in Firefox browser. You have to make sure all the text is included (even a blank space; the punctuation; the full-angle and half-angle) when filling in the parameters. You can just copy the text inside angle brackets from the original source code.
 

Item Text Contains:
Octoparse will generate X Path that contains specific text and find all the elements that contains the text.

Item Text Start With:
Similarly, Octoparse will generate X Path that contains the beginning of the text/sentence and find all the information that begin with the text.

After you check the option and fill in the parameter, click on the Generate button and an X Path will be created and displayed in the X Path Result part.

On the left side of the Generate button is four buttons that used the find the elements near to the information you really want. Generally you won’t use these buttons and they are usually used after an X Path is created.

Child: selects the child node.
Parent: selects the parent node.
Previous: selects the “preceding-sibling::node()”.
Next: selects the “following-sibling::node()”.
 

Conclusion

There are many ways to find the information on a webpage, that is, you can write different X Path expressions to locate the web elements, or ask an expert for help.
- See more at: Octoparse Tutorial

标签: ,

2017年2月26日星期日

Facebook Data Mining


    
Mining data from Facebook has been quite popular and useful in a few past years. The crawled or scraped data will be valuable and constructive for commercial, scientific, and many other fields of prediction and analysis, especially when these data is processed deeply, like data purge, machine learning. Without a doubt, data mining which serves as a basis tier crossing the whole data process is of paramount importance.
Facebook also has provided a serving website allowing those developers to access its data, since data enthusiasts express such intense interest in the data from Facebook, . This website has provided many simple and easy-to-grasp methods with detailed guidelines for users to learn and access to its resource.
Talking about this Facebook API which is known as Graph API, it is one kind of interface with REST (Representational State Transfer), which is based on the network architecture. It implies that Facebook calls functions by using remote methods, like HTTP, GET, POST to send messages and echo back REST service.
Take an Facebook example of Coca-Cola Corp., if users are intended to retrieve remarks posted on the graffiti wall, what they need to do is simply entering :
https://graph.facebook.com/cocacola/feed,then the system will return the data results in JSON file. JSON(JavaScript Object Notation) is one kind of data exchange format which is easy for users to handle, as well as easy for devices to analyze and generate. The data fields include the message ID, detailed info of data, author, author ID, and other kinds of info. Not only the graffiti wall, but also all other Facebook objects can use the following URL structure to retrieve what they want.
    {
       "error": {
                     "message": "Unknown path components: /CONNECTION_TYPE",
                     "type": "OAuthException",
                     "code": 2500,
                      "fbtrace_id": "AU3Q0qQUX1/"
        }
Here, we should note that we can only access to the data  when the objects are public, otherwise we should provide access token if the objects are defined as private.  
Users should feel happy to hear this: there has been a R package which is known as the Rfacebook Package. It provides an interface to the Facebook API. For mining Facebook using R, the Rfacebook package  provides functions that allow R to access Facebook’s API to get information about posts, comments, likes, group that mention specific keywords & much more. Then we can use the specific commands like below to search pages. Apart from R, there exists a portion of people getting used to Python. Here are also tips for reference. First of all, check out documentation on Facebook's Graph API https://developers.facebook.com/docs/reference/api/. If you are not familiar with JSON, DO read a tutorial on it (for instance http://secretgeek.net/json_3mins.asp). Once you grasp the concepts, start using this API. For Python, there are at several alternatives:                                                                                                                                 
  • facebook/python-sdk https://github.com/facebook/python-sdk 
  • pyFaceGraph https://github.com/iplatform/pyFaceGraph/
  • It is also semitrivial to write a simple HTTP client that uses the graph APIUsers are suggested to check out the Python libraries, try out the examples from their documentation and check if they have already done what you need to do. Compared with R, Python can simplify the data process procedure by saving time of code management, output and note files. While using R can optimize the graph visualization, since users can visualize friends on the Facebook.
                                                                                                         
There are still some data extraction tools for some people without any programming skills to scrape or crawl data from Facebook, like OctoparseVisual Scraper.                      
Octoparse:                                                                          
Octoparse is a powerful web scraper that can scrape both static and dynamic websites with AJAX, JavaScript, cookies and etc. First, you need to download the client end and then start with your scraping tasks. For this software, you needn’t have any programming skills, but you should learn some rules that has been set to help users to extract data. Plus, it has provided cloud service and proxy server setting functionality to prevent from IP block and accelerate the extracting process.      

  
Would like to know more, please visit http://www.octoparse.com/

Visual Scraper:
Visual Scraper is another great free web scraper with simple point-and-click interface and could be used to collect data from the web. You can get real-time data from several web pages and export the extracted data as CSV, XML, JSON or SQL files.                                                                                                           The freeware, which is available for Windows, enables you to scrape data from up to 50,000 web pages for only one user.                                                            Besides the SaaS, VisualScraper offer web scraping service such as data delivery services and createing software extractors services.

   
 If you want to know more, please visit http://www.visualscraper.com/pricing


Author: The Octoparse Team
- See more at: Octoparse Tutorial

标签: ,