2016年12月8日星期四

No.1 Automated Web Scraping Software for Windows

Octoparse is a free client-side Windows web scraping software that turns websites into structured tables of data without coding. It’s easy and free! Automatically Extract Web Data from Sites Within Minutes!

Octoparse simulates web browsing behavior such as opening a web page, logging into an account, entering a text, pointing-and-clicking the web element, etc. This tool allows you to easily get data by clicking the information in the built-in browser. Export Data in Any Format You Like! Don’t waste your time copying and pasting. Download Octoparse for Windows today and collect web data. It’s easy and free!

Why Octoparse Is Your Best Choice:

Point and Click Interface
Select the data to be scraped with mouse clicks. No need to code. Use X path and Regular Expressions to collect the data with accuracy.

All Sorts of Data Loading
Scrape data from all sorts of data loading techniques such as AJAX, or JavaScript. Fully-fledged built-in browser to load data from different sources.

Cloud Service
Anonymously scrape data using Octoparse. Support Proxy and API. Automatic IP rotation to prevent IP from being blocked.

Note: You can have a maximum of 10 projects and cannot save them to the cloud in the demo version. Requires Microsoft .NET Framework 3.5.

(from No.1 Automated Web Scraping Software for Windows-Octoparse)

标签: , , , , , ,

2016年12月5日星期一

How to Get Data from the Web

Most enterprises of any size are generating large amounts of web data, all the time. But how to deal with these data - data collection and data processing, it’s always a problem. The significance of Big data technologies does not lie in its ability to grasp with large-scale data collection, but in the intelligence to process data and thus extract valuable information from such a large-volume data for further analysis. And the premise of big data technologies is that we get large volume of valuable data.

Data analysis and data mining are not focused on the data itself, but on how to solve actual business problems during data operations. We can get valuable information from the data collected by performing data mining and data analysis, but the premise is we must ensure that the data collected are of high quality.

How to get data from the web? Or, specifically, the exact data you want from the webpage?
As a big fun of data mining, I’d like to share with you my experience of getting data from the web.

1. Web crawler

A web crawler (also known as web scraper, web spider, web robots) is an automated program/script that use to browse the internet and collect the data from web pages.

The most common way to retrieve web data from internet is using a crawler. You can crawl almost the data you see in the web pages after you know how to write a web crawler by using a programming language such as R and Python. It’s very convenient for an automated crawler to get large quantities of data from websites within a short time. For example, I used to collect 10w social media data, 200w lottery information, 100w travel information, 15w hotel information, 40w URLs from a website, and etc. After I get the data I will use regular expression (it’s a simple but powerful tool and also my favorite) to extract the exact strings from the data collected. If you are new to regular expression and want to use it, I’d like to recommend you to learn from many free online resources with regular expression testers. The more you practice with regular expression , the more familiar you are with it. Practice makes perfect. There are many free online regular expression testers available on the internet. I use regular expressions 101 a lot.

What if you don’t have any coding knowledge but want to get data from the web?
Fear not! A lot of web scraping tools for non-technical people are available online.

If you’re a little curious about web crawler and have a love of learning, I suggest you use some web scraping software such as Octoparse, import.io, webharvy and etc. It may take some time to learn, but once mastered it's hard to find anything better.

If you’re not technical and have no time to learn web scraping software, you can consider outsourcing or any other ways to get the data from the web you need. Many companies will provide various professional web data scraping services to directly get the data from web if you need. Or you can hire some web scraping specialists to get the data for you.

2. Some websites that provide public data-sets

There are a number of freely public data-sets available online and you can easily download/buy them from the internet. Below are some common used websites that to retrieve public data.


3. Share resource with your friends

It's wonderful to make friends with people who are good at web crawling and share your experiences with them. You can easily find them in some web crawling forums/blogs. Many people love to crawl data from the web but don’t know how to better use the data they collected. You can learn from each other by sharing web crawling skills and expertise.

Author: The Octoparse Team
- See more at:  Octoparse Blog

标签: , , , , ,

2016年12月1日星期四

How to schedule a crawler/scraping task in Octoparse every hour?



-see more at: Octoparse FAQ

标签: , , , , ,

2016年11月28日星期一

Web Scraping Service - Octoparse Cloud Extraction Works Better

(from  Web Scraping Service - Octoparse Cloud Extraction Works Better)

(Download the extraction task of this tutorial HERE just in case you need it.)

As we all know, Octoparse's cloud servers enables you to extract the data from websites much quicker and stable when the data volume you wanna scraped is huge. Today I will show you an example that sometimes Octoparse Cloud Extraction is more efficient that Local Extraction.

In this tutorial, I will take Yelp for example to show you how to scrape Yelp data and test the efficiencies of our two Extraction methods - Cloud Extraction and Local Extraction.

(Note: Cloud Extraction is not available for the Free Edition. For more information about different editions, you could click HERE.)



Part 1. Configure a rule for your task

Let’s make a rule for our scraping task first.
Step 1.
Choose "Advanced Mode" ➜ Click "Start"➜ Complete basic information. ➜ Click "Next".

Step 2.
Enter the target URL of Yelp in the built-in browser. ➜ Click "Go" icon to open the webpage.

Step 3.
After the web page is completely loaded. Scroll to the bottom of the page. Right-click the pagination link (the ‘Next’ button on this page). Select "Loop click in the element" to create the Pagination Loop.
(Note:
--- If you want to extract information from every page of search result, you need to add a page navigation action.
--- If the content of the web page is completed displayed while the URL is still loading, you can choose to stop the loading of the URL by clicking the red cross icon ‘×’ near the URL, or wait until the page is loaded.)


Step 4.
Move your cursor over the section with similar layout, where you would extract data.
Click the first highlighted link ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first highlighted link has been added to the list.  ➜ Click "Continue to edit the list".
Click the second highlighted link ➜ Click "Add current item to the list" again. Now we get all the links with similar layout.
Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.


Step 5. 
After the page has been loaded completely, let’s drag the second "Loop Item" into the first "Cycle Pages" box, place it right above the "Click to paginate" action in the Workflow Designer so that we can grab all the elements of items from multiple pages.
And we can check if the rule is working well by clicking the actions from the beginning of the rule.
"Go To Web Page" ➜ "Cycle Pages" box ➜ "Loop Item" box (all items are extracted) ➜ "Click Item"
 

The rule seems to be working well so far. So let's continue to extract the data.
If the page URL keeps loading while the content of the web page is completed displayed (at least the data you wanna scraped has been shown), you can choose to stop the loading of the URL by clicking the red cross icon ‘×’ near the URL, or wait until the page is loaded.

Step 6.
The data we will extract on detail page include the website URL, phone number, address, type, company name and the current page URL.
Because the third item of the 1st page have all the data we want (the first two items lack the "website URL" on the detail page).
So we click the Loop Item(Extract Data) and select the third item from the Loop Item list. Then click the "Click Item" in the workflow.
After the page has been loaded completely, we will begin to extract the data we want.
Click the website URL on the web page➜ Select "Extract text" and rename it. Other contents can be extracted in the same way. Don’t forget to click Save after all the data fields is extracted and configured.

 

Step 7.
Check if the rule is working well by clicking the actions from the beginning of the rule.
"Go To Web Page" ➜ "Cycle Pages" box ➜ "Loop Item" box (all items are extracted) ➜ "Click Item" ➜ "Extract Data" ➜ "Click to Paginate"

Note:
1. Don’t forget the set AJAX timeout for the "Click Item" and "Click to Paginate" actions.
2. If it takes a long time to load a page when you click "Click Item" or "Click to Paginate", you can set a longer waiting time in its next action before executing its next action. For example, if it takes 10 seconds to open the web page in the "Click to Paginate" action, you can set a a longer waiting time in its next action - "Click Item" before executing the "Click Item" action.



Part 2. Run the task with both Local Extraction and Cloud Extraction

Let’s run the task with both Local Extraction and Cloud Extraction.
After you click one of these two options, you can go to the left panel and find the task from a category in My Task. Right click the task and choose the other extraction option.
After almost 3 hours we got all the data from two extraction types.
The data we get from Local Extraction:


The data we get from Cloud Extraction:


The speed and efficiency of Local Extraction will greatly influenced by your computer network and the performance of the computer. While our cloud servers are used only to extract the data from web pages, thus sometimes Octoparse Cloud Extraction will work better than Local Extraction for the outstanding performance of our cloud servers.
Need large amounts of data from site? Our sales team are happy to help!  
Note: If the page URL keeps loading while the content of the web page is completed displayed (at least the data you wanna scraped has been shown), you can choose to stop the loading of the URL by clicking the red cross icon ‘×’ near the URL, or wait until the page is loaded.

Author: The Octoparse Team
- See more at: Octoparse Tutorial

标签: , , , ,