(from Web Scraping Service - Octoparse Cloud Extraction Works Better)
(Download the extraction task of this tutorial HERE just in case you need it.)
As we all know, Octoparse's cloud servers enables you to extract the data from websites much quicker and stable when the data volume you wanna scraped is huge. Today I will show you an example that sometimes Octoparse Cloud Extraction is more efficient that Local Extraction.
In this tutorial, I will take Yelp for example to show you how to scrape Yelp data and test the efficiencies of our two Extraction methods - Cloud Extraction and Local Extraction.
(Note: Cloud Extraction is not available for the Free Edition. For more information about different editions, you could click HERE
Part 1. Configure a rule for your task
Let’s make a rule for our scraping task first.
Choose "Advanced Mode" ➜ Click "Start"➜ Complete basic information. ➜ Click "Next".
Enter the target URL of Yelp in the built-in browser. ➜ Click "Go" icon to open the webpage.
After the web page is completely loaded. Scroll to the bottom of the page. Right-click the pagination link (the ‘Next’ button on this page). Select "Loop click in the element" to create the Pagination Loop.
--- If you want to extract information from every page of search result, you need to add a page navigation action.
--- If the content of the web page is completed displayed while the URL is still loading, you can choose to stop the loading of the URL by clicking the red cross icon ‘×’ near the URL, or wait until the page is loaded.)
Move your cursor over the section with similar layout, where you would extract data.
Click the first highlighted link ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first highlighted link has been added to the list. ➜ Click "Continue to edit the list".
Click the second highlighted link ➜ Click "Add current item to the list" again. Now we get all the links with similar layout.
Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.
After the page has been loaded completely, let’s drag the second "Loop Item" into the first "Cycle Pages" box, place it right above the "Click to paginate" action in the Workflow Designer so that we can grab all the elements of items from multiple pages.
And we can check if the rule is working well by clicking the actions from the beginning of the rule.
"Go To Web Page" ➜ "Cycle Pages" box ➜ "Loop Item" box (all items are extracted) ➜ "Click Item"
The rule seems to be working well so far. So let's continue to extract the data.
If the page URL keeps loading while the content of the web page is completed displayed (at least the data you wanna scraped has been shown), you can choose to stop the loading of the URL by clicking the red cross icon ‘×’ near the URL, or wait until the page is loaded.
The data we will extract on detail page include the website URL, phone number, address, type, company name and the current page URL.
Because the third item of the 1st page have all the data we want (the first two items lack the "website URL" on the detail page).
So we click the Loop Item(Extract Data) and select the third item from the Loop Item list. Then click the "Click Item" in the workflow.
After the page has been loaded completely, we will begin to extract the data we want.
Click the website URL on the web page➜ Select "Extract text" and rename it. Other contents can be extracted in the same way. Don’t forget to click Save after all the data fields is extracted and configured.
Check if the rule is working well by clicking the actions from the beginning of the rule.
"Go To Web Page" ➜ "Cycle Pages" box ➜ "Loop Item" box (all items are extracted) ➜ "Click Item" ➜ "Extract Data" ➜ "Click to Paginate"
1. Don’t forget the set AJAX timeout for the "Click Item" and "Click to Paginate" actions.
2. If it takes a long time to load a page when you click "Click Item" or "Click to Paginate", you can set a longer waiting time in its next action before executing its next action. For example, if it takes 10 seconds to open the web page in the "Click to Paginate" action, you can set a a longer waiting time in its next action - "Click Item" before executing the "Click Item" action.
Part 2. Run the task with both Local Extraction and Cloud Extraction
Let’s run the task with both Local Extraction and Cloud Extraction.
After you click one of these two options, you can go to the left panel and find the task from a category in My Task. Right click the task and choose the other extraction option.
After almost 3 hours we got all the data from two extraction types.
The data we get from Local Extraction:
The data we get from Cloud Extraction:
The speed and efficiency of Local Extraction will greatly influenced by your computer network and the performance of the computer. While our cloud servers are used only to extract the data from web pages, thus sometimes Octoparse Cloud Extraction will work better than Local Extraction for the outstanding performance of our cloud servers.
Need large amounts of data from site? Our sales team
are happy to help!
Note: If the page URL keeps loading while the content of the web page is completed displayed (at least the data you wanna scraped has been shown), you can choose to stop the loading of the URL by clicking the red cross icon ‘×’ near the URL, or wait until the page is loaded.
Author: The Octoparse Team
- See more at: Octoparse Tutorial
标签： Big data, Business, Octoparse, real-time data, Web scraping