2017年1月2日星期一

Scraping Restaurants Information from yell.com

Octoparse enables you to scrape the search results from Yell.com. After you enter the items you want to search in a certain region, you will redirect to the search page by clicking the “Search” botton.
 

In this tutorial we will scrape data about all restaurants in London from yell.com with Octoparse.
Then we will use the URL of the search results page in Octoparse. The website URL we will use is
The data fields in this tutorial include the name of the restaurantaddress, telephone and star-rating score.

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape Yell.com. (Download my extraction task of this tutorial HERE just in case you need it.)

Step 1. Set up basic information.

Click “Quick Start” ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click “Next”.
 

Step 2Enter the target URL in the built-in browser. ➜ Click “Go” icon to open the webpage.

 

Step 3Click on the “Next” pagination link. ➜ Choose “Loop click in the element” to turn the page.


 (Note:
1. If you want to extract information from every page of search result, you need to add a page navigation action.
2. You can right click the “Next” pagination linkto prevent trigger the link. 
3. You can click “Expand the selection area” button until “Loop click in the element” appears. ) 
 

Step 4Move your cursor over the section with similar layout, where you would extract data.

Click the first section on the web page➜ Create a list of sections with similar layout. Click “Create a list of items” (sections with similar layout). ➜ “Add current item to the list”.

Then the first section has been added to the list. ➜ Click “Continue to edit the list”.

Click the second section ➜ Click “Add current item to the list” again. Now we get all the sections with similar layout. ➜Then click “Finish Creating List” ➜ Click “loop” to process the list for extracting the elements we want in each page.
 

Step 5. Extract the search results.

If the content you want to extract has been formatted as clickable hyperlinks, you can right-click the content in the built-in browser to prevent turning to that links.

Extract the name of the first restaurant. ➜ Right click the name ➜ Select “Extract text”. Address and Telephone number can be extracted in the same way.

When extracting star-rating score, you need to expand the selection area to select all the stars signs. Since there is nothing in the “Extract text” option, we select the “Extract outer html” option which will include the rating score of the restaurant.

All the content will be selected in Data Fields. Click the “Field Name” to modify. Then click “Save”.
 

Step 6. Re-format the data field.

As you can see the data field “Star-rating” is not being extract correctly, in this case we can re-format the data field “Star-rating” to extract the exact information we want.

Choose the field you want to reformat ➜ Select the “Customize Field” button ➜ Choose “Re-format extracted data”.  
From the outer html of the data field, we can that the star rating score is started with ‘title=”’ and ended with ‘out of’.

So we click “Add step” ➜ Select “Match with Regular Expression” ➜ Click “Try RegEx Tool”.
In the RegEx Tool window, we tick the “Start with” and enter “title=””; tick the “End with” and enter “out of”. Click “Generate” ➜ Click “Match” ➜ The matching result is 4.3 ➜ Click “Apply” ➜ Click “OK” ➜ Click “Done”.
Then the value for the “Star_rating” data field turns into 4.3. Then click "Save".

 

Step 7. Drag the second “Loop Item” box before the “Click to paginate” action of the “Cycle Pages” box in the Workflow Designer so that we can grab all the elements of sections from multiple pages.

 

Step 8. Click “Save” to save your configuration. Then click “Next” ➜ Click “Next” ➜ Click “Local Extraction” to run the task on your computer. Octoparse will automatically extract all the data selected.

 

Step 9. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
 


Note:
Yell.com detects malicious requests and therefore will stop your extraction.

In this case, we can set up a longer timeout of each action except the ‘Go To Web Page' action. (We set 3 seconds for each action of the workflow.) 

Author: The Octoparse Team

- See more at: Octoparse Tutorial

标签: , ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页