2017年1月2日星期一

Scraping Yelp Reviews

Octoparse enables you to scrape reviews from yelp.com.    

In this tutorial we will scrape all reviews about car audios in Brooklyn, NY, United States from yelp.com with Octoparse.
The data fields include the company name, phone number , address, car audio type, type, customer name and his/her reviews about the car audio.

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape Yelp reviews. (Download my extraction task of this tutorial HERE just in case you need it.)

Step 1. Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
 

Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.

(URL of the example: https://www.yelp.com/search?find_desc=car+audio&find_loc=Brooklyn%2C+NY ) 
 

Step 3. Right click on the "Next" pagination link. ➜ Choose "Loop click in the element" to turn the page.
 

(Note:
1.If you want to extract information from every page of search result, you need to add a page navigation action.
2.You can right click the "Next" pagination link to prevent trigger the link.
3.You can click "Expand the selection area" button until "Loop click in the element" appears. ) 

Step 4. Move your cursor over the section with similar layout, where you would extract data.

Click the first item ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

Then the first item has been added to the list. ➜ Click "Continue to edit the list".

Click the second item ➜ Click "Add current item to the list" again. Now we get all the links with similar layout. ➜Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.


Note:
We can tick the “Scroll Down” and “Page Acceleration” options to load the web page completely. Then click “Save”.
 

Step 5. Extract the information of the car audio.

When the web page URL is still loading while all the content of the web page has been completely loaded, we need to click the multiplication sign (×) to stop the URL from loading.
 
Extract the phone number of the car audio. ➜ Click the phone number➜ Select "Extract text". Other contents can be extracted in the same way. Add the current page URL as a data field.  
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click “Save”.
 

Step 6. Extract Reviews about the car audio

Step 6-1. Right click on the "Next" pagination link ➜ Choose "Loop click in the element" to turn the page.
 

Step 6-2. Move your cursor over the section with similar layout, where you would extract data.

Click the first section ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

Then the first section has been added to the list. ➜ Click "Continue to edit the list".

Click the second section ➜ Click "Add current item to the list" again. Now we get all the links with similar layout. ➜Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.

Step 6-3Extract the reviews.

Click the customer name ➜ Select "Extract text". Reviews can be extracted in the same way. 
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
 


Step 7. In the second Loop Item box, we drag the second "Loop Item" before the "Click to paginate" action of the second “Cycle Pages” box so that we can grab all the reviews about the hotel from multiple pages.

Step 8. Drag the second "Loop Item" box before the "Click to paginate" action of the first “Cycle Pages” box so that we can grab all the reviews about all the hotels from multiple pages.
 

Step 9. Check the workflow.

Now we need to check the workflow by clicking actions from the beginning of the workflow.
Go to the webpage ➜ The first Cycle Pages box ➜ The first Loop Item box ➜ Click Item ➜Extract Data ➜ The second Cycle Pages box ➜ The second Loop Item box Extract Data ➜ Extract Data ➜ Click to Paginate ➜ Click to Paginate.
While checking the workflow, we need to set up a longer timeout of some actions except the ‘Go To Web Page' action and set up ajax timeout for the two “Click to Paginate” actions.
 

Step 10. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
 

Step 11. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
 


Author: The Octoparse Team

- See more at: Octoparse Tutorial

标签: , ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页