2017年1月2日星期一

Scraping Hotel Reviews from Tripadvisor.com

In this tutorial we will scrape the phone numbers of all the hotels and their customer reviews in London from TripAdvisor.com with Octoparse.
The data fields include Hotel name, the number of reviews, address, ranking, PhoneNumber, customer name and his/her reviews about the hotel.

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape Tripadvisor hotel reviews. (Download my extraction task of this tutorial HERE just in case you need it.)

Step 1. Set up basic information.

Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
 

Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html )
 

Step 3. Click on the "Next" pagination link. ➜ Choose "Loop click in the element" to turn the page.

(Note:
1.If you want to extract information from every page of search result, you need to add a page navigation action.
2.You can right click the "Next" pagination link to prevent trigger the link.
3.You can click "Expand the selection area" button until "Loop click in the element" appears. ) 

Step 4. Move your cursor over the section with similar layout, where you would extract data.

Click the first highlighted link ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

Then the first highlighted link has been added to the list. ➜ Click "Continue to edit the list".

Click the second highlighted link ➜ Click "Add current item to the list" again. Now we get all the links with similar layout. ➜Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.

Step 5. Extract the hotel information.

Extract the hotel name. ➜ Click the name ➜ Select "Extract text". Other contents can be extracted in the same way. 
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".

Note:
You can select the item that would has the full information you needed since sometimes the first item will not include all the content you want to extract. In this case, you can pick up one of the items that contains all the content you needed in the loop.

When you click the "Loop Item" box, you will find nothing was extracted in the loop. You can click the "Go To Web Page" button to go through the rule.


Step 6. Correct the data field.

As you can see, the data extracted for the "Phonenumber" is incorrect. 

odify the XPath expressions for "Phonenumber" because its location has changed. 
The correct XPath is //div[@class='ui_icon phone fl icnLink']/following-sibling::div

Click the field you want to change.➜ Click "Customize Field". ➜ Click "Define ways to locate an item". ➜ Copy and replace the correct XPath in the "Matching XPath" text box ➜ Click “OK”.

Step 6-2. Re-format it.

As you can see the data field “Phonenumber” is not being extract correctly, in this case we can re-format the data field “Phonenumber” to extract the exact information we want.

Choose the field you want to reformat ➜ Select the “Customize Field” button ➜ Choose “Re-format extracted data”.  
There’re two steps we need to perform. One is to remove the space between the phone number, the other is to extract only the phone number.
Step one:
Click “Add step” ➜ Select “Replace with Regular Expression” ➜ Enter ‘\s+’ in the Regular Expression box ➜ Click “Calculate” ➜ Click “OK”.
Step two:  
Click “Add step” ➜ Select “Match with Regular Expression” ➜ Click “Try RegEx Tool”.
In the RegEx Tool window, we tick the “Start with” and enter “-->” ➜ Click “Generate” ➜ Click “Match” ➜ Click “Apply” ➜ Click “Calculate” ➜ Click “OK” ➜ Click “Done”.
Then the value for the “Phonenumber” data field has been extracted correctly . Then click "Save".
 


Step 7. Extract Hotel Reviews.

Step 7-1. Right click on the "Next" pagination link ➜ Choose "Loop click in the element" to turn the page.

Step 7-2. Move your cursor over the section with similar layout, where you would extract data.

Click the first section ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

Then the first section has been added to the list. ➜ Click "Continue to edit the list".

Click the second section ➜ Click "Add current item to the list" again. Now we get all the links with similar layout. ➜Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.
 

Step 7-3. Extract the reviews.

Click the customer name ➜ Select "Extract text". Reviews can be extracted in the same way. 
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".



Step 8In the second Loop Item box, we drag the second "Loop Item" before the "Click to paginate" action of the second “Cycle Pages” box so that we can grab all the reviews about the hotel from multiple pages.

 


Step 9. Drag the second "Loop Item" box before the "Click to paginate" action of the first “Cycle Pages” box so that we can grab all the reviews about all the hotels from multiple pages.



Step 10. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
 


Step 11. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.


Note:
1. After completing the workflow,you can check the workflowby clicking actions from the beginning of the workflow, one by one.
2.You will find out the there are some missing values for some data fields in the output. In this case, you need to figure out why Octoparse could not extract the value for the data fields. Click this article to find out the reasons for the missing values when using Local Extraction.
Some original XPath for some data fields could not select the elements correctly and result in missing values for these data fields. In this case, you can modify the XPath expressions for these data fields. Here I replace all the SPAN tags with * tags for all the data fields. Click "Save" to save the configuration. You can follow this tutorial to modify XPath expressions in Octoparse.
Knowing some knowledge about how to edit XPath expressions could help you solve lots of problems when scraping data from websites. The tutorials or FAQs below could help you pick up XPath quickly.



Author: The Octoparse Team

- See more at: Octoparse Tutorial

标签: , ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页