2017年2月26日星期日

Web Scraping Hotel Information from Google Maps


Octoparse enables you to scrape the search results from Google Maps.

In this web scraping tutorial we will scrape the search results for hotels and restaurants in US on Google Maps. 

The website URL we will use is https://www.google.com/maps/search/+hotel/@35.9018652,-104.7657719,5z/data=!3m1!4b1.
The data fields include the restaurant/hotel name, website, address, phone number and the overall star rating.

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape hotels and restaurants information from Google Maps. (Download my extraction task of this tutorial HERE just in case you need it.)

Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
 

Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.

(URL of the example: https://www.google.com/maps/search/+hotel/@35.9018652,-104.7657719,5z/data=!3m1!4b1)
 
Note: Please don't place your cursor on the map area using mouse wheel to avoid scrolling over the map.

Step 3. Create a loop for keywords
Wait till the page has loaded, then drop an “Loop” action into the Workflow. 
Select a loop mode ➜ Choose “Text list” ➜ Enter the keywords (restaurant and hotel) ➜ Click “OK”  Click “Save.” 

Click the search bar  Select "Enter text value", then an "Enter Text" action will be created ➜ Drag the "Enter Text" action into the Loop ➜ Check the option "Loop Text - Use the text in the Loop Item to fill in the text box"  Click “Save”. 

Click the Search icon  Select "Click an item", then an "Click item" action will be created.

Navigate to "Click Item" action ➜ Tick "AJAX Load" checkbox ➜ set an AJAX timeout of 5 seconds (or longer)➜ Click "Save".
 


Step 4. Extract data from multiple search result pages

Click on the "Next page" icon ➜ Select "Click an item", then an "Click item" action will be created. ➜ Select the “Customize Field” button ➜ Choose “Define ways to locate an item” ➜ Copy the XPath expression ➜ Click "Cancel" ➜ Click "Cancel".

Drag a "Loop" item into the workflow ➜ Choose a "Loop Mode" under "Advanced Options". ➜ Select "Single Element" option.  ➜ Paste the XPath into the "Single Element" text box ➜ Click "Save".

Then drop the "Click item" action into the "Loop item" box we've just created ➜ Check the "AJAX Load" option under Advanced Options ➜ set an AJAX timeout of 5 seconds (or longer) ➜ Click "Save".

Check the "Use Loop" option ➜ Click "Save".

Step 5. Extract the items from the search results page.

For some websites, we need to right click the items to prevent from triggering the hyperlink of the items when creating a list for extracting these items.

Right click the first item➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first item has been added to the list. ➜ Click "Continue to edit the list".
Click the second item➜ Click "Add current item to the list" again (Now we get all the restaurants with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the detailed information of these items.

Here, we can replace the "Extract Data" action with a "Click Item" action after the "loop" for processing the list is created. 
Right click the "Extract Data" action inside the Loop ➜ Choose "Delete" ➜ Drag a "Click Item" action into this Loop ➜ Click “Save”. Google Maps will turn to the detail page of the restaurant.

Navigate to "Click Item" action ➜ Tick "AJAX Load" checkbox ➜ set an AJAX timeout of 5 seconds or 10 seconds ➜ Click "Save".

Step 6. Extract detail information from the these items.

Click the website of the restaurant ➜ Select "Extract text". Other contents can be extracted in the same way. 
All the content will be selected in Data Fields. ➜ Click the "Field Name" to rename it. Then click "Save".
 

You may find out that some of the data extracted is not placed in the right place after viewing the results of Local Extraction. You can modify the X Path for these data fields if necessary.
Click the data field ➜ Select the “Customize Field” button ➜ Choose “Define ways to locate an item” ➜ Modify the XPath expression ➜ Click "OK" ➜ Click "Save".
 

Click on the "Back to results" button ➜ Select "Click an item" and then an "Click item" action will be created. ➜ Click "Save".

Navigate to the "Click Item" action ➜ Check the "AJAX Load" option under Advanced Options ➜ set an AJAX timeout of 5 seconds ➜ Click "Save".

Step 7. Check the workflow.

Now we need to check the workflow by clicking actions from the beginning of the workflow and adjusting the order.
  
Go to the webpage ➜ The first Loop Item box ➜ Enter Text ➜ Click Item ➜  The second Loop Item box ➜ Adjust the order of the third Loop Item box inside the second Loop Item box ➜ The third Loop Item box ➜ Click Item ➜ Extract Data ➜ Click Item ➜ Click Item.
 

Step 8. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
 

Step 9. All data extracted will be shown in "Data Extracted" pane. If there are too many duplicate data in the output, you can lengthen the AJAX timeout of the "Click Item" for pagination(refer to Step 4).
Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.



Author: The Octoparse Team
- See more at: Octoparse Tutorial

标签: ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页