2017年1月2日星期一

Scrape Data from YellowPages.com

Octoparse enables you to scrape yellowpages.com (www.yp.com). You can capture names, addresses, cities, phone numbers, websites, etc of a certain job positions in a region posted on yellowpages.com.
  
In this tutorial we will scrape all anesthesiologist in New York, NY, United States from yellowpages.com with Octoparse.
The data fields include Name, Location, City, ZipCode, PhoneNumber, Today’s working hour, Tomorrow’s working hour, website, Business hours from Monday to Friday, Business hours on weekends, Payment method, Neighborhoods, AKA, Other links and Categories.

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape anesthesiologist information from yellowpage.com.
(Download my extraction task of this tutorial HERE just in case you need it.)

Step 1. Set up basic information.
Click “Quick Start” ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information.

Step 2Enter the target URL in the built-in browser. ➜ Click “Go” icon to open the webpage.
(URL of the example: http://www.yellowpages.com/search?search_terms=Anesthesiologists&geo_location_terms=New+York ) 

Step 3Click on the “Next” pagination link. ➜ Choose “Loop click in the element” to turn the page.

(Note:
If you want to extract information from every page of search result, you need to add a page navigation action.
You can right click the “Next” pagination linkto prevent trigger the link. 
You can click “Expand the selection area” button until “Loop click in the element” appears. )

Step 4Move your cursor over the section with similar layout, where you would extract data.
Click the first highlighted link ➜ Create a list of sections with similar layout. Click “Create a list of items” (sections with similar layout). ➜ “Add current item to the list”.
Then the first highlighted link has been added to the list. ➜ Click “Continue to edit the list”.
Click the second highlighted link ➜ Click “Add current item to the list” again. Now we get all the links with similar layout. ➜Then click “Finish Creating List” ➜ Click “loop” to process the list for extracting the elements in each page.

Step 5. Extract the anesthesiologist information.
You can select the item that would has the full information you needed since sometimes the first item will not include all the content you want to extract. In this case, you can pick up one of the items that contains all the content you needed in the loop. Here we choose the item “Nyu Anesthesia Assoc”.
When you click the “Loop Item” box, you will find nothing was extracted in the loop. You can click the “Go To Web Page” button to go through the workflow.

Extract the title. ➜ Click the title ➜ Select “Extract text”. Other contents can be extracted in the same way.

Step 6. All the content will be selected in Data Fields. ➜ Click the “Field Name” to modify. Then click “Save”.

Step 7. Drag the second “Loop Item” before “Click to paginate” action in the Workflow Designer so that we can grab all the elements of sections from multiple pages.

Step 8. Click “Save” to save your configuration. Then click “Next” ➜ Click “Next” ➜ Click “Local Extraction” to run the task on your computer. Octoparse will automatically extract all the data selected.

Step 9. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.


Note:
YellowPages.com use multiple XPath templates for the content of the detail pages.
You will find out some missing values for some data fields because of the incorrect XPath for certain data field when using Cloud Extraction.
In this case you can modify the XPath expressions for some data fields to ensure the extraction.

Author: The Octoparse Team
- See more at: Octoparse Tutorial

标签: , ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页