2016年12月29日星期四

Scraping Online Dictionary - Merriam-Webster.com

Octoparse enables you to scrape the online dictionary into an organized list by entering a list of words. It’s very easy to use and could get the definition and examples of the word you want by using a Loop mode for entering a text list.

In this tutorial, I will show you how to scrape definition of some words from merriam-webster.com.
The website URL we will use is www.merriam-webster.com.
The data fields include the word, its characteristic, its definition and example.

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape word’s definition.
(Download my extraction task of this tutorial HERE just in case you need it.)

Step 1. Set up basic information.
Click “Quick Start” ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click “Next”.

Step 2. Enter the target URL in the built-in browser. ➜ Click “Go” icon to open the webpage.
(URL of the example: https://www.merriam-webster.com)

Note: If the URL keeps loading while the content of the website has fully loaded, you can click the multiplication sign (×) to prevent it from loading.

Step 3. Create a loop for entering texts.
Drag a "Loop Item" into the Workflow Designer and then choose "Text list" in the "Loop mode".
Enter the text or a list of text you want to scrape in the "Text list" box and Click "Save".
You can see the list of text will be shown on the “Loop Item” box.

You need to click the search bar where you enter the text in the built-in browser, and choose the “Enter text value”option.

Drag the "Enter Text" box into the "Loop Item" box under Workflow Designer. And then tick "Use the text in Loop Item to fill in the text box". Click "Save". So you could see that the program will enter the text one by one. 

Step 4. Get the search results.
Click the “Search” button of the website ➜ Choose “Click an item”.

Step 5. Extract the words’ definitions.
Now we are on the search result page of the first word “socialism”.  
Extract the word. ➜ Click the word ➜ Select “Extract text”. Other contents can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the “Field Name” to modify. Then click “Save”.

Step 6. Check the rule/workflow.
Now we need to check the workflow by clicking actions from the beginning of the rule/workflow.
Go to the webpage ➜ Loop Item box ➜ Enter text ➜ Click Item ➜ Extract Data.

Step 7. Click “Save” to save your configuration. Then click “Next” ➜ Click “Next” ➜ Click “Local Extraction” to run the task on your computer. Octoparse will automatically extract all the data selected.

Step 8. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
 
Note:
The correct use of XPath is the key to extract data in Octoparse. If you find anything missing values, please go back to your rule and go through it from the beginning and modify the Xpath expressions for the data fields. Check out this article to modify the XPath expressions of a data field.
Knowing a little knowledge about XPath could help you solve a lot of problems in using Octoparse. The tutorials or FAQs below could help you pick up XPath quickly.

You can use Cloud Extraction to speed up the extraction, our cloud servers will collect the data for you within a short time. Go to the pricing page to get more information about our subscription plans and extraction services.  http://www.octoparse.com/pricing



Author: The Octoparse Team

- See more at: Octoparse Tutorial

标签: , , , ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页