Scraping Articles from Reuters.com
Octoparse enables you to scrape articles from reuters.com. There're two parts for getting the real-time data in Octoparse - Make a scraping task and schedule a task on Octoparse's cloud platform.
In this web scraping tutorial we will scrape the market articles in U.S from reuters.com to get the content of these articles - such as the title of the article, the body text of the article, published date/time, the author and the article URL with Octoparse.
The website URL we will use is http://www.reuters.com/news/archive/marketsNews?date=today.
The data fields include article title, the body text of article, published date/time and author.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape the articles about US market from reuters.com. (Download my extraction task of this tutorial HERE just in case you need it.)
Part 1. Make a scraping task in Octoparse
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: http://www.reuters.com/news/archive/marketsNews?date=today)
Step 3. Right click on the “Next” pagination link. ➜ Choose “Loop click in the element” to turn the page.
(Note:
If you want to extract information from every page of search result, you need to add a page navigation action.
You can right click the “Next” pagination link to prevent triggering the link.
You can click “Expand the selection area” button until the “Loop click in the element” option appears.)
Step 4. Move your cursor over the article with similar layout, where you would extract the content of the article.
Right click the first article ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first article has been added to the list. ➜ Click "Continue to edit the list".
Right click the second article ➜ Click "Add current item to the list" again (Now we get all the articles with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the content of the articles.
Step 5. Extract the content of the article.
You can select the item that would has the full information you needed since sometimes the first item will not include all the content you want to extract. In this case, you can pick up one of the items that contains all the content you needed in the loop. Here we choose the article “Offshore yuan set for biggest weekly gain as China bears down on speculators”.
When you click the “Loop Item” box, you will find nothing was extracted in the loop. You can click the “Go To Web Page” button to go through the workflow.
Click the title of the article➜ Select "Extract text". Other contents can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
Step 6. Re-format the data field.
As you can see the data field “BodyText” is not being extract correctly, in this case we can extract the exact text from the outer HTML code by using regular expression.
Step 6-1. Get the outer HTML code for the data field.
Choose the field you want to reformat ➜ Select the “Customize Field” button ➜ Choose “Define data extracted” ➜ Choose "Extract outer HTML, including the page source, text with format and images" under the "Extract data from page content" option. ➜ Click "OK" ➜ Click "Save". Then you will see the outer HTML code has been extracted.
Step 6-2. Use regular expressions to re-format the data field.
Choose the field you want to reformat ➜ Select the “Customize Field” button ➜ Choose “Re-format extracted data”.
There’re two steps we need to perform. One is to remove the space between the phone number, the other is to extract only the phone number.
Step one:
Click “Add step” ➜ Select “Match with Regular Expression” ➜ Enter ‘<p(.+?)</p>’ in the Regular Expression box ➜ Select “Match all” ➜ Click “OK”.
Step two:
Click “Add step” ➜ Select “Replace with Regular Expression” ➜ Enter ‘<(.+?)>’ in the Regular Expression box ➜ Click “Calculate” ➜ Click “OK” ➜ Click “Done”.
Then the body text of the article has been extracted correctly . Then click "Save".
Step 7. Drag the “Loop Item” box before the “Click to paginate” action of the “Cycle Pages” box in the Workflow so that we can grab all the content of articles from multiple pages.
Step 8. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow.
Go to the webpage (Set a timeout of 60 seconds)➜ The Cycle Pages box ➜ The Loop Item box ➜ Click Item ➜ Extract Data ➜ Click to paginate.
Step 9. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 10. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Part 2. Schedule a task and run it on Octoparse's cloud platform.
After you perfectly made the scraping by following the steps above in this web scraping tutorial, you can schedule your task to run it on Octoparse's cloud platform.
Step 1. Find out the task you've just made ➜ double click the task to open it ➜ keep clicking "Next" until you are in the "Done" step ➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.
Step 2. Set the parameters.
In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.
· Periods of Availability - The data extraction period by setting the Start date and End date.
· Run Mode - Once, Weekly, Monthly, Real Time
We can set a suitable time interval to collect the stock and click "Start" to schedule your task.
After you click "OK" in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.
Author: The Octoparse Team
- See more at: Octoparse Tutorial标签: Octoparse, tutorial, Web scraping
0 条评论:
发表评论
订阅 博文评论 [Atom]
<< 主页