Scraping Stock Information from CNN Money
Octoaprse enables you to scrape stock information from financial website. There're two parts for getting the real-time data in Octoparse - Make a scraping task and schedule a task to run it in Octoparse cloud.
In this web scraping tutorial we will scrape the stock data from CNN Money website to get detail information - such as the title, body, published date/time, author and article URL with Octoparse.
The website URL we will use is http://money.cnn.com/data/sectors/finance/?sector=4800&industry=Industries%20within%20Finance&page=1.
The data fields include company name, market capitalization, price-earnings ratio, price, change, ratio of change and YTD change.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape stock information from CNN Money. (Download my extraction task of this tutorial HERE just in case you need it.)
Part 1. Make a scraping task in Octoparse
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: http://money.cnn.com/data/sectors/finance/?sector=4800&industry=Industries%20within%20Finance&page=1)
Step 3. Extract the values from the drop-down menu.
We will extract all the finance companies from different industries, and thus we need to create a loop for the drop-down menu.
Click the combobox under "Finance Companies" tag ➜ Click "Loop switch combobox" (Now we could loop click the values in the drop-down menu) ➜ Click "Save".
Step 4. Click the "Next page" button to extract more information from multiply web pages.
We need to extract stock information from all the companies. Since we cannot directly click on the "Next page" button to create a loop automatically, we can manually create a loop for it.
Drag a "Loop" item inside the loop for drop-down menu, under the "Switch Dropdown" action. ➜ Choose a "Loop Mode" under "Advanced Options". ➜ Select "Single Element" option.
Enter the XPath expression which can select the location of the "Next page" button into the "Single Element" text box. ➜ Click "Save". The XPath expression is //SPAN[@id='fwd']
Then drop an "Click item" action into the "Loop item" box ➜ Select "Click items in loop item" ➜ Select "Open the link in new tab" ➜ Click "Save". Now Octoparse will automatically click on the "Next page" button to reveal more information.
Step 5. Move your cursor over the section within the table, where you would extract stock data from these companies.
Click the first company ➜ Click the "Expand the selected area" button to select the whole row ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first company has been added to the list. ➜ Click "Continue to edit the list".
Click the second company ➜ Click the "Expand the selected area" button to select the whole row ➜ Click "Add current item to the list" again (Now we get all the companies with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the detailed information from these companies.
Step 6. Extract the stock information from the table.
Click the company name ➜ Select "Extract text". Other contents can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
Step 7. Drag the third “Loop Item” box before the “Click Item” action of the second “Loop Item” box in the Workflow so that we can grab all the elements of sections from multiple pages.
Step 8. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow.
Go to the webpage ➜ The first Loop Item box for drop-down menu ➜ Switch Dropdown (Uncheck the "Load the page with AJAX" option) ➜ The second Loop Item box ➜ The third Loop Item box ➜ Extract Data ➜ Click Item.
Step 9. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 10. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Part 2. Schedule a task and run it on Octoparse's cloud platform.
After you perfectly made the scraping by following the steps above in this web scraping tutorial, you can schedule your task to run it in Octoparse cloud.
Step 1. Find out the task you've just made ➜ double click the task to open it ➜ keep clicking "Next" until you are in the "Done" step ➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.
Step 2. Set the parameters.
In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.
· Periods of Availability - The data extraction period by setting the Start date and End date.
· Run Mode - Once, Weekly, Monthly, Real Time
We can set a suitable time interval to collect the stock and click "Start" to schedule your task.
After you click "OK" in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.
Author: The Octoparse Team
- See more at: Octoparse Tutorial标签: Octoparse, tutorial, Web scraping
0 条评论:
发表评论
订阅 博文评论 [Atom]
<< 主页