Speed up Cloud Extraction (1)
(from http://www.octoparse.com/tutorial/speed-up-cloud-extraction-1/?category=CLOUDEXTRACTION)
In this tutorial, you will learn how to speed up Cloud Extraction by telling the program to split up one task into multiple sub tasks.
When you are creating a "Loop Item" to extract items in its list, the default "Loop mode" will be “Variable list”. And the XPath auto generated will extract all the items in the list. But with "Variable list" and "Single element", the task cannot be chopped into multiple sub tasks on the cloud platform. (See picture below)
In order to speed up Cloud Extraction, you will need to switch "Loop mode" into "'Fixed list", "List of URLs" or "Text list". In this case, your cloud servers will be assigned to scrape each sub tasks and collect all the data within a shorter time.
Let me take you an example of ”Variable list” auto generated by Octoparse.
Website URL:
(Click here to download the demo task in this tutorial.)
To build a pattern for data we'd like to scrape, we create a "Loop Item" by adding 2 items and Octoparse auto generates the XPath for “Variable list”.(See the screenshot below.)
The XPath expression is
//DIV[@id='tile-container']/UL[1]/LI/DIV[1]/A[2]/H3[1]/DIV[1].
Please follow the steps below to learn how to edit the original XPath and generate XPath expressions for “Fixed list”. We recommend to modify the original XPath in Firebug & Firepath so that you will know if you make the correct XPath.
Step 1.
Open the website URL in Firefox browser and turn on Firebug.
(Click HERE to learn how to download Firebug and Firepath.)
Step 2.
Copy the XPath.
(//DIV[@id='tile-container']/UL[1]/LI/DIV[1]/A[2]/H3[1]/DIV[1]) in Firepath. Press “Enter” key. We can see that all the items are selected by the XPath expression.
Step 3.
To select the second item/element of the table according to its position among its siblings, we can need to use [] with a number for indexing after a tag.
For example,
//a[1] means the first <a>,
//a[1] # first <a>
//a[last()] # last <a>
//ol/li[2] # second <li>
//ol/li[position()=2] # same as above
//ol/li[position()>1] # :not(:first-child)
In the original XPath, we see that there is no index number inside “[]” behind the <Li> tag. This means that this XPath expression selects all <Li> tags, i.e., all the items of the table on this web page. In this case, we can use [] with a number to generate XPath expressions to select a specific item.
Here, insert [2] after the <Li> tag. Then we can select the second item. If we change the number to 4, then we would select the fourth item.
Note: You can learn more about XPath from these tutorials here.
Step 4.
Copy the XPath expression and paste it to Notepad, and you can generate the XPath expressions that could be used to select each item on this web page.
Step 5.
Copy all the XPath expressions from the Notepad file and paste them into Octoparse.
Select a "Loop Mode" under "Advanced Options". ➜ Select "Fixed list" option ➜ Paste the list of XPath expressions into the text box ➜ Click "Save". You will see the total number of item selected.
Note: In the GIF file, there are only 7 items selected. You can modify more XPath expressions to select all the 40 items.
Step 6.
It’s done! You can run the task in the cloud to collect data with a faster speed.
Provided that the web page has uniform source code and a standard code format, the two loop modes, “Fixed list” and “Variable list”, could be exchanged with each other by modifying the XPath expression. If there is a “Click Item” action inside the loop Item box, I would strongly recommend you to transform the "Loop mode" from “Variable list” to “Fixed list” to allow Octoparse to split your scraping task and then speed up the extraction.
If there is only a “Extract Data” action inside "Loop Item" box, we would recommend you to check out this tutorial to learn more about how to speed up Cloud Extraction.
Author: The Octoparse Team
- See more at: http://www.octoparse.com/tutorial/speed-up-cloud-extraction-1/?category=CLOUDEXTRACTION#sthash.MoZfPbpA.dpuf
0 条评论:
发表评论
订阅 博文评论 [Atom]
<< 主页