2017年1月23日星期一

Speed up Cloud Extraction (2)

(from http://www.octoparse.com/tutorial/scraping-hotel-reviews-from-tripadvisorcom/)

In Speed up Cloud Extraction (1), you’ve learned how to speed up Cloud Extraction by telling the program to split up one task into multiple sub tasks. When you use “Fix list”, “List of URLs” or “Text list” loop mode, the task will be auto split up into multiple sub tasks on the cloud platform.

In this tutorial, you will learn how to speed up Cloud Extraction in Octoparse by optimizing pagination.

When you configure pagination by clicking on “Next” button, “Click to paginate” action will be auto generated in Workflow. Since you click on one element, “Single element” loop mode would be the default loop mode, which is not allowed to split on the cloud platform. Assume that a task is meant to extract URLs from a list page:

Let’s say that opening a page task 5s, Extracting data takes 2s, Clicking “Next Page” takes 3s. The extraction process on the cloud platform will be like:
 
( Note: Octoparse’s cloud servers extract data simultaneously.)

In this case, Cloud Extraction would be very slow since the pagination always takes 3s when scraping one data field. To optimize pagination, you will need to use split-table Loop modeon pagination. (List of URLs & Text list)

 1. Query string pagination - Use List of URLsloop mode

The query string pagination is simple URL with query string parameter: “page=1, page=2, page=3...”
For query string pagination, use “List of URLs” loop mode by putting the URLs, instead of creating a “Click to paginate” by clicking on “Next” button.

Step 1. Create a list of URLs and copy them.

Step 2. Drop a “Loop Item” into the Workflow designer.

Step 3. Select “List of URLs” loop mode and paste the URLs in the text box. Then click “OK” &“Save”.
Then continue to configure the task for Cloud Extraction.

2. Jumping to a Specific Page - Use “Text list”loop mode

When the website allows visitors to enter a page number and jump to the specific page, use “Text list” loop mode to enter page numbers.

Step 1. Create page numbers and copy them.

Step 2. Drop a “Loop Item” into the Workflow designer.

Step 3. Select “Text list” loop mode and paste the text in the text box. Then click “OK” &“Save”.

When you optimize the pagination, Cloud Extraction process will takes less 3s when scraping each data field 

                                           (After)



                                                 (Before)
Once we know how to optimize pagination by switching to different “Loop mode”s, we can make Cloud Extraction a lot faster.


Author: The Octoparse Team

标签: , ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页