WEBサービス: Reasons and Solutions - Cloud Extraction Is Slower Than Local Extraction

Imagine that one day you open one web scraping software and the screen display all the data you want, neatly.

Octoparse Cloud servers had got all the data you want from any websites for you. You're full of joy.

We love to see you smile.

We are dedicated to providing the best web scraping software and service for you.

So we create some tutorials to solve all the problems you may have when using Cloud Extractions.

We summarize several problems encountered by our paying users.

1. I get data from Local Extraction but none from Cloud Extraction

2. Cloud Extraction is slower than Local Extraction

3. There are missing data in Cloud Extraction

This tutorial will talk more about how to solve the second problem - How to make Cloud Extraction work normally and faster than Local Extraction?

Reasons and Solutions

The principle behind the Cloud Extraction is that, the task you put on Cloud Extraction is split into many sub-tasks, and these sub-tasks are assigned to many different cloud servers. These cloud servers will run these sub-tasks and send the data collected to our cloud database. All the data collected by our cloud servers would be saved in our cloud database. So the reasons for the problem may be:

Reason 1. The task is not split.

If the task is split into sub-tasks in the cloud, then 4 cloud servers will allocated to these sub-tasks (for Standard subscription plan). Thus executing tasks in the cloud will speed up the extraction and in this case have better performance than local extraction.

Conversely, if the task is not split, then only one cloud server will be allocated to the task and thus Cloud Extraction will be slower than Local Extraction.

Solution:

You can try to split your task. Check your rule of the task and see if you can re-configure it. For example, you can split the task by using Loop (URL list) to extract the data if the pages URLs are similar, except the page number.

http://app.vccedge.com/fund/angel_investments?&page=1

http://app.vccedge.com/fund/angel_investments?&page=2

http://app.vccedge.com/fund/angel_investments?&page=3

...

http://app.vccedge.com/fund/angel_investments?&page=10

Check out this tutorial to create a task using Loop (URL List).

Reason 2. Your machine performs better than one cloud server.

Your machine works better than our cloud server. Besides, your internet environment is brilliant and is much better than a cloud server. So if the task cannot be split, it would be better to run the scraping task in your machine by using Local Extraction.

Reason 3. The Professional subscription plan provides 10 cloud server for you to run your tasks in the cloud while Standard subscription plan has 4 cloud servers.

If you start many tasks in the cloud concurrently for only once, like task 1, task 2, task 3, orderly, Octoparse will first split the task 1 into many sub-tasks, allocate cloud servers to these sub-tasks and these cloud servers will scrape the data; then deal with task 2 and task 3 orderly in the same way.

Generally, the task 1 will be executed firstly. For example, let’s say that you have 10 cloud servers, and the task 1 uses 4 cloud servers, then the remaining 6 cloud servers will be used by task 2 which would use 6 cloud servers. In this case, task 3 will wait for the cloud servers in the executing queue with 0% progress.

Solution A: Don’t start too many tasks in the cloud concurrently.

Octoparse enables you to control the number of tasks being run in parallel. You can set the maximum number of tasks in parallel in the Cloud Extraction so that some tasks can be preferentially executed.

If you want to retrieve the data from a task within a short time, you can set all your cloud servers to the task by setting one here; if you want N tasks to run parallelly, you can set M here to faster the speed.(M ≤ N)

Solution B: Add more cloud servers for your tasks by contact us via support@octoparse.com.

Solution C: If you start many tasks using Scheduled Cloud Extraction, you can estimate the time needed for a task and stagger the scheduled time of your tasks.

Reason 4. The rule you configured for the task will affect the speed of Extraction, both for Local Extraction and Cloud Extraction.

When it works really good for Local Extraction, it doesn’t mean that the task can work well in the cloud. Since your machine and internet environment are brilliant and better than a cloud server, For example, it takes less time for your computer to open certain web page.

Solution:

So before putting the task into the cloud, you may need to set up set up a longer timeout of each step except ‘Go To Web Page'.(usually 5 seconds).