2017年1月22日星期日

Speed up Cloud Extraction (1)

(from http://www.octoparse.com/tutorial/speed-up-cloud-extraction-1/?category=CLOUDEXTRACTION)

In this tutorial, you will learn how to speed up Cloud Extraction by telling the program to split up one task into multiple sub tasks.

When you are creating a "Loop Item" to extract items in its list, the default "Loop mode" will be “Variable list”. And the XPath auto generated will extract all the items in the list. But with "Variable list" and "Single element", the task cannot be chopped into multiple sub tasks on the cloud platform. (See picture below)
In order to speed up Cloud Extraction, you will need to switch "Loop mode" into "'Fixed list", "List of URLs" or "Text list". In this case, your cloud servers will be assigned to scrape each sub tasks and collect all the data within a shorter time.


Let me take you an example of ”Variable list” auto generated by Octoparse.
Website URL:
(Click here to download the demo task in this tutorial.)


To build a pattern for data we'd like to scrape, we create a "Loop Item" by adding 2 items and Octoparse auto generates the XPath for “Variable list”.(See the screenshot below.)
The XPath expression is
//DIV[@id='tile-container']/UL[1]/LI/DIV[1]/A[2]/H3[1]/DIV[1].




Please follow the steps below to learn how to edit the original XPath and generate XPath expressions for “Fixed list”. We recommend to modify the original XPath in Firebug & Firepath so that you will know if you make the correct XPath.

Step 1.
Open the website URL in Firefox browser and turn on Firebug.
(Click HERE to learn how to download Firebug and Firepath.)

Step 2.
Copy the XPath.
(//DIV[@id='tile-container']/UL[1]/LI/DIV[1]/A[2]/H3[1]/DIV[1]) in Firepath. Press “Enter” key. We can see that all the items are selected by the XPath expression.

Step 3.
To select the second item/element of the table according to its position among its siblings, we can need to use [] with a number for indexing after a tag.

For example,
//a[1] means the first <a>,

//a[1]                  # first <a>
//a[last()]             # last <a>
//ol/li[2]              # second <li>
//ol/li[position()=2]   # same as above
//ol/li[position()>1]   # :not(:first-child)

In the original XPath, we see that there is no index number inside “[]” behind the <Li> tag. This means that this XPath expression selects all <Li> tags, i.e., all the items of the table on this web page. In this case, we can use [] with a number to generate XPath expressions to select a specific item.

Here, insert [2] after the <Li> tag. Then we can select the second item. If we change the number to 4, then we would select the fourth item.

Note: You can learn more about XPath from these tutorials here.

Step 4.
Copy the XPath expression and paste it to Notepad, and you can generate the XPath expressions that could be used to select each item on this web page. 


Step 5.
Copy all the XPath expressions from the Notepad file and paste them into Octoparse.
Select a "Loop Mode" under "Advanced Options". ➜ Select "Fixed list" option ➜ Paste the list of XPath expressions into the text box ➜ Click "Save". You will see the total number of item selected.


Note: In the GIF file, there are only 7 items selected. You can modify more XPath expressions to select all the 40 items.

Step 6.
It’s done! You can run the task in the cloud to collect data with a faster speed.

Provided that the web page has uniform source code and a standard code format, the two loop modes, “Fixed list” and “Variable list”, could be exchanged with each other by modifying the XPath expression. If there is a “Click Item” action inside the loop Item box, I would strongly recommend you to transform the "Loop mode" from “Variable list” to “Fixed list” to allow Octoparse to split your scraping task and then speed up the extraction.
                                                                                                  


If there is only a “Extract Data” action inside "Loop Item" box, we would recommend you to check out this tutorial to learn more about how to speed up Cloud Extraction.





Author: The Octoparse Team
- See more at: http://www.octoparse.com/tutorial/speed-up-cloud-extraction-1/?category=CLOUDEXTRACTION#sthash.MoZfPbpA.dpuf

标签: , , ,

2016年12月27日星期二

Reasons and Solutions - Missing Data in Cloud Extraction

We all want to get a neat Excel spreadsheet with the data scraped, before going further analysis.
With Octoparse, you can fetch the data you want from websites and have the data ready for your use. Our cloud services enable you to fetch large amounts of data by running your scraping task with Cloud Extraction. The premise is, you know how to deal with all the circumstances when you are using Cloud Extraction to scrape the sites.
We summarize several problems encountered by our paying users and thus make several tutorials about reasons and solutions for these Cloud Extraction problems.

3. There Are Missing Data in Cloud Extraction
This tutorial will talk more about how to solve the third problem - What should I deal with missing data in the Cloud Extraction?

Before seeking solutions on how to solve these problems, we can first review the concepts of Cloud Extraction.

Octoparse Cloud Extraction

Octoparse Cloud Extraction refers to the process of retrieving data on a large scale through many cloud servers, based on distributed computing, 24/7/365.
After downloading the client, you can open a new task, configure a workflow/rule for the task, and perform the task with Cloud Extraction by putting it to the cloud. Then you can turn off your machine and let Octoparse do the rest. Cloud Extraction enables you to greatly speed up the extraction, start the task automatically at your scheduled time, and avoid your IP address being banned by websites.

Reasons and Solutions

Reason 1. If there are missing data when you use Local Extraction to run the scraping task, then it will definitely have missing data in Cloud Extraction. 

When there are missing data when you run a scraping task in Local Extraction, you can: 
1. Firstly, you need to check whether the web element exists on the web page. If it exists, there are two possible situations to consider:
Situation 1. The data content is not loaded before Octoparse execute the "Extract Data" action.
Solution: Set up a longer timeout of each step except the ‘Go To Web Page' action, or wait until some elements appear on the web page.

Situation 2.  The XPath for the loop box didn't select all the items listed on the web page.

Solution: You can modify the Xpath for the loop box.

2.  The source code for the web element of the web page is different from that of the web page when you made the rule in formatting.
Solution: You can set a backup position option, if the element you want has only two locations on the website. Or you can manually modify the XPath of the web element.
 Check out this FAQ to learn how to set the option.
  
3. Part of the web page is loaded in an asynchronous way. So it happens that Octoparse execute the "Extract Data" action when the web element hasn't been loaded and appeared on the web page.
Solution: You can set up a longer timeout of each step except the ‘Go To Web Page' action. 
Note: Make sure you set the Ajax timeout option.

Reason 2. The website you want to scrape implements anti-crawling techniques such as requiring login information, using Captcha and IP blocking.
Solution A: You can manually change your IP address by assigning other IP addresses to the task in the Extraction step. Check out this FAQ.

Solution B: Some websites will automatically unlocked your IP address after a while, so you can try the task after a while.

Reason 3. The data field does not exist.
Solution:
If all the data fields you want to extract is missing, the Octoparse will delete the data record (the whole row). In this case, it’s strongly recommend to add a fixed data field such as current time, current page URL, a fixed data value, etc.
- See more at: Octoparse Tutorial

标签: , , , ,

Reasons and Solutions - Cloud Extraction Is Slower Than Local Extraction

Imagine that one day you open one web scraping software and the screen display all the data you want, neatly.
Octoparse Cloud servers had got all the data you want from any websites for you. You're full of joy.
We love to see you smile.
We are dedicated to providing the best web scraping software and service for you. 
So we create some tutorials to solve all the problems you may have when using Cloud Extractions.

We summarize several problems encountered by our paying users.

2. Cloud Extraction is slower than Local Extraction
3. There are missing data in Cloud Extraction

This tutorial will talk more about how to solve the second problem - How to make Cloud Extraction work normally and faster than Local Extraction?

Reasons and Solutions 

The principle behind the Cloud Extraction is that, the task you put on Cloud Extraction is split into many sub-tasks, and these sub-tasks are assigned to many different cloud servers. These cloud servers will run these sub-tasks and send the data collected to our cloud database. All the data collected by our cloud servers would be saved in our cloud database. So the reasons for the problem may be:

Reason 1. The task is not split.
If the task is split into sub-tasks in the cloud, then 4 cloud servers will allocated to these sub-tasks (for Standard subscription plan). Thus executing tasks in the cloud will speed up the extraction and in this case have better performance than local extraction.
Conversely, if the task is not split, then only one cloud server will be allocated to the task and thus Cloud Extraction will be slower than Local Extraction.

Solution:
You can try to split your task. Check your rule of the task and see if you can re-configure it. For example, you can split the task by using Loop (URL list) to extract the data if the pages URLs are similar, except the page number.
http://app.vccedge.com/fund/angel_investments?&page=1
http://app.vccedge.com/fund/angel_investments?&page=2
http://app.vccedge.com/fund/angel_investments?&page=3
...
...
http://app.vccedge.com/fund/angel_investments?&page=10
Check out this tutorial to create a task using Loop (URL List).
Reason 2. Your machine performs better than one cloud server.
Your machine works better than our cloud server. Besides, your internet environment is brilliant and is much better than a cloud server. So if the task cannot be split, it would be better to run the scraping task in your machine by using Local Extraction.

Reason 3. The Professional subscription plan provides 10 cloud server for you to run your tasks in the cloud while Standard subscription plan has 4 cloud servers.
If you start many tasks in the cloud concurrently for only once, like task 1, task 2, task 3, orderly, Octoparse will first split the task 1 into many sub-tasks, allocate cloud servers to these sub-tasks and these cloud servers will scrape the data; then deal with task 2 and task 3 orderly in the same way.
Generally, the task 1 will be executed firstly. For example, let’s say that you have 10 cloud servers, and the task 1 uses 4 cloud servers, then the remaining 6 cloud servers will be used by task 2 which would use 6 cloud servers. In this case, task 3 will wait for the cloud servers in the executing queue with 0% progress.

Solution A: Don’t start too many tasks in the cloud concurrently.
Octoparse enables you to control the number of tasks being run in parallel. You can set the maximum number of tasks in parallel in the Cloud Extraction so that some tasks can be preferentially executed.

If you want to retrieve the data from a task within a short time, you can set all your cloud servers to the task by setting one here; if you want N tasks to run parallelly, you can set M here to faster the speed.(M ≤ N)
Solution B: Add more cloud servers for your tasks by contact us via support@octoparse.com.
Solution C: If you start many tasks using Scheduled Cloud Extraction, you can estimate the time needed for a task and stagger the scheduled time of your tasks.

Reason 4. The rule you configured for the task will affect the speed of Extraction, both for Local Extraction and Cloud Extraction.
When it works really good for Local Extraction, it doesn’t mean that the task can work well in the cloud. Since your machine and internet environment are brilliant and better than a cloud server, For example, it takes less time for your computer to open certain web page.  

Solution:
So before putting the task into the cloud, you may need to set up set up a longer timeout of each step except ‘Go To Web Page'.(usually 5 seconds).
- See more at: Octoparse Tutorial

标签: , , ,

Reasons and Solutions - Getting Data from Local Extraction but None from Cloud Extraction

We all want to get a neat Excel spreadsheet with the data scraped, before going further analysis.
With Octoparse, you can fetch the data you want from websites and have the data ready for your use. Our cloud services enable you to fetch large amounts of data by running your scraping task with Cloud Extraction. The premise is, you know how to deal with all the circumstances when you are using Cloud Extraction to scrape the sites.
We summarize several problems encountered by our paying users.
1. I get data from Local Extraction but none from Cloud Extraction
2. Cloud Extraction is slower than Local Extraction
3. There are missing data in Cloud Extraction


In this tutorial, we would like to dig into the first problem - Why I get no data record with Cloud Extraction when the scraping task works well with Local Extraction?


Reasons and Solutions

Reason 1. The web environment in the cloud is different from that in your machine and thus the original XPath cannot select the HTML elements correctly and therefore miss some data.
Situation 1. The website’s source code has been greatly changed.
Some websites will automatically deliver regionally relevant web page information to its visitor based on the visitor’s geographical location from the IP address. And all IP addresses of our cloud servers are randomly assigned by our cloud hosting. Thus the structure of web pages displayed in your own machine will be different from that in the our cloud servers. Therefore it will unavoidably to have some missing data while using Cloud Extraction.
Solution A: You can save the cookie in “Cache Settings” option to ensure the web page opened is the one you want to scrape. Try out this tutorial to set the cookie.
Solution B: If solution A does not work for the website you want to scraped, you can try the option “clear cache before opening the web page”to open the initial web page and add some actions in your rule to select the web element like the region.

Situation 2. The website’s source code has been changed just a little bit.
Solution:
Generally we need to modify the XPath of the web elements who could not get the data with Cloud Extraction. For example, you can change the absolute XPath to relative XPath. The tip is to try to use the XPath that can directly select the web elements you want.
Absolute Xpath
Relative Xpath
      html/body/div[5]/div[3]/div[2]/div[1]           
.//*[@id='tab-2']/div
The tutorials or FAQs below could help you pick up XPath quickly.

Reason 2. The cookie you saved has been disabled.
For websites that require login, we need to remote connect to our cloud hosting to detect whether the cookie is working or not, or you can check the rule in Octoparse and see if the cookie is still valid or not.
Situation 1. Some cookies allows only one web browser (or one IP address) to log in. This means that the cookies of the website saved on web browser A will be disabled when you log into the site on web browser B. If you insist on logging in the website in browser B, then you will be logged out in browser A.
Solution:
Tick the option to not split the task in Cloud Extraction when configuring the rule.

Situation 2. The cookies of the website has been saved when you configure the rule for the task, and it works well to extract data by using Local Extraction. But when you use Cloud Extraction to collect data, our cloud servers will not be able to move forward without the cookies, and thus you cannot get data by using Cloud Extraction.
Solution:
For websites that require login information, you can add some actions to enter the login information in your rule.
Check out this tutorial to add the login action to your rule. How to scrape a website that requires login first.
If your Cloud Extraction is still not smooth after adding actions to enter login information, please contact our Support Team via support@octoparse.com.

Reason 3. The rule you configured for the task is not optimized for Cloud Extraction.
Since your machine and internet environment are better than a cloud server, so it takes less time for your computer to deal with a complicated web page. When you put the task in the cloud, you need to consider the performance of a cloud server and make an adjustment to your rule of the task.

Solution:
So before putting the task into the cloud, you may need to set up set up a longer timeout of each step except ‘Go To Web Page'.(usually 5 seconds), or wait until some elements appear on the web page. This screenshot shows how to set up the timeout of the 'Click Item' action.

Reason 4. IP address blocking
Some websites, such as websites from Alibaba group, will implement anti-scraping mechanisms so it happens that our IP addresses had been blocked when the website detects the scraping behavior. In this case, please contact our Support Team via support@octoparse.com and we will check it for you.

Solution A:
You can assign IP addresses for your task manually in Octoparse. (Only for Local Extraction)
Check out this tutorial to manually assign the IP addresses.
Solution B:
You can observe the anti-scraping mechanisms used by the website you scraped, and get the data after a while. Or you can set up a longer waiting time each step (except ‘Go To Web Page') to act more like a human.
- See more at: Octoparse Tutorial

标签: , , ,