2016年12月27日星期二

Reasons and Solutions - Getting Data from Local Extraction but None from Cloud Extraction

We all want to get a neat Excel spreadsheet with the data scraped, before going further analysis.
With Octoparse, you can fetch the data you want from websites and have the data ready for your use. Our cloud services enable you to fetch large amounts of data by running your scraping task with Cloud Extraction. The premise is, you know how to deal with all the circumstances when you are using Cloud Extraction to scrape the sites.
We summarize several problems encountered by our paying users.
1. I get data from Local Extraction but none from Cloud Extraction
2. Cloud Extraction is slower than Local Extraction
3. There are missing data in Cloud Extraction


In this tutorial, we would like to dig into the first problem - Why I get no data record with Cloud Extraction when the scraping task works well with Local Extraction?


Reasons and Solutions

Reason 1. The web environment in the cloud is different from that in your machine and thus the original XPath cannot select the HTML elements correctly and therefore miss some data.
Situation 1. The website’s source code has been greatly changed.
Some websites will automatically deliver regionally relevant web page information to its visitor based on the visitor’s geographical location from the IP address. And all IP addresses of our cloud servers are randomly assigned by our cloud hosting. Thus the structure of web pages displayed in your own machine will be different from that in the our cloud servers. Therefore it will unavoidably to have some missing data while using Cloud Extraction.
Solution A: You can save the cookie in “Cache Settings” option to ensure the web page opened is the one you want to scrape. Try out this tutorial to set the cookie.
Solution B: If solution A does not work for the website you want to scraped, you can try the option “clear cache before opening the web page”to open the initial web page and add some actions in your rule to select the web element like the region.

Situation 2. The website’s source code has been changed just a little bit.
Solution:
Generally we need to modify the XPath of the web elements who could not get the data with Cloud Extraction. For example, you can change the absolute XPath to relative XPath. The tip is to try to use the XPath that can directly select the web elements you want.
Absolute Xpath
Relative Xpath
      html/body/div[5]/div[3]/div[2]/div[1]           
.//*[@id='tab-2']/div
The tutorials or FAQs below could help you pick up XPath quickly.

Reason 2. The cookie you saved has been disabled.
For websites that require login, we need to remote connect to our cloud hosting to detect whether the cookie is working or not, or you can check the rule in Octoparse and see if the cookie is still valid or not.
Situation 1. Some cookies allows only one web browser (or one IP address) to log in. This means that the cookies of the website saved on web browser A will be disabled when you log into the site on web browser B. If you insist on logging in the website in browser B, then you will be logged out in browser A.
Solution:
Tick the option to not split the task in Cloud Extraction when configuring the rule.

Situation 2. The cookies of the website has been saved when you configure the rule for the task, and it works well to extract data by using Local Extraction. But when you use Cloud Extraction to collect data, our cloud servers will not be able to move forward without the cookies, and thus you cannot get data by using Cloud Extraction.
Solution:
For websites that require login information, you can add some actions to enter the login information in your rule.
Check out this tutorial to add the login action to your rule. How to scrape a website that requires login first.
If your Cloud Extraction is still not smooth after adding actions to enter login information, please contact our Support Team via support@octoparse.com.

Reason 3. The rule you configured for the task is not optimized for Cloud Extraction.
Since your machine and internet environment are better than a cloud server, so it takes less time for your computer to deal with a complicated web page. When you put the task in the cloud, you need to consider the performance of a cloud server and make an adjustment to your rule of the task.

Solution:
So before putting the task into the cloud, you may need to set up set up a longer timeout of each step except ‘Go To Web Page'.(usually 5 seconds), or wait until some elements appear on the web page. This screenshot shows how to set up the timeout of the 'Click Item' action.

Reason 4. IP address blocking
Some websites, such as websites from Alibaba group, will implement anti-scraping mechanisms so it happens that our IP addresses had been blocked when the website detects the scraping behavior. In this case, please contact our Support Team via support@octoparse.com and we will check it for you.

Solution A:
You can assign IP addresses for your task manually in Octoparse. (Only for Local Extraction)
Check out this tutorial to manually assign the IP addresses.
Solution B:
You can observe the anti-scraping mechanisms used by the website you scraped, and get the data after a while. Or you can set up a longer waiting time each step (except ‘Go To Web Page') to act more like a human.
- See more at: Octoparse Tutorial

标签: , , ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页