2016年12月27日星期二

Reasons and Solutions - Missing Data in Cloud Extraction

We all want to get a neat Excel spreadsheet with the data scraped, before going further analysis.
With Octoparse, you can fetch the data you want from websites and have the data ready for your use. Our cloud services enable you to fetch large amounts of data by running your scraping task with Cloud Extraction. The premise is, you know how to deal with all the circumstances when you are using Cloud Extraction to scrape the sites.
We summarize several problems encountered by our paying users and thus make several tutorials about reasons and solutions for these Cloud Extraction problems.

3. There Are Missing Data in Cloud Extraction
This tutorial will talk more about how to solve the third problem - What should I deal with missing data in the Cloud Extraction?

Before seeking solutions on how to solve these problems, we can first review the concepts of Cloud Extraction.

Octoparse Cloud Extraction

Octoparse Cloud Extraction refers to the process of retrieving data on a large scale through many cloud servers, based on distributed computing, 24/7/365.
After downloading the client, you can open a new task, configure a workflow/rule for the task, and perform the task with Cloud Extraction by putting it to the cloud. Then you can turn off your machine and let Octoparse do the rest. Cloud Extraction enables you to greatly speed up the extraction, start the task automatically at your scheduled time, and avoid your IP address being banned by websites.

Reasons and Solutions

Reason 1. If there are missing data when you use Local Extraction to run the scraping task, then it will definitely have missing data in Cloud Extraction. 

When there are missing data when you run a scraping task in Local Extraction, you can: 
1. Firstly, you need to check whether the web element exists on the web page. If it exists, there are two possible situations to consider:
Situation 1. The data content is not loaded before Octoparse execute the "Extract Data" action.
Solution: Set up a longer timeout of each step except the ‘Go To Web Page' action, or wait until some elements appear on the web page.

Situation 2.  The XPath for the loop box didn't select all the items listed on the web page.

Solution: You can modify the Xpath for the loop box.

2.  The source code for the web element of the web page is different from that of the web page when you made the rule in formatting.
Solution: You can set a backup position option, if the element you want has only two locations on the website. Or you can manually modify the XPath of the web element.
 Check out this FAQ to learn how to set the option.
  
3. Part of the web page is loaded in an asynchronous way. So it happens that Octoparse execute the "Extract Data" action when the web element hasn't been loaded and appeared on the web page.
Solution: You can set up a longer timeout of each step except the ‘Go To Web Page' action. 
Note: Make sure you set the Ajax timeout option.

Reason 2. The website you want to scrape implements anti-crawling techniques such as requiring login information, using Captcha and IP blocking.
Solution A: You can manually change your IP address by assigning other IP addresses to the task in the Extraction step. Check out this FAQ.

Solution B: Some websites will automatically unlocked your IP address after a while, so you can try the task after a while.

Reason 3. The data field does not exist.
Solution:
If all the data fields you want to extract is missing, the Octoparse will delete the data record (the whole row). In this case, it’s strongly recommend to add a fixed data field such as current time, current page URL, a fixed data value, etc.
- See more at: Octoparse Tutorial

标签: , , , ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页