WEBサービス: 6 Tips to Use the Web Scraping Tool Octoparse

These days we received some feedback from our users and some of them have troubles moving forward with Octoparse for issues happened occasionally. Therefore, my post here is to share my experience with you about using Octoparse, in hope that they’ll help guide you move forward and deal with more difficult and complex websites.

Manually Check the Rule in the Workflow Designer

Since Octoparse doesn’t signal an error for you to trace the problem when configuring a rule, you would usually have no ideas when some problems arose like missing data or failing to click the item or open the page. To avoid such errors or to find out whether the rule configured works, you’d better manually check the rule in the Workflow before running the task. By doing this, you could see which steps don’t work in the visual built-in browser and data field. Thus once you find something wrong, you could modify the rule accordingly. Check the tutorial below to learn how to do that.

Check The Extraction Rule When Errors Occur

Set Proper Timeout and Scroll Times

Sometimes you would find that even you configured the right rule and could get the data when manually checking the rule in the Workflow Designer, data records often missed when you initiated extraction. The easiest method is to set longer AJAX timeout under the action of “Go to page”, “Click item” and “Click to paginate”. Also, you could set waiting time before execution under different actions in the Workflow Designer so that you could ensure the data you want is loaded.

Some contents are not displayed unless you scroll down, so you may miss some data by forgetting setting the scroll times. Choose the scroll down ways and set proper scroll times. It’s also important to the results you get.

But before executing the steps above, you should remember that all the steps should be taken after the page is fully loaded; if not, even though you change the rule, the rule would still not work.

Besides, we don’t recommend you to choose “Open the link in new tab” and “Load the page with AJAX” in parallel unless Octoparse still failed to open some websites like LinkedIn.

Manually Modify the XPath

The correct use of XPath is the key to extract data in Octoparse. Steps like pagination, missing data and irregular value fields involve the change of XPath at most times. So I strongly suggest you learn some knowledge about XPath. Just a little know of XPath could help you solve a lot of problems in using Octoparse. The tutorials or FAQs below could help you pick up XPath quickly.

How to use Firebug and Firepath?

Getting started with XPath 1

Getting Started With XPath 2

Modify XPath Manually in Octoparse

Split the Task

You would find that you couldn’t get all the data records you want even though you ensure the configuration rule is right. Issues happened occasionally especially in the steps of “Click item” or “Click to paginate” because of the the amount of data or the complexity of the website itself. Even, if you don’t use cloud extraction with paid versions, you would find that you have to restart if the Internet cut off or the computer went to sleep. It would take you quite a long time. You would also feel quite boring about extracting the same data records again and again. My personal experience here is to separate the task into two projects. For example, if I want to extract the detail page of the item, which the configuration rule is often similar below, I wouldn’t choose the “Click item” directly.

Instead, I would extract the URL of the items in the “Click Item” first and then extract data by using the List of URLs to loop extracting the URLs.

By doing this, there are less missing data and faster data extraction speed. Also, you could easily find what’s the problem because of the less steps. Besides, if the extraction process accidentally stopped, by exporting the extracted data and checking where it stopped, you could restart from where it stopped, instead of starting zero again. The tutorial below would help you how to use the URL List.

URLs - Advanced Mode

Set Cache Settings

Sometimes you would find that the built-in browser didn’t open the URL you want entered under the action of “Go to page”. It may be because you opened other websites too many times and the computer record your cache. Just choose to clear cache before opening the web page and you could open the website you want.

Another example in setting cache is to extract websites that requiring login. After login, you could choose “Use specified Cookie” to record your account information, so that you needn’t check login steps again and again. This would also protect your personal information.

Use the RegEx Tool

Sometimes it would take you a little time to find out the information you want as there are many other noisy information. Or some information is involved in the attributes of the HTML, which you couldn’t extract directly. To precisely extract the information you want, you could use the RegEx Tool. The tutorial below would help you how to use the RegEx Tool in Octoparse.

Scrape Emails from Facebook Pages

The tips above could help you move forward better with Octoparse. Also, we are working harder to improve performance and provide more efficient, intelligent solutions.

Author: The Octoparse Team

- See more at: Octoparse Tutorial

标签： Big data, Octoparse, web crawler, Web scraping

WEBサービス

2016年12月18日星期日

6 Tips to Use the Web Scraping Tool Octoparse

0 条评论:

发表评论

我的简介

先前的博文