2017年2月27日星期一

Web scraping | Introduction to Octoparse XPath Tool

You can use Octoparse to scrape websites now but sometimes the output has missing data or the task is not working properly. A new X Path expression can easily solve the problems and make the task work. Thus, it’s necessary to master Octoparse X Path Tool when scraping data from websites. Just a little bit of effort, you can greatly improve your productivity.

Before reading the article, you can learn basic HTML & XPath knowledge in these documents.


In this tutorial you'll get to know how to use Octoparse XPath Tool with some syntax example, when configuring your scraping task in Octoparse.

Location

There are two ways to get Octoparse X Path Tool:

Method 1. All the actions except “Go To Web Page”in Octoparse Workflow have the “Define ways to locate an item”option under the “Customize Field”setting. Select the “Define ways to locate an item”option and click on the Try XPath Tool link (lower left corner).
 
The Customized Field button in Extract Data
 
The Customized Field button in Enter Text
 
The Customized Field button in Click Item

Select the “Define ways to locate an item”option and click on the Try XPath Tool link (lower left corner).

 


Method 2. The third icon appeared on the upper left corner of Octoparse interface.
 

Main Interface
 

The main interface of Octoparse XPath Tool can be divided into 4 parts, as follows: 
1. The Browser part.
Enter the target URL in the built-in browser and click on the Go button. The content about the web page will be displayed here.
2. The Source Code part. View the source code of the web page.
3. The XPath Setting part. Check the options and fill in some parameters to generate X Path expression by hitting the Generate button.
4. The XPath Result part. After the XPath is generated, click on the Match button to see if the current XPath finds elements on the webpage.

Note: The structure and hierarchy of the source code shown in Octoparse XPath Tool is not clear. It’s strongly recommend you use the Firefox extension - Firepath to check the web page source code. Check out the tutorial HERE to use Firepath.

XPath Setting

The XPath expression is generated automatically after you check the option and fill in some text in the XPath Setting part. 
The X Path Setting part

Item Tag Name:
The blue text such as SPAN, A, HR, BR in the source code describes the tag name in Firefox browser. You can check the Item Tag Name and fill in the tag name from which you want to find the elements.
 
The blue text represents the tag name.

Item Position:
The default value is 1 which represent the first item among the siblings.

Item ID, Item Name, Item Style Class:
Generally, a tag element will have some attributes such as id attribute, name attribute or class attribute inside. You can check the options you need and fill in some text to better locate the elements. Id attribute, name attribute and class attribute are the most common ones and you can edit the X Path generated yourself for other attributes by replacing the attribute name and related parameters copied from the original source code in Firefox browser.
 
Pic: Check the Item ID option with some parameter and generate X Path. 
 
Pic: Copy the X Path generated and paste in Firepath.

Pic: Replace data-action with id and replace ‘sx-card-deck’ with rot-B00D2PNANY'.

Item Text:
The black text in the source code describes all the text information in Firefox browser. You have to make sure all the text is included (even a blank space; the punctuation; the full-angle and half-angle) when filling in the parameters. You can just copy the text inside angle brackets from the original source code.
 

Item Text Contains:
Octoparse will generate X Path that contains specific text and find all the elements that contains the text.

Item Text Start With:
Similarly, Octoparse will generate X Path that contains the beginning of the text/sentence and find all the information that begin with the text.

After you check the option and fill in the parameter, click on the Generate button and an X Path will be created and displayed in the X Path Result part.

On the left side of the Generate button is four buttons that used the find the elements near to the information you really want. Generally you won’t use these buttons and they are usually used after an X Path is created.

Child: selects the child node.
Parent: selects the parent node.
Previous: selects the “preceding-sibling::node()”.
Next: selects the “following-sibling::node()”.
 

Conclusion

There are many ways to find the information on a webpage, that is, you can write different X Path expressions to locate the web elements, or ask an expert for help.
- See more at: Octoparse Tutorial

标签: ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页