2016年12月12日星期一

How to avoid collecting the first item of the web page in Octoparse?

Q: Why does Octoparse only collect the first item from each page?


Description: 

I have been testing your software to try and data mine some info.
The website is https://www.yelp.com/search?find_desc=car+audio&find_loc=Brooklyn%2C+NY
The problem is it will only collect the first item from each page.


A: 
In this case, you can check the "Loop Item" that used to extract all the items from the page, and the XPath for the "Loop Item".
Please follow the steps to check your rule.
1. Open the task
2. In the "Design Overflow" step, you will see the rule in the Workflow Designer. Click each step/box one by one from the beginning to go through the rule. Make sure the order of the rule is correct.
3. When you click the "Loop Item" box, check if all items in the page are extracted by the XPath.
If not, you need to modify the XPath by using our Octoparse XPath tool or other tools like Firepath.
Check out these tutorials to learn how to edit XPath.
4. Replace the original with the correct XPath.

Only after you create a correct 'loop item' that contains all the links to the detail pages can you move forward to next step and collect data from websites.

Check out the tutorial to check your scraping task: Check The Extraction Rule When Errors Occur

-See more at: Octoparse FAQ

标签: , ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页