2017年8月10日星期四

Scrape Data from a List of URLs by Creating a Simple Scraper

Web scraping can be done by creating a web crawler in Python. Before coding a Python-based crawler, you need to look into source and get to know the structure of the target website. And of course you need to learn Python. It will be much easier if you already know how to code. But for a tech noob, it’s very difficult to learn everything from scratch. So we create our app Octoparse to help people who know little to nothing about coding to easily scrape any web data.  
In this tutorial we will learn how create the simplest and easiest web scraper to scrape a list of URLs, without any coding at all. This method is best suited to beginners like some of you. (We will assume that Octoparse is already installed on your computer. If that’s not the case, download here)

This tutorial will walk you through these steps:



1. Create a “Loop Item” in the workflow
After setting up basic information for your task, drag a “Loop Item” and drop it into the workflow designer.




 2. Add a list of URLs into the created “Loop Item”

After create a “Loop Item” in the workflow designer, add a list of URLs in “Loop Item” to create a pattern for navigating each webpage.
      · Select “List of URLs” Loop mode under advanced options of “Loop Item” step
      · Copy and paste all the URLs into “List of URLs” text box
      · Click “OK” and then save the configuration
Note:
     1) All the URLs should share the similar layout
     2) Add no more than 20,000 URLs
     3) You will need to manual copy and paste the URLs into “List of URLs” text box.
     4) After entering all the URLs, “Go To Webpage” action will be automatically created in “Loop Item”.





 3. Click to extract data points from one webpage
When the webpage is completely loaded, click the data point on the webpage to extract data you need.
      · Click the data you need and select “Extract Data”(“Extract Data” action will be automatically created.)
 



4. Run the scraper set up
The scraper is now created. Run the task with either “Local Extraction” or “Cloud Extraction”.
In this tutorial we run the scraper with “Cloud Extraction”.
      ·  Click “Next” and then Click “Cloud Extraction” to run the scraper on the cloud platform


Note:
     1) You are able to close the app or your computer when running the scraper with “Cloud Extraction”.
         Just sit back and relax. Then come back for the data. No need to worry about Internet interruption or hardware limitation.
     2) You can also run the scraper with “Local Extraction” (on your local machine).




5. Export data extracted

      · Click “View Data” to check the data extracted
      · Choose “Export Current Page” or “Export All” to export data extracted
Note:
       1) Octoparse supports exporting data in Excel(2003), Excel(2007), CSV, or having data delivered to your database.
       2) You can also create Octoparse APIs and access data. See more at: http://www.octoparse.com/tutorial/api


Now we've learn how to scrape data from a list of URLs by creating a simple scraper without any coding! Very easy right? Try it for yourself

Demo data extracted like below: 
(I also attach the demo task and demo task exported in excel. Find them here)


Now check out similar case studies:
     · URLs - Advanced Mode 

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页