2017年1月22日星期日

Web Scraping - Scraping Websites That Required Login with Octoparse

(from http://www.octoparse.com/tutorial/web-scraping-scraping-websites-that-required-login-with-octoparse/)

It's common that you need to log in to a website that requires a username and password before scraping data from this website or performing more actions. Websites such as Facebook, Twitter, LinkedIn, and etc. would require users to first log in their accounts to visit the website or view more contents.

In this web scraping tutorial we will teach you how to scrape a site that required login with Octoparse. The examples of websites which required login we'd like to use are Facebook, Twitter and LinkedIn

Before scraping a website that required login with Octoparse, we need to create a task for each website and set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".


Facebook.  (Download the scraping task HERE)

Step 1. Enter the  URL of Facebook website in the built-in browser. ➜ Click "Go" icon to open the webpage ➜ Click "Save".
(URL of the example: https://www.facebook.com/?stype=lo&jlou=AfdpcqxUre_1gbgZ5SOb0-KvZp9Ex5BwenJg2fO4Dz2MHw0jKnROZkAAbC_TaFcGAe6kiA2X2fcQuFmf5dSeBgviyGdb47hV
Ym0a_0SfogqCQw&smuh=54539&lh=Ac_oyCjRcPZfNeXe)

Step 2. Enter authorization information such as username and password.
Click the input field for "Email or Phone" on the web page ➜ Choose "Enter text value" ➜ Enter your email or phone number in the textbox for "Enter text" under "Customize Current Action" ➜ Click "Save".

Click the input field for "Password" on the web page ➜ Choose "Enter text value" ➜ Enter your password in the textbox for "Enter text" under "Customize Current Action" ➜ Click "Save". You will see the password is shown on the web page.

Step 3. Click the Login button.
Click the "Login" button ➜ Choose "Click an item" ➜ Click "Save".

Step 4. Enter the URL of the Facebook website you want to scrape data from.
Drag a “Go To Web Page” to the workflow and enter the target URL in the textbox of "Page URL". Then click "Save".

Twitter.  (Download the scraping task HERE)
Step 1. Enter the  URL of Twitter website in the built-in browser. ➜ Click "Go" icon to open the webpage ➜ Click "Save".
(URL of the example: https://twitter.com/)

Step 2. Click on the "Log in" button ➜ Choose "Click an item" and a "Click Item" action will be created in the workflow.
Because the web page uses AJAX to click the "Log in" button so we need to set AJAX timeout for the "Click Item" action.
Tick "AJAX Load" checkbox under "Advanced Options" ➜ set an AJAX timeout of 2 seconds ➜ Click "Save". 

Step 3. Enter authorization information such as username and password.
Click the input field for "Phone, email or username" on the web page ➜ Choose "Enter text value" ➜ Enter your email, phone number or username in the textbox for "Enter text" under "Customize Current Action" ➜ Click "Save".
 
Click the input field for "Password" on the web page ➜ Choose "Enter text value" ➜ Enter your password in the textbox for "Enter text" under "Customize Current Action" ➜ Click "Save". You will see the password is shown on the web page.
 Note: If you want to uncheck the "Remember me" option, you can click the option, choose "Click an item" and uncheck it.

Step 4. Click the Login button.
Click the "Login" button ➜ Choose "Click an item" ➜ Click "Save".

Step 5. Enter the URL of the Facebook website you want to scrape data from.
Drag a “Go To Web Page” to the workflow and enter the target URL in the textbox of "Page URL". Then click "Save".

LinkedIn.  (Download the scraping task HERE)
Step 1. Enter the  URL of LinkedIn website in the built-in browser. ➜ Click "Go" icon to open the webpage ➜ Click "Save".
(URL of the example: https://www.linkedin.com/uas/login?goback=&trk=hb_signin)

Step 2. Enter authorization information such as username and password.
Click the input field for "Email address" on the web page ➜ Choose "Enter text value" ➜ Enter your email address in the textbox for "Enter text" under "Customize Current Action" ➜ Click "Save".

Click the input field for "Password" on the web page ➜ Choose "Enter text value" ➜ Enter your password in the textbox for "Enter text" under "Customize Current Action" ➜ Click "Save". You will see the password is shown on the web page.
 
Step 3. Click the Login button.
Click the "Login" button ➜ Choose "Click an item" ➜ Click "Save".

Step 4. Enter the URL of the Facebook website you want to scrape data from.
Drag a “Go To Web Page” to the workflow and enter the target URL in the textbox of "Page URL". Then click "Save".
 


Author: The Octoparse Team

标签: , ,

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页