2017年8月17日星期四

Scraping Data behind a CAPTCHA - Advanced Web Scraping

Is it possible to scrape data that is protected behind a CAPTCHA with Octoparse? This is one of the most asking questions by our users. Yes Octoparse is able to scrape data behind a CAPTCHA. It's available when running your scraping tasks with local machine. Although Octoparse Cloud Service does not provide CAPTCHA -solving service, our development guys are working very hard on it. So we'll see. ;)
In this tutorial, I will introduce the solutions to two most common issues when CAPTCHA appears:



Here is a general rule to resolve CAPTCHA issue in Octoparse: 
· allow the webpage to load the image (see the screenshot below)  
· set waiting time before execution
· entering the text CAPTCHA or dragging the slider
 




1) Scrape data behind log in
Some websites need log in credentials before browsing. Bypassing a CAPTCHA is often needed when verifying the provided credentials.
I’ll take http://agent.octoparse.com/login for example to show you how to scrape the website with a Captcha when logging in.
Step 1.
· Go to the webpage.
· Click the User Name box and choose “Enter Text value”.
· Enter the credentials.
· Click “Save”.
 

The same way to enter the password.

Step 2. Manually enter the CAPTCHA in the built-in browser.
As the CAPTCHA would change when the webpage reloads, you don’t need to add another step to enter the CAPTCHA in the workflow at this point. Just manually enter the CAPTCHA in the built-in browser.
(Note: the same way to drag a slider.)

Step 3. Set proper waiting timeout before execution.
Click  “Sign in” button, choose “Click an item” to log in the website.
 

To make sure that we have enough time to manually enter the CAPTCHA when starting local extraction, we need to set longer timeout before Octoparse executes “Click item”.
So just click “Click item” in the workflow. ➜ Check “Waiting before execution” under advanced options. ➜ Set proper execution timeout. ➜ Click “Save”.
 

Step 4. Extract the data you want.
I wouldn’t display the details as all of our tutorials have shown that.
  
Step 5. Manually enter the CAPTCHA when starting data extraction on local machine.
As you could see, there’s a built-in browser in the process of extraction. You need to wait until the CAPTCHA appears and enter theCAPTCHA.
 
Then you have solved the CAPTCHA problem and Octoparse would do the rest for you to get the data.
Note:
We extremely recommend you to load and store the cookie by checking the option Use specified Cookie under Cache Settings. By doing this, you dont need to enter the CAPTCHA at most times when scraping behind a login.
(Follow the tutorial here to know how to store cookies in Octoparse.)




2) Access a website too frequently in a short time

It’s the same to pass a Captcha when accessing a website. Octoparse mimics the human behaviour with the point-and-click interface, so you just need to manually enter the Captcha like what you do in the normal browser, and set proper waiting timeout to manually enter the captcha at the following step. This is what you should do when making a crawler in the Workflow.
And then you come to the process of extracting data on local machine; you need to wait until the CAPTCHA appears, and manually pass the CAPTCHA before the operation timed out.

Now you know how to scrape data behind a CAPTCHA in Octoparse. I know it's probably less than ideal. But it could be a way of solving CAPTCHA in a short term. Hope you enjoy data hunting with Octoparse!

Check out the tutorials below to learn more about how to crawl data from a website:

The Octoparse Team

0 条评论:

发表评论

订阅 博文评论 [Atom]

<< 主页