WEBサービス: 十二月 2016

2016年12月29日星期四

Scraping Online Dictionary - Merriam-Webster.com

Octoparse enables you to scrape the online dictionary into an organized list by entering a list of words. It’s very easy to use and could get the definition and examples of the word you want by using a Loop mode for entering a text list.

In this tutorial, I will show you how to scrape definition of some words from merriam-webster.com.

The website URL we will use is www.merriam-webster.com.

The data fields include the word, its characteristic, its definition and example.

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape word’s definition.

(Download my extraction task of this tutorial HERE just in case you need it.)

Step 1. Set up basic information.

Click “Quick Start” ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click “Next”.

Step 2. Enter the target URL in the built-in browser. ➜ Click “Go” icon to open the webpage.

(URL of the example: https://www.merriam-webster.com)

Note: If the URL keeps loading while the content of the website has fully loaded, you can click the multiplication sign (×) to prevent it from loading.

Step 3. Create a loop for entering texts.

Drag a "Loop Item" into the Workflow Designer and then choose "Text list" in the "Loop mode".

Enter the text or a list of text you want to scrape in the "Text list" box and Click "Save".

You can see the list of text will be shown on the “Loop Item” box.

You need to click the search bar where you enter the text in the built-in browser, and choose the “Enter text value”option.

Drag the "Enter Text" box into the "Loop Item" box under Workflow Designer. And then tick "Use the text in Loop Item to fill in the text box". Click "Save". So you could see that the program will enter the text one by one.

Step 4. Get the search results.

Click the “Search” button of the website ➜ Choose “Click an item”.

Step 5. Extract the words’ definitions.

Now we are on the search result page of the first word “socialism”.

Extract the word. ➜ Click the word ➜ Select “Extract text”. Other contents can be extracted in the same way.

All the content will be selected in Data Fields. ➜ Click the “Field Name” to modify. Then click “Save”.

Step 6. Check the rule/workflow.

Now we need to check the workflow by clicking actions from the beginning of the rule/workflow.

Go to the webpage ➜ Loop Item box ➜ Enter text ➜ Click Item ➜ Extract Data.

Step 7. Click “Save” to save your configuration. Then click “Next” ➜ Click “Next” ➜ Click “Local Extraction” to run the task on your computer. Octoparse will automatically extract all the data selected.

Step 8. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.

Note:

The correct use of XPath is the key to extract data in Octoparse. If you find anything missing values, please go back to your rule and go through it from the beginning and modify the Xpath expressions for the data fields. Check out this article to modify the XPath expressions of a data field.

Knowing a little knowledge about XPath could help you solve a lot of problems in using Octoparse. The tutorials or FAQs below could help you pick up XPath quickly.

How to use Firebug and Firepath?

Getting started with XPath 1

Getting Started With XPath 2

Modify XPath Manually in Octoparse

You can use Cloud Extraction to speed up the extraction, our cloud servers will collect the data for you within a short time. Go to the pricing page to get more information about our subscription plans and extraction services. http://www.octoparse.com/pricing

Author: The Octoparse Team

- See more at: Octoparse Tutorial

标签： dictionary, Octoparse, tutorial, web crawler, Web scraping

Web Scraping|Scrape Booking Reviews

(picture from www.luxurybackpacker.com)

Collecting online customer reviews, including star ratings, comments, likes, dislikes, images, videos, share channels and etc, can help an online retailer to better understand if the product sold is a good purchase and popular among customers, thus to adjust marketing strategies. There are many web scraping tools available online to live up to your expectations to scrape data from websites.

In this article we will talk about the key points to scrape customer reviews about the hotels in Tbilisi City from booking.com with Octoparse. We won’t provide specific steps for making the scraping task and if you want to learn how to make such a scraping task or want to get other types of customer reviews from booking.com, we offer the extraction services for you to suit the needs. Please contact us via support@octoparse.com.

We’ve made the scraping task and you can directly download the .otd file (What is an OTD. file?) to begin collect the hotel reviews from booking.com. (Download my extraction task of this article HERE just in case you need it.)

The OTD. file is available only in Octoparse. You can Download Octoparse before downloading the scraping task.

Please click HERE to open the website URL we used.

The data fields include hotel name, hotel address, star rating, customer name and comments posted by the customer.

The scraping task we’ve made in Octoparse is looked like this.

We will go to the detail page of each hotel and get the reviews under the “Read all trusted reviews” tab.

Since sometimes the actual number of reviews are more than what is shown on the detail page, we will need to get all the reviews from all the countries displayed. Therefore, we clicked the plus button to display all the countries in which the consumer were located.

In Octoparse, we will create a list of items to extract all the countries. The Xpath for the loop will extract extra elements from the web page so we need to modify the XPath and let the XPath expression to select the elements correctly.

Since we all know that all the elements will be extracted by clicking the elements when you create a list of items in Octoaprse, and the booking.com website will select the first country in the pop-up window by default, thus the first country will therefore be unselected when you create a loop for these countries.

In this case, we need to select the first country by clicking the checkbox of the first country and Octoparse will generate a “Click Item” action in the rule.

All the customer reviews about the hotel will be extracted by countries.

Since there are anonymous customer accounts and reviews, so the extraction output will have duplicate data records. You can export the data by choosing only the valid data.

Author: The Octoparse Team

- See more at: Octoparse Blog

标签： Octoparse, tutorial, web crawler, Web scraping

10 Essential Tutorials That Every Octoparse Newbie Should Know

Octoparse offers the most convenient way to scrape data from websites. Although few programming knowledge is required, some still claim that they have no ideas about how to use Octoparse. Thus this post aims to help our lovely new users to settle into Octoparse smoothly.

Below you will find links to 10 of the most helpful tutorials that will support you to make a first step in Octoparse. These guides will not only help you in scraping different kinds of website structures, they will also show you some tips to make Octoparse more user-friendly, how you can move forward better with it.

What is Octoparse?

Let’s start at the very beginning. This tutorial Octoparse Introduction introduces you the elements that make Octoparse so great. It would help you understand the functions and familiarize yourself with the workflow built into creating a crawler in Octoparse yourself. Click further on the link and watch the video tutorial patiently to each section and menu to discover the specific actions that are available to configure a task in Octoparse.

How to scrape websites with pagination?

Pagination feature is usually used to divide records into pages when the amount of records shown on a website is extremely huge. Therefore, it is quite common for you to handle with the pagination in data extraction. This tutorial Scrape Data from Websites with Pagination (Query Strings) shows you exactly how to add actions in the Workflow Designer to get the information from websites with pagination.

How to scrape websites requiring login?

Some websites require login before you browse the contents. It is the same to scrape the websites. This tutorial How to Scrape A Website That Requires Login First? shows you how to scrape websites requiring authentication.

How to scrape websites by automatically searching the key words?

Most websites provide search bars for you to precisely get the information you want. I bet you want all the information using different key words. This tutorial How to Scrape Data by Searching Multiple Keywords on A Website? could show you how to do that.

How to get information from drop-down menus?

Drop-down menus are usually used in websites, where the contents are dynamically linked to what you choose in the drop-down list. This tutorial Scrape Web Data from A Drop-Down Menu shows you values from drop-down menus could also be extracted in Octoparse.

How to get data in seconds?

Octoparse Smart Mode allows users to get data in seconds by lowering the barrier to entry for anyone who need data. The tutorial Octoparse Smart Mode -- Get Data in Seconds will show you how to extract all of the data without having to configure an extraction rule in Octoparse. But before you move on, remember that Octoparse now is only available to those websites with list information.

How to extract data with certain URLs?

Sometimes you may just want to scrape information with several certain URLs. Octoparse allows you to do that like the actions you take in the search engines. The only thing you need to do is to follow the tutorial Extract Data from A List of URLs with Similar Web Content Layouts to learn how to get the data you want from URL lists. It’s absolutely simple!

How to extract information from detail web pages?

You would find that most of websites select a record in one page and display other related information on another page, namely, the detail page. For example, you would find that you should double click the link of the product to go into the detail page on Amazon. To precisely get these information, you could follow this tutorial List & Detail Web Page - Advanced Mode.

Why and how to manually modify the XPath?

XPath is usually required in Octoparse in certain cases. One concerns the missing data solution in Octoparse. This tutorial Modify XPath Manually in Octoparse could give you guides in extracting data using XPath.

How to be precise with configuration rule?

The visual workflow interface allows you to maximum configure your own rule in Octoparse. When you want to make sure your rule stays accurate, all you need to do is to manually check the rule in the workflow interface. Follow this tutorial Check The Extraction Rule When Errors Occur to make sure you’ll never have issues with running the task.

Ready for more? Visit the Octoparse Tutorial for useful how-to guides on all things to learn how you can get the most out of using Octoparse.

Author: The Octoparse Team

- See more at: Octoparse Tutorial

标签： Big data, Business, Octoparse, tutorial, Web scraping

2016年12月27日星期二

Reasons and Solutions - Missing Data in Cloud Extraction

We all want to get a neat Excel spreadsheet with the data scraped, before going further analysis.

With Octoparse, you can fetch the data you want from websites and have the data ready for your use. Our cloud services enable you to fetch large amounts of data by running your scraping task with Cloud Extraction. The premise is, you know how to deal with all the circumstances when you are using Cloud Extraction to scrape the sites.

We summarize several problems encountered by our paying users and thus make several tutorials about reasons and solutions for these Cloud Extraction problems.

1. I get data from Local Extraction but none from Cloud Extraction

2. Cloud Extraction Is Slower Than Local Extraction

3. There Are Missing Data in Cloud Extraction

This tutorial will talk more about how to solve the third problem - What should I deal with missing data in the Cloud Extraction?

Before seeking solutions on how to solve these problems, we can first review the concepts of Cloud Extraction.

Octoparse Cloud Extraction

Octoparse Cloud Extraction refers to the process of retrieving data on a large scale through many cloud servers, based on distributed computing, 24/7/365.

After downloading the client, you can open a new task, configure a workflow/rule for the task, and perform the task with Cloud Extraction by putting it to the cloud. Then you can turn off your machine and let Octoparse do the rest. Cloud Extraction enables you to greatly speed up the extraction, start the task automatically at your scheduled time, and avoid your IP address being banned by websites.

Reasons and Solutions

Reason 1. If there are missing data when you use Local Extraction to run the scraping task, then it will definitely have missing data in Cloud Extraction.

When there are missing data when you run a scraping task in Local Extraction, you can:

1. Firstly, you need to check whether the web element exists on the web page. If it exists, there are two possible situations to consider:

Situation 1. The data content is not loaded before Octoparse execute the "Extract Data" action.

Solution: Set up a longer timeout of each step except the ‘Go To Web Page' action, or wait until some elements appear on the web page.

Situation 2. The XPath for the loop box didn't select all the items listed on the web page.

Solution: You can modify the Xpath for the loop box.

2. The source code for the web element of the web page is different from that of the web page when you made the rule in formatting.

Solution: You can set a backup position option, if the element you want has only two locations on the website. Or you can manually modify the XPath of the web element.

Check out this FAQ to learn how to set the option.

3. Part of the web page is loaded in an asynchronous way. So it happens that Octoparse execute the "Extract Data" action when the web element hasn't been loaded and appeared on the web page.

Solution: You can set up a longer timeout of each step except the ‘Go To Web Page' action.

Note: Make sure you set the Ajax timeout option.

Reason 2. The website you want to scrape implements anti-crawling techniques such as requiring login information, using Captcha and IP blocking.

Solution A: You can manually change your IP address by assigning other IP addresses to the task in the Extraction step. Check out this FAQ.

Solution B: Some websites will automatically unlocked your IP address after a while, so you can try the task after a while.

Reason 3. The data field does not exist.

Solution:

If all the data fields you want to extract is missing, the Octoparse will delete the data record (the whole row). In this case, it’s strongly recommend to add a fixed data field such as current time, current page URL, a fixed data value, etc.

- See more at: Octoparse Tutorial

标签： Big data, Cloud service, Octoparse, web crawler, Web scraping

Reasons and Solutions - Cloud Extraction Is Slower Than Local Extraction

Imagine that one day you open one web scraping software and the screen display all the data you want, neatly.

Octoparse Cloud servers had got all the data you want from any websites for you. You're full of joy.

We love to see you smile.

We are dedicated to providing the best web scraping software and service for you.

So we create some tutorials to solve all the problems you may have when using Cloud Extractions.

We summarize several problems encountered by our paying users.

1. I get data from Local Extraction but none from Cloud Extraction

2. Cloud Extraction is slower than Local Extraction

3. There are missing data in Cloud Extraction

This tutorial will talk more about how to solve the second problem - How to make Cloud Extraction work normally and faster than Local Extraction?

Reasons and Solutions

The principle behind the Cloud Extraction is that, the task you put on Cloud Extraction is split into many sub-tasks, and these sub-tasks are assigned to many different cloud servers. These cloud servers will run these sub-tasks and send the data collected to our cloud database. All the data collected by our cloud servers would be saved in our cloud database. So the reasons for the problem may be:

Reason 1. The task is not split.

If the task is split into sub-tasks in the cloud, then 4 cloud servers will allocated to these sub-tasks (for Standard subscription plan). Thus executing tasks in the cloud will speed up the extraction and in this case have better performance than local extraction.

Conversely, if the task is not split, then only one cloud server will be allocated to the task and thus Cloud Extraction will be slower than Local Extraction.

Solution:

You can try to split your task. Check your rule of the task and see if you can re-configure it. For example, you can split the task by using Loop (URL list) to extract the data if the pages URLs are similar, except the page number.

http://app.vccedge.com/fund/angel_investments?&page=1

http://app.vccedge.com/fund/angel_investments?&page=2

http://app.vccedge.com/fund/angel_investments?&page=3

...

http://app.vccedge.com/fund/angel_investments?&page=10

Check out this tutorial to create a task using Loop (URL List).

Reason 2. Your machine performs better than one cloud server.

Your machine works better than our cloud server. Besides, your internet environment is brilliant and is much better than a cloud server. So if the task cannot be split, it would be better to run the scraping task in your machine by using Local Extraction.

Reason 3. The Professional subscription plan provides 10 cloud server for you to run your tasks in the cloud while Standard subscription plan has 4 cloud servers.

If you start many tasks in the cloud concurrently for only once, like task 1, task 2, task 3, orderly, Octoparse will first split the task 1 into many sub-tasks, allocate cloud servers to these sub-tasks and these cloud servers will scrape the data; then deal with task 2 and task 3 orderly in the same way.

Generally, the task 1 will be executed firstly. For example, let’s say that you have 10 cloud servers, and the task 1 uses 4 cloud servers, then the remaining 6 cloud servers will be used by task 2 which would use 6 cloud servers. In this case, task 3 will wait for the cloud servers in the executing queue with 0% progress.

Solution A: Don’t start too many tasks in the cloud concurrently.

Octoparse enables you to control the number of tasks being run in parallel. You can set the maximum number of tasks in parallel in the Cloud Extraction so that some tasks can be preferentially executed.

If you want to retrieve the data from a task within a short time, you can set all your cloud servers to the task by setting one here; if you want N tasks to run parallelly, you can set M here to faster the speed.(M ≤ N)

Solution B: Add more cloud servers for your tasks by contact us via support@octoparse.com.

Solution C: If you start many tasks using Scheduled Cloud Extraction, you can estimate the time needed for a task and stagger the scheduled time of your tasks.

Reason 4. The rule you configured for the task will affect the speed of Extraction, both for Local Extraction and Cloud Extraction.

When it works really good for Local Extraction, it doesn’t mean that the task can work well in the cloud. Since your machine and internet environment are brilliant and better than a cloud server, For example, it takes less time for your computer to open certain web page.

Solution:

So before putting the task into the cloud, you may need to set up set up a longer timeout of each step except ‘Go To Web Page'.(usually 5 seconds).

- See more at: Octoparse Tutorial

标签： Cloud service, Octoparse, web crawler, Web scraping

Reasons and Solutions - Getting Data from Local Extraction but None from Cloud Extraction

We all want to get a neat Excel spreadsheet with the data scraped, before going further analysis.

We summarize several problems encountered by our paying users.

1. I get data from Local Extraction but none from Cloud Extraction

2. Cloud Extraction is slower than Local Extraction

3. There are missing data in Cloud Extraction

In this tutorial, we would like to dig into the first problem - Why I get no data record with Cloud Extraction when the scraping task works well with Local Extraction?

Reasons and Solutions

Reason 1. The web environment in the cloud is different from that in your machine and thus the original XPath cannot select the HTML elements correctly and therefore miss some data.

Situation 1. The website’s source code has been greatly changed.

Some websites will automatically deliver regionally relevant web page information to its visitor based on the visitor’s geographical location from the IP address. And all IP addresses of our cloud servers are randomly assigned by our cloud hosting. Thus the structure of web pages displayed in your own machine will be different from that in the our cloud servers. Therefore it will unavoidably to have some missing data while using Cloud Extraction.

Solution A: You can save the cookie in “Cache Settings” option to ensure the web page opened is the one you want to scrape. Try out this tutorial to set the cookie.

Solution B: If solution A does not work for the website you want to scraped, you can try the option “clear cache before opening the web page”to open the initial web page and add some actions in your rule to select the web element like the region.

Situation 2. The website’s source code has been changed just a little bit.

Solution:

Generally we need to modify the XPath of the web elements who could not get the data with Cloud Extraction. For example, you can change the absolute XPath to relative XPath. The tip is to try to use the XPath that can directly select the web elements you want.

Absolute Xpath	Relative Xpath
html/body/div[5]/div[3]/div[2]/div[1] →	.//*[@id='tab-2']/div

The tutorials or FAQs below could help you pick up XPath quickly.

How to use Firebug and Firepath?

Getting started with XPath 1

Getting Started With XPath 2

Modify XPath Manually in Octoparse

Reason 2. The cookie you saved has been disabled.

For websites that require login, we need to remote connect to our cloud hosting to detect whether the cookie is working or not, or you can check the rule in Octoparse and see if the cookie is still valid or not.

Situation 1. Some cookies allows only one web browser (or one IP address) to log in. This means that the cookies of the website saved on web browser A will be disabled when you log into the site on web browser B. If you insist on logging in the website in browser B, then you will be logged out in browser A.

Solution:

Tick the option to not split the task in Cloud Extraction when configuring the rule.

Situation 2. The cookies of the website has been saved when you configure the rule for the task, and it works well to extract data by using Local Extraction. But when you use Cloud Extraction to collect data, our cloud servers will not be able to move forward without the cookies, and thus you cannot get data by using Cloud Extraction.

Solution:

For websites that require login information, you can add some actions to enter the login information in your rule.

Check out this tutorial to add the login action to your rule. How to scrape a website that requires login first.

If your Cloud Extraction is still not smooth after adding actions to enter login information, please contact our Support Team via support@octoparse.com.

Reason 3. The rule you configured for the task is not optimized for Cloud Extraction.

Since your machine and internet environment are better than a cloud server, so it takes less time for your computer to deal with a complicated web page. When you put the task in the cloud, you need to consider the performance of a cloud server and make an adjustment to your rule of the task.

Solution:

So before putting the task into the cloud, you may need to set up set up a longer timeout of each step except ‘Go To Web Page'.(usually 5 seconds), or wait until some elements appear on the web page. This screenshot shows how to set up the timeout of the 'Click Item' action.

Reason 4. IP address blocking

Some websites, such as websites from Alibaba group, will implement anti-scraping mechanisms so it happens that our IP addresses had been blocked when the website detects the scraping behavior. In this case, please contact our Support Team via support@octoparse.com and we will check it for you.

Solution A:

You can assign IP addresses for your task manually in Octoparse. (Only for Local Extraction)

Check out this tutorial to manually assign the IP addresses.

Solution B:

You can observe the anti-scraping mechanisms used by the website you scraped, and get the data after a while. Or you can set up a longer waiting time each step (except ‘Go To Web Page') to act more like a human.

- See more at: Octoparse Tutorial

标签： Cloud service, Octoparse, web crawler, Web scraping

WEBサービス

2016年12月29日星期四

Scraping Online Dictionary - Merriam-Webster.com

Web Scraping|Scrape Booking Reviews

10 Essential Tutorials That Every Octoparse Newbie Should Know

2016年12月27日星期二

Reasons and Solutions - Missing Data in Cloud Extraction

Reasons and Solutions - Cloud Extraction Is Slower Than Local Extraction

Reasons and Solutions - Getting Data from Local Extraction but None from Cloud Extraction

我的简介

链接

先前的博文

存档