2016年12月12日星期一

Integrate Octoparse and any other database via API

Q: Is it possible to integrate Octoparse and any other database via Octoparse API? How?


A:

Currently Octoparse only enables you to export data to MySQL, SqlServer and Oracle for free.
If you need to get the most out of Octoparse to other database types, you can pull the data scraped from one task via our Octoparse API.
Our API returns the JSON-formatted data which is an open protocol. And after you get the large amounts of data, you can save the data collected to any other databases by writing some code to integrate Octoparse and any other database such as noSQL, MongoDB, Salesforce and etc.
 
Note:
The Octoparse API is only available in paid versions.
You need to write the code to integrate Octoparse to any other database yourself.

-See more at Octoparse FAQ

标签: , ,

2016年11月22日星期二

Get real-time data scraped from a website via API

(from Get real-time data scraped from a website via API)

(picture from www.forbes.com)

Scraping web data in real-time from websites is of paramount importance for most of companies.
It's usually the case that the more up-to-date information you have, the more choices available to you.
Scraping real-time websites can help support immediate decision making. For example, if a company sells clothes online, the company's website and customer service center need to know the most up-to-date data on inventory to prevent orders for items that are out of stock. If an item has only 5 in stock and the customer tries to purchase 6, or if a customer order is canceled due to style/color/ size of the item were unavailable, the customer can be notified and re-select another similar product, and a company can thus discover the best sellers online. But not all departments of the company need real-time data. Most companies can achieve their business goals by looking at long-term trends such as weekly or monthly business performance reports and annual comparisons. Similarly, the Finance department may need real-time data to analyze economic indicators or to make a budget vs. actual comparison.

(picture from www.cin7.com)

Another example to note is to scrape stock data in real time from financial information sites such as Google Finance, Yahoo Finance and etc. To make investing easier, you need to get real-time stock quotes including stock price today, earnings and estimates, and other investing data displayed on many online information providers. To get the latest stock data and value a company’s stock, you need to stay on top of these website, keep an eye on these stock information and take immediate actions to the sudden changes of stock data to ensure your investment performs to expectation. The internet make the process of scraping stock information easy, fast and free. It’s easy to scrape the stock data from these sites and make it available for your purpose of reusing it.

(picture from blog.excel4apps.com)

Once you collect the data scraped, you want to have the data in hand by seamlessly connect the scraped data to your machine. API (application program interface) is a way to make that happen by enabling an application to interact with other system/library/software/etc. An API allows you to control and manage the data scraped - you can make a request for the data crawled and integrate them with your machines.
Imagine that you are ordering two salads at McDonald's drive-thru window (API), you will get the two salads (Data) at the exit after you’re done ordering. There is an electronic board for drivers to choose the food they want to order and you will see the bill after completing ordering. Similarly, when you request data via an API which is cloud based whenever you want, you just make API calls and will get the data stored in the cloud immediately.

How to automate this process of scraping website content in real-time and get the information as you requested?
Octoparse and its web scraping API would be your best choice.

Octoparse

This freeware allows you to collect web data in real-time via Octoparse web scraping API.
You can schedule a task in Octoparse to scrape the real-time websites hourly/daily/weekly/monthly/etc. and connect the data scraped to your environment via the scraping API. With Octoparse scraping API, you can directly access to all the real-time scraped data from scraping millions of websites on the Internet for your purpose of reusing it.

Author: The Octoparse Team

标签: , , ,

2016年11月20日星期日

API | Octoparse, Free Web Scraping

(from API | Octoparse, Free Web Scraping)

Content 
Octoparse Data Export API

You must obtain an access token to use the Octoparse API. The access token is passed with each API request and is used
to authenticate you access to the Octoparse API. It provides a secure access to the Octoparse API.

1. Overview 
You can export data extracted using the Octoparse Data API by using the following procedure. It is worth mentioning
that you have an Octoparse advanced account(Standard/Professional) and have obtained some data from at least one task
that is running in the cloud before using the Octoparse API.

The basic flow to use the Octoparse API:
1. Get an access token by providing your user name and password.
2. Use Access Token and Task ID to get the data from a specific extraction task in Octoparse.

 


2. Get an Access Token

You can obtain an access token by making an HTTP POST request with your username and password. 

HTTP Method: POST
POST Content Type: application/x-www-form-urlencoded
POST Example:
username={username}&password={password}&grant_type=password
The values of username and password should be URL-Encoded.

The successful HTTP response for the token request contains the access token that you can use to access the Octoparse
API. The response is JSON-encoded and below is an example response. 

{
"access_token": "ABCD1234",
"token_type": "bearer",
"expires_in": 86399,
"refresh_token": "refresh_token"
}

The response includes the output properties as follows.

Property
Description
access_token
The access token that you can use to authenticate you access to the Octoparse API.
token_type
The format of the access toke. Currently, Octoparse uses a BEARER token.
expires_in
The number of seconds for which the access token is valid.Current default value is 86400(24 hours).
refresh_token

A token that can be sent to Octoparse API instead of an authorization code.
(When the access token expires, send a POST request to the Octoparse API using this token instead of an authorization code. A new access token will be returned. A new refresh token might be returned too.)

An access token is a unique identifier of making an Octoparse API call, and is needed to add to the HTTP request Header
to get the data from tasks via API.

Name: Authorization
Value: bearer {access token}

The response would return some JSON-formatted strings if the request for access token failed. Below are some explanations
for all error cases.

Case 1. The content of the POST is not formatted correctly.

{
"error": "unsupported_grant_type"
}

Make sure that the format is like:

username={username}&password={password}&grant_type=password

Case 2. The user name or password in the POST is incorrect.

{
"error": "invalid_grant",
"error_description": "The user name or password is incorrect."
}
3. How to Get Data through API 

3.1. Get all data of a task using paging

Octoparse supports paging of data to retrieve only some data records by displaying a particular page of data, using
the HTTP GET request. The parameters - taskID, pageindex, pagesize are needed for this API and the access
token should be added to the HTTP Header.

HTTP Method: GET
HTTP Header parameter 1:
Name: Authorization
Value: bearer {access token}
HTTP Header parameter2:
Name: Accept 
Value: application/json 

Octoparse will page through the data of the current task based on the given pagesize(The maximum allowed page size
is 1,000) and return the data of the index page; the number of data records returned is based on the page size you set.
For example, let’s say there are 1,000 data records in a task. If the pageindex is 1 and pagesize is 2 (pagesize=2,
pageindex=1), the data will be divided into 500 pages with 2 data records per page and Octoparse API would return the
first page of 2 data records.
The successful HTTP response with correct access token and taskID will get JSON -formatted data. Below is an example
response.

{
"data": { 
"total": 1000, 
"currentTotal": 2, 
"dataList": [ 
            {
"State": "Texas", 
"City": "Plano",
"Date": "2013-1-1",
"Humidity": "34%",
"High Temperatures": "72.8F",
"Wind": "NW 8km/h",
"Low Temperatures": "24.8F"
            },
            {
"State": "Texas",
"City": "Plano",
"Date": "2013-1-2",
"Humidity": "32%",
"High Temperatures": "76F",
"Wind": "NNW 10km/h",
"Low Temperatures": "25F" 
            }
        ]
    }, 
"error": "success"
}

The data returned includes fields as follows.

Data field
Description
total
The number of total data records of the current task
currentTotal
The number of data records requested
dataList
The list of data fields
error
Prompt information


3.2. Get Unexported Data from a Task

You can get all unexported data from a task in batches, using the HTTP GET request. The taskID and the number
of data records returned per batch(size) are needed for the request, and the access token is needed to add to
the HTTP Header. Octoparse API will return the data that were first collected.

HTTP Method: GET
HTTP Header parameter 1:
Name: Authorization
Value: bearer {access token}
HTTP Header parameter 2:
Name: Accept
Value: application/json
           application/xml

The interface would return the unexported data (the amount of unexported data depends on the parameter: ‘size’) and
then identify this data as exported data so that all exported data will be skipped next time you make a request.

For example, let’s say there are 1,000 data records in a task. If the size is 2 (the number of data records returned
per batch) for the first request, Octoparse API will return 2 data records that were first collected. Similarly,
next time it will return another 2 data records that were first collected from the remaining 998 records.

The successful HTTP response with correct access token and taskID will get JSON -formatted data. Below is an
example response.

{
"data": {
"total": 1000,
"currentTotal": 2,
"dataList": [
            {
"State": "Texas",
"City": "Plano",
"Date": "2013-1-1",
"Humidity": "34%"
"High Temperatures": "72.8F",
"Wind": "NW 8km/h",
"Low Temperatures": "24.8F"
            },
            {
"State": "Texas",
"City": "Plano",
"Date": "2013-1-2",
"Humidity": "32%",
"High Temperatures": "76F",
"Wind": "NNW 10km/h",
"Low Temperatures": "25F" 
            }
        ]
    },
"error": "success"
}

The data returned includes fields as follows.

Data field
Description
total
The number of total unexported data records of the current task
currentTotal
The number of data records requested
dataList
The list of data fields
error
Prompt information

Note:

If the parameter provided is incorrect when getting data from task, Octoparse API will return the following errors. Below
are some explanations for all error cases.

Case 1. Access token is invalid or has expired. Please use your username and password to obtain a new access token.

{
"error": "unauthorized",
"error_Description": "access_token invalid"
}

Case 2. The requested resource does not support HTTP method 'POST'. Please use GET method in this case.

{
"message": "Requested Resource Does Not Support HTTP Method 'POST'"
}

Case 3. 

The taskID is invalid or the task doesn’t belong to the user indicated by the access token. Please use correct taskID.

{
"error": "taskid_error",
"error_Description": "TaskID is invalid or the task does not belong to you."
}

Case 4. 

The size is too big and exceeds the maximum allowed size. The default size is 1000.

{
"error": "export_pagesize_error",
"error_Description": "Size range from 1 to 1000"
}

Case 5. The server is temporarily unavailable.

{
"error":"server_error",
"error_Description": "Server Error. Please try again later!"
}

4. Two Ways to Get a Task ID

4.1. Get a Task ID via API

You can get all data from a task via a task ID.
Generally, users will create a task group to extract large amounts of data and therefore will create many tasks that
categorized into that group to extract this data separately. In this case you can obtain the task group ID and all the
task IDs under the group via two APIs (One for task group ID, the other for task ID), then extract all the data from
these tasks in the group by writing codes to work with the APIs.

4.1.1 Get a Task Group ID

First of all, you need to obtain task group ID by using the HTTP GET request and adding the access token to the HTTP
Header.

HTTP Method: GET
HTTP Header parameter:
Name: Authorization
Value: bearer {access token}

If the access token you requested is accurate and could be used to get data, you will get a text-formatted task group
list as follows.

{
"data": [
        {
"taskGroupId": 84, 
"taskGroupName": "Task Group ID 1"
        },
        {
"taskGroupId": 527,
"taskGroupName": "Task Group ID 2"
        }]
“error”: ”success”
}

The descriptions of data fields in the task group list are as follows:

Data Field
Description
taskGroupId
The unique identifier for the task group
taskGroupName
The name for the task group


4.1.2 Get a Task ID from the task group

For a task group, all the tasks under the task group can be obtained by providing the task group ID.
You can get the list of all the tasks by using the HTTP GET request, adding the access token to the HTTP Header and
using the task group ID as the parameter.

HTTP Method: GET
HTTP Header parameter:
Name: Authorization
Value: bearer {access token}

If the access token you requested is accurate and the task group belongs to you, you will get a text-formatted task list as
follows.

{
"data": [
        {
"taskId": "taskid1",
"taskName": ""
        },
        {
"taskId": "taskid2",
"taskName": "Task 2"
        }]
“error”: ”success”
}

The descriptions of data fields in the task list are as follows:

Data field
Description
taskId
The unique identifier for the task
taskName
The name for the task

Note:

If the parameter provided is incorrect when getting data from task, Octoparse API will return the following errors. Below
are some explanations for all error cases.

Case 1. Access token is invalid or has expired. Please use your username and password to obtain a new access token.

{
"error": "unauthorized",
"error_Description": "access_token invalid"
}

Case 2. The requested resource does not support HTTP method 'POST'. Please use GET method in this case.

{
"message": "Requested Resource Does Not Support HTTP Method 'POST'"
}

Case 3. The server is temporarily unavailable.

{
"error": "server_error",
"error_Description": "Server Error. Please try again later!"
}

4.2. Get a task ID via Octoparse client

This function is only available for Standard and Professional Plans.
After you log in to Octoparse, right click a task and choose “Create an API”.(Only available for Standard and Professional
Plan).
 

Then you will get the Task ID on the pop-up window.



5. Sample Code
GitHub: 
- See more at: http://www.octoparse.com/tutorial/api 

标签: , ,