2016年11月29日星期二

A Must-Have Web Scraper for Data Comparison Software - Octoparse

(picture from www.theodysseyonline.com)

We are living in a fast-paced and high-intensity environment partly because we are busy making comparisons between different plans for evaluation and choose the most suitable one. Generally, we compare prices online to find the best deals before making a purchase or getting a loan. What’s more, it’s worth taking the time to compare the best schools/companies that will best develop your study and career.

Can the data comparison process be both more time-saving and efficient? Use tools.

Learning how to create and use tools is a great step forward in the thinking of human beings. The potential business value of a person greatly depends on the amount of information he needs to know and the ability to solve problems, learn from experience and adapt to the digital age, efficiently. The smart guy will find the right data extraction software to collect massive amounts of data from the internet before implementing data comparison software and data analysis software.


Why Octoparse

Octoparse is a must-have web scraper to grab the data available online that can provide data needed to your data comparison software. Its Windows Client has easy-to-use UI interface and you can create your own web crawler easily by simply point-&-click. Besides, it has free version and paid versions to suit different data extraction needs. The free version is powerful enough for data comparison but the paid versions enable you to large amounts of web data in real time.
                            

The most obvious example using data comparison software is product price comparison. Since price is the most important factor that influences consumer behavior, it’s crucial that companies make data comparisons - compare almost all the influential factors, to react promptly and maximize their profit.

Successful Use Cases

An e-commerce seller in Japan use Octoparse to extract amazon Mexico and eBay US market data for selling his Japanese product oversea by comparing the data from these two countries and decide the product items that have price difference.

An e-commerce seller in German use Octoparse to monitor his competitor sellers from US and use Octoparse Scheduled Cloud Extraction to grab the information such as product title, image URL, price, ASIN, shipping weight and etc. from his competitor sellers for competitor analysis.

A user uses Octoparse to extract the discounts and promotions information from competitors’ websites at 9:00 per day for competitive marketing analysis.




Author: The Octoparse Team


Download Octoparse Today


For more information about Octoparse, please click here.
Sign up today!


Author's Picks

- See more at: Octoparse Blog

标签: , , , ,

Big Data Extraction Tools For Good Decision-making

(from Big Data Extraction Tools For Good Decision-making)


(picture from www.marketingtechblog.com)

Big Data Make Better Business Decisions

Big data refers to a data set that can not be crawled, managed, and processed, parsed within a certain period of time by conventional software tools. Big data technology is used to deal with big data - crawl, manage, process and parse huge volumes of raw data and quickly access valuable information from a wide variety of data types. The technologies for big data include massively parallel processing (MPP), data mining for power grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.

It sound great and encouraging that raw big data can be translated into valuable information to better business decisions. Since the volume of data is growing so rapidly and thus will accelerate more opportunities for companies to gain deeper insights and find some new ways to widen profit margins by using the big data extracted.

Nowadays, more and more businesses will rely on data-driven decision-making to improve operation once they use the right big data extraction tools and figure out the right ways to analyze the data by using some big data technologies.

The Data Volume Affect Dood Decision-making

Usually, companies have already collected the data they need but in most cases they simply don’t not sure whether the result of analysis for the “big data” is reliable enough to support decision making. After all, it’s hard yet important for companies to make a good decision, not to mention a good decision does not always result in a good outcome.

Such problem often arises because of the volume of the big data extracted before fully exploiting the data.

It’s common for big companies to build a IT development department or hire some big data crawler specialists to extract big data from the internet. However not all the companies can afford it. Usually dozens of small companies like startup companies don’t have big budgets for big data extraction.

There are many programming languages or resources available on the internet and it’s convenient for developer to write a web crawler to extract big data from the web. For non-techie guys, they can use big data extraction tools to collect the web data needed. But such tools seem to be designed for experts in web scraping rather than for people with no coding experience, and few tools can complete suit their needs—either for the low cost-effectiveness or the imperfection of a big data extraction tool.

Several Tools For Big Data Extraction

To fully meet your big data extraction needs for your businesses to yield good decision-making, I will suggest some big data extraction tools especially for tech newbies.

1. Octoparse

Octoparse is a modern visual big data extraction freeware for Windows systems. Both experienced and inexperienced users would find it easy to bulk extract unstructured or semi-structured information from websites and transform the data into structured one. The Smart mode will extract data in web pages automatically within a very short time. And it’s easier and faster for newbie to get data from the web by using the point-&-click interface. It allows you to get the real-time data through Octoparse API. Their cloud service would be the best choice for big data extraction because of the IP rotation and abundant cloud servers.

2. import.io

A platform that converts semi-structured data in web pages into structured data. It offers real-time data retrieval through their JSON REST-based and streaming APIs and can integrate with many programming languages and data manipulation tools, as well as a federation platform which permits more than 100 data sources to be queried at the same time.

3. Mozenda

Mozenda offers web data extraction tool and data scraping tools which make it easy to scrape content from the internet. You can programmatically connect to your Mozenda account by using the Mozenda Web Services Rest API though a standard HTTP request.

4. Webharvy
  
WebHarvy enables you to easily extract data from websites to your machine and can deal with all website. No programming/scripting knowledge needed. WebHarvy can help you extract data from product listings/eCommerce websites, yellow pages, real estate listings, social networks data, forums etc. Just click the data you need and it’s unbelievably easy to use. Crawls data from multiple web pages of listings, following each link.

5. UiPath

UiPath is a fully-featured, and extensible freeware for automating any web or desktop application. UiPath Studio Community is free for individual developers, small professional teams, education and training purposes. UiPath permits organizations to make software robots that automate manual, repetitive rules-based tasks.


Author: The Octoparse Team

- See more at: Octoparse Blog

标签: , , ,

2016年11月28日星期一

A Free LinkedIn Jobs Scraper! You Cannot MISS IT!!

See this video to know how to extract data from LinkedIn!

Scheduled Data Extraction - Octoparse Cloud Web Scraping Service

(from Scheduled Data Extraction - Octoparse Cloud Web Scraping Service)

Octoparse Cloud Service provide options of scheduling data Extraction for a task. You can set up the data extraction schedule for running your scraping tasks on Octoparse cloud platforms.

The following steps describe the process of scheduling a cloud scraping task, starting at the point when you complete creating your task that is ready for scraping web data.
After you complete configuring your task, select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.




1. Set the Parameters

In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.
 · Periods of Availability - The data extraction period by setting the Start date and End date.
 · Run Mode - Once, Weekly, Monthly, Real Time

There are four types of Run mode to set the schedule.
  · Run Mode - Once
To run the task at some specific times on a selected day: Select the specific day during the period of availability, and then select the time of day from the lists. (Octoparse will choose the 00:00 by default if you skip this part.)

  · Run Mode - Weekly
To run the task at some specific times on selected days each week: Specify the days of the week, and then select the time of day from the lists. (Octoparse will choose the 00:00 by default if you skip this part.)
For example, specify Monday, Thursday and Friday of each week, run at 9:00 and 16:00.

  · Run Mode - Monthly
To run the task at some specific times on selected days each month: Specify the days of the month, and then select the time of day from the lists. (Octoparse will choose the 00:00 by default if you skip this part.)
For example, specify 11th, 16th, 17th, 18th and 23rd for each month, run at 9:00 and 16:00.

  · Run Mode - Real Time
To run the task at an interval from now on: Specify the interval to run the task.
For example, run the task once every 30 minutes, from now on. Assuming the current time is 9:00 a.m. and after you click the Start button, your task will start running at 9:30 a.m. and will be executed every 30 minutes.

2. Manage your schedules

Before starting your schedule, you can save or cancel your settings.

Start your schedule
After you set the parameter for a schedule of your task, you can
1. Click the Start button to start you schedule of the task.
2. Click the OK button to check the “Cloud: Waiting”list and see if the task is scheduled.


Stop your schedule
Find the task in the “My Task” list and open the task. Go to the last step - Done. You will see the status of your task in the “Schedule Cloud Extraction Settings” option.
1. Click this option to edit your schedule.
2. Click the Stop button.
3. Click OK to disable the schedule.



Edit an existing Schedule
To edit an existing schedule, you can find the task in the “My Task” list and open the task. Go to the last step - Done. You will see the status of your task in the “Schedule Cloud Extraction Settings” option. Click this option to reschedule your tasks.


Check the status of your scheduled tasks
1. Find the “Task Status”option on the left pane
2. Select the “Cloud: Waiting”option
You will see all the waiting tasks here.

- See more at: Octoparse Tutorial

Web Scraping Service - Octoparse Cloud Extraction Works Better

(from  Web Scraping Service - Octoparse Cloud Extraction Works Better)

(Download the extraction task of this tutorial HERE just in case you need it.)

As we all know, Octoparse's cloud servers enables you to extract the data from websites much quicker and stable when the data volume you wanna scraped is huge. Today I will show you an example that sometimes Octoparse Cloud Extraction is more efficient that Local Extraction.

In this tutorial, I will take Yelp for example to show you how to scrape Yelp data and test the efficiencies of our two Extraction methods - Cloud Extraction and Local Extraction.

(Note: Cloud Extraction is not available for the Free Edition. For more information about different editions, you could click HERE.)



Part 1. Configure a rule for your task

Let’s make a rule for our scraping task first.
Step 1.
Choose "Advanced Mode" ➜ Click "Start"➜ Complete basic information. ➜ Click "Next".

Step 2.
Enter the target URL of Yelp in the built-in browser. ➜ Click "Go" icon to open the webpage.

Step 3.
After the web page is completely loaded. Scroll to the bottom of the page. Right-click the pagination link (the ‘Next’ button on this page). Select "Loop click in the element" to create the Pagination Loop.
(Note:
--- If you want to extract information from every page of search result, you need to add a page navigation action.
--- If the content of the web page is completed displayed while the URL is still loading, you can choose to stop the loading of the URL by clicking the red cross icon ‘×’ near the URL, or wait until the page is loaded.)


Step 4.
Move your cursor over the section with similar layout, where you would extract data.
Click the first highlighted link ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first highlighted link has been added to the list.  ➜ Click "Continue to edit the list".
Click the second highlighted link ➜ Click "Add current item to the list" again. Now we get all the links with similar layout.
Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.


Step 5. 
After the page has been loaded completely, let’s drag the second "Loop Item" into the first "Cycle Pages" box, place it right above the "Click to paginate" action in the Workflow Designer so that we can grab all the elements of items from multiple pages.
And we can check if the rule is working well by clicking the actions from the beginning of the rule.
"Go To Web Page" ➜ "Cycle Pages" box ➜ "Loop Item" box (all items are extracted) ➜ "Click Item"
 

The rule seems to be working well so far. So let's continue to extract the data.
If the page URL keeps loading while the content of the web page is completed displayed (at least the data you wanna scraped has been shown), you can choose to stop the loading of the URL by clicking the red cross icon ‘×’ near the URL, or wait until the page is loaded.

Step 6.
The data we will extract on detail page include the website URL, phone number, address, type, company name and the current page URL.
Because the third item of the 1st page have all the data we want (the first two items lack the "website URL" on the detail page).
So we click the Loop Item(Extract Data) and select the third item from the Loop Item list. Then click the "Click Item" in the workflow.
After the page has been loaded completely, we will begin to extract the data we want.
Click the website URL on the web page➜ Select "Extract text" and rename it. Other contents can be extracted in the same way. Don’t forget to click Save after all the data fields is extracted and configured.

 

Step 7.
Check if the rule is working well by clicking the actions from the beginning of the rule.
"Go To Web Page" ➜ "Cycle Pages" box ➜ "Loop Item" box (all items are extracted) ➜ "Click Item" ➜ "Extract Data" ➜ "Click to Paginate"

Note:
1. Don’t forget the set AJAX timeout for the "Click Item" and "Click to Paginate" actions.
2. If it takes a long time to load a page when you click "Click Item" or "Click to Paginate", you can set a longer waiting time in its next action before executing its next action. For example, if it takes 10 seconds to open the web page in the "Click to Paginate" action, you can set a a longer waiting time in its next action - "Click Item" before executing the "Click Item" action.



Part 2. Run the task with both Local Extraction and Cloud Extraction

Let’s run the task with both Local Extraction and Cloud Extraction.
After you click one of these two options, you can go to the left panel and find the task from a category in My Task. Right click the task and choose the other extraction option.
After almost 3 hours we got all the data from two extraction types.
The data we get from Local Extraction:


The data we get from Cloud Extraction:


The speed and efficiency of Local Extraction will greatly influenced by your computer network and the performance of the computer. While our cloud servers are used only to extract the data from web pages, thus sometimes Octoparse Cloud Extraction will work better than Local Extraction for the outstanding performance of our cloud servers.
Need large amounts of data from site? Our sales team are happy to help!  
Note: If the page URL keeps loading while the content of the web page is completed displayed (at least the data you wanna scraped has been shown), you can choose to stop the loading of the URL by clicking the red cross icon ‘×’ near the URL, or wait until the page is loaded.

Author: The Octoparse Team
- See more at: Octoparse Tutorial

标签: , , , ,

2016年11月22日星期二

Get real-time data scraped from a website via API

(from Get real-time data scraped from a website via API)

(picture from www.forbes.com)

Scraping web data in real-time from websites is of paramount importance for most of companies.
It's usually the case that the more up-to-date information you have, the more choices available to you.
Scraping real-time websites can help support immediate decision making. For example, if a company sells clothes online, the company's website and customer service center need to know the most up-to-date data on inventory to prevent orders for items that are out of stock. If an item has only 5 in stock and the customer tries to purchase 6, or if a customer order is canceled due to style/color/ size of the item were unavailable, the customer can be notified and re-select another similar product, and a company can thus discover the best sellers online. But not all departments of the company need real-time data. Most companies can achieve their business goals by looking at long-term trends such as weekly or monthly business performance reports and annual comparisons. Similarly, the Finance department may need real-time data to analyze economic indicators or to make a budget vs. actual comparison.

(picture from www.cin7.com)

Another example to note is to scrape stock data in real time from financial information sites such as Google Finance, Yahoo Finance and etc. To make investing easier, you need to get real-time stock quotes including stock price today, earnings and estimates, and other investing data displayed on many online information providers. To get the latest stock data and value a company’s stock, you need to stay on top of these website, keep an eye on these stock information and take immediate actions to the sudden changes of stock data to ensure your investment performs to expectation. The internet make the process of scraping stock information easy, fast and free. It’s easy to scrape the stock data from these sites and make it available for your purpose of reusing it.

(picture from blog.excel4apps.com)

Once you collect the data scraped, you want to have the data in hand by seamlessly connect the scraped data to your machine. API (application program interface) is a way to make that happen by enabling an application to interact with other system/library/software/etc. An API allows you to control and manage the data scraped - you can make a request for the data crawled and integrate them with your machines.
Imagine that you are ordering two salads at McDonald's drive-thru window (API), you will get the two salads (Data) at the exit after you’re done ordering. There is an electronic board for drivers to choose the food they want to order and you will see the bill after completing ordering. Similarly, when you request data via an API which is cloud based whenever you want, you just make API calls and will get the data stored in the cloud immediately.

How to automate this process of scraping website content in real-time and get the information as you requested?
Octoparse and its web scraping API would be your best choice.

Octoparse

This freeware allows you to collect web data in real-time via Octoparse web scraping API.
You can schedule a task in Octoparse to scrape the real-time websites hourly/daily/weekly/monthly/etc. and connect the data scraped to your environment via the scraping API. With Octoparse scraping API, you can directly access to all the real-time scraped data from scraping millions of websites on the Internet for your purpose of reusing it.

Author: The Octoparse Team

标签: , , ,

2016年11月21日星期一

Buy Data for Your Business

(from Buy Data for Your Business)

We are living in the big data era. And many people believe that data has become a key competitive weapon in the operational efficiency of business and investment. There is no doubt that many people have the needs of buying data. But how? In this post I will address you how to buy data for your business and factors that you should consider when buying data.

How to buy data for your business?
There are different methods to get the data you want.
1.Establishing your own platform
The first method that would be considered is to establish your own platform. You should hire a professional team to establish your own platform. It would be available if your companies need small data. Or the data in those target websites is not complex. For big data, you should hire more people to establish the platform and maintain the security of your system. It would be a high cost for most companies, and even for the big companies.
2. Buying the data directly
There are different data providers in different industries, which you could get the specific data you want. Or you could turn to the data service companies for help. Although they are not professionals in specific areas, they have the experience to get data online and have collected a great number of data. For example, BlueKai collects PC and smartphone users' data to enhance ad marketing for their clients, and has about 700 million actionable profiles. You could directly purchase the data from this kind of company to optimize your business. It would be much easier. And you don’t need to hire well-paid engineers to establish and maintain your system. However, this method is more applicable to those companies who don’t need the real-time data all the time. Another problem is that everyone could get the same data as long as they have enough money to purchase the sources. Thus your strategies made for your business based on the same purchased data may not be as competitive as you expect.
3.Using the data extraction tools
Thanks to the quick development of technology, there are quite a few data tools available now. And most of them are designed for non-technical person, which means that extracting data is no longer the priority of programmers. Therefore, you could also use the data extraction tools like Octoparseimport.io and Mozenda yourself to get the data you want. This service would be more personalized as you could get any data you want specifically. Also, if you don’t know how to use the data extraction tool, you could ask a special custom service. For example, Octoparse provides custom service to help you extract the data you want directly or get the configuration rule (or crawler) in Octoparse, and thus you could easily extract the data. As for how to choose these tools, it depends on the traits of data extraction tools, your budgets, data volume and so on. However, such web scraping tools also need to maintain and upgrade constantly. And as some websites use many anti-web scraping techniques, you couldn’t get all the data you want by using such kind of web scraping tools. It is also not the perfect method to get data.

What should be considered before buying data?
I think the first factor needed to consider when buying data is whether you could get all the data needed without any other additional data solution supplements. No matter you purchase the data from data suppliers or get the data yourself by establishing the platform or using the data extraction tool.
The second factor is the cost, which would greatly affect what kind of methods you choose.
The third one is the speed and the quality of data delivery. How long it would take to get the data you want? How easy it is to add or expand the data volume?... Such problems are also worth thinking when buying data.
The last one is the length of data maintenance. How often will you get the updated data as some data are often updated? You need to make sure you have a sustainable data service in the long term.

How to choose the best method to buy data?
If you have enough budgets, the best way would be to hire some experts to establish the data platform. If not, you could buy the data from data provider or data service company. Or if you don’t need data constantly, you could just use the data extraction tools like Octoparse. If you are not familiar with such data extraction tools or want better service, you could choose custom service provided by data extraction companies.

Author: The Octoparse Team

标签: , ,