2017年5月15日星期一

3 Typical Ways to Use Web Scraping Tools for Marketing Decision

(from http://www.octoparse.com/blog/3-typical-ways-to-use-web-scraping-tools-for-marketing-decision/)

Web scraping, also known as web crawling, (web) data extraction, data mining, screen scraping, is the process of collecting large amounts of data from the web, then save to a file, database, etc. Let’s dig deeper into web scraping.

It’s estimated that the Internet has doubled in size every year since 2012. So what does this mean? Yes, there’s a lot of data. These data could help you substantially if you know what to do with it. Now the question boils down to how do you collect the data? if you just navigate from site to site, picking the information you want, and then copy and paste it to another file, it would be way too time consuming and tedious although it has been the only choice for a long time before automatic web crawler becomes available. Most people are not technical enough to build a crawler from scratch, neither do they have enough budgets to purchase the data, hence using the web scraping tools (refer to Top 5 Web Scraping Tools Review for more information) like Octoparse for data scraping would be the best choice for an anyone who wants to mine the web for more insights.  

Any data visible on the web can be crawled even those websites require login information if you have the credentials. Here’s how most of web scraping tools work: open the web page for browsing, and then automatically extract the selected data from hundreds or thousands of URLs at the same time or in scheduled sequence.

You may ask why. What’s the point, and what are some of the things you can do with these scraped data? My answer is a good decision, strategy or plan made based on the amount of the data you had.
So how can you use web scraping tools? Here are 3 typical ways for marketing decision.

  • Lead Generation
There are many tactics you could try to generate a lead depending on the nature of your business. Social media, community like Quora, conferences, guest posting, paid ads, lead magnets, etc. And how does web scraping help generate leads?

In essence, a lead can be easily defined as contact details that fits a profile. If you have a new cloud Medical SaaS for anesthesiologists, you need a list of anesthesiologists; if you have a new product that want to persuade all real estate agents to use it, you need their information.

A web scraper could automatically collect the information for you: name, location, city, zip code, phone number, website, etc.
And you could further qualify those leads by searching or filtering the scraped data by keywords, or any other criteria to find your exact personas. So it’s not just leads, it’s qualified leads. That’s a goldmine.

With these scraped contact details, you could build your customer base and keep a steady flow of prospects heading into your sales funnel.

All these information is available online if you know where to look. Two good resources to get the information are Yellowpages and Yelp. Here are the links for you to learn how to scrape the data from these two websites:

  • Market Research
Market research is part of the due diligence for business owners. A web scraper can extract the necessary data into structured formats from market research firms, directories, news sites, and industry blogs. With this, you could gather information about the opportunities, and organize an extensive list of the direct and indirect competition, or the potential customer base (based on your buyer personas) in a given area, and more.

For example, a real estate company could use the scraped auction, sales, and pricing data to keep abreast of market trends and real-time competitive pricing structures.

  • Search Engine Optimization
If you have a website, no matter what it is, whether it’s a product or a service, whether it’s something everyone could use or designed for a small niche; if you want to promote online and work with data, you need to get more traffic to grow your market.

There are different channels to get traffic, including direct, organic, referral, social, paid. For most of websites, it comes from organic search. There are several ways to boost your organic search traffic, but they all ultimately revolves around search engine optimization (SEO).  

Let’s take Octoparse for example and see what you could do with web scraping for SEO management and analysis.

First, we could track the page ranks over time by scraping various search engine results pages for given keywords.
We know that Octoparse is a web scraping tool and I want to know that where Octoparse rank for each targeted keyword with “web scraping”, so I enter targeted keywords to extract the search results (refer to How to Scrape Data by Searching Multiple Keywords on A Website for more information), and export these data into excel. After finding out where Octoparse ranks, I create a chart for the results.
 

By finding out what ranks before Octoparse, I give up some keywords considering there may be some virtually unbeatable websites.

Second, to rank higher for more exposure and clicks, I turn to direct competition and see what keywords and phrases they’re ranking for and targeting at. A thorough scrape and text analysis of their site content can give me insight into the titles, keywords, descriptions, links.
Then I could take some actions to generate some high quality articles to generate traffic from search engine.

Third, rankings change all the time. I need to keep an eye on the updated data so that I could know whether I’m moving up, moving down, or staying the at the same level. Most of web scrapers provide cloud service to get real-time data. For example, Octoparse Cloud Service enables users to schedule the crawlers to get the updated data over time. You can refer to How To Get Organic Traffic From Search Engine To Your Blog for more traffic generating tips.

There are many other ways to use web scraping tools like job hunting and recruiting, financial planning, etc. I’ve only mentioned a small part but hopefully it would give you some ideas about how to use the scraped data. With large amounts of data available online, you need a simple solution to collect and sift through it.

A scraping tool allows you to benefit from automatic web scraping without having to install anything or learn coding.

Author's Picks

Web Crawling | How to Build a Crawler to Extract Web Data

2017年5月3日星期三

FREE Professional Plan for Case Review - Octoparse Web Crawler Software

Thanks for being one of our most loyal clients. We would like to hear from you on how you use our product Octoparse. Your review will contribute greatly to our continuous efforts on improving Octoparse.

We are offering one-month FREE Professional Plan($189) Subscription to anyone who writes a case review for Octoparse. Here's how it works:

Step 1: Write a Case Review(about 350 words) about why you use Octoparse and how you use the extracted data with Octoparse, include at least on 1 direct link to Octoparse.com and our facebook. Send us your copy to support@octoparse.com before 5/31/2017.
Tips:  
  • Send the copy via email with subject: Octoparse Review Free Pro 
  • Provide username in the email
  • It is not necessary to write in English, write in any languages you like

Step 2: Post your review on your blog. It's ok if you don't have one, upvote Octoparse on Producthunt and share your review on social media like Facebook, Twitter, etc. once your review's been published.
Tips: 
  • Attach the link of your blog or screenshots of your upvote on Producthunt in your email.

Step 3: Once we receive your email, we will upgrade your account to Professional Plan.

For any queries, feel free to contact us via support@octoparse.com. We look forward to hearing from you. Cheers!

       
Regards

The Octoparse Team
- See more at: Octoparse Blog

2017年4月20日星期四

Top 5 Web Scraping Tools Review

Web scraping (also known as web crawling, web data extraction) is meant for extracting data from websites. Normally, there are two options for users to crawl websites. We can build our own crawlers by coding or using any other public APIs. Alternatively, web scraping can also be done using an automation web scraping software, which refers to an automated process implemented using a bot or web crawler. The data extracted from web pages can be exported as various formats or into different types of databases for further analysis.
There are many web scraping tools you may find online. In this post, I would like to share with you some popular automated scrapers that people think well of and have a run-through of their respective featured service .


Visual Web Ripper is an automated web scraping tool that supports a variety of features. It works well for certain tough, difficult-to-scrape websites with some advanced techniques, like running scripts which requires users with programming skills.
This scraping tool has an user-friendly interactive interface to help users grasp the basic operational process fast. The covered featured characteristics include:
Extract varied data formats
Visual Web Ripper is able to cope with difficult blocks layouts, especially for some web elements displayed on the web  page without a direct HTML association.  
AJAX                                                                   
Visual Web Ripper is able to extract the AJAX supplied data.
Login Required
Users can scrape websites which requires login first.
Data Export formats
CSV, Excel, XML, SQL Server, MySQL, SQLite, Oracle and OleDB, Customized C# or VB script file output (if additionally programmed)
IP proxy servers
Proxy to hide IP-address
Even though it provides such many functionalities, it hasn’t provide users with cloud based service yet. That means users can only have this application installed on the local machine and have it run locally, which may limit the scale or efficiency of scraping considering a higher demand on data scraping.
Debugger
Visual Web Ripper has a debugger which will help users build reliable agents where some issues that can be resolved in an effective way.
[Pricing]
Visual Web Ripper charges users from $ 349 to $2090 based on the subscribed user seat number and the maintenance will last for 6 months. Specifically, users who purchased a single seat ($349) can only install and use this application on a single computer, otherwise users will have to pay double or more to run it on other devices. If you feel no problem with this kind of pricing structure, Visual Web Ripper could be listed in your options.
                                     


Octoparse is a full-featured and non-coding desk-top web scraper with many outstanding characteristics compared with others.
It provides users with some useful, easy-to-use built-in tools to extract data from tough or aggressive websites that are difficult to scrape.
It's UI is very user-friendly and designed in a rather logical way. Users won't have too many troubles locating any functions. Additionally, Octoparse does visualized the extraction process using a workflow designer to help users  stay on top of the scraping process for any tasks. Octoparse supports:
Ad Blocking
Ad Blocking will optimize task by reducing loading time and number of HTTP request.
AJAX Setting
Octoparse is able to extract AJAX supplied data and set timeout.
XPath setting
Users can modify XPath to locate web elements more precisely using XPath setting provided by Octoparse.
Regex Setting
Users can normalize the extracted data output using Octoparse built-in Regex tool to generate a matching regular expression automatically.           
Data Export formats
CSV, Excel, XML, SQL Server, MySQL, SQLite, Oracle and OleDB
IP proxy servers
Proxy to hide IP-address
Cloud Service
Octoparse provides cloud based service. It will speed up data extraction speed - 4 to 10 times faster than Local Extraction. Once users use Cloud Extraction, 4 to 10 cloud servers will be assigned to work on their extraction tasks. It will set users free from long time maintenance and certain hardware requirements.  
API Access
Users can create their own API that will return data formatted as XML strings. 
[Pricing]
Octoparse is free to use if you don't choose to use the Cloud Service. Unlimited pages scraping is excellent compared to all the other scrapers in the market. However, if you would want to consider using it's Cloud Service for more sophisticated scraping, it does offer offer two paid editions: Standard Edition and Professional Edition.
Both editions has provided inclusive featured scraping service, 
                                         
Standard Edition: $75 per month when billed annually, or $89 per month when billed monthly.
    Standard Edition offers all featured functions.
    Number of tasks in the Task Group: 100
    Cloud Servers: 6
Professional Edition: $158 per month when billed annually, or $189 per month when billed monthly.
    Professional Edition offers all featured functions.
    Number of tasks in the Task Group: 200
    Cloud Servers: 14
To conclude, Octoparse is a rich-featured scraping software with reasonable subscription pricing , it is worth your try. 


Mozenda is a cloud based web scraping service. It provides many useful utility features for data extraction. Users will be allowed to upload extracted data to cloud storage. Some featured service include:
Extract varied data formats
Mozenda is able to extract many types of data formats, however not that easy to do with some data with irregular data layout.
Regex Setting
Users can normalize the extracted data results using Regex Editor within Mozenda. However, it is not that easy to handle this Regex Editor, you may need to learn more about how to write a regular expression.         
Data Export formats
It can support varied types of export transformation.
AJAX Setting
Mozenda is able to extract AJAX supplied data and set timeout.
[Pricing]
Mozenda users pay for Page Credits, which is the number of individual request to a website to load a web-page. Each Subscription plan comes with a fixed number of pages included in the monthly plan price. That means the web pages out of the range of the limited page numbers will be charged additionally. And cloud storage vary based on different editions. Two Editions are offered for Mozenda:
                                                     


Import.io is a web-based platform for extracting data from websites without writing any code. Users can build their extractors by easy point & click, then Import.io will automatically extract data from web pages into a structured dataset. It serves users with a number of characteristic features:
Authentication
Extract data from behind a login/password
Cloud Service
Use the SaaS platform to store data that is extracted.
Parallelized data acquisitions are distributed automatically by scalable cloud architecture
API Access
Integration with Google Sheets, Excel, Tableau and many others.
[Pricing]
Import.io charges subscribers based on the quantity of the extracting queries per month, so users should better reckon up the number of extracting queries before they make a subscription. (One single query equals one single page URL.)
There are three Paid Editions offered by Import.io:
                                      
Essential Edition: $199 per month when billed annually, or $299 month-to-month when billed monthly.
Essential Edition offers all featured functions.
Essential Edition offers users with up to 10,000 queries per month.
    
Professional Edition: $349 per month when billed annually, or $499 per month when billed monthly.
Professional Edition offers all featured functions.
Professional Edition offers users with up to 50,000 queries per month.

Enterprise Edition: $699 per month when billed annually, or $999 per month when billed monthly.
Enterprise Edition offers all featured functions.
Enterprise Edition offers users with up to 400,000 queries per month.

Content Grabber is one of the most feature rich web scraping tools. It is more suitable for people with advanced programming skills, since it offers many powerful scripting editing, debugging interfaces for people in need. Users are allowed to use C# or VB.NET to write regular expressions instead of generating the matching expression using the built-in Regex tool, like Octoparse. The features covered within Content Grabber include:
Debugger
Content Grabber has a debugger which will help users build reliable agents where some issues that can be
resolved in an effective way.
Visual Studio 2013 Integration
Content Grabber can integrate with Visual Studio 2013 for the most powerful script editing, debugging and unit testing features.

Custom Display Templates

Custom HTML display templates allow you to remove these promotional messages and add your own  designs to the screens - effectively allowing you to white label your self-contained agent.

Programming Interface

The Content Grabber API can be used to add web automation capabilities to your own desktop and web applications. The web API does require access to the Content Grabber Windows service, which is part of the Content Grabber software, and must be installed on the web server or a server accessible to the web server.
[Pricing]
Content Grabber offers two purchasing methods:
                                   
Buy License : Buying any Content Grabber license outright gives you a perpetual license.
For License users, there are three editions are available for users to buy:
Server Edition:
This Basic Edition only provides users with limited Agent Editors. The total cost is $449.
Profession Edition:
It serves users with full featured Agent Editor. However, API is not available. The pricing is $995.
Premium Edition:
This Advanced Edition provides all featured services within Content Grabber. However, it also charges a bit higher with a pricing $2495.

Monthly Subscription: Users who sign up to a monthly subscription will be charged upfront each month for the edition they choose.
For subscribers, there are also the same three editions for users to buy:
Server Edition:
This Basic Edition only provides users with limited Agent Editors. The total cost is $69 per month.
Profession Edition:
It serves users with full featured Agent Editor. However, API is not available. The pricing is $149 per month.
Premium Edition:
This Advanced Edition provides all featured services within Content Grabber. However, it also charges a bit higher with a pricing $299.

Conclusion

In this post, 5 automation web scraper software have been evaluated from various perspectives. Most of these scrapers can satisfy users' basic scraping needs. Some of these scraper tools, like Octoparse, Content Grabber, have even provided more advanced functionality to help users extract matching results from tough websites using their built-in Regex, XPath tools and Proxy Servers. Particularly, users without any programming skills are not suggested to run custom scripts (Visual Web Ripper, Content Grabber and etc). Anyway, whichever scraper any user should choose is totally dependent on your individual requirements. Thus, make sure you have an overall understanding of a scraper's features before you subscribe to it. Lastly, check out the below feature comparison chart if you are putting some serious thoughts on subscribing to a data extraction service provider. Happy data hunting!
                                    
- See more at: Octoparse Blog

标签: , , ,

2017年3月2日星期四

Octoparse Cloud Service

Octoparse has always been dedicated itself to providing users with better experience and more professional service. Notably, Octoparse Cloud-based Service has added more featured services so that users can crawl or scrape data with increasingly high speed and large scale. So, we are proud to say Octoparse Cloud Service is providing high-quality service for people with higher demand on crawling. And we’d like to share more with you about our Cloud Service.

What is Octoparse Cloud Service ?
Defined as a DaaS model, Octoparse Cloud Service manages the infrastructure and platforms that run the applications. Octoparse Cloud Servers install and   operate application software in the cloud and our cloud users can access the software from the cloud clients. That means this service will set free users from long-time maintenance and certain hardware requirements.

How does Cloud Service Work ?
Contributed to the cloud distributed and parallel computing, Octoparse offers multi-threads processing mechanism. Put it in another way, our featured Cloud Service differs from the Local Extraction in its scalabillity, since users’ tasks can achieve a higher crawling speed by cloned onto at least 6 virtual machines running simultaneously. Users will be allowed to extract data on a 24-7 basis using Octoparse cloud service after users upload their configured tasks into the Cloud Platform. After completion of extraction, data extracted will be returned to the clients.

Why you should choose Octoparse Cloud-based Service?

IP Rotation
Multiple cloud servers will be able to  provide users with IP rotations to automate IPs’ leaving without being traced by some aggressive target websites, thus preventing our users from getting blocked.

Extraction Speed Up
Cloud Extraction can be relatively faster compared with Octoparse Local Extraction.  Normally, Octoparse can scrape data 6th to 14th as faster as the Local Extraction, which means at least 6 cloud servers out there are scraping data for you. More advanced, users can adjust the cloud servers based on their higher demand on the Cloud Service.

Scheduling Tasks
Notice to all users, Task Schedule is only available in Cloud Extraction. After configuration of tasks, user tasks will run on the Cloud Platform at the scheduled time. This featured service allows users to schedule their scraping tasks any time from target websites in which inf updates frequently with a precision of minute.

API 
API provided by Octoparse enables Octoparse to connect with any system based on users’ exporting needs. That means Octoparse can provide users with various formats of data to export, like Excel, CSV, HTML,  TXT, and database (MySQL, SQL Server, and Oracle).


When you need our Cloud Service ?

1. Oceans of data needs to be scraped within a shorter period of time.
2. Target websites update their real time data frequently.
3. Data scraped needs to be exported automatically.


Start your Cloud Extraction now!   

Manually activate ‘Cloud Extraction’in the 4thstep ‘Done’ after configuration.


Alternatively, users can active their tasks within ‘In Queue Tasks’ and click ‘Cloud Extraction’ to start their Cloud Extraction.


Schedule Cloud Extraction Settings
Users can schedule cloud settings after completion of configuring their tasks.


To schedule Cloud Extraction Settings, users should apply a valid Select Date and Availability time period based on their requirements to the task.


As the example figure below, enter the Configuration name, click ‘Save the Configuration’ and ‘Save’ button. Then users can apply the Saved Configuration to any other tasks which will reuse the schedule setting.
 

Users can click “OK” to activate Scheduled Cloud Extraction for their tasks. Otherwise, just click × button and Save the configuration.
 

After activation of scheduled tasks, users will be directed into the ‘Cloud: Waiting’ tasks menu, in which users can check the ‘Next Execution Time’ of the schedule tasks.


The reminder about Scheduled Cloud Extraction will be displayed in the 4th step as below.


Users can stop their tasks within the ‘Task Status’. And a confirmation pop-up info box will show when users manage to stop the tasks.



Cloud Extraction Speed UP : Splitting Tasks

Tasks running on the Cloud Platform need to be split up if users are meant to speed up the extraction process, otherwise there will be no difference between Cloud Extraction and Local Extraction. However, it’s noteworthy that whether users can split up tasks is dependent on the loop mode.
In Octoparse, only Fixed list, list of URLs and Text list can split up tasks using Cloud Extraction. While Single element and Variable list will not be able to split the tasks for Cloud Extraction. Take the example of Fixed list splitting up as below.


The task above generates a Variable list in default, which occupies 1 cloud server on the Cloud Platform and disables tasks splitting. To improve this, users can switch the Variable list to the Fixed list to split up the task on the Cloud Platform.
Here, users can modify the XPath of the Variable list and switch to a Fixed list. As an example below, users need to edit XPath //DIV[@id='mainResults']/UL[1]/LI and append an array sequence number to this XPath, like //DIV[@id='mainResults']/UL[1]/LI[i] (i=1, 2, 3 ..., n). After editing the XPath, users can add the modified numbered XPath into the Fixed list one by one, then click ‘OK’ and ‘Save’. As the example figure below, we can see the first item in the loop list after we copy //DIV[@id='mainResults']/UL[1]/LI[1] in the Fixed list and click ‘OK’ and ‘Save’.


The same way, we can add the XPath with a sequential array number one by one, then click ’OK’ and ‘Save’.


After adding the modified XPath, we can get the Loop Items displayed in the list as below.


The scheduled or running task will be split up and cloned onto multiple cloud servers after users make changes with its XPath to speed up its Cloud Extraction, otherwise it would be needless to run tasks using Cloud Servers. Specifically, users can also choose to skip using Cloud Extraction by click the option as below.


Users can adjust the maximum number of tasks running in parallel. Specifically, Octoparse professional Edition sets a threshold of 14 threads working simultaneously. The threads will be assigned to tasks randomly. That means if users set a threshold of 10 threads in parallel, then 10 tasks could be activated and run in the Cloud Servers at most. However, it is highly possible that some tasks may occupy more than 1 thread, if any of these tasks is split up. For example, it is probable that the 8 of the 10 tasks have occupied all of the threads, while leaving 2 idle tasks waiting for free Cloud Servers. More advanced, users can set priorities for the tasks so that the task with the relatively top priority will be executed first. Particularly, a split task which has  occupied cloud servers before setting priorities will keep waiting until tasks which are assigned with a higher priority have completed their Cloud Extraction already.
 



Author: The Octoparse Team
- See more at: Octoparse Tutorial