2017年3月2日星期四

Octoparse Cloud Service

Octoparse has always been dedicated itself to providing users with better experience and more professional service. Notably, Octoparse Cloud-based Service has added more featured services so that users can crawl or scrape data with increasingly high speed and large scale. So, we are proud to say Octoparse Cloud Service is providing high-quality service for people with higher demand on crawling. And we’d like to share more with you about our Cloud Service.

What is Octoparse Cloud Service ?
Defined as a DaaS model, Octoparse Cloud Service manages the infrastructure and platforms that run the applications. Octoparse Cloud Servers install and   operate application software in the cloud and our cloud users can access the software from the cloud clients. That means this service will set free users from long-time maintenance and certain hardware requirements.

How does Cloud Service Work ?
Contributed to the cloud distributed and parallel computing, Octoparse offers multi-threads processing mechanism. Put it in another way, our featured Cloud Service differs from the Local Extraction in its scalabillity, since users’ tasks can achieve a higher crawling speed by cloned onto at least 6 virtual machines running simultaneously. Users will be allowed to extract data on a 24-7 basis using Octoparse cloud service after users upload their configured tasks into the Cloud Platform. After completion of extraction, data extracted will be returned to the clients.

Why you should choose Octoparse Cloud-based Service?

IP Rotation
Multiple cloud servers will be able to  provide users with IP rotations to automate IPs’ leaving without being traced by some aggressive target websites, thus preventing our users from getting blocked.

Extraction Speed Up
Cloud Extraction can be relatively faster compared with Octoparse Local Extraction.  Normally, Octoparse can scrape data 6th to 14th as faster as the Local Extraction, which means at least 6 cloud servers out there are scraping data for you. More advanced, users can adjust the cloud servers based on their higher demand on the Cloud Service.

Scheduling Tasks
Notice to all users, Task Schedule is only available in Cloud Extraction. After configuration of tasks, user tasks will run on the Cloud Platform at the scheduled time. This featured service allows users to schedule their scraping tasks any time from target websites in which inf updates frequently with a precision of minute.

API 
API provided by Octoparse enables Octoparse to connect with any system based on users’ exporting needs. That means Octoparse can provide users with various formats of data to export, like Excel, CSV, HTML,  TXT, and database (MySQL, SQL Server, and Oracle).


When you need our Cloud Service ?

1. Oceans of data needs to be scraped within a shorter period of time.
2. Target websites update their real time data frequently.
3. Data scraped needs to be exported automatically.


Start your Cloud Extraction now!   

Manually activate ‘Cloud Extraction’in the 4thstep ‘Done’ after configuration.


Alternatively, users can active their tasks within ‘In Queue Tasks’ and click ‘Cloud Extraction’ to start their Cloud Extraction.


Schedule Cloud Extraction Settings
Users can schedule cloud settings after completion of configuring their tasks.


To schedule Cloud Extraction Settings, users should apply a valid Select Date and Availability time period based on their requirements to the task.


As the example figure below, enter the Configuration name, click ‘Save the Configuration’ and ‘Save’ button. Then users can apply the Saved Configuration to any other tasks which will reuse the schedule setting.
 

Users can click “OK” to activate Scheduled Cloud Extraction for their tasks. Otherwise, just click × button and Save the configuration.
 

After activation of scheduled tasks, users will be directed into the ‘Cloud: Waiting’ tasks menu, in which users can check the ‘Next Execution Time’ of the schedule tasks.


The reminder about Scheduled Cloud Extraction will be displayed in the 4th step as below.


Users can stop their tasks within the ‘Task Status’. And a confirmation pop-up info box will show when users manage to stop the tasks.



Cloud Extraction Speed UP : Splitting Tasks

Tasks running on the Cloud Platform need to be split up if users are meant to speed up the extraction process, otherwise there will be no difference between Cloud Extraction and Local Extraction. However, it’s noteworthy that whether users can split up tasks is dependent on the loop mode.
In Octoparse, only Fixed list, list of URLs and Text list can split up tasks using Cloud Extraction. While Single element and Variable list will not be able to split the tasks for Cloud Extraction. Take the example of Fixed list splitting up as below.


The task above generates a Variable list in default, which occupies 1 cloud server on the Cloud Platform and disables tasks splitting. To improve this, users can switch the Variable list to the Fixed list to split up the task on the Cloud Platform.
Here, users can modify the XPath of the Variable list and switch to a Fixed list. As an example below, users need to edit XPath //DIV[@id='mainResults']/UL[1]/LI and append an array sequence number to this XPath, like //DIV[@id='mainResults']/UL[1]/LI[i] (i=1, 2, 3 ..., n). After editing the XPath, users can add the modified numbered XPath into the Fixed list one by one, then click ‘OK’ and ‘Save’. As the example figure below, we can see the first item in the loop list after we copy //DIV[@id='mainResults']/UL[1]/LI[1] in the Fixed list and click ‘OK’ and ‘Save’.


The same way, we can add the XPath with a sequential array number one by one, then click ’OK’ and ‘Save’.


After adding the modified XPath, we can get the Loop Items displayed in the list as below.


The scheduled or running task will be split up and cloned onto multiple cloud servers after users make changes with its XPath to speed up its Cloud Extraction, otherwise it would be needless to run tasks using Cloud Servers. Specifically, users can also choose to skip using Cloud Extraction by click the option as below.


Users can adjust the maximum number of tasks running in parallel. Specifically, Octoparse professional Edition sets a threshold of 14 threads working simultaneously. The threads will be assigned to tasks randomly. That means if users set a threshold of 10 threads in parallel, then 10 tasks could be activated and run in the Cloud Servers at most. However, it is highly possible that some tasks may occupy more than 1 thread, if any of these tasks is split up. For example, it is probable that the 8 of the 10 tasks have occupied all of the threads, while leaving 2 idle tasks waiting for free Cloud Servers. More advanced, users can set priorities for the tasks so that the task with the relatively top priority will be executed first. Particularly, a split task which has  occupied cloud servers before setting priorities will keep waiting until tasks which are assigned with a higher priority have completed their Cloud Extraction already.
 



Author: The Octoparse Team
- See more at: Octoparse Tutorial

Web Crawler Service

Web data crawling or scraping is becoming increasingly popular in the last few years. The scraped data can be used for various analysis, even predictions. By analyzing the data, people can gain insight into one industry and take on other competitors. Here, we can see how useful and necessary it is to get high quality data with a faster speed and in a large scale. Also, a higher demand on data has driven the fast growth of Web Crawler Service. 
Web Crawler Service can be found easily if you search it via Google. More exactly, it is one kind of customized Paid Service. Every time  you'd like to crawl a web site or any data set, you need to pay for the service provider and then you can get the crawled data you want. There is something you should notice, you must be careful with the service provider you choose and express your data requirements as clear and exclusive as possible. I will propose some Web Crawler Service I used or learned for your reference. Anyway, the evaluation of services is hard since those services continuously evolve to serve the customer better. The best way to decide is what your requirements are and what is on offer, map them and rank them by yourself.  

DataHen is known as a professional Web Crawler Service Provider. It has offered well-rounded and patient service, covering all levels of data crawling or scraping requirements from personal, startups and enterprises. You will not need to buy or learn a scraping software by using DataHen. They are able to fill up forms when being obfuscated by certain sites which require authentications. The UI is straightforward to understand, as can be seen below, you only need to fill out the required information and they will deliver the data you need to crawl.

 


grepsr is a powerful Crawler Servcie platform which provides multi-kinds of user data crawling needs. To communicate better with users, grepsr has provided a quite clear and all-inclusive requirements gathering user interface as below. There are also three editions of Paid Plan of grepsr from Starters to Enterprises. Users can choose any plan based on their respective crawling needs.


Octoparse should be defined as a web scraping tool, eventhough it also offers customized data crawlers service. Octoparse Web Crawler Service is powerful as well. Tasks can be scheduled to run on the Cloud Platform which include at least 6 Cloud Servers working simultaneously. It also supports IP rotations, which prevents getting blocked by certain websites. Plus, Octoparse API allows users to connect their system to their scraped data in real time. Users can either import the Octoparse data into your own DB, or use the API to require access to their account’s data. Plus, Octoparse provides a Free Edition Extraction Plan. The Free Edition can also meet the basic needs of scraping or crawling from users. Anyone can use it to scrape or crawl data after you register an acount. The only thing is that you need to learn to configure the basic scraping rules to crawl data you need, anyway, it is easy to grasp the configuration skills. The UI is clear and straightforward to understand, as can be seen in the figure below. By the way, their back-up service is professional, users with any doubts can contact them directly and get feedback and solutions ASAP.


Scrapinghub is known as a Web Crawler tool, which also provide correlated crawling service you need to pay for. It can satisfy the basic needs of the scraping or crawling. Also, it has a proxy rotator(Crawlera), which means the crawling process will bypass bot counter measures so  they can crawl large sites faster. Plus, cloud-based web crawling platform, allows to easily deploy crawlers and scale them on demand without needing to worry about servers, monitoring, backups, or cron jobs. It helps developers turn over two billion web pages per month into valuable data.




Author: The Octoparse Team
- See more at: Octoparse Blog

My Experience in Choosing a Free Web Crawler Software

As the world is drowning in data, crawling or scraping data is becoming more and more popular. Certain web data crawlers or scrapers software which are known as extraction tools shouldn't be strangers any more to people with crawling needs. Most of these web data crawlers, scrapers, or extractors are web-based applications or can be installed in the local desk-top with a user-friendly UI.
I once tried crawling data on my own by programing in Ruby, Python to retrieve the structured data I need. Sometimes it is really time consuming, bothering and low efficient. Then, I began trying using some data crawler tools, as I learned that there are some kinds of scrapers and crawlers that requires no programming and can help users to crawl data much  faster with high quality. There are hundreds of web crawlers available when you search "Data Crawler Software" via Google. Here, I just want to introduce several free web crawler software I once used for your reference.

Octoparse

Octoparse is a powerful visual windows-based free web data crawler software. The UI can be seen as below, it is really easy for users to grasp this tool by using its simple and friendly user interface. To use it, you need download this application on your local desk-top first. As the figure below shows, you can click-and-drag the blocks in the Workflow Designer pane to customize your own task. Actually, Octoparse provides two editions of crawling service, the Free Edition and Paid Edition. Anyway, both editions can satisfy the basic scraping or crawling needs of users. You can run your tasks on the local side and have data exported in various formats. More advance, if you switch your Free edition to any Paid Edition, then you can share the Cloud-based service by uploading your task and configurations to the Cloud Platform, where there are 6 or more servers running your tasks simultaneously with a higher speed in a larger scale. Plus, you can automate your data extraction leaving without being traced using Octoparse’s anonymous proxy featured service that could rotate tons of IPs, it will prevent you from being blocked by certain websites. Octoparse also provides API creation to connect your system to your scraped data in real time. You can either import the Octoparse data into your own DB, or use our API to require access to your account’s data. After you finish your configuration of the task, you can export data in various formats as you need, like CSV, Excel formats, HTML, TXT, and database (MySQL, SQL Server, and Oracle).


Import.io is also known as a web crawler software covering all different levels of crawling needs. It offers a Magic tool which can convert a site into a table without any training sessions. It suggests users to download its desktop app if more complicated websites need to be crawled. Once you’ve built your API, they offer a number of simple integration options such as Google Sheets, Plot.ly, Excel as well as GET and POST requests. When you consider that all this comes with a free-for-life price tag and an awesome support team, it is a clear first port of call for those on the hunt for structured data. They also offer a paid enterprise level option for companies looking for more large scale or complex data extraction.


Mozenda

Mozenda is also a user-friendly web data crawler software. It has a point-and-click UI for users without any coding skills to use. Mozenda also takes the hassle out of automating and publishing extracted data. Tell Mozenda what data you want once, and then get it however frequently you need it. Plus, it allows advanced programming using REST API the user can connect directly Mozenda account. Plus, it provides the Cloud-based service and rotation of IPs as well.



SEO experts, online marketers and even spammers should be very familiar with ScrapeBox with its very user-friendly UI. Users can easily harvest data from a website to grab emails, check page rank, verify working proxis and RSS submission. BY using thousands of rotating proxies, you will be able to sneak on the competitor’s site keywords, do research on .gov sites, harvesting data, and commenting without getting blocked or detected.


Google Web Scraper Plugin

Admittedly, those crawlers are powerful to meet people with complicated crawling or scraping needs. While if people just want to scrape data in a simple way, I suggest you choose the Google Web Scraper Plugin. It is a browser-based web scraper working like the Firfox’s Outwit Hub. You can download it as an extension and have it installed in your browser. You need highlight the data fields you’d like to crawl , right-click and choose “Scrape similar…”. Anything that’s similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs. The latest version still had some bugs on spreadsheets. Eventhough it is easy to handle, it is noteworthy that it can’t scrape images and crawl data in a large amount.




Author: The Octoparse Team
- See more at: Octoparse Blog