Octoparse has always been dedicated itself to providing users with better experience and more professional service. Notably, Octoparse Cloud-based Service has added more featured services so that users can crawl or scrape data with increasingly high speed and large scale. So, we are proud to say Octoparse Cloud Service is providing high-quality service for people with higher demand on crawling. And we’d like to share more with you about our Cloud Service.
What is Octoparse Cloud Service ?
Defined as a DaaS model, Octoparse Cloud Service manages the infrastructure and platforms that run the applications. Octoparse Cloud Servers install and operate application software in the cloud and our cloud users can access the software from the cloud clients. That means this service will set free users from long-time maintenance and certain hardware requirements.
How does Cloud Service Work ?
Contributed to the cloud distributed and parallel computing, Octoparse offers multi-threads processing mechanism. Put it in another way, our featured Cloud Service differs from the Local Extraction in its scalabillity, since users’ tasks can achieve a higher crawling speed by cloned onto at least 6 virtual machines running simultaneously. Users will be allowed to extract data on a 24-7 basis using Octoparse cloud service after users upload their configured tasks into the Cloud Platform. After completion of extraction, data extracted will be returned to the clients.
Why you should choose Octoparse Cloud-based Service?
Multiple cloud servers will be able to provide users with IP rotations to automate IPs’ leaving without being traced by some aggressive target websites, thus preventing our users from getting blocked.
Extraction Speed Up
Cloud Extraction can be relatively faster compared with Octoparse Local Extraction. Normally, Octoparse can scrape data 6th to 14th as faster as the Local Extraction, which means at least 6 cloud servers out there are scraping data for you. More advanced, users can adjust the cloud servers based on their higher demand on the Cloud Service.
Notice to all users, Task Schedule is only available in Cloud Extraction. After configuration of tasks, user tasks will run on the Cloud Platform at the scheduled time. This featured service allows users to schedule their scraping tasks any time from target websites in which inf updates frequently with a precision of minute.
API provided by Octoparse enables Octoparse to connect with any system based on users’ exporting needs. That means Octoparse can provide users with various formats of data to export, like Excel, CSV, HTML, TXT, and database (MySQL, SQL Server, and Oracle).
When you need our Cloud Service ?
1. Oceans of data needs to be scraped within a shorter period of time.
2. Target websites update their real time data frequently.
3. Data scraped needs to be exported automatically.
Start your Cloud Extraction now!
Manually activate ‘Cloud Extraction’in the 4thstep ‘Done’ after configuration.
Alternatively, users can active their tasks within ‘In Queue Tasks’ and click ‘Cloud Extraction’ to start their Cloud Extraction.
Schedule Cloud Extraction Settings
Users can schedule cloud settings after completion of configuring their tasks.
To schedule Cloud Extraction Settings, users should apply a valid Select Date and Availability time period based on their requirements to the task.
As the example figure below, enter the Configuration name, click ‘Save the Configuration’ and ‘Save’ button. Then users can apply the Saved Configuration to any other tasks which will reuse the schedule setting.
Users can click “OK” to activate Scheduled Cloud Extraction for their tasks. Otherwise, just click × button and Save the configuration.
After activation of scheduled tasks, users will be directed into the ‘Cloud: Waiting’ tasks menu, in which users can check the ‘Next Execution Time’ of the schedule tasks.
The reminder about Scheduled Cloud Extraction will be displayed in the 4th step as below.
Users can stop their tasks within the ‘Task Status’. And a confirmation pop-up info box will show when users manage to stop the tasks.
Cloud Extraction Speed UP : Splitting Tasks
Tasks running on the Cloud Platform need to be split up if users are meant to speed up the extraction process, otherwise there will be no difference between Cloud Extraction and Local Extraction. However, it’s noteworthy that whether users can split up tasks is dependent on the loop mode.
In Octoparse, only Fixed list, list of URLs and Text list can split up tasks using Cloud Extraction. While Single element and Variable list will not be able to split the tasks for Cloud Extraction. Take the example of Fixed list splitting up as below.
The task above generates a Variable list in default, which occupies 1 cloud server on the Cloud Platform and disables tasks splitting. To improve this, users can switch the Variable list to the Fixed list to split up the task on the Cloud Platform.
Here, users can modify the XPath of the Variable list and switch to a Fixed list. As an example below, users need to edit XPath //DIV[@id='mainResults']/UL/LI and append an array sequence number to this XPath, like //DIV[@id='mainResults']/UL/LI[i] (i=1, 2, 3 ..., n). After editing the XPath, users can add the modified numbered XPath into the Fixed list one by one, then click ‘OK’ and ‘Save’. As the example figure below, we can see the first item in the loop list after we copy //DIV[@id='mainResults']/UL/LI in the Fixed list and click ‘OK’ and ‘Save’.
The same way, we can add the XPath with a sequential array number one by one, then click ’OK’ and ‘Save’.
After adding the modified XPath, we can get the Loop Items displayed in the list as below.
The scheduled or running task will be split up and cloned onto multiple cloud servers after users make changes with its XPath to speed up its Cloud Extraction, otherwise it would be needless to run tasks using Cloud Servers. Specifically, users can also choose to skip using Cloud Extraction by click the option as below.
Users can adjust the maximum number of tasks running in parallel. Specifically, Octoparse professional Edition sets a threshold of 14 threads working simultaneously. The threads will be assigned to tasks randomly. That means if users set a threshold of 10 threads in parallel, then 10 tasks could be activated and run in the Cloud Servers at most. However, it is highly possible that some tasks may occupy more than 1 thread, if any of these tasks is split up. For example, it is probable that the 8 of the 10 tasks have occupied all of the threads, while leaving 2 idle tasks waiting for free Cloud Servers. More advanced, users can set priorities for the tasks so that the task with the relatively top priority will be executed first. Particularly, a split task which has occupied cloud servers before setting priorities will keep waiting until tasks which are assigned with a higher priority have completed their Cloud Extraction already.
Author: The Octoparse Team
- See more at: Octoparse Tutorial