2017年3月2日星期四

Octoparse Cloud Service

Octoparse has always been dedicated itself to providing users with better experience and more professional service. Notably, Octoparse Cloud-based Service has added more featured services so that users can crawl or scrape data with increasingly high speed and large scale. So, we are proud to say Octoparse Cloud Service is providing high-quality service for people with higher demand on crawling. And we’d like to share more with you about our Cloud Service.

What is Octoparse Cloud Service ?
Defined as a DaaS model, Octoparse Cloud Service manages the infrastructure and platforms that run the applications. Octoparse Cloud Servers install and   operate application software in the cloud and our cloud users can access the software from the cloud clients. That means this service will set free users from long-time maintenance and certain hardware requirements.

How does Cloud Service Work ?
Contributed to the cloud distributed and parallel computing, Octoparse offers multi-threads processing mechanism. Put it in another way, our featured Cloud Service differs from the Local Extraction in its scalabillity, since users’ tasks can achieve a higher crawling speed by cloned onto at least 6 virtual machines running simultaneously. Users will be allowed to extract data on a 24-7 basis using Octoparse cloud service after users upload their configured tasks into the Cloud Platform. After completion of extraction, data extracted will be returned to the clients.

Why you should choose Octoparse Cloud-based Service?

IP Rotation
Multiple cloud servers will be able to  provide users with IP rotations to automate IPs’ leaving without being traced by some aggressive target websites, thus preventing our users from getting blocked.

Extraction Speed Up
Cloud Extraction can be relatively faster compared with Octoparse Local Extraction.  Normally, Octoparse can scrape data 6th to 14th as faster as the Local Extraction, which means at least 6 cloud servers out there are scraping data for you. More advanced, users can adjust the cloud servers based on their higher demand on the Cloud Service.

Scheduling Tasks
Notice to all users, Task Schedule is only available in Cloud Extraction. After configuration of tasks, user tasks will run on the Cloud Platform at the scheduled time. This featured service allows users to schedule their scraping tasks any time from target websites in which inf updates frequently with a precision of minute.

API 
API provided by Octoparse enables Octoparse to connect with any system based on users’ exporting needs. That means Octoparse can provide users with various formats of data to export, like Excel, CSV, HTML,  TXT, and database (MySQL, SQL Server, and Oracle).


When you need our Cloud Service ?

1. Oceans of data needs to be scraped within a shorter period of time.
2. Target websites update their real time data frequently.
3. Data scraped needs to be exported automatically.


Start your Cloud Extraction now!   

Manually activate ‘Cloud Extraction’in the 4thstep ‘Done’ after configuration.


Alternatively, users can active their tasks within ‘In Queue Tasks’ and click ‘Cloud Extraction’ to start their Cloud Extraction.


Schedule Cloud Extraction Settings
Users can schedule cloud settings after completion of configuring their tasks.


To schedule Cloud Extraction Settings, users should apply a valid Select Date and Availability time period based on their requirements to the task.


As the example figure below, enter the Configuration name, click ‘Save the Configuration’ and ‘Save’ button. Then users can apply the Saved Configuration to any other tasks which will reuse the schedule setting.
 

Users can click “OK” to activate Scheduled Cloud Extraction for their tasks. Otherwise, just click × button and Save the configuration.
 

After activation of scheduled tasks, users will be directed into the ‘Cloud: Waiting’ tasks menu, in which users can check the ‘Next Execution Time’ of the schedule tasks.


The reminder about Scheduled Cloud Extraction will be displayed in the 4th step as below.


Users can stop their tasks within the ‘Task Status’. And a confirmation pop-up info box will show when users manage to stop the tasks.



Cloud Extraction Speed UP : Splitting Tasks

Tasks running on the Cloud Platform need to be split up if users are meant to speed up the extraction process, otherwise there will be no difference between Cloud Extraction and Local Extraction. However, it’s noteworthy that whether users can split up tasks is dependent on the loop mode.
In Octoparse, only Fixed list, list of URLs and Text list can split up tasks using Cloud Extraction. While Single element and Variable list will not be able to split the tasks for Cloud Extraction. Take the example of Fixed list splitting up as below.


The task above generates a Variable list in default, which occupies 1 cloud server on the Cloud Platform and disables tasks splitting. To improve this, users can switch the Variable list to the Fixed list to split up the task on the Cloud Platform.
Here, users can modify the XPath of the Variable list and switch to a Fixed list. As an example below, users need to edit XPath //DIV[@id='mainResults']/UL[1]/LI and append an array sequence number to this XPath, like //DIV[@id='mainResults']/UL[1]/LI[i] (i=1, 2, 3 ..., n). After editing the XPath, users can add the modified numbered XPath into the Fixed list one by one, then click ‘OK’ and ‘Save’. As the example figure below, we can see the first item in the loop list after we copy //DIV[@id='mainResults']/UL[1]/LI[1] in the Fixed list and click ‘OK’ and ‘Save’.


The same way, we can add the XPath with a sequential array number one by one, then click ’OK’ and ‘Save’.


After adding the modified XPath, we can get the Loop Items displayed in the list as below.


The scheduled or running task will be split up and cloned onto multiple cloud servers after users make changes with its XPath to speed up its Cloud Extraction, otherwise it would be needless to run tasks using Cloud Servers. Specifically, users can also choose to skip using Cloud Extraction by click the option as below.


Users can adjust the maximum number of tasks running in parallel. Specifically, Octoparse professional Edition sets a threshold of 14 threads working simultaneously. The threads will be assigned to tasks randomly. That means if users set a threshold of 10 threads in parallel, then 10 tasks could be activated and run in the Cloud Servers at most. However, it is highly possible that some tasks may occupy more than 1 thread, if any of these tasks is split up. For example, it is probable that the 8 of the 10 tasks have occupied all of the threads, while leaving 2 idle tasks waiting for free Cloud Servers. More advanced, users can set priorities for the tasks so that the task with the relatively top priority will be executed first. Particularly, a split task which has  occupied cloud servers before setting priorities will keep waiting until tasks which are assigned with a higher priority have completed their Cloud Extraction already.
 



Author: The Octoparse Team
- See more at: Octoparse Tutorial

Web Crawler Service

Web data crawling or scraping is becoming increasingly popular in the last few years. The scraped data can be used for various analysis, even predictions. By analyzing the data, people can gain insight into one industry and take on other competitors. Here, we can see how useful and necessary it is to get high quality data with a faster speed and in a large scale. Also, a higher demand on data has driven the fast growth of Web Crawler Service. 
Web Crawler Service can be found easily if you search it via Google. More exactly, it is one kind of customized Paid Service. Every time  you'd like to crawl a web site or any data set, you need to pay for the service provider and then you can get the crawled data you want. There is something you should notice, you must be careful with the service provider you choose and express your data requirements as clear and exclusive as possible. I will propose some Web Crawler Service I used or learned for your reference. Anyway, the evaluation of services is hard since those services continuously evolve to serve the customer better. The best way to decide is what your requirements are and what is on offer, map them and rank them by yourself.  

DataHen is known as a professional Web Crawler Service Provider. It has offered well-rounded and patient service, covering all levels of data crawling or scraping requirements from personal, startups and enterprises. You will not need to buy or learn a scraping software by using DataHen. They are able to fill up forms when being obfuscated by certain sites which require authentications. The UI is straightforward to understand, as can be seen below, you only need to fill out the required information and they will deliver the data you need to crawl.

 


grepsr is a powerful Crawler Servcie platform which provides multi-kinds of user data crawling needs. To communicate better with users, grepsr has provided a quite clear and all-inclusive requirements gathering user interface as below. There are also three editions of Paid Plan of grepsr from Starters to Enterprises. Users can choose any plan based on their respective crawling needs.


Octoparse should be defined as a web scraping tool, eventhough it also offers customized data crawlers service. Octoparse Web Crawler Service is powerful as well. Tasks can be scheduled to run on the Cloud Platform which include at least 6 Cloud Servers working simultaneously. It also supports IP rotations, which prevents getting blocked by certain websites. Plus, Octoparse API allows users to connect their system to their scraped data in real time. Users can either import the Octoparse data into your own DB, or use the API to require access to their account’s data. Plus, Octoparse provides a Free Edition Extraction Plan. The Free Edition can also meet the basic needs of scraping or crawling from users. Anyone can use it to scrape or crawl data after you register an acount. The only thing is that you need to learn to configure the basic scraping rules to crawl data you need, anyway, it is easy to grasp the configuration skills. The UI is clear and straightforward to understand, as can be seen in the figure below. By the way, their back-up service is professional, users with any doubts can contact them directly and get feedback and solutions ASAP.


Scrapinghub is known as a Web Crawler tool, which also provide correlated crawling service you need to pay for. It can satisfy the basic needs of the scraping or crawling. Also, it has a proxy rotator(Crawlera), which means the crawling process will bypass bot counter measures so  they can crawl large sites faster. Plus, cloud-based web crawling platform, allows to easily deploy crawlers and scale them on demand without needing to worry about servers, monitoring, backups, or cron jobs. It helps developers turn over two billion web pages per month into valuable data.




Author: The Octoparse Team
- See more at: Octoparse Blog

My Experience in Choosing a Free Web Crawler Software

As the world is drowning in data, crawling or scraping data is becoming more and more popular. Certain web data crawlers or scrapers software which are known as extraction tools shouldn't be strangers any more to people with crawling needs. Most of these web data crawlers, scrapers, or extractors are web-based applications or can be installed in the local desk-top with a user-friendly UI.
I once tried crawling data on my own by programing in Ruby, Python to retrieve the structured data I need. Sometimes it is really time consuming, bothering and low efficient. Then, I began trying using some data crawler tools, as I learned that there are some kinds of scrapers and crawlers that requires no programming and can help users to crawl data much  faster with high quality. There are hundreds of web crawlers available when you search "Data Crawler Software" via Google. Here, I just want to introduce several free web crawler software I once used for your reference.

Octoparse

Octoparse is a powerful visual windows-based free web data crawler software. The UI can be seen as below, it is really easy for users to grasp this tool by using its simple and friendly user interface. To use it, you need download this application on your local desk-top first. As the figure below shows, you can click-and-drag the blocks in the Workflow Designer pane to customize your own task. Actually, Octoparse provides two editions of crawling service, the Free Edition and Paid Edition. Anyway, both editions can satisfy the basic scraping or crawling needs of users. You can run your tasks on the local side and have data exported in various formats. More advance, if you switch your Free edition to any Paid Edition, then you can share the Cloud-based service by uploading your task and configurations to the Cloud Platform, where there are 6 or more servers running your tasks simultaneously with a higher speed in a larger scale. Plus, you can automate your data extraction leaving without being traced using Octoparse’s anonymous proxy featured service that could rotate tons of IPs, it will prevent you from being blocked by certain websites. Octoparse also provides API creation to connect your system to your scraped data in real time. You can either import the Octoparse data into your own DB, or use our API to require access to your account’s data. After you finish your configuration of the task, you can export data in various formats as you need, like CSV, Excel formats, HTML, TXT, and database (MySQL, SQL Server, and Oracle).


Import.io is also known as a web crawler software covering all different levels of crawling needs. It offers a Magic tool which can convert a site into a table without any training sessions. It suggests users to download its desktop app if more complicated websites need to be crawled. Once you’ve built your API, they offer a number of simple integration options such as Google Sheets, Plot.ly, Excel as well as GET and POST requests. When you consider that all this comes with a free-for-life price tag and an awesome support team, it is a clear first port of call for those on the hunt for structured data. They also offer a paid enterprise level option for companies looking for more large scale or complex data extraction.


Mozenda

Mozenda is also a user-friendly web data crawler software. It has a point-and-click UI for users without any coding skills to use. Mozenda also takes the hassle out of automating and publishing extracted data. Tell Mozenda what data you want once, and then get it however frequently you need it. Plus, it allows advanced programming using REST API the user can connect directly Mozenda account. Plus, it provides the Cloud-based service and rotation of IPs as well.



SEO experts, online marketers and even spammers should be very familiar with ScrapeBox with its very user-friendly UI. Users can easily harvest data from a website to grab emails, check page rank, verify working proxis and RSS submission. BY using thousands of rotating proxies, you will be able to sneak on the competitor’s site keywords, do research on .gov sites, harvesting data, and commenting without getting blocked or detected.


Google Web Scraper Plugin

Admittedly, those crawlers are powerful to meet people with complicated crawling or scraping needs. While if people just want to scrape data in a simple way, I suggest you choose the Google Web Scraper Plugin. It is a browser-based web scraper working like the Firfox’s Outwit Hub. You can download it as an extension and have it installed in your browser. You need highlight the data fields you’d like to crawl , right-click and choose “Scrape similar…”. Anything that’s similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs. The latest version still had some bugs on spreadsheets. Eventhough it is easy to handle, it is noteworthy that it can’t scrape images and crawl data in a large amount.




Author: The Octoparse Team
- See more at: Octoparse Blog

2017年2月28日星期二

Free Online Web Crawler Tool

                                                                           

With the increasing demand on data, more and more people began crawling web to get access to oceans of data. Therefore, Web Crawling is serving as an ascendingly important role to help people with data needs to fetch data meeting their requirements. Till now, there are three most common methods for people to crawl web data - Utilizing the public APIs provided by the target websites; Programming and build crawler on your own; Using some automated web crawler tools. Based on my user experience, I will mainly discuss several free online web crawler tools in the following section for the reference of web crawler beginners.
Before my introduction about the online web crawler tools, we should learn first about what is the web crawler meant for? Well, the web crawler tool is designed to scrape or crawl data from websites. We can also call them web harvesting tools or extraction tools. It can automate the crawling process with a faster speed and harvest data in a large scale. People who use it are not required to know any coding skills. They just need to learn the configuration rules related to different crawler tools. More advance, the online web crawlers are useful if users would like to gather the information and have it put into an useable form. The URL list can be stored in a spreadsheet and expanded in a dataset over time in the Cloud-Platform. That means the scraped data can be merged into an existing database by using the online web service. Here, I’d like to propose several free online web crawlers for your reference. Anyway, what I propose is just suggestions. Anyone who's going to choose a web crawler tool should learn about its respective detailed functionalities first and select the one based on your requirements.

Import.io

Import.io provides online web scraper service now. The data storage and related techniques are all based on Cloud-based Platform. To activate its function, the user need to add a web browser extension to enable this tool. The user interface of Import.io is easy to handle, users can click and select the data fields to crawl the data they need. For more detailed instructions, users can visit their official website for more tutorials and assistance. The Import.io can customize a dataset for pages with no data in the existing IO library by getting acceess to the Cloud-based library of API’s.
Its Cloud Service provides data storage and related data processing control options in the Cloud-Platform. One can add it to existing databses. Libraries and etc.


Scraper Wiki has set their free online accounts to a fixed maximum of datasets. Good news to all users, their free service provides the same elegant service as the paid service. They have also made a commitment to providing journalists premium accounts without cost.Their free online web scraper has added a new feature that PDF table is available. However, this PDF format doesn’t work well, since it will be practically hard if users would like to make cutting and pasting. The Scraper Wiki also added other more advanced options. Like, they released some other edtions of their application developed in different programming language, like Python, Ruby and Php, for a better flexibility in different operating system platforms.


Dexi.io

CloudScrape Cloud Scraping Service in Dexi.io is meant for regular web users to operate on. It always commits itself in providing high quality Cloud Service Scraping. It provides users with IP Proxy and in-built CAPTCHA resolving features which can help users scrape most of the websites. Users can learn how to use CloudScrape by clicking and pointing easily, even for beginners or amateurs. Cloud hosting makes possible all the scraped data to be stored in cloud. API allows to monitor and remotly manage web robots. It’s CAPTCHA solving option sets CloudScrape apart from services like Import.io or Kimono. The service provides a vast variety of data integrations, so that extracted data might automatically be uploaded thru (S)FTP or into your Google Drive, DropBox, Box or AWS. The data integration can be completed seamlessly.
Apart from some of those free online web crawler tools, there are other reliable web crawler tools providing online service which may charge for their service though.

 

Octoparse is known as a Windows desk-top web crawler application, which provides reliable online crawling service as well. For their Cloud-based service, Octoparse can offer at least 6 cloud servers which can run users’ task concurrently. It also supports Cloud Data Storage and more advanced options about Cloud service. Its UI is very user-friendly and there are lots of related tutorials in their website for users to learn how to configure the tasks and make crawler on their own.

 




Author: The Octoparse Team

- See more at: Octoparse Blog

Price Scraping

                           

Scraping data from websites is nothing new at all. In commercial field, a large amount of scraped data can be used for business analysis. As well known, we can scrape the details, like price, stock, rating and etc, covering various data fields to monitor the change of the items. These data scraped can further help analysts and market sellers to evaluate the potential value or make more significant decisions.
However, there are some websites that we can’t scrape from. More exactly, even if these sites could provide APIs, there still exist some data fields that we couldn’t scrape or have no authentication to access to. For example, Amazon does provide a Product Advertising API, but the API itself couldn’t provide the access to all the information displayed on its product page for people to scrape, like price and etc. In this case, the only way to scrape more data, saying price data field, is to build our own scraper by programming or use certain kinds of automated scraper tools.
Sometimes, even we know how to scrape data on our own by programming, like using Ruby or Python, we still couldn’t scrape data in the end for various possible reasons. In most cases, we probably would be forbidden to scrape from certain websites due to our suspicious repeating scraping actions within a very short period of time traced by those target sites. If so, we may need to utilize IP proxy which automates IPs’ leaving without being traced by those target sites.
The possible solutions described above may require people to be familiar with coding skills and more advanced technical knowledge. Otherwise, it could be a tough or impossible task for us to complete. Thus, to make scraping websites available for most people, I’d like to list several scraper tools that can help you scrape any commercial data, including price, stock, reviews and etc, in a structured way with high efficiency and much faster speed.

Octoparse

I once used this scraper tool to scrape many websites, like Facebook, eBay, Priceline and etc, for data including price, reviews, comments and etc. This Scraper tool trully suits very well in scraping various data in most websites. Users needn’t know any how to program to scrape data by using this scraper, but they should learn to configure their tasks. The configuration of tasks is easy to grasp, the UI is very user-friendly, as the figure you can see below. There is a Workflow Designer pane where you should point&drag the functional visual blocks. It simulates human browsing behaviors and scrape the structured data users need. By using this scraper, you can use the Proxy IP only by setting certain Advanced Options, which is very efficient and fast. Then, you can scrape data, including price, reviews and etc, as you need after completing the configuration.


The extraction of hundreds or more data can be completed within seconds. You can scrape any data type as you want, the data frames will be returned like the figure below which includes price and customers evaluation scraped results. Notice to all users, there are two editions of Octoparse Scraping Service - the Free Edition and the Paid Edition. Both editions will provide the basic scraping needs for users, that means users can scrape data and have it exported in various formats, like CSV, Excel formats, HTML, TXT, and database (MySQL, SQL Server, and Oracle). While, if you want to scrape data with a much more faster speed, you can upgrade your free account to any paid account in which Cloud Service is available. There will be at least 4 cloud servers with Octoparse Cloud Service working on your task simultaneously.

 

Additionally, Octoparse also offers Scraping or Crawling Service, that means you can express your scraping needs and requirements and pay them to scrape what data you need. 

Import.io

Import.io is also known as a web crawler covering all different levels of crawling needs. It offers a Magic tool which can convert a site into a table without any training sessions. While it suggests users to download its desktop app if more complicated websites need to be crawled. Once you’ve built your API, they offer a number of simple integration options such as Google Sheets, Plot.ly, Excel as well as GET and POST requests. It also provides Proxy Servers so that users can prevent from being detected by certain target websites and you can scrape as much data as you need. It is not hard to use this tool at all, the UI of Import. Io is quite friendly to use, you can refer to their official tutorials to learn how to configure your own scraping tasks. When you consider that all this comes with a free-for-life price tag and an awesome support team, import.io is a clear first port of call for those on the hunt for structured data. They also offer a paid enterprise level option for companies looking for more large scale or complex data extraction.


SEO experts, online marketers and even spammers should be very familiar with ScrapeBox with its very user-friendly UI. Users can easily harvest data from a website to grab emails, check page rank, verify working proxis and RSS submission. BY using thousands of rotating proxies, you will be able to sneak on the competitor’s site keywords, do research on .gov sites, harvesting data, and commenting without getting blocked or detected.



Author: The Octoparse Team
- See more at: Octoparse Blog

Website Crawler & Sentiment Analysis

        

To start with Sentiment Analysis, what comes first to our mind is where and how we can crawl oceans of data for our analysis. Normally, web crawler or crawling from web social media should be one reasonable way to get access to the public opinion data resource. Thus, in this writing, I want to share with you about how I crawled the website using web crawler and proceeded to deal with those data for Sentiment Analysis to develop an application which ranks universities based on users’s opinions crawled from social media website - Twitter.
To crawl the Twitter data, there have been several methods that we can adopt - Build a web crawler on our own by programming or choosing an automated web crawler, like OctoparseImport.io and etc. Also we could use the public APIs provided by certain websites to get access to their data set.
First, as well known, Twitter provides public APIs for developers to read and write tweets conveniently. The REST API identifies Twitter applications and users using OAuth; After learn about this, we can utilize twitter REST APIs to get most recent and popular tweets, Twitter4j has been imported to crawl twitter data through twitter REST API. And Twitter data can be crawled according to specific time range, locations, or other data fields. And the data crawled will be returned as JSON format. Note that APP developers need to generate twitter application accounts, so as to get the authorized access to twitter API. By using a specific Access Token, the application made a request to the POST OAuth2 to exchange credentials so that users can get authenticated access to the REST API .
This mechanism will allow us to pull users’information in the data resource. Then we can use the serach function to crawl these structured tweets related with university topics.
Then, I generated the query set to crawl tweets, which is like the figure below. I collected University Ranking data of USNews 2016, which includes 244 universities and their ranking. Then, I customized the data fiels I need to use for crawling tweets into JSON format.

                                                      

                                                      
                                             

For my tweets crawled results, I extracted 462,413 tweets results totally. And most universities’ tweets crawled number is less than 2000. 
                        
So far, someone may feel that the whole process to crawl Twitter is troublesome already. This method requires people with good programming skills and knowledge on regular expression to crawl the websites for structured and formatted data. This may be tough for someone without any coding skills and related knowledge. Here, for your reference, I’d like to propose some automated web crawler tools which can help you crawl websites without any coding skills, like OctoparseImport.ioMozenda. Based on my user experience, I can share with you one crawler tool I once used - Octoparse. It is quite user-friendly and easy to use. You can download their desk-top crawler software by visiting this link: http://www.octoparse.com/. This local machine installed web crawler tool can automatically crawl the data, like tweets, from the target sites. Its UI is user-friendly as below and you just need to learn how to confiugre the crawling settings of your tasks which you can learn by reading or watching their tutorials. The data crawled can be exported to various structure formats as you need, like Excel, CSV, HTML, TXT, and database(MySQL, SQL Server, and Oracle). Plus, it provides IP rotations which automates IPs leaving without being traced by the target sites. This mechanism serves as an important role, since it can prevent us from getting blocked by certain aggressive sites which doesn’t allow people to crawl any data from them.

                

Back to the University Ranking of my designed application. Ranking technology in my application is to parse tweets crawled from Twitter and then rank related tweets according to their relevance to a specific university. I want to filter high-related tweets (topK) to do the Sentiment Analysis, which will avoid trivial tweets that make our results inaccurate. There are may ranking methods actually, such as rank them based on TF-IDF similarity, text summarization, spatial and temporal factors or machine learning ranking method. Even Twitter itself has provided a method based on time or popularity. However, we need a more advanced method which can filter out the most spam and trivial tweets. In order to measure the trust and popularity of a tweet, I use the following features from tweets: retweets count, followers count, friends acount. Assuming that a trust tweet should be posted by a trust user. And a trust user should have enough friends and followers, then a popular tweet should have high retweet-count. According to my assumption, I builit a model combining the trust and popularity (TP Score) for a tweet. Then I rank those tweets based on TP score. Noteworthy that report news usually has high retweet-count, and this type  of score will be useless for our Sentiment Analysis. Thus, I assigned a relatively lower weight on this portion when computing the TP score. I designed and derived a formula as below. This Twitter supported method has considered the presence of query words, the recency of tweets. Thus, the tweets we crawled has been filtered by query words and posting time. All we need is to consider about the retweets countsm, followers counts, friends counts.
                                        
I make my own university ranking according to public reputation which is represented by sentiment score. However, public reputation is only on of the factors that should be referenced when evaluating a university. Thus, I want to present an overall ranking which combined both commercial ranking and our ranking. And there are three main types of tweets texts as below.

                                              

Sentiment Score: the Sentiment Score was calculated for public reputation. The positive rate of each university was used as the sentiment score for the public reputation ranking. The formula below defines the positive rate. Note that the negative polarity was not considered since it is equal to zero.
                                                                          
Where n is the total number of tweets for each university; And represents the positive polarity for a tweet, 4.
After I completed the Sentiment Anlysis, I would proceed to build a classifier for Sentiment Analysis using machine learning algorithm. This algorithm I will discuss further in the next writing, which is about Maximum Entropy classifier.


Author: The Octoparse Team
- See more at: Octoparse Blog

标签: , ,