WEBサービス

2018年2月7日星期三

27 Neutral Network Explained in Graphics

Neural network is an essential aspect of Machine Learning. It can be easily undertstood as a system of computer hardwares/softwares that works in a way inspired by mimic the human brain. Through massive trainings, such system learns from examples and generally without task specific programming just like a real human does. In this article, I’m going to briefly explain the 27 neutral networks one by one in an easy-to-understand language.

(Source: Fjodor van Veen from Asimov Institute)

1.Perceptron (P)

The simplest and oldest neural network we've known for a long time. It's a linear classifier that takes in input, combines the input then maps to the corresponding outputs via particular functions. Nothing really fancy about it.

2.Feed Forward

Feed Forward is another oldest neural network we know. It was originated from the 1950s. A feed forward system generally includes the followings :

All nodes are linked
No feedback loop to control the output
There’s a layer between input and output layer (the hidden layer)

In most cases, this type of network uses back-propagation methods for training.

3.Radial Basis Network

Radial Basis Network is actually a Feedforward with activation function, a radial basis function instead of a logic function.

So what is the difference between these two kinds of networks?
A logic function maps an arbitrary value in the range [0, ... 1] to answer the "yes or no" question. While this is applicable to a classification system it does not support continuous variables.

On the contrary, a radial basis function shows "how far we are from the target." This is perfect for function approximation and machine control (For example, it can be a sub for PID controller).
In short, an RBF is an FF with different activation functions and applications.

4.Deep Feed Forward

DFF has opened the Pandora's Box for Deep Learning in the early 90s. These are still Feed Forward Neural Networks, but with more hidden layers.

When training the traditional FF model, we only pass a small amount of error information to the next layer. With more layers, DFF is able to learn more about errors; however, it becomes impractical as the amount of training time required increases with more layers. Until the early 00s, we have developed a series of effective methods for training deep feedforward neural networks which have formed the core of modern machine learning systems today and enable the functionality of feedforward neural networks.

5.Recurrent Neural Network

RNNs introduce a different type of neurons: recurrent neurons.
The first network of this type is called the Jordan Network, where every hidden neuron receives its own output after a fixed delay (one or more iterations). Other than this, it is very similar to ordinary fuzzy neural networks.
Of course, there are other changes - such as passing state to the input node, variable delay, etc., but the main idea remains the same. This type of neural network is mainly used when the context is important - the past iterative results and the sample-generated decisions can impact the current ones. An example of the most common context is text analysis - a word can only be analyzed in the context of the preceding words or sentence.

6.Long/Short Term Memory(LSTM)

LSTM introduced a memory unit, a special unit that makes it possible to handle data along with variablee time intervals. Recurrent neural networks can process text by "remembering" the first ten words, and LSTM long and short memory networks can handle video frames by "remembering" what happened from the many frames before. LSTM networks are also widely used for text and voice recognition.

The storage unit is actually made up of components called gates, which are recursive and control how the information is "remembered" and "forgotten". The figure below illustrates the structure of LSTM:

The above (x) is the gate, they have their own weight, and sometimes have activation function. On each sample, they decide whether to pass the data, erase the memory, and so on.

You can find more comprehensive explaination of LSTM here:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Input Gate determines how much of the last sample is stored in memory; the output gate adjusts the amount of data transferred to the next level, and the Forget Gate controls the rate at which memory is stored.

7.Gated Recurrent Unit

GRU is a LSTM with different gates.
The lack of an output gate makes it easier to repeat the same output multiple times based on a specific input. It is now most often used in sound (music) and speech synthesis.

Though the actual combination is a bit different: all LSTM gates are grouped together into a so-called Update Gate and the Reset Gate is closely related to the input.

They consume less resources than LSTM but have almost the same effect.

8.Auto Encoder

Autoencoders are used for classification, clustering and feature compression.

When you train a feed forward (FF) neural network for classification, you mainly have to provide X examples in Y categories and expect one of the Y output cells to be activated. This is called "supervised learning."

On the other hand, auto-encoders can be trained without supervision. Their structure - when the number of hidden units is less than the number of input units (and the number of output units equals the number of input units), and automatic encoder is trained in a way that the output as close to the input as possible , the automatic encoder is forced to generalize the data and search for common features.

9.Variational Auto Encoder

Compared with an auto encoder, a VAE compresses the probability, not the feature.

In spite of this simple change, an auto encoder can only answer questions like, "How do we summarize the data?", while a VAE answers questions like "How strong is the connection between the two things? Should we divide the error between two things or are they completely independent? "

More in-depth explanation can be found here: https://github.com/kvfrans/variational-autoencoder

10.Denoising Auto Encoder

Although auto-encoders are cool, they sometimes fail to find the most proper features but rather adapts to the input data (in fact an example of over-fitting).

The Noise Reduction Automatic Encoder (DAE) adds some noise to the input unit - changing data by random bits, randomly shifting bits in the input, and so on. By doing this, a forced noise reduction auto-encoder reconstructs the output from a somewhat noisy input, making it more generic, forcing the selection of more common features.

11.Sparse Auto Encoder

Sparse Auto Encoder (SAE) is another form of auto encoding that sometimes pulls out some hidden packet samples from the data. The structure is same from AE, but the number of hidden cells is greater than the number of input or output cells.

12.Markov Chain

Markov Chain (MC) is an old chart concept. Each of its endpoints is assigned with a certain possibility. In the past, we have used it to construct a text structure like "dear" appears after "Hello" with a probability of 0.0053%, and "you" appears after "hello" with a probability of 0.03551%.

These Markov chains are not typical neural networks. They can be used as probability-based categories (like Bayesian filtering), for clustering (for some categories), and also as finite state machines.

13.Hopfield Network

The Hopfield Network (HN) is trained by a limited set of samples so that they react to known samples using the same sample.

Before training, each sample is used as an input sample, as a hidden sample during training and as an output sample after it has been used.
When HN tries to reconstruct the trained samples, they can be used to denoise the input value and repair the input. If half of the pictures or sequences are given for learning, they can feed back to the entire sample.

14.Boltzmann Machine

Boltzmann machine (BM) is very similar to HN, with some cells marked as inputs as well as hidden cells. When the hidden unit updates its status, the input unit becomes the output unit (In training, BM and HN update units one by one instead of in parallel).

This is the first network topology that successfully preserves the simulated annealing approach.
Multi-layered Porzman machines can be used for so-called deep belief networks (to be introduced shortly), and deep belief networks can be used for feature detection and extraction.

15.Restricted Boltzmann Machine

The restricted Borzmann machines (RBMs) are very similar to BMs in structure, but constrained RBMs are allowed to be trained back-propagating like FFs (the only difference is that the RBM will go through the input layer once before data is backpropagated).

16.Deep Belief Network

Deep Belief Network (DBN) is actually a number of Boltzmann machines (surrounded by VAE) togther. They can be linked together (when one neural network is training another), and data can be generated using patterns learned.

17.Deep Convolutional Network

Today, Deep Convolutional Network (DCN) is the superstar of artificial neural networks. It has convolutional units (or pools) and kernels, each for a different purpose.

Convolution kernels are actually used to process incoming data, and pooling layers are used to simplify them (in most cases using nonlinear equations such as max) to reduce unnecessary features.

They are usually used for image recognition, they run on a small part of the image (about 20x20 pixels). The input window slides one pixel by pixel along the image. The data then flows to the convolution layer, which forms a funnel (compression of the identified features). In terms of image recognition, the first layer identifies the gradient, the second layer identifies the line, and the third layer identifies the shape, and so on, up to the level of a particular object. DFF is usually attached to the end of the convolution layer for future data processing.

18.Deconvolutional Network

The deconvolution network (DN) is the inverted version of DCN. DN can generate the vector as (dog: 0, lizard: 0, horse: 0, cat: 1) after capturing the cat's picture. DNC can draw a cat after getting this vector.

19.Deep Convolutional Inverse Graphics Network

Deep Convolutional Inverse Graphics Network (DCIGN), looks like DCN and DN attached together, but not exactly so.

In fact, it is an auto encoder, DCN and DN are not as two separate networks, but rather as a space that carries the network input and output. Most of these neural networks can be used in image processing and can process images that they have not been trained on before. For it's level of abstraction, these networks can be used to remove something from a picture, redraw it, or replace a horse with a zebra like the famous CycleGAN.

20.Generative Adversarial Network

The Generative Adversarial Network (GAN) represents the dual networks family consisting of generators and differentiator. They are always hurting each other - the generator tries to generate some data, and the differentiator receives the sample data and tries to discern which are the samples and which ones are generated. As long as you can maintain the balance between the training of the two neural networks, in the constant evolution, this neural network can generate the actual image.

21.Liquid State Machine

Liquid State Machines (LSMs) are sparse neural networks whose activation functions are replaced (not all connected) by thresholds. Only when the threshold is reached, the cell accumulates the value information from the successive samples and the output freed, and again sets the internal copy to zero.
The idea comes from the human brain, these neural networks are widely used in computer vision, speech recognition systems, but has no major breakthrough.

22.Extreme Learning Machine

Extreme Learning Machines (ELM) reduce the complexity behind an FF network by creating a sparse, random connection of hidden layers. They require less computer power, and the actual efficiency depends very much on tasks and data.

23.Echo State Network

Echo status network (ESN) is a subdivision of a repeating network. Data passes through the input, and if multiple iterations are monitored, only the weight between hidden layers is updated after that.
To be honest, besides multiple theoretical benchmarks, I don’t know any practical use of this Network. Any comments is welcomed.

24.Deep Residual Network

Deep Residual Network(DRN) passes parts of input values to the next level. This feature makes it possible to reach many layers (up to 300 layers), but they are actually recurrent neural network(RNN) without a clear delay.

25.Kohonen Network

Kohonen network (KN) introduces the "cell distance" feature. For the most part used for classification, this network tries to adjust their cells to make the most probable response to a particular input. When some cells are updated, the cells closest to them are also updated.
Much like SVM, these networks are not always considered "real" neural networks.

26.Support Vector Machine

Support Vector Machines (SVMs) are used for binary categorical work and the result will be "yes" or "no" regardless of how many dimensions or inputs the network processes.

SVMs are not always known as neural networks.

27.Neural Turing Machine

Neural networks are like black boxes - we can train them, get results, enhance them, but most of the actual decision paths are not visible to us.

The Neurological Turing Machine (NTM) is trying to solve this problem - it is an FF after extracting memory cells. Some authors also say that it is an abstract version of LSTM.

Memory is content-based; this network can read memory based on the status quo, write memory, and also represents the Turing complete neural network.

I hope this summary will be helpful to anyone that's interested in learning about Neural Network. If you feel like anything needs to bee corrected or added, please contact us at support@octoparse.com.

Source: Octoparse

2018年1月23日星期二

Top 8 Technology Trends for 2018 You Must Know About

The industries are experiencing a new wave of technologies, from IoT, to artificial intelligence, which may greatly impact us and our surrounding world. This article concludes the main ideas from a white book named “8 in 2018: the top transformative technologies to watch this year” released by IHS Markit, a leading global information service company.

1. Artificial Intelligence(AI)

Artificial intelligence is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using the rules to reach approximate or definite conclusions), and self-correction. Particular applications of AI include expert systems, speech recognition and machine vision(definition source: http://searchcio.techtarget.com)
AI has already grown to a certain degree. Trend shows that several industries, including smartphones, healthcare, and automotive, tend to put more efforts on AI research and development. At present, AI has been focused on two practices: device and cloud based. Both approaches have pros and cons. Cloud AI is more competent to analyze data since it utilizes deep learning algorithms, but there are potential threads around privacy, latency, and stability. A stronger on-device AI can help offset those dangers to some degree; for instance, smartphone users who deploy the built-in AI of their phones are able to store data locally and thus safeguard their privacy.

2. Internet of Things(IoT)

The Internet of things (IoT) is the network of physical devices, vehicles, home appliances and other items embedded with electronics, software, sensors, actuators, and network connectivity which enables these objects to connect and exchange data.
IHS Markit predicts, the global installed base of IoT devices will rise from 27 billion in 2017 to 73 billion in 2025. The enhanced connectivity options with edge computing and cloud analytics will lead to the acceleration of IoT growth in 2018 and IoT evolution of the four-stage, connecting, collecting, computing, and creating.
The enhancement of IoT connectivity, such as low-power wireless access(LPWA) and 5G, will drive this growth. Moreover, technologies relevant to IoT will become more mature. Machine video and ubiquitous video will provide consecutive data support for visual analytics; and artificial intelligence, the Cloud, and virtualization will help develop critical insights sourced from data at the so-called “edge” of computing networks. Applying techniques to data will drive monetization in the form of cost savings, greater efficiencies, and a transition from product-centric to service- centric business models.

3. Cloud & Virtualization

Cloud computing is an information technology (IT) paradigm that enables ubiquitous access to shared pools of configurable system resources and higher-level services that can be rapidly provisioned with minimal management effort, often over the Internet.
Cloud services will smoothen the road for technologically immature companies to utilize machine learning and artificial intelligence, transforming their usage and understanding of data radically.

4. Connectivity
Connectivity refers to the ability to link and communicate with other people, devices, computer systems, software, or the internet.
Along with the first emerge of 5G commercial deployment, connectivity must become a focus point in 2018. However, the path to full 5G deployment and adoption is complicated, thus, for mobile network operators, infrastructure providers, device manufacturers, and end-users, opportunities and challengers come along. 5G- the next generation of cellular technology after LTE- represents a dramatic expansion of traditional cellular technology use-cases beyond mobile voice and broadband, to include a multitude of IoT and mission-critical applications.

5. Ubiquitous Video
Ubiquitous Video refers to the ability to capture, create, consume, and distribute video content almost anywhere. The explosive growth of video services has been driven by multiple factors, including the high penetration of camera-enabled mobile phones and commoditization, which have enabled displays of various sizes and shapes to be placed in almost any location with a range of wired and wireless connectivity options.
Screens and cameras are widely applied to consumer- and enterprise- devices. Along with the increasingly growth of broadcast, fixed and mobile data network, which has thrown a bomb in video consumption, creation, distribution, and data traffic. More importantly, video content is increasingly expanding beyond entertainment into industrial applications for medical, education, security, and remote controls, as well as digital signage.

6. Computer vision

Computer vision is an interdisciplinary field that deals with how computers can be made for gaining high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do.
The increasing importance of computer vision is directly tied to the mega-trend of digitalization that has been playing out in the industrial, enterprise, and consumer segments over the past 20 years. The proliferation of image sensors, as well as improvements in image processing and analysis, have enabled a broad range of applications and use-cases including industrial robots, drone applications, intelligent transportation systems, high-quality surveillance, medical applications, and automotive safety.

7. Robots and drones

Robots and drones are autonomous or semi-autonomous machines that are capable of completing complex, often repetitive actions. Robots may be fixed or mobile but typically land-based, while drones are commonly viewed as aerial and include fixed-wing, rotor-based, airships, and balloons.
The global market for robots and drones will grow to 3.9 billion U.S. dollars in 2018. Moreover, robots and drones could possibly subvert the traditional business models that have been running in manufacturing for years, and brings a significant impact to fields such as logistics, material picking and handling, navigational autonomy, and delivery.

8. Blockchain
The Harvard Business Review describes it as "an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable and permanent way". A blockchain, is a continuously growing list of records, that utilizes cryptography and timestamps to provide a permanent record of various types of transactions and interaction.
Blockchain enables decentralized transactions, and it is the basis of digital currency, such as bitcoin and ether. It is obvious that blockchain has penetrated in financial service industry as its early adoption of blockchain in payments-related solutions. In 2018, Blockchain-based services will continue to prevail beyond financial services as they have already been developed and deployed, which include: the use of blockchain to improve advertising measurement and combat adfraud; blockchain-based systems for distributing music royalty payments; solutions to better track and manage electronics supply chains.

Source: Octoparse

More related sources:

Understanding Big data, Data mining, and Machine Learning in 5 Minutes

15 Highest Paying Programming Languages in 2017

Top 30 Process Automation Tools for 2018

Why we need data service?

2018年1月4日星期四

Extract Reviews - Dealing with "Show More" Buttons

Product reviews are importance resources for both sellers and buyers. Sellers find about how their products are rated by users while buyers generally spend much time wading through pages of reviews in order to find out whether a product is worth buying.

Many Octoparse users are extracting reviews on daily basis. One of the most frequently asked question is how to deal with "load more" button when it is required to make visible of the full review content instead of the first few lines.

It is actually extremely easy to solve this problem in Octoparse: just make a loop to click those "load more" buttons one by one before extracting the reviews.

Let’s look at an example for Walmart (example URL):

Looking through the reviews on Walmart.com, you can easily spot the “Read More” button showing right below some of the reviews.

What we need to do is really to have the program click open all the "Read more" button all together, so we'll have the complete version of all the reviews. Then, we'll proceed with an extract action for all the reviews. Follow the steps below,

Drag a Loop Item into the workflow after opening the webpage in Octoparse
Choose "Single element" in Loop Mode
Enter the XPath of "Read More" button (//BUTTON[text()='Read more'])*
Click "Save"

*Notice the XPath used here only applies to this particular example. User should find out the suitable XPath to use for different webpages. The selected XPath must be capable of locating all the "Read More" buttons on the page (click here to learn more about XPath)

Drag a Click Item into the Loop Item
Tick "Click items in the loop"
Tick "Load the page with AJAX" and select a proper AJAX timeout
Click "Save"

Next, make a loop list of all the review sections.
Drag the review loop item out of first Loop item, then re-position it to right below the first loop

Click on "Extract Data" action, then click to extract any sub-elements (such as reviewer, review date, comment etc) from the first review section outlined in the built-in browser.
Rename the data field if needed

In this way, Octoparse will click all the "Read More" button before extracting the reviews to make sure all reviews contents are captured completely.

To learn more about scraping reviews, refer to these tutorials:

Scraping Yelp Reviews

Scraping Hotel Reviews from Tripadvisor.com

Amazon Scraping Case Study |Scrape Amazon Product Reviews and Ratings

Author: The Octoparse Team

2017年12月13日星期三

Big Data: 70 Amazing Free Data Sources You Should Know for 2017

Every great data visualization starts with good and clean data. Most of people believe that collecting big data would be a rough thing, but it’s simply not true. There are thousands of free data sets available online, ready to be analyzed and visualized by anyone. Here we’ve rounded up 70 free data sources for 2017 on government, crime, health, financial and economic data,marketing and social media, journalism and media, real estate, company directory and review, and more.

We hope you could enjoy this and save a lot time and energy searching blindly online.

Free Data Source: Government

Data.gov: It is the first stage and acts as a portal to all sorts of amazing information on everything from climate to crime freely by the US Government.
Data.gov.uk: There are datasets from all UK central departments and a number of other public sector and local authorities. It acts as a portal to all sorts of information on everything, including business and economy, crime and justice, defence, education, environment, government, health, society and transportation.
US. Census Bureau: The website is about the government-informed statistics on the lives of US citizens including population, economy, education, geography, and more.
The CIA World Factbook: Facts on every country in the world; focuses on history, government, population, economy, energy, geography, communications, transportation, military, and transnational issues of 267 countries.
Socrata: Socratais a mission-driven software company that is another interesting place to explore government-related data with some visualization tools built-in. Its data as a service has been adopted by more than 1200 government agencies for open data, performance management and data-driven government.
European Union Open Data Portal: It is the single point of access to a growing range of data from the institutions and other bodies of the European Union. The data boosts includes economic development within the EU and transparency within the EU institutions, including geographic, geopolitical and financial data, statistics, election results, legal acts, and data on crime, health, the environment, transport and scientific research. They could be reused in different databases and reports. And more, a variety of digital formats are available from the EU institutions and other EU bodies. The portal provides a standardised catalogue, a list of apps and web tools reusing these data, a SPARQL endpoint query editor and rest API access, and tips on how to make best use of the site.
Canada Open Datais a pilot project with many government and geospatial datasets. It could help you explore how the Government of Canada creates greater transparency, accountability, increases citizen engagement, and drives innovation and economic opportunities through open data, open information, and open dialogue.
Datacatalogs.org: It offers open government data from US, EU, Canada, CKAN, and more.
U.S. National Center for Education Statistics: The National Center for Education Statistics (NCES) is the primary federal entity for collecting and analyzing data related to education in the U.S. and other nations.
UK Data Service: The UK Data Service collection includes major UK government-sponsored surveys, cross-national surveys, longitudinal studies, UK census data, international aggregate, business data, and qualitative data.

Free Data Source: Crime

Uniform Crime Reporting: The UCR Program has been the starting place for law enforcement executives, students, researchers, members of the media, and the public seeking information on crime in the US.
FBI Crime Statistics: Statistical crime reports and publications detailing specific offenses and outlining trends to understand crime threats at both local and national levels.
Bureau of Justice Statistics: Information on anything related to U.S. justice system, including arrest-related deaths, census of jail inmates, national survey of DNA crime labs, surveys of law enforcement gang units, etc.
National Sex Offender Search: It is an unprecedented public safety resource that provides the public with access to sex offender data nationwide. It presents the most up-to-date information as provided by each Jurisdiction.

Free Data Source: Health

U.S. Food & Drug Administration: Here you will find a compressed data file of the Drugs@FDA database. Drugs@FDA, is updated daily, this data file is updated once per week, on Tuesday.
UNICEF: UNICEF gathers evidence on the situation of children and women around the world. The data sets include accurate, nationally representative data from household surveys and other sources.
World Health Organisation: statistics concerning nutrition, disease and health in more than 150 countries.
Healthdata.gov: 125 years of US healthcare data including claim-level Medicare data, epidemiology and population statistics.
NHS Health and Social Care Information Centre: Health data sets from the UK National Health Service. The organization produces more than 260 official and national statistical publications. This includes national comparative data for secondary uses, developed from the long-running Hospital Episode Statistics which can help local decision makers to improve the quality and efficiency of frontline care.

Free Data Source: Financial and Economic Data

World Bank Open Data: Education statistics about everything from finances to service delivery indicators around the world.
IMF Economic Data: An incredibly useful source of information that includes global financial stability reports, regional economic reports, international financial statistics, exchange rates, directions of trade, and more.
UN Comtrade Database: Free access to detailed global trade data with visualizations. UN Comtrade is a repository of official international trade statistics and relevant analytical tables. All data is accessible through API.
Global Financial Data: With data on over 60,000 companies covering 300 years, Global Financial Data offers a unique source to analyze the twists and turns of the global economy.
Google Finance: Real-time stock quotes and charts, financial news, currency conversions, or tracked portfolios.
Google Public Data Explorer: Google's Public Data Explorer provides public data and forecasts from a range of international organizations and academic institutions including the World Bank, OECD, Eurostat and the University of Denver. These can be displayed as line graphs, bar graphs, cross sectional plots or on maps.
U.S. Bureau of Economic Analysis: U.S. official macroeconomic and industry statistics, most notably reports about the gross domestic product (GDP) of the United States and its various units. They also provide information about personal income, corporate profits, and government spending in their National Income and Product Accounts (NIPAs).
Financial Data Finder at OSU: Plentiful links to anything related to finance, no matter how obscure, including World Development Indicators Online, World Bank Open Data, Global Financial Data, International Monetary Fund Statistical Databases, and EMIS Intelligence.
National Bureau of Economic Research: Macro data, industry data, productivity data, trade data, international finance, data, and more.
U.S. Securities and Exchange Commission: Quarterly datasets of extracted information from exhibits to corporate financial reports filed with the Commission.
Visualizing Economics: Data visualizations about the economy.
Financial Times: The Financial Times provides a broad range of information, news and services for the global business community.

Free Data Source: Marketing and Social Media

Amazon API: Browse Amazon Web Services’Public Data Sets by category for a huge wealth of information. Amazon API Gateway allows developers to securely connect mobile and web applications to APIs that run on Amazon Web(AWS) Lambda, Amazon EC2, or other publicly addressable web services that are hosted outside of AWS.
American Society of Travel Agents: ASTA is the world's largest association of travel professionals. It provides members information including travel agents and the companies whose products they sell such as tours, cruises, hotels, car rentals, etc.
Social Mention: Social Mention is a social media search and analysis platform that aggregates user-generated content from across the universe into a single stream of information.
Google Trends: Google Trends shows how often a particular search-term is entered relative to the total search-volume across various regions of the world in various languages.
Facebook API: Learn how to publish to and retrieve data from Facebook using the Graph API.
Twitter API: The Twitter Platform connects your website or application with the worldwide conversation happening on Twitter.
Instagram API: The Instagram API Platform can be used to build non-automated, authentic, high-quality apps and services.
Foursquare API: The Foursquare API gives you access to our world-class places database and the ability to interact with Foursquare users and merchants.
HubSpot: A large repository of marketing data. You could find the latest marketing stats and trends here. It also provides tools for social media marketing, content management, web analytics, landing pages and search engine optimization.
Moz: Insights on SEO that includes keyword research, link building, site audits, and page optimization insights in order to help companies to have a better view of the position they have on search engines and how to improve their ranking.
Content Marketing Institute: The latest news, studies, and research on content marketing.

Free Data Source: Journalism and Media

The New York Times Developer Network– Search Times articles from 1851 to today, retrieving headlines, abstracts and links to associated multimedia. You can also search book reviews, NYC event listings, movie reviews, top stories with images and more.
Associated Press API: The AP Content API allows you to search and download content using your own editorial tools, without having to visit AP portals. It provides access to images from AP-owned, member-owned and third-party, and videos produced by AP and selected third-party.
Google Books Ngram Viewer: It is an online search engine that charts frequencies of any set of comma-delimited search strings using a yearly count of n-grams found in sources printed between 1500 and 2008 in Google's text corpora.
Wikipedia Database: Wikipedia offers free copies of all available content to interested users.
FiveThirtyEight: It is a website that focuses on opinion poll analysis, politics, economics, and sports blogging. The data and code on Github is behind the stories and interactives at FiveThirtyEight.
Google Scholar: Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. It includes most peer-reviewed online academic journals and books, conference papers, theses and dissertations, preprints, abstracts, technical reports, and other scholarly literature, including court opinions and patents.

Free Data Source: Real Estate

Castles: Castles are a successful, privately owned independent agency. Established in 1981, they offer a comprehensive service incorporating residential sales, letting and management, and surveys and valuations.
Realestate.com: RealEstate.com serves as the ultimate resource for first-time home buyers, offering easy-to-understand tools and expert advice at every stage in the process.
Gumtree: Gumtree is the first site for free classifieds ads in the UK. Buy and sell items, cars, properties, and find or offer jobs in your area is all available on the website.
James Hayward: It provides an innovative database approach to residential sales, lettings & management.
Lifull Home’s: Japan’s property website.
Immobiliare.it: Italy’s property website.
Subito: Italy’s property website.
Immoweb: Belgium's leading property website.

Free Data Source: Business Directory and Review

LinkedIn: LinkedIn is a business- and employment-oriented social networking service that operates via websites and mobile apps. It has 500 million members in 200 countries and you could find the business directory here.
OpenCorporates: OpenCorporates is the largest open database of companies and company data in the world, with in excess of 100 million companies in a similarly large number of jurisdictions. Our primary goal is to make information on companies more usable and more widely available for the public benefit, particularly to tackle the use of companies for criminal or anti-social purposes, for example corruption, money laundering and organised crime.
Yellowpages: The original source to find and connect with local plumbers, handymen, mechanics, attorneys, dentists, and more.
Craigslist: Craigslist is an American classified advertisements website with sections devoted to jobs, housing, personals, for sale, items wanted, services, community, gigs, résumés, and discussion forums.
GAF Master Elite Contractor: Founded in 1886, GAF has become North America’s largest manufacturer of commercial and residential roofing (Source: Fredonia Group study). Our success in growing the company to nearly $3 billion in sales has been a result of our relentless pursuit of quality, combined with industry-leading expertise and comprehensive roofing solutions. Jim Schnepper is the President of GAF, an operating subsidiary of Standard Industries. When you are looking to protect the things you treasure most, here are just some of the reasons why we believe you should choose GAF.
CertainTeed: You could find contractors, remodelers, installers or builders in the US or Canada on your residential or commercial project here.
Companies in California: All information about companies in California.
Manta: Manta is one of the largest online resources that deliver products, services and educational opportunities. The Manta directory boasts millions of unique visitors every month who search comprehensive database for individual businesses, industry segments and geographic-specific listings.
EU-Startups: Directory about startups in EU.
Kansas Bar Association: Directory for lawyers. The Kansas Bar Association (KBA) was founded in 1882 as a voluntary association for dedicated legal professionals and has more than 7,000 members, including lawyers, judges, law students, and paralegals.

Free Data Source: Other Portal Websites

Capterra: Directory about business software and reviews.
Monster: Data source for jobs and career opportunities.
Glassdoor: Directory about jobs and information about inside scoop on companies with employee reviews, personalized salary tools, and more.
The Good Garage Scheme: Directory about car service, MOT or car repair.
OSMOZ: Information about fragrance.
Octoparse: A free data extraction tool to collect all the web data mentioned above online.

Do you know some great data sources? Contact to let us know and help us share the data love.

More Related Sources:

Top 30 Big Data Tools for Data Analysis

Top 30 Free Web Scraping Software