Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Machine Learning with Spark and Hadoop

We have discussed a typical life cycle of a data science project in Chapter 1, Big Data Analytics at a 10,000-Foot View. This chapter, however, is aimed at learning more about machine learning techniques used in data science with Spark and Hadoop.

Data science is all about extracting deep meaning from data and creating data products. This requires both tools and methods such as statistics, machine learning algorithms, and tools for data collection and data cleansing. Once the data is collected and cleansed, it is analyzed using exploratory analytics to find patterns and build models with the aim of extracting deep meaning or creating a data product.

So, let's understand how these patterns and models are created. This chapter is divided into the following subtopics:

Introducing machine learning
Machine learning on Spark and Hadoop
Machine learning algorithms
Examples of machine learning algorithms
Building machine learning pipelines
Machine learning with H2O and Spark
Introducing Hivemall
Introducing Hivemall for Spark

Introducing machine learning

Machine learning is the science of making machines work without programming predefined rules. Let's go through a simple example of how a program is written with a regular approach and a machine learning approach. For example, if you are developing a spam filter. You need to identify all possible parameters at design time and hardcode them within the program as follows:

spam_words = ("No investment", "Why pay more?", "You are a winner!", "Free quote")

import sys
for line in sys.stdin:
    if spam_words in line:
        print "Spam Found"
    else:
        process_lines()

In machine learning, computers will learn from the data we provide and make a decision on these spam words. Machine learning is similar to human learning. Let's understand how humans learn.

Humans learn something by doing a task over and over again, which is known as practice. Humans gain experience by practicing something. They get better at the task with more and more practice. Humans are considered to have learned something when they can repeat a task with some expected level of accuracy. However, human learning is not scalable as it has to consider a variety of things.

In machine learning, you typically provide training data with features, such as the type of words with output variables such as spam or ham. Once this data is fed to machine learning algorithms, such as classification or regression, it learns a model of correlation between features and output variables. You can predict that the e-mail is a spam or ham by providing input e-mails called test data to the model. You can refine the model by providing more and more training data to improve accuracy. You can see a spam detection example with machine learning in the next section.

The advantages of machine learning are as follows:

It is more accurate than human learning as it is data-driven. The bigger the data, a better accuracy level is achieved as it learns from the data.
Machine learning can be automated to automatically predict or recommend products.
Machine learning algorithms can produce answers in milliseconds, which enables us to create real-time applications.
Machine learning algorithms are scalable and able to process all data.

The disadvantages of machine learning are as follows:

You need to acquire the right data (labeled data) and enrich it
It is usually impossible to get 100% accuracy

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Machine Learning with Spark and Hadoop

Create new playlist

Sign In

Sign Up

Chapter 7. Machine Learning with Spark and Hadoop

Introducing machine learning

Table of Contents for
7. Machine Learning with Spark and Hadoop