Chapter 7. Machine Learning with Spark and Hadoop

We have discussed a typical life cycle of a data science project in Chapter 1, Big Data Analytics at a 10,000-Foot View. This chapter, however, is aimed at learning more about machine learning techniques used in data science with Spark and Hadoop.

Data science is all about extracting deep meaning from data and creating data products. This requires both tools and methods such as statistics, machine learning algorithms, and tools for data collection and data cleansing. Once the data is collected and cleansed, it is analyzed using exploratory analytics to find patterns and build models with the aim of extracting deep meaning or creating a data product.

So, let's understand how these patterns and models are created. This chapter is divided into the following subtopics:

  • Introducing machine learning
  • Machine learning on Spark and Hadoop
  • Machine learning algorithms
  • Examples of machine learning algorithms
  • Building machine learning pipelines
  • Machine learning with H2O and Spark
  • Introducing Hivemall
  • Introducing Hivemall for Spark

Introducing machine learning

Machine learning is the science of making machines work without programming predefined rules. Let's go through a simple example of how a program is written with a regular approach and a machine learning approach. For example, if you are developing a spam filter. You need to identify all possible parameters at design time and hardcode them within the program as follows:

spam_words = ("No investment", "Why pay more?", "You are a winner!", "Free quote")

import sys
for line in sys.stdin:
    if spam_words in line:
        print "Spam Found"
    else:
        process_lines()

In machine learning, computers will learn from the data we provide and make a decision on these spam words. Machine learning is similar to human learning. Let's understand how humans learn.

Humans learn something by doing a task over and over again, which is known as practice. Humans gain experience by practicing something. They get better at the task with more and more practice. Humans are considered to have learned something when they can repeat a task with some expected level of accuracy. However, human learning is not scalable as it has to consider a variety of things.

In machine learning, you typically provide training data with features, such as the type of words with output variables such as spam or ham. Once this data is fed to machine learning algorithms, such as classification or regression, it learns a model of correlation between features and output variables. You can predict that the e-mail is a spam or ham by providing input e-mails called test data to the model. You can refine the model by providing more and more training data to improve accuracy. You can see a spam detection example with machine learning in the next section.

The advantages of machine learning are as follows:

  • It is more accurate than human learning as it is data-driven. The bigger the data, a better accuracy level is achieved as it learns from the data.
  • Machine learning can be automated to automatically predict or recommend products.
  • Machine learning algorithms can produce answers in milliseconds, which enables us to create real-time applications.
  • Machine learning algorithms are scalable and able to process all data.

The disadvantages of machine learning are as follows:

  • You need to acquire the right data (labeled data) and enrich it
  • It is usually impossible to get 100% accuracy
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.14.132