Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Karthik Ramasubramanian and Abhishek Singh, Machine Learning Using R, 10.1007/978-1-4842-2334-5_9

9. Scalable Machine Learning and Related Technologies

Karthik Ramasubramanian¹ and Abhishek Singh¹

(1)New Delhi, Delhi, India

A few years back, you would have not heard the word "scalable" in machine learning parlance. The reason was mainly attributed to the lack of infrastructure, data, and real-world application. Machine learning was being much talked about in the research community of academia or in well-funded industry research labs. A prototype of any real-world application using machine learning was considered a big feat and a demonstration of breakthrough research. However, time has changed ever since the availability of powerful commodity hardware at a reduced cost and big data technology's widespread adaption. As a result, the data has become easily accessible and software developments are becoming more and more data savvy. Every single byte of data is being captured even if its use is not clear in the near future.

As you witnessed in Chapter 6, the machine learning algorithm has a lot of statistical and mathematical depth, but that's not sufficient for it to become scalable. The veracity of such statistical techniques is only enough to work on a small dataset that wholly resides in one machine. However, when the data size grows big enough to challenge the storage capabilities of a single machine, the world of distributed computing and algorithmic complexities starts to take over. And in this world, questions like the following start to emerge:

Does the algorithm run in linear or quadratic time?
Do we have a distributed (parallel) version of the algorithm?
Do we have enough machines with required storage and computing power?

If the answers to these questions are yes, you are ready to think big. A very recent notion of building data products, which we emphasized in our PEBE machine learning process flow, originates from our ability to scale things that can cater to the demand of ever-changing technology, data, and increasing number of users of the product. We are continuously learning from the incremental addition of new data.

In this concluding chapter of the book, we will take you through the exciting journey of big data technologies like Apache Hadoop, Hive, Pig, and Spark, with special focus on scalable machine learning using real-world examples. We will be presenting an introduction to these technologies.

9.1 Distributed Processing and Storage

Imagine a program that uses the most optimized algorithm with the best running time (time complexity) and it’s designed for efficient storage as well. However, the notion of best running time for a company like Google is few microseconds or even lesser (for its search program)and a company involved in DNA sequencing might be willing to spend even few days or weeks for the program to complete. Parallel and distributed computing before big data revolution started was solving the problem of execution time. The same programs were ported to run on multiple machines (servers) at the same time. In other words, the program is divided into many subtasks and assigned to multiple machines executing it at the same time. The paradigm shift big data brought this way of distributed computing was to design a mechanism that efficiently divides the data as well as with the program that processes it. The type of problems people thought about in the distributed computing era and the big data generation have also seen a quite big makeover. For example, problems like the vertex graph coloring problem (finding a way to color the vertices of a graph so that no two adjacent vertices share the same color) is considered a computationally challenging task even for a small graph with a few vertices. There is lot of literature available to designing such distributed programs like the one described in the references at the end of the chapter.

On the other hand, when enormous volumes of data are involved, for example, sorting an array of a billion numbers, the big data technologies have found their way through the solution. Our focus in this chapter is to highlight some of the technologies in this domain using a real-world example.

Although the evolution of distributed and parallel computing began many decades ago, its widespread use has been made possible by two breakthrough works, which led to an entire application development and further state-of-the-art technologies. The first such breakthrough came from Google in 2003, with their "Google File System" followed by "MapReduce: Simplified Data Processing on Large Clusters" in 2004. The former provided a scalable distributed file system for large distributed data-intensive applications and the latter designed a programming model and an associated implementation for processing and generating large datasets. They provide an architecture for dividing and storing a lot of data in smaller chunks across thousands of machines (nodes) and taking computations locally to the machines with smaller chunks of data than running on the entire data.

The second breakthrough, which took this technology to the masses, was in 2006, with Apache Hadoop, a complete open source framework for distributed storage and processing. Hadoop successfully demonstrated that, by using large computer clusters built from commodity hardware, it’s possible to achieve reduced computation time and automatically handle hardware failures.

9.1.1 Google File System (GFS )

The design principle behind GFS was done keeping in mind the demand of data-intensive applications. GFS provided the scalable distributed file system (for storage) for large data.

In order to precisely emphasize the need of such file system, the following is the excerpt from the paper, The Google File System:

While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large datasets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

Handling terabytes of data using thousands of disk over thousands of machines speaks for the humongous task such systems are designed to process.

As shown in Figure 9-1, which was originally published in the paper "Google File System," a GFS master stores the metadata about every data chunk stored in GFS chunk server.

Figure 9-1. A Google file system

The metadata contains the file and chunk namespace (an abstract container holding unique name or identifier), file to chunk mappings, and the location of each chunk’s replica for fault tolerance. In the initial design, there was only a single master; however, the most contemporary distributed architectures have much more complex settings even around the master. The GFS client interacts with the master for metadata requests and all data requests go to the chunk servers.

9.1.2 MapReduce

The distributed processing using MapReduce is at the core of how a task on a big dataset is divided according to the distributed storage. MapReduce was designed as a programming model applying a certain logic, which could range from a sorting operation to running a machine leaning algorithm on a large volume of data.

In a nutshell, as the paper, “MapReduce: Simplified Data Processing on Large Clusters,” explains, users specify a map function that processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. In simpler terms, you break the data into smaller chunks and write a map function to process a <key, value> pair from each of the smaller chunks simultaneously in the different nodes. This in turn generates an intermediate <key, value> pair, which travels over the network to a central node to get merged by certain logic defined by reduce function. The combination of these two is called the MapReduce program. We will see a simple example of this in the Hadoop ecosystem section.

Figure 9-2 from the classic paper, "MapReduce: Simplified Data Processing on Large Clusters," shows how the input file that’s split into smaller chunks is placed on workers (chunk server) where the map program executes.

Figure 9-2. MapReduce execution flow

Once the map phase has completed its assigned task, it writes the data back into the local disk on the chunk servers, which is then picked up by the Reduce program to finally output the results. This entire process executes seamlessly even if there are hardware failures. We will explain MapReduce in greater detail later in the next section.

9.1.3 Parallel Execution in R

In the CRAN documentation titled, “Getting Started with doParallel and foreach,” by Steve Weston and Rich Callaway, the creators of the package doParallel, they explain, The doParallel package is a “parallel backend” for the foreach package. It provides a mechanism needed to execute foreach loops in parallel. The foreach package must be used in conjunction with a package such as doParallel in order to execute code in parallel. Foreach is an idiom that allows for iterating over elements in a collection, without the use of an explicit loop counter.

Before we go into some examples of MapReduce and discuss the Hadoop ecosystem, let’s see some ways to simulate the random forest algorithm (explained in Chapter 6) using parallel execution in multi-core CPUs of a single machine. We will use the credit score dataset.

9.1.3.1 Setting the Cores

Using the doParallel library in R, we can set the number of cores of the CPU, which you want your machine to use in while running the model. There are algorithmic ways to decide (beyond the scope of this book) how many cores you should be using if a dedicated machine for such processing is available. However, if it’s your personal machine, don’t overkill the system by using many cores. Keep in mind that assigning all the cores to this process could crash your other processes due to insufficient resources. To be safer, we used the c-2 cores, where c is the number of cores available in your machine.

library(doParallel)

# Find out how many cores are available (if you don't already know)                    
c =detectCores()
c
 [1] 4
# Find out how many cores are currently being used                    
getDoParWorkers()
 [1] 1
# Create cluster with c-2 cores                    
cl <-makeCluster(c-2)

# Register cluster                    
registerDoParallel(cl)

# Find out how many cores are being used                    
getDoParWorkers()
 [1] 2

9.1.3.2 Problem Statement

The data being used here builds a model, which can predict whether a customer would default in repaying the bank loan or not (a binary classifier) using random forest. For this demonstration, we are simply looking for the time it takes to build the model when executed in serial versus parallel manners.

 Problem : Identifying Risky Bank Loans

setwd("C:\Users\Karthik\Dropbox\Book Writing - Drafts\Chapter Drafts\Chapter 9 - Scalable Machine Learning and related technology\Datasets")
credit <-read.csv("credit.csv")
str(credit)
 'data.frame':    1000 obs. of  17 variables:
  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
  $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
  $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
  $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
  $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
  $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
  $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
  $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
  $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
  $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
  $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
# create a random sample for training and test data                    
# use set.seed to use the same random number sequence as the tutorial                    
set.seed(123)
train_sample <-sample(1000, 900)

str(train_sample)                                                                                                              
  int [1:900] 288 788 409 881 937 46 525 887 548 453 ...
# split the data frames                    
credit_train <-credit[train_sample, ]
credit_test  <-credit[-train_sample, ]

9.1.3.3 Building the model: Serial

Note the time it takes to execute the random forest model in a serial fashion on the training data created.

 Training a model on the data

library(randomForest)

#Sequential Execution                    
system.time(rf_credit_model <-randomForest(credit_train[-17],
                                            credit_train$default,
ntree =1000))
    user  system elapsed
     1.8     0.0     1.8

9.1.3.4 Building the Model: Parallel

In the parallel version of the code, instead of directly using the random forest model with ntree = 1000 parameters (which means build 1000 decision trees), we are going to use the foreach function with %dopar%, so we can split the 1000-decision tree building process into four processes. Each part builds 250 decision trees using the randomForest function.

#Parallel Execution                  
system.time(
  rf_credit_model_parallel <-foreach(nt =rep(250,4),
.combine = combine ,
.packages ='randomForest')
                            %dopar%
randomForest(
                              credit_train[-17],
                              credit_train$default,
ntree = nt))
    user  system elapsed
    0.33    0.09    1.73

9.1.3.5 Stopping the Clusters

Stop all the clusters and resume the execution in a serial fashion.

#Shutting down cluster - when you're done, be sure to close #the parallel backend using                  
stopCluster(cl)

Observe here, approximately, that the parallel execution is 80% faster (it might differ based on your system) than the sequential one. If a single machine using multi-cores could bring such a huge improvement, imagine the time and resources you’d save when using a large computing cluster.

Notes:

The “user time” is the CPU time charged for the execution of user instructions of the calling process.
The “system time” is the CPU time charged for execution by the system on behalf of the calling process.

In the next section, we go a little deeper into the Hadoop ecosystem and demonstrate the first “hello world” example using Hadoop and R.

9.2 The Hadoop Ecosystem

There are plenty of resources on Hadoop due to is popularity. Taking a broad view, the Hadoop framework consists of the following three modules (the technical details of the framework are beyond the scope of this book):

Hadoop Distributed File System: This is the storage part of Hadoop; the core where the data chunks really reside. Dividing data into smaller segments means you need a meticulous way of storing the references in the form of metadata and making them available to all the processes requiring it.
Hadoop YARN: Yet Another Resource Negotiator, this is also known as the data operating system. Starting with Hadoop 2.0, YARN has become the core engine driving the processes efficiently by a prudent resource management framework.
Hadoop MapReduce: MapReduce decides the execution logic of what needs to be done with the data. The logic should be designed in such a way that it can execute in parallel with smaller chunks of data residing in a distributed cluster of machines.

On top of this, there are many additional software packages specially designed to work on the Hadoop framework, namely Apache Pig, Hive, HBase, Mahout, Spark, Sqoop, Flume, Oozie, Storm, Solr, and more. All this software is necessary because of the paradigm shift Hadoop brought in the traditional scheme of relational and small scale data. We will take a brief look of Apache Pig, Hive, HBase, and Spark in this chapter, as they are the three main pillars of the Hadoop ecosystem. Figure 9-3 shows these tools organized in the Hadoop ecosystem.

Figure 9-3. Hadoop components and tools

We first discuss the MapReduce, which sits in the YARN layer of Hadoop, the processing super-head.

9.2.1 MapReduce

MapReduce is a programming model for designing parallel and distributed algorithms on a cluster of machines. At a broad level, it consists of two procedures . Map, which performs operations like filtering and sorting; it processes the key-value pair and generates a intermediate key-value pair. Reduce merges all the intermediate values with the same key. If a problem could be expressed this way, then it’s possible to use a MapReduce to break the problem into smaller parts. Over the years, this model has been successfully used in many real-world problems. In order to understand this model, let’s look at a simple example of word count.

9.2.1.1 MapReduce Example: Word Count

Imagine there is a news aggregator application trying to build an automatic topic generator for all their articles in the web. The first step in the topic generator algorithm is to build a bag-of-word with their frequencies or, in other words, count the number of occurrences of each word in an article. Since there are an enormous number of articles on the web, it definitely requires huge computational power to be able to build this topic generator.

Figure 9-4 shows the MapReduce execution flow as the article is split into many key-value pairs, processed by the Map function, which generates the intermediate key-value pair of word and a value of 1. Another process called shufflemoves the output of map to the Reducer, where finally the values are added for each keyword.

Figure 9-4. Word count example using MapReduce

Notes:

The example needs a Linux/UNIX machine to run.
Appropriate system paths need to be defined by the administrators.
Here is the system information in which the code was executed.
1. platform: i686-redhat-linux-gnu
2. arch: i686
3. os: Linux-gnu
4. system: i686, Linux-gnu
5. major: 3
6. minor: 1.2
7. year: 2014
8. month: 10
9. day: 31
10. svn rev: 66913
11. language: R
12. version.string: R version 3.1.2 (2014-10-31)
The appropriate Hadoop version is required to run the code. This code runs on Hadoop version 2.2.0, build 1529768. Comparability of this code with the latest version of Hadoop is not checked.

You must set the environment variable with the location of the Hadoop bin folder and the Hadoop streaming JAR.

Sys.setenv(HADOOP_CMD="/usr/lib/hadoop-2.2.0/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar")

Then you install and call the libraries rmr2 and rhdfs. Once they are successful, you initialize the HDFS to read or write data from HDFS.

library(rmr2)
library(rhdfs)

# Hadoop File Operations                    

#initialize File HDFS                    
hdfs.init()

Then you put some sample data into the HDFS using the put() function in the rhdfs library.

#Put File into HDFS                  
hdfs.put("/home/sample.txt","/hadoop_practice")
 [1] TRUE

The you define the Map and Reduce function. This code snippet defines the way the Map and Reduce function are going to scan the text file and tokenize (a term generally given to splitting a given sentence or doc by a separator like space) into key-value pairs for counting.

# Reads a bunch of lines at a time                    
#Map Phase                    
map <-function(k,lines) {
  words.list <-strsplit(lines, '\s+')
  words <-unlist(words.list)
return( keyval(words, 1) )
}

#Reduce Phase                    
reduce <-function(word, counts) {
keyval(word, sum(counts))
}

#MapReduce Function                    
wordcount <-function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}

The you run the wordcount. The wordcount function we defined is now ready to be executed on Hadoop. Before calling the function, ensure that you have set the base directory where the input file exists and where you want to put the output generated by the wordcount function:

 read text files from folder input on HDFS
 save result in folder output on HDFS
 Submit job

basedir <- '/hadoop_practice'
infile  <-file.path(basedir, 'sample.txt')
outfile <-file.path(basedir, 'output')
ret <-wordcount(infile, outfile)

Fetch the results. Once the execution of the wordcount function is complete, you can fetch the results back into R and convert that into a data frame and sort the results, as shown in this code snippet.

 Fetch results from HDFS
result <-from.dfs(outfile)
results.df <-as.data.frame(result, stringsAsFactors=F)
colnames(results.df) <-c('word', 'count')
tail(results.df,100)

          word count
 1           R     1
 2          Hi     1
 3          to     1
 4         All     1
 5        with     1
 6       class     1
 7      hadoop     3
 8     Welcome     1
 9 integrating     1

head(results.df[order(results.df$count, decreasing =TRUE),])

     word count
 7 hadoop     3
 1      R     1
 2     Hi     1
 3     to     1
 4    All     1
 5   with     1

Since the entire book is written in R, we have presented this example of word count where R integrates with Hadoop using its Hadoop streaming library which is built in the packages rhdfc and rmr2. For demonstration sake, such an integration might be fine but in a real production system, it might not be a robust solution. Other programming languages like Java, Scala, and Python have a robust production level code integrity and a tight coupling with the Hadoop framework. In the coming sections, we will introduce the basics of Hive, Pig, and Hbase, and conclude with a real-world example using Spark.

9.2.2 Hive

The most critical paradigm shift required in terms of adapting to a big data technology like Hadoop was the ability to read, write, and transform data as one is familiar doing in the Relational Database Management Systems (RDBMS) using SQL (Structured Query Language). RDBMS has a well-structured design of tables grouped into databases which follow a predefined schema. Querying any table is easy if you follow the SQL syntax, logic, and schema properly. The databases are well managed in a data warehouse.

Now, in order to facilitate such ease of querying the data stored in HDFS, there was a need for a data warehouse tool that’s strongly coupled with the HDFS and, at the same time, provide the same capabilities of querying like the traditional RDBMS. Apache Hive was developed keeping this thought at the center of its design principles. Although the underlying storage is HDFS , the data could be structured in a well-defined schema. Among all the other tools in the Hadoop ecosystem, Hive is the most used component across the industry. The advanced technical discussion on the Hive architecture and design is beyond the scope of this book; however, we will present introductory material here in order for you to connect with the larger scheme of things when it comes to big data.

There are many tools in the market that help with large-scale data processing from various sources in a company and put it into a common data platform (Hive is a must data processing engine in such data platforms), which is then made available across companies to analysts, product managers, developers, operations analysts, and so on. Qubole Data Service is one such platform offering such a processing service. It also provides a GUI for writing SQL queries which runs on Hive.

Notes:

In the following demonstrations, we used a Linux virtual machine from Cloudera. However, if you have an instance of Linux OS installed in your personal systems, you can follow the link https://cwiki.apache.org/confluence/display/Hive/GettingStarted to set up Hive.
Alternatively, you could download a virtual machine (also called Sandbox) from Cloudera, Hortonworks, or MapR. These virtual machines are prebuilt with all the necessary tools and components of Hadoop to get you started quickly. Here are couple of options. Horton VM: http://hortonworks.com/products/sandbox/ and Cloudera VM: http://www.cloudera.com/downloads/quickstart_vms/5-8.html . For the demonstrations in this chapter on Pig, Hive, HBase, we used a VM from Cloudera.
We will use the native command-line interface to show some basics of Hive.

9.2.2.1 Creating Tables

The query looks very similar to the traditional SQL query; however, what happens in the background is a lot different in Hive. Upon successful execution of this query, a new file is created in the HDFS in the default database of Hive warehouse (see Figure 9-5).

Figure 9-5. The Hive create table command

Figure 9-6 shows the emp_info table in the folder structure /user/hive/warehouse/ of HDFS.

Figure 9-6. Hive table in HDFS

9.2.2.2 Describing Tables

Once the table is created, you can use the describe formatted emp_info; command to see the structure of the table matching the one we used during creation. Along with column name, it also shows the data type of the column (see Figure 9-7).

Figure 9-7. The describe table command

9.2.2.3 Generating Data and Storing it in a Local File

The table is now ready to be loaded with some data. We have shown in Figure 9-8 the generation of some dummy data and storing it in the local directory in a file named emp_info .

Figure 9-8. Generate data and store in local file

9.2.2.4 Loading the Data into the Hive Table

Once we have the data in a local file, using the command lo ad data local inpath '/home/training/emp_info' into table emp_info; we will load the data into the Hive table emp_info in HDFS.

Figure 9-9. Load the data into a Hive table

Figure 9-10 shows the data in the HDFS file that we loaded from the local file system.

Figure 9-10. Data in the HDFS file

9.2.2.5 Selecting a Query

Figure 9-11 shows two varieties of the select query. The first one is without a where clause and the second one uses where dep = 'A'. Notice how the MapReduce framework built into Hadoop comes into play in the Hive query. This is the exact reason why we associate tools like Hive with the Hadoop ecosystem. The only difference here, unlike with the Word count example, is that we don't have to explicitly define any Map or Reduce methods; instead Hive automatically does that for us.

Figure 9-11. Select query with and without a where clause

Apart from these basic commands, Hive supports data partitioning, table joins, multi-inserts, user-defined functions, and data export. These functionality are comprehensive enough for analytical databases to be migrated into Hive.

9.2.3 Apache Pig

Apache Pig is an analytical platform for large datasets. Pig programs, which are written in Pig Latin, are compiled by Pig's infrastructure layer to produce a sequence of MapReduce programs, thus achieving parallelism. Its strong coupling with Hadoop provides the storage structure of HDFS and process handling by YARN.

Let’s revisit our wordcount example from the MapReduce section and see how we write the same example in a series of Pig Latin commands. For the detailed documentation on Pig set and usage, refer to http://pig.apache.org/docs/r0.16.0/start.html .

9.2.3.1 Connecting to Pig

The command pig -x local connects to a local file system. Simply using the command pig in the terminal will connect to the HDFS. For our word count example, we will stick with the local file system

Figure 9-12. Connecting to Pig using local file system

9.2.3.2 Loading the Data

The command A1 = load '/home/training/wc.txt' as (line:chararray); will scan the file and store each line and a character array. The dump A1 command will output the following:

(Hi All Welcome to Hadoop )
(Hadoop class integrating with R Hadoop)

Figure 9-13. Load data into A1

9.2.3.3 Tokenizing Each Line

Tokenize each line into a word and store it as a list. The dump A2 command will output the following:

({(Hi),(All),(Welcome),(to),(Hadoop)})
({(Hadoop),(class),(integrating),(with),(R),(Hadoop)})

Figure 9-14. Tokenize each line

9.2.3.4 Flattening the Tokens

The A3 = foreach A2 generate flatten(tokens) as words; command will further break each tokenized line into token of words. The dump A3 command will output the following:

(Hi)
(All)
(Welcome)
(to)
(Hadoop)
(Hadoop)
(class)
(integrating)
(with)
(R)
(Hadoop)

Figure 9-15. Flattening the tokens

9.2.3.5 Grouping the Words

Using the command A4 = group A3 by words; will create a key-value pair of words and the list of the word repeated as many times as it is contained in the tokenized list. The dump A4 command will output the following:

(R,{(R)})
(Hi,{(Hi)})
(to,{(to)})
(All,{(All)})
(with,{(with)})
(class,{(class)})
(Hadoop,{(Hadoop),(Hadoop),(Hadoop)})
(Welcome,{(Welcome)})
(integrating,{(integrating)})

Figure 9-16. Group words

9.2.3.6 Counting and Sorting

The following two commands will generate the key-value pair of a word and the number of its occurrence in the document and subsequently sort by count.

A5 = foreach A4 generate group,COUNT(A3);
A6 = order A5 by $1 desc;

The dump A6 command will output the following:

(Hadoop,3)
(R,1)
(Hi,1)
(to,1)
(All,1)
(with,1)
(class,1)
(Welcome,1)
(integrating,1)

Figure 9-17. Starting HBase

Using Pig, many such analytical workflows involving selection, filter, join, union, sorting, grouping, and transformation could be created with ease on large datasets.

9.2.4 HBase

So far we have been discussing representing data in a structured format of rows and columns with predefined schema, which once it’s made, is difficult to tweak for changing requirements. In other words, though Hive offered a distributed version of RDBMS on large datasets, it still requires you to follow a fixed database schema and store the data in warehouse based on it. However, with rapidly changing data we need random, real-time read/writes on large distributed data. In such a scenario, the database can't be relational anymore; it has to be what people in the big data world call NoSQL. HBase was modeled after Google's big table: a distributed storage system for structured data on Google file system (GFS) .

Contrary to a traditional RDBMS system, which stores every row of data with all its columns even if there are many null values and redundant data across tables due to normalization, HBase is a columnar store. This means that each row of data is stored by column family. For example, if you have an employee table with column family called details, you could store columns like name, age, and qualification under the column family details. So if there is a new column address, which could be added under details in real-time.

9.2.4.1 Starting HBase

Start the HBase using the shell script start-hbase.sh. Run the following three commands:

cd /usr/lib/hbase/
sudo bin/start-hbase.sh
hbase shell

Figure 9-18. Create and put data

9.2.4.2 Creating the Table and Put Data

The following commands will create a table named employee with two columns called details and salary. And in the details column family, it will put the data under the name and gender column.

create 'employee','details','salary'
put 'employee','e1','details:name','karthik'
put 'employee','e1','details:gender','m'
put 'employee','e1','salary:sal','20000'

Figure 9-19. Scan the data

9.2.4.3 Scanning the Data

Using the command scan 'employee, you can see how the data is stored in HBase. Each row corresponds to the column values under a column family.

A comprehensive reference guide on HBase could be found at http://hbase.apache.org/book.html#arch.overview .

9.2.5 Spark

Spark provides lightning-fast cluster computing (similar to distributed computing with multiple nodes working together). Spark has an advanced Directed Acyclic Graph (DAG) based execution engine which makes it 100 times faster than Hadoop MapReduce in RAM or memory and 10 times faster on disk. Contrary to Hadoop, which supports only Java, in Spark, you can write applications using Java, Scala, Python, and R. If this was not sufficient, Spark also offers SQL, streaming, machine learning, and graph libraries that could be combined in any fashion to create an application pipeline. Apart from accessing data from HDFS, in Spark, you can connect to HBase, Cassandra, S3, and many more.

In this chapter, we use SparkR, which is a lightweight front-end offering to use Apache Spark from R. It’s light but very rich in functionality. In a nutshell, SparkR provides the following functionality:

You could create SparkDataFrames from the local data frames or hive tables.
On SparkDataFrame operations like selecting, grouping, and aggregation as offered by dplyr package in R are possible.
You can run SQL queries directly on the hive from R.
It provides some set of machine learning algorithms from the MLlib library of Spark.

This powerful offering is definitely taking the industry by storm. However, we will keep our focus on machine learning library of Spark, MLlib.

For interested readers, more details on Spark can be found at http://spark.apache.org/docs/latest/index.html .

9.3 Machine Learning in R with Spark

MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

ML algorithms: Common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization: Feature extraction, transformation, dimensionality reduction, and selection
Pipelines: Tools for constructing, evaluating, and tuning ML pipelines
Persistence: Saving and loading algorithms, models, and pipelines
Utilities: Linear algebra, statistics, data handling, etc.

Currently, SparkR supports the following machine learning algorithms :

Generalized Linear Model
Accelerated Failure Time (AFT) Survival Regression Model
Naive Bayes Model and KMeans Model

Under the hood, SparkR uses MLlib to train the model. The following code in R is taken from our earlier example of housing price predictions, but this is a scalable version of the model using SparkR.

Note (for Windows users) before running the code, follow these steps:

Download pre-built for Hadoop 2.7 and later Spark release from http://spark.apache.org/downloads.html .
Extract the files into the C:-2.0.0-bin-hadoop2.7 folder (you can choose your own location).
Create a symbolic link for the SparkR library using the following command in the cmd prompt: mklink /D "C:Files-3.2.2" "C:-2.0.0-bin-hadoop2.7".
Using RStudio or the R command line, test using library (SparkR).

Let’s go into the R code that follows and understand how SparkR helps build a scalable machine learning model with a Spark engine. Keep in mind that the code is executed in a standalone Spark cluster with only one node. The true potential of Spark could only be seen if the same code runs on a large enterprise cluster of computing nodes with Spark.

9.3.1 Setting the Environment Variable

The following command will let R know the location where Spark and Hadoop binaries are installed in your machine. Remember, both of these are the same environment variable as you would have set in your system properties (for Windows machines).

#Set environment variable                
Sys.setenv(SPARK_HOME='C:/Spark/spark-2.0.0-bin-hadoop2.7',HADOOP_HOME='C:/Hadoop-2.3.0')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'),.libPaths()))
Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"')

9.3.2 Initializing the Spark Session

Once the environment variables are set, initialize the SparkR session with parameters like spark.driver.memory, spark.sql.warehouse.dir, and so on, as shown in the following code snippet. This initialization is required in order for the R environment to connect with Spark running in the local machine.

library(SparkR)
library(rJava)

#The entry point into SparkR is the SparkSession which connects your R program to a Spark cluster                  
sparkR.session(enableHiveSupport =FALSE, appName ="SparkR-ML",master ="local[*]", sparkConfig =list(spark.driver.memory ="1g",spark.sql.warehouse.dir="C:/Hadoop-2.3.0"))
 Launching java with spark-submit command C:/Spark/spark-2.0.0-bin-hadoop2.7/bin/spark-submit2.cmd   --driver-memory "1g" "sparkr-shell" C:UsersKarthikAppDataLocalTempRtmpuoqh3Mackend_port1030727b704d
 Java ref type org.apache.spark.sql.SparkSession id 1

9.3.3 Loading Data and the Running Pre-Process

Load the housing data introduced in Chapter 6 and perform the same set of preprocessing steps as shown in the following code snippet:

library(data.table)

#Read the housing data                  
Data_House_Price <-fread("/Users/karthik/Dropbox/Book Writing - Drafts/Chapter Drafts/Chapter 7 - Machine Learning Model Evaluation/tosend/House Sale Price Dataset.csv",header=T, verbose =FALSE, showProgress =FALSE)

str(Data_House_Price)
 Classes 'data.table' and 'data.frame':   1300 obs. of  14 variables:
  $ HOUSE_ID        : chr  "0001" "0002" "0003" "0004" ...
  $ HousePrice      : int  163000 102000 265979 181900 252000 180000 115000 176000 192000 132500 ...
  $ StoreArea       : int  433 396 864 572 1043 440 336 486 430 264 ...
  $ BasementArea    : int  662 836 0 594 0 570 0 552 24 588 ...
  $ LawnArea        : int  9120 8877 11700 14585 10574 10335 21750 9900 3182 7758 ...
  $ StreetHouseFront: int  76 67 65 NA 85 78 100 NA 43 NA ...
  $ Location        : chr  "RK Puram" "Jama Masjid" "Burari" "RK Puram" ...
  $ ConnectivityType: chr  "Byway" "Byway" "Byway" "Byway" ...
  $ BuildingType    : chr  "IndividualHouse" "IndividualHouse" "IndividualHouse" "IndividualHouse" ...
  $ ConstructionYear: int  1958 1951 1880 1960 2005 1968 1960 1968 2004 1962 ...
  $ EstateType      : chr  "Other" "Other" "Other" "Other" ...
  $ SellingYear     : int  2008 2006 2009 2007 2009 2006 2009 2008 2010 2007 ...
  $ Rating          : int  6 4 7 6 8 5 5 7 8 5 ...
  $ SaleType        : chr  "NewHouse" "NewHouse" "NewHouse" "NewHouse" ...
  - attr(*, ".internal.selfref")=<externalptr>
#Pulling out relevant columns and assigning required fields in the dataset                  
Data_House_Price <-Data_House_Price[,.(HOUSE_ID,HousePrice,StoreArea,StreetHouseFront,BasementArea,LawnArea,Rating,SaleType)]

#Omit any missing value                  
Data_House_Price <-na.omit(Data_House_Price)

Data_House_Price$HOUSE_ID <-as.character(Data_House_Price$HOUSE_ID)

9.3.4 Creating SparkDataFrame

Now, create the training and testing SparkDataFrame by splitting the original dataset Data_House_Price into the first two-third. and the rest (the final third).for training and testing, respectively. It’s similar to the data frame in R, which helps store any tabular data of rows and column, but in Spark its implementation is much more efficient to handle network transfers and process thousands of computing nodes.

#Spark Data Frame - Train                  
gaussianDF_train <-createDataFrame(Data_House_Price[1:floor(nrow(Data_House_Price)*(2/3)),])

#Spark Data Frame - Test                  
gaussianDF_test <-createDataFrame(Data_House_Price[floor(nrow(Data_House_Price)*(2/3) +1):nrow(Data_House_Price),])

class(gaussianDF_train)
 [1] "SparkDataFrame"
 attr(,"package")
 [1] "SparkR"
class(gaussianDF_test)
 [1] "SparkDataFrame"
 attr(,"package")
 [1] "SparkR"

9.3.5 Building the ML Model

Essentially this is the core of this chapter. The first machine learning model built to scale to work with large datasets. spark.glm is a function in the MLlib library of Spark with a scalable implementation of Generalized Linear Model (GLM) . Ideally, nothing changes as far as the syntax goes (except for the function name), but under the hood, there could be large army of nodes working together, automatically running the MapReduce program and many other operations supported by Spark to achieve the final outcome.

# Fit a generalized linear model of family "gaussian" with spark.glm                  
gaussianGLM <-spark.glm(gaussianDF_train, HousePrice ∼StoreArea +StreetHouseFront +BasementArea +LawnArea +Rating  +SaleType, family ="gaussian")

# Model summary                  
summary(gaussianGLM)                                                                              

 Deviance Residuals:
 (Note: These are approximate quantiles with relative error <= 0.01)
     Min       1Q   Median       3Q      Max  
 -432276   -23923    -4236    16522   380300  

 Coefficients:
                        Estimate  Std. Error  t value   Pr(>|t|)  
 (Intercept)            -80034    32619       -2.4536   0.014387  
 StoreArea              58.172    9.8507      5.9054    5.4833e-09
 StreetHouseFront       136.98    80.828      1.6947    0.090578  
 BasementArea           23.623    3.7224      6.3461    3.9629e-10
 LawnArea               0.77459   0.19875     3.8973    0.0001066
 Rating                 35402     1519.4      23.3      0         
 SaleType_NewHouse      -12979    31904       -0.40681  0.68427   
 SaleType_FirstResale   10117     32497       0.31132   0.75565   
 SaleType_SecondResale  -24563    32480       -0.75626  0.44975   
 SaleType_ThirdResale   -22562    34847       -0.64748  0.51754   
 SaleType_FourthResale  -32205    36778       -0.87567  0.38151   

 (Dispersion parameter for gaussian family taken to be 2012650630)

     Null deviance: 4.9599e+12  on 711  degrees of freedom
 Residual deviance: 1.4109e+12  on 701  degrees of freedom
 AIC: 17286

 Number of Fisher Scoring iterations: 1

9.3.6 Predicting the Test Data

In the final step, you can now predict the house prices on the test dataset using the ML model built in the previous step. Refer to Chapter 6 to understand the evaluation criteria for this model.

#Prediction on the gaussianModel                
gaussianPredictions <-predict(gaussianGLM, gaussianDF_test)
names(gaussianPredictions) <-c('HOUSE_ID','HousePrice','StoreArea','StreetHouseFront','BasementArea','LawnArea','Rating','SaleType','ActualPrice','PredictedPrice')
gaussianPredictions$PredictedPrice <-round(gaussianPredictions$PredictedPrice,2.0)
showDF(gaussianPredictions[,9:10])
 +-----------+--------------+
 |ActualPrice|PredictedPrice|
 +-----------+--------------+
 |   139400.0|      128582.0|
 |   157000.0|      202101.0|
 |   178000.0|      164765.0|
 |   120000.0|       50425.0|
 |   130000.0|      155841.0|
 |   582933.0|      333450.0|
 |   309000.0|      255584.0|
 |   176000.0|      192695.0|
 |   125000.0|      132784.0|
 |   130000.0|      140085.0|
 |   169990.0|      183082.0|
 |   213000.0|      222965.0|
 |   144000.0|      122123.0|
 |   118500.0|      158940.0|
 |   138000.0|      116004.0|
 |   437154.0|      346572.0|
 |   230000.0|      261396.0|
 |    82000.0|       61949.0|
 |    85000.0|      119914.0|
 |   214900.0|      218930.0|
 +-----------+--------------+
 only showing top 20 rows

9.3.7 Stopping the SparkR Session

In the end, when the job is done, execute the following code to free all the resources being held for this process, like CPU and memory.

sparkR.stop()

While this code is running, you can fire up http://localhost:4040/jobs/ in your browser and see the progress of your Spark jobs. For every job that is generated automatically upon the execution of this code, you could look at the DAG visualization and see how the Spark engine actually carries out the job.

In order to understand how visualization is built to understand what your application is actually doing on the Spark cluster, follow these blog post from databricks:

https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

9.4 Machine Learning in R with H2O

As we are ending this journey of machine learning in this book, we want to introduce one more powerful platform for R users, called H2O. We have been discussing some powerful techniques in machine learning like deep learning, text analysis, ensembles, etc.. These techniques are not feasible to be executed on individual machines and need high-power computing.

R is a popular language and remarkably adaptable to different platforms and it has provided options for integrating itself to powerful high-performance computing environments. In previous chapters and sections, we showed some examples, like Microsoft Cognitive Serves, Spark, and other Apache products. In this last section, we introduce H2O, which is an open source high performance cluster for big data analysis.

H2O was developed and maintained by H2O.ai, formerly Oxdata, a startup founded in 2011. H2O is marketed as "The Open Source In-Memory, Prediction Engine for Big Data Science.” It offers an impressive array of machine learning algorithms. The H2O R package provides functions for building GLM, GBM, K-means, Naive Bayes, Principal Components Analysis, random forests, and deep learning (multi-layer neural net models).

H2O is a Java Virtual Machine that is optimized for doing “in-memory” processing of distributed, parallel machine learning algorithms on clusters. A “cluster” is a software construct that can be fired up on your laptop, on a server, or across the multiple nodes of a cluster of real machines, including computers that form a Hadoop cluster. According to the latest documentation, the H2O software can be run on conventional operating systems like Microsoft Windows (7 or later), Mac OS X (10.9 or later), and Linux (Ubuntu 12.04; RHEL/CentOS 6 or later). It also runs on big data systems, particularly Apache Hadoop Distributed File System (HDFS), and is available on several popular virtual machines like Cloudera (5.1 or later), MapR (3.0 or later), and Hortonworks (HDP 2.1 or later). It also operates on cloud computing environments, for example using Amazon EC2, Google Compute Engine, and Microsoft Azure. The H2O Sparkling Water software is databricks-certified on Apache Spark.

For R, the H20 package is available on CRAN. Before you proceed to the demo of H20, we recommend you follow these URLs, which have some well documented materials :

Complete documentation on H20 package: https://cran.r-project.org/web/packages/h2o/h2o.pdf ).
Another documentation on H2O is available at the h2o.ai as well ( http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html ).
More implementation of ML algorithms for H2O: ( https://github.com/h2oai/h2o-3/tree/master/h2o-r/demos )
Installation of h2o: A user-friendly and easy to follow description of installation is provided here: http://h2o-release.s3.amazonaws.com/h2o/master/1735/docs-website/Ruser/Rinstall.html
A presentation onHigh Performance Machine Learning in R with H2O at http://www.stat.berkeley.edu/~ledell/docs/h2o_hpccon_oct2015.pdf .

9.4.1 Installation of Packages

Once you are done with installing the prerequisites, the following code will fetch the latest release of the H20 package for r and install that in the local system.

Notes:

A good Internet connection is recommended before you try this code. All computations are performed (in highly optimized Java code) in the H2O cluster and initiated by REST calls from R.
It’s advisable not to experiment with these codes in your local machines with large volume of data (it’s safe to run the demos shown in the following code on your local machines).

# The following two commands remove any previously installed H2O packages for R.                  
if ("package:h2o" %in%search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in%rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download, install and initialize the H2O package for R.                  
install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/rel-kahan/5/R", getOption("repos"))))

#Alternatively you can install the package h2o from CRAN as below                  
install.packages("h2o")

9.4.2 Initialization of H2O Clusters

Once the installation is done, you can fire a instance of clusters for the computation by calling the init() function.

# Load the h2o library in R                  
library(h2o);
#Initiate a cluster in your machine                  
localH2O =h2o.init()
The above function will return an output saying Connection successful as shown below:
 Starting H2O JVM and connecting: .... Connection successful!

 R is connected to the H2O cluster:
     H2O cluster uptime:         4 seconds 188 milliseconds
     H2O cluster version:        3.10.0.6
     H2O cluster version age:    1 month and 9 days  
     H2O cluster name:           H2O_started_from_R_abhisheksingh_zve484
     H2O cluster total nodes:    1
     H2O cluster total memory:   0.89 GB
     H2O cluster total cores:    4
     H2O cluster allowed cores:  2
     H2O cluster healthy:        TRUE
     H2O Connection ip:          localhost
     H2O Connection port:        54321
     H2O Connection proxy:       NA
     R Version:                  R version 3.2.3 (2015-12-10)

 Note:  As started, H2O is limited to the CRAN default of 2 CPUs.
        Shut down and restart H2O as shown below to use all your CPUs.
            > h2o.shutdown()
            > h2o.init(nthreads = -1)

Once you have initiated a cluster into your local machine, you are ready to run your computations on high-power clusters of H2O. There are lot of other examples to get you started with Gradient Boosting Machine (GBM) , Generalized Linear Models (GLM) , ensemble tress, and many more.

9.4.3 Deep Learning Demo in R with H2O

The following code runs a built-in demo of deep learning using the demo function with the parameter, h20.deeplearning, which internally makes the REST API calls to the local H2O cluster. In brief, the code:

Imports a built-in dataset named prostate.csv, parses it, and prints a summary. The data was collected by Dr. Donn Young at the Ohio State University Comprehensive Cancer Center for a study of patients with varying degrees of prostate cancer. The goal of this demo was to predict whether a tumor has penetrated the prostate capsule based on the variables measured at a baseline exam. The metadata is shown in Figure 9-20.
Figure 9-20. Feature definition of prostate cancer dataset
Then, it runs deep learning on the dataset to predict the tumor penetration of the prostate cancer.

This demo runs H2O on localhost:54321 .

9.4.3.1 Running the Demo

The function demo runs all at once and outputs the entire output at one go. However, for better understanding of what the function does, we have split the output and explained each part in detail.

# Run Deep learning demo                  
demo(h2o.deeplearning)

The demo runs.

9.4.3.2 Loading the Testing Data

Load the data from the local file system directory of R, where the H2O package is installed. It might look like C:UsersKarthikDocumentsRwin-library3.2h2oextdata.

 > prostate.hex = h2o.uploadFile(path = system.file("extdata", "prostate.csv", package="h2o"), destination_frame = "prostate.hex")

Summary output. The summary output should match the feature definition as per Figure 9-20.

 > summary(prostate.hex)
  ID               CAPSULE          AGE             RACE           
  Min.   :  1.00   Min.   :0.0000   Min.   :43.00   Min.   :0.000  
  1st Qu.: 95.75   1st Qu.:0.0000   1st Qu.:62.00   1st Qu.:1.000  
  Median :190.50   Median :0.0000   Median :67.00   Median :1.000  
  Mean   :190.50   Mean   :0.4026   Mean   :66.04   Mean   :1.087  
  3rd Qu.:285.25   3rd Qu.:1.0000   3rd Qu.:71.00   3rd Qu.:1.000  
  Max.   :380.00   Max.   :1.0000   Max.   :79.00   Max.   :2.000  
  DPROS           DCAPS           PSA               VOL            
  Min.   :1.000   Min.   :1.000   Min.   :  0.300   Min.   : 0.00  
  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:  4.900   1st Qu.: 0.00  
  Median :2.000   Median :1.000   Median :  8.664   Median :14.20  
  Mean   :2.271   Mean   :1.108   Mean   : 15.409   Mean   :15.81  
  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 17.063   3rd Qu.:26.40  
  Max.   :4.000   Max.   :2.000   Max.   :139.700   Max.   :97.60  
  GLEASON        
  Min.   :0.000  
  1st Qu.:6.000  
  Median :6.000  
  Mean   :6.384  
  3rd Qu.:7.000  
  Max.   :9.000

Model building. The function h2o.deeplearning builds a deep learning model using the response variable CAPSULE and the rest of the variable as a predictor. Additional parameters are:

Hidden, which specifies the hidden layer sizes,
Activation, which specifies the type of activation function; the demo uses a Tanh function
epochs, which directs the neural network with “How many times the dataset should be iterated (streamed)”

 > # Set the CAPSULE column to be a factor column then build model.
 > prostate.hex$CAPSULE = as.factor(prostate.hex$CAPSULE)

 > model = h2o.deeplearning(x = setdiff(colnames(prostate.hex), c("ID","CAPSULE")), y = "CAPSULE", training_frame = prostate.hex, activation = "Tanh", hidden = c(10, 10, 10), epochs = 10000)

Print the output. The output of the five layer deep neural network is printed by accessing the model summary field from the model object created in the previous step.

 > print(model@model$model_summary)
 Status of Neuron Layers: predicting CAPSULE, 2-class classification, bernoulli distribution, CrossEntropy loss, 322 weights/biases, 8.4 KB, 3,800,000 training samples, mini-batch size 1

   layer units    type dropout       l1       l2 mean_rate rate_rms
 1     1     7   Input  0.00 %                                     
 2     2    10    Tanh  0.00 % 0.000000 0.000000  0.004538 0.009754
 3     3    10    Tanh  0.00 % 0.000000 0.000000  0.007007 0.011632
 4     4    10    Tanh  0.00 % 0.000000 0.000000  0.003262 0.005256
 5     5     2 Softmax         0.000000 0.000000  0.002906 0.000392

   momentum mean_weight weight_rms mean_bias bias_rms
 1                                                   
 2 0.000000   -0.118311   1.642809 -0.152061 1.519672
 3 0.000000    0.018304   1.594797 -0.470666 0.681625
 4 0.000000   -0.063209   1.924838 -0.545838 0.903191
 5 0.000000    0.495293   4.894484  0.012870 2.835105

Make the prediction. Since the dataset was small, we haven’t split the data into training or testing datasets, but rather show the predictions on the same dataset used in training. However, in cases where sufficient data is available, you are encouraged to run the prediction on the testing dataset to better understand the efficacy of the model.

 > # Make predictions with the trained model with training data.
 > predictions = predict(object = model, newdata = prostate.hex)

 > # Export predictions from H2O Cluster as R dataframe.
 > predictions.R = as.data.frame(predictions)

 > head(predictions.R)
   predict           p0           p1
 1       0 9.984036e-01 1.596373e-03
 2       0 9.999973e-01 2.683004e-06
 3       0 9.731078e-01 2.689217e-02
 4       0 9.496504e-01 5.034956e-02
 5       0 9.996701e-01 3.298716e-04
 6       1 4.167409e-07 9.999996e-01

 > tail(predictions.R)
     predict          p0           p1
 375       0 0.999999999 7.078566e-10
 376       0 0.986077940 1.392206e-02
 377       0 0.998982044 1.017956e-03
 378       1 0.008513801 9.914862e-01
 379       0 1.000000000 5.989944e-11
 380       0 1.000000000 2.681686e-14

Model evaluation. The accuracy of the model is 99.5%, which is exceptionally good. The other measures in the output were discussed in detail throughout Chapter 6. For example, MSE, Mean Square Error (MSE), Gini index, and so on.

 > # Check performance of classification model.
 > performance = h2o.performance(model = model)

 > print(performance)
 H2OBinomialMetrics: deeplearning
 ** Reported on training data. **
 ** Metrics reported on full training frame **

 MSE:  0.01764182
 RMSE:  0.1328225
 LogLoss:  0.0741766
 Mean Per-Class Error:  0.01861449
 AUC:  0.9958826
 Gini:  0.9917653

 Confusion Matrix for F1-optimal threshold:
          0   1    Error    Rate
 0      223   4 0.017621  =4/227
 1        3 150 0.019608  =3/153
 Totals 226 154 0.018421  =7/380

 Maximum Metrics: Maximum metrics at their respective thresholds
                         metric threshold    value idx
 1                       max f1  0.347034 0.977199 114
 2                       max f2  0.347034 0.979112 114
 3                 max f0point5  0.730649 0.983718 106
 4                 max accuracy  0.551164 0.981579 110
 5                max precision  1.000000 1.000000   0
 6                   max recall  0.007983 1.000000 152
 7              max specificity  1.000000 1.000000   0
 8             max absolute_mcc  0.347034 0.961761 114
 9   max min_per_class_accuracy  0.347034 0.980392 114
 10 max mean_per_class_accuracy  0.347034 0.981386 114

More demos in the H2O package. Running the following command will list all the available demos in H2O, which you can run once and then observe how the model building process is being followed for the specific ML algorithm.

demo(package = “h2o)                    
Demos in package ‘h2o’:                                                                                      

h2o.anomaly                  H2O anomaly using prostate cancer data
h2o.deeplearning             H2O deeplearning using prostate cancer data
h2o.gbm                      H2O generalized boosting machines using prostate cancer data
h2o.glm                      H2O GLM using prostate cancer data
h2o.glrm                     H2O GLRM using walking gait data
h2o.kmeans                   H2O K-means using prostate cancer data
h2o.naiveBayes               H2O naive Bayes using iris and Congressional voting data
h2o.prcomp                   H2O PCA using Australia coast data
h2o.randomForest             H2O random forest classification using iris data

9.5 Summary

In the days to come, as the cost of infrastructure goes down and data volume increases, the need for scaling up will become the first priority in the machine learning process flow. Every single application built on machine learning first has to start with the thinking of scalable implementation. Most of the traditional RDBMS systems will soon become obsolete as the data starts to explode in its size. The giants in the industry have already started to take the first step toward migrating to systems that support large scales and the agility to change as per business needs. In the not so far in future, a greater emphasis on efficient algorithmic designs and focus on subjects like quantum computing will start to appear when answers to growing data volume are addressed by another wave of disruptive technology.

We have taken up a comprehensive journey into the world of machine learning by drawing the inspiration from the fast growing data science methodology and techniques. Though a vast majority of the ML model building process flow exists and is explained with much elegance in the classic literature, we felt a need to stich the ML model building process flow with the modern world thinking emerging from data science.

We have also simplified the statistics and mathematics wherever possible to make the study of ML more practical and give plenty of additional resources for further reading. The depth of topics like sampling, regression models, and deep learning is so deep and diverse that each of these topic could produce a book of equal size. However, practical applicability of such algorithms were made possible because of the plethora of R packages available in CRAN.

Since R is the preferred programming language for beginners as well as advanced users for building quick ML prototypes around a real-world problem, we chose R to demonstrate all the examples in the book. If you want to pursue machine learning for you career or research work, a fine balance of skillsets in computer science, statistics, and domain knowledge will prove to be useful.

9.6 References

Hadoop: The Definitive Guide, by Tom White.
Big Data, Data Mining, and Machine Learning by Jared Dean.
Cole, Richard; Vishkin, Uzi (1986), "Deterministic coin tossing with applications to optimal parallel list ranking," Information and Control, 70 (1): 32–53, doi:10.1016/S0019-9958(86)80023-7.
Introduction to Algorithms (1st ed.), Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990), MIT Press, ISBN 0-262-03141-8.
The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, Google.
“MapReduce: Simplified Data Processing on Large Clusters,: Jeffrey Dean and Sanjay Ghemawat, Google.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. Scalable Machine Learning and Related Technologies

Create new playlist

Sign In

Sign Up

9. Scalable Machine Learning and Related Technologies

9.1 Distributed Processing and Storage

9.1.1 Google File System (GFS )

Figure 9-1. A Google file system

9.1.2 MapReduce

Figure 9-2. MapReduce execution flow

9.1.3 Parallel Execution in R

9.1.3.1 Setting the Cores

9.1.3.2 Problem Statement

9.1.3.3 Building the model: Serial

9.1.3.4 Building the Model: Parallel

9.1.3.5 Stopping the Clusters

9.2 The Hadoop Ecosystem

Figure 9-3. Hadoop components and tools

9.2.1 MapReduce

9.2.1.1 MapReduce Example: Word Count

Figure 9-4. Word count example using MapReduce

9.2.2 Hive

9.2.2.1 Creating Tables

Figure 9-5. The Hive create table command

Figure 9-6. Hive table in HDFS

9.2.2.2 Describing Tables

Figure 9-7. The describe table command

9.2.2.3 Generating Data and Storing it in a Local File

Figure 9-8. Generate data and store in local file

9.2.2.4 Loading the Data into the Hive Table

Figure 9-9. Load the data into a Hive table

Figure 9-10. Data in the HDFS file

9.2.2.5 Selecting a Query

Figure 9-11. Select query with and without a where clause

9.2.3 Apache Pig

9.2.3.1 Connecting to Pig

Figure 9-12. Connecting to Pig using local file system

9.2.3.2 Loading the Data

Figure 9-13. Load data into A1

9.2.3.3 Tokenizing Each Line

Figure 9-14. Tokenize each line

9.2.3.4 Flattening the Tokens

Figure 9-15. Flattening the tokens

9.2.3.5 Grouping the Words

Figure 9-16. Group words

9.2.3.6 Counting and Sorting

Figure 9-17. Starting HBase

9.2.4 HBase

9.2.4.1 Starting HBase

Figure 9-18. Create and put data

9.2.4.2 Creating the Table and Put Data

Figure 9-19. Scan the data

9.2.4.3 Scanning the Data

9.2.5 Spark

9.3 Machine Learning in R with Spark

9.3.1 Setting the Environment Variable

9.3.2 Initializing the Spark Session

9.3.3 Loading Data and the Running Pre-Process

9.3.4 Creating SparkDataFrame

9.3.5 Building the ML Model

9.3.6 Predicting the Test Data

9.3.7 Stopping the SparkR Session

9.4 Machine Learning in R with H2O

9.4.1 Installation of Packages

9.4.2 Initialization of H2O Clusters

9.4.3 Deep Learning Demo in R with H2O

Figure 9-20. Feature definition of prostate cancer dataset

9.4.3.1 Running the Demo

9.4.3.2 Loading the Testing Data

9.5 Summary

9.6 References

Table of Contents for
9. Scalable Machine Learning and Related Technologies