19. Scalable Super Learning (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

344 Handb ook of Big Data

2. Train the base learners on subsets of the original training dataset.

3. Utilize distributed or parallelized base learning algorithms.

4. Employ online learning techniques to avoid memory-wise scalability limitations.

5. Implement the ensemble (and/or base learners) in a scalable language, such as

C++, Java, Scala,orJulia, for example.

Currently, there are three implementations of the Super Learner algorithm that have an

R interface. The SuperLearner and subsemble [27] R packages are implemented entirely in

R, although they can make use of base learning algorithms that are written in compiled

languages as long as there is an R interface available. Often, the main computational

tasks of machine learning algorithms accessible via R packages are written in Fortran (e.g.,

randomForest, glmnet)orC++ (e.g., e1071s interface to LIBSVM [6]), and the runtime

of certain algorithms can be reduced by linking R to an optimized BLAS (Basic Linear

Algebra Subprograms) library, such as OpenBLAS [46], ATLAS [44], or Intel MKL [19].

These techniques may provide additional speed in training, but do not necessarily curtail

all memory-related scalability issues. Typically, because at least one copy of the full training

dataset must reside in memory in R, this is an inherent limitation to the scalability of these

implementations.

A more scalable implementation of the Super Learner algorithm is available in the

h2oEnsemble R package [25]. The H2O Ensemble implementation uses R to interface with

distributed base learning algorithms from the high-performance, open source Java machine

learning library, H2O [14]. Each of these three Super Learner implementations is at a

diﬀerent stage of development and has beneﬁts and drawbacks compared to the others,

but all three projects are being actively developed and maintained.

The main challenge in writing a Super Learner implementation is not implementing

the ensemble algorithm itself. In fact, the Super Learner algorithm simply organizes the

cross-validated output from the base learners and applies the metalearning algorithm to this

derived dataset. Some thought must be given to the parallelization aspects of the algorithm,

but this is typically a straightforward exercise, given the computational independence of the

cross-validation and base learning steps. One of the main software engineering tasks in any

Super Learner implementation is creating a uniﬁed interface to a large collection of base

learning and metalearning algorithms. A Super Learner implementation must include a

novel or third-party machine learning algorithm interface that allows users to specify the

base learners in a common format. Ideally, the users of the software should be able to deﬁne

their own base learning functions that specify an algorithm and set of model parameters in

addition to any default algorithms that are provided within the software. The performance

of the Super Learner is determined by the combined performance of the base learners, so a

having a rich library of machine learning algorithms accessible in the ensemble software is

important.

The metalearning methods can use the same interface as the base learners simplifying

the implementation. The metalearner is just another algorithm, although it is common for

a nonnegative linear combination of the base algorithms to be created using a method like

NNLS. However, if the loss function of interest to the user is unrelated to the objective

functions associated with the base learning algorithms, then a linear combination of the

base learners that minimizes the user-speciﬁed loss function can be learned using a nonlinear

optimization library, such as NLopt. In classiﬁcation problems, this is particularly relevant

in the case where the outcome variable in the training set is highly imbalanced. NLopt

provides a common interface to a number of diﬀerent algorithms that can be used to solve

this problem. There are also methods that allow for constraints, such as nonnegativity

(α

≥ 0) and convexity (



l=1

= 1) of the weights. Using one of several nonlinear

Scalable Super Le a rning 345

optimization algorithms, such as L-BFGS-B, Nelder-Mead, or COBYLA, it is possible to

ﬁnd a linear combination of the base learners that speciﬁcally minimizes the loss function of

interest.

19.3.1 SuperLearner R Package

As is common for many statistical algorithms, the original implementation of the Super

Learner algorithm was written in R.TheSuperLearner R package, ﬁrst released in 2010, is

actively maintained with new features being added periodically. This package implements

the Super Learner algorithm and provides a uniﬁed interface to a diverse set of machine

learning algorithms that are available in the R language. The software is extensible in the

sense that the user can deﬁne custom base-learner function wrappers and specify them

as part of the ensemble; however, there are about 30 algorithm wrappers provided by the

package by default. The main advantage of an R implementation is direct access to the rich

collection of machine learning algorithms that already exist within the R ecosystem. The

main disadvantage of an R implementation is memory-related scalability.

Because the base learners are trained independently from each other, the training of the

constituent algorithms can be done in parallel. The embarrassingly parallel nature of the

cross-validation and base learning steps of the Super Learner algorithm can be exploited

in any language. If there are L base learners and V cross-validation folds, there are L × V

independent computational tasks involved in creating the level-one data. The Sup erLearner

package provides functionality to parallelize the cross-validation step via multicore or SNOW

(Simple Network of Workstations) [40] clusters.

The R language and its third-party libraries are not particularly well known for memory

eﬃciency, so depending on the speciﬁcations of the machine or cluster that is being used,

it is possible to run out of memory while attempting to train the ensemble on large

training sets. Because the SuperLearner package relies on third-party implementations of

the base learning algorithms, the scalability of Super Learner is tied to the scalability of

the base learner implementations used in the ensemble. When selecting a single model

among a group of candidate algorithms based on cross-validated model performance, this is

computationally equivalent to generating the level-one data in the Super Learner algorithm.

If cross-validation is already being employed as a means of grid-search-based model selection

among a group of candidate learning algorithms, the addition of the metalearning step is

a computationally minimal burden. However, a Super Learner ensemble can result in a

signiﬁcant boost in overall model performance over a single base learner model.

19.3.2 Subsemble R Pack age

The subsemble R package implements the Subsemble algorithm [36], a variant of Super

Learning, which ensembles base models trained on subsets of the original data. Speciﬁcally,

the disjoint union of the subsets is the full training set. As a special case, where the number

of subsets = 1, the package also implements the Super Learner algorithm.

The Subsemble algorithm can be used as a stand-alone ensemble algorithm or as the

base learning algorithm in the Super Learner algorithm. Empirically, it has been shown that

Subsemble can provide better prediction performance than ﬁtting a single algorithm once

on the full available dataset [36], although this is not always the case.

An oracle result shows that Subsemble performs as well as the best possible combination

of the subset-speciﬁc ﬁts. The Super Learner has more powerful asymptotic properties; it

performs as well as the best possible combination of the base learners trained on the full

dataset. However, when used as a stand-alone ensemble algorithm, Subsemble oﬀers great

346 Handb ook of Big Data

computational ﬂexibility, in that the training task can be scaled to any size by changing the

number, or size, of the subsets. This allows the user to eﬀectively ﬂatten the training process

into a task that is compatible with available computational resources. If parallelization is

used eﬀectively, all subset-speciﬁc ﬁts can be trained at the same time, drastically increasing

the speed of the training process. Because the subsets are typically much smaller than the

original training set, this also reduces the memory requirements of each node in your cluster.

The computational ﬂexibility and speed of the Subsemble algorithm oﬀer a unique solution

to scaling ensemble learning to big data problems.

In the subsemble package, the J subsets can be created by the software at random, or

the subsets can be explicitly speciﬁed by the user. Given L base learning algorithms and J

subsets, a total of L × J subset-speciﬁc ﬁts will be trained and included in the Subsemble

(by default). This construction allows each base learning algorithm to see each subset of

the training data, so in this sense, there is a similarity to ensembles trained on the full

data. To distinguish the variations on this theme, this type of ensemble construction is

referred to as a cross-product Subsemble. The subsemble package also implements what are

called divisor Subsembles, a structure that can be created if the number of unique base

learning algorithms is a divisor of the number of subsets. In this case, there are only J total

subset-speciﬁc ﬁts that make up the ensemble, and each learner only sees approximately

n/J observations from the full training set (assuming that the subsets are of equal size). For

example, if L =2andJ = 10, then each of the two base learning algorithms would be used

to train ﬁve subset-speciﬁc ﬁts and would only see a total of 50% of the original training

observations. This type of Subsemble allows for quicker training, but will typically result in

less accurate models. Therefore, the cross-product method is the default Subsemble type in

the software.

An algorithm called Supervised Regression Tree Subsemble or SRT Subsemble [35] is also

on the development road map for the subsemble package. SRT Subsemble is an extension of

the regular Subsemble algorithm, which provides a means of learning the optimal number

and constituency of the subsets. This method incurs an additional computational cost, but

can provide greater model performance for the Subsemble.

19.3.3 H2O Ensemble

The H2O Ensemble software contains an implementation of the Super Learner ensemble

algorithm that is built on the distributed, open source, Java-based machine learning platform

for big data, H2O. H2O Ensemble is currently implemented as a stand-alone R package called

h2oEnsemble that makes use of the h2o package, the R interface to the H2O platform.

There are a handful of powerful supervised machine learning algorithms supported by the

h2o package, all of which can be used as base learners for the ensemble. This includes a

high-performance method for deep learning, which allows the user to create ensembles of

deep neural nets or combine the power of deep neural nets with other algorithms, such as

Random Forest or Gradient Boosting Machines (GBMs) [12].

Because the H2O machine learning platform was designed with big data in mind, each

of the H2O base learning algorithms is scalable to very large training sets and enables

parallelism across multiple nodes and cores. The H2O platform comprises a distributed

in-memory parallel computing architecture and has the ability to seamlessly use datasets

stored in Hadoop Distributed File System (HDFS), Amazon’s S3 cloud storage, NoSQL,

and SQL databases in addition to CSV ﬁles stored locally or in distributed ﬁlesystems. The

H2O Ensemble project aims to match the scalability of the H2O algorithms, so although

the ensemble uses R as its main user interface, most of the computations are performed in

Java via H2O in a distributed, scalable fashion.

Scalable Super Le a rning 347

There are several publicly available benchmarks of the H2O algorithms. Notably,

the H2O GLM implementation has been benchmarked on a training set of one billion

observations [13]. This benchmark training set is derived from the Airline Dataset [31],

which has been called the Iris dataset for big data. The one billion row training set is a

42 Gb CSV ﬁle with 12 feature columns (9 numerical features, 3 categorical features with

cardinalities 30, 376, and 380) and a binary outcome. Using a 48-node cluster (8 cores on

each node, 15 Gb of RAM, and 1 Gb interconnect speed), the H2O GLM can be trained in

5.6 s. The H2O algorithm implementations aim to be scalable to any size dataset so that

all of the available training set, rather than a subset, can be used for training models.

H2O Ensemble takes a diﬀerent approach to scaling the Super Learner algorithm than

the subsemble or SuperLearner R packages. Because the subsemble and SuperLearner

ensembles rely on third-party R algorithm implementations that are typically single

threaded, the parallelism of these two implementations occurs in the cross-validation and

base learning steps. In the SuperLearner implementation, the ability to take advantage of

multiple cores is strictly limited by the number of cross-validation folds and number of base

learners. With subsemble, the scalability of the ensemble can be improved by increasing the

number of subsets used; however, this may lead to a decrease in model performance. Unlike

most third-party machine learning algorithms that are available in R,theH2O base learning

algorithms are implemented in a distributed fashion and can scale to all available cores in a

multicore or multinode cluster. In the current release of H2O Ensemble, the cross-validation

and base learning steps of the ensemble algorithm are performed in serial; however, each

serial training step is maximally parallelized across all available cores in a cluster. The H2O

Ensemble implementation could possibly be re-architected to parallelize the cross-validation

and base learning steps; however, it is unknown at this time how that may aﬀect runtime

performance.

19.3.3.1 R Code Example

The following R code example demonstrates how to create an ensemble of a Random Forest

and two Deep Neural Nets using the h2oEnsemble R interface. In the code below, an example

shows the current method for deﬁning custom base-learner functions. The h2oEnsemble

package comes with four base learner function wrappers; however, to create a base learner

with nondefault model parameters, the user can pass along nondefault function arguments

as shown. The user must also specify a metalearning algorithm, and in this example, a GLM

wrapper function is used.

library("SuperLearner") # For "SL.nnls" metalearner function

library("h2oEnsemble")

# Create custom base learner functions using non-default model params:

h2o_rf_1 <- function(..., family = "binomial",

ntree = 500,

depth = 50,

mtries = 6,

sample.rate = 0.8,

nbins = 50,

nfolds = 0)

h2o.randomForest.wrapper(..., family = family, ntree = ntree,

depth = depth, mtries = mtries, sample.rate = sample.rate,

nbins = nbins, nfolds = nfolds)

}

348 Handb ook of Big Data

h2o_dl_1 <- function(..., family = "binomial",

nfolds = 0,

activation = "RectifierWithDropout",

hidden = c(200,200),

epochs = 100,

l1 = 0,

l2 = 0)

h2o.deeplearning.wrapper(..., family = family, nfolds = nfolds,

activation = activation, hidden = hidden, epochs = epochs,

l1 = l1, l2 = l2)

}

h2o_dl_1 <- function(..., family = "binomial",

nfolds = 0,

activation = "Rectifier",

hidden = c(200,200),

epochs = 100,

l1 = 0,

l2 = 1e-05)

h2o.deeplearning.wrapper(..., family = family, nfolds = nfolds,

activation = activation, hidden = hidden, epochs = epochs,

l1 = l1, l2 = l2)

}

The function interface for the h2o.ensemble function follows the same conventions as the

other h2o R package algorithm functions. This includes the x and y arguments, which are

the column names of the predictor variables and outcome variable, respectively. The data

object is a reference to the training dataset, which exists in Java memory. The family

argument is used to specify the type of prediction (i.e., classiﬁcation or regression). The

predict.h2o.ensemble function uses the predict(object, newdata) interface that is

common to most machine learning software packages in R. After specifying the base learner

library and the metalearner, the ensemble can be trained and tested:

# Set up the ensemble

learner <- c("h2o_rf_1", "h2o_dl_1", "h2o_dl_2")

metalearner <- "SL.nnls"

# Train the ensemble using 2-fold CV to generate level-one data

# More CV folds will increase runtime, but should increase performance

fit <- h2o.ensemble(x = x, y = y, data = data, family = "binomial",

learner = learner, metalearner = metalearner,

cvControl = list(V = 2))

# Generate predictions on the test set

pred <- predict(fit, newdata)

19.3.4 Performance Benchmarks

The H2O Ensemble was benchmarked on Amazon’s Elastic Compute Cloud (EC2) to

demonstrate the practical use of the Super Learner algorithm on big data. The instance type

used across all benchmarks is EC2’s c3.8xlarge type, which has 32 virtual CPUs (vCPUs)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 19. Scalable Super Learning (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
19. Scalable Super Learning (2/4)