344 Handb ook of Big Data
2. Train the base learners on subsets of the original training dataset.
3. Utilize distributed or parallelized base learning algorithms.
4. Employ online learning techniques to avoid memory-wise scalability limitations.
5. Implement the ensemble (and/or base learners) in a scalable language, such as
C++, Java, Scala,orJulia, for example.
Currently, there are three implementations of the Super Learner algorithm that have an
R interface. The SuperLearner and subsemble [27] R packages are implemented entirely in
R, although they can make use of base learning algorithms that are written in compiled
languages as long as there is an R interface available. Often, the main computational
tasks of machine learning algorithms accessible via R packages are written in Fortran (e.g.,
randomForest, glmnet)orC++ (e.g., e1071s interface to LIBSVM [6]), and the runtime
of certain algorithms can be reduced by linking R to an optimized BLAS (Basic Linear
Algebra Subprograms) library, such as OpenBLAS [46], ATLAS [44], or Intel MKL [19].
These techniques may provide additional speed in training, but do not necessarily curtail
all memory-related scalability issues. Typically, because at least one copy of the full training
dataset must reside in memory in R, this is an inherent limitation to the scalability of these
implementations.
A more scalable implementation of the Super Learner algorithm is available in the
h2oEnsemble R package [25]. The H2O Ensemble implementation uses R to interface with
distributed base learning algorithms from the high-performance, open source Java machine
learning library, H2O [14]. Each of these three Super Learner implementations is at a
different stage of development and has benefits and drawbacks compared to the others,
but all three projects are being actively developed and maintained.
The main challenge in writing a Super Learner implementation is not implementing
the ensemble algorithm itself. In fact, the Super Learner algorithm simply organizes the
cross-validated output from the base learners and applies the metalearning algorithm to this
derived dataset. Some thought must be given to the parallelization aspects of the algorithm,
but this is typically a straightforward exercise, given the computational independence of the
cross-validation and base learning steps. One of the main software engineering tasks in any
Super Learner implementation is creating a unified interface to a large collection of base
learning and metalearning algorithms. A Super Learner implementation must include a
novel or third-party machine learning algorithm interface that allows users to specify the
base learners in a common format. Ideally, the users of the software should be able to define
their own base learning functions that specify an algorithm and set of model parameters in
addition to any default algorithms that are provided within the software. The performance
of the Super Learner is determined by the combined performance of the base learners, so a
having a rich library of machine learning algorithms accessible in the ensemble software is
important.
The metalearning methods can use the same interface as the base learners simplifying
the implementation. The metalearner is just another algorithm, although it is common for
a nonnegative linear combination of the base algorithms to be created using a method like
NNLS. However, if the loss function of interest to the user is unrelated to the objective
functions associated with the base learning algorithms, then a linear combination of the
base learners that minimizes the user-specified loss function can be learned using a nonlinear
optimization library, such as NLopt. In classification problems, this is particularly relevant
in the case where the outcome variable in the training set is highly imbalanced. NLopt
provides a common interface to a number of different algorithms that can be used to solve
this problem. There are also methods that allow for constraints, such as nonnegativity
(α
l
0) and convexity (
L
l=1
α
l
= 1) of the weights. Using one of several nonlinear
Scalable Super Le a rning 345
optimization algorithms, such as L-BFGS-B, Nelder-Mead, or COBYLA, it is possible to
find a linear combination of the base learners that specifically minimizes the loss function of
interest.
19.3.1 SuperLearner R Package
As is common for many statistical algorithms, the original implementation of the Super
Learner algorithm was written in R.TheSuperLearner R package, first released in 2010, is
actively maintained with new features being added periodically. This package implements
the Super Learner algorithm and provides a unified interface to a diverse set of machine
learning algorithms that are available in the R language. The software is extensible in the
sense that the user can define custom base-learner function wrappers and specify them
as part of the ensemble; however, there are about 30 algorithm wrappers provided by the
package by default. The main advantage of an R implementation is direct access to the rich
collection of machine learning algorithms that already exist within the R ecosystem. The
main disadvantage of an R implementation is memory-related scalability.
Because the base learners are trained independently from each other, the training of the
constituent algorithms can be done in parallel. The embarrassingly parallel nature of the
cross-validation and base learning steps of the Super Learner algorithm can be exploited
in any language. If there are L base learners and V cross-validation folds, there are L × V
independent computational tasks involved in creating the level-one data. The Sup erLearner
package provides functionality to parallelize the cross-validation step via multicore or SNOW
(Simple Network of Workstations) [40] clusters.
The R language and its third-party libraries are not particularly well known for memory
efficiency, so depending on the specifications of the machine or cluster that is being used,
it is possible to run out of memory while attempting to train the ensemble on large
training sets. Because the SuperLearner package relies on third-party implementations of
the base learning algorithms, the scalability of Super Learner is tied to the scalability of
the base learner implementations used in the ensemble. When selecting a single model
among a group of candidate algorithms based on cross-validated model performance, this is
computationally equivalent to generating the level-one data in the Super Learner algorithm.
If cross-validation is already being employed as a means of grid-search-based model selection
among a group of candidate learning algorithms, the addition of the metalearning step is
a computationally minimal burden. However, a Super Learner ensemble can result in a
significant boost in overall model performance over a single base learner model.
19.3.2 Subsemble R Pack age
The subsemble R package implements the Subsemble algorithm [36], a variant of Super
Learning, which ensembles base models trained on subsets of the original data. Specifically,
the disjoint union of the subsets is the full training set. As a special case, where the number
of subsets = 1, the package also implements the Super Learner algorithm.
The Subsemble algorithm can be used as a stand-alone ensemble algorithm or as the
base learning algorithm in the Super Learner algorithm. Empirically, it has been shown that
Subsemble can provide better prediction performance than fitting a single algorithm once
on the full available dataset [36], although this is not always the case.
An oracle result shows that Subsemble performs as well as the best possible combination
of the subset-specific fits. The Super Learner has more powerful asymptotic properties; it
performs as well as the best possible combination of the base learners trained on the full
dataset. However, when used as a stand-alone ensemble algorithm, Subsemble offers great
346 Handb ook of Big Data
computational flexibility, in that the training task can be scaled to any size by changing the
number, or size, of the subsets. This allows the user to effectively flatten the training process
into a task that is compatible with available computational resources. If parallelization is
used effectively, all subset-specific fits can be trained at the same time, drastically increasing
the speed of the training process. Because the subsets are typically much smaller than the
original training set, this also reduces the memory requirements of each node in your cluster.
The computational flexibility and speed of the Subsemble algorithm offer a unique solution
to scaling ensemble learning to big data problems.
In the subsemble package, the J subsets can be created by the software at random, or
the subsets can be explicitly specified by the user. Given L base learning algorithms and J
subsets, a total of L × J subset-specific fits will be trained and included in the Subsemble
(by default). This construction allows each base learning algorithm to see each subset of
the training data, so in this sense, there is a similarity to ensembles trained on the full
data. To distinguish the variations on this theme, this type of ensemble construction is
referred to as a cross-product Subsemble. The subsemble package also implements what are
called divisor Subsembles, a structure that can be created if the number of unique base
learning algorithms is a divisor of the number of subsets. In this case, there are only J total
subset-specific fits that make up the ensemble, and each learner only sees approximately
n/J observations from the full training set (assuming that the subsets are of equal size). For
example, if L =2andJ = 10, then each of the two base learning algorithms would be used
to train five subset-specific fits and would only see a total of 50% of the original training
observations. This type of Subsemble allows for quicker training, but will typically result in
less accurate models. Therefore, the cross-product method is the default Subsemble type in
the software.
An algorithm called Supervised Regression Tree Subsemble or SRT Subsemble [35] is also
on the development road map for the subsemble package. SRT Subsemble is an extension of
the regular Subsemble algorithm, which provides a means of learning the optimal number
and constituency of the subsets. This method incurs an additional computational cost, but
can provide greater model performance for the Subsemble.
19.3.3 H2O Ensemble
The H2O Ensemble software contains an implementation of the Super Learner ensemble
algorithm that is built on the distributed, open source, Java-based machine learning platform
for big data, H2O. H2O Ensemble is currently implemented as a stand-alone R package called
h2oEnsemble that makes use of the h2o package, the R interface to the H2O platform.
There are a handful of powerful supervised machine learning algorithms supported by the
h2o package, all of which can be used as base learners for the ensemble. This includes a
high-performance method for deep learning, which allows the user to create ensembles of
deep neural nets or combine the power of deep neural nets with other algorithms, such as
Random Forest or Gradient Boosting Machines (GBMs) [12].
Because the H2O machine learning platform was designed with big data in mind, each
of the H2O base learning algorithms is scalable to very large training sets and enables
parallelism across multiple nodes and cores. The H2O platform comprises a distributed
in-memory parallel computing architecture and has the ability to seamlessly use datasets
stored in Hadoop Distributed File System (HDFS), Amazon’s S3 cloud storage, NoSQL,
and SQL databases in addition to CSV files stored locally or in distributed filesystems. The
H2O Ensemble project aims to match the scalability of the H2O algorithms, so although
the ensemble uses R as its main user interface, most of the computations are performed in
Java via H2O in a distributed, scalable fashion.
Scalable Super Le a rning 347
There are several publicly available benchmarks of the H2O algorithms. Notably,
the H2O GLM implementation has been benchmarked on a training set of one billion
observations [13]. This benchmark training set is derived from the Airline Dataset [31],
which has been called the Iris dataset for big data. The one billion row training set is a
42 Gb CSV file with 12 feature columns (9 numerical features, 3 categorical features with
cardinalities 30, 376, and 380) and a binary outcome. Using a 48-node cluster (8 cores on
each node, 15 Gb of RAM, and 1 Gb interconnect speed), the H2O GLM can be trained in
5.6 s. The H2O algorithm implementations aim to be scalable to any size dataset so that
all of the available training set, rather than a subset, can be used for training models.
H2O Ensemble takes a different approach to scaling the Super Learner algorithm than
the subsemble or SuperLearner R packages. Because the subsemble and SuperLearner
ensembles rely on third-party R algorithm implementations that are typically single
threaded, the parallelism of these two implementations occurs in the cross-validation and
base learning steps. In the SuperLearner implementation, the ability to take advantage of
multiple cores is strictly limited by the number of cross-validation folds and number of base
learners. With subsemble, the scalability of the ensemble can be improved by increasing the
number of subsets used; however, this may lead to a decrease in model performance. Unlike
most third-party machine learning algorithms that are available in R,theH2O base learning
algorithms are implemented in a distributed fashion and can scale to all available cores in a
multicore or multinode cluster. In the current release of H2O Ensemble, the cross-validation
and base learning steps of the ensemble algorithm are performed in serial; however, each
serial training step is maximally parallelized across all available cores in a cluster. The H2O
Ensemble implementation could possibly be re-architected to parallelize the cross-validation
and base learning steps; however, it is unknown at this time how that may affect runtime
performance.
19.3.3.1 R Code Example
The following R code example demonstrates how to create an ensemble of a Random Forest
and two Deep Neural Nets using the h2oEnsemble R interface. In the code below, an example
shows the current method for defining custom base-learner functions. The h2oEnsemble
package comes with four base learner function wrappers; however, to create a base learner
with nondefault model parameters, the user can pass along nondefault function arguments
as shown. The user must also specify a metalearning algorithm, and in this example, a GLM
wrapper function is used.
library("SuperLearner") # For "SL.nnls" metalearner function
library("h2oEnsemble")
# Create custom base learner functions using non-default model params:
h2o_rf_1 <- function(..., family = "binomial",
ntree = 500,
depth = 50,
mtries = 6,
sample.rate = 0.8,
nbins = 50,
nfolds = 0)
h2o.randomForest.wrapper(..., family = family, ntree = ntree,
depth = depth, mtries = mtries, sample.rate = sample.rate,
nbins = nbins, nfolds = nfolds)
}
348 Handb ook of Big Data
h2o_dl_1 <- function(..., family = "binomial",
nfolds = 0,
activation = "RectifierWithDropout",
hidden = c(200,200),
epochs = 100,
l1 = 0,
l2 = 0)
h2o.deeplearning.wrapper(..., family = family, nfolds = nfolds,
activation = activation, hidden = hidden, epochs = epochs,
l1 = l1, l2 = l2)
}
h2o_dl_1 <- function(..., family = "binomial",
nfolds = 0,
activation = "Rectifier",
hidden = c(200,200),
epochs = 100,
l1 = 0,
l2 = 1e-05)
h2o.deeplearning.wrapper(..., family = family, nfolds = nfolds,
activation = activation, hidden = hidden, epochs = epochs,
l1 = l1, l2 = l2)
}
The function interface for the h2o.ensemble function follows the same conventions as the
other h2o R package algorithm functions. This includes the x and y arguments, which are
the column names of the predictor variables and outcome variable, respectively. The data
object is a reference to the training dataset, which exists in Java memory. The family
argument is used to specify the type of prediction (i.e., classification or regression). The
predict.h2o.ensemble function uses the predict(object, newdata) interface that is
common to most machine learning software packages in R. After specifying the base learner
library and the metalearner, the ensemble can be trained and tested:
# Set up the ensemble
learner <- c("h2o_rf_1", "h2o_dl_1", "h2o_dl_2")
metalearner <- "SL.nnls"
# Train the ensemble using 2-fold CV to generate level-one data
# More CV folds will increase runtime, but should increase performance
fit <- h2o.ensemble(x = x, y = y, data = data, family = "binomial",
learner = learner, metalearner = metalearner,
cvControl = list(V = 2))
# Generate predictions on the test set
pred <- predict(fit, newdata)
19.3.4 Performance Benchmarks
The H2O Ensemble was benchmarked on Amazon’s Elastic Compute Cloud (EC2) to
demonstrate the practical use of the Super Learner algorithm on big data. The instance type
used across all benchmarks is EC2’s c3.8xlarge type, which has 32 virtual CPUs (vCPUs)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.107.31