19
Scalable Super Learning
Erin LeDell
CONTENTS
19.1 Introduction .................................................................... 339
19.2 The Super Learner Algorithm ................................................. 340
19.2.1 Stacking ................................................................ 340
19.2.2 Base Learners .......................................................... 341
19.2.3 Metalearning Algorithm ............................................... 341
19.2.4 Oracle Properties ...................................................... 342
19.2.5 Comparison to Other Ensemble Learners ............................. 343
19.3 Super Learner Software ........................................................ 343
19.3.1 SuperLearner R Package ............................................... 345
19.3.2 Subsemble R Package .................................................. 345
19.3.3 H2O Ensemble ......................................................... 346
19.3.3.1 R Code Example ........................................... 347
19.3.4 Performance Benchmarks .............................................. 348
19.3.5 Computational Details ................................................ 349
19.4 Online Super Learning ......................................................... 350
19.4.1 Optimization in Online Learning ...................................... 350
19.4.2 The OSL Algorithm ................................................... 351
19.4.3 A Practical Online Super Learner ..................................... 352
19.5 Super Learner in Practice ...................................................... 352
19.6 Concluding Remarks ........................................................... 353
19.7 Glossary ........................................................................ 353
References ............................................................................. 355
19.1 Introduction
Super Learning is a generalized loss-based ensemble learning framework that was the-
oretically validated in [42]. This template for learning is applicable to and currently
being used across a wide class of problems including problems involving biased sampling,
missingness, and censoring. It can be used to estimate marginal densities, conditional den-
sities, conditional hazards, conditional means, conditional medians, conditional quantiles,
conditional survival functions, among others [43]. Some applications of Super Learning
include the estimation of propensity scores, dose–response functions [21], and optimal
dynamic treatment rules [29], for example.
When used for standard prediction, the Super Learner algorithm is a supervised learning
algorithm equivalent to generalized stacking [4,45], an ensemble learning technique that
339
340 Handb ook of Big Data
combines the output from a set of base learning algorithms via a second-level metalearning
algorithm. The Super Learner is built on the theory of cross-validation and has been proven
to represent an asymptotically optimal system for learning [42]. The framework allows for
a general class of prediction algorithms to be considered for the ensemble.
In this chapter, we discuss some of the history of stacking, review the basic theoretical
properties of Super Learning and provide a description of Online Super Learning. We discuss
several practical implementations of the Super Learner algorithm and highlight the various
ways in which the algorithm can scale to big data. In conclusion, we present examples of
real-world applications that utilize Super Learning.
19.2 The Super Learner Algorithm
The Super Learner algorithm is a generalization of the stacking algorithm introduced in
context of neural networks by Wolpert [45] in 1992 and adapted to regression problems by
Breiman [4] in 1996. The Super Learner name was introduced due to the theoretical oracle
property and its consequences as presented in [9,42].
The distinct characteristic of the Super Learner algorithm in the context of stacking is
its theoretical foundation, discussed in Section 19.2.4, which shows that the Super Learner
ensemble is asymptotically equivalent to the oracle. Under most conditions, the theory
also guarantees that the Super Learner ensemble will perform as least as well as the best
individual base learner. Therefore, if model performance is the primary objective in a
machine learning problem, the Super Learner with a rich candidate learner library can
be used to theoretically and practically guarantee better model performance than can be
achieved by a single algorithm alone.
19.2.1 Stacking
Stacking or stacked generalization is a procedure for ensemble learning where a second-level
learner, or a metalearner, is trained on the output (i.e., the cross-validated predictions,
described below) of a collection of base learners. The output from the base learners, also
called the level-one data, can be generated by using cross-validation. Accordingly, the
original training dataset is often referred to as the level-zero data. Although not backed
by theory, in the case of large training sets, it may be reasonable to construct the level-
one data using predictions from a single, independent test set. In the original stacking
literature, Wolpert proposed using leave-one-out cross-validation [45]; however, because of
computational costs and empirical evidence of superior performance, Breiman suggested
using 10-fold cross-validation to generate the level-one data [4]. The Super Learner theory
requires cross-validation (usually V -fold cross-validation, in practice) to generate the level-
one data.
The following describes in greater detail how to construct the level-one data. Assume
that the training set comprises n independent and identically distributed observations
{O
1
, ..., O
n
},whereO
i
=(X
i
,Y
i
), X
i
R
p
is a vector of covariate or feature values, and
Y
i
R is the outcome. Consider an ensemble comprising a set of L base learning algorithms,
{Ψ
1
, ..., Ψ
L
}, each of which is indexed by an algorithm class and a specific set of model
parameters. Then, the process of constructing the level-one data will involve generating an
n × L matrix, Z,ofV -fold cross-validated predicted values as follows:
1. The original training set, X, is divided at random into V roughly equal pieces
(validation folds), X
(1)
, ..., X
(V )
.
Scalable Super Le a rning 341
2. For each base learner in the ensemble, Ψ
l
, V -fold cross-validation is used to
generate n cross-validated predicted values associated with the lth learner.
These n-dimensional vectors of cross-validated predicted values become the L
columns of Z.
The level-one dataset, Z, along with the original outcome, y R
n
, is used to train the
metalearning algorithm. As a final task, each of the L base learners will be fit on the
full training set and these fits will be saved. The final ensemble fit comprises the L base
learner fits along with the metalearner fit. To generate a prediction for new data using the
ensemble, the algorithm first generates predicted values from each of the L base learner fits,
and then passes those predicted values as input to the metalearner fit, which returns the
final predicted value for the ensemble.
The historical definition of stacking does not specify restrictions on the type of algorithm
used as a metalearner; however, the metalearner is often a method that minimizes the
cross-validated risk of some loss function of interest. For example, ordinary least squares
(OLS) can be used to minimize the sum of squared residuals, in the case of a linear
model. The Super Learner algorithm can be thought of as the theoretically supported
generalization of stacking to any estimation problem, where the goal is to minimize the cross-
validated risk of some bounded loss function, including loss functions indexed by nuisance
parameters.
19.2.2 Base Learners
It is recommended that the base learner library include a diverse set of learners (e.g., Linear
Model, Support Vector Machine, Random Forest, Neural Net, etc.); however, the Super
Learner theory does not require any specific level of diversity among the set of the base
learners. The library can also include copies of the same algorithm, indexed by different
sets of model parameters. For example, the user could specify multiple Random Forests [5],
each with a different splitting criterion, tree depth or mtry value. Typically, in stacking-based
ensemble methods, the prediction functions,
ˆ
Ψ
1
, ...,
ˆ
Ψ
L
, are fit by training each of base learn-
ing algorithms, Ψ
1
, ..., Ψ
L
, on the full training dataset and then combining these fits using a
metalearning algorithm, Φ. However, there are variants of Super Learning, such as the Sub-
semble algorithm [36], which learn the prediction functions on subsets of the training data.
The base learners can be any parametric or nonparametric supervised machine learning
algorithm. Stacking was originally presented by Wolpert, who used neural networks as
base learners. Breiman extended the stacking framework to regression problems under the
name stacked regressions and experimented with different base learners. For base learning
algorithms, he evaluated ensembles of decision trees (with different numbers of terminal
nodes), generalized linear models (GLMs) using subset variable regression (with a different
number of predictor variables), or ridge regression [16] (with different ridge parameters).
He also built ensembles by combining several subset variable regression models with ridge
regression models and found that the added diversity among the base models increased
performance. Both Wolpert and Breiman focused their work on using the same underlying
algorithm (i.e., neural nets, decision trees, or GLMs) with unique tuning parameters as the
set of base learners, although Breiman briefly suggested the idea of using heterogeneous
base learning algorithms, such as neural nets and nearest-neighbor.
19.2.3 Metalearning Algorithm
The metalearner, Φ, is used to find the optimal combination of the L base learners. The
Z matrix of cross-validated predicted values, as described in Section 19.2.1, is used as the
342 Handb ook of Big Data
input for the metalearning algorithm, along with the original outcome from the level-zero
training data, y =(Y
1
, ..., Y
n
).
The metalearning algorithm is typically a method designed to minimize the cross-
validated risk of some loss function. For example, if your goal is to minimize mean squared
prediction error, you could use the least-squares algorithm to solve for α =(α
1
, ..., α
L
), the
weight vector that minimizes the following:
n
i=1
Y
i
L
l=1
α
l
z
li
2
Because the set of predictions from the various base learners may be highly correlated,
it is advisable to choose a metalearning method that performs well in the presence of
collinear predictors. Regularization via Ridge [17] or Lasso [39] regression is commonly used
to overcome the issue of collinearity among the predictor variables that make up the level-
one dataset. Empirically, Breiman found that using Ridge regression as the metalearner
often yielded a lower prediction error than using unregularized least-squares regression. Of
the regularization methods he considered, a linear combination achieved via nonnegative
least squares (NNLS) [23] gave the best results in terms of prediction error. The NNLS
algorithm minimizes the same objective function as the least-squares algorithm, but adds
the constraint that α
l
0, for l =1, ..., L. Le Blanc and Tibshirani [24] also came to the
conclusion that nonnegativity constraints lead to the most accurate linear combinations of
the base learners.
Breiman also discussed desirable theoretical properties that arise by enforcing the
additional constraint that
l
α
l
= 1, where the ensemble is a convex combination of the
base learners. However, in simulations, he shows that the prediction error is nearly the same
whether or not
l
α
l
= 1. The convex combination is not only empirically motivated, but
also supported by the theory. As mentioned in Section 19.2.4, the oracle results for the Super
Learner require a uniformly bounded loss function and restricting to the convex combination
implies that if each algorithm in the library is bounded, the convex combination will also
be bounded. In practice, truncation of the predicted values to the range of the outcome
variable in the training set is sufficient to allow for unbounded loss functions.
In the Super Learner algorithm, the metalearning method is specified as the minimizer of
the cross-validated risk of a loss function of interest, such as squared-error loss or rank loss
(1-AUC). If the loss function of interest is unique, unusual or complex, it may be difficult
to find an existing machine learning algorithm (i.e., metalearner) that directly or indirectly
minimizes this function. However, the optimal combination of the base learners can be
estimated using a nonlinear optimization algorithm such as those that are available in the
open source NLopt library [20]. This particular approach to metalearning provides a great
deal of flexibility to the Super Learner, in the sense that the ensemble can be trained to
optimize any complex objective. Historically, in stacking implementations, the metalearning
algorithm is often some sort of regularized linear model; however, a variety of parametric
and nonparametric methods can used as a metalearner to combine the output from the
base fits.
19.2.4 Oracle Properties
The oracle selector is defined as the estimator, among all possible weighted combinations
of the base prediction functions, which minimizes risk under the true data-generating
distribution. The oracle result for the cross-validation selector among a set of candidate
learners was established in [41] for general bounded loss functions, in [10] for unbounded
loss functions under an exponential tail condition, and in [42] for its application to the
Scalable Super Le a rning 343
Super Learner. The oracle selector is considered to be optimal with respect to a particular
loss function, given the set of base learners; however, it depends on both the observed data
and the true distribution, P
0
, and thus is unknown. If the true prediction function cannot
be represented by a combination of the base learners in the library, then optimal will be
the closest combination that could be determined to be optimal, if the true data-generating
mechanism were known. If a training set is large enough, it would theoretically result in
the oracle selector. In the original stacking literature, Breiman observed that the ensemble
predictor almost always has lower prediction error than the single best base learner, although
a proof was not presented in his work [4].
If one of the base learners is a parametric model that happens to contain the true
prediction function, this base learner will achieve a parametric rate of convergence and thus
the Super Learner achieves an almost parametric rate of convergence, log(n)/n.
19.2.5 Comparison to Other Ensemble Learners
In the machine learning community, the term ensemble learning is often associated with
bagging [3] or boosting [11] techniques or particular algorithms such as Random Forest [5].
Stacking is similar to bagging due to the independent training of the base learners; however,
there are two notable distinctions. The first is that stacking uses a metalearning algorithm
to optimally combine the output from the base learners instead of simple averaging, as
in bagging. The second is that modern stacking is typically characterized by a diverse set
of strong base learners, where bagging is often associated with a single, often weak, base
learning algorithm. A popular example of bagging is the Random Forest algorithm that bags
classification or regression trees trained on subsets of the feature space. A case could be made
that bagging is a special case of stacking that uses the mean as the metalearning algorithm.
From a computational perspective, bagging, like stacking, is a particularly well-suited
ensemble learning method for big data because models can be trained independently on
different cores or machines within a cluster. However, because boosting utilizes an iterative,
hence sequential, training approach, it does not scale as easily to big data problems.
A Bayesian ensemble learning algorithm that is often compared to stacking is Bayesian
model averaging (BMA). A BMA model is a linear combination of the output from the
base learners in which the weights are the posterior probabilities of models. Both BMA and
Super Learner use cross-validation as part of the ensemble process. In the case where the
true prediction function is contained within the base learner library, BMA is never worse
than stacking and often is demonstrably better under reasonable conditions. However, if the
true prediction function is not well approximated by the base learner library, then stacking
will significantly outperform BMA [7].
19.3 Super Learner Software
This section serves as a broad overview of several implementations of the Super Learner
ensemble algorithm and its variants. The original Super Learner implementation is the
SuperLearner R package [34]; however, there are several ongoing projects that aim to create
more scalable implementations of the algorithm that are suitable for big data. We will discuss
a variety of approaches to scalingG the Super Learner algorithm:
1. Perform the cross-validation and base learning steps in parallel as these are
computationally independent tasks.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.74.211