19. Scalable Super Learning (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Scalable Super Learning

Erin LeDell

CONTENTS

19.1 Introduction .................................................................... 339

19.2 The Super Learner Algorithm ................................................. 340

19.2.1 Stacking ................................................................ 340

19.2.2 Base Learners .......................................................... 341

19.2.3 Metalearning Algorithm ............................................... 341

19.2.4 Oracle Properties ...................................................... 342

19.2.5 Comparison to Other Ensemble Learners ............................. 343

19.3 Super Learner Software ........................................................ 343

19.3.1 SuperLearner R Package ............................................... 345

19.3.2 Subsemble R Package .................................................. 345

19.3.3 H2O Ensemble ......................................................... 346

19.3.3.1 R Code Example ........................................... 347

19.3.4 Performance Benchmarks .............................................. 348

19.3.5 Computational Details ................................................ 349

19.4 Online Super Learning ......................................................... 350

19.4.1 Optimization in Online Learning ...................................... 350

19.4.2 The OSL Algorithm ................................................... 351

19.4.3 A Practical Online Super Learner ..................................... 352

19.5 Super Learner in Practice ...................................................... 352

19.6 Concluding Remarks ........................................................... 353

19.7 Glossary ........................................................................ 353

References ............................................................................. 355

19.1 Introduction

Super Learning is a generalized loss-based ensemble learning framework that was the-

oretically validated in [42]. This template for learning is applicable to and currently

being used across a wide class of problems including problems involving biased sampling,

missingness, and censoring. It can be used to estimate marginal densities, conditional den-

sities, conditional hazards, conditional means, conditional medians, conditional quantiles,

conditional survival functions, among others [43]. Some applications of Super Learning

include the estimation of propensity scores, dose–response functions [21], and optimal

dynamic treatment rules [29], for example.

When used for standard prediction, the Super Learner algorithm is a supervised learning

algorithm equivalent to generalized stacking [4,45], an ensemble learning technique that

339

340 Handb ook of Big Data

combines the output from a set of base learning algorithms via a second-level metalearning

algorithm. The Super Learner is built on the theory of cross-validation and has been proven

to represent an asymptotically optimal system for learning [42]. The framework allows for

a general class of prediction algorithms to be considered for the ensemble.

In this chapter, we discuss some of the history of stacking, review the basic theoretical

properties of Super Learning and provide a description of Online Super Learning. We discuss

several practical implementations of the Super Learner algorithm and highlight the various

ways in which the algorithm can scale to big data. In conclusion, we present examples of

real-world applications that utilize Super Learning.

19.2 The Super Learner Algorithm

The Super Learner algorithm is a generalization of the stacking algorithm introduced in

context of neural networks by Wolpert [45] in 1992 and adapted to regression problems by

Breiman [4] in 1996. The Super Learner name was introduced due to the theoretical oracle

property and its consequences as presented in [9,42].

The distinct characteristic of the Super Learner algorithm in the context of stacking is

its theoretical foundation, discussed in Section 19.2.4, which shows that the Super Learner

ensemble is asymptotically equivalent to the oracle. Under most conditions, the theory

also guarantees that the Super Learner ensemble will perform as least as well as the best

individual base learner. Therefore, if model performance is the primary objective in a

machine learning problem, the Super Learner with a rich candidate learner library can

be used to theoretically and practically guarantee better model performance than can be

achieved by a single algorithm alone.

19.2.1 Stacking

Stacking or stacked generalization is a procedure for ensemble learning where a second-level

learner, or a metalearner, is trained on the output (i.e., the cross-validated predictions,

described below) of a collection of base learners. The output from the base learners, also

called the level-one data, can be generated by using cross-validation. Accordingly, the

original training dataset is often referred to as the level-zero data. Although not backed

by theory, in the case of large training sets, it may be reasonable to construct the level-

one data using predictions from a single, independent test set. In the original stacking

literature, Wolpert proposed using leave-one-out cross-validation [45]; however, because of

computational costs and empirical evidence of superior performance, Breiman suggested

using 10-fold cross-validation to generate the level-one data [4]. The Super Learner theory

requires cross-validation (usually V -fold cross-validation, in practice) to generate the level-

one data.

The following describes in greater detail how to construct the level-one data. Assume

that the training set comprises n independent and identically distributed observations

, ..., O

},whereO

=(X

), X

∈ R

is a vector of covariate or feature values, and

∈ R is the outcome. Consider an ensemble comprising a set of L base learning algorithms,

{Ψ

, ..., Ψ

}, each of which is indexed by an algorithm class and a speciﬁc set of model

parameters. Then, the process of constructing the level-one data will involve generating an

n × L matrix, Z,ofV -fold cross-validated predicted values as follows:

1. The original training set, X, is divided at random into V roughly equal pieces

(validation folds), X

(1)

, ..., X

(V )

Scalable Super Le a rning 341

2. For each base learner in the ensemble, Ψ

, V -fold cross-validation is used to

generate n cross-validated predicted values associated with the lth learner.

These n-dimensional vectors of cross-validated predicted values become the L

columns of Z.

The level-one dataset, Z, along with the original outcome, y ∈ R

, is used to train the

metalearning algorithm. As a ﬁnal task, each of the L base learners will be ﬁt on the

full training set and these ﬁts will be saved. The ﬁnal ensemble ﬁt comprises the L base

learner ﬁts along with the metalearner ﬁt. To generate a prediction for new data using the

ensemble, the algorithm ﬁrst generates predicted values from each of the L base learner ﬁts,

and then passes those predicted values as input to the metalearner ﬁt, which returns the

ﬁnal predicted value for the ensemble.

The historical deﬁnition of stacking does not specify restrictions on the type of algorithm

used as a metalearner; however, the metalearner is often a method that minimizes the

cross-validated risk of some loss function of interest. For example, ordinary least squares

(OLS) can be used to minimize the sum of squared residuals, in the case of a linear

model. The Super Learner algorithm can be thought of as the theoretically supported

generalization of stacking to any estimation problem, where the goal is to minimize the cross-

validated risk of some bounded loss function, including loss functions indexed by nuisance

parameters.

19.2.2 Base Learners

It is recommended that the base learner library include a diverse set of learners (e.g., Linear

Model, Support Vector Machine, Random Forest, Neural Net, etc.); however, the Super

Learner theory does not require any speciﬁc level of diversity among the set of the base

learners. The library can also include copies of the same algorithm, indexed by diﬀerent

sets of model parameters. For example, the user could specify multiple Random Forests [5],

each with a diﬀerent splitting criterion, tree depth or mtry value. Typically, in stacking-based

ensemble methods, the prediction functions,

, ...,

, are ﬁt by training each of base learn-

ing algorithms, Ψ

, ..., Ψ

, on the full training dataset and then combining these ﬁts using a

metalearning algorithm, Φ. However, there are variants of Super Learning, such as the Sub-

semble algorithm [36], which learn the prediction functions on subsets of the training data.

The base learners can be any parametric or nonparametric supervised machine learning

algorithm. Stacking was originally presented by Wolpert, who used neural networks as

base learners. Breiman extended the stacking framework to regression problems under the

name stacked regressions and experimented with diﬀerent base learners. For base learning

algorithms, he evaluated ensembles of decision trees (with diﬀerent numbers of terminal

nodes), generalized linear models (GLMs) using subset variable regression (with a diﬀerent

number of predictor variables), or ridge regression [16] (with diﬀerent ridge parameters).

He also built ensembles by combining several subset variable regression models with ridge

regression models and found that the added diversity among the base models increased

performance. Both Wolpert and Breiman focused their work on using the same underlying

algorithm (i.e., neural nets, decision trees, or GLMs) with unique tuning parameters as the

set of base learners, although Breiman brieﬂy suggested the idea of using heterogeneous

base learning algorithms, such as neural nets and nearest-neighbor.

19.2.3 Metalearning Algorithm

The metalearner, Φ, is used to ﬁnd the optimal combination of the L base learners. The

Z matrix of cross-validated predicted values, as described in Section 19.2.1, is used as the

342 Handb ook of Big Data

input for the metalearning algorithm, along with the original outcome from the level-zero

training data, y =(Y

, ..., Y

The metalearning algorithm is typically a method designed to minimize the cross-

validated risk of some loss function. For example, if your goal is to minimize mean squared

prediction error, you could use the least-squares algorithm to solve for α =(α

, ..., α

), the

weight vector that minimizes the following:



i=1



−



l=1



Because the set of predictions from the various base learners may be highly correlated,

it is advisable to choose a metalearning method that performs well in the presence of

collinear predictors. Regularization via Ridge [17] or Lasso [39] regression is commonly used

to overcome the issue of collinearity among the predictor variables that make up the level-

one dataset. Empirically, Breiman found that using Ridge regression as the metalearner

often yielded a lower prediction error than using unregularized least-squares regression. Of

the regularization methods he considered, a linear combination achieved via nonnegative

least squares (NNLS) [23] gave the best results in terms of prediction error. The NNLS

algorithm minimizes the same objective function as the least-squares algorithm, but adds

the constraint that α

≥ 0, for l =1, ..., L. Le Blanc and Tibshirani [24] also came to the

conclusion that nonnegativity constraints lead to the most accurate linear combinations of

the base learners.

Breiman also discussed desirable theoretical properties that arise by enforcing the

additional constraint that



= 1, where the ensemble is a convex combination of the

base learners. However, in simulations, he shows that the prediction error is nearly the same

whether or not



= 1. The convex combination is not only empirically motivated, but

also supported by the theory. As mentioned in Section 19.2.4, the oracle results for the Super

Learner require a uniformly bounded loss function and restricting to the convex combination

implies that if each algorithm in the library is bounded, the convex combination will also

be bounded. In practice, truncation of the predicted values to the range of the outcome

variable in the training set is suﬃcient to allow for unbounded loss functions.

In the Super Learner algorithm, the metalearning method is speciﬁed as the minimizer of

the cross-validated risk of a loss function of interest, such as squared-error loss or rank loss

(1-AUC). If the loss function of interest is unique, unusual or complex, it may be diﬃcult

to ﬁnd an existing machine learning algorithm (i.e., metalearner) that directly or indirectly

minimizes this function. However, the optimal combination of the base learners can be

estimated using a nonlinear optimization algorithm such as those that are available in the

open source NLopt library [20]. This particular approach to metalearning provides a great

deal of ﬂexibility to the Super Learner, in the sense that the ensemble can be trained to

optimize any complex objective. Historically, in stacking implementations, the metalearning

algorithm is often some sort of regularized linear model; however, a variety of parametric

and nonparametric methods can used as a metalearner to combine the output from the

base ﬁts.

19.2.4 Oracle Properties

The oracle selector is deﬁned as the estimator, among all possible weighted combinations

of the base prediction functions, which minimizes risk under the true data-generating

distribution. The oracle result for the cross-validation selector among a set of candidate

learners was established in [41] for general bounded loss functions, in [10] for unbounded

loss functions under an exponential tail condition, and in [42] for its application to the

Scalable Super Le a rning 343

Super Learner. The oracle selector is considered to be optimal with respect to a particular

loss function, given the set of base learners; however, it depends on both the observed data

and the true distribution, P

, and thus is unknown. If the true prediction function cannot

be represented by a combination of the base learners in the library, then optimal will be

the closest combination that could be determined to be optimal, if the true data-generating

mechanism were known. If a training set is large enough, it would theoretically result in

the oracle selector. In the original stacking literature, Breiman observed that the ensemble

predictor almost always has lower prediction error than the single best base learner, although

a proof was not presented in his work [4].

If one of the base learners is a parametric model that happens to contain the true

prediction function, this base learner will achieve a parametric rate of convergence and thus

the Super Learner achieves an almost parametric rate of convergence, log(n)/n.

19.2.5 Comparison to Other Ensemble Learners

In the machine learning community, the term ensemble learning is often associated with

bagging [3] or boosting [11] techniques or particular algorithms such as Random Forest [5].

Stacking is similar to bagging due to the independent training of the base learners; however,

there are two notable distinctions. The ﬁrst is that stacking uses a metalearning algorithm

to optimally combine the output from the base learners instead of simple averaging, as

in bagging. The second is that modern stacking is typically characterized by a diverse set

of strong base learners, where bagging is often associated with a single, often weak, base

learning algorithm. A popular example of bagging is the Random Forest algorithm that bags

classiﬁcation or regression trees trained on subsets of the feature space. A case could be made

that bagging is a special case of stacking that uses the mean as the metalearning algorithm.

From a computational perspective, bagging, like stacking, is a particularly well-suited

ensemble learning method for big data because models can be trained independently on

diﬀerent cores or machines within a cluster. However, because boosting utilizes an iterative,

hence sequential, training approach, it does not scale as easily to big data problems.

A Bayesian ensemble learning algorithm that is often compared to stacking is Bayesian

model averaging (BMA). A BMA model is a linear combination of the output from the

base learners in which the weights are the posterior probabilities of models. Both BMA and

Super Learner use cross-validation as part of the ensemble process. In the case where the

true prediction function is contained within the base learner library, BMA is never worse

than stacking and often is demonstrably better under reasonable conditions. However, if the

true prediction function is not well approximated by the base learner library, then stacking

will signiﬁcantly outperform BMA [7].

19.3 Super Learner Software

This section serves as a broad overview of several implementations of the Super Learner

ensemble algorithm and its variants. The original Super Learner implementation is the

SuperLearner R package [34]; however, there are several ongoing projects that aim to create

more scalable implementations of the algorithm that are suitable for big data. We will discuss

a variety of approaches to scalingG the Super Learner algorithm:

1. Perform the cross-validation and base learning steps in parallel as these are

computationally independent tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 19. Scalable Super Learning (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
19. Scalable Super Learning (1/4)