Scalable Super Le a rning 341
2. For each base learner in the ensemble, Ψ
l
, V -fold cross-validation is used to
generate n cross-validated predicted values associated with the lth learner.
These n-dimensional vectors of cross-validated predicted values become the L
columns of Z.
The level-one dataset, Z, along with the original outcome, y ∈ R
n
, is used to train the
metalearning algorithm. As a final task, each of the L base learners will be fit on the
full training set and these fits will be saved. The final ensemble fit comprises the L base
learner fits along with the metalearner fit. To generate a prediction for new data using the
ensemble, the algorithm first generates predicted values from each of the L base learner fits,
and then passes those predicted values as input to the metalearner fit, which returns the
final predicted value for the ensemble.
The historical definition of stacking does not specify restrictions on the type of algorithm
used as a metalearner; however, the metalearner is often a method that minimizes the
cross-validated risk of some loss function of interest. For example, ordinary least squares
(OLS) can be used to minimize the sum of squared residuals, in the case of a linear
model. The Super Learner algorithm can be thought of as the theoretically supported
generalization of stacking to any estimation problem, where the goal is to minimize the cross-
validated risk of some bounded loss function, including loss functions indexed by nuisance
parameters.
19.2.2 Base Learners
It is recommended that the base learner library include a diverse set of learners (e.g., Linear
Model, Support Vector Machine, Random Forest, Neural Net, etc.); however, the Super
Learner theory does not require any specific level of diversity among the set of the base
learners. The library can also include copies of the same algorithm, indexed by different
sets of model parameters. For example, the user could specify multiple Random Forests [5],
each with a different splitting criterion, tree depth or mtry value. Typically, in stacking-based
ensemble methods, the prediction functions,
ˆ
Ψ
1
, ...,
ˆ
Ψ
L
, are fit by training each of base learn-
ing algorithms, Ψ
1
, ..., Ψ
L
, on the full training dataset and then combining these fits using a
metalearning algorithm, Φ. However, there are variants of Super Learning, such as the Sub-
semble algorithm [36], which learn the prediction functions on subsets of the training data.
The base learners can be any parametric or nonparametric supervised machine learning
algorithm. Stacking was originally presented by Wolpert, who used neural networks as
base learners. Breiman extended the stacking framework to regression problems under the
name stacked regressions and experimented with different base learners. For base learning
algorithms, he evaluated ensembles of decision trees (with different numbers of terminal
nodes), generalized linear models (GLMs) using subset variable regression (with a different
number of predictor variables), or ridge regression [16] (with different ridge parameters).
He also built ensembles by combining several subset variable regression models with ridge
regression models and found that the added diversity among the base models increased
performance. Both Wolpert and Breiman focused their work on using the same underlying
algorithm (i.e., neural nets, decision trees, or GLMs) with unique tuning parameters as the
set of base learners, although Breiman briefly suggested the idea of using heterogeneous
base learning algorithms, such as neural nets and nearest-neighbor.
19.2.3 Metalearning Algorithm
The metalearner, Φ, is used to find the optimal combination of the L base learners. The
Z matrix of cross-validated predicted values, as described in Section 19.2.1, is used as the