18
Divide and R ecombine: Subsemble, Exploiting
the Power of Cross-Validation
Stephanie Sapp and Erin LeDell
CONTENTS
18.1 Introduction .................................................................... 323
18.2 Subsemble Ensemble Learning for Big Data ................................... 325
18.2.1 The Subsemble Algorithm ............................................. 325
18.2.2 Oracle Result for Subsemble .......................................... 327
18.2.3 A Practical Subsemble Implementation ............................... 328
18.2.3.1 subsemble R Code Example ................................ 329
18.2.4 Performance Benchmarks .............................................. 330
18.2.4.1 Model Performance ......................................... 330
18.2.4.2 Computational Performance ............................... 331
18.3 Subsembles with Subset Supervision ........................................... 331
18.3.1 Supervised Subsembles ................................................ 332
18.3.2 The SRT Subsemble Algorithm ....................................... 333
18.3.2.1 Constructing and Selecting the Number of Subsets ....... 333
18.3.3 SRT Subsemble in Practice ............................................ 334
18.4 Concluding Remarks ........................................................... 335
18.5 Glossary ........................................................................ 335
References ............................................................................. 337
18.1 Introduction
As massive datasets become increasingly common, new scalable approaches to prediction are
needed. Given that memory and runtime constraints are common in practice, it is important
to develop practical machine learning methods that perform well on big datasets in a fixed
computational resource setting. Procedures using subsets from a training set are promising
tools for prediction with large-scale datasets [16]. Recent research has focused on developing
and evaluating the performance of various subset-based prediction procedures. Subsetting
procedures in machine learning construct subsets from the available training data, then
train an algorithm on each subset, and finally combine the results across the subsets to
form a final prediction. Prediction methods operating on subsets of the training data can
take advantage of modern computational resources, because machine learning on subsets
can be massively parallelized.
Bagging [1], or bootstrap aggregating, is a classic example of a subsampling prediction
procedure. Bagging involves drawing many bootstrap samples of a fixed size, fitting the
323
324 Handb ook of Big Data
same underlying algorithm on each bootstrap sample, and obtaining the final prediction
by averaging the results across the fits. Bagging can lead to significant model performance
gains when used with weak or unstable algorithms such as classification or regression trees.
The bootstrap samples are drawn with replacement, so each bootstrap sample of size n
contains approximately 63.2% of the unique training examples, while the remainder of the
observations contained in the sample are duplicates. Therefore, in bagging, each model is fit
using only a subset of the original training observations. The drawback of taking a simple
average of the output from the subset fits is that the predictions from each of the fits
are weighted equally, regardless of the individual quality of each fit. The performance of a
bagged fit can be much better compared to that of a nonbagged algorithm, but a simple
average is not necessarily the optimal combination of a set of base learners.
An average mixture (AVGM) procedure for fitting the parameter of a parametric model
has been studied by Zhang et al. [16]. AVGM partitions the full available dataset into disjoint
subsets, estimates the parameter within each subset, and finally combines the estimates by
simple averaging. Under certain conditions on the population risk, the AVGM can achieve
better efficiency than training a parametric model on the full data. A subsampled aver age
mixture (SAVGM) procedure, an extension of AVGM, is proposed in [16] and is shown to
provide substantial performance benefits over AVGM. As with AVGM, SAVGM partitions
the full data into subsets and estimates the parameter within each subset. However,
SAVGM also takes a single subsample from each partition, reestimates the parameter on the
subsample, and combines the two estimates into a so-called subsample-corrected estimate.
The final parameter estimate is obtained by simple averaging of the subsample-corrected
estimates from each partition. Both procedures have a theoretical backing; however, the
results rely on using parametric models.
An ensemble method for classification with large-scale datasets, using subsets of
observations to train algorithms, and combining the classifiers linearly, was implemented
and discussed in the case study of [8] at Twitter, Inc.
While not a subset method, boosting, formulated in [4], is an example of an ensemble
method that differentiates between the quality of each fit in the ensemble. Boosting iterates
the process of training a weak learner on the full dataset, then reweighting observations, with
higher weights given to poorly classified observations from the previous iteration. However,
boosting is not a subset method, because all observations are iteratively reweighted, and
thus all observations are needed at each iteration. Boosting is also a sequential algorithm,
and hence cannot be parallelized.
Another nonsubset ensemble method that differentiates between the quality of each
fit is the Super Learner algorithm of [14], which generalizes and establishes the theory
for stacking procedures developed by Wolpert [15] and extended by Breiman [2]. Super
Learner learns the optimal weighted combination of a base learner library of candidate
base learner algorithms by using cross-validation and a second-level metalearning algorithm.
Super Learner generalizes stacking by allowing for general loss functions and hence a broader
range of estimator combinations.
The Subsemble algorithm is a method proposed in [12], for combining results from fitting
the same underlying algorithm on different subsets of observations. Subsemble is a form of
supervised stacking [2,15] and is similar in nature to the Super Learner algorithm, with
the distinction that base learner fits are trained on subsets of the data instead of the full
training set. Subsemble can also accommodate multiple base learning algorithms, with each
algorithm being fit on each subset. The approach has many benefits and differs from other
ensemble methods in a variety of ways.
First, any type of underlying algorithm, parametric or nonparametric, can be used.
Instead of simply averaging subset-specific fits, Subsemble differentiates fit quality across
the subsets and learns a weighted combination of the subset-specific fits. To evaluate fit
Divide and Recombine 325
quality and determine the weighted combination, Subsemble uses cross-validation, thus,
using independent data to train the base learners and learn the weighted combination.
Finally, Subsemble has desirable statistical performance and can improve prediction quality
on both small and large datasets.
This chapter focuses on the statistical properties and performance of the Subsemble
algorithm. We present an oracle result for Subsemble, showing that Subsemble performs
as well as the best possible combination of the subset-specific fits. Empirically, it has been
shown that Subsemble performs well as a prediction procedure for moderate- and large-sized
datasets [12]. Subsemble can, and often does, provide better prediction performance than
fitting a single base algorithm on the full available dataset.
18.2 Subsemble Ensemble Learning for Big Data
Let X R
p
denote a real-valued vector of covariates and let Y R represent a real-valued
outcome value with joint distribution, P
0
(X, Y ). Assume that a training set consists of n
independent and identically distributed observations, O
i
=(X
i
,Y
i
)ofO P
0
. The goal is
to learn a function
ˆ
f(X) for predicting the outcome, Y , given the input X.
Assume that there is a set of L machine learning algorithms, Ψ
1
, ..., Ψ
L
,whereeachis
indexed by an algorithm class and a specific set of model parameters. These algorithms
can be any class of supervised learning algorithms, such as a Random Forest, Support
Vector Machine, or a linear model. The base learner library can also include copies of the
same algorithm, specified by different sets of tuning parameters. Typically, in stacking-
based [2,15] ensemble methods, functions,
ˆ
Ψ
1
, ...,
ˆ
Ψ
L
, are learned by applying base learning
algorithms, Ψ
1
, ..., Ψ
L
, to the full training dataset and then combining these fits using a
metalearning algorithm, Φ, trained on the cross-validated predicted values from the base
learners. Historically, in stacking methods, the metalearning method is often chosen to
be some sort of regularized linear model, such as nonnegative least squares (NNLS) [2];
however, a variety of parametric and nonparametric methods can be used to learn the
optimal combination output from the base fits. In the Super Learner algorithm, the
metalearning algorithm is specified as a method that minimizes the cross-validated risk
of some particular loss function of interest, such as negative log-likelihood loss or squared-
error loss.
18.2.1 The Subsemble Algorithm
Instead of using the entire dataset to obtain a single fit,
ˆ
Ψ
l
, for each base learner, Subsemble
applies algorithm Ψ
l
to multiple different subsets of the available observations. The subsets
are created by partitioning of the entire training set into J disjoint subsets. The subsets
are commonly created randomly and of the same size. With L unique base learners and J
subsets, the ensemble then comprises a total of L×J subset-specific fits,
ˆ
Ψ
l
j
. As in the Super
Learner algorithm, Subsemble obtains the optimal combination of the fits by minimizing
cross-validated risk through cross-validation.
In stacking algorithms, V-fold cross-validation is often used to generate what is called
the level-one data. The level-one data is the input data to the metalearning algorithm, which
is different from the level-zero data or the original training dataset. In the Super Learner
algorithm, the level-one data consists of the V -fold cross-validated predicted values from
each base learning algorithm. With L base learners and a training set of n observations, the
level-one data will be an n × L matrix and serve as the design matrix in the metalearning
task.
326 Handb ook of Big Data
In the Subsemble algorithm, a modified version of V -fold cross-validation is used to
obtain the level-one data. Each of the J subsets is partitioned further into V folds, so that
the vth validation fold spans across all J subsets. For each base learning algorithm, Ψ
l
,the
(j, v)th iteration of the cross-validation process is defined as follows:
1. Train the (j, v)th subset-specific fit,
ˆ
Ψ
l
j,v
, by applying Ψ
l
to the observations that
are in folds {1, ..., V }v, but restricted to subset j. The training set used here is
a subset of the jth subset and contains (n(v 1))/Jv observations.
2. Using the subset-specific fit,
ˆ
Ψ
l
j,v
, predicted values are generated for the entire
vth validation fold, including those observations that are not in subset j.Thesize
of the validation set for the (j, v)th iteration is n/V .
This unique version of cross-validation generates predicted values for all n observations in
the full training set, while only training on subsets of data. A total of L ×J learner-subset
models are cross-validated, resulting in an n ×(L ×J) matrix of level-one data that can be
used to train the metalearning algorithm, Φ. A diagram depicting the Subsemble algorithm
using a single underlying base learning algorithm, ψ is shown in Figure 18.1.
More formally, we define P
n,v
as the empirical distribution of the observations not in
the vth fold. For each observation i, define P
n,v(i)
to be the empirical distribution of the
observations not in the fold containing observation i. The optimal combination is selected
by applying the metalearning algorithm Φ to the following redefined set of n observations:
(
˜
X
i
,Y
i
), where
˜
X
i
= {
˜
X
l
i
}
L
l=1
,and
˜
X
l
i
= {
ˆ
Ψ
l
j
(P
n,v(i)
)(X
i
)}
J
j=1
.Thatis,foreachi,the
level-one input vector,
˜
X
i
, consists of the L ×J predicted values obtained by evaluating the
L × J subset-specific estimators trained on the data excluding the v(i)th fold at X
i
.
The cross-validation process is used only to generate the level-one data, so as a separate
task, L ×J final subset-specific fits are trained, using the entire subset j as the training set
for each (l, j)th fit. The final Subsemble fit comprises the L ×J subset-specific fits,
ˆ
Ψ
l
j
and
a metalearner fit,
ˆ
Φ. Pseudocode for the Subsemble algorithm is shown in Figure 18.2.
n
Argmin
β
...
...
...
...
...
...
ψ
^
1
β
1
ψ
^
1
+ β
2
ψ
^
2
+ ... + β
J
ψ
^
J
ψ
^
1v
ψ
^
2v
ψ
^
Jv
ψ
^
2
ψ
^
J
...
n
J
n
J
n
J
j = 1
j = 2 j = J
v = 1
v = 2 v = V v = V v = V
nn/J
J
n
2
=
V
...
...
...
v = 1
v = 2
nn/J
J
n
2
=
V
...
...
...
v = 1
v = 2
nn/J
J
n
2
=
V
Σ{Y
i
(β
1
ψ
^
1v
+ β
2
ψ
^
2v
+ ... + β
J
ψ
^
Jv
)}
2
i
FIGURE 18.1
Diagram of the Subsemble procedure using a single base learner ψ and linear regression as
the metalearning algorithm.
Divide and Recombine 327
Algorithm 18.1 Subsemble
Assume n observations (X
i
,Y
i
)
Partition the n observations into J disjoint subsets
Base learning algorithms: Ψ
1
,...,Ψ
L
Metalearner algorithm: Φ
Optimal combination:
ˆ
Φ({
ˆ
Ψ
l
1
,...,
ˆ
Ψ
l
J
}
L
l=1
)
for j 1:J do
// Create subset-specific base learner fits
for l 1:L do
ˆ
Ψ
l
j
apply Ψ
l
to observations i such that i j
end for
// Create V folds
Randomly partition each subset j into V folds
end for
for v 1:V do
// CV fits
for l 1:L do
ˆ
Ψ
l
j,v
apply Ψ
l
to observations i such that i j, i ∈ v
end for
for i : i v do
// Predicted values
˜
X
i
{
ˆ
Ψ
l
1,v
(X
i
),...,
ˆ
Ψ
l
J,v
(X
i
)}
L
l=1
end for
end for
ˆ
Φ apply Φ to training data (Y
i
,
˜
X
i
)
ˆ
Φ({
ˆ
Ψ
l
1
,...,
ˆ
Ψ
l
J
}
L
l=1
) final prediction function
FIGURE 18.2
Pseudocode for the Subsemble algorithm.
18.2.2 Oracle Result for Subsemble
The following oracle result gives a theoretical guarantee of Subsemble’s performance, was
proven in [12], and follows directly from the work of [14]. Theorem 18.1 has been extended
from the original formulation to allow for L base learners instead of a single base learner. The
squared-error loss function is used as an example metalearning algorithm in Theorem 18.1.
Theorem 18.1 Assume that the metalearner algorithm
ˆ
Φ=
ˆ
Φ
β
is indexed by a nite-
dimensional parameter β B.LetB
n
be a finite set of values in B, with the number of
values growing at most p olynomial rate in n. Assume that there exist bounded sets Y
R and Euclidean X such that P ((Y,X) Y × X)=1and P (
ˆ
Ψ
l
(P
n
) Y)=1for
l =1,...,L.
Define the cross-validation selector of β as
β
n
=arg min
βB
n
n
i=1
Y
i
ˆ
Φ
β
(
˜
X
i
)
2
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.131.38.14