18. Divide and Recombine: Subsemble, Exploiting the Power of Cross-Validation (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Divide and R ecombine: Subsemble, Exploiting

the Power of Cross-Validation

Stephanie Sapp and Erin LeDell

CONTENTS

18.1 Introduction .................................................................... 323

18.2 Subsemble Ensemble Learning for Big Data ................................... 325

18.2.1 The Subsemble Algorithm ............................................. 325

18.2.2 Oracle Result for Subsemble .......................................... 327

18.2.3 A Practical Subsemble Implementation ............................... 328

18.2.3.1 subsemble R Code Example ................................ 329

18.2.4 Performance Benchmarks .............................................. 330

18.2.4.1 Model Performance ......................................... 330

18.2.4.2 Computational Performance ............................... 331

18.3 Subsembles with Subset Supervision ........................................... 331

18.3.1 Supervised Subsembles ................................................ 332

18.3.2 The SRT Subsemble Algorithm ....................................... 333

18.3.2.1 Constructing and Selecting the Number of Subsets ....... 333

18.3.3 SRT Subsemble in Practice ............................................ 334

18.4 Concluding Remarks ........................................................... 335

18.5 Glossary ........................................................................ 335

References ............................................................................. 337

18.1 Introduction

As massive datasets become increasingly common, new scalable approaches to prediction are

needed. Given that memory and runtime constraints are common in practice, it is important

to develop practical machine learning methods that perform well on big datasets in a ﬁxed

computational resource setting. Procedures using subsets from a training set are promising

tools for prediction with large-scale datasets [16]. Recent research has focused on developing

and evaluating the performance of various subset-based prediction procedures. Subsetting

procedures in machine learning construct subsets from the available training data, then

train an algorithm on each subset, and ﬁnally combine the results across the subsets to

form a ﬁnal prediction. Prediction methods operating on subsets of the training data can

take advantage of modern computational resources, because machine learning on subsets

can be massively parallelized.

Bagging [1], or bootstrap aggregating, is a classic example of a subsampling prediction

procedure. Bagging involves drawing many bootstrap samples of a ﬁxed size, ﬁtting the

323

324 Handb ook of Big Data

same underlying algorithm on each bootstrap sample, and obtaining the ﬁnal prediction

by averaging the results across the ﬁts. Bagging can lead to signiﬁcant model performance

gains when used with weak or unstable algorithms such as classiﬁcation or regression trees.

The bootstrap samples are drawn with replacement, so each bootstrap sample of size n

contains approximately 63.2% of the unique training examples, while the remainder of the

observations contained in the sample are duplicates. Therefore, in bagging, each model is ﬁt

using only a subset of the original training observations. The drawback of taking a simple

average of the output from the subset ﬁts is that the predictions from each of the ﬁts

are weighted equally, regardless of the individual quality of each ﬁt. The performance of a

bagged ﬁt can be much better compared to that of a nonbagged algorithm, but a simple

average is not necessarily the optimal combination of a set of base learners.

An average mixture (AVGM) procedure for ﬁtting the parameter of a parametric model

has been studied by Zhang et al. [16]. AVGM partitions the full available dataset into disjoint

subsets, estimates the parameter within each subset, and ﬁnally combines the estimates by

simple averaging. Under certain conditions on the population risk, the AVGM can achieve

better eﬃciency than training a parametric model on the full data. A subsampled aver age

mixture (SAVGM) procedure, an extension of AVGM, is proposed in [16] and is shown to

provide substantial performance beneﬁts over AVGM. As with AVGM, SAVGM partitions

the full data into subsets and estimates the parameter within each subset. However,

SAVGM also takes a single subsample from each partition, reestimates the parameter on the

subsample, and combines the two estimates into a so-called subsample-corrected estimate.

The ﬁnal parameter estimate is obtained by simple averaging of the subsample-corrected

estimates from each partition. Both procedures have a theoretical backing; however, the

results rely on using parametric models.

An ensemble method for classiﬁcation with large-scale datasets, using subsets of

observations to train algorithms, and combining the classiﬁers linearly, was implemented

and discussed in the case study of [8] at Twitter, Inc.

While not a subset method, boosting, formulated in [4], is an example of an ensemble

method that diﬀerentiates between the quality of each ﬁt in the ensemble. Boosting iterates

the process of training a weak learner on the full dataset, then reweighting observations, with

higher weights given to poorly classiﬁed observations from the previous iteration. However,

boosting is not a subset method, because all observations are iteratively reweighted, and

thus all observations are needed at each iteration. Boosting is also a sequential algorithm,

and hence cannot be parallelized.

Another nonsubset ensemble method that diﬀerentiates between the quality of each

ﬁt is the Super Learner algorithm of [14], which generalizes and establishes the theory

for stacking procedures developed by Wolpert [15] and extended by Breiman [2]. Super

Learner learns the optimal weighted combination of a base learner library of candidate

base learner algorithms by using cross-validation and a second-level metalearning algorithm.

Super Learner generalizes stacking by allowing for general loss functions and hence a broader

range of estimator combinations.

The Subsemble algorithm is a method proposed in [12], for combining results from ﬁtting

the same underlying algorithm on diﬀerent subsets of observations. Subsemble is a form of

supervised stacking [2,15] and is similar in nature to the Super Learner algorithm, with

the distinction that base learner ﬁts are trained on subsets of the data instead of the full

training set. Subsemble can also accommodate multiple base learning algorithms, with each

algorithm being ﬁt on each subset. The approach has many beneﬁts and diﬀers from other

ensemble methods in a variety of ways.

First, any type of underlying algorithm, parametric or nonparametric, can be used.

Instead of simply averaging subset-speciﬁc ﬁts, Subsemble diﬀerentiates ﬁt quality across

the subsets and learns a weighted combination of the subset-speciﬁc ﬁts. To evaluate ﬁt

Divide and Recombine 325

quality and determine the weighted combination, Subsemble uses cross-validation, thus,

using independent data to train the base learners and learn the weighted combination.

Finally, Subsemble has desirable statistical performance and can improve prediction quality

on both small and large datasets.

This chapter focuses on the statistical properties and performance of the Subsemble

algorithm. We present an oracle result for Subsemble, showing that Subsemble performs

as well as the best possible combination of the subset-speciﬁc ﬁts. Empirically, it has been

shown that Subsemble performs well as a prediction procedure for moderate- and large-sized

datasets [12]. Subsemble can, and often does, provide better prediction performance than

ﬁtting a single base algorithm on the full available dataset.

18.2 Subsemble Ensemble Learning for Big Data

Let X ∈ R

denote a real-valued vector of covariates and let Y ∈ R represent a real-valued

outcome value with joint distribution, P

(X, Y ). Assume that a training set consists of n

independent and identically distributed observations, O

=(X

)ofO ∼ P

. The goal is

to learn a function

f(X) for predicting the outcome, Y , given the input X.

Assume that there is a set of L machine learning algorithms, Ψ

, ..., Ψ

,whereeachis

indexed by an algorithm class and a speciﬁc set of model parameters. These algorithms

can be any class of supervised learning algorithms, such as a Random Forest, Support

Vector Machine, or a linear model. The base learner library can also include copies of the

same algorithm, speciﬁed by diﬀerent sets of tuning parameters. Typically, in stacking-

based [2,15] ensemble methods, functions,

, ...,

, are learned by applying base learning

algorithms, Ψ

, ..., Ψ

, to the full training dataset and then combining these ﬁts using a

metalearning algorithm, Φ, trained on the cross-validated predicted values from the base

learners. Historically, in stacking methods, the metalearning method is often chosen to

be some sort of regularized linear model, such as nonnegative least squares (NNLS) [2];

however, a variety of parametric and nonparametric methods can be used to learn the

optimal combination output from the base ﬁts. In the Super Learner algorithm, the

metalearning algorithm is speciﬁed as a method that minimizes the cross-validated risk

of some particular loss function of interest, such as negative log-likelihood loss or squared-

error loss.

18.2.1 The Subsemble Algorithm

Instead of using the entire dataset to obtain a single ﬁt,

, for each base learner, Subsemble

applies algorithm Ψ

to multiple diﬀerent subsets of the available observations. The subsets

are created by partitioning of the entire training set into J disjoint subsets. The subsets

are commonly created randomly and of the same size. With L unique base learners and J

subsets, the ensemble then comprises a total of L×J subset-speciﬁc ﬁts,

. As in the Super

Learner algorithm, Subsemble obtains the optimal combination of the ﬁts by minimizing

cross-validated risk through cross-validation.

In stacking algorithms, V-fold cross-validation is often used to generate what is called

the level-one data. The level-one data is the input data to the metalearning algorithm, which

is diﬀerent from the level-zero data or the original training dataset. In the Super Learner

algorithm, the level-one data consists of the V -fold cross-validated predicted values from

each base learning algorithm. With L base learners and a training set of n observations, the

level-one data will be an n × L matrix and serve as the design matrix in the metalearning

task.

326 Handb ook of Big Data

In the Subsemble algorithm, a modiﬁed version of V -fold cross-validation is used to

obtain the level-one data. Each of the J subsets is partitioned further into V folds, so that

the vth validation fold spans across all J subsets. For each base learning algorithm, Ψ

,the

(j, v)th iteration of the cross-validation process is deﬁned as follows:

1. Train the (j, v)th subset-speciﬁc ﬁt,

j,v

, by applying Ψ

to the observations that

are in folds {1, ..., V }v, but restricted to subset j. The training set used here is

a subset of the jth subset and contains (n(v − 1))/Jv observations.

2. Using the subset-speciﬁc ﬁt,

j,v

, predicted values are generated for the entire

vth validation fold, including those observations that are not in subset j.Thesize

of the validation set for the (j, v)th iteration is n/V .

This unique version of cross-validation generates predicted values for all n observations in

the full training set, while only training on subsets of data. A total of L ×J learner-subset

models are cross-validated, resulting in an n ×(L ×J) matrix of level-one data that can be

used to train the metalearning algorithm, Φ. A diagram depicting the Subsemble algorithm

using a single underlying base learning algorithm, ψ is shown in Figure 18.1.

More formally, we deﬁne P

n,v

as the empirical distribution of the observations not in

the vth fold. For each observation i, deﬁne P

n,v(i)

to be the empirical distribution of the

observations not in the fold containing observation i. The optimal combination is selected

by applying the metalearning algorithm Φ to the following redeﬁned set of n observations:

(

), where

= {

}

l=1

,and

= {

n,v(i)

)(X

)}

j=1

.Thatis,foreachi,the

level-one input vector,

, consists of the L ×J predicted values obtained by evaluating the

L × J subset-speciﬁc estimators trained on the data excluding the v(i)th fold at X

The cross-validation process is used only to generate the level-one data, so as a separate

task, L ×J ﬁnal subset-speciﬁc ﬁts are trained, using the entire subset j as the training set

for each (l, j)th ﬁt. The ﬁnal Subsemble ﬁt comprises the L ×J subset-speciﬁc ﬁts,

and

a metalearner ﬁt,

Φ. Pseudocode for the Subsemble algorithm is shown in Figure 18.2.

Argmin

...

+ β

+ ... + β

...

j = 1

j = 2 j = J

v = 1

v = 2 v = V v = V v = V

nn/J

= −

...

v = 1

v = 2

nn/J

= −

...

v = 1

v = 2

nn/J

= −

Σ{Y

− (β

+ β

+ ... + β

)}

FIGURE 18.1

Diagram of the Subsemble procedure using a single base learner ψ and linear regression as

the metalearning algorithm.

Divide and Recombine 327

Algorithm 18.1 Subsemble

• Assume n observations (X

)

• Partition the n observations into J disjoint subsets

• Base learning algorithms: Ψ

,...,Ψ

• Metalearner algorithm: Φ

• Optimal combination:

Φ({

,...,

}

l=1

)

for j ← 1:J do

// Create subset-specific base learner fits

for l ← 1:L do

← apply Ψ

to observations i such that i ∈ j

end for

// Create V folds

Randomly partition each subset j into V folds

end for

for v ← 1:V do

// CV fits

for l ← 1:L do

j,v

← apply Ψ

to observations i such that i ∈ j, i ∈ v

end for

for i : i ∈ v do

// Predicted values

←



{

1,v

),...,

J,v

)}

l=1



end for

Φ ← apply Φ to training data (Y

)

Φ({

,...,

}

l=1

) ← ﬁnal prediction function

FIGURE 18.2

Pseudocode for the Subsemble algorithm.

18.2.2 Oracle Result for Subsemble

The following oracle result gives a theoretical guarantee of Subsemble’s performance, was

proven in [12], and follows directly from the work of [14]. Theorem 18.1 has been extended

from the original formulation to allow for L base learners instead of a single base learner. The

squared-error loss function is used as an example metalearning algorithm in Theorem 18.1.

Theorem 18.1 Assume that the metalearner algorithm

Φ=

is indexed by a ﬁnite-

dimensional parameter β ∈ B.LetB

be a ﬁnite set of values in B, with the number of

values growing at most p olynomial rate in n. Assume that there exist bounded sets Y ∈

R and Euclidean X such that P ((Y,X) ∈ Y × X)=1and P (

) ∈ Y)=1for

l =1,...,L.

Deﬁne the cross-validation selector of β as

=arg min

β∈B



i=1



−

(

)



..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18. Divide and Recombine: Subsemble, Exploiting the Power of Cross-Validation (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
18. Divide and Recombine: Subsemble, Exploiting the Power of Cross-Validation (1/4)