Chapter 7. Ensemble Methods

In this chapter, we take a step back from learning new models and instead think about how several trained models can work together as an ensemble, in order to produce a single model that is more powerful than the individual models involved.

The first type of ensemble that we will study uses different samples of the same data set in order to train multiple versions of the same model. These models then vote on the correct answer for a new observation and an average or majority decision is made, depending on the type of problem. This process is known as bagging, which is short for bootstrap aggregation. Another approach to combine models is boosting. This essentially involves training a chain of models and assigning weights to observations that were incorrectly classified or fell far from their predicted value so that successive models are forced to prioritize them.

As methods, bagging and boosting are fairly general and have been applied with a number of different types of models. Decision trees, studied in Chapter 6, Tree-based Methods, are particularly suited to ensemble methods. So much so, that a particular type of tree-based ensemble model has its own name—the random forest. Random forests offer significant improvements over the single decision tree and are generally considered to be very powerful and flexible models, as we shall soon discover. In this chapter, we'll revisit some of the data sets we analyzed in previous chapters and see if we can improve performance by applying some of the principles we learn here.

Bagging

The focus of this chapter is on combining the results from different models in order to produce a single model that will outperform individual models on their own. Bagging is essentially an intuitive procedure for combining multiple models trained on the same data set, by using majority voting for classification models and average value for regression models. We'll present this procedure for the classification case, and later show how this is easily extended to handle regression models.

Bagging procedure for binary classification

Inputs:

  • data: The input data frame containing the input features and a column with the binary output label
  • M: An integer, representing the number of models that we want to train

Output:

  • models: A set of Μ trained binary classifier models

Method:

1. Create a random sample of size n, where n is the number of observations in the original data set, with replacement. This means that some of the observations from the original training set will be repeated and some will not be chosen at all. This process is known as bootstrapping, bootstrap sampling, or bootstrap resampling.

2. Train a classification model using this sampled data set. Typically, we opt not to use regularization or shrinkage methods designed to reduce overfitting in this step, because the aggregating process used at the end will be used to smooth out overfitting.

3. For each observation in the sampled data set, record the class assigned by the model.

4. Repeat this process M times in order to train M models.

5. For every observation in the original training data set, compute the predicted class via a majority vote across the different models. For example, suppose M = 61 and through bootstrap sampling a particular observation appears in the training data for 50 of the models. If 37 of these predict class 1 for this observation and 13 predict class -1, by majority vote the overall prediction will be class 1.

6. Compute the model's accuracy using the labels provided by the training set.

In a nutshell, all we are effectively doing is training the same model on M different versions of the input training set (created through sampling with replacement) and averaging the result.

A legitimate question to ask would be: "How many distinct observations do we get each time we sample with replacement?" On average, we end up with 63 percent of the distinct observations in every sample that we make. To understand where this comes from, consider that because we are sampling with replacement, the probability of not picking out a particular observation, x1, during sampling is just the result of n failed Bernoulli trials: Bagging. This number also happens to be the average proportion of observations that are not selected across the entire training data set because we multiply and divide the previous expression by n to compute this quantity. The numerical result of this expression can be approximated by e-1, which is roughly 37 percent. Consequently, the average proportion of observations that are selected is around 63 percent. This number is just an average, of course, and is more accurate for larger values of n.

Margins and out-of-bag observations

Let's imagine that for a particular observation, x1, 85 percent of our models predict the correct class and the remaining 15 percent predict the incorrect class. Let's also imagine that we have another observation, x2, for which the analogous percentages are 53 percent and 47 percent. Clearly, our intuition suggests that we should be more confident about the classification of the former observation compared to the latter observation. Put differently, the difference between the classification proportions, also known as the margin (similar to but not to be confused with the margin used for support vector machines) is a good indicator of the confidence of our classification.

The 70 percent margin of observation x1 is much larger than the 6 percent margin of observation x2 and thus, we believe more strongly in our ability to correctly classify the former observation. In general, what we are hoping for is a classifier that has a large margin for all the observations. We are less optimistic about the generalization abilities of a classifier that has a small margin for more than a handful of observations.

One thing the reader may have noticed here is that in generating the set of predicted values for each model, we are using the same data on which the model was trained. If we look closely at step 3 of the procedure, we are classifying the same sampled data that we used in step 2 to train the model. Even though we are eventually relying on using an averaging process at the end in order to obtain the estimated accuracy of the bagged classifier for unseen data, we haven't actually used any unseen data at any step of the way.

Remember that in step 1 we constructed a sample of the training data with which to train our model. From the original data set, we refer to the observations that were not chosen for a particular iteration of the procedure as the out-of-bag (OOB) observations. These observations are therefore not used in the training of the model at that iteration. Consequently, instead of relying on the observations used to train the models at every step, we can actually use the OOB observations to record the accuracy of a particular model.

In the end, we average over all the OOB accuracy rates to obtain an average accuracy. This average accuracy is far more likely to be a realistic and objective estimate of the performance of the bagged classifier on unseen data. For a particular observation, the assigned class is thus decided as the majority vote over all classifiers for which the observation was not picked in their corresponding training sample.

The samples generated from sampling the original data set with replacement, known as bootstrapped samples, are similar to drawing multiple samples from the same distribution. As we are trying to estimate the same target function using a number of different samples instead of just one, the averaging process reduces the variance of the result. To see this, consider trying to estimate the mean of a set of observations drawn from the same distribution and all mutually independent of each other. More formally, these are known as independent and identically distributed (iid) observations. The variance of the mean of these observations is Margins and out-of-bag observations.

This shows that as the number of observations increases, the variance decreases. Bagging tries to achieve the same behavior for the function we are trying to model. We don't have truly independent training samples, and are instead forced to use bootstrapped samples, but this thought experiment should be enough to convince us that, in principle, bagging has the potential to reduce the variance of the model. At the same time, this averaging process is a form of smoothing over any localized bumps in the function that we are trying to estimate. Assuming that the target regression function or classification boundary that we are trying to estimate is actually smooth, then bagging may also reduce the bias of our model.

Predicting complex skill learning with bagging

Bagging and boosting are both very popular with the tree-based models that we studied in Chapter 6, Tree-based Methods. There are many notable implementations to apply these approaches to methodologies like CART for building trees.

The ipred package, for example, contains an implementation to build a bagged predictor for trees built with rpart(). We can experiment with the bagging() function that this package provides. To do this, we specify the number of bagged trees to make using the nbagg parameter (default is 25) and indicate that we want to compute accuracy using the out-of-bag (OOB) samples by setting the coob parameter to TRUE.

We will do this for our complex skill learning data set from the previous chapter, using the same training data frame:

> baggedtree <- bagging(LeagueIndex ~ ., data = skillcraft_train,    
                        nbagg = 100, coob = T)
> baggedtree_predictions <- predict(baggedtree, skillcraft_test)
> (baggedtree_SSE <- compute_SSE(baggedtree_predictions, 
                                 skillcraft_test$LeagueIndex))
[1] 646.3555

As we can see, the SSE on the test set is less than the lowest SSE that we saw when tuning a single tree. Increasing the number of bagged iterations, however, does not seem to improve this performance substantially. We will revisit this data set again later.

Predicting heart disease with bagging

The prototypical use case for bagging is the decision tree; however, it is important to remember that we can use this method with a variety of different models. In this section, we will show how we can build a bagged logistic regression classifier. We built a logistic regression classifier for the Statlog Heart data set in Chapter 3, Logistic Regression. Now, we will repeat that experiment but use bagging in order to see if we can improve our results. To begin with, we'll draw our samples with replacement and use these to train our models:

> M <- 11
> seeds <- 70000 : (70000 + M - 1)
> n <- nrow(heart_train)
> sample_vectors <- sapply(seeds, function(x) { set.seed(x); return(sample(n, n, replace = T)) })

In our code, the data frames heart_train and heart_test are referring to the same data frames that we prepared in Chapter 3, Logistic Regression. We begin by deciding on the number of models that we will train and setting the appropriate value of M. Here, we have used an initial value of 11.

Note that it is a good idea to use an odd number of models with bagging, so that during the majority voting process there can never be a tie with binary classification. For reproducibility, we set a vector of seeds that we will use. This is simply a counter from an arbitrarily chosen starting seed value of 70,000. The sample_vectors matrix in our code contains a matrix where the columns are the indexes of randomly selected rows from the training data with replacement. Note that the rows are numbered 1 through 230 in the training data, making the sampling process easy to code.

Next, we'll define a function that creates a single logistic regression model given a sampling vector of indices to use with our training data frame:

 train_1glm <- function(sample_indices) { 
     data <- heart_train[sample_indices,]; 
     model <- glm(OUTPUT ~ ., data = data, family = binomial("logit")); 
     return(model)
 }

> models <- apply(sample_vectors, 2, train_1glm)

In the last line of the preceding code, we iterate through the columns of the sample_vectors matrix we produced earlier and supply them as an input to our logistic regression model training function, train_1glm(). The resulting models are then stored in our final list variable, models. This now contains 11 trained models.

As the first method of evaluating our models, we are going to use the data on which each individual model was trained. To that end, we'll construct the bags variable that is a list of these data frames, this time with unique indexes, as we don't want to use any duplicate rows from the bootstrap sampling process in the evaluation. We'll also add a new column called ID to these data frames that stores the original row names from the heart_train data frame. We'll see why we do this shortly.

 get_1bag <- function(sample_indices) {
     unique_sample <- unique(sample_indices); 
     df <- heart_train[unique_sample, ]; 
     df$ID <- unique_sample; 
     return(df)
 }
 
> bags <- apply(sample_vectors, 2, get_1bag)

We now have a list of models and a list of data frames that they were trained on, the latter without duplicate observations. From these two, we can create a list of predictions. For each training data frame, we will tack on a new column called PREDICTIONS {m}, where {m} will be the number of the model being used to make the predictions. Consequently, the first data frame in the bags list will have a predictions column called PREDICTIONS 1. The second data frame will have a predictions column called PREDICTIONS 2, the third will have one called PREDICTIONS 3, and so on.

The following call produces a new set of data frames as just described, but only keeping the PREDICTIONS{m} and ID columns, and these data frames are stored as a list in the variable training_predictions:

 glm_predictions <- function(model, data, model_index) {
     colname <- paste("PREDICTIONS", model_index);
     data[colname] <- as.numeric( 
                      predict(model, data, type = "response") > 0.5); 
     return(data[,c("ID", colname), drop = FALSE])
 }
 
> training_predictions <- 
     mapply(glm_predictions, models, bags, 1 : M, SIMPLIFY = F)

Next, we want to merge all of these data frames onto a single data frame, where the rows are the rows of the original data frame (and thus, correspond to the observations in the data set) and the columns are the predictions made by each model on the observations. Where a particular row (observation) was not selected by the sampling process to train a particular model, it will have an NA value in the column corresponding to the predictions that that model makes.

Just to be clear, recall that each model is making predictions only on the observations that were used to train it and so the number of predictions that each model makes is smaller than the total number of observations available in our starting data.

As we have stored the original row numbers of the heart_train data frame in the ID column of every data frame created in the previous step, we can merge using this column. We use the Reduce() function along with the merge() function in order to merge all the data frames in our training_predictions variable into one new data frame. Here is the code:

> train_pred_df <- Reduce(function(x, y) merge(x, y, by = "ID",  
                          all = T), training_predictions)

Let's have a look at the first few lines and columns of this aggregated data frame:

> head(training_prediction_df[, 1:5])
  ID PREDICTIONS 1 PREDICTIONS 2 PREDICTIONS 3 PREDICTIONS 4
1  1             1            NA             1            NA
2  2             0            NA            NA             0
3  3            NA             0             0            NA
4  4            NA             1             1             1
5  5             0             0             0            NA
6  6             0             1             0             0

The first column is the ID row that was used to merge the data frame. The numbers in this column are the row numbers of the observations from the starting training data frame. The PREDICTIONS 1 column contains the predictions that the first model makes. We can see that this model had rows 1, 2, 5, and 6 as part of its training data. For the first row, the model predicts class 1 and for the other three rows, it predicts class 0. Rows 3 and 4 were not part of its training data and so there are two NA values. This reasoning can be used to understand the remaining columns, which correspond to the next three models trained.

With this data frame constructed, we can now produce our training data predictions for the whole bagged model using a majority vote across each row of the preceding data frame. Once we have these, we merely need to match the predictions with the labeled values of the corresponding rows of the original heart_train data frame and compute our accuracy:

> train_pred_vote <- apply(train_pred_df[,-1], 1, 
                function(x) as.numeric(mean(x, na.rm = TRUE) > 0.5))
> (training_accuracy <- mean(train_pred_vote == 
                heart_train$OUTPUT[as.numeric(train_pred_df$ID)]))
[1] 0.9173913

We now have our first accuracy measure for our bagged model—91.7 percent. This is analogous to measuring the accuracy on our training data. We will now repeat this process using the out-of-bag observations for each model to compute the out-of-bag accuracy.

There is one caveat here, however. In our data, the ECG column is a factor with three levels, one of which, level 1, is very rare. As a result of this, when we draw bootstrap samples from the original training data, we may encounter samples in which this factor level never appears. When that happens, the glm() function will think this factor only takes two levels, and the resulting model will be unable to make predictions when it encounters an observation with a value for the ECG factor that it has never seen before.

To handle this situation, we need to replace the level 1 value of this factor with an NA value for the out-of-bag observations, if the model they correspond to did not have at least one observation with an ECG factor level of 1 in its training data. Essentially, for simplicity, we will just not attempt to make a prediction for these problematic observations when they arise. With this in mind, we will define a function to compute the out-of-bag observations for a particular sample and then use this to find the out-of-bag observations for all our samples:

 get_1oo_bag <- function(sample_indices) {
     unique_sample <- setdiff(1 : n, unique(sample_indices)); 
     df <- heart_train[unique_sample,]; 
     df$ID <- unique_sample; 
     if (length(unique(heart_train[sample_indices,]$ECG)) < 3) 
         df[df$ECG == 1,"ECG"] = NA; 
     return(df)
 }
 
> oo_bags <- apply(sample_vectors, 2, get_1oo_bag)

Next, we will use our glm_predictions() function to compute predictions using our out-of-bag samples. The remainder of the process is identical to what we did earlier:

> oob_predictions <- mapply(glm_predictions, models, oo_bags, 
                            1 : M, SIMPLIFY = F)
> oob_pred_df <- Reduce(function(x, y) merge(x, y, by = "ID", 
                        all = T), oob_predictions)
> oob_pred_vote <- apply(oob_pred_df[,-1], 1, 
                   function(x) as.numeric(mean(x, na.rm = TRUE) > 0.5))

> (oob_accuracy <- mean(oob_pred_vote == 
             heart_train$OUTPUT[as.numeric(oob_pred_df$ID)], 
             na.rm = TRUE))
[1] 0.8515284

As expected, we see that our out-of-bag accuracy, which is a better measure of performance on unseen data, is lower than the training data accuracy. In the last line of the previous code sample, we excluded NA values when computing the out-of-bag accuracy. This is important because it is possible that a particular observation may appear in all the bootstrap samples and therefore never be available for an out-of-bag prediction.

Equally, our fix for the rare level of the ECG factor means that even if an observation is not selected by the sampling process, we may still not be able to make a prediction for it. The reader should verify that only one observation happens to produce an NA value because of the combination of the two phenomena just described.

Finally, we'll repeat this process a third time using the heart_test data frame to obtain the test set accuracy:

 get_1test_bag <- function(sample_indices) {
     df <- heart_test; 
     df$ID <- row.names(df); 
     if (length(unique(heart_train[sample_indices,]$ECG)) < 3) 
         df[df$ECG == 1,"ECG"] = NA; 
     return(df)
 }
 
> test_bags <- apply(sample_vectors, 2, get_1test_bag)
> test_predictions <- mapply(glm_predictions, models, test_bags, 
                             1 : M, SIMPLIFY = F)
> test_pred_df <- Reduce(function(x, y) merge(x, y, by = "ID",
                         all = T), test_predictions)
> test_pred_vote <- apply(test_pred_df[,-1], 1, 
              function(x) as.numeric(mean(x, na.rm = TRUE) > 0.5))
> (test_accuracy <- mean(test_pred_vote == 
              heart_test[test_pred_df$ID,"OUTPUT"], na.rm = TRUE))
[1] 0.8

The accuracy on the test set seems lower than what we found without a bagged model. This is not necessarily bad news for us, since the test set is very small. In fact, the difference between the performance of this bagged model and the original model trained in Chapter 3, Logistic Regression, is 32/40 compared to 36/40, which is to say it is only worse by four observations in 40.

In a real-world situation, we generally want to have a much larger test set to estimate our unseen accuracy. In fact, because of this, we are more inclined to believe our out-of-bag accuracy measurement, which is done over a larger number of observations and averaged over many models.

Bagging is actually very useful for us in this scenario as it gives us a model for which we can have a better estimate of the test accuracy, using the out-of-bag observations because the test set is so small. As a final demonstration, we run the previous code a number of times with different values of M and store the results in a data frame:

> heart_bagger_df
     M Training Accuracy Out-of-bag Accuracy Test Accuracy
1   11         0.9173913           0.8515284         0.800
2   51         0.9130435           0.8521739         0.800
3  101         0.9173913           0.8478261         0.800
4  501         0.9086957           0.8521739         0.775
5 1001         0.9130435           0.8565217         0.775

This table shows us that the test accuracy fluctuates around 80 percent. This isn't that surprising given the small size of our test set of only 40 observations. For the training accuracy, we see that we are fluctuating around 91 percent. The OOB accuracy, which is far more stable as an accuracy measure, shows us that the expected performance of the model is around 85 percent. As the number of models increases, we don't see much of an improvement over 11 models, though for most real-world data sets, we would usually see some improvement before tapering off.

Although our example focused exclusively on bagging for classification problems, the move to regression problems is relatively straightforward. Instead of using majority votes for a particular observation, we use the average value of the target function predicted by the individual models. Bagging is not always guaranteed to provide a performance improvement on a model. For starters, we should note that it makes sense to use bagging only when we have a nonlinear model. As the bagging process is performing an average (a linear operation) over the models generated, we will not see any improvements with linear regression, for example, because we aren't increasing the expressive power of our model. The next section talks about some other limitations of bagging.

Note

For more information on bagging, consult the original paper of Leo Breiman titled Bagging Predictors, published in 1996 in the journal Machine Learning.

Limitations of bagging

So far, we've only explored the upside of using bagging, but in some cases it may turn out not to be a good idea. Bagging involves taking the average across predictions made by several models, which are trained on bootstrapped samples of the training data. This averaging process smoothens the overall output, which may reduce bias when the target function is smooth. Unfortunately, if the target function is not smooth, we may actually introduce bias by using bagging.

Another way that bagging introduces bias is when one of the output classes is very rare. Under those circumstances, the majority voting system tends to be biased towards the more common class. Other problems may arise in relation to the sampling process itself. As we have already learned, when some categorical features include values that are rare, these may not appear at all in some of the bootstrap samples. When this happens, the models built for these samples will be unable to make a prediction when they encounter this new feature level in their test set.

High leverage points, which are highly influential in determining the model's output function compared to other points, can also be a problem. If a bootstrap sample is drawn that does not include one or more high leverage points, the resulting trained model will be quite different compared to when they are included. Therefore, bagging performance depends on how often these particular observations are sampled in order to win the majority vote. Due to this fact, our ensemble model will have a high variance in the presence of high leverage points. For a given data set, we can often predict if we are in this situation by looking for outliers and highly skewed features.

We must also remember that the different models we build are not truly independent of each other in the strict sense because they still use the same set of input features. The averaging process would have been more effective if the models were independent. Also, bagging does not help when the type of model that we are using predicts a functional form that is very far from the true form of the target function. When this happens, training multiple models of this type merely reproduces the systematic errors across the different models. Put differently, bagging works better when we have low bias and high variance models as the averaging process is primarily designed to reduce the variance.

Finally, and this applies to ensemble models in general, we tend to lose the explanative power of our model. We saw an example of explanative power in linear regression where each model parameter (regression coefficient) corresponded to the amount of change in the output, for a unit increase in the corresponding feature. Decision trees are another example of a model with high explanatory power. Using bagging loses this benefit because of the majority voting process and so we cannot directly relate our inputs to the predicted output.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.144.59