Ensembles and model averaging

Another approach to regularization involves creating ensembles of models and combining them, such as by model averaging or some other algorithm for combining individual model results. As with many of the previous regularization methods, model averaging is a fairly simple concept. If you have different models that each generate a set of predictions, each model may make errors in its predictions, but they might not all make the same errors. Where one model predicts too high a value, another may predict one that's too low, so that, if averaged, some of the errors cancel out resulting in a more accurate prediction than would have been otherwise obtained.

To better understand model averaging, let's consider a couple of different but extreme examples. In the first case, suppose that the models being averaged are identical or at least generate identical predictions (that is, perfectly correlated). In that case, the average will result in no benefit, but also no harm. In the second case, suppose that the models being averaged each independently perform equally well, and their predictions are uncorrelated (or have very low correlations). Then the average will be far more accurate as it gains the strengths of each model. The following code gives an example using simulated data. In this small example, we only have three models, but they illustrate the point:

## simulated data
set.seed(1234)
d <- data.frame(
  x = rnorm(400))
d$y <- with(d, rnorm(400, 2 + ifelse(x < 0, x + x^2, x + x^2.5), 1))
d.train <- d[1:200, ]
d.test <- d[201:400, ]

## three different models
m1 <- lm(y ~ x, data = d.train)
m2 <- lm(y ~ I(x^2), data = d.train)
m3 <- lm(y ~ pmax(x, 0) + pmin(x, 0), data = d.train)

## In sample R2
cbind(
  M1 = summary(m1)$r.squared,
  M2 = summary(m2)$r.squared,
  M3 = summary(m3)$r.squared)

       M1   M2   M3
[1,] 0.33 0.60 0.76

We can see that the predictive value of each model, at least in the training data, varies quite a bit. Evaluating the correlations among fitted values in the training data can also help to indicate how much overlap there is among the model predictions:

## correlations in the training data
cor(cbind(
  M1 = fitted(m1),
  M2 = fitted(m2),
  M3 = fitted(m3)))

     M1   M2   M3
M1 1.00 0.11 0.65
M2 0.11 1.00 0.78
M3 0.65 0.78 1.00

Next we generate predicted values for the testing data, the average of the predicted values, and again correlate the predictions along with reality in the testing data:

## generate predictions and the average prediction
d.test$yhat1 <- predict(m1, newdata = d.test)
d.test$yhat2 <- predict(m2, newdata = d.test)
d.test$yhat3 <- predict(m3, newdata = d.test)
d.test$yhatavg <- rowMeans(d.test[, paste0("yhat", 1:3)])

## correlation in the testing data
cor(d.test)

             x    y  yhat1  yhat2 yhat3 yhatavg
x        1.000 0.44  1.000 -0.098  0.60    0.55
y        0.442 1.00  0.442  0.753  0.87    0.91
yhat1    1.000 0.44  1.000 -0.098  0.60    0.55
yhat2   -0.098 0.75 -0.098  1.000  0.69    0.76
yhat3    0.596 0.87  0.596  0.687  1.00    0.98
yhatavg  0.552 0.91  0.552  0.765  0.98    1.00

From the results we can see that indeed the average of the three models' predictions performs better than any of the models individually. However, this is only guaranteed to be true when each model performs similarly well. For example, consider a pathological case where one model predicts the outcome perfectly and another is random noise that is completely uncorrelated with the outcome. In this case, averaging the two would certainly result in worse performance than just using the good model. In general, it is good to check that the models being averaged have similar performance, at least in the training data. The second lesson is that, given models with similar performance, it is desirable to have lower correlations between model predictions, as this will result in the best performing average.

Ensemble methods are methods that employ model averaging. One common technique is known as bootstrap aggregating, where the data is sampled with replacement to form equally sized datasets, a model is trained on each, and then these results are averaged. Because the data is sampled with replacement, some cases may show up multiple times or not at all in each dataset. Because a model is trained on each dataset, if a particular variation is unique to just a few cases or a rare quirk of the data, it may only emerge in one model; when the predictions are averaged across many models trained on each of the resampled datasets, such overfitting will tend to be reduced. This process is known as bagging (bootstrap aggregating). In some contexts (for example, decision trees), further steps may be taken to attempt to reduce the correlations among the different models. For example, random forests are decision trees that use bootstrap aggregating but also randomly select a subset of features at each node split in order to try to reduce model to model correlations and thus improve the overall average performance.

Bagging and model averaging is not used as frequently in deep neural networks because the computational cost of training each model can be quite high and thus repeating the process many times becomes prohibitively expensive in terms of time and compute resources. However, the dropout process discussed in the next section serves a very similar function to the way many subset models are trained, by dropping specific neurons, and then the results of these models are averaged. Nevertheless, it is still possible to use model averaging in the context of deep neural networks, even if perhaps it is on only a handful of models rather than hundreds, as is common in random forests and some other approaches.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.42.116