Use case – improving out-of-sample model performance using dropout

Dropout is a relatively novel approach to regularization that is particularly valuable for large and complex deep neural networks. For a much more detailed exploration of dropout in deep neural networks, see Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). The concept behind dropout is actually quite straightforward. During the training of the model, units (for example, inputs, hidden neurons, and so on) are probabilistically dropped along with all connections to and from them. For example, Figure 3.3 is an example of what might happen at each step of training for a model where hidden neurons and their connections are probabilistically dropped with a probability of 1/3. The grayed out and dashed neurons and connections are the ones that were dropped. Importantly, it is not that some neurons are dropped during the entirety of training, but that they are only dropped for a step/update:

Use case – improving out-of-sample model performance using dropout

Figure 3.3

One way to think about dropout is that it forces models to be more robust to perturbations. Although many neurons are included in the full model, during training they are not all simultaneously present, and so neurons must operate somewhat more independently than they would have to otherwise. It is also worth noting that inputs can be dropped as well as hidden neurons, but typically this is either not done or done to a much lesser extent.

Another way of viewing dropout is that, if you have a large model with N weights between hidden neurons, but 50% are dropped during training, although all N weights will be used during some stages of training, you have effectively halved the total model complexity as the average number of weights will be halved. This reduces model complexity, and hence may help to prevent overfitting of the data. Because of this feature, if the proportion of dropout is p, Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014) recommend scaling up the target model complexity by 1/p in order to end up with a roughly equally complex model.

Although neurons can be randomly dropped during training, during testing it is computationally inconvenient to calculate many predictions based on models dropping some neurons and then average the predictions from each model. Instead, it has been suggested (and this seems to perform well) that we should use an approximate average based on scaling the weights from a single neural network based on each weight's probability of being included (that is, 1 – p, although this can be done empirically rather than theoretically).

In addition to working well, this approximate weight re-scaling is a fairly trivial calculation. Thus, the primary computational cost of dropout comes from the fact that a model with more neurons and weights must be used because so many (a commonly recommended value is around 50% for hidden neurons) are dropped during each training update.

Although dropout is fairly computationally cheap, it can be slower as, because of the dropout, a larger model may be required, and larger models typically are slower or more computationally demanding to train. To counteract this, a higher learning rate can be used so that fewer iterations are required. One potential downside of such an approach is that, with fewer neurons and a faster learning rate, some weights may become quite large. Fortunately, it is possible to use dropout along with other forms of regularization, such as the L1 or L2 penalty. Taken together, the result is a larger model that that can quickly (a faster learning rate) explore a broader parameter space, but is regularized through dropout and a penalty to keep the weights in check.

To show the use of dropout in a neural network, we will return to the Modified National Institute of Standards and Technology (MNIST) dataset (that we downloaded in Chapter 2, Training a Prediction Model, from Kaggle) we worked with previously. We will use the nn.train() function from the deepnet package, as it allows for dropout. As in the previous chapter, we will run the four models in parallel to reduce the time it takes. Specifically, we compare four models, two with and two without dropout regularization and with either 40 or 80 hidden neurons. For dropout, we specify the proportion to dropout separately for the hidden and visible units. Based on the rule of thumb that about 50% of hidden units (and 80% of observed units) should be kept, we specify the dropout proportions at .5 and .2, respectively:

## Fit Models
nn.models <- foreach(i = 1:4, .combine = 'c') %dopar% {
set.seed(1234)
 list(nn.train(
    x = as.matrix(digits.X),
    y = model.matrix(~ 0 + digits.y),
    hidden = c(40, 80, 40, 80)[i],
    activationfun = "tanh",
    learningrate = 0.8,
    momentum = 0.5,
    numepochs = 150,
    output = "softmax",
    hidden_dropout = c(0, 0, .5, .5)[i],
    visible_dropout = c(0, 0, .2, .2)[i]))
}

Next, we can loop through the models and obtain predicted values and get the overall model performance:

nn.yhat <- lapply(nn.models, function(obj) {
  encodeClassLabels(nn.predict(obj, as.matrix(digits.X)))
})

perf.train <- do.call(cbind, lapply(nn.yhat, function(yhat) {
  caret::confusionMatrix(xtabs(~ I(yhat - 1) + digits.y))$overall
}))
colnames(perf.train) <- c("N40", "N80", "N40_Reg", "N80_Reg")

options(digits = 4)
perf.train

                  N40    N80 N40_Reg N80_Reg
Accuracy       0.9050 0.9546  0.9212  0.9396
Kappa          0.8944 0.9495  0.9124  0.9329
AccuracyLower  0.8965 0.9485  0.9134  0.9326
AccuracyUpper  0.9130 0.9602  0.9285  0.9460
AccuracyNull   0.1116 0.1116  0.1116  0.1116
AccuracyPValue 0.0000 0.0000  0.0000  0.0000
McnemarPValue     NaN    NaN     NaN     NaN

When evaluating the models in the in-sample training data, it seems that the 40-neuron model performs better with regularization than without it, but that the 80-neuron model performs better without regularization than with regularization. Of course the real test comes on the testing or hold out data:

i2 <- 5001:10000
test.X <- digits.train[i2, -1]
test.y <- digits.train[i2, 1]

nn.yhat.test <- lapply(nn.models, function(obj) {
  encodeClassLabels(nn.predict(obj, as.matrix(test.X)))
})

perf.test <- do.call(cbind, lapply(nn.yhat.test, function(yhat) {
  caret::confusionMatrix(xtabs(~ I(yhat - 1) + test.y))$overall
}))
colnames(perf.test) <- c("N40", "N80", "N40_Reg", "N80_Reg")

perf.test
                  N40    N80 N40_Reg N80_Reg
Accuracy       0.8652 0.8684  0.8868  0.9014
Kappa          0.8502 0.8537  0.8742  0.8904
AccuracyLower  0.8554 0.8587  0.8777  0.8928
AccuracyUpper  0.8746 0.8777  0.8955  0.9095
AccuracyNull   0.1074 0.1074  0.1074  0.1074
AccuracyPValue 0.0000 0.0000  0.0000  0.0000
McnemarPValue     NaN    NaN     NaN     NaN

The testing data highlights quite well the fact that, in the non-regularized model, the additional neurons do not meaningfully improve the performance of the model on the testing data. In addition, the in-sample performance was overly optimistic (Accuracy = 0.9546 versus Accuracy = 0.8684 for the 80-neuron, non-regularized model in training and testing data, respectively). However, here we see the advantage of the regularized models for both the 40- and the 80-neuron models. Although both still perform worse in the testing data than they did in the training data, they perform better than the equivalent non-regularized models in the testing data. This difference is particularly important for the 80-neuron model as there is a 0.0862 drop in overall accuracy from training to testing data, but in the regularized model the drop is only 0.0382, resulting in the regularized 80-neuron model having the best overall performance.

Although these numbers are by no means record-setting, they do show the value of using dropout, or regularization more generally, and how one might go about trying to tune the model and dropout parameters to improve the ultimate testing performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.137.7