Fine-tuning auto-encoder models

In the previous sections of this chapter, we have learned how to train and use auto-encoder models. This last section explores how to optimize and fine-tune an auto-encoder model, examining issues such as how to pick the number of hidden neurons or the number of layers.

Sometimes, there may be conceptual reasons to assume certain structures about the data. However, if there are not, we may vary the values of these parameters to obtain the best model. One dilemma that is exacerbated when trying several models and choosing the best one is that, even if several models are equivalent, by chance in a given sample one may outperform the others. To combat this, we can use techniques such as cross-validation during training in order to optimize the parameter values while only using the training data, and then only this final model needs to be validated using the holdout or testing data. Currently, H2O does not support cross-validation for auto-encoder models. If we really wanted to use cross-validation, we could implement it manually. We can do this easily using the createFolds() function from the caret package:

## create 5 folds
folds <- createFolds(1:20000, k = 5)

Next we can create a list of the hyperparameters we want to try for tuning. We do this in the following code:

## create parameters to try
hyperparams <- list(
  list(
    hidden = c(50),
    input_dr = c(0),
    hidden_dr = c(0)),
  list(
    hidden = c(200),
    input_dr = c(.2),
    hidden_dr = c(0)),
  list(
    hidden = c(400),
    input_dr = c(.2),
    hidden_dr = c(0)),
  list(
    hidden = c(400),
    input_dr = c(.2),
    hidden_dr = c(.5)),
  list(
    hidden = c(400, 200),
    input_dr = c(.2),
    hidden_dr = c(.25, .25)),
  list(
    hidden = c(400, 200),
    input_dr = c(.2),
    hidden_dr = c(.5, .25)))

Finally, we can loop through the hyperparameters and 5-fold cross-validation to train all of the models. This may take several minutes to complete as we are training 6 x 5 or 30 models, some with hundreds of hidden neurons (note that, for this model to run with increased speed, we changed the H2O cluster to one with 12GB of memory and 5 cores):

fm <- lapply(hyperparams, function(v) {
  lapply(folds, function(i) {
  h2o.deeplearning(
    x = xnames,
    training_frame = h2odigits.train[-i, ],
    validation_frame = h2odigits.train[i, ],
    activation = "Tanh",
    autoencoder = TRUE,
    hidden = v$hidden,
    epochs = 30,
    sparsity_beta = 0,
    input_dropout_ratio = v$input_dr,
    hidden_dropout_ratios = v$hidden_dr,
    l1 = 0,
    l2 = 0
  )
  })
})

Next we loop through the results and extract the MSE for the validation data, which here is the single fold not used in the cross-validation:

fm.res <- lapply(fm, function(m) {
  sapply(m, h2o.mse, valid = TRUE)
})

We merge the results together into a data table to view and plot the performance across the folds of the cross-validation:

fm.res <- data.table(
  Model = rep(paste0("M", 1:6), each = 5),
  MSE = unlist(fm.res))

head(fm.res)
   Model         MSE
1:    M1 0.014619734
2:    M1 0.014655749
3:    M1 0.014651761
4:    M1 0.014310286
5:    M1 0.014303792
6:    M2 0.006781414

Finally, we can make boxplots of the results to see how spread out they are or if any of the cross-validated runs were especially aberrant. The results are shown in Figure 4.9, and it appears that the MSEs for each fold in the cross-validation are quite close so that the mean/median is a reasonable summary:

p.erate <- ggplot(fm.res, aes(Model, MSE)) +
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = "point", colour = "red") +
  theme_classic()
print(p.erate)
Fine-tuning auto-encoder models

Figure 4.9

If we calculate the mean MSE by model and order from smallest to largest, these are the results we get:

fm.res[, .(Mean_MSE = mean(MSE)), by = Model][order(Mean_MSE)]
   Model    Mean_MSE
1:    M4 0.006261764
2:    M3 0.006276417
3:    M2 0.006725956
4:    M5 0.007768764
5:    M6 0.007797575
6:    M1 0.014508264

It appears that the fourth set of hyperparameters provided the lowest cross-validated MSE. The fourth set of hyperparameters was a fairly complex model, with 400 hidden neurons, but also had regularization with 20% of the input variables dropped and 50% of the hidden neurons dropped at each iteration, and this actually outperforms (albeit only slightly) the third set of hyperparameters where the same model complexity was used but without any dropout on the hidden layer. Although not much worse, the deep models here with a second layer of 200 hidden neurons perform worse than the shallow model.

With the best model selected, we can re-run using all training data and with our actual testing data, using the fourth set of hyperparameters:

fm.final <- h2o.deeplearning(
    x = xnames,
    training_frame = h2odigits.train,
    validation_frame = h2odigits.test,
    activation = "Tanh",
    autoencoder = TRUE,
    hidden = hyperparams[[4]]$hidden,
    epochs = 30,
    sparsity_beta = 0,
    input_dropout_ratio = hyperparams[[4]]$input_dr,
    hidden_dropout_ratios = hyperparams[[4]]$hidden_dr,
    l1 = 0,
    l2 = 0
  )

fm.final
Training Set Metrics: 
=====================

MSE: (Extract with `h2o.mse`) 0.005880221

H2OAutoEncoderMetrics: deeplearning
** Reported on validation data. **

Validation Set Metrics: 
=====================

MSE: (Extract with `h2o.mse`) 0.006072476

We can see that the MSE in our testing data, which was not used at all during training, is fairly close, though slightly worse than in the training data, and is actually slightly less than the MSE estimated from cross-validation, in this case. To the extent that we searched over a reasonable set of hyperparameters, this model is now optimized, validated, and ready for use.

In practice, it is often difficult to balance the tradeoff between the possibility of obtaining better performance with a different model or different set of hyperparameters with the time it takes to run and train many different models. Sometimes it can be helpful to explore the optimal model using a random subset of all data, if the data is very large, in order to speed computation. For this book, the example datasets we have been using are quite small compared to those commonly used in deep learning where there may be millions or hundreds of millions of cases and hundreds or thousands of variables or inputs. However, the approaches used here will scale to larger datasets, but will simply take more time. It is also worth noting that, though for these relatively small datasets we have been seeing good performance with fairly simpler models, larger datasets may benefit more from complex models and provide sufficient data to support learning a very complex structure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.162.37