Picking hyperparameters

The parameters of a model typically refer to things such as the weights or bias/intercept parameters. However, there are many other parameters that must be set at the offset and are not optimized or learned during model training. These are sometimes referred to as hyperparameters. Indeed, even the choice of model (for example, deep feedforward neural network, random forest, or support vector machine) can be seen as a hyperparameter.

Even if we assume that somehow we have decided that a deep feedforward neural network is the best modeling strategy, there are still many hyperparameters that must be set. These hyperparameters may be explicitly specified by the user or implicitly specified by using default values, where software provides them.

The values chosen for the hyperparameters can have a dramatic impact on the accuracy and training speed of a model. Indeed, we have already seen examples of trying different hyperparameters, such as trying different numbers of hidden neurons in a layer or a different number of layers. However, other hyperparameters also impact performance and speed. For example, in the following code, we set up the R environment, load the Modified National Institute of Standards and Technology (MNIST) data (images of handwritten digits) we have worked with, and run two prediction models, only varying the learning rate:

source("checkpoint.R")
options(width = 70, digits = 2)

cl <- h2o.init(
  max_mem_size = "12G",
  nthreads = 4)

## data setup
digits.train <- read.csv("train.csv")
digits.train$label <- factor(digits.train$label, levels = 0:9)

h2odigits <- as.h2o(
  digits.train,
  destination_frame = "h2odigits")

i <- 1:32000
h2odigits.train <- h2odigits[i, ]

itest <- 32001:42000
h2odigits.test <- h2odigits[itest, ]
xnames <- colnames(h2odigits.train)[-1]



system.time(ex1 <- h2o.deeplearning(
  x = xnames,
  y = "label",
  training_frame= h2odigits.train,
  validation_frame = h2odigits.test,
  activation = "RectifierWithDropout",
  hidden = c(100),
  epochs = 10,
  adaptive_rate = FALSE,
  rate = .001,
  input_dropout_ratio = 0,
  hidden_dropout_ratios = c(.2)
))


system.time(ex2 <- h2o.deeplearning(
  x = xnames,
  y = "label",
  training_frame= h2odigits.train,
  validation_frame = h2odigits.test,
  activation = "RectifierWithDropout",
  hidden = c(100),
  epochs = 10,
  adaptive_rate = FALSE,
  rate = .01,
  input_dropout_ratio = 0,
  hidden_dropout_ratios = c(.2)
))

The first difference is that ex1 took 1.34 times as long to train as did ex2. Printing each model shows a fairly large performance difference, as well. To save space in the book, most of the output from typing ex1 and ex2 is omitted and only the test set metrics are shown:

ex1

Test Set Metrics: 
=====================
Metrics reported on full validation frame 

MSE: (Extract with `h2o.mse`) 0.03326067
R^2: (Extract with `h2o.r2`) 0.9960457
Logloss: (Extract with `h2o.logloss`) 0.2021435
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)

===============================================================
         X0   X1  X2   X3  X4  X5   X6   X7  X8  X9      Error 
0       984    0   1    0   0   3   13    2   6   2 0.02670623 
1         0 1119   5    2   1   1    1    5   5   0 0.01755926 
2         7    1 920    8   5   0    6    7   7   2 0.04465213 
3         3    5   5 1006   1  13    1    7   7   1 0.04099142 
4         0    7   3    0 896   2    5    2   4  13 0.03862661 
5         6    2   4   17   5 835    7    1  10   5 0.06390135 
6         5    2   1    0   6   8  966    1   2   0 0.02522704 
7         2    2   8    7   3   1    0 1027   0   8 0.02930057 
8         1   11   3    7   4  15    1    2 922   3 0.04850361 
9         5    3   1    7  18   6    2   20   2 932 0.06425703 
Totals 1013 1152 951 1054 939 884 1002 1074 965 966 0.03930000 

ex2

Test Set Metrics: 
=====================
Metrics reported on full validation frame 

MSE: (Extract with `h2o.mse`) 0.1264212
R^2: (Extract with `h2o.r2`) 0.9849702
Logloss: (Extract with `h2o.logloss`) 2.136629
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
==============================================================
        X0   X1  X2   X3  X4   X5  X6   X7  X8  X9      Error 
0      938    0   5   11   3   19  19    7   8   1 0.07220574 
1        0 1105   6    6   2    6   1    8   5   0 0.02985075 
2       18    7 757   54  20    9  47   36   5  10 0.21391485 
3        1    2  22  887  10   36   0   50  30  11 0.15443279 
4        1    7   0    1 854    7  13    8   5  36 0.08369099 
5       11    6   4   45  16  767   8    5  29   1 0.14013453 
6       13    5   5    1   6   63 887    5   6   0 0.10494450 
7        2    8   3    3   4    7   0 1024   0   7 0.03213611 
8        7   48  37   27   8   67  12   22 715  26 0.26212590 
9        7    3   3   12  47   22   1  158  11 732 0.26506024 
Totals 998 1191 842 1047 970 1003 988 1323 814 824 0.13340000 

Although ex1 took longer to train, it performs substantially better on the test data than does ex2. The higher learning rate is faster but sacrifices performance. This highlights one of the decisions that needs to be made. However, as there are many hyperparameters, the decision about one is not made in isolation from the rest. One example of this is regularization. Often, relatively larger or more complex models are used with many hidden neurons and possibly multiple layers, choices that will tend to increase accuracy (at least within the training data) and reduce speed. However, these complex models often include some form of regularization, such as dropout, which would tend to reduce accuracy (at least within the training data) and improve speed as only a subset of neurons are included in any given iteration.

One of the most important decisions has to do with the architecture of the model. For example, decisions must be made as to how many layers there should be, how many hidden neurons should be in each layer, whether there should be any skipping patterns or each layer should only have sequential connections, and so on. Unfortunately, there are no simple rules to follow to resolve many of these questions. Good choices may rely on having a knowledge of the subject domain or prior analytical work may provide reasonable starting points.

In the absence of subject domain expertise or prior models, designing an effective architecture requires some trial and error. This trial and error can be a manual or an automated process. In theory, just as parameters are optimized, so hyperparameters could also be optimized. However, in practice this may not be feasible computationally as it can require running many variations of models, each of which requires substantial compute resources and time to complete.

Understanding what each hyperparameter does can help to inform your decisions. For example, if you start with a model and its performance is worse than is acceptable hyperparameters should be changed to allow greater capacity and flexibility in the model, for example, adding more hidden neurons, additional layers of hidden neurons, more training epochs, and so on. If there is a large difference between the model's performance on the training data and testing data, this may suggest the model is overfitting the data, in which case hyperparameters may be tweaked to reduce capacity or add more regularization. In some cases, it may be that more data is required to support fitting a more complex model needed to adequately predict the outcome. We will discuss some ways to refine model architecture (including more analytical approaches) in greater detail in Chapter 6, Tuning and Optimizing Models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.206.69