Use case – training a deep neural network for automatic classification

For our use case, we use data from a subset of the Million Song Dataset, from the University of California Irvine online dataset repository (Lichman, M. (2013)). There are 515,345 cases, with the first 463,715 being training cases and the last 51,630 cases used for testing. The first column of the dataset contains the year and the remaining columns are features from the timbre of the song. Download and decompress the data from here: http://archive.ics.uci.edu/ml/datasets/YearPredictionMSD. Our goal is to predict the year each song was released.

First we need to download the data and then unzip it, which we can do using the following code:

download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt.zip", destfile = "YearPredictionMSD.txt.zip")
unzip("YearPredictionMSD.txt.zip")

Now we can read data into R using fread() from the data.table package. The fread() function is preferable to read.csv() here because it can be orders-of-magnitude faster, and it still took 30 seconds on a high-end desktop with a solid state hard drive:

d <- fread("YearPredictionMSD.txt", sep = ",")

First we can take a quick look at the distribution of the outcome, the year of release. The following code creates a histogram that is shown in Figure 5.4:

p.hist <- ggplot(d[, .(V1)], aes(V1)) +
  geom_histogram(binwidth = 1) +
  theme_classic() +
  xlab("Year of Release") 
print(p.hist)  
Use case – training a deep neural network for automatic classification

Figure 5.4

One possible concern is that the relatively extreme values may exert an undue influence on the model. We can reduce this by reflecting the distribution and taking the square root. We could also exclude a small amount of the more extreme cases, such as by excluding the bottom and top 0.5% (1% of data total). Checking the quantiles (in the following code) would include the years 1957 to 2010:

quantile(d$V1, probs = c(.005, .995))
0.5% 100% 
1957 2010

The following code trims the data and converts the training and testing datasets for H2O:

d.train <- d[1:463715][V1 >= 1957 & V1 <= 2010]

d.test <- d[463716:515345][V1 >= 1957 & V1 <= 2010]

h2omsd.train <- as.h2o(
  d.train,
  destination_frame = "h2omsdtrain")

h2omsd.test <- as.h2o(
  d.test,
  destination_frame = "h2omsdtest")

To get started and provide some baseline performance levels, we can build a linear regression model:

summary(m0 <- lm(V1 ~ ., data = d.train))$r.squared
[1] 0.24

cor(
  d.test$V1,
  predict(m0, newdata = d.test))^2
[1] 0.23

Although not great, linear regression accounts for 24% of the variance in years in the training data and 23% in the testing data; these results provide a benchmark for us to beat with the feedforward neural network.

Our first network is shallow with a single hidden layer and is fairly small. This is a larger dataset than some of the previous ones we have worked with, but it is still small enough that it is easy to work with all of it. To make performance scoring occur on the full dataset, we use the special value, 0, passed to the score_training_samples and score_validation_samples arguments. On the 10-core H2O cluster setup, the model took 79 seconds to train, recorded using the system.time() function:

m1 <- h2o.deeplearning(
  x = colnames(d)[-1],
  y = "V1",
  training_frame= h2omsd.train,
  validation_frame = h2omsd.test,
  activation = "RectifierWithDropout",
  hidden = c(50),
  epochs = 100,
  input_dropout_ratio = 0,
  hidden_dropout_ratios = c(0),
  score_training_samples = 0,
  score_validation_samples = 0,
  diagnostics = TRUE,
  export_weights_and_biases = TRUE,
  variable_importances = TRUE
  )

The results from this simple model show a marked improvement over the linear regression model. The feedforward neural network, even though it only had a single layer with 50 hidden neurons, accounted for 32% of the variance in release year in the testing data, up from 23% using only linear regression.

Because the model was small and had fewer hidden neurons than input variables, no dropout or other regularization was used. However, the performance discrepancy between the training and testing data (R2 = 0.37 versus R2 = 0.32, respectively), indicates that some regularization may be helpful:

m1

Model Details:
==============

H2ORegressionModel: deeplearning
Model ID:  DeepLearning_model_R_1451972322936_5 
Status of Neuron Layers: predicting V1, regression, gaussian distribution, Quadratic loss, 4,601 weights/biases, 72.5 KB, 13,702,476 training samples, mini-batch size 1
  layer units             type dropout       l1       l2 mean_rate
1     1    90            Input  0.00 %                            
2     2    50 RectifierDropout  0.00 % 0.000000 0.000000  0.009403
3     3     1           Linear         0.000000 0.000000  0.000218
  rate_RMS momentum mean_weight weight_RMS mean_bias bias_RMS
1                                                            
2 0.007939 0.000000   -0.018219   0.598229 -2.199141 2.245173
3 0.000202 0.000000   -0.042807   0.103305 -0.767868 0.000000


H2ORegressionMetrics: deeplearning
** Reported on training data. **
Description: Metrics reported on full training frame

MSE:  76
R2 :  0.37
Mean Residual Deviance :  76


H2ORegressionMetrics: deeplearning
** Reported on validation data. **
Description: Metrics reported on temporary (load-balanced) validation frame

MSE:  80
R2 :  0.32
Mean Residual Deviance :  80

Although our shallow neural network model was an improvement over linear regression, it still did not perform well and there is clearly room for improvement. Next, we will try a larger, deep feedforward neural network. In the model code next, we have three layers of hidden neurons, with 200, 200, and 400 hidden neurons, respectively. We will also introduce a modest amount of dropout on the hidden (but not input) layer. This model took 843 seconds to train:

m2 <- h2o.deeplearning(
  x = colnames(d)[-1],
  y = "V1",
  training_frame= h2omsd.train,
  validation_frame = h2omsd.test,
  activation = "RectifierWithDropout",
  hidden = c(200, 200, 400),
  epochs = 100,
  input_dropout_ratio = 0,
  hidden_dropout_ratios = c(.2, .2, .2),
  score_training_samples = 0,
  score_validation_samples = 0,
  diagnostics = TRUE,
  export_weights_and_biases = TRUE,
  variable_importances = TRUE
  )

Examining the performance of the model shows a noticeable improvement from the small and shallow model we tried first. In the testing data, the shallow model had an R2 of 0.32 whereas the deep model has an R2 of 0.35.

There is also a degree of overfitting. The difference in R2 between the training and testing data is 0.05, which is comparable to the simpler model where the difference was also 0.05. The more complex model improves performance, with little difference in overfitting, perhaps due to the dropout used:

m2

Model Details:
==============

H2ORegressionModel: deeplearning
Model ID:  DeepLearning_model_R_1452031055473_5 
Status of Neuron Layers: predicting V1, regression, gaussian distribution, Quadratic loss, 139,201 weights/biases, 1.6 MB, 22,695,351 training samples, mini-batch size 1
  layer units             type dropout       l1       l2 mean_rate
1     1    90            Input  0.00 %                            
2     2   200 RectifierDropout 20.00 % 0.000000 0.000000  0.011513
3     3   200 RectifierDropout 20.00 % 0.000000 0.000000  0.014861
4     4   400 RectifierDropout 20.00 % 0.000000 0.000000  0.054338
5     5     1           Linear         0.000000 0.000000  0.001258
  rate_RMS momentum mean_weight weight_RMS mean_bias bias_RMS
1                                                            
2 0.004978 0.000000    0.000848   0.207373 -0.254659 0.321144
3 0.012359 0.000000   -0.032566   0.104347  1.017329 0.341556
4 0.036596 0.000000   -0.031768   0.072171  0.651546 0.292565
5 0.000505 0.000000    0.001421   0.020867 -0.596303 0.000000


H2ORegressionMetrics: deeplearning
** Reported on training data. **
Description: Metrics reported on full training frame

MSE:  66
R2 :  0.40
Mean Residual Deviance :  66


H2ORegressionMetrics: deeplearning
** Reported on validation data. **
Description: Metrics reported on temporary (load-balanced) validation frame

MSE:  70
R2 :  0.35
Mean Residual Deviance :  70

To see whether the performance on the testing data can be improved further, we will try one additional model including substantially more hidden neurons in each layer, more training iterations (epochs), and with a higher degree of regularization. Readers may not wish to run the following code (the model took over 10 hours to complete on the 10-core H2O cluster):

m3 <- h2o.deeplearning(
  x = colnames(d)[-1],
  y = "V1",
  training_frame= h2omsd.train,
  validation_frame = h2omsd.test,
  activation = "RectifierWithDropout",
  hidden = c(500, 500, 1000),
  epochs = 500,
  input_dropout_ratio = 0,
  hidden_dropout_ratios = c(.5, .5, .5),
  score_training_samples = 0,
  score_validation_samples = 0,
  diagnostics = TRUE,
  export_weights_and_biases = TRUE
  )

The performance of this model on the testing data was actually worse than either of the previous two models, though still superior to the linear regression:

m3

Model Details:
==============

H2ORegressionModel: deeplearning
Model ID:  DeepLearning_model_R_1451972322936_15 
Status of Neuron Layers: predicting V1, regression, gaussian distribution, Quadratic loss, 798,001 weights/biases, 9.2 MB, 47,002,720 training samples, mini-batch size 1
  layer units             type dropout       l1       l2 mean_rate
1     1    90            Input  0.00 %                            
2     2   500 RectifierDropout 50.00 % 0.000000 0.000000  0.028872
3     3   500 RectifierDropout 50.00 % 0.000000 0.000000  0.047632
4     4  1000 RectifierDropout 50.00 % 0.000000 0.000000  0.084886
5     5     1           Linear         0.000000 0.000000  0.001238
  rate_RMS momentum mean_weight weight_RMS mean_bias bias_RMS
1                                                            
2 0.014727 0.000000    0.000941   0.069018  0.417255 0.048082
3 0.020226 0.000000   -0.007515   0.049535  0.968111 0.054521
4 0.062396 0.000000   -0.009451   0.038735  0.929930 0.032726
5 0.000445 0.000000    0.000538   0.014785 -0.478095 0.000000


H2ORegressionMetrics: deeplearning
** Reported on training data. **
Description: Metrics reported on full training frame

MSE:  84
R2 :  0.30
Mean Residual Deviance :  84


H2ORegressionMetrics: deeplearning
** Reported on validation data. **
Description: Metrics reported on temporary (load-balanced) validation frame

MSE:  85
R2 :  0.28
Mean Residual Deviance :  85

Our best model then is still the deep model, but with fewer hidden neurons per layer. One way that we can try to see if that model can be improved is to try training for additional epochs or iterations. In the model output, there is a model ID. For the best performing model, this was: DeepLearning_model_R_1452031055473_5. This can be passed to the checkpoint argument of the h2o.deeplearning() function so that training begins using the weights from the previous model. Note that the model ID will be different every time you run the code; thus, when running it on your own computer or servers, you will need to use the model ID from your run.

As long as the general architecture—the number of hidden neurons, layers, and connections—remains the same, using the checkpoint can be a great time saver. This is not only true because the previous training iterations can be re-used, but also because it tends to take longer for earlier than later iterations. The following example shows how to run the model, changing the epochs from 500 to 1,000 (since 500 have already been done) and starting from the previous model run by specifying the model name as a character string to the checkpoint argument:

m2b <- h2o.deeplearning(
  x = colnames(d)[-1],
  y = "V1",
  training_frame= h2omsd.train,
  validation_frame = h2omsd.test,
  activation = "RectifierWithDropout",
  hidden = c(200, 200, 400),
  checkpoint = "DeepLearning_model_R_1452031055473_5",
  epochs = 1000,
  input_dropout_ratio = 0,
  hidden_dropout_ratios = c(.2, .2, .2),
  score_training_samples = 0,
  score_validation_samples = 0,
  diagnostics = TRUE,
  export_weights_and_biases = TRUE,
  variable_importances = TRUE

  )

However, in the end, the additional epochs did not improve the model performance. In fact, it became slightly worse:

m2b

Model Details:
==============

H2ORegressionModel: deeplearning
Model ID:  DeepLearning_model_R_1452031055473_81 
Status of Neuron Layers: predicting V1, regression, gaussian distribution, Quadratic loss, 139,201 weights/biases, 1.6 MB, 30,054,531 training samples, mini-batch size 1
  layer units             type dropout       l1       l2 mean_rate
1     1    90            Input  0.00 %                            
2     2   200 RectifierDropout 20.00 % 0.000000 0.000000  0.008598
3     3   200 RectifierDropout 20.00 % 0.000000 0.000000  0.012581
4     4   400 RectifierDropout 20.00 % 0.000000 0.000000  0.025138
5     5     1           Linear         0.000000 0.000000  0.000895
  rate_RMS momentum mean_weight weight_RMS mean_bias bias_RMS
1                                                            
2 0.004485 0.000000   -0.004116   0.473692 -1.601533 1.060434
3 0.017790 0.000000   -0.040249   0.239924  0.767950 1.305716
4 0.022843 0.000000   -0.048592   0.105753  0.360921 0.439503
5 0.000582 0.000000   -0.001778   0.029287 -0.065273 0.000000


H2ORegressionMetrics: deeplearning
** Reported on training data. **
Description: Metrics reported on full training frame

MSE:  62
R2 :  0.43
Mean Residual Deviance :  62


H2ORegressionMetrics: deeplearning
** Reported on validation data. **
Description: Metrics reported on temporary (load-balanced) validation frame

MSE:  72
R2 :  0.33
Mean Residual Deviance :  72

Working with model results

It is easy to save models in R but, when calling H2O from R, most results are not actually stored in R; instead they are stored in the H2O cluster. Thus, only saving the R object will merely save the reference to the model in the H2O cluster and, if that is shut down and lost, the full model results will not be saved. To avoid this and save the full model results, we use the h2o.saveModel() function and specify the model to be saved (by passing the R object), the path, and whether to overwrite files if already there (using force = TRUE):

h2o.saveModel(
  object = m2,
  path = "c:\Users\jwile\DeepLearning",
  force = TRUE)

This will create a directory with all of the files needed to load and use the model again. Once you have saved a model, you can load it back into a new H2O cluster using the h2o.loadModel() function. Note that you also must specify the folder name for the model results to load.

In addition to just saving the model results to be loaded again into an H2O cluster, models can be saved as a Plain Old Java Object (POJO). Saving models as a POJO is useful as they can be embedded in other applications and used to score results. H2O models can be saved as a POJO using the h2o.download_pojo() function, with the same arguments.

Another useful function is h2o.scoreHistory(). The score history shows the performance of the model across iterations as well as a time stamp and the duration for each epoch. The following code shows how to use it and the results:

h2o.scoreHistory(m2)

Scoring History: 
             timestamp          duration training_speed   epochs
1  2016-01-06 23:20:18         0.000 sec                 0.00000
2  2016-01-06 23:20:26        15.537 sec 13922 rows/sec  0.21687
3  2016-01-06 23:21:51  1 min 40.761 sec 22603 rows/sec  4.11902
4  2016-01-06 23:23:15  3 min  4.790 sec 25030 rows/sec  8.66890
5  2016-01-06 23:24:39  4 min 28.208 sec 26347 rows/sec 13.43506
6  2016-01-06 23:26:00  5 min 49.401 sec 27540 rows/sec 18.41458
7  2016-01-06 23:27:21  7 min 10.032 sec 28317 rows/sec 23.39553
8  2016-01-06 23:28:40  8 min 29.325 sec 28928 rows/sec 28.37323
9  2016-01-06 23:29:59  9 min 48.908 sec 29354 rows/sec 33.34907
10 2016-01-06 23:31:21 11 min 10.056 sec 29771 rows/sec 38.54472
11 2016-01-06 23:32:41 12 min 30.532 sec 30130 rows/sec 43.73626
12 2016-01-06 23:34:04 13 min 53.652 sec 30444 rows/sec 49.14818
13 2016-01-06 23:34:12 14 min  1.667 sec 30442 rows/sec 49.14818
   iterations         samples training_MSE training_deviance
1           0        0.000000                               
2           1   100145.000000     73.50950          73.50950
3          19  1902057.000000     65.90201          65.90201
4          40  4003071.000000     66.39865          66.39865
5          62  6203960.000000     63.97995          63.97995
6          85  8503375.000000     65.20361          65.20361
7         108 10803448.000000     62.67372          62.67372
8         131 13102020.000000     63.91678          63.91678
9         154 15399734.000000     60.31355          60.31355
10        178 17798949.000000     60.15803          60.15803
11        202 20196268.000000     61.71012          61.71012
12        227 22695351.000000     58.34747          58.34747
13        227 22695351.000000     65.90201          65.90201
   training_r2 validation_MSE validation_deviance validation_r2
1                                                              
2      0.32564       73.67272            73.67272       0.30763
3      0.39543       69.57711            69.57711       0.34612
4      0.39087       71.70615            71.70615       0.32611
5      0.41306       70.45211            70.45211       0.33790
6      0.40184       71.98921            71.98921       0.32345
7      0.42505       70.90519            70.90519       0.33364
8      0.41364       72.69913            72.69913       0.31678
9      0.44670       70.49905            70.49905       0.33746
10     0.44812       70.76801            70.76801       0.33493
11     0.43389       72.22494            72.22494       0.32124
12     0.46473       70.55234            70.55234       0.33696
13     0.39543       69.57711            69.57711       0.34612

So far we have only examined the overall performance of the model. Although this is a useful summary, it provides less than a complete picture. Examining the model residuals can help us understand whether the model performs consistently across the range of the data and any anomalous residuals; it also helps us to generally assess performance more comprehensively. We can calculate residuals by getting predicted values for all cases using the h2o.predict() function and then taking the difference between the observed values and the predictions. The following code extracts predictions, joins them with observed values, and plots them. A residual of zero indicates a perfect prediction, with either positive or negative residuals indicating over- or under-prediction. Since years are discrete, we can visualize the data using boxplots of the residuals for each actual year of song release, using the following code. This is shown in Figure 5.5:

yhat <- as.data.frame(h2o.predict(m1, h2omsd.train))
yhat <- cbind(as.data.frame(h2omsd.train[["V1"]]), yhat)

p.resid <- ggplot(yhat, aes(factor(V1), predict - V1)) +
  geom_boxplot() +
  geom_hline(yintercept = 0) +
  theme_classic() +
  theme(axis.text.x = element_text(
          angle = 90, vjust = 0.5, hjust = 0)) +
  xlab("Year of Release") +
  ylab("Predicted Year of Release")
print(p.resid)
Working with model results

Figure 5.5

The results show a marked pattern of decreasing residuals in later years or, conversely, show extremely aberrant model predictions for the earlier years. In part, this may be due to the distribution of the data. With most cases coming from the mid 1990s to 2000s, as we saw earlier in Figure 5.4 the model will be most sensitive to accurately predicting these values, and the comparatively fewer cases before 1990 or 1980 will have less influence.

Because we used the variable_importances argument, we can extract the relative importance of each variable for the model using the h2o.varimp() function. Although it is difficult to accurately apportion the importance of each variable, it can be helpful to provide a rough sense of which variables tend to make a larger contribution to the prediction than others. This can be a helpful way to exclude some variables that contribute very little, for example. The following code extracts the important variables, prints the top 10 (the dataset is sorted from most to least important), and makes a graph of the results to display the distribution, shown in Figure 5.6:

imp <- as.data.frame(h2o.varimp(m2))
imp[1:10, ]
   variable relative_importance scaled_importance percentage
1        V2                1.00              1.00      0.039
2        V3                0.66              0.66      0.026
3        V4                0.53              0.53      0.020
4       V14                0.47              0.47      0.018
5       V24                0.47              0.47      0.018
6        V7                0.44              0.44      0.017
7       V37                0.40              0.40      0.016
8        V6                0.39              0.39      0.015
9       V59                0.35              0.35      0.014
10      V26                0.34              0.34      0.013


p.imp <- ggplot(imp, aes(factor(variable, levels = variable), percentage)) +
  geom_point() +
  theme_classic() +
  theme(axis.text.x = element_blank()) +
  xlab("Variable Number") +
  ylab("Percentage of Total Importance")
print(p.imp)
Working with model results

Figure 5.6

From the description of the dataset, the first 12 variables represented various timbres of the music, with the next 78 being the unique elements of a covariance matrix from the first 12. Thus it is interesting that, in the top variables, the first three are all the timbres, not from the covariances. If, for example, the later 78 variables were costly or difficult to collect, we might consider what performance is possible using only the first 12 predictors. The following model tests that approach using a simple shallow model:

mtest <- h2o.deeplearning(
  x = colnames(d)[2:13],
  y = "V1",
  training_frame= h2omsd.train,
  validation_frame = h2omsd.test,
  activation = "RectifierWithDropout",
  hidden = c(50),
  epochs = 100,
  input_dropout_ratio = 0,
  hidden_dropout_ratios = c(0),
  score_training_samples = 0,
  score_validation_samples = 0,
  diagnostics = TRUE,
  export_weights_and_biases = TRUE,
  variable_importances = TRUE
)

mtest

H2ORegressionModel: deeplearning
Model ID:  DeepLearning_model_R_1452082402089_15 
Status of Neuron Layers: predicting V1, regression, gaussian distribution, Quadratic loss, 701 weights/biases, 13.6 KB, 27,398,762 training samples, mini-batch size 1
  layer units             type dropout       l1       l2 mean_rate
1     1    12            Input  0.00 %                            
2     2    50 RectifierDropout  0.00 % 0.000000 0.000000  0.003773
3     3     1           Linear         0.000000 0.000000  0.000985
  rate_RMS momentum mean_weight weight_RMS mean_bias bias_RMS
1                                                            
2 0.007925 0.000000    0.004197   0.504967 -0.679546 0.965184
3 0.000926 0.000000   -0.106522   0.286619 -1.400430 0.000000


H2ORegressionMetrics: deeplearning
** Reported on training data. **
Description: Metrics reported on full training frame

MSE:  82
R2 :  0.24
Mean Residual Deviance :  82


H2ORegressionMetrics: deeplearning
** Reported on validation data. **
Description: Metrics reported on temporary (load-balanced) validation frame

MSE:  83
R2 :  0.22
Mean Residual Deviance :  83

The results show an R2 of only 0.24 for the training and 0.22 for the testing data. This is still comparable to the linear regression with all variables, but quite a bit lower than the 0.32 or 0.35 obtained using neural networks on the full set of predictors. Even though many of the variables have a fairly small importance, combined they add up to a noticeable difference.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.8.212