Chapter 6. Tuning and Optimizing Models

In this final chapter, we will discuss a few approaches to tuning models. We will cover ways of addressing missing data. Although we have used example datasets without any missing data, in the real world missing data is a common occurrence. We will also discuss what can be done when a model is performing poorly, including a detailed examination of how to search for and optimize model hyperparameters.

This chapter will cover the following topics:

  • Dealing with missing data
  • Solutions for models with low accuracy

In this chapter, we make use of two new packages: the gridExtra package for graphics and the mgcv package for fitting generalized additive models at the end. These new packages should be added to the checkpoint.R file, and the file should be sourced to set up the R environment for the rest of the code shown. R can be set up and an H2O cluster initialized using the following code:

source("checkpoint.R")
options(width = 70, digits = 2)

cl <- h2o.init(
  max_mem_size = "12G",
  nthreads = 4)

Dealing with missing data

When working with real-world applications, we often must contend with missing data. H2O includes a function to impute variables using the mean, median, or mode, and optionally to do so by some other grouping variables.

To examine how to impute missing data this way, we will use the small Iris dataset on flowers. In particular, we will set the petal width and length values to missing for the species "setosa" and then impute their values:

## setup iris data with some missing
d <- as.data.table(iris)
d[Species == "setosa", c("Petal.Width", "Petal.Length") := .(NA, NA)]

h2o.dmiss <- as.h2o(d, destination_frame="iris_missing")
h2o.dmeanimp <- as.h2o(d, destination_frame="iris_missing_imp")

First, we will do a simple mean imputation. This has to be done one column at a time:

## mean imputation
missing.cols <- colnames(h2o.dmiss)[apply(d, 2, anyNA)]

for (v in missing.cols) {
  h2o.dmeanimp <- h2o.impute(h2o.dmeanimp, column = v)
}

One problem with imputing the overall non-missing mean is that, if there are any systematic differences, these will be missed; also, if we could get better predictions about the missing data from any of the non-missing data, this is also missed.

Instead of a simple mean imputation, we could use a simple prediction model. The following code builds a random forest model to predict each missing column. All default values are used. If random forests take too long, a glm model could also be used:

## random forest imputation
d.imputed <- d

## prediction model
for (v in missing.cols) {
  tmp.m <- h2o.randomForest(
    x = setdiff(colnames(h2o.dmiss), v),
    y = v,
    training_frame = h2o.dmiss)
  yhat <- as.data.frame(h2o.predict(tmp.m, newdata = h2o.dmiss))
  d.imputed[[v]] <- ifelse(is.na(d.imputed[[v]]), yhat$predict, d.imputed[[v]])
}

To compare the different methods, we can create a scatter plot of petal length against petal width, with the color and shape of the points determined by the flower species. This graph has three panels. The top panel is the original data. The middle panel is the data using mean imputation. The bottom panel is the data using random forest imputation. The following code creates the graph shown in Figure 6.1:

grid.arrange(
  ggplot(iris, aes(Petal.Length, Petal.Width,
    color = Species, shape = Species)) +
    geom_point() +
    theme_classic() +
    ggtitle("Original Data"),
 ggplot(as.data.frame(h2o.dmeanimp), aes(Petal.Length, Petal.Width,
    color = Species, shape = Species)) +
    geom_point() +
    theme_classic() +
   ggtitle("Mean Imputed Data"),
 ggplot(d.imputed, aes(Petal.Length, Petal.Width,
    color = Species, shape = Species)) +
    geom_point() +
    theme_classic() +
   ggtitle("Random Forest Imputed Data"),
  ncol = 1)
Dealing with missing data

Figure 6.1

In this case, the mean imputation creates aberrant values quite removed from reality. If needed, more advanced prediction models could be generated. In statistical inferences, multiple imputation is preferred over single imputation (regardless of the method) as the latter fails to account for uncertainty—that is, when imputing the missing values there is some degree of uncertainty as to exactly what those values are. However, in most use cases for deep learning, the datasets are far too large and the computational time too demanding to create multiple datasets with different imputed values, train models on each, and pool the results; thus, these simpler methods (such as mean imputation or using some other prediction model) are common.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.50.222