Chapter 11. Improving Model Performance

When a sports team falls short of meeting its goal—whether the goal is to obtain an Olympic gold medal, a league championship, or a world record time—it must search for possible improvements. Imagine that you're the team's coach. How would you spend your practice sessions? Perhaps you'd direct the athletes to train harder or train differently in order to maximize every bit of their potential. Or, you might emphasize better teamwork, utilizing the athletes' strengths and weaknesses more smartly.

Now imagine that you're training a world champion machine learning algorithm. Perhaps you hope to compete in data mining competitions such as those posted on Kaggle (http://www.kaggle.com/competitions). Maybe you simply need to improve business results. Where do you begin? Although the context differs, the strategies one uses to improve sports team performance can also be used to improve the performance of statistical learners.

As the coach, it is your job to find the combination of training techniques and teamwork skills that allow you to meet your performance goals. This chapter builds upon the material covered throughout this book to introduce a set of techniques for improving the predictive performance of machine learners. You will learn:

  • How to automate model performance tuning by systematically searching for the optimal set of training conditions
  • The methods for combining models into groups that use teamwork to tackle tough learning tasks
  • How to apply a variant of decision trees, which has quickly become popular due to its impressive performance

None of these methods will be successful for every problem. Yet, looking at the winning entries to machine learning competitions, you'll likely find that at least one of them has been employed. To be competitive, you too will need to add these skills to your repertoire.

Tuning stock models for better performance

Some learning problems are well-suited to the stock models presented in the previous chapters. In such cases, it may not be necessary to spend much time iterating and refining the model; it may perform well enough as it is. On the other hand, some problems are inherently more difficult. The underlying concepts to be learned may be extremely complex, requiring an understanding of many subtle relationships, or it may be affected by random variation, making it difficult to define the signal within the noise.

Developing models that perform extremely well on difficult problems is every bit an art as it is a science. Sometimes a bit of intuition is helpful when trying to identify areas where performance can be improved. In other cases, finding improvements will require a brute-force, trial and error approach. Of course, the process of searching numerous possible improvements can be aided by the use of automated programs.

In Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, we attempted a difficult problem: identifying loans that were likely to enter into default. Although we were able to use performance tuning methods to obtain a respectable classification accuracy of about 82 percent, upon a more careful examination in Chapter 10, Evaluating Model Performance, we realized that the high accuracy was a bit misleading. In spite of the reasonable accuracy, the kappa statistic was only about 0.28, which suggested that the model was actually performing somewhat poorly. In this section, we'll revisit the credit scoring model to see whether we can improve the results.

Tip

To follow along with the examples, download the credit.csv file from the Packt Publishing website and save it to your R working directory. Load the file into R using the command credit <- read.csv("credit.csv").

You will recall that we first used a stock C5.0 decision tree to build the classifier for the credit data. We then attempted to improve its performance by adjusting the trials parameter to increase the number of boosting iterations. By increasing the number of iterations from the default of 1 up to the value of 10, we were able to increase the model's accuracy. This process of adjusting the model options to identify the best fit is called parameter tuning.

Parameter tuning is not limited to decision trees. For instance, we tuned k-NN models when we searched for the best value of k. We also tuned neural networks and support vector machines as we adjusted the number of nodes or hidden layers, or chose different kernel functions. Most machine learning algorithms allow the adjustment of at least one parameter, and the most sophisticated models offer a large number of ways to tweak the model fit. Although this allows the model to be tailored closely to the learning task, the complexity of all the possible options can be daunting. A more systematic approach is warranted.

Using caret for automated parameter tuning

Rather than choosing arbitrary values for each of the model's parameters—a task that is not only tedious, but also somewhat unscientific—it is better to conduct a search through many possible parameter values to find the best combination.

The caret package, which we used extensively in Chapter 10, Evaluating Model Performance, provides tools to assist with automated parameter tuning. The core functionality is provided by a train() function that serves as a standardized interface for over 175 different machine learning models for both classification and regression tasks. By using this function, it is possible to automate the search for optimal models using a choice of evaluation methods and metrics.

Tip

Do not feel overwhelmed by the large number of models—we've already covered many of them in the earlier chapters. Others are simple variants or extensions of the base concepts. Given what you've learned so far, you should be confident that you have the ability to understand all of the available methods.

Automated parameter tuning requires you to consider three questions:

  • What type of machine learning model (and specific implementation) should be trained on the data?
  • Which model parameters can be adjusted, and how extensively should they be tuned to find the optimal settings?
  • What criteria should be used to evaluate the models to find the best candidate?

Answering the first question involves finding a well-suited match between the machine learning task and one of the 175 models. Obviously, this requires an understanding of the breadth and depth of machine learning models. It can also help to work through a process of elimination. Nearly half of the models can be eliminated depending on whether the task is classification or numeric prediction; others can be excluded based on the format of the data or the need to avoid black box models, and so on. In any case, there's also no reason you can't try several approaches and compare the best results of each.

Addressing the second question is a matter largely dictated by the choice of model, since each algorithm utilizes a unique set of parameters. The available tuning parameters for the predictive models covered in this book are listed in the following table. Keep in mind that although some models have additional options not shown, only those listed in the table are supported by caret for automatic tuning.

Model

Learning Task

Method name

Parameters

k-Nearest Neighbors

Classification

knn

k

Naive Bayes

Classification

nb

fL, usekernel

Decision Trees

Classification

C5.0

model, trials, winnow

OneR Rule Learner

Classification

OneR

None

RIPPER Rule Learner

Classification

JRip

NumOpt

Linear Regression

Regression

lm

None

Regression Trees

Regression

rpart

cp

Model Trees

Regression

M5

pruned, smoothed, rules

Neural Networks

Dual use

nnet

size, decay

Support Vector Machines (Linear Kernel)

Dual use

svmLinear

C

Support Vector Machines (Radial Basis Kernel)

Dual use

svmRadial

C, sigma

Random Forests

Dual use

rf

mtry

Tip

For a complete list of the models and corresponding tuning parameters covered by caret, refer to the table provided by package author Max Kuhn at http://topepo.github.io/caret/modelList.html.

If you ever forget the tuning parameters for a particular model, the modelLookup() function can be used to find them. Simply supply the method name, as illustrated here for the C5.0 model:

> modelLookup("C5.0")
  model parameter                 label forReg forClass probModel
1  C5.0    trials # Boosting Iterations  FALSE     TRUE      TRUE
2  C5.0     model            Model Type  FALSE     TRUE      TRUE
3  C5.0    winnow                Winnow  FALSE     TRUE      TRUE

The goal of automatic tuning is to search a set of candidate models comprising a matrix, or grid, of parameter combinations. Because it is impractical to search every conceivable combination, only a subset of possibilities is used to construct the grid. By default, caret searches at most three values for each of the p parameters. This means that at most 3^p candidate models will be tested. For example, by default, the automatic tuning of k-Nearest Neighbors will compare 3^1 = 3 candidate models with k=5, k=7, and k=9. Similarly, tuning a decision tree will result in a comparison of up to 27 different candidate models, comprising the grid of 3^3 = 27 combinations of model, trials, and winnow settings. In practice, however, only 12 models are actually tested. This is because the model and winnow parameters can only take two values (tree versus rules and TRUE versus FALSE, respectively), which makes the grid size 3 * 2 * 2 = 12.

Tip

Since the default search grid may not be ideal for your learning problem, caret allows you to provide a custom search grid defined by a simple command, which we will cover later.

The third and final step in automatic model tuning involves identifying the best model among the candidates. This uses the methods discussed in Chapter 10, Evaluating Model Performance, such as the choice of resampling strategy for creating training and test datasets and the use of model performance statistics to measure the predictive accuracy.

All of the resampling strategies and many of the performance statistics we've learned are supported by caret. These include statistics such as accuracy and kappa (for classifiers) and R-squared or RMSE (for numeric models). Cost-sensitive measures such as sensitivity, specificity, and area under the ROC curve (AUC) can also be used, if desired.

By default, caret will select the candidate model with the largest value of the desired performance measure. As this practice sometimes results in the selection of models that achieve marginal performance improvements via large increases in model complexity, alternative model selection functions are provided.

Given the wide variety of options, it is helpful that many of the defaults are reasonable. For instance, caret will use prediction accuracy on a bootstrap sample to choose the best performer for classification models. Beginning with these default values, we can then tweak the train() function to design a wide variety of experiments.

Creating a simple tuned model

To illustrate the process of tuning a model, let's begin by observing what happens when we attempt to tune the credit scoring model using the caret package's default settings. From there, we will adjust the options to our liking.

The simplest way to tune a learner requires you to only specify a model type via the method parameter. Since we used C5.0 decision trees previously with the credit model, we'll continue our work by optimizing this learner. The basic train() command for tuning a C5.0 decision tree using the default settings is as follows:

> library(caret)
> set.seed(300)
> m <- train(default ~ ., data = credit, method = "C5.0")

First, the set.seed() function is used to initialize R's random number generator to a set starting position. You may recall that we used this function in several prior chapters. By setting the seed parameter (in this case to the arbitrary number 300), the random numbers will follow a predefined sequence. This allows simulations that use random sampling to be repeated with identical results—a very helpful feature if you are sharing code or attempting to replicate a prior result.

Next, we define a tree as default ~ . using the R formula interface. This models loan default status (yes or no) using all of the other features in the credit data frame. The parameter method = "C5.0" tells caret to use the C5.0 decision tree algorithm.

After you've entered the preceding command, there may be a significant delay (dependent upon your computer's capabilities) as the tuning process occurs. Even though this is a fairly small dataset, a substantial amount of calculation must occur. R must repeatedly generate random samples of data, build decision trees, compute performance statistics, and evaluate the result.

The result of the experiment is saved in an object named m. If you would like to examine the object's contents, the str(m) command will list all the associated data, but this can be quite overwhelming. Instead, simply type the name of the object for a condensed summary of the results. For instance, typing m yields the following output (note that labels have been added for clarity):

Creating a simple tuned model

The labels highlight four main components in the output:

  1. A brief description of the input dataset: If you are familiar with your data and have applied the train() function correctly, this information should not be surprising.
  2. A report of the preprocessing and resampling methods applied: Here, we see that 25 bootstrap samples, each including 1,000 examples, were used to train the models.
  3. A list of the candidate models evaluated: In this section, we can confirm that 12 different models were tested, based on the combinations of three C5.0 tuning parameters—model, trials, and winnow. The average and standard deviation of the accuracy and kappa statistics for each candidate model are also shown.
  4. The choice of the best model: As the footnote describes, the model with the largest accuracy was selected. This was the model that used a decision tree with 20 trials and the setting winnow = FALSE.

After identifying the best model, the train() function uses its tuning parameters to build a model on the full input dataset, which is stored in the m list object as m$finalModel. In most cases, you will not need to work directly with the finalModel sub-object. Instead, simply use the predict() function with the m object as follows:

> p <- predict(m, credit)

The resulting vector of predictions works as expected, allowing us to create a confusion matrix that compares the predicted and actual values:

> table(p, credit$default)

p      no yes
  no  700   2
  yes   0 298

Of the 1,000 examples used for training the final model, only two were misclassified. However, it is very important to note that since the model was built on both the training and test data, this accuracy is optimistic and thus, should not be viewed as indicative of performance on unseen data. The bootstrap estimate of 73 percent (shown in the summary output) is a more realistic estimate of future performance.

Using the train() and predict() functions also offers a couple of benefits in addition to the automatic parameter tuning.

First, any data preparation steps applied by the train() function will be similarly applied to the data used for generating predictions. This includes transformations such as centering and scaling as well as imputation of missing values. Allowing caret to handle the data preparation will ensure that the steps that contributed to the best model's performance will remain in place when the model is deployed.

Second, the predict() function provides a standardized interface for obtaining predicted class values and class probabilities, even for model types that ordinarily would require additional steps to obtain this information. The predicted classes are provided by default:

> head(predict(m, credit))
[1] no  yes no  no  yes no
Levels: no yes

To obtain the estimated probabilities for each class, use the type = "prob" parameter:

> head(predict(m, credit, type = "prob"))
         no        yes
1 0.9606970 0.03930299
2 0.1388444 0.86115561
3 1.0000000 0.00000000
4 0.7720279 0.22797208
5 0.2948062 0.70519385
6 0.8583715 0.14162851

Even in cases where the underlying model refers to the prediction probabilities using a different string (for example, "raw" for a naiveBayes model), the predict() function will translate type = "prob" to the appropriate string behind the scenes.

Customizing the tuning process

The decision tree we created previously demonstrates the caret package's ability to produce an optimized model with minimal intervention. The default settings allow optimized models to be created easily. However, it is also possible to change the default settings to something more specific to a learning task, which may assist with unlocking the upper echelon of performance.

Each step in the model selection process can be customized. To illustrate this flexibility, let's modify our work on the credit decision tree to mirror the process we had used in Chapter 10, Evaluating Model Performance. If you remember, we had estimated the kappa statistic using 10-fold cross-validation. We'll do the same here, using kappa to optimize the boosting parameter of the decision tree. Note that decision tree boosting was previously covered in Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, and will also be covered in greater detail later this chapter.

The trainControl() function is used to create a set of configuration options known as a control object, which guides the train() function. These options allow for the management of model evaluation criteria such as the resampling strategy and the measure used for choosing the best model. Although this function can be used to modify nearly every aspect of a tuning experiment, we'll focus on the two important parameters: method and selectionFunction.

Tip

If you're eager for more details, you can use the ?trainControl command for a list of all the parameters.

For the trainControl() function, the method parameter is used to set the resampling method, such as holdout sampling or k-fold cross-validation. The following table lists the possible method types as well as any additional parameters for adjusting the sample size and number of iterations. Although the default options for these resampling methods follow popular convention, you may choose to adjust these depending upon the size of your dataset and the complexity of your model.

Resampling method

Method name

Additional options and default values

Holdout sampling

LGOCV

p = 0.75 (training data proportion)

k-fold cross-validation

cv

number = 10 (number of folds)

Repeated k-fold cross-validation

repeatedcv

number = 10 (number of folds)

repeats = 10 (number of iterations)

Bootstrap sampling

boot

number = 25 (resampling iterations)

0.632 bootstrap

boot632

number = 25 (resampling iterations)

Leave-one-out cross-validation

LOOCV

None

The selectionFunction parameter is used to specify the function that will choose the optimal model among the various candidates. Three such functions are included. The best function simply chooses the candidate with the best value on the specified performance measure. This is used by default. The other two functions are used to choose the most parsimonious, or simplest, model that is within a certain threshold of the best model's performance. The oneSE function chooses the simplest candidate within one standard error of the best performance, and tolerance uses the simplest candidate within a user-specified percentage.

Tip

Some subjectivity is involved with the caret package's ranking of models by simplicity. For information on how models are ranked, see the help page for the selection functions by typing ?best at the R command prompt.

To create a control object named ctrl that uses 10-fold cross-validation and the oneSE selection function, use the following command (note that number = 10 is included only for clarity; since this is the default value for method = "cv", it could have been omitted):

> ctrl <- trainControl(method = "cv", number = 10,
                       selectionFunction = "oneSE")

We'll use the result of this function shortly.

In the meantime, the next step in defining our experiment is to create the grid of parameters to optimize. The grid must include a column named for each parameter in the desired model, prefixed by a period. It must also include a row for each desired combination of parameter values. Since we are using a C5.0 decision tree, this means we'll need columns named .model, .trials, and .winnow. For other machine learning models, refer to the table presented earlier in this chapter or use the modelLookup() function to lookup the parameters as described previously.

Rather than filling this data frame cell by cell—a tedious task if there are many possible combinations of parameter values—we can use the expand.grid() function, which creates data frames from the combinations of all the values supplied. For example, suppose we would like to hold constant model = "tree" and winnow = "FALSE" while searching eight different values of trials. This can be created as:

> grid <- expand.grid(.model = "tree",
                      .trials = c(1, 5, 10, 15, 20, 25, 30, 35),
                      .winnow = "FALSE")

The resulting grid data frame contains 1 * 8 * 1 = 8 rows:

> grid
  .model .trials .winnow
1   tree       1   FALSE
2   tree       5   FALSE
3   tree      10   FALSE
4   tree      15   FALSE
5   tree      20   FALSE
6   tree      25   FALSE
7   tree      30   FALSE
8   tree      35   FALSE

The train() function will build a candidate model for evaluation using each row's combination of model parameters.

Given this search grid and the control list created previously, we are ready to run a thoroughly customized train() experiment. As we did earlier, we'll set the random seed to the arbitrary number 300 in order to ensure repeatable results. But this time, we'll pass our control object and tuning grid while adding a parameter metric = "Kappa", indicating the statistic to be used by the model evaluation function—in this case, "oneSE". The full command is as follows:

> set.seed(300)
> m <- train(default ~ ., data = credit, method = "C5.0",
             metric = "Kappa",
             trControl = ctrl,
             tuneGrid = grid)

This results in an object that we can view by typing its name:

> m
Customizing the tuning process

Although much of the output is similar to the automatically tuned model, there are a few differences of note. As 10-fold cross-validation was used, the sample size to build each candidate model was reduced to 900 rather than the 1,000 used in the bootstrap. As we requested, eight candidate models were tested. Additionally, because model and winnow were held constant, their values are no longer shown in the results; instead, they are listed as a footnote.

The best model here differs quite significantly from the prior trial. Before, the best model used trials = 20, whereas here, it used trials = 1. This seemingly odd finding is due to the fact that we used the oneSE rule rather the best rule to select the optimal model. Even though the 35-trial model offers the best raw performance according to kappa, the 1-trial model offers nearly the same performance with a much simpler form. Not only are simple models more computationally efficient, but they also reduce the chance of overfitting the training data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.166.45