When a sports team falls short of meeting its goal—whether the goal is to obtain an Olympic gold medal, a league championship, or a world record time—it must search for possible improvements. Imagine that you're the team's coach. How would you spend your practice sessions? Perhaps you'd direct the athletes to train harder or train differently in order to maximize every bit of their potential. Or, you might emphasize better teamwork, utilizing the athletes' strengths and weaknesses more smartly.
Now imagine that you're training a world champion machine learning algorithm. Perhaps you hope to compete in data mining competitions such as those posted on Kaggle (http://www.kaggle.com/competitions). Maybe you simply need to improve business results. Where do you begin? Although the context differs, the strategies one uses to improve sports team performance can also be used to improve the performance of statistical learners.
As the coach, it is your job to find the combination of training techniques and teamwork skills that allow you to meet your performance goals. This chapter builds upon the material covered throughout this book to introduce a set of techniques for improving the predictive performance of machine learners. You will learn:
None of these methods will be successful for every problem. Yet, looking at the winning entries to machine learning competitions, you'll likely find that at least one of them has been employed. To be competitive, you too will need to add these skills to your repertoire.
Some learning problems are well-suited to the stock models presented in the previous chapters. In such cases, it may not be necessary to spend much time iterating and refining the model; it may perform well enough as it is. On the other hand, some problems are inherently more difficult. The underlying concepts to be learned may be extremely complex, requiring an understanding of many subtle relationships, or it may be affected by random variation, making it difficult to define the signal within the noise.
Developing models that perform extremely well on difficult problems is every bit an art as it is a science. Sometimes a bit of intuition is helpful when trying to identify areas where performance can be improved. In other cases, finding improvements will require a brute-force, trial and error approach. Of course, the process of searching numerous possible improvements can be aided by the use of automated programs.
In Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, we attempted a difficult problem: identifying loans that were likely to enter into default. Although we were able to use performance tuning methods to obtain a respectable classification accuracy of about 82 percent, upon a more careful examination in Chapter 10, Evaluating Model Performance, we realized that the high accuracy was a bit misleading. In spite of the reasonable accuracy, the kappa statistic was only about 0.28, which suggested that the model was actually performing somewhat poorly. In this section, we'll revisit the credit scoring model to see whether we can improve the results.
You will recall that we first used a stock C5.0 decision tree to build the classifier for the credit data. We then attempted to improve its performance by adjusting the trials
parameter to increase the number of boosting iterations. By increasing the number of iterations from the default of 1 up to the value of 10, we were able to increase the model's accuracy. This process of adjusting the model options to identify the best fit is called parameter
tuning.
Parameter tuning is not limited to decision trees. For instance, we tuned k-NN models when we searched for the best value of k. We also tuned neural networks and support vector machines as we adjusted the number of nodes or hidden layers, or chose different kernel functions. Most machine learning algorithms allow the adjustment of at least one parameter, and the most sophisticated models offer a large number of ways to tweak the model fit. Although this allows the model to be tailored closely to the learning task, the complexity of all the possible options can be daunting. A more systematic approach is warranted.
Rather than choosing arbitrary values for each of the model's parameters—a task that is not only tedious, but also somewhat unscientific—it is better to conduct a search through many possible parameter values to find the best combination.
The caret
package, which we used extensively in Chapter 10, Evaluating Model Performance, provides tools to assist with automated parameter tuning. The core functionality is provided by a train()
function that serves as a standardized interface for over 175 different machine learning models for both classification and regression tasks. By using this function, it is possible to automate the search for optimal models using a choice of evaluation methods and metrics.
Do not feel overwhelmed by the large number of models—we've already covered many of them in the earlier chapters. Others are simple variants or extensions of the base concepts. Given what you've learned so far, you should be confident that you have the ability to understand all of the available methods.
Automated parameter tuning requires you to consider three questions:
Answering the first question involves finding a well-suited match between the machine learning task and one of the 175 models. Obviously, this requires an understanding of the breadth and depth of machine learning models. It can also help to work through a process of elimination. Nearly half of the models can be eliminated depending on whether the task is classification or numeric prediction; others can be excluded based on the format of the data or the need to avoid black box models, and so on. In any case, there's also no reason you can't try several approaches and compare the best results of each.
Addressing the second question is a matter largely dictated by the choice of model, since each algorithm utilizes a unique set of parameters. The available tuning parameters for the predictive models covered in this book are listed in the following table. Keep in mind that although some models have additional options not shown, only those listed in the table are supported by caret
for automatic tuning.
Model |
Learning Task |
Method name |
Parameters |
---|---|---|---|
k-Nearest Neighbors |
Classification |
|
|
Naive Bayes |
Classification |
|
|
Decision Trees |
Classification |
|
|
OneR Rule Learner |
Classification |
|
None |
RIPPER Rule Learner |
Classification |
|
|
Linear Regression |
Regression |
|
None |
Regression Trees |
Regression |
|
|
Model Trees |
Regression |
|
|
Neural Networks |
Dual use |
|
|
Support Vector Machines (Linear Kernel) |
Dual use |
|
|
Support Vector Machines (Radial Basis Kernel) |
Dual use |
|
|
Random Forests |
Dual use |
|
|
For a complete list of the models and corresponding tuning parameters covered by caret
, refer to the table provided by package author Max Kuhn at http://topepo.github.io/caret/modelList.html.
If you ever forget the tuning parameters for a particular model, the modelLookup()
function can be used to find them. Simply supply the method name, as illustrated here for the C5.0 model:
> modelLookup("C5.0") model parameter label forReg forClass probModel 1 C5.0 trials # Boosting Iterations FALSE TRUE TRUE 2 C5.0 model Model Type FALSE TRUE TRUE 3 C5.0 winnow Winnow FALSE TRUE TRUE
The goal of automatic tuning is to search a set of candidate models comprising a matrix, or grid, of parameter combinations. Because it is impractical to search every conceivable combination, only a subset of possibilities is used to construct the grid. By default, caret
searches at most three values for each of the p parameters. This means that at most 3^p candidate models will be tested. For example, by default, the automatic tuning of k-Nearest Neighbors will compare 3^1 = 3 candidate models with k=5
, k=7
, and k=9
. Similarly, tuning a decision tree will result in a comparison of up to 27 different candidate models, comprising the grid of 3^3 = 27 combinations of model
, trials
, and winnow
settings. In practice, however, only 12 models are actually tested. This is because the model
and winnow
parameters can only take two values (tree
versus rules
and TRUE
versus FALSE
, respectively), which makes the grid size 3 * 2 * 2 = 12.
The third and final step in automatic model tuning involves identifying the best model among the candidates. This uses the methods discussed in Chapter 10, Evaluating Model Performance, such as the choice of resampling strategy for creating training and test datasets and the use of model performance statistics to measure the predictive accuracy.
All of the resampling strategies and many of the performance statistics we've learned are supported by caret
. These include statistics such as accuracy and kappa (for classifiers) and R-squared or RMSE (for numeric models). Cost-sensitive measures such as sensitivity, specificity, and area under the ROC curve (AUC) can also be used, if desired.
By default, caret
will select the candidate model with the largest value of the desired performance measure. As this practice sometimes results in the selection of models that achieve marginal performance improvements via large increases in model complexity, alternative model selection functions are provided.
Given the wide variety of options, it is helpful that many of the defaults are reasonable. For instance, caret
will use prediction accuracy on a bootstrap sample to choose the best performer for classification models. Beginning with these default values, we can then tweak the train()
function to design a wide variety of experiments.
To illustrate the process of tuning a model, let's begin by observing what happens when we attempt to tune the credit scoring model using the caret
package's default settings. From there, we will adjust the options to our liking.
The simplest way to tune a learner requires you to only specify a model type via the method
parameter. Since we used C5.0 decision trees previously with the credit model, we'll continue our work by optimizing this learner. The basic train()
command for tuning a C5.0 decision tree using the default settings is as follows:
> library(caret) > set.seed(300) > m <- train(default ~ ., data = credit, method = "C5.0")
First, the set.seed()
function is used to initialize R's random number generator to a set starting position. You may recall that we used this function in several prior chapters. By setting the seed
parameter (in this case to the arbitrary number 300), the random numbers will follow a predefined sequence. This allows simulations that use random sampling to be repeated with identical results—a very helpful feature if you are sharing code or attempting to replicate a prior result.
Next, we define a tree as default ~ .
using the R formula interface. This models loan default status (yes
or no
) using all of the other features in the credit
data frame. The parameter method = "C5.0"
tells caret
to use the C5.0 decision tree algorithm.
After you've entered the preceding command, there may be a significant delay (dependent upon your computer's capabilities) as the tuning process occurs. Even though this is a fairly small dataset, a substantial amount of calculation must occur. R must repeatedly generate random samples of data, build decision trees, compute performance statistics, and evaluate the result.
The result of the experiment is saved in an object named m
. If you would like to examine the object's contents, the str(m)
command will list all the associated data, but this can be quite overwhelming. Instead, simply type the name of the object for a condensed summary of the results. For instance, typing m
yields the following output (note that labels have been added for clarity):
The labels highlight four main components in the output:
train()
function correctly, this information should not be surprising.model
, trials
, and winnow
. The average and standard deviation of the accuracy and kappa statistics for each candidate model are also shown.winnow = FALSE
.After identifying the best model, the train()
function uses its tuning parameters to build a model on the full input dataset, which is stored in the m
list object as m$finalModel
. In most cases, you will not need to work directly with the finalModel
sub-object. Instead, simply use the predict()
function with the m
object as follows:
> p <- predict(m, credit)
The resulting vector of predictions works as expected, allowing us to create a confusion matrix that compares the predicted and actual values:
> table(p, credit$default) p no yes no 700 2 yes 0 298
Of the 1,000 examples used for training the final model, only two were misclassified. However, it is very important to note that since the model was built on both the training and test data, this accuracy is optimistic and thus, should not be viewed as indicative of performance on unseen data. The bootstrap estimate of 73 percent (shown in the summary output) is a more realistic estimate of future performance.
Using the train()
and predict()
functions also offers a couple of benefits in addition to the automatic parameter tuning.
First, any data preparation steps applied by the train()
function will be similarly applied to the data used for generating predictions. This includes transformations such as centering and scaling as well as imputation of missing values. Allowing caret
to handle the data preparation will ensure that the steps that contributed to the best model's performance will remain in place when the model is deployed.
Second, the predict()
function provides a standardized interface for obtaining predicted class values and class probabilities, even for model types that ordinarily would require additional steps to obtain this information. The predicted classes are provided by default:
> head(predict(m, credit)) [1] no yes no no yes no Levels: no yes
To obtain the estimated probabilities for each class, use the type = "prob"
parameter:
> head(predict(m, credit, type = "prob")) no yes 1 0.9606970 0.03930299 2 0.1388444 0.86115561 3 1.0000000 0.00000000 4 0.7720279 0.22797208 5 0.2948062 0.70519385 6 0.8583715 0.14162851
Even in cases where the underlying model refers to the prediction probabilities using a different string (for example, "raw"
for a naiveBayes
model), the predict()
function will translate type = "prob"
to the appropriate string behind the scenes.
The decision tree we created previously demonstrates the caret
package's ability to produce an optimized model with minimal intervention. The default settings allow optimized models to be created easily. However, it is also possible to change the default settings to something more specific to a learning task, which may assist with unlocking the upper echelon of performance.
Each step in the model selection process can be customized. To illustrate this flexibility, let's modify our work on the credit decision tree to mirror the process we had used in Chapter 10, Evaluating Model Performance. If you remember, we had estimated the kappa statistic using 10-fold cross-validation. We'll do the same here, using kappa to optimize the boosting parameter of the decision tree. Note that decision tree boosting was previously covered in Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, and will also be covered in greater detail later this chapter.
The trainControl()
function is used to create a set of configuration options known as a control object, which guides the train()
function. These options allow for the management of model evaluation criteria such as the resampling strategy and the measure used for choosing the best model. Although this function can be used to modify nearly every aspect of a tuning experiment, we'll focus on the two important parameters: method
and selectionFunction
.
For the trainControl()
function, the method
parameter is used to set the resampling method, such as holdout sampling or k-fold cross-validation. The following table lists the possible method types as well as any additional parameters for adjusting the sample size and number of iterations. Although the default options for these resampling methods follow popular convention, you may choose to adjust these depending upon the size of your dataset and the complexity of your model.
Resampling method |
Method name |
Additional options and default values |
---|---|---|
Holdout sampling |
|
|
k-fold cross-validation |
|
|
Repeated k-fold cross-validation |
|
|
Bootstrap sampling |
|
|
0.632 bootstrap |
|
|
Leave-one-out cross-validation |
|
None |
The selectionFunction
parameter is used to specify the function that will choose the optimal model among the various candidates. Three such functions are included. The best
function simply chooses the candidate with the best value on the specified performance measure. This is used by default. The other two functions are used to choose the most parsimonious, or simplest, model that is within a certain threshold of the best model's performance. The oneSE
function chooses the simplest candidate within one standard error of the best performance, and tolerance
uses the simplest candidate within a user-specified percentage.
To create a control object named ctrl
that uses 10-fold cross-validation and the oneSE
selection function, use the following command (note that number = 10
is included only for clarity; since this is the default value for method = "cv"
, it could have been omitted):
> ctrl <- trainControl(method = "cv", number = 10, selectionFunction = "oneSE")
We'll use the result of this function shortly.
In the meantime, the next step in defining our experiment is to create the grid of parameters to optimize. The grid must include a column named for each parameter in the desired model, prefixed by a period. It must also include a row for each desired combination of parameter values. Since we are using a C5.0 decision tree, this means we'll need columns named .model
, .trials
, and .winnow
. For other machine learning models, refer to the table presented earlier in this chapter or use the modelLookup()
function to lookup the parameters as described previously.
Rather than filling this data frame cell by cell—a tedious task if there are many possible combinations of parameter values—we can use the expand.grid()
function, which creates data frames from the combinations of all the values supplied. For example, suppose we would like to hold constant model = "tree"
and winnow = "FALSE"
while searching eight different values of trials. This can be created as:
> grid <- expand.grid(.model = "tree", .trials = c(1, 5, 10, 15, 20, 25, 30, 35), .winnow = "FALSE")
The resulting grid data frame contains 1 * 8 * 1 = 8 rows:
> grid .model .trials .winnow 1 tree 1 FALSE 2 tree 5 FALSE 3 tree 10 FALSE 4 tree 15 FALSE 5 tree 20 FALSE 6 tree 25 FALSE 7 tree 30 FALSE 8 tree 35 FALSE
The train()
function will build a candidate model for evaluation using each row's combination of model parameters.
Given this search grid and the control list created previously, we are ready to run a thoroughly customized train()
experiment. As we did earlier, we'll set the random seed to the arbitrary number 300
in order to ensure repeatable results. But this time, we'll pass our control object and tuning grid while adding a parameter metric = "Kappa"
, indicating the statistic to be used by the model evaluation function—in this case, "oneSE"
. The full command is as follows:
> set.seed(300) > m <- train(default ~ ., data = credit, method = "C5.0", metric = "Kappa", trControl = ctrl, tuneGrid = grid)
This results in an object that we can view by typing its name:
> m
Although much of the output is similar to the automatically tuned model, there are a few differences of note. As 10-fold cross-validation was used, the sample size to build each candidate model was reduced to 900 rather than the 1,000 used in the bootstrap. As we requested, eight candidate models were tested. Additionally, because model
and winnow
were held constant, their values are no longer shown in the results; instead, they are listed as a footnote.
The best model here differs quite significantly from the prior trial. Before, the best model used trials = 20
, whereas here, it used trials = 1
. This seemingly odd finding is due to the fact that we used the oneSE
rule rather the best
rule to select the optimal model. Even though the 35-trial model offers the best raw performance according to kappa, the 1-trial model offers nearly the same performance with a much simpler form. Not only are simple models more computationally efficient, but they also reduce the chance of overfitting the training data.
18.118.166.45