As an alternative to increasing the performance of a single model, it is possible to combine several models to form a powerful team. Just as the best sports teams have players with complementary rather than overlapping skillsets, some of the best machine learning algorithms utilize teams of complementary models. Since a model brings a unique bias to a learning task, it may readily learn one subset of examples, but have trouble with another. Therefore, by intelligently using the talents of several diverse team members, it is possible to create a strong team of multiple weak learners.
This technique of combining and managing the predictions of multiple models falls into a wider set of meta-learning methods defining techniques that involve learning how to learn. This includes anything from simple algorithms that gradually improve performance by iterating over design decisions—for instance, the automated parameter tuning used earlier in this chapter—to highly complex algorithms that use concepts borrowed from evolutionary biology and genetics for self-modifying and adapting to learning tasks.
For the remainder of this chapter, we'll focus on meta-learning only as it pertains to modeling a relationship between the predictions of several models and the desired outcome. The teamwork-based techniques covered here are quite powerful, and are used quite often to build more effective classifiers.
Suppose you were a contestant on a television trivia show that allowed you to choose a panel of five friends to assist you with answering the final question for the million-dollar prize. Most people would try to stack the panel with a diverse set of subject matter experts. A panel containing professors of literature, science, history, and art, along with a current pop-culture expert would be a safely well-rounded group. Given their breadth of knowledge, it would be unlikely to find a question that stumps the group.
The meta-learning approach that utilizes a similar principle of creating a varied team of experts is known as an ensemble. All the ensemble methods are based on the idea that by combining multiple weaker learners, a stronger learner is created. The various ensemble methods can be distinguished, in large part, by the answers to these two questions:
When answering these questions, it can be helpful to imagine the ensemble in terms of the following process diagram; nearly all ensemble approaches follow this pattern:
First, input training data is used to build a number of models. The allocation function dictates how much of the training data each model receives. Do they each receive the full training dataset or merely a sample? Do they each receive every feature or a subset?
Although the ideal ensemble includes a diverse set of models, the allocation function can increase diversity by artificially varying the input data to bias the resulting learners, even if they are the same type. For instance, it might use bootstrap sampling to construct unique training datasets or pass on a different subset of features or examples to each model. On the other hand, if the ensemble already includes a diverse set of algorithms—such as a neural network, a decision tree, and a k-NN classifier—the allocation function might pass the data on to each algorithm relatively unchanged.
After the models are constructed, they can be used to generate a set of predictions, which must be managed in some way. The combination function governs how disagreements among the predictions are reconciled. For example, the ensemble might use a majority vote to determine the final prediction, or it could use a more complex strategy such as weighting each model's votes based on its prior performance.
Some ensembles even utilize another model to learn a combination function from various combinations of predictions. For example, suppose that when M1 and M2 both vote yes, the actual class value is usually no. In this case, the ensemble could learn to ignore the vote of M1 and M2 when they agree. This process of using the predictions of several models to train a final arbiter model is known as stacking.
One of the benefits of using ensembles is that they may allow you to spend less time in pursuit of a single best model. Instead, you can train a number of reasonably strong candidates and combine them. Yet, convenience isn't the only reason why ensemble-based methods continue to rack up wins in machine learning competitions; ensembles also offer a number of performance advantages over single models:
None of these benefits would be very helpful if you weren't able to easily apply ensemble methods in R, and there are many packages available to do just that. Let's take a look at several of the most popular ensemble methods and how they can be used to improve the performance of the credit model we've been working on.
One of the first ensemble methods to gain widespread acceptance used a technique called bootstrap aggregating or bagging for short. As described by Leo Breiman in 1994, bagging generates a number of training datasets by bootstrap sampling the original training data. These datasets are then used to generate a set of models using a single learning algorithm. The models' predictions are combined using voting (for classification) or averaging (for numeric prediction).
Although bagging is a relatively simple ensemble, it can perform quite well as long as it is used with relatively unstable learners, that is, those generating models that tend to change substantially when the input data changes only slightly. Unstable models are essential in order to ensure the ensemble's diversity in spite of only minor variations between the bootstrap training datasets. For this reason, bagging is often used with decision trees, which have the tendency to vary dramatically given minor changes in the input data.
The ipred
package offers a classic implementation of bagged decision trees. To train the model, the bagging()
function works similar to many of the models used previously. The nbagg
parameter is used to control the number of decision trees voting in the ensemble (with a default value of 25
). Depending on the difficulty of the learning task and the amount of training data, increasing this number may improve the model's performance up to a limit. The downside is that this comes at the expense of additional computational expense because a large number of trees may take some time to train.
After installing the ipred
package, we can create the ensemble as follows. We'll stick to the default value of 25 decision trees:
> library(ipred) > set.seed(300) > mybag <- bagging(default ~ ., data = credit, nbagg = 25)
The resulting model works as expected with the predict()
function:
> credit_pred <- predict(mybag, credit) > table(credit_pred, credit$default) credit_pred no yes no 699 2 yes 1 298
Given the preceding results, the model seems to have fit the training data extremely well. To see how this translates into future performance, we can use the bagged trees with 10-fold CV using the train()
function in the caret
package. Note that the method name for the ipred
bagged trees function is treebag
:
> library(caret) > set.seed(300) > ctrl <- trainControl(method = "cv", number = 10) > train(default ~ ., data = credit, method = "treebag", trControl = ctrl) Bagged CART 1000 samples 16 predictor 2 classes: 'no', 'yes' No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... Resampling results Accuracy Kappa Accuracy SD Kappa SD 0.735 0.3297726 0.03439961 0.08590462
The kappa statistic of 0.33 for this model suggests that the bagged tree model performs at least as well as the best C5.0 decision tree we tuned earlier in this chapter. This illustrates the power of ensemble methods; a set of simple learners working together can outperform very sophisticated models.
To get beyond bags of decision trees, the caret
package also provides a more general bag()
function. It includes native support for a handful of models, though it can be adapted to other types with a bit of additional effort. The bag()
function uses a control object to configure the bagging process. It requires the specification of three functions: one for fitting the model, one for making predictions, and one for aggregating the votes.
For example, suppose we wanted to create a bagged support vector machine model, using the
ksvm()
function in the kernlab
package we used in Chapter 7, Black Box Methods – Neural Networks and Support Vector Machines. The bag()
function requires us to provide functionality for training the SVMs, making predictions, and counting votes.
Rather than writing these ourselves, the caret
package's built-in svmBag
list object supplies three functions we can use for this purpose:
> str(svmBag) List of 3 $ fit :function (x, y, ...) $ pred :function (object, x) $ aggregate:function (x, type = "class")
By looking at the svmBag$fit
function, we see that it simply calls the ksvm()
function from the kernlab
package and returns the result:
> svmBag$fit function (x, y, ...) { library(kernlab) out <- ksvm(as.matrix(x), y, prob.model = is.factor(y), ...) out } <environment: namespace:caret>
The pred
and aggregate
functions for svmBag
are also similarly straightforward. By studying these functions and creating your own in the same format, it is possible to use bagging with any machine learning algorithm you would like.
Applying the three functions in the svmBag
list, we can create a bagging control object:
> bagctrl <- bagControl(fit = svmBag$fit, predict = svmBag$pred, aggregate = svmBag$aggregate)
By using this with the train()
function and the training control object (ctrl
), defined earlier, we can evaluate the bagged SVM model as follows (note that the kernlab
package is required for this to work; you will need to install it if you have not done so previously):
> set.seed(300) > svmbag <- train(default ~ ., data = credit, "bag", trControl = ctrl, bagControl = bagctrl) > svmbag Bagged Model 1000 samples 16 predictors 2 classes: 'no', 'yes' No pre-processing Resampling: Cross-Validation (10 fold) Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... Resampling results Accuracy Kappa Accuracy SD Kappa SD 0.728 0.2929505 0.04442222 0.1318101 Tuning parameter 'vars' was held constant at a value of 35
Given that the kappa statistic is below 0.30, it seems that the bagged SVM model performs worse than the bagged decision tree model. It's worth pointing out that the standard deviation of the kappa statistic is fairly large compared to the bagged decision tree model. This suggests that the performance varies substantially among the folds in the cross-validation. Such variation may imply that the performance might be improved further by upping the number of models in the ensemble.
Another common ensemble-based method is called boosting because it boosts the performance of weak learners to attain the performance of stronger learners. This method is based largely on the work of Robert Schapire and Yoav Freund, who have published extensively on the topic.
Similar to bagging, boosting uses ensembles of models trained on resampled data and a vote to determine the final prediction. There are two key distinctions. First, the resampled datasets in boosting are constructed specifically to generate complementary learners. Second, rather than giving each learner an equal vote, boosting gives each learner's vote a weight based on its past performance. Models that perform better have greater influence over the ensemble's final prediction.
Boosting will result in performance that is often quite better and certainly no worse than the best of the models in the ensemble. Since the models in the ensemble are built to be complementary, it is possible to increase ensemble performance to an arbitrary threshold simply by adding additional classifiers to the group, assuming that each classifier performs better than random chance. Given the obvious utility of this finding, boosting is thought to be one of the most significant discoveries in machine learning.
Although boosting can create a model that meets an arbitrarily low error rate, this may not always be reasonable in practice. For one, the performance gains are incrementally smaller as additional learners are added, making some thresholds practically infeasible. Additionally, the pursuit of pure accuracy may result in the model being overfitted to the training data and not generalizable to unseen data.
A boosting algorithm called AdaBoost or adaptive boosting was proposed by Freund and Schapire in 1997. The algorithm is based on the idea of generating weak learners that iteratively learn a larger portion of the difficult-to-classify examples by paying more attention (that is, giving more weight) to frequently misclassified examples.
Beginning from an unweighted dataset, the first classifier attempts to model the outcome. Examples that the classifier predicted correctly will be less likely to appear in the training dataset for the following classifier, and conversely, the difficult-to-classify examples will appear more frequently. As additional rounds of weak learners are added, they are trained on data with successively more difficult examples. The process continues until the desired overall error rate is reached or performance no longer improves. At that point, each classifier's vote is weighted according to its accuracy on the training data on which it was built.
Though boosting principles can be applied to nearly any type of model, the principles are most commonly used with decision trees. We already used boosting in this way in Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, as a method to improve the performance of a C5.0 decision tree.
The AdaBoost.M1 algorithm provides another tree-based implementation of AdaBoost for classification. The AdaBoost.M1 algorithm can be found in the adabag
package.
Let's create an AdaBoost.M1
classifier for the credit data. The general syntax for this algorithm is similar to other modeling techniques:
> set.seed(300) > m_adaboost <- boosting(default ~ ., data = credit)
As usual, the predict()
function is applied to the resulting object to make predictions:
> p_adaboost <- predict(m_adaboost, credit)
Departing from convention, rather than returning a vector of predictions, this returns an object with information about the model. The predictions are stored in a sub-object called class
:
> head(p_adaboost$class) [1] "no" "yes" "no" "no" "yes" "no"
A confusion matrix can be found in the confusion
sub-object:
> p_adaboost$confusion Observed Class Predicted Class no yes no 700 0 yes 0 300
Did you notice that the AdaBoost model made no mistakes? Before you get your hopes up, remember that the preceding confusion matrix is based on the model's performance on the training data. Since boosting allows the error rate to be reduced to an arbitrarily low level, the learner simply continued until it made no more errors. This likely resulted in overfitting on the training dataset.
For a more accurate assessment of performance on unseen data, we need to use another evaluation method. The adabag
package provides a simple function to use 10-fold CV:
> set.seed(300) > adaboost_cv <- boosting.cv(default ~ ., data = credit)
Depending on your computer's capabilities, this may take some time to run, during which it will log each iteration to screen. After it completes, we can view a more reasonable confusion matrix:
> adaboost_cv$confusion Observed Class Predicted Class no yes no 594 151 yes 106 149
We can find the kappa statistic using the vcd
package as described in Chapter 10, Evaluating Model Performance.
> library(vcd) > Kappa(adaboost_cv$confusion) value ASE Unweighted 0.3606965 0.0323002 Weighted 0.3606965 0.0323002
With a kappa of about 0.36, this is our best-performing credit scoring model yet. Let's see how it compares to one last ensemble method.
Another ensemble-based method called random forests (or decision tree forests) focuses only on ensembles of decision trees. This method was championed by Leo Breiman and Adele Cutler, and combines the base principles of bagging with random feature selection to add additional diversity to the decision tree models. After the ensemble of trees (the forest) is generated, the model uses a vote to combine the trees' predictions.
Random forests combine versatility and power into a single machine learning approach. As the ensemble uses only a small, random portion of the full feature set, random forests can handle extremely large datasets, where the so-called "curse of dimensionality" might cause other models to fail. At the same time, its error rates for most learning tasks are on par with nearly any other method.
Although the term "Random Forests" is trademarked by Breiman and Cutler, the term is sometimes used colloquially to refer to any type of decision tree ensemble. A pedant would use the more general term "decision tree forests" except when referring to the specific implementation by Breiman and Cutler.
It's worth noting that relative to other ensemble-based methods, random forests are quite competitive and offer key advantages relative to the competition. For instance, random forests tend to be easier to use and less prone to overfitting. The following table lists the general strengths and weaknesses of random forest models:
Strengths |
Weaknesses |
---|---|
|
Due to their power, versatility, and ease of use, random forests are quickly becoming one of the most popular machine learning methods. Later on in this chapter, we'll compare a random forest model head-to-head against the boosted C5.0 tree.
Though there are several packages to create random forests in R, the randomForest
package is perhaps the implementation that is most faithful to the specification by Breiman and Cutler, and is also supported by caret
for automated tuning. The syntax for training this model is as follows:
By default, the randomForest()
function creates an ensemble of 500 trees that consider sqrt(p)
random features at each split, where p
is the number of features in the training dataset and sqrt()
refers to R's square root function. Whether or not these default parameters are appropriate depends on the nature of the learning task and training data. Generally, more complex learning problems and larger datasets (either more features or more examples) work better with a larger number of trees, though this needs to be balanced with the computational expense of training more trees.
The goal of using a large number of trees is to train enough so that each feature has a chance to appear in several models. This is the basis of the sqrt(p)
default value for the mtry
parameter; using this value limits the features sufficiently so that substantial random variation occurs from tree-to-tree. For example, since the credit data has 16 features, each tree would be limited to splitting on four features at any time.
Let's see how the default randomForest()
parameters work with the credit data. We'll train the model just as we did with other learners. Again, the set.seed()
function ensures that the result can be replicated:
> library(randomForest) > set.seed(300) > rf <- randomForest(default ~ ., data = credit)
To look at a summary of the model's performance, we can simply type the resulting object's name:
> rf Call: randomForest(formula = default ~ ., data = credit) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4 OOB estimate of error rate: 23.8% Confusion matrix: no yes class.error no 640 60 0.08571429 yes 178 122 0.59333333
The output notes that the random forest included 500 trees and tried four variables at each split, just as we expected. At first glance, you might be alarmed at the seemingly poor performance according to the confusion matrix—the error rate of 23.8 percent is far worse than the resubstitution error of any of the other ensemble methods so far. However, this confusion matrix does not show resubstitution error. Instead, it reflects the out-of-bag error rate (listed in the output as OOB estimate of error rate
), which unlike resubstitution error, is an unbiased estimate of the test set error. This means that it should be a fairly reasonable estimate of future performance.
The out-of-bag estimate is computed during the construction of the random forest. Essentially, any example not selected for a single tree's bootstrap sample can be used to test the model's performance on unseen data. At the end of the forest construction, the predictions for each example each time it was held out are tallied, and a vote is taken to determine the final prediction for the example. The total error rate of such predictions becomes the out-of-bag error rate.
As mentioned previously, the randomForest()
function is supported by caret
, which allows us to optimize the model while, at the same time, calculating performance measures beyond the out-of-bag error rate. To make things interesting, let's compare an auto-tuned random forest to the best auto-tuned boosted C5.0 model we've developed. We'll treat this experiment as if we were hoping to identify a candidate model for submission to a machine learning competition.
We must first load caret
and set our training control options. For the most accurate comparison of model performance, we'll use repeated 10-fold cross-validation, or 10-fold CV repeated 10 times. This means that the models will take a much longer time to build and will be more computationally intensive to evaluate, but since this is our final comparison we should be very sure that we're making the right choice; the winner of this showdown will be our only entry into the machine learning competition.
> library(caret) > ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
Next, we'll set up the tuning grid for the random forest. The only tuning parameter for this model is mtry
, which defines how many features are randomly selected at each split. By default, we know that the random forest will use sqrt(16)
, or four features per tree. To be thorough, we'll also test values half of that, twice that, as well as the full set of 16 features. Thus, we need to create a grid with values of 2
, 4
, 8
, and 16
as follows:
> grid_rf <- expand.grid(.mtry = c(2, 4, 8, 16))
We can supply the resulting grid to the train()
function with the ctrl
object as follows. We'll use the kappa metric to select the best model:
> set.seed(300) > m_rf <- train(default ~ ., data = credit, method = "rf", metric = "Kappa", trControl = ctrl, tuneGrid = grid_rf)
The preceding command may take some time to complete as it has quite a bit of work to do! When it finishes, we'll compare that to a boosted tree using 10
, 20
, 30
, and 40
iterations:
> grid_c50 <- expand.grid(.model = "tree", .trials = c(10, 20, 30, 40), .winnow = "FALSE") > set.seed(300) > m_c50 <- train(default ~ ., data = credit, method = "C5.0", metric = "Kappa", trControl = ctrl, tuneGrid = grid_c50)
When the C5.0 decision tree finally completes, we can compare the two approaches side-by-side. For the random forest model, the results are:
> m_rf Resampling results across tuning parameters: mtry Accuracy Kappa Accuracy SD Kappa SD 2 0.7247 0.1284142 0.01690466 0.06364740 4 0.7499 0.2933332 0.02989865 0.08768815 8 0.7539 0.3379986 0.03107160 0.08353988 16 0.7556 0.3613151 0.03379439 0.08891300
For the boosted C5.0 model, the results are:
> m_c50 Resampling results across tuning parameters: trials Accuracy Kappa Accuracy SD Kappa SD 10 0.7325 0.3215655 0.04021093 0.09519817 20 0.7343 0.3268052 0.04033333 0.09711408 30 0.7381 0.3343137 0.03672709 0.08942323 40 0.7388 0.3335082 0.03934514 0.09746073
With a kappa of about 0.361, the random forest model with mtry = 16
was the winner among these eight models. It was higher than the best C5.0 decision tree, which had a kappa of about 0.334, and slightly higher than the AdaBoost.M1
model with a kappa of about 0.360. Based on these results, we would submit the random forest as our final model. Without actually evaluating the model on the competition data, we have no way of knowing for sure whether it will end up winning, but given our performance estimates, it's the safer bet. With a bit of luck, perhaps we'll come away with the prize.
3.135.221.112