The final classifier that we will be discussing in this chapter is the aptly named Random Forest and is an example of a meta-technique called ensemble learning. The idea and logic behind random forests follows thusly:
Given that (unpruned) decision trees can be nearly bias-less high variance classifiers, a method of reducing variance at the cost of a marginal increase of bias could greatly improve upon the predictive accuracy of the technique. One salient approach to reducing variance of decision trees is to train a bunch of unpruned decision trees on different random subsets of the training data, sampling with replacement—this is called bootstrap aggregating or bagging. At the classification phase, the test observation is run through all of these trees (a forest, perhaps?), and each resulting classification casts a vote for the final classification of the whole forest. The class with the highest number of votes is the winner. It turns out that the consensus among many high-variance trees on bootstrapped subsets of the training data results in a significant accuracy improvement and vastly decreased variance.
Très bien ensemble!
Bagging is one example of an ensemble method—a meta-technique that uses multiple classifiers to improve predictive accuracy. Nearly bias-less/high-variance classifiers are the ones that seem to benefit the most from ensemble methods. Additionally, ensemble methods are easiest to use with classifiers that are created and trained rapidly, since method ipso facto relies on a large number of them. Decision trees fit all of these characteristics, and this accounts for why bagged trees and random forests are the most common ensemble learning instruments.
So far, what we have chronicled describes a technique called bagged trees. But random forests have one more trick up their sleeves! Observing that the variance can be further reduced by forcing the trees to be less similar, random forests differ from bagged trees by forcing the tree to only use a subset of its available predictors to split on in the growing phase.
Many people begin confused as to how deliberately reducing the efficacy of the component trees can possibly result in a more accurate ensemble. To clear this up, consider that a few very influential predictors will dominate the expression of the trees, even if the subsets contain little overlap. By constraining the number of predictors a tree can use on each splitting phase, a more diverse crop of trees is built. This results in a forest with lower variance than a forest with no constraints.
Random forests are the modern darling of classifiers—and for good reason. For one, they are often extraordinarily accurate. Second, since random forests use only two hyper-parameters (the number of trees to use in the forest and the number of predictors to use at each step of the splitting process), they are very easy to create, and require little in the way of hyper-parameter tuning. Third, it is extremely difficult for a random forest to overfit, and it doesn't happen very often at all, in practice. For example, increasing the number of trees that make up the forest does not cause the forest to overfit, and fiddling with the number-of-predictors hyper-parameter can't possibly result in a forest with a higher variance than that of the component tree that overfits the most.
One last awesome property of the random forest is that the training error rate that it reports is a nearly unbiased estimator cross-validated error rate. This is because the training error rate, at least that R reports using the predict
function on a randomForest
with no newdata
argument, is the average error rate of the classifier tested on all the observations that were kept out of the training sample at each stage of the bootstrap aggregation. Because these were independent observations, and not used for training, it closely approximates the CV error rate. The error rate reported on the remaining observations left out of the sample at every bagging step is called the Out-Of-Bag (OOB) error rate.
The primary drawback to random forests is that they, to some extent, revoke the chief benefit of decision trees: their interpretability; it is far harder to visualize the behavior of a random forest than it is for any of the component trees. This puts the interpretability of random forests somewhere between logistic regression (which is marginally more interpretable) and k-NN (which is largely un-interpretable).
At long last, let's use a random forest on our dataset to classify observations as being positive or negative for diabetes!
> library(randomForest) > forest <- randomForest(diabetes ~ ., data=training, + importance=TRUE, + ntree=2000, + mtry=5) > accuracy(predict(forest), training[,9]) [1] 0.7654723 > predictions <- predict(forest, newdata = test) > accuracy(predictions, test[,9]) [1] 0.7727273
In this incantation, we set the number of trees (ntree
) to an arbitrarily high number and set the number of predictors (mtry
) to 5
. Though it is not shown above, I used the OOB error rate to guide in the choosing of this hyper-parameter. Had we left it blank, it would have defaulted to the square root of the number of total predictors.
As you can see from the output of our accuracy
function, the random forest is competitive with the performance of our highest performing (on this dataset, at least) classifier: logistic regression. On other datasets, with other characteristics, random forests sometimes blow the competition out of the water.
18.188.168.78