Random forests

The final ensemble model that we will discuss in this chapter is unique to tree-based models and is known as the random forest. In a nutshell, the idea behind random forests stems from an observation on bagging trees. Let's suppose that the actual relationship between the features and the target variable can be adequately described with a tree structure. It is quite likely that during bagging with moderately sized bootstrapped samples, we will keep picking the same features to split on high up in the tree.

For example, in our Skillcraft data set, we expect to see APM as the feature that will be chosen at the top of most of the bagged trees. This is a form of tree correlation that essentially impedes our ability to derive the variance reduction benefits from bagging. Put differently, the different tree models that we build are not truly independent of each other because they will have many features and split points in common. Consequently, the averaging process at the end will be less successful in reducing the ensemble variance.

To counteract this effect, the random forest algorithm introduces an element of randomization in the tree construction process. Just as with bagging, random forests involve building a number of trees with bootstrapped samples and using the average of their predictions to form the ensemble prediction. When we construct individual trees, however, the random forest algorithm imposes a constraint.

At each node in the tree, we draw a random sample of size mtry from the total number of input features. Whereas in regular tree construction, we consider all the features at each node to determine which one to split on, with random forests, we only consider features from the sample we created for that node. We can often use a relatively small number for mtry.

The number of trees that we build, in combination with the fact that each tree has several nodes, is often enough to ensure that the more important features are sampled a sufficient number of times. Various heuristics have been proposed for choosing appropriate values for this parameter, such as one third or the square root of the total number of features available.

This sampling step effectively forces the structure of the bagged trees to be different from each other and offers a number of different benefits. Feature sampling allows us to consider input features that are successful in splitting the data for only a small range of the target variable. These locally relevant features are rarely chosen without the sampling constraint because we usually prefer features that form good overall splits of the data at a given node in the tree. Nonetheless, we may want to include these features in our model if we don't want to overlook local variations in the output.

Similarly, sampling input features is useful when we have correlated input features. Regular tree construction tends to favor only one of the features from a correlated set while ignoring the rest despite the fact that the resulting splits from even highly correlated features are not exactly the same. When we sample input features we are less likely to have correlated features compete with each other and so we can choose a wider range of features to use with our model.

In general, the randomized nature of the sampling process is designed to combat overfitting because we can think of this process as applying regularization on the impact of each input feature. Overfitting can still be a problem if we happen to have too many input features unrelated to the target variable compared to those that are related, but this is a fairly rare scenario. Random forests in general scale quite favorably with the number of input features, precisely because of this sampling process that doesn't require us to consider all the features when splitting at each node. In particular, this model is a good choice when the number of features exceeds the number of observations. Finally, the sampling process mitigates the cost of constructing a large number of trees again because we consider a subset of input features when deciding on how to split at each node. The number of trees is another tuning parameter that we must decide on in a random forest model; it is very common to build anywhere between several hundred and a few thousand trees.

In R, we can use the randomForest package in order to train random forest models. The randomForest() function takes in a formula and a training data frame, as well as a number of other optional parameters. Of particular interest is the ntree parameter, which controls the number of trees that will be built for the ensemble, and the mtry parameter, which is the number of features sampled for use at each node for splitting. These parameters should be tuned by trying out different configurations, and we can use the tune() function from the e1071 package to do just that:

> library("randomForest")
> library("e1071")
> rf_ranges <- list(ntree = c(500, 1000, 1500, 2000), mtry = 3:8)
> rf_tune <- tune(randomForest, LeagueIndex ~ ., data = 
                  skillcraft_train, ranges = rf_ranges)
> rf_tune$best.parameters
   ntree mtry
14  1000    6
> rf_best <- rf_tune$best.model
> rf_best_predictions <- predict(rf_best, skillcraft_test)
> (rf_best_SSE <- compute_SSE(rf_best_predictions, 
                              skillcraft_test$LeagueIndex))
[1] 555.7611

The results show that on this data set, the best parameter combination is to train 1,000 trees and use a value of 6 for mtry. This last value corresponds to one third of the number of input features, which is the typical heuristic for regression problems. The SSE value on our test set is almost identical to what we obtained using gradient boosting.

The importance of variables in random forests

We already discussed the fact that ensemble models do not, in general, have explanative power. For random forests, it turns out that we can still measure variable importance scores for the different input features by tallying and keeping track of the reductions in our error function across all the trees in the ensemble. In this way, we can obtain an analogous plot to the one we obtained for a single tree when we looked at this data set in Chapter 6, Tree-based Methods.

To compute variable importance, we use the importance() function and plot the results:

The importance of variables in random forests

Looking at this plot, we can see that APM and ActionLatency are once again the most important features, but their order is reversed. We also see that TotalHours is now third in importance, significantly higher than what we saw before in a single tree.

We have explored the Skillcraft data set using a number of different methods, but each time we treated this as a regression problem and measured our accuracy using the SSE. Our target variable is the league index, which tells us the gaming league in which a player competes. As such, it is actually an ordered factor.

As we've already seen before, models whose output is an ordered factor can be tricky to train as well as assess. For example, perhaps a more appropriate method of assessing our model would be to first round our model's numerical output so that we obtain a prediction on an actual player league. Then, we could assess the model using a weighted classification error rate that more heavily penalizes a predicted league index that is very far from the actual league index. We leave this as an exercise for the reader.

One of the issues that we often face when we model the problem as a regression problem is that we have no way to force the output to predict across the full range of the original levels. In our particular data set, for example, we might never predict the lowest or highest league. For some suggestions on alternative ways to model ordered factors with regression trees, there is an insightful paper published in 2000 by Kramer and others, titled Prediction of Ordinal Classes Using Regression Trees. This appears in the 34th issue of Fundamentals Informaticae by IOS Press.

Note

For random forests, the original reference is a 2001 paper by Leo Breiman, titled Random Forests, published in the journal Machine Learning. Besides this reference, a fantastic chapter with numerous examples appears in the book Statistical Learning from a Regression Perspective, Richard A. Derk, Springer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.198.59