Random forests – a collection of trees

It is almost a fact that combined forecasts tend to work better than single forecasts. This phenomenon is called the Wisdom of the Crowd. Random forests exploit the Wisdom of the Crowd while fitting and combining several trees. Due to this combination task, algorithms such as random forests are also called ensemble learning. Random forests are not the only ensemble learning algorithms; bagging, boosting, and committees also fit several models.

In this section, we are not only aiming at random forests but all those other kinds of models and packages that could possibly compete with them. This time we are not benchmarking accuracy only. Time elapsed will be taken into consideration. Note that it is a very simple measure and may widely vary from my end to yours.

The time needed to train a single neural network model is sometimes greater than the time needed to fit hundreds of regressions. Yet, neural networks can represent non-linear relations without the need for express transformation.

Great members of the R community—SiR Andy and SiR Matthew Wiener—ported methods for random forests from Fortran to R. These methods are hanging around the randomForest package, just waiting for you to use them. Check if it's already installed to start:

if(!requireNamespace('randomForest', quietly = T)){
install.packages('randomForest')}

If your internet connection is locked and loaded, after running the preceding code block, you will make sure to have randomForest installed.

Using requireNamespace()instead of require()to check if a package is already is installed culminates in a small efficiency gain. The former function only attaches the library while the latter attaches and loads it.

Random forests can be used to tackle problems such as classification, survival analysis, and prediction in general. More often than not, random forests can easily outperform single regression and other kinds of single models for both in-sample and out-sample measures. Forests helps to diminish overfitting.

A model overfits when they do very well in the in-sample measures but it derails for out-of-sample ones. If data partitioning is done right, great gaps might indicate that a model is so addicted to the estimation data that it can't generalize well from it. Modelers usually set as a goal to overfit less.

As the package relies on the random seed generator, let's start by loading the package and setting seed:

library(randomForest)
set.seed(1999)

Now we can fit a random forest model. Sys.time() will be used to calculate elapsed time as follows:

time0_rf <- Sys.time()

rf_model <- randomForest(vote ~ . ,
data = dt_Chile[-i_out ,],
ntree = 450,
na.action = na.exclude)

time1_rf <- Sys.time()

The preceding chunk starts by creating a variable named time0_rf (time zero random forest) and terminates with the creation of time1_rf (time one random forest). They respectively account for the system time that immediately precedes and proceeds the random forest model fitting. The difference between them (time0_rf - time1_rf) will tell how much time it took to fit our model.

Many things, apart from how efficient and cost less your model fitting is, will interfere within elapsed time. How good your machine is and programs running in the background are examples of what could affect it as well. Nonetheless, for this context, that time measure will be enough. Other than that, I suggest you try the tictoc or microbenchmark packages, if you need to benchmark your code.

In between these two variables, the code block is saving a random forest model in a variable named rf_model. The function used to calibrate this model carries the same name as the package responsible for it, randomForest(). The way we set this is very similar to the way we set single decision-tree models in the section, Tree models.

First, we set up a formula (vote ~ .). The dot sign is a shortcut for all of the other variables available in the dataset. The argument data brings no novelties. It rules the DataFrame used to train the model. The number of trees trained is ruled by ntree. Lastly, there is the na.action argument. We already know that there is no NA at our data (check the Tree models section), but if there was, na.exclude would make sure to make it vanish from the fitting process:

mean(predict(rf_model, type = 'class', 
newdata = dt_Chile[test,]) == dt_Chile[test, 'vote'])
# [1] 0.6547945
time1_rf - time0_rf
# Time difference of 1.061416 secs

The best (out-of-sample) hit rate we got until recently for this problem was floating around 63%, 83 %. Using random forests, we could achieve a hit rate near 65%, 48 %. The  print() and plot() methods are also available for randomForest objects such as rf_model. Try print(rf_model) to check a confusion matrix or plot(rf_model) to check a graph.

We can try bagging with the ipred package. The following code block shows how to make sure it's installed before loading the whole package:

if(!require(ipred)){install.packages('ipred')}
library(ipred)

With ipred installed, we can try bagging the algorithm with the bagging()function:

time0_bag <- Sys.time()

bag_model <- bagging(vote ~ ., data = dt_Chile[-i_out ,])

time1_bag <- Sys.time()

Again, variables were created using Sys.time() for benchmarking purposes. Let's check how bagging has rolled:

mean(predict(bag_model, type = 'class', 
newdata = dt_Chile[test,]) == dt_Chile[test, 'vote'])
# [1] 0.6356164
time1_bag - time0_bag
# Time difference of 0.350996 secs

Bagging did not perform better than random forests but that does not mean that it will go this way every single time. Yet, there is weak evidence about ipred being able to beat randomForest. There is yet another algorithm that might compete with random forests, which is boosting.

You can fit boosting models with R using the gbm package. The reader might face some tricks ahead. Let's start as usual by installing and loading the package we will be using next:

if(!require(gbm)){install.packages('gbm')}
library(gbm)

Fitting a model with gbm is pretty simple. Check it out:

time0_gbm <- Sys.time()

gbm_model <- gbm(formula = formula(dt_Chile[,c(8, 1:7)]),
data = dt_Chile[-i_out ,],
distribution = 'multinomial',
n.trees = 450)

time1_gbm <- Sys.time()

The formula input is equivalent to vote ~ .. The trick here is to get the DataFrame (dt_Chile[ , ]) and tweak the columns order, so the predicted variable will appear in the right-hand side of the expressions after inputting it into formula(). With this, I only wanted to introduce an alternative way to set up expressions; vote ~ . would work perfectly there.

Next, there is the data argument as usual. The news here is coming from the distribution argument; as we have multiple categories to classify, 'multinomial' is the way to go. If we had a binary problem we might like to use 'bernoulli' instead. If you skip this argument, gbm::gbm() will pick it from an educated guess.

There are tons of distributions available. Entering ?gbm::gbm()will help you to find which are available. Deeper knowledge might be needed to optimally pick a distribution.

As randomForest() fitted 450 trees, about the same was asked from gbm() with n.trees = 450. There are many other accessible arguments that can be used to tune your model. They are well described in the documentation; follow them with ?gbm::gbm().

summary(gbm_model)will give a plot displaying the variables contribution to the model. You will notice that, from the seven independent variables, only three contributed and, from those three, statusquo alone got more 80% relative influence.

Calculating the hit rate may be a little bit tricker in comparison to what we have seen up to now, but that is not all bad. Predictions are given in a much more detailed (and maybe raw) way. This means that you can aim for higher levels of customization, which is actually an advantage of using R over some black-box statistical software.

We can store the predictions in an object for further examination:

pred_gbm <- predict(gbm_model, 
type = 'response',
n.trees = 450,
newdata = dt_Chile[test,])

Here, we can see how predict.gbm() works. The first argument must be a gbm class model. The 'class' argument is not available; the options for this argument are 'link' or 'response'. Each of them responds differently depending on the distribution adopted in the fitting phase. Here we asked for the probabilities of each observation being a certain category.

The n.trees argument receives how many trees we want the prediction to be built on. This same argument can be also input with a vector; this way you can retrieve predictions for each specific iteration. The last argument, newdata, is setting the new dataset used to draft predictions.

The object storing the predictions is a three-dimensional array. The third dimension has a single element. We could access predictions for different iterations if we had input n.trees with a vector instead of an integer. That said, this third dimension is not very useful here, so let's transform this object into a two-dimensional DataFrame:

pred_gbm <- as.data.frame(pred_gbm[,,1])

Notice how the third dimension was subsetted: pred_gbm[,,1]. With this, we avoid the column names from being weird. Speaking in terms of forecast, it's much better to have your predictions in terms of probabilities rather than points. You are far more honest saying There is a 28% chance that Jack will vote yes than saying Jack will vote yes.

If you check head(pred_gbm), you will see that we have interval (probabilities) predictions rather than point predictions (yes, no, and so on). This is actually better in terms of predictions but for the sake of comparability (and simplicity), we have to get point predictions just like the ones we could get for the previous models while setting prediction.*(type = 'class').

However, until this point we have only compared point measures (hit rate), so we have to convert our probabilistic prediction into a point one. We have to check for each row, which is the column holding the greater probability. We can easily work this out with a couple of functions while working with a DataFrame:

pred_gbm <- names(pred_gbm)[max.col(pred_gbm, 'first')]

Now pred_gbm is a vector of characters displaying which vote each person displayed by test sample is more likely to choose. With this, we can measure the hit rate:

mean(pred_gbm == dt_Chile[test, 'vote'])
# [1] 0.6630137
time1_gbm - time0_gbm
# Time difference of 0.9355559 secs

The performance here was enhanced but not for much. It does not necessarily mean that gbm ever beats randomForest. As a matter of fact, this difference could come from tuning parameters or even the sample picked for test and train. I personally don't have a preference here.

I really appreciate how I can see how each variable has influenced the process using summary.gbm() and how I can pick a distribution using gbm::gbm(). Yet the results were so close that I am not sure about one performing better than other. The time requested to fit those models are often so small that you might rapidly grow out of excuses for not fitting both at once.

Even if we had a dataset with high dimensionality, only three variables showed some influence over gbm models and probably the same happened to random forest models and simpler decision trees. Luckily for us, the dataset was not so small that it couldn't differ good features from not so good ones. Selecting features from data knowledge is a role of feature engineering.

In this section, we saw how to easily grow random forests and other ensemble learning techniques through R. Let me stress that although random forests can handle classification and regression problems, they are much more likely to deal better with classification problems rather than regression ones.

Another downside is that they are remarked as a black-box type of model in the statistical sense. Next, we are moving forward to SVM models. This is a whole supervised learning technique that can handle regression and classification.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.202.240