How to do it...

In this recipe, we will work with the mushroom dataset that we used in the first recipe of this chapter and we will fit a random forest classifier. Random forests are built by combining the predictions of several (usually hundreds) of trees, trained using different subsets of features, and bootstrapped datasets obtained using the original data. The way these bootstrap samples are built, is by sampling with replacement; this implies that some of the observations might appear multiple times for each dataset, whereas some of them won't appear at all:

We first load the mushroom dataset, and we assign the column names, as they are missing from the .csv file.Both the target and the features are categorical:

library(caret)
set.seed(11)
mushroom_data = read.csv("https://archive.ics.uci.edu/ml/machine-learning- 
databases/mushroom/agaricus-lepiota.data",head=FALSE)
colnames(mushroom_data) = c("edible","cap_shape", "cap_surface", "cap_color","bruises","odor", 
"gill_attachment","gill_spacing","gill_size","gill_color","stalk_shape",   "stalk_root","stalk_surface_above_ring","stalk_surface_below_ring","stalk_color_above_ring",
 "stalk_color_below_ring","veil_type","veil_color","ring_number","ring_type",
"spore_print_color","population","habitat")

The objective of the data-processing part here will be to separate the target from the features, build the appropiate dummy variables for these features, and then join the features and target back. As a consequence, we save the target variables containing data on edible/nonedible mushrooms (the first column of our dataset). We also remove it from the dataset:

edible = mushroom_data[,1]
mushroom_data = mushroom_data[,-1]

We then remove the veil_type variable, which has only one level (it's constant and doesn't add anything to the model). After this is done, we generate the dummy model/expression containing the dataset and the expression. The ~. means that we want every variable to be transformed, and the sep="__" means that we want to create the column names in the format of <factorname>__<level>:

mushroom_data = mushroom_data[,-which(colnames(mushroom_data)=="veil_type")]
mushroom_dummy_model = dummyVars(data=mushroom_data,~.,sep="__")

We then transform the dataset using the previous formula and we join both the dummies and the target variable laterally:

mushroom_data_model = cbind(data.frame(predict(mushroom_dummy_model, mushroom_data)), 
edible)

We then print the time, since this code takes some time to run. We define a control object that will manage how the grid search works. We will use repeated cross-validation (repeatedcv) with 4 folds and 1 repeat. This means that the data will be split into four parts, and four models will be estimated, one at a time. Each one of them will be trained in three folds, and evaluated in one. Because of repeated=1, this happens only once. There are several possible metrics we might be interested in, but since we are dealing with a balanced dataset, we can use the accuracy. The tunegrid variable holds a grid of values that will be tested: caret only allows us to tune the number of features that will be used (chosen at random) to train each tree (the objective of random forests is to train models that are as decorrelated as possible):

control = trainControl(method="repeatedcv", number=4, repeats=1)
metric = "Accuracy"
tunegrid = expand.grid(.mtry=c(2,5,7,10))

We then train the model by specifying that we want to model the edible target in terms of all the other variables in the dataset. The method that will be used here will be random forests (RF) and the other parameters correspond to the tunegrid and control objects that we defined previously:

rf_default = train(edible~., data=mushroom_data_model, method="rf", metric=metric,  
tuneGrid=tunegrid,  trControl=control)

We finally print the results of the model. Essentially, with either 5 or more for mtry, we get a 100% accuracy rate for the cross-validation samples. The kappa coefficient is similar to the accuracy, but it makes a correction by chance. For example, let's assume we have A and B labels, and we have a classifier that classifies 90% of the labels correctly. Depending on the proportion of A/B, this might be a spectacular classifier or a mediocre one.

For example, if the proportion was 50/50, classifying 90% of them correctly would be quite good. If, on the contrary, we had a 90/10 (or a 10/90), our 90% classifier would be mediocre. Why? Because in the latter case, if we assigned each label at random to A, we would still be on an approximate 90% accuracy rate; note that in the former case (50/50), we would be on a 50% accuracy rate. Cohen's kappa applies a correction to the ratio to account for this random chance agreement/accuracy. It will always be smaller than the regular accuracy. For balanced datasets, like the one we have here, Cohen's kappa will be roughly like the accuracy:

print(rf_default)

Take a look at the following screenshot:

There are several elements inside the tuned estimator, but in general we just want to get the best estimator using the full dataset. In order to achieve this, we can do the following:

rf_default$finalModel

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...