Random forest model

There are a number of approaches to learning in multiclass problems. Techniques such as random forest and discriminant analysis will deal with multiclass while some techniques and/or packages won't—for example, generalized linear models, glm(), in base R. The functionality built into mlr allows you to run a number of techniques for supervised and unsupervised learning. However, leveraging its power the first couple of times you use it can be a little confusing. If you follow the process outlined in the following, you'll be well on your way to developing powerful learning pipelines. We'll be using random forest in this demonstration. 

We've created the training and testing sets, which you can do in mlr, but I still prefer the technique we've been doing using the caret package. One of the unique things about the mlr package is that you have to put your training data into a task structure, specifically, in this problem, a classification task. Optionally, you can place your test set in a task as well. You specify the dataset and the target containing the labels:

> dna_task <- mlr::makeClassifTask(data = train, target = "Class")

There are many ways to use mlr in your analysis, but I recommend creating a resample object.

In the following code block, we create a resampling object to help us in tuning the number of trees for our random forest, consisting of five subsamples. Keep in mind that you have similar flexibility in the resampling method just like the caret package with techniques such as cross-validation and repeated cross-validation:

> rdesc <- mlr::makeResampleDesc("Subsample", iters = 5)

The next object establishes the grid of trees for tuning with the minimum number of trees, set to 50, and the maximum set to 200. You can also establish multiple parameters as we did with the caret package. Your options can be explored by calling help for the function with makeParamSet:

> param <-
+ ParamHelpers::makeParamSet(ParamHelpers::makeDiscreteParam("ntree", values = c(50, 75, 100, 150, 175, 200)))

Next, create a control object, establishing a numeric grid:

> ctrl <- makeTuneControlGrid()

With the preliminary objects created, we can now go ahead and tune the hyperparameter for the optimal number of trees in the random forest, as per our grid. Notice that we're specifying classif.randomForest. The previous link on the available models of mlr gives us all of the proper syntax you use for your desired method. One thing we should do is bring the mlr library into the environment, so we can use that syntax. We also use the objects we just created:

> library(mlr)

> tuning <-
mlr::tuneParams(
"classif.randomForest",
task = dna_task,
resampling = rdesc,
par.set = param,
control = ctrl)

Once the algorithm completes its iterations, you can call up both the optimal number of trees and the associated out-of-sample error:

> tuning$x
$`ntree`
[1] 175

> tuning$y
mmce.test.mean
0.04635294

The optimal number of trees as per our experiment grid is 175 with a mean misclassification error of 0.046 percent. It's now a simple matter of setting this parameter for training as a wrapper around the makeLearner() function. Notice that I set the predicted type to "prob" as the default is the predicted class and not the probability:

> rf <-
mlr::setHyperPars(mlr::makeLearner("classif.randomForest", predict.type = "prob"),
par.vals = tuning$x)

Now we train the model again with just 175 trees:

> fit_rf <- mlr::train(rf, dna_task)

You can see the confusion matrix on the train data:

> fit_rf$learner.model

OOB estimate of error rate: 5.14%
Confusion matrix:
ei ie n class.error
ei 563 26 25 0.08306189
ie 16 575 21 0.06045752
n 10 33 1281 0.03247734

That's better than I expected with an out-of-bag error of just over 5%. Also, there is no error for a class that's way out of balance. Additionally, it performs pretty well on the test data:

> mlr::calculateConfusionMatrix(pred)
predicted
true ei ie n -err.-
ei 139 4 10 14
ie 3 147 3 6
n 2 3 325 5
-err.- 5 7 13 25

The package has a full set of metrics available. Here, I pull up the test accuracy and log-loss:

> mlr::performance(pred, measures = list(acc, logloss))
acc logloss
0.9606918 0.2863458

It has an impressive 96% accuracy on the test set and a baseline log-loss of 0.286. This leads us to the next step, where we see whether creating an ensemble by just combining the predictions of random forest and MARS can improve performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.129.90