Growing trees with tree and rpart

There are no big differences between growing trees with tree or rpart. Both work in similar ways and both depend on the dtree package, more or less. As the time-cost of growing trees is often too small, I can't see a reason not to try both, hence explaining how to grow trees using both.

The following shows how to recursively grow a decision-tree model using the tree package:

if(!require(tree)){install.packages('tree')}
library(tree)
tree_tree <- tree(vote ~ . ,
                  data = dt_Chile[-i_out,],
                  method = 'class',
                  mindev = 0)

The very first couple of lines are performing a check-install on the tree package and then loading and attaching the whole package. After these two lines, an object called tree_tree is being created. The tree() function is taking care of creating the decision tree using the tree package. Let's get a closer look at its arguments.

It starts with formula. By inputting vote ~ ., we want the vote variable to be explained for all of the other variables available in the DataFrame. The dot sign (.) used like this stands for a quick way to say all other variables are available. The package will craft the tree recursively and the order in which we ask for the variables won't change the final result.

Yet, we could name fewer variables. For example, if you only want to build a tree based on regions and income only, naming vote ~ region + income would do it. Or naming vote ~ income + region would lead to the same results. Notice that the order on the right side of the expression doesn't matter while the variable named on the left side (predicted variable) was not changed.

The next arguments are data and method. While the latter is input with 'class', which is short for classification, the former is input with dt_Chile[-i_out,]. Notice how the i_out object was used to call only the training set; the minus sign played an essential role here by removing from the set all of the rows that corresponded to the out-sample.

Last but not least, there is the mindev argument. The tree is grown recursively, mindev sets a criterion on how much a new branch (variable) must contribute in order to get the tree to grow it. By setting it to zero, we allow all variables to participate. Can you figure out how this could go wrong?

Arguments such as mindev are called control parameters. In order to meet some other control parameters, try typing ?tree.control into your R console.

Let's look at what we got so far with the following code:

plot(tree_tree)
text(tree_tree, cex = .7)

The following shows the mess we got in by setting mindev = 0 (the default 0.01 would be much better):

Figure 6.3: First decision tree for vote intentions or, as I like to call it, a mess

Do you recall me saying that a great advantage of decision trees is simplicity? By growing too many nodes, simplicity is likely to vanish.

A key concept of decision-tree models is nodes. The first (top) node is called root node. The last ones (bottom) are called leaf nodes or end nodes. The remaining ones are called chance nodes or intermediary nodes.

In order to print the text that shows how the tree rolls, simply call the object storing it (tree_tree). An alternative is to use summary(). If inputted with trees created by the tree package, summary()will only output some brief information about the model; your console won't be flooded with numerous nodes even if you had thousands of them:

summary(tree_tree)

# Classification tree:
# tree(formula = vote ~ ., data = dt_Chile[-i_out, ], method = "class", 
#     mindev = 0)
# Number of terminal nodes: 220 
# Residual mean deviance: 1.05 = 1556 / 1482 
# Misclassification error rate: 0.2244 = 382 / 1702

The misclassification error rate is an accuracy measure related to classification trees. Trusting the training data, our model missed 22,44% (382 out of 1702) of the times. The following code will teach how a very similar tree can be grown using the rpart package:

if(!require(rpart)){install.packages('rpart')}
library(rpart)
tree_rpart <- rpart(vote ~ ., 
                    data = dt_Chile[-i_out,],
                    method = 'class',
                    control=rpart.control(cp = 0))

The only thing that's different here is how the control parameter was set. The argument control is ruling it. A parameter similar to mindev is being named inside the rpart.control() function, called cp. There are lots of control parameters listed; try calling ?rpart.control().

Calling summary(tree_rpart) won't give anything similar to summary(tree_tree). We can manually estimate the misclassification error rate with the following code:

mean(residuals(tree_rpart))
# [1] 0.2514689

The model fitted by rpart misclassified around 25,15% of the times. An upside of this package is that it can be combined with rattle and RColorBrewer in order to create stunning decision-tree visualizations. Check it out:

if(!require(rattle)){install.packages('rattle')}
if(!require(RColorBrewer)){install.packages('RColorBrewer')}
library(rattle)
fancyRpartPlot(tree_rpart, sub = '')

Figure 6.4 shows the visualization created by the fancyRpartPlot()function:

Figure 6.4: Decision-tree visualization created with the rattle package

It is much better than the visualization achieved in Figure 6.3—we could use plot()and text() functions as well—yet it's not thanks to the numerous nodes. As far as we know, only the interpretability may be hurt until now. More often than not, the ability to predict further data is also hurt by having too many nodes but we can't know for sure until we try it on test data.

Let's start by checking tree_tree (the decision tree grown with the tree package):

predict(tree_tree, type = 'class', 
        newdata = dt_Chile[test,])

mean(predict(tree_tree, type = 'class', 
             newdata = dt_Chile[test,]) == dt_Chile[test, 'vote'])
# [1] 0.569863

First, predict()was called to ask for a prediction upon the test dataset; a vector of classes is output. Later, mean()was combined with predict()and a Boolean operator (==) to calculate the hit rate using the test dataset. In order to calculate the misclassification error rate, try replacing == with !=.

Slightly different results may come each time you try to calculate the hit and/or misclassification rate at the test dataset using the tree_tree object, unless you set seed. This may be due to the tie method—it may be random.

About the same could be done to retrieve the predictions and hit rate for the test dataset using rpart:

predict(tree_rpart, type = 'class', 
        newdata = dt_Chile[test,])

mean(predict(tree_rpart, type = 'class', 
             newdata = dt_Chile[test,]) == dt_Chile[test, 'vote'])
# [1] 0.6109589

The test hit rate for tree_rpart was better (61,06%). It does not mean that rpart will do better every single time; try both. The trees trained until now are overcomplicated and do not generalize well yet on unseen data. We could squeeze some performance by doing pruning.

To put it simply, pruning will cut off nodes that don't contribute that much to model performance. Too often, this approach leads toward a better out-of-sample performance plus trees that are much easier to understand. The subsequent code block is pruning tree_tree while calculating the new hit rate and plotting the pruned tree:

p_tree <- prune.misclass(tree_tree, 
                         best = 5,
                         newdata = dt_Chile[val,])

mean(predict(p_tree, type = 'class', 
             newdata = dt_Chile[test,]) == dt_Chile[test, 'vote'])
# [1] 0.6383562

plot(p_tree);text(p_tree, cex = .7)

One way to prune a tree using the tree package we used the prune.misclass() function. Fill this function with a tree model created by the tree package, an integer picking a number of terminal nodes a subtree may have (the best argument). This integer pruning parameter, an alternative is the k argument, which is a cost-complexity parameter.

There is no rule for choosing the numbers to input best or k; try some.

The prune.missclass() function should be input at least with tree (the tree model), best, or k. There is yet another argument that can improve the pruning, newdata. Inputting newdata will rule with which data to calculate the pruning parameters; hence, the validation data can help here. The whole pruned tree is stored in an object called p_tree.

Using the test dataset, the hit rate for p_tree was calculated. The newer rate, 63,83%, shows some improvement in comparison to the one calculated for tree_tree (56,98%). Also, the plot feels much more friendly now, as shown here:

Figure 6.5: Pruned tree (tree package)

We can prune tree_rpart using a cost-complexity parameter:

p_rpart <- prune(tree_rpart, cp = .01)

mean(predict(p_rpart, type = 'class', 
             newdata = dt_Chile[test,]) == dt_Chile[test, 'vote'])
# [1] 0.6356164

With rpart, we can call prune() to prune decision trees. The only arguments needed are the tree to be pruned—it must be an rpart object—and the complexity parameter (cp). In comparison to tree_rpart, the pruned tree (p_rpart) gained around 2 p.p. performance in the test dataset.

We can visualize p_rpart using rattle:

# library(rattle)
fancyRpartPlot(p_rpart, sub = '')

The following diagram shows the results:

Figure 6.6: Pruned tree (rpart package)

This tree is much easier to understand. The decision made by the node number three is redundant. If node number one was directly connected to node number six, the outcome would be the same. Arguably, the only reason for us to get such complicated trees is that we set off the complexity control parameter in the first place. In the following, I show a code that did not turn off this parameter along with the hit rate for the test sample:

tree_tree2 <- tree(vote ~ . ,
                   data = dt_Chile[-i_out,],
                   method = 'class',
                   mindev = .01)
mean(predict(tree_tree2, type = 'class', 
             newdata = dt_Chile[test,]) == dt_Chile[test, 'vote'])
# [1] 0.6246575

tree_rpart2 <- rpart(vote ~ ., 
                     data = dt_Chile[-i_out,], 
                     method = 'class',
                     control=rpart.control(cp = .01))
mean(predict(tree_rpart2, type = 'class', 
             newdata = dt_Chile[test,]) == dt_Chile[test, 'vote'])
# [1] 0.6356164

A point made: less (nodes) is often more (out of sample accuracy) when it comes to decision trees. Decision trees work as foundations for a very famous algorithm called random forests. The relation between forests and trees is not a fateful name coincidence. The next section is (not randomly) explores the random forests guided by the awesome R.

Table of Contents for Growing trees with tree and rpart

Create new playlist

Sign In

Sign Up

Table of Contents for
Growing trees with tree and rpart