We now move on to one of the easily interpretable and most popular classifiers there are out there: the decision tree. Decision trees—which look like an upside down tree with the trunk on top and the leaves on the bottom—play an important role in situations where classification decisions have to be transparent and easily understood and explained. It also handles both continuous and categorical predictors, outliers, and irrelevant predictors rather gracefully. Finally, the general ideas behind the algorithms that create decision trees are quite intuitive, though the details can sometimes get hairy.
Figure 9.7 depicts a simple decision tree designed to classify motor vehicles into either motorcycles, golf carts, or sedans.
This is a rather simple decision tree with only three leaves (terminal nodes) and two decision points. Note that the first decision point is (a) on a binary categorical variable, and (b) results in one terminal node, motorcycle. The other branch contains the other decision point, a continuous variable with a split point. This split point was chosen carefully by the decision tree-creating algorithm to result in the most informative split—the one that best classifies the rest of the observations as measured by the misclassification rate of the training data.
Actually, in most cases, the decision tree-creating algorithm doesn't choose a split that results in the lowest misclassification rate of the training data, but chooses on that which minimizes either the Gini coefficient or cross entropy of the remaining training observations. The reasons for this are two-fold: (a) both the Gini coefficient and cross entropy have mathematical properties that make them more easily amendable to numerical optimization, and (b) it generally results in a final tree with less bias.
The overall idea of the decision tree-growing algorithm, recursive splitting, is simple:
The stopping criterion is usually either a certain depth, which the tree cannot grow past, or a minimum number of observations, for which a leaf node cannot further classify. Both of these are hyper-parameters (also called tuning parameters) of the decision tree algorithm—just like the k in k-NN—and must be fiddled with in order to achieve the best possible decision tree for classifying an independent dataset.
A decision tree, if not kept in check, can grossly overfit the data—returning an enormous and complicated tree with a minimum leaf node size of 1—resulting in a nearly bias-less classification mechanism with prodigious variance. To prevent this, either the tuning parameters must be chosen carefully or a huge tree can be built and cut down to size afterward. The latter technique is generally preferred and is, quite appropriately, called pruning. The most common pruning technique is called cost complexity pruning, where complex parts of the tree that provide little in the way of classification power, as measured by improvement of the final misclassification rate, are cut down and removed.
Enough theory—let's get started! First, we'll grow a full tree using the PID dataset and plot the result:
> library(tree) > our.big.tree <- tree(diabetes ~ ., data=training) > summary(our.big.tree) Classification tree: tree(formula = diabetes ~ ., data = training) Variables actually used in tree construction: [1] "glucose" "age" "mass" "pedigree" "triceps" "pregnant" [7] "insulin" Number of terminal nodes: 16 Residual mean deviance: 0.7488 = 447.8 / 598 Misclassification error rate: 0.184 = 113 / 614 > plot(our.big.tree) > text(our.big.tree)
The resulting plot is depicted in Figure 9.10.
The power of a decision tree—which is usually not competitive with other classification mechanisms, accuracy-wise—is that the representation of the decision rules are transparent, easy to visualize, and easy to explain. This tree is rather large and unwieldy, which hinders its ability to be understood (or memorized) at a glance. Additionally, for all its complexity, it only achieves an 81% accuracy rate on the training data (as reported by the summary
function).
We can (and will) do better! Next, we will be investigating the optimal size of the tree employing cross-validation, using the cv.tree
function.
> set.seed(3) > cv.results <- cv.tree(our.big.tree, FUN=prune.misclass) > plot(cv.results$size, cv.results$dev, type="b")
In the preceding code, we are telling the cv.tree
function that we want to prune our tree using the misclassification rate as our objective metric. Then, we are plotting the CV error rate (dev
) and a function of tree size (size
).
As you can see from the output (shown in Figure 9.9), the optimal size (number of terminal nodes) of the tree seems to be five. However, a tree of size three is not terribly less performant than a tree of size five; so, for ease of visualization, interpretation, and memorization, we will be using a final tree with three terminal nodes. To actually perform the pruning, we will be using the prune.misclass
function, which takes the size of the tree as an argument.
> pruned.tree <- prune.misclass(our.big.tree, best=3) > plot(pruned.tree) > text(pruned.tree) > # let's test its accuracy > pruned.preds <- predict(pruned.tree, newdata=test, type="class") > accuracy(pruned.preds, test[,9]) # 71% [1] 0.7077922
The final tree is depicted in Figure 9.10.
Rad! A tree so simple it can be easily memorized by medical personnel and achieves the same testing-set accuracy as the unwieldy tree in figure 9.8: 71%! Now the accuracy rate, by itself, is nothing to write home about, particularly because the naïve classifier achieves a 65% accuracy rate. Nevertheless, the fact that a significantly better classifier can be built from two simple rules—closely following the logic physicians employ, anyway—is where decision trees have a huge leg up relative to other techniques. Further, we could have bumped up this accuracy rate with more samples and more careful hyper-parameter tuning.
3.15.138.89