Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Decision trees

We now move on to one of the easily interpretable and most popular classifiers there are out there: the decision tree. Decision trees—which look like an upside down tree with the trunk on top and the leaves on the bottom—play an important role in situations where classification decisions have to be transparent and easily understood and explained. It also handles both continuous and categorical predictors, outliers, and irrelevant predictors rather gracefully. Finally, the general ideas behind the algorithms that create decision trees are quite intuitive, though the details can sometimes get hairy.

Figure 9.7 depicts a simple decision tree designed to classify motor vehicles into either motorcycles, golf carts, or sedans.

Figure 9.7: A simple and illustrative decision tree that classifies motor vehicles into either motorcycles, golf carts, and sedans

This is a rather simple decision tree with only three leaves (terminal nodes) and two decision points. Note that the first decision point is (a) on a binary categorical variable, and (b) results in one terminal node, motorcycle. The other branch contains the other decision point, a continuous variable with a split point. This split point was chosen carefully by the decision tree-creating algorithm to result in the most informative split—the one that best classifies the rest of the observations as measured by the misclassification rate of the training data.

Note

Actually, in most cases, the decision tree-creating algorithm doesn't choose a split that results in the lowest misclassification rate of the training data, but chooses on that which minimizes either the Gini coefficient or cross entropy of the remaining training observations. The reasons for this are two-fold: (a) both the Gini coefficient and cross entropy have mathematical properties that make them more easily amendable to numerical optimization, and (b) it generally results in a final tree with less bias.

The overall idea of the decision tree-growing algorithm, recursive splitting, is simple:

Step 1: Choose a variable and split point that results in the best classification outcomes.
Step 2: For each of the resulting branches, check to see if some stopping criteria is met. If so, leave it alone. If not, move on to next step.
Step 3: Repeat Step 1 on the branches that do not meet the stopping criteria.

The stopping criterion is usually either a certain depth, which the tree cannot grow past, or a minimum number of observations, for which a leaf node cannot further classify. Both of these are hyper-parameters (also called tuning parameters) of the decision tree algorithm—just like the k in k-NN—and must be fiddled with in order to achieve the best possible decision tree for classifying an independent dataset.

A decision tree, if not kept in check, can grossly overfit the data—returning an enormous and complicated tree with a minimum leaf node size of 1—resulting in a nearly bias-less classification mechanism with prodigious variance. To prevent this, either the tuning parameters must be chosen carefully or a huge tree can be built and cut down to size afterward. The latter technique is generally preferred and is, quite appropriately, called pruning. The most common pruning technique is called cost complexity pruning, where complex parts of the tree that provide little in the way of classification power, as measured by improvement of the final misclassification rate, are cut down and removed.

Enough theory—let's get started! First, we'll grow a full tree using the PID dataset and plot the result:

  > library(tree)
  > our.big.tree <- tree(diabetes ~ ., data=training)
  > summary(our.big.tree)
  
  Classification tree:
  tree(formula = diabetes ~ ., data = training)
  Variables actually used in tree construction:
  [1] "glucose"  "age"      "mass"     "pedigree" "triceps"  "pregnant"
  [7] "insulin" 
  Number of terminal nodes:  16 
  Residual mean deviance:  0.7488 = 447.8 / 598 
  Misclassification error rate: 0.184 = 113 / 614 
  > plot(our.big.tree)
  > text(our.big.tree)

The resulting plot is depicted in Figure 9.10.

Figure 9.8: An unpruned and complex decision tree

The power of a decision tree—which is usually not competitive with other classification mechanisms, accuracy-wise—is that the representation of the decision rules are transparent, easy to visualize, and easy to explain. This tree is rather large and unwieldy, which hinders its ability to be understood (or memorized) at a glance. Additionally, for all its complexity, it only achieves an 81% accuracy rate on the training data (as reported by the summary function).

We can (and will) do better! Next, we will be investigating the optimal size of the tree employing cross-validation, using the cv.tree function.

  > set.seed(3)
  > cv.results <- cv.tree(our.big.tree, FUN=prune.misclass)
  > plot(cv.results$size, cv.results$dev, type="b")

In the preceding code, we are telling the cv.tree function that we want to prune our tree using the misclassification rate as our objective metric. Then, we are plotting the CV error rate (dev) and a function of tree size (size).

Figure 9.9: A plot cross-validated misclassification error as a function of tree size. Observe that tree of size one performs terribly, and that the error rate steeply declines before rising slightly as the tree is overfit and large sizes.

As you can see from the output (shown in Figure 9.9), the optimal size (number of terminal nodes) of the tree seems to be five. However, a tree of size three is not terribly less performant than a tree of size five; so, for ease of visualization, interpretation, and memorization, we will be using a final tree with three terminal nodes. To actually perform the pruning, we will be using the prune.misclass function, which takes the size of the tree as an argument.

  > pruned.tree <- prune.misclass(our.big.tree, best=3)
  > plot(pruned.tree)
  > text(pruned.tree)
  > # let's test its accuracy
  > pruned.preds <- predict(pruned.tree, newdata=test, type="class")
  > accuracy(pruned.preds, test[,9])   # 71%
  [1] 0.7077922

The final tree is depicted in Figure 9.10.

Figure 9.10: Simpler decision tree with the same testing set performance as the tree in Figure 9.8

Rad! A tree so simple it can be easily memorized by medical personnel and achieves the same testing-set accuracy as the unwieldy tree in figure 9.8: 71%! Now the accuracy rate, by itself, is nothing to write home about, particularly because the naïve classifier achieves a 65% accuracy rate. Nevertheless, the fact that a significantly better classifier can be built from two simple rules—closely following the logic physicians employ, anyway—is where decision trees have a huge leg up relative to other techniques. Further, we could have bumped up this accuracy rate with more samples and more careful hyper-parameter tuning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Decision trees

Create new playlist

Sign In

Sign Up

Decision trees

Note

Table of Contents for
Decision trees