Investigating the inner workings of a decision tree

We established earlier that a decision tree is basically a flow chart that makes a series of decisions about the data. The process starts at the root node (which is the node at the very top), where we split the data into two groups (only for binary trees), based on some decision rule. Then, the process is repeated until all remaining samples have the same target label, at which point we have reached a leaf node.

In the spam filter example earlier, decisions were made by asking true/false questions. For example, we asked whether an email contained a certain word. If it did, we followed the edge labeled true and asked the next question. However, this works not just for categorical features, but also for numerical features, if we slightly tweak the way we ask a true/false question. The trick is to define a cut-off value for a numerical feature, and then ask questions of the form: Is feature, f, larger than value, v? For example, we could have asked whether the word prince appeared more than five times in an email.

So, what kind of questions were asked to arrive at the decision tree plotted earlier?

To get a better understanding, I have annotated the earlier graph with some important components of the decision tree algorithm in the following diagram:

Starting at the root node (which I have labeled node 0), you can see that the first question asked was whether the sodium concentration was smaller or equal to 0.72. This resulted in two subgroups:

All data points where Na <= 0.72 (node 1), which was true for nine data points
All data points where Na > 0.72 (node 2), which was true for six data points

At node 1, the next question asked was whether the remaining data points did not have high cholesterol levels (cholesterol=high < 0.5), which was true for five data points (node 3) and false for four data points (node 4). At node 3, all five remaining data points had the same target label, which was drug C (class = C), meaning that there was no more ambiguity to resolve. We call such nodes pure. Hence, node 3 became a leaf node.

Back at node 4, the next question asked was whether sodium levels were lower than 0.445 (Na <= 0.445), and the remaining four data points were split into node 7 and node 8. At this point, both node 7 and node 8 became leaf nodes.

You can see how this algorithm could result in really complex structures if you allowed an unlimited number of questions. Even the tree in the preceding diagram can be a bit overwhelming—and this one has only a depth of three! Deeper trees are even harder to grasp. In real-world examples, a depth of 10 is actually not uncommon.

The smallest possible decision tree consists of a single node, the root node, asking a single true/false question. We say the tree has a depth of 0. A tree of depth 0 is also called a decision stump. It has been used productively in the adaptive boosting algorithm (see Chapter 10, Ensemble Methods for Classification).

Table of Contents for Investigating the inner workings of a decision tree

Create new playlist

Sign In

Sign Up

Table of Contents for
Investigating the inner workings of a decision tree