Algorithms for training decision trees

Now that we have understood how a decision tree works, we'll want to address the issue of how we can train one using some data. There are several algorithms that have been proposed to build decision trees, and in this section we will present a few of the most well-known. One thing we should bear in mind is that whatever tree-building algorithm we choose, we will have to answer four fundamental questions:

  • For every node (including the root node), how should we choose the input feature to split on and, given this feature, what is the value of the split point?
  • How do we decide whether a node should become a leaf node or if we should make another split point?
  • How deep should our tree be allowed to become?
  • Once we arrive at a leaf node, what value should we predict?

Note

A great introduction to decision trees is Chapter 3 of Machine Learning, Tom Mitchell. This book was probably the first comprehensive introduction to machine learning and is well worth reading. Although published in 1997, much of the material in the book remains relevant today. Furthermore, according to the book's website at http://www.cs.cmu.edu/~tom/mlbook.html, there is a planned second edition in the works.

Classification and regression trees

The Classification and Regression Tree (CART) methodology, to which we will henceforth refer simply as CART, is one of the earliest proposed approaches to building tree-based models. As the name implies, the methodology encompasses both an approach to building regression trees and an approach to building classification trees.

CART regression trees

For regression trees, the key intuition with the CART approach is that at any given point in the tree, we choose both the input feature to split on and the value of the split point within that feature, by finding which combination of these maximizes the reduction in the Sum of Squared Error (SSE). For every leaf node in a regression tree built using CART, the predicted value is simply the average value of the output predicted by all the data points that are assigned to that particular leaf node. To determine whether a new split point should be made or whether the tree should grow a leaf node, we simply count the number of data points that are currently assigned to a node, and if this value is less than a predetermined threshold, we create a new leaf node.

For any given node in the tree, including the root node, we begin by having some data points assigned to that node. At the root node, all the data points are assigned, but once we make a split, some of the data points are assigned to the left child and the remaining points are assigned to the right child. The starting value of the SSE is just the sum of squared error computed using the average value CART regression trees of the output variable CART regression trees for the n data points assigned to the current node:

CART regression trees

If we split these data points into two groups of size n1 and n2 so that n1 + n2 = n, and we compute the new SSE for all the data points as the sum of the SSE values for each of the two new groups, we have:

CART regression trees

Here, the first sum iterates over j, which are the new indices of the data points in the first group corresponding to the left child, and the second sum iterates over k, which are the new indices of the data points inside the second group belonging to the right child. The idea behind CART is that we find a way to form these two groups of data points by considering every possible feature and every possible split point within that feature so that this new quantity is minimized. Thus, we can think of our error function in CART as the SSE.

One of the natural advantages of CART and tree-based models in general, is that they are capable of handling various input types, from numerical inputs (both discrete and continuous) to binary inputs as well as categorical inputs. Numerical inputs can be ordered in a natural way by sorting them in ascending order, for example. When we do this, we can see that if we have k distinct numbers, there are k-1 distinct ways to split these into two groups so that all the numbers in one group are smaller than all the numbers in the second group, and both groups have at least one element. This is simply done by picking the numbers themselves as split points and not counting the smallest number as a split point (which would produce an empty group). So if we have a feature vector x with the numbers {5.6, 2.8, 9.0}, we first sort these into {2.8, 5.6, 9.0}. Then, we take each number except the smallest (2.8) to form a split point and a corresponding rule that checks whether the input value is smaller than the split point. In this way, we produce the only two possible groupings for our feature vector:

  • Group1 = {2.8}, Group2 = {5.6, 9.0} IF x < 5.6 THEN Group1 ELSE Group2
  • Group1 = {2.8, 5.6}, Group2 = {9.0} IF x < 9.0 THEN Group1 ELSE Group2

Note that it is important to have at least one element in each group, otherwise we haven't actually split our data. Binary input features can also be handled by simply using the split point that corresponds to putting all data points that have this feature take the first value in the first group, and the remaining data points that have the second value of this feature in the second group.

Handling unordered categorical input features (factors) is substantially harder because there is no natural order. As a result, any combination of levels can be assigned to the first group and the remainder to the second group. If we are dealing with a factor that has k distinct levels, then there are 2k-1 -1 possible ways to form two groups with at least one level assigned to each group.

So, a binary-valued feature has one possible split, as we know, and a three-valued feature has three possible splits. With the numerical feature vector containing the numbers {5.6, 2.8, 9.0}, we've already seen two possible splits. The third possible split that could arise if these numbers were labels, is the one in which one group has data points with this feature taking the value 5.6, and another group with the two values 2.8 and 9.0. Clearly, this is not a valid split when we treat the feature as numerical.

As a final note, we always have the option of a one-versus-all approach for categorical input features, which is essentially the same as considering splits in which one group always consists of a single element. This is not always a good idea as it may turn out that a particular subset of the levels when taken together are more predictive of an output compared to a single level. If this is the case, the resulting tree will probably be more complex, having a greater number of node splits.

There are various ways to deal with the large increase in complexity associated with finding and evaluating all the different split points for categorical input features, but we won't go into further detail right now. Instead, let's write some R code to see how we might find the split point of a numerical input feature using the SSE criterion that CART uses:

compute_SSE_split <- function(v, y, split_point) {
  index <- v < split_point
  y1 <- y[index]
  y2 <- y[!index]
  SSE <- sum((y1 - mean(y1)) ^ 2) + sum((y2 - mean(y2)) ^ 2)
  return(SSE)
}

compute_all_SSE_splits <- function(v, y) {
  sapply(unique(v), function(sp) compute_SSE_split(v, y, sp))
}

The first function, compute_SSE_split(), takes in a feature vector v, an output vector y, and a specific value for the feature vector that we want to split on, namely split_point. It uses these to compute the SSE value for that split. To do this, it first creates an indexing vector, identifying all the elements of the feature vector whose value is less than the split_point value. This is then used to split the output vector into two groups, y1 and y2; the former contains elements identified by the indexing vector and the latter contains the remaining elements. Finally, the SSE is computed as the sum of the SSE of the two groups using the familiar formula of the sum of squared distances from the group mean.

The second function, compute_all_SSE_splits(), takes in a feature vector v, and an output vector y, and uses all the unique elements of the feature vector as potential split points (for simplicity, we are ignoring the fact that we should not be splitting on the smallest value of the feature vector).

In order to demonstrate how splitting works in CART, we will generate a small artificial data set. The following code snippet generates twenty random observations of two input features and stores the result in a data frame, rcart_df:

> set.seed(99)
> x1 <- rbinom(20, 1, 0.5)
> set.seed(100)
> x2 <- round(10 + rnorm(20, 5, 5), 2)
> set.seed(101)
> y <- round((1 + (x2 * 2 / 5) + x1 - rnorm(20, 0, 3)), 2)
> rcart_df <- data.frame(x1, x2, y)
> rcart_df
   x1    x2     y
1   1 12.49  7.97
2   0 15.66  5.61
3   1 14.61  9.87
4   1 19.43  9.13
5   1 15.58  7.30
6   1 16.59  5.11
7   1 12.09  4.98
8   0 18.57  8.77
9   0 10.87  2.60
10  0 13.20  6.95
11  1 15.45  6.60
12  1 15.48 10.58
13  0 13.99  2.31
14  1 18.70 13.88
15  1 15.62  8.96
16  1 14.85  8.52
17  0 13.06  8.77
18  0 17.55  7.84
19  0 10.43  7.63
20  0 26.55 17.77

In practice, twenty data points might be a suitable number to use as a threshold for building a leaf node, but for this example, we will simply suppose that we wanted to make a new split using these data. We have two input features, x1 and x2. The former is a binary input feature that we have coded using the numerical labels 0 and 1. This allows us to reuse the functions we just wrote to compute the possible splits. The latter is a numerical input feature. By applying our compute_all_SSE_splits() function on each feature separately, we can compute all the possible split points for each feature and their SSE. The following two plots show these SSE values for each feature in turn:

CART regression trees

Looking at both plots, we can see that the best possible split produces an SSE value of 124.05 and this can be achieved by splitting on the feature x2 at the value 18.7. Consequently, our regression tree would contain a node with the following splitting rule:

 If x2 < 18.7

The CART methodology always applies the same logic to determine whether to make a new split at each node as well as how to pick which feature and value to split on. This recursive approach of splitting up the data points at each node to build the regression tree is why this process is also known as recursive partitioning.

Tree pruning

If we were to allow the recursive partitioning process to repeat indefinitely, we would eventually terminate by having leaf nodes with a single data point each, because that is when we cannot split the data any further. This model would fit the training data perfectly, but it is highly unlikely that its performance would generalize on unseen data. Thus, tree-based models are susceptible to overfitting. To combat this, we need to control the depth of our final decision tree.

The process of removing nodes from the tree to limit its size and complexity is known as pruning. One possible pruning method is to impose a threshold for the smallest number of data points that can be used in order to create a new split in the tree instead of creating a leaf node. This will create leaf nodes earlier on in the procedure and the data points that are assigned to them may not all have the same output. In this case, we can simply predict the average value for regression (and the most popular class for classification). This is an example of pre-pruning, as we are pruning the tree while building it and before it is fully constructed.

Intuitively, we should be able to see that the larger the depth of the tree and the fewer the average number of data points assigned to leaf nodes, the greater the degree of overfitting. Of course, if we have fewer nodes in the tree, we probably aren't being granular enough in our modeling of the underlying data.

The question of how large a tree should be allowed to grow is thus effectively a question of how to model our data as closely as possible while controlling the degree of overfitting. In practice, using pre-pruning is tricky as it is difficult to find an appropriate threshold.

Another regularization process that is used by the CART methodology to prune trees is known as cost-complexity tuning. In effect, trees are often allowed to grow fully using the recursive partitioning approach described in the previous section. Once this completes, we prune the resulting tree, that is to say, we start removing split points and merging leaf nodes to shrink the tree according to a certain criterion. This is known as post-pruning, as we prune the tree after it has been built. When we construct the original tree, the error function that we use is the SSE. To prune the tree, we use a penalized version of the SSE for minimizing:

Tree pruning

Here, α is a complexity parameter controlling the degree of regularization and Tp is the number of nodes in the tree, which is a way to model the size of the tree. Similar to the way in which the lasso limits the size of regression coefficients in generalized linear models, this regularization procedure limits the size of the resulting tree. A very small value of α results in a small degree of pruning, which in the limit of α taking the value of 0 corresponds to no pruning at all. On the other hand, using a high value for this parameter results in trees that are much shorter, which, at its limit, can result in a zero-sized tree with no splits at all, which predicts a single average value of the output for all possible inputs.

It turns out that every particular value of α corresponds to a unique tree structure that minimizes this penalized form of the SSE for that particular value. Put differently, given a particular value for α, there is a unique and predictable way to prune a tree in order to minimize the penalized SSE, but the details of this procedure are beyond the scope of this book. For now, we can simply assume that every value of α is associated with a single tree.

This particular feature is very useful, as we don't have any ambiguity in picking a tree once we settle on a value for the complexity parameter α. It does not, however, give us a way to determine what actual value we should use. Cross-validation, which we saw in Chapter 5, Support Vector Machines, is a commonly used approach designed to estimate an appropriate value of this parameter. Cross-validation applied to this problem would involve partitioning the data into k folds. We then train and prune k trees by using all the data excluding a single fold and repeating this for each of the k folds. Finally, we measure the SSE on the folds held out for testing and average the results. We can repeat our cross-validation procedure for different values of a. Another approach when more data is available is to use a validation data set for evaluating models that have been trained on the same training data set but with different values of α.

Missing data

One characteristic of decision trees is that they have a natural way of handling missing data during training. For example, when we consider which feature to split on at a particular node, we can ignore data points that have a missing value for a particular feature and compute the potential reduction in our error function (deviance, SSE, and so on) using the remaining data points. Note that while this approach is handy, it could potentially increase the bias of the model substantially, especially if we are ignoring a large portion of our available training data because of missing values.

One might wonder whether we are able to handle missing values during prediction for unseen data points. If we are at a particular node in the tree that is splitting on a feature, for which our test data point has a missing value, we are seemingly stuck. In practice, this situation can be dealt with via the use of surrogate splits. The key notion behind these is that for every node in the tree, apart from the feature that was optimally chosen to split on, we keep track of a list of other features that produce similar splits in the data as the feature we actually chose. In this way, when our testing data point has a missing value for a feature that we need in order to make a prediction, we can refer to a node's surrogate splits instead, and use a different feature for this node.

Regression model trees

One of the potential drawbacks of regression trees built with CART is that even though we limit the number of data points that are assigned to a particular leaf node, these may still have significant variations in the output variable among themselves. When this happens, taking the average value and using this as a single prediction for that leaf node may not be the best idea.

Regression model trees attempt to overcome this limitation by using the data points at the leaf nodes to construct a linear model to predict the output. The original regression model tree algorithm was developed by J. Ross Quinlan and is known as M5. The M5 algorithm computes a linear model at each node in the tree. For a test data point, we first compute the decision path traversed from the root node to the leaf node. The prediction that is then made is the output of the linear model associated with that leaf node.

M5 also differs from the algorithm used in CART in that it employs a different criterion to determine which feature to split on. This criterion is the weighted reduction in standard deviation:

Regression model trees

This general equation assumes that we split the data into p partitions (as we have seen for trees, p is typically 2). For each partition i, we compute the standard deviation σi. Then we compute the weighted average of these standard deviations using the relative size of each partition (ni/n) as the weights. This is subtracted from the initial standard deviation of the unpartitioned data.

The idea behind this criterion is that splitting a node should produce groups of data points that within each group display less variability with respect to the output variable than all the data points when grouped together. We'll have a chance to see M5 trees as well as CART trees in action later on in this chapter.

CART classification trees

Building classification trees using the CART methodology continues the notion of recursively splitting up groups of data points in order to minimize some error function. Our first guess for an appropriate error function is the classification accuracy. It turns out that this is not a particularly good measure to use to build a classification tree.

What we would actually like to use is a measure for node purity that would score nodes based on whether they contain data points primarily belonging to one of the output classes. This is a very intuitive idea because what we are effectively aiming for in a classification tree is to eventually be able to group our training data points into sets of data points at the leaf nodes, so that each leaf node contains data points belonging to only one of the classes. This will mean that we can confidently predict this class if we arrive at that leaf node during prediction.

One possible measure of node purity, which is frequently used with CART for classification trees is the Gini index. For an output variable with K different classes, the Gini index G is defined as follows:

CART classification trees

To calculate the Gini index, we compute an estimate of the probability of every class and multiply this with the probability of not being that class. We then add up all these products. For a binary classification problem, it should be easy to see that the Gini index evaluates to 2 CART classification trees(1- CART classification trees), where CART classification trees is the estimated probability of one of the classes.

To compute the Gini index at a particular node in a tree, we can simply use the ratio of the number of data points labeled as class k over the total number of data points as an estimate for the probability of a data point belonging to class k at the node in question. Here is a simple R function to compute the Gini index:

gini_index <- function(v) {
  t <- table(v)
  probs <- t / sum(t)
  terms <- sapply(probs, function(p) p * (1 - p) )
  return(sum(terms))
}

To compute the Gini index, our gini_index() function first tabulates all the entries in a vector. It divides each of these frequency counts with the total number of counts to transform them into probability estimates. Finally, it computes the product CART classification trees(1- CART classification trees) for each of these and sums up all the terms. Let's try a few examples:

> gini_index(v = c(0, 0, 0, 1, 1, 1))
[1] 0.5
> gini_index(v = c(0, 0, 0, 1, 1, 1, 1, 1, 1))
[1] 0.4444444
> gini_index(v = c(0, 0, 0, 1, 1, 1, 2, 2, 2))
[1] 0.6666667
> gini_index(v = c(1, 1, 1, 1, 1, 1))
[1] 0

Note how the Gini index for a completely pure node (a node with only one class) is 0. For a binary output with equal proportions of the two classes, the Gini index is 0.5. Similar to the standard deviation in regression trees, we use the weighted reduction in Gini index, where we weigh each partition by its relative size, to determine appropriate split points:

CART classification trees

Another commonly used criterion is deviance. When we studied logistic regression, we saw that this is just the constant -2 multiplied by the log-likelihood of the data. In a classification tree setting, we compute the deviance of a node in a classification tree as:

CART classification trees

Unlike the Gini index, the total number of observations nk at a node affects the value of deviance. All nodes that have the same proportion of data points across different classes will have the same value of the Gini index, but if they have different numbers of observations, they will have different values of deviance. In both splitting criteria, however, a completely pure node will have a value of 0 and a positive value otherwise.

Aside from using a different splitting criterion, the logic to build a classification tree using the CART methodology is exactly parallel to that of building a regression tree. Missing values are handled in the same way and the tree is pre-pruned in the same way using a threshold on the number of data points left to build leaf nodes. The tree is also post-pruned using the same cost-complexity approach outlined for regression trees, but after replacing the SSE as the error function with either the Gini index or the deviance.

C5.0

The C5.0 algorithm developed by Ross Quinlan is an algorithm to build a decision tree for classification. This algorithm is the latest in a chain of successively improved versions starting from an algorithm known as ID3, which developed into C4.5 (and an open source implementation in the Java programming language known as J48) before culminating in C5.0. There are a good many acronyms used for decision trees, but thankfully many of them are related to each other. The C5.0 chain of algorithms has several differences from the CART methodology, most notably in the choice of splitting criterion as well as in the pruning procedure.

The splitting criterion used with C5.0 is known as entropy or the information statistic, and has its roots in information theory. Entropy is defined as the average number of binary digits (bits) needed to communicate information via a message as a function of the probabilities of the different symbols used. Entropy also has roots in statistical physics, where it is used to represent the degree of chaos and uncertainty in a system. When the symbols or components of a system have equal probabilities, there is a high degree of uncertainty, but entropy is lower when one symbol is far likelier than the others. This observation renders the definition of entropy very useful in measuring node purity. The formal definition of entropy in bits for the multiclass scenario with K classes is:

C5.0

In the binary case, the equation simplifies to (where p arbitrarily refers to the probability of one of the two classes):

C5.0

We can compare entropy to the Gini index for binary classification in the following plot:

C5.0

From the plot, we can see that both functions have the same general shape for the binary class problem. Recall that the lower the entropy, the lower the uncertainty we have about the distribution of our classes and hence we have higher node purity. Consequently, we want to minimize the entropy as we build our tree. In ID3, the splitting criterion that is used is the weighted entropy reduction, which is also known as the information gain:

C5.0

It turns out that this criterion suffers from selection bias, in that it tends to favor categorical variables because of the large number of possible groupings compared to the linear range of splits we find with continuous features. To combat this, from C4.5 onwards, the criterion was refined into the information gain ratio. This is a normalized version of the information gain, where we normalize with respect to a quantity known as the split information value.

This in turn represents the potential increase in information that we can get just by the size of the partitions themselves. A high split information value occurs when we have evenly sized partitions and a low value occurs when most of the data points are concentrated in a small number of the partitions. In summary, we have this:

C5.0

The C5.0 chain of algorithms also incorporate alternative methods to prune a tree that go beyond simple elimination of nodes and sub-trees. For example, inner nodes may be removed before leaf nodes so that the nodes beneath the removed node (the sub-tree) are pushed up (raised) to replace the removed node. C5.0, in particular, is a very powerful algorithm that also contains improvements to speed, memory usage, native boosting (covered in the next chapter) abilities, as well as the ability to specify a cost matrix so that the algorithm can avoid making certain types of misclassifications over others, just as we saw with support vector machines in the previous chapter.

We'll demonstrate how to build trees with C5.0 in R in a subsequent section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.185.196