If you recall from Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, a decision tree builds a model much like a flowchart in which decision nodes, leaf nodes, and branches define a series of decisions that are used to classify examples. Such trees can also be used for numeric prediction by making only small adjustments to the tree-growing algorithm. In this section, we will consider only the ways in which trees for numeric prediction differ from trees used for classification.
Trees for numeric prediction fall into two categories. The first, known as regression trees, were introduced in the 1980s as part of the seminal Classification and Regression Tree (CART) algorithm. Despite the name, regression trees do not use linear regression methods as described earlier in this chapter, rather they make predictions based on the average value of examples that reach a leaf.
The second type of trees for numeric prediction are known as model trees. Introduced several years later than regression trees, they are lesser-known, but perhaps more powerful. Model trees are grown in much the same way as regression trees, but at each leaf, a multiple linear regression model is built from the examples reaching that node. Depending on the number of leaf nodes, a model tree may build tens or even hundreds of such models. This may make model trees more difficult to understand than the equivalent regression tree, with the benefit that they may result in a more accurate model.
Trees that can perform numeric prediction offer a compelling yet often overlooked alternative to regression modeling. The strengths and weaknesses of regression trees and model trees relative to the more common regression methods are listed in the following table:
Strengths |
Weaknesses |
---|---|
|
|
Though traditional regression methods are typically the first choice for numeric prediction tasks, in some cases, numeric decision trees offer distinct advantages. For instance, decision trees may be better suited for tasks with many features or many complex, non-linear relationships among features and outcome. These situations present challenges for regression. Regression modeling also makes assumptions about how numeric data is distributed that are often violated in real-world data. This is not the case for trees.
Trees for numeric prediction are built in much the same way as they are for classification. Beginning at the root node, the data is partitioned using a divide-and-conquer strategy according to the feature that will result in the greatest increase in homogeneity in the outcome after a split is performed. In classification trees, you will recall that homogeneity is measured by entropy, which is undefined for numeric data. Instead, for numeric decision trees, homogeneity is measured by statistics such as variance, standard deviation, or absolute deviation from the mean.
One common splitting criterion is called the Standard Deviation Reduction (SDR). It is defined by the following formula:
In this formula, the sd(T) function refers to the standard deviation of the values in set T, while T1, T2, ..., Tn are the sets of values resulting from a split on a feature. The |T| term refers to the number of observations in set T. Essentially, the formula measures the reduction in standard deviation by comparing the standard deviation pre-split to the weighted standard deviation post-split.
For example, consider the following case in which a tree is deciding whether or not to perform a split on binary feature A or B:
Using the groups that would result from the proposed splits, we can compute the SDR for A and B as follows. The length()
function used here returns the number of elements in a vector. Note that the overall group T is named tee
to avoid overwriting R's built-in T()
and t()
functions:
> tee <- c(1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 6, 7, 7, 7, 7) > at1 <- c(1, 1, 1, 2, 2, 3, 4, 5, 5) > at2 <- c(6, 6, 7, 7, 7, 7) > bt1 <- c(1, 1, 1, 2, 2, 3, 4) > bt2 <- c(5, 5, 6, 6, 7, 7, 7, 7) > sdr_a <- sd(tee) - (length(at1) / length(tee) * sd(at1) + length(at2) / length(tee) * sd(at2)) > sdr_b <- sd(tee) - (length(bt1) / length(tee) * sd(bt1) + length(bt2) / length(tee) * sd(bt2))
Let's compare the SDR of A against the SDR of B:
> sdr_a [1] 1.202815 > sdr_b [1] 1.392751
The SDR for the split on feature A was about 1.2 versus 1.4 for the split on feature B. Since the standard deviation was reduced more for the split on B, the decision tree would use B first. It results in slightly more homogeneous sets than with A.
Suppose that the tree stopped growing here using this one and only split. A regression tree's work is done. It can make predictions for new examples depending on whether the example's value on feature B places the example into group T1 or T2. If the example ends up in T1, the model would predict mean(bt1) = 2, otherwise it would predict mean(bt2) = 6.25.
In contrast, a model tree would go one step further. Using the seven training examples falling in group T1 and the eight in T2, the model tree could build a linear regression model of the outcome versus feature A. Note that Feature B is of no help in building the regression model because all examples at the leaf have the same value of B—they were placed into T1 or T2 according to their value of B. The model tree can then make predictions for new examples using either of the two linear models.
To further illustrate the differences between these two approaches, let's work through a real-world example.
3.147.63.199