Decision trees

Decision Trees are a predictive modeling technique that generates rules that derive the likelihood of a certain outcome based on the likelihood of the preceding outcomes. In general, decision trees are typically constructed similar to a flowchart, with a series of nodes and leaves that denote a parent-child relationship. Nodes that do not link to other nodes are known as leaves.

Decision Trees belong to a class of algorithms that are often known as CART (Classification and Regression Trees). If the outcome of interest is a categorical variable, it falls under a classification exercise, whereas if the outcome is a number, it is known as a regression tree.

An example will help to make this concept clearer. Take a look at the chart:

The chart shows a hypothetical scenario: if school is closed/not closed. The rectangular boxes (in blue) represent the nodes. The first rectangle (School Closed) represent the root node, whereas the inner rectangles represent the internal nodes. The rectangular boxes with angled edges (in green and italic letters) represent the 'leaves' (or terminal nodes).

Decision Trees are simple to understand and one of the few algorithms that are not a 'black box'. Algorithms such as those used to create Neural Networks are often considered black boxes, as it is very hard - if not impossible - to intuitively determine the exact path by which a final outcome was reached due to the complexity of the model.

In R, there are various facilities for creating Decision Trees. A commonly used library for creating them in R is rpart. We'll revisit our PimaIndiansDiabetes dataset to see how a decision tree can be created using the package.

We would like to create a model to determine how glucose, insulin, (body) mass, and age are related to diabetes. Note that in the dataset, diabetes is a categorical variable with a yes/no response.

For visualizing the decision tree, we will use the rpart.plot package. The code for the same is given as follows:

install.packages("rpart") 
install.packages("rpart.plot") 
 
library(rpart) 
library(rpart.plot) 
 
rpart_model<- rpart (diabetes ~ glucose + insulin + mass + age, data = PimaIndiansDiabetes) 
 
 
>rpart_model 
n= 768  
 
node), split, n, loss, yval, (yprob) 
      * denotes terminal node 
 
  1) root 768 268 neg (0.6510417 0.3489583)   
    2) glucose< 127.5 485  94neg (0.8061856 0.1938144) * 
    3) glucose>=127.5 283 109 pos (0.3851590 0.6148410)   
      6) mass< 29.95 76  24neg (0.6842105 0.3157895)   
       12) glucose< 145.5 41   6 neg (0.8536585 0.1463415) * 
       13) glucose>=145.5 35  17pos (0.4857143 0.5142857)   
         26) insulin< 14.5 21   8 neg (0.6190476 0.3809524) * 
         27) insulin>=14.5 14   4 pos (0.2857143 0.7142857) * 
      7) mass>=29.95 207  57pos (0.2753623 0.7246377)   
       14) glucose< 157.5 115  45pos (0.3913043 0.6086957)   
         28) age< 30.5 50  23neg (0.5400000 0.4600000)   
           56) insulin>=199 14   3 neg (0.7857143 0.2142857) * 
           57) insulin< 199 36  16pos (0.4444444 0.5555556)   
            114) age>=27.5 10   3 neg (0.7000000 0.3000000) * 
            115) age< 27.5 26   9 pos (0.3461538 0.6538462) * 
         29) age>=30.5 65  18pos (0.2769231 0.7230769) * 
       15) glucose>=157.5 92  12pos (0.1304348 0.8695652) * 
 
>rpart.plot(rpart_model, extra=102, nn=TRUE)

# The plot shown below illustrates the decision tree that the model, rpart_model represents.

Reading from the top, the graph shows that that there are 500 cases of diabetes=neg in the dataset (out of a total of 768 records).

> sum(PimaIndiansDiabetes$diabetes=="neg") 
[1] 500

Of the total number of records in the dataset (768) with value of glucose < 128, there were 485 records marked as negative. Of these, the model correctly predicted 391 cases as negative (Node Number 2, the first one on the left from the bottom).

For the records which had a glucose reading of > 128, there were 283 records marked as positive (Node Number 3, the node immediately below the topmost/root node). The model correctly classified 174 of these cases.

Another, more recent package for intuitive decision trees with comprehensive visual information is FFTrees (Fast and Frugal Decision Trees). The following example has been provided for informational purposes:

install.packages("FFTrees") 
library(caret) 
library(mlbench) 
library(FFTrees) 
set.seed(123) 
 
data("PimaIndiansDiabetes") 
diab<- PimaIndiansDiabetes 
diab$diabetes<- 1 * (diab$diabetes=="pos") 
 
train_ind<- createDataPartition(diab$diabetes,p=0.8,list=FALSE,times=1) 
 
training_diab<- diab[train_ind,] 
test_diab<- diab[-train_ind,] 
 
diabetes.fft<- FFTrees(diabetes ~.,data = training_diab,data.test = test_diab) 
plot(diabetes.fft)

# The plot below illustrates the decision tree representing diabetes.fft using the FFTrees package.

Decision Trees work by splitting the data recursively until a stopping criterion, such as when a certain depth has been reached, or the number of cases, is below a specified value. Each split is done based on the variable that will lead to a 'purer subset'.

In principle, we can grow an endless number of trees from a given set of variables, which makes it a particularly hard and intractable problem. Numerous algorithms exist which provide an efficient method for splitting and creating decision trees. One such method is Hunt's Algorithm.

Further details about the algorithm can be found at: https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf.

Table of Contents for Decision trees

Create new playlist

Sign In

Sign Up

Table of Contents for
Decision trees