Classifying and predicting with decision trees

Decision trees are one of the most frequently used data mining algorithms. The algorithm is not very complex, yet it gives very good results in many cases. In addition, you can easily understand the result. You use decision trees for classification and prediction. Typical usage scenarios include:

  • Predicting which customers will leave
  • Targeting the audience for mailings and promotional campaigns
  • Explaining reasons for a decision

Decision trees are a directed technique. Your target variable is the one that holds information about a particular decision, divided into a few discrete and broad categories (yes/no; liked/partially liked/disliked, and so on). You are trying to explain this decision using other gleaned information saved in other variables (demographic data, purchasing habits, and so on). With limited statistical significance, you are going to predict the target variable for a new case using its known values for input variables based on the results of your trained model.

You use recursive partitioning to build the tree. The data is split into partitions using a specific value of one of the explaining variables. The partitions are then split again and again. Initially the data is in one big box. The algorithm tries all possible breaks of both input (explaining) variables for the initial split. The goal is to get purer partitions considering the classes of the target variable. The tree continues to grow using the two new partitions as separate starting points and splitting them more. You have to stop the process somewhere. Otherwise, you could get a completely fitted tree that has only one case in each class. The class would be, of course, absolutely pure and this would not make any sense; you could not use the results for any meaningful prediction, because the prediction would be 100% accurate, but for this case only. This phenomenon is called overfitting.

The following code uses the rpart() function from the base installation to build a decision tree using all possible independent variables, with factors properly defined, as used in the last logistic regression model:

TMDTree <- rpart(BikeBuyer ~ MaritalStatus + Gender + 
                 TotalChildren + NumberChildrenAtHome + 
                 Education + Occupation + 
                 HouseOwnerFlag + NumberCarsOwned + 
                 CommuteDistance + Region + 
                 YearlyIncome + Age, 
                 method="class", data=TM.train); 

You can plot the tree to understand how the splits were made. The following code uses the prp() function from the rpart.plot package to plot the tree:

install.packages("rpart.plot"); 
library(rpart.plot); 
prp(TMDTree, type = 2, extra = 104, fallen.leaves = FALSE); 

You can see the plot of the decision tree in the following figure. You can easily read the rules from the tree; for example, if the number of cars owned is two, three, or four, and yearly income is not lower than 65,000, then you have approximately 67% of non-buyers (in the following figure, just follow the path to the leftmost node):

Decision tree plot

So how does this model perform? The following code creates the classification matrix for this model:

predDT <- predict(TMDTree, TM.test, type = "class"); 
perfDT <- table(TM.test$BikeBuyer, predDT, 
                dnn = c("Actual", "Predicted")); 
perfDT; 

Here are the results:

          Predicted
    Actual   No  Yes
       No  2119  718
       Yes 1232 1477

The predictions are better in some cells and worse in other cells, compared with the last logistic regression model. The number of false negatives (predicted No, actual Yes) is especially disappointing. So let's try it with another model. This time, the following code uses the ctree() function from the party package:

install.packages("party", dependencies = TRUE); 
library("party"); 
TMDT <- ctree(BikeBuyer ~ MaritalStatus + Gender + 
              TotalChildren + NumberChildrenAtHome + 
              Education + Occupation + 
              HouseOwnerFlag + NumberCarsOwned + 
              CommuteDistance + Region + 
              YearlyIncome + Age, 
              data=TM.train); 
predDT <- predict(TMDT, TM.test, type = "response"); 
perfDT <- table(TM.test$BikeBuyer, predDT, 
                dnn = c("Actual", "Predicted")); 
perfDT; 

The results are:

          Predicted
    Actual   No  Yes
       No  2190  647
       Yes  685 2024

Now you can finally see some big improvements. The version of the decision trees algorithm used by the ctree() function is called the conditional inference trees. This version uses a different method for deciding how to make the splits. The function accepts many parameters. For example, the following code creates a new model, this time lowering the condition for a split, thus forcing more splits:

TMDT <- ctree(BikeBuyer ~ MaritalStatus + Gender + 
              TotalChildren + NumberChildrenAtHome + 
              Education + Occupation + 
              HouseOwnerFlag + NumberCarsOwned + 
              CommuteDistance + Region + 
              YearlyIncome + Age, 
              data=TM.train,  
              controls = ctree_control(mincriterion = 0.70)); 
predDT <- predict(TMDT, TM.test, type = "response"); 
perfDT <- table(TM.test$BikeBuyer, predDT, 
                dnn = c("Actual", "Predicted")); 
perfDT; 

The classification matrix for this model is shown as follows:

          Predicted
    Actual   No  Yes
       No  2200  637
       Yes  603 2106

The predictions have slightly improved again. You could continue with the process until you reach the desired quality of the predictions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.164.24