Performing the analyses in R

Now that we have our data ready, we will focus on performing the analyses in R.

Classification with C4.5

We will first predict the income of the participants using C4.5.

The unpruned tree

We will start by examining the unpruned tree. This is configured using the Weka_Control(U= TRUE). J48() argument in RWeka, which uses the formula notation we have seen previously. The dot (.) after the tilde indicates that all attributes except the class attribute have to be used. We used the control argument to tell R that we want an unpruned tree (we will discuss pruning later):

C45tree = J48(income ~ . , data= AdultTrain,
   control= Weka_control(U=TRUE))

You can examine the tree by typing:

C45tree

We will not display it here as it is very big: the size of the tree is 5,715, with 4,683 leaves; but we can examine how well the tree classified the cases:

summary(C45tree)
The unpruned tree

The performance of the classifier on the training dataset

We can see that even though about 89 percent of cases are correctly classified, the kappa statistic (which we discussed in the previous chapter) is .78, which is not bad. In practice, a value of 0.60 or higher is highly recommended.

The following will try to classify the test set and assign the predictions to a new attribute in a data frame called Predictions:

Predictions = data.frame(matrix(nrow = nrow(AdultTest), ncol=0))
Predictions$C45 = predict(C45tree, AdultTest)

The pruned tree

Let's examine what happens with a pruned tree, before we see the result on unseen data (the testing dataset):

C45pruned = J48(income ~ . , data= AdultTrain,
   control= Weka_control(U=FALSE))

The resulting tree is smaller, but still quite big; it has a size of 2,278 and 1,767 leaves. Typing the following line, you will see that around the same number of instances were correctly classified and that kappa is now 0.76. Pruning the tree decreased the classification performance on the training data:

summary(C45pruned)

The following will try to classify the test set and assign those to a new attribute called PredictedC45pr, which we will examine later:

Predictions$C45pr = predict(C45pruned, AdultTest)

C50

As a reminder, C50 performs boosting, which is the reiteration of the classification with a higher weight given to misclassified observations. Let's run a boosted C5.0 with 10 trials (boosted 10 times) and examine the output. Only the accuracy and confusion matrix are displayed here, but you can examine the tree on your screen:

C50tree = C5.0(y = AdultTrain$income, x = AdultTrain[,-13], 
   Trials = 10)
summary(C50tree)

Here is part of the output (at the bottom of your screen):

Evaluation on training data (22654 cases):
Decision Tree   
----------------  
Size      Errors  
708  1980 (8.7%)   <<
(a)  (b)    <-classified as
----  ----
10061  1266  (a): class small
714  10613  (b): class large

The size of the tree is 714, which is much smaller than the previously unpruned version of C4.5. We can see that the accuracy is a bit better using boosted C5.0 (8.4 percent misclassified observations). What about the kappa value? Let's try another way to get it this time! It can be obtained by first creating a table from the confusion matrix and then, calling the cohen.kappa() function on it:

1  TabC5.0= as.table(matrix(c(10061,1980,714,10613), 
2     byrow=T, nrow = 2))
3  library(psych)
4  cohen.kappa(TabC5.0)

The output shows that the kappa value is .77. In this case, C5.0 is similar to C4.5, on the training data. Let's now create the prediction on unseen data, which we will examine later:

Predictions$C5.0 = predict(C50tree, AdultTest)

CART

We will try to classify the same data again and see how well the CART performs on our training and testing datasets:

CARTtree = rpart(income ~. , data= AdultTrain)

The following line of code displays a big output about the performed splits. We will not comment on this here as the following plot is more informative:

summary(CARTtree)

Here, we will simply include the plot of the tree that is more readable:

rpart.plot(CARTtree, extra = 1)

As can be noted on the graph, the tree is far simpler than with C4.5 or C5.0:

CART

A graph of the tree using CART

Obtaining the confusion matrix for the training data is a bit more difficult with CART. It requires predicting the class on the training data. The predictions are made in terms of probabilities (as you can see when looking at the tree in its textual form). We, therefore, recode values higher than .5 as a large income and those lower, as a small income. We will also create a temporary data frame of the data without missing values and will create the confusion matrix on this data frame. Finally, we display the confusion matrix and the kappa value:

1  ProbsCART = predict(CARTtree, AdultTrain)
2  PredictCART = rep(0, nrow(ProbsCART))
3  PredictCART[ProbsCART[,1] <=.5] = "small"
4  PredictCART[ProbsCART[,1] >.5] = "large"
5  TabCART = table(AdultTrain$income, PredictCART)
6  TabCART
7  cohen.kappa(TabCART)

The output is shown here:

       large  small
small   8640   2687
large   1630   9697

The accuracy for this classification is not very different from the other algorithms:

(8640+9696)/sum(TabCART) = 81%.

However, the kappa value is lower (0.62). Classification with CART is not good for this dataset, but there are ways to improve it.

Pruning

As mentioned previously, pruning in CART is based on the generation simplified trees from another dataset. So let's try to obtain an even simpler tree than before. The cp argument on the code corresponds to the complexity parameter that we discussed at the beginning of the section:

CARTtreePruned = prune(CARTtree, cp=0.03)

The tree is even smaller now:

rpart.plot(CARTtreePruned, extra = 1)
Pruning

A graph of a pruned tree using CART

I will now comment on this tree a bit. The root node is on the relationship attribute. Let's examine the modalities of this attribute:

levels(AdultTrain$relationship)

The output is as follows:

[1] "Husband"        "Not-in-family"  "Other-relative" "Own-child"
[5] "Unmarried"      "Wife"

From the plot, we can see that people who are Not-in-family, live with Other-relative, with their Own-child, and are Unmarried have mostly a small income. Among people who are not in these categories, the tree splits on the attribute capital-gain. People with a capital gain lower than 4668 mostly have a low income. The people in the other categories mostly have a large income.

Using the same approach as before, we obtain an accuracy of 78 percent on the training dataset, and 81 percent on the second set. The kappa values here are only 0.56. This is not encouraging. However, remember that things can go much better with other datasets, including your data (give it a try).

We now classify the test set for later:

ProbsCARTtest = predict(CARTtreePruned, AdultTest)
Predictions$CART[ProbsCARTtest[,1] <=.5] = "small"
Predictions$CART[ProbsCARTtest[,1] >.5] = "large"

Random forests in R

RF = randomForest(y = AdultTrain$income, x = AdultTrain[,-13])

The confusion matrix can be obtained by typing:

RF

The output is as follows:

Call:
 randomForest(x = AdultTrain[, -13], y = AdultTrain$income) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3
        OOB estimate of  error rate: 35.79%
Confusion matrix:
small  large  class.error
small  11301   26  0.0022954
large   8081  3246  0.7134281

This is pretty disappointing. The classification is even worse than with CART. What about prediction accuracy on the testing dataset? Let's add this to our Predictions data frame:

Predictions$RF = predict(RF, AdultTest)

Note

Here, we used the default parameters. The classification should depend notably on two arguments (type ?randomForest() in the console for more arguments):

  • mtry: This determines the number of predictors to be included for each split. By default, mtry = sqrt(p), where p is the total number of predictors in the analysis.
  • cutoff: This determines the cutoff for the probability of membership to be used in the classification. By default, for dichotomous classification, cutoff = c(0.5, 0.5), which means that a cutoff value of 0.5 is used as a threshold.

Examining the predictions on the testing set

In the previous section, we predicted the income attribute on the testing set using several algorithms. We can now examine which algorithm was better in these predictions. We could do this manually, but it is faster and more fun to design some code that gives immediate access to the accuracy of the different algorithms on this data. So here we go:

1  values = data.frame(matrix(ncol = ncol(Predictions), nrow = 6))
2  rownames(values) = c("True +", "True -", "False +", "False -", 
3     "Accuracy", "Kappa")
4  names(values) = names(Predictions)
5  for (i in 1:ncol(Predictions)) {
6     tab = table(AdultTest$income,Predictions[,i])
7     values[1,i] = tab[1,1]
8     values[2,i] = tab[2,2]
9     values[3,i] = tab[1,2]
10     values[4,i] = tab[2,1]
11     values[5,i] = sum(diag(tab))/sum(tab)
12     values[6,i] = cohen.kappa(tab)[1] 
13  }
14  round(values,2)

In the output, we can see that although the algorithms all reached more than 70 percent of classification accuracy, the kappa value was never above 0.56. This is not sufficient and we would not trust such a classification in practice. Contrary to what could be expected, random forest performed the worst. It is, therefore, crucial to always try several algorithms, and pick the one that works the best with your data:

Examining the predictions on the testing set

Conditional inference trees in R

The following code was used to produce the figure we saw at the beginning of the chapter:

1  set.seed(999)
2  TitanicRandom = Titanic.df[sample(nrow(Titanic.df)),]
3  TitanicTrain = TitanicRandom[1:1100,]
4  TitanicTest = TitanicRandom[1101:2201,]

Let's now generate and plot the tree:

CItree <- ctree(Survived ~ Class + Sex + Age, data=TitanicTrain)
plot(CItree)

We can examine the classification of the confusion matrix of the training and testing datasets as we did using the previously presented algorithms:

1  CIpredictTrain = predict(CItree, TitanicTrain)
2  CIpredictTest = predict(CItree, TitanicTest)
3  TabCI_Train = table(TitanicTrain$Survived,CIpredictTrain)
4  TabCI_Test = table(TitanicTest$Survived,CIpredictTest)
5  TabCI_Train

The training dataset has the following confusion matrix:

CIpredictTrain
      No  Yes
No   724   10
Yes  231  135

Here, we display the confusion matrix for the testing dataset:

CIpredictTest
      No  Yes
No   746   10
Yes  226  119

We can see that in both datasets, only about a third of the individuals who survived were correctly classified. The kappa values are very low—0.42 for the training set and 0.4 for the testing set, despite statistical significance at all splits in the training set!

Yet, having a look at the figure at the beginning of the chapter, we can see that almost all women were correctly classified. Sometimes, some subgroups are easier to classify than others. This means that depending on the aim of your analysis, even the preceding results might be very informative!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.232.189