In order to determine which one of the impurity measures to use, it's important that we cover some foundational knowledge beginning with the concept of information gain.
At it's core, information gain is as it sounds: the gain in information from moving between two states. More accurately, the information gain of a certain event is the difference between the amount of information known before and after the event takes place. One common measure of this information is looking at the Entropy which can be defined as:
Where pj is the frequency of label j at a node.
Now that you are familiar with the concept of information gain and Entropy, we can move on to what is meant by the Gini Index (there is no correlation whatsoever to the Gini coefficient).
The Gini Index: is a measure of how often a randomly chosen element would be misclassified if it were randomly given a label according to the distribution of labels at a given node.
Compared with the equation for Entropy, the Gini Index should be computed slightly faster due to the absence of a log computation which may be why it is the default option for many other machine learning libraries including MLlib.
But does this make it a better measure for which to make splits for our decision tree? It turns out that the choice of impurity measure has little effect on performance with respect to single decision tree algorithms. The reason for this, according to Tan et. al, in the book Introduction to Data Mining, is that:
Now it's time we train our decision tree classifier on the training data:
val dtreeModel = DecisionTree.trainClassifier( trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) // Show the tree println("Decision Tree Model: " + dtreeModel.toDebugString)
This should yield a final output which looks like this (note that your results will be slightly different due to the random split of the data):
The output shows that decision tree has depth 5 and 63 nodes organized in an hierarchical decision predicate. Let's go ahead and interpret it looking at the first five decisions. The way it reads is: "If feature 25's value is less than or equal to 1.0559 AND is less than or equal to 0.61558 AND feature 27's value is less than or equal to 0.87310 AND feature 5's value is less than or equal to 0.89683 AND finally, feature 22's value is less than or equal to 0.76688, then the prediction is 1.0 (the Higgs-Boson). BUT, these five conditions must be met in order for the prediction to hold." Notice that if the last condition is not held (feature 22's value is > 0.76688) but the previous four held conditions remain true, then the prediction changes from 1 to 0, indicating background noise.
Now, let's score the model on our test dataset and print the prediction error:
val treeLabelAndPreds = testData.map { point =>
val prediction = dtreeModel.predict(point.features)
(point.label.toInt, prediction.toInt)
}
val treeTestErr = treeLabelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println(f"Tree Model: Test Error = ${treeTestErr}%.3f")
The output is as follows:
After some period of time, the model will score all of the test set data and then compute an error rate which we defined in the preceding code. Again, your error rate will be slightly different than ours but as we show, our simple decision tree model has an error rate of ~33%. However, as you know, there are different kinds of errors that we can possibly make and so it's worth exploring what those types of error are by constructing a confusion matrix:
val cm = treeLabelAndPreds.combineByKey( createCombiner = (label: Int) => if (label == 0) (1,0) else (0,1), mergeValue = (v:(Int,Int), label:Int) => if (label == 0) (v._1 +1, v._2) else (v._1, v._2 + 1), mergeCombiners = (v1:(Int,Int), v2:(Int,Int)) => (v1._1 + v2._1, v1._2 + v2._2)).collect
The preceding code is using advanced the Spark method combineByKey which allows us to map each (K,V)-pair to a value, which is going to represent the output of the group by the key operation. In this case, the (K,V)-pair represents the actual value K and prediction V. We map each prediction to a tuple by creating a combiner (parameter createCombiner) - if the predicted values is 0, then we map to (1,0); otherwise, we map to (0,1). Then we need to define how combiners accept a new value and how combiners are merged together. At the end, the method produces:
cm: Array[(Int, (Int, Int))] = Array((0,(5402,4131)), (1,(2724,7846)))
The resulting array contains two tuples - one for the actual value 0 and another for the actual value 1. Each tuple contains the number of predictions 0 and 1. Hence, it is easy to extract all necessary to present a nice confusion matrix.
val (tn, tp, fn, fp) = (cm(0)._2._1, cm(1)._2._2, cm(1)._2._1, cm(0)._2._2) println(f"""Confusion Matrix | ${0}%5d ${1}%5d ${"Err"}%10s |0 ${tn}%5d ${fp}%5d ${tn+fp}%5d ${fp.toDouble/(tn+fp)}%5.4f |1 ${fn}%5d ${tp}%5d ${fn+tp}%5d ${fn.toDouble/(fn+tp)}%5.4f | ${tn+fn}%5d ${fp+tp}%5d ${tn+fp+fn+tp}%5d ${(fp+fn).toDouble/(tn+fp+fn+tp)}%5.4f |""".stripMargin)
The code extracts all true negatives and positives predictions and also missed predictions and outputs of the confusion matrix based on the template shown on Figure 9:
Going back to our first example in this chapter, we can see that our model is making most of the errors in detecting the actual Boson particle. In this case, all data points representing detection of Boson are wrongly missclassified as non-Boson. However, the overall error rate is pretty low! This is a nice example of how the overall error rate can be misleading for a dataset with an imbalanced response.
Next, we will consider another modeling metric used to judge classification models, called the Area Under the (Receiver Operating Characteristic) Curve (AUC) (see the following figure for an example). The Receiver Operating Characteristic (ROC) curve is a graphical representation of the True Positive Rate versus the False Positive Rate:
- True Positive Rate: The total number of true positives divided by the sum of true positives and false negatives. Expressed differently, it is the ratio of the true signals for the Higgs-Boson particle (where the actual label was 1) to all the predicted signals for the Higgs-Boson (where our model predicted label is 1). The value is shown on the y-axis.
- False Positive Rate: The total number of false positives divided by the sum of false positives and true negatives, which is plotted on the x-axis.
- For more metrics, please see the figure for "Metrics derived from confusion matrix".
It follows that the ROC curve portrays our model's tradeoff of TPR against FPR for a given decision threshold (the decision threshold is the cutoff point whereby we say it is label 0 or label 1). Therefore, the area under the ROC curve can be thought of as an average model accuracy whereby a value of 1.0 would represent perfect classification, 0.5 would be a coin-flip (meaning our model is doing a 50-50 job at guessing 1 or 0), and anything less than 0.5 would mean flipping a coin is more accurate than our model! This is an incredibly useful metric which we will see can be used to compare against different hyper-parameter tweaks and different models altogether! Let's go ahead and create a function which will allow us to calculate the AUC for our decision tree model which we will use to compare against other models:
type Predictor = { def predict(features: Vector): Double } def computeMetrics(model: Predictor, data: RDD[LabeledPoint]): BinaryClassificationMetrics = { val predAndLabels = data.map(newData => (model.predict(newData.features), newData.label)) new BinaryClassificationMetrics(predAndLabels) } val treeMetrics = computeMetrics(dtreeModel, testData) println(f"Tree Model: AUC on Test Data = ${treeMetrics.areaUnderROC()}%.3f")
The output is as follows:
Interested in a great read on this subject? There is no holy bible that is the be-all-end-all. The book, The Elements of Statistical Learning, by Trevor Hastie - renowned statistics professor from Stanford University - is a great source of information. This book offers useful nuggets for both beginners and advanced practitioners of machine learning and is highly recommended.