At first, use the embedded model metrics which the Spark API provides. We are going to use the same approach that we used in the previous chapter. We start by defining a method to extract model metrics for a given model and dataset:
import org.apache.spark.mllib.evaluation._ import org.apache.spark.mllib.tree.model._ def getMetrics(model: RandomForestModel, data: RDD[LabeledPoint]): MulticlassMetrics = { val predictionsAndLabels = data.map(example => (model.predict(example.features), example.label) ) new MulticlassMetrics(predictionsAndLabels) }
Then we can directly compute Spark MulticlassMetrics:
val rfModelMetrics = getMetrics(rfModel, testData)
And look at first interesting classification model metrics called Confusion matrix. It is represented by the type org.apache.spark.mllib.linalg.Matrix allowing you to perform algebraic operations:
println(s"""|Confusion matrix: |${rfModelMetrics.confusionMatrix}""".stripMargin)
The output is as follows:
In this case, Spark prints predicted classes in columns. The predicted classes are stored in the field labels of the object rfModelMetrics. However, the field contains only translated indexes (see the created variable activityId2Idx). Nevertheless, we can easily create a function to transform the label index to an actual label string:
def idx2Activity(idx: Double): String = activityId2Idx. find(e => e._2 == idx.asInstanceOf[Int]). map(e => activitiesMap(e._1)). getOrElse("UNKNOWN") val rfCMLabels = rfModelMetrics.labels.map(idx2Activity(_)) println(s"""|Labels: |${rfCMLabels.mkString(", ")}""".stripMargin)
The output is as follows:
For example, we can see that other activity was mispredicted many times with other activities - it was predicted correctly for 36455 cases; however, for 1261 cases the model predicted the other activity, but actual activity was house cleaning. On the other hand, the model predicted the activity folding laundry instead of the other activity.
You can directly see that we can directly compute overall prediction accuracy based on correctly predicted activities located on the diagonal of the Confusion matrix:
val rfCM = rfModelMetrics.confusionMatrix val rfCMTotal = rfCM.toArray.sum val rfAccuracy = (0 until rfCM.numCols).map(i => rfCM(i,i)).sum / rfCMTotal println(f"RandomForest accuracy = ${rfAccuracy*100}%.2f %%")
The output is as follows:
However, the overall accuracy can be misleading in cases of classes are not evenly distributed (for example, most of the instances are represented by a single class). In such cases, overall accuracy can be confusing, since the model just predicting a dominant class would provide a high accuracy. Hence, we can look at our predictions in more detail and explore accuracy per individual class. However, first we look at the distribution of actual labels and predicted labels to see (1) if there is a dominant class and (2) if model preserves input distribution of classes and is not skewed towards predicting a single class:
import org.apache.spark.mllib.linalg.Matrix
def colSum(m: Matrix, colIdx: Int) = (0 until m.numRows).map(m(_, colIdx)).sum
def rowSum(m: Matrix, rowIdx: Int) = (0 until m.numCols).map(m(rowIdx, _)).sum
val rfCMActDist = (0 until rfCM.numRows).map(rowSum(rfCM, _)/rfCMTotal)
val rfCMPredDist = (0 until rfCM.numCols).map(colSum(rfCM, _)/rfCMTotal)
println(s"""^Class distribution
^${table(Seq("Class", "Actual", "Predicted"),
rfCMLabels.zip(rfCMActDist.zip(rfCMPredDist)).map(p => (p._1, p._2._1, p._2._2)),
Map(1 -> "%.2f", 2 -> "%.2f"))}
""".stripMargin('^'))
The output is as follows:
We can easily see that there is no dominant class; however, the classes are not uniformly distributed. It is also worth noticing that the model preserves distribution of actual classes and there is no trend to prefer a single class. This just confirms our observation based on the Confusion matrix.
And finally, we can look at individual classes and compute precision (aka positive predictive value), recall (or so-called sensitivity) and F-1 score. To remind definitions from the previous chapter: precision is a fraction of the correct predictions for a given class (that is, TP/TP+TF), while recall is defined as a fraction of all class instances that were correctly predicted (that is, TP/TP+FN). And finally, the F-1 score combines both of them since it is computed as the weighted harmonic mean of precision and recall. We can easily compute them with the help of functions we already defined:
def rfPrecision(m: Matrix, feature: Int) = m(feature, feature) / colSum(m, feature)
def rfRecall(m: Matrix, feature: Int) = m(feature, feature) / rowSum(m, feature)
def rfF1(m: Matrix, feature: Int) = 2 * rfPrecision(m, feature) * rfRecall(m, feature) / (rfPrecision(m, feature) + rfRecall(m, feature))
val rfPerClassSummary = rfCMLabels.indices.map { i =>
(rfCMLabels(i), rfRecall(rfCM, i), rfPrecision(rfCM, i), rfF1(rfCM, i))
}
println(s"""^Per class summary:
^${table(Seq("Label", "Recall", "Precision", "F-1"),
rfPerClassSummary,
Map(1 -> "%.4f", 2 -> "%.4f", 3 -> "%.4f"))}
""".stripMargin('^'))
The output is as follows:
In our case, we deal with a quite good model since most of values are close to value 1.0. It means that the model performs well for each input category - generating a low number of false positives (precision) and also false negatives (recall).
The nice feature of the Spark API is that it already provides methods to compute all three metrics we computed manually. We can easily call methods precision, recall, fMeasure with the index of label to get the same values. However, in the Spark case, the Confusion matrix is collected for each call and hence increases overall computation time.
In our case, we use the already computed Confusion matrix and get the same results directly. Readers can verify that the following code gives us the same numbers as stored in rfPerClassSummary:
val rfPerClassSummary2 = rfCMLabels.indices.map { i => (rfCMLabels(i), rfModelMetrics.recall(i), rfModelMetrics.precision(i), rfModelMetrics.fMeasure(i)) }
By having statistics per class, we can compute macro-average metrics simply by computing the mean value for each of the computed metrics:
val rfMacroRecall = rfCMLabels.indices.map(i => rfRecall(rfCM, i)).sum/rfCMLabels.size val rfMacroPrecision = rfCMLabels.indices.map(i => rfPrecision(rfCM, i)).sum/rfCMLabels.size val rfMacroF1 = rfCMLabels.indices.map(i => rfF1(rfCM, i)).sum/rfCMLabels.size println(f"""|Macro statistics |Recall, Precision, F-1 |${rfMacroRecall}%.4f, ${rfMacroPrecision}%.4f, ${rfMacroF1}%.4f""".stripMargin)
The output is as follows:
The Macro statistics give us an overall characteristic for all feature statistics. We can see expected values close to 1.0 since our model performs quite well on the testing data.
Moreover, the Spark ModelMetrics API provides also weighted precision, recall and F-1 scores, which are mainly useful if we deal with unbalanced classes:
println(f"""|Weighted statistics |Recall, Precision, F-1 |${rfModelMetrics.weightedRecall}%.4f, ${rfModelMetrics.weightedPrecision}%.4f, ${rfModelMetrics.weightedFMeasure}%.4f |""".stripMargin)
The output is as follows:
And at the end, we are going to look at one more way of computing model metrics which is useful also in the cases when the classes are not well distributed. The method is called one-versus-all and it provides performance of the classifier with respect to one class at a time. That means we will compute a Confusion matrix for each output class - we can consider this approach as treating the classifier as a binary classifier predicting a class as positive case and any of other classes as negative case:
import org.apache.spark.mllib.linalg.Matrices val rfOneVsAll = rfCMLabels.indices.map { i => val icm = rfCM(i,i) val irowSum = rowSum(rfCM, i) val icolSum = colSum(rfCM, i) Matrices.dense(2,2, Array( icm, irowSum - icm, icolSum - icm, rfCMTotal - irowSum - icolSum + icm)) } println(rfCMLabels.indices.map(i => s"${rfCMLabels(i)} ${rfOneVsAll(i)}").mkString(" "))
This will give us performance of each class with respect to other classes represented by a simple binary Confusion matrix. We can sum up all matrices and get a Confusion matrix to compute average accuracy and micro-averaged metrics per class:
val rfOneVsAllCM = rfOneVsAll.foldLeft(Matrices.zeros(2,2))((acc, m) => Matrices.dense(2, 2, Array(acc(0, 0) + m(0, 0), acc(1, 0) + m(1, 0), acc(0, 1) + m(0, 1), acc(1, 1) + m(1, 1))) ) println(s"Sum of oneVsAll CM: ${rfOneVsAllCM}")
The output is as follows:
Having an overall Confusion matrix, we can compute average accuracy per class:
println(f"Average accuracy: ${(rfOneVsAllCM(0,0) + rfOneVsAllCM(1,1))/rfOneVsAllCM.toArray.sum}%.4f")
The output is as follows:
The matrix gives us also Micro-averaged metrics (recall, precision, F-1). However, it is worth mentioning that our rfOneVsAllCM matrix is symmetric. This means that Recall, Precision and F-1 have the same value (since FP and FN are the same):
println(f"Micro-averaged metrics: ${rfOneVsAllCM(0,0)/(rfOneVsAllCM(0,0)+rfOneVsAllCM(1,0))}%.4f")
The output is as follows:
Furthermore, an understanding of model metrics and especially of a role of Confusion matrix in multiclass classification is crucial but not connected only to the Spark API. A great source of information is the Python scikit documentation (http://scikit-learn.org/stable/modules/model_evaluation.html) or various R packages (for example, http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html).