Precision, recall, and specificity

The results of testing the effectiveness of an ML model against test data can be summed up into four categories (assuming the model is predicting between two classes, such as yes/no or broken/not broken). We will use the example of a wearable IoT sensor where the sensor data is being used to predict if someone is walking or not walking:

The ML model said the person was walking, but he was actually sitting on the couch watching Dancing with the Stars drinking a super-sized Sprite. This is a False Positive (FP), also called a Type I error.
The ML model said he was not walking when he, in fact, was walking quite quickly to the restroom due to the super-sized Sprite he just drank. This is a False Negative (FN) or a Type II error.
The ML model said he was walking, and he was actually walking. This is called a True Positive (TP).
The ML model said he was not walking and he was, in fact, snoring up a storm in his bed at the time–so not walking. This is called a True Negative (TN).

Taken together, these can be placed into a 2x2 matrix with the number of occurrences of each type shown. This is called a confusion matrix, and all sorts of useful diagnostic information about the performance of an ML model can be generated from it.

The following table shows a generic example of a confusion matrix:

Confusion matrix example

When generated in R, the confusion matrix will look like the following table. The confusionMatrix function in the caret package can be used to create it from the predictions of a trained model:

confusionMatrix(data = testPredictions,reference = classFactors,positive = "Class1")

Confusion matrix example in R.

If you take all the instances that the ML model predicted positive and actually were positive (as intended), and compare it against the total times it predicted positive whether as intended or not, you can get a measure for how much you can trust the positive predictions of a model. This is called Precision. It is also called Positive Predictive Value (PPV), but is usually referred to as Precision:

Precision = True Positives/(True Positives + False Positives)

If you take all the instances that the ML model predicted positive and actually were positive and compare that against the total number of actual positives–predicted or not, you can get a measure for how well the model can capture all the positive occurrences. It knows a thing when it sees a thing. This is called Recall. And also Sensitivity. And also True Positive Rate (TPR). And also hit rate. And also probability of detection. But mostly Recall and Sensitivity:

Recall = True Positives/(True Positives + False Negatives)

If you would like to judge your ML model's capability as a critic, you can take all the predicted negative which actually were negative instances, and compare this against all the total negative instances, predicted or not. This will give you a measure of how well a model knows something isn't a thing when it sees it. This is called Specificity or True Negative Rate (TNR):

Specificity = True Negatives/(True Negatives + False Positives)

Each of these measures alone can be misleading. For example, if a negative instance is rare, say 1% probability, then a model could show excellent Recall by simply always assuming an instance is positive. The Specificity in this case would be terrible even though the recall was good. It is always best to check multiple measures to verify your model performs well.

Table of Contents for Precision, recall, and specificity

Create new playlist

Sign In

Sign Up

Table of Contents for
Precision, recall, and specificity