Performance metrics

While there are many different types of classification algorithms, evaluation metrics more or less shares similar principles. In a supervised classification problem, there exists a true output and a model-generated predicted output for each data point. For this reason, the results for each data point can be assigned to one of four categories:

  • True positive (TP): Label is positive and prediction is also positive.
  • True negative (TN): Label is negative and prediction is also negative.
  • False positive (FP): Label is negative but prediction is positive.
  • False negative (FN): Label is positive but prediction is negative.

Now, to get a clearer idea about these parameters, refer to the following figure:

Figure 23: Prediction classifier (that is, confusion matrix)

The TP, FP, TN, FN are the building blocks for most classifier evaluation metrics. A fundamental point when considering classifier evaluation is that pure accuracy (that is, was the prediction correct or incorrect) is not generally a good metric. The reason for this is that a dataset may be highly unbalanced. For example, if a model is designed to predict fraud from a dataset where 95% of the data points are not fraud and 5% of the data points are fraud. Then suppose a naive classifier predicts not fraud (regardless of input) will be 95% accurate. For this reason, metrics such as precision and recall are typically used because they take into account the type of error. In most applications, there is some desired balance between precision and recall, which can be captured by combining the two into a single metric, called the F-measure.

Precision signifies how many of the positively classified were relevant. On the other hand, recall signifies how good a test is at detecting the positives? In binary classification, recall is called sensitivity. It is important to note that the the precision may not decrease with recall. The relationship between recall and precision can be observed in the stair step area of the plot:

  • Receiver operating characteristic (ROC)
  • Area under ROC curve
  • Area under precision-recall curve

These curves are typically used in binary classification to study the output of a classifier. However, sometimes it is good to combine precision and recall to choose between two models. In contrast, using precision and recall with multiple-number evaluation metrics makes it harder to compare algorithms. Suppose you have two algorithms that perform as follows:

Classifier

Precision

Recall

X

96%

89%

Y

99%

84%

 

Here, neither classifier is obviously superior, so it doesn't immediately guide you toward picking the optimal one. But using F1 score, which is a measure that combines precision and recall (that is, the harmonic mean of precision and recall), balanced the F1 score. Let's calculate it and place it in the table:

Classifier

Precision

Recall

F1 score

X

96%

89%

92.36%

Y

99%

84%

90.885%

Therefore, having F1-score helps make a decision for selecting from a large number of classifiers. It gives a clear preference ranking among all of them and therefore a clear direction for progress, that is, classifier X.

For the binary classification, the preceding performance metrics can be calculated as follows:

Figure 24: Mathematical formula for computing performance metrics for binary classifiers (source: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html)

However, in multiclass classification problems where more than two predicted labels are associated, computing the earlier metrics is more complex but can be computed using the following mathematical equations:

Figure 25: Mathematical formula for computing performance metrics for multiclass classifiers

Where δ^(x) is called modified delta function and that can be defined as follows (source: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html):

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.61.179