Chapter 2. Evaluation Metrics

Evaluation metrics are tied to machine learning tasks. There are different metrics for the tasks of classification, regression, ranking, clustering, topic modeling, etc. Some metrics, such as precision-recall, are useful for multiple tasks. Classification, regression, and ranking are examples of supervised learning, which constitutes a majority of machine learning applications. We’ll focus on metrics for supervised learning models in this report.

Classification Metrics

Classification is about predicting class labels given input data. In binary classification, there are two possible output classes. In multiclass classification, there are more than two possible classes. I’ll focus on binary classification here. But all of the metrics can be extended to the multiclass scenario.

An example of binary classification is spam detection, where the input data could include the email text and metadata (sender, sending time), and the output label is either “spam” or “not spam.” (See Figure 2-1.) Sometimes, people use generic names for the two classes: “positive” and “negative,” or “class 1” and “class 0.”

There are many ways of measuring classification performance. Accuracy, confusion matrix, log-loss, and AUC are some of the most popular metrics. Precision-recall is also widely used; I’ll explain it in “Ranking Metrics”.

Figure 2-1. Email spam detection is a binary classification problem (source: Mark Enomoto | Dato Design)

Accuracy

Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio between the number of correct predictions and the total number of predictions (the number of data points in the test set):

accuracy equals StartFraction number-sign correct predictions Over number-sign total data points EndFraction

Confusion Matrix

Accuracy looks easy enough. However, it makes no distinction between classes; correct answers for class 0 and class 1 are treated equally—sometimes this is not enough. One might want to look at how many examples failed for class 0 versus class 1, because the cost of misclassification might differ for the two classes, or one might have a lot more test data of one class than the other. For example, when a doctor makes a medical diagnosis that a patient has cancer when he doesn’t (known as a false positive) has very different consequences than making the call that a patient doesn’t have cancer when he does (a false negative). A confusion matrix (or confusion table) shows a more detailed breakdown of correct and incorrect classifications for each class. The rows of the matrix correspond to ground truth labels, and the columns represent the prediction.

Suppose the test dataset contains 100 examples in the positive class and 200 examples in the negative class; then, the confusion table might look something like this:

  Predicted as positive Predicted as negative
Labeled as positive 80 20
Labeled as negative 5 195

Looking at the matrix, one can clearly tell that the positive class has lower accuracy  (80/(20 + 80) = 80%) than the negative class (195/(5 + 195) = 97.5%). This information is lost if one only looks at the overall accuracy, which in this case would be (80 + 195)/(100 + 200) = 91.7%.

Per-Class Accuracy

A variation of accuracy is the average per-class accuracy—the average of the accuracy for each class. Accuracy is an example of what’s known as a micro-average, and average per-class accuracy is a macro-average. In the above example, the average per-class accuracy would be (80% + 97.5%)/2 = 88.75%. Note that in this case, the average per-class accuracy is quite different from the accuracy.

In general, when there are different numbers of examples per class, the average per-class accuracy will be different from the accuracy. (Exercise for the curious reader: Try proving this mathematically!) Why is this important? When the classes are imbalanced, i.e., there are a lot more examples of one class than the other, then the accuracy will give a very distorted picture, because the class with more examples will dominate the statistic. In that case, you should look at the per-class accuracy, both the average and the individual per-class accuracy numbers.

Per-class accuracy is not without its own caveats. For instance, if there are very few examples of one class, then test statistics for that class will have a large variance, which means that its accuracy estimate is not as reliable as other classes. Taking the average of all the classes obscures the confidence measurement of individual classes.

Log-Loss

Log-loss, or logarithmic loss, gets into the finer details of a classifier. In particular, if the raw output of the classifier is a numeric probability instead of a class label of 0 or 1, then log-loss can be used. The probability can be understood as a gauge of confidence. If the true label is 0 but the classifier thinks it belongs to class 1 with probability 0.51, then even though the classifier would be making a mistake, it’s a near miss because the probability is very close to the decision boundary of 0.5. Log-loss is a “soft” measurement of accuracy that incorporates this idea of probabilistic confidence.

Mathematically, log-loss for a binary classifier looks like this:

log hyphen loss equals minus StartFraction 1 Over upper N EndFraction sigma-summation Underscript i equals 1 Overscript upper N Endscripts y Subscript i Baseline log p Subscript i Baseline plus left-parenthesis 1 minus y Subscript i Baseline right-parenthesis log left-parenthesis 1 minus p Subscript i Baseline right-parenthesis

Formulas like this are incomprehensible without years of grueling, inhuman training. Let’s unpack it. pi is the probability that the ith data point belongs to class 1, as judged by the classifier. yi is the true label and is either 0 or 1. Since yi is either 0 or 1, the formula essentially “selects” either the left or the right summand. The minimum is 0, which happens when the prediction and the true label match up. (We follow the convention that defines 0 log 0 = 0.)

The beautiful thing about this definition is that it is intimately tied to information theory: log-loss is the cross entropy between the distribution of the true labels and the predictions, and it is very closely related to what’s known as the relative entropy, or Kullback–Leibler divergence. Entropy measures the unpredictability of something. Cross entropy incorporates the entropy of the true distribution, plus the extra unpredictability when one assumes a different distribution than the true distribution. So log-loss is an information-theoretic measure to gauge the “extra noise” that comes from using a predictor as opposed to the true labels. By minimizing the cross entropy, we maximize the accuracy of the classifier.

AUC

AUC stands for area under the curve. Here, the curve is the receiver operating characteristic curve, or ROC curve for short. This exotic sounding name originated in the 1950s from radio signal analysis, and was made popular by a 1978 paper by Charles Metz called "Basic Principles of ROC Analysis.” The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to the rate of false positives (see Figure 2-2). In other words, it shows you how many correct positive classifications can be gained as you allow for more and more false positives. The perfect classifier that makes no mistakes would hit a true positive rate of 100% immediately, without incurring any false positives—this almost never happens in practice.

Figure 2-2. Sample ROC curve (source: Wikipedia)

The ROC curve is not just a single number; it is a whole curve. It provides nuanced details about the behavior of the classifier, but it’s hard to quickly compare many ROC curves to each other. In particular, if one were to employ some kind of automatic hyperparameter tuning mechanism (a topic we will cover in Chapter 4), the machine would need a quantifiable score instead of a plot that requires visual inspection. The AUC is one way to summarize the ROC curve into a single number, so that it can be compared easily and automatically. A good ROC curve has a lot of space under it (because the true positive rate shoots up to 100% very quickly). A bad ROC curve covers very little area. So high AUC is good, and low AUC is not so good.

For more explanations about ROC and AUC, see this excellent tutorial by Kevin Markham. Outside of the machine learning and data science community, there are many popular variations of the idea of ROC curves. The marketing analytics community uses lift and gain charts. The medical modeling community often looks at odds ratios. The statistics community examines sensitivity and specificity.

Ranking Metrics

We’ve arrived at ranking metrics. But wait! We are not quite out of the classification woods yet. One of the primary ranking metrics, precision-recall, is also popular for classification tasks.

Ranking is related to binary classification. Let’s look at Internet search, for example. The search engine acts as a ranker. When the user types in a query, the search engine returns a ranked list of web pages that it considers to be relevant to the query. Conceptually, one can think of the task of ranking as first a binary classification of “relevant to the query” versus “irrelevant to the query,” followed by ordering the results so that the most relevant items appear at the top of the list. In an underlying implementation, the classifier may assign a numeric score to each item instead of a categorical class label, and the ranker may simply order the items by the raw score.

Another example of a ranking problem is personalized recommendation. The recommender might act either as a ranker or a score predictor. In the first case, the output is a ranked list of items for each user. In the case of score prediction, the recommender needs to return a predicted score for each user-item pair—this is an example of a regression model, which we will discuss later.

Precision-Recall

Precision and recall are actually two metrics. But they are often used together. Precision answers the question, “Out of the items that the ranker/classifier predicted to be relevant, how many are truly relevant?” Whereas, recall answers the question, “Out of all the items that are truly relevant, how many are found by the ranker/classifier?” Figure 2-3 contains a simple Venn diagram that illustrates precision versus recall.

Figure 2-3. Illustration of precision and recall

Mathematically, precision and recall can be defined as the following:

precision equals StartFraction number-sign happy correct answers Over number-sign total items returned by ranker EndFraction

recall equals StartFraction number-sign happy correct answers Over number-sign total relevant items EndFraction

Frequently, one might look at only the top k items from the ranker, k = 5, 10, 20, 100, etc. Then the metrics would be called “precision@k” and “recall@k.”

When dealing with a recommender, there are multiple “queries” of interest; each user is a query into the pool of items. In this case, we can average the precision and recall scores for each query and look at “average precision@k” and “average recall@k.” (This is analogous to the relationship between accuracy and average per-class accuracy for classification.)

Precision-Recall Curve and the F1 Score

When we change k, the number of answers returned by the ranker, the precision and recall scores also change. By plotting precision versus recall over a range of k values, we get the precision-recall curve. This is closely related to the ROC curve. (Exercise for the curious reader: What’s the relationship between precision and the false-positive rate? What about recall?)

Just like it’s difficult to compare ROC curves to each other, the same goes for the precision-recall curve. One way of summarizing the precision-recall curve is to fix k and combine precision and recall. One way of combining these two numbers is via their harmonic mean:

upper F 1 equals 2 StartFraction precision asterisk recall Over precision plus recall EndFraction

Unlike the arithmetic mean, the harmonic mean tends toward the smaller of the two elements. Hence the F1 score will be small if either precision or recall is small.

NDCG

Precision and recall treat all retrieved items equally; a relevant item in position k counts just as much as a relevant item in position 1. But this is not usually how people think. When we look at the results from a search engine, the top few answers matter much more than answers that are lower down on the list.

NDCG tries to take this behavior into account. NDCG stands for normalized discounted cumulative gain. There are three closely related metrics here: cumulative gain (CG), discounted cumulative gain (DCG), and finally, normalized discounted cumulative gain. Cumulative gain sums up the relevance of the top k items. Discounted cumulative gain discounts items that are further down the list. Normalized discounted cumulative gain, true to its name, is a normalized version of discounted cumulative gain. It divides the DCG by the perfect DCG score, so that the normalized score always lies between 0.0 and 1.0. See the Wikipedia article for detailed mathematical formulas.

DCG and NDCG are important metrics in information retrieval and in any application where the positioning of the returned items is important.

Regression Metrics

In a regression task, the model learns to predict numeric scores. For example, when we try to predict the price of a stock on future days given past price history and other information about the company and the market, we can treat it as a regression task. Another example is personalized recommenders that try to explicitly predict a user’s rating for an item. (A recommender can alternatively optimize for ranking.)

RMSE

The most commonly used metric for regression tasks is RMSE (root-mean-square error), also known as RMSD (root-mean-square deviation). This is defined as the square root of the average squared distance between the actual score and the predicted score:

RMSE equals StartRoot StartFraction sigma-summation Underscript i Endscripts left-parenthesis y Subscript i Baseline minus ModifyingAbove y With caret Subscript i Baseline right-parenthesis squared Over n EndFraction EndRoot

Here, yi denotes the true score for the ith data point, and ModifyingAbove y With caret Subscript i denotes the predicted value. One intuitive way to understand this formula is that it is the Euclidean distance between the vector of the true scores and the vector of the predicted scores, averaged by StartRoot n EndRoot, where n is the number of data points.

Quantiles of Errors

RMSE may be the most common metric, but it has some problems. Most crucially, because it is an average, it is sensitive to large outliers. If the regressor performs really badly on a single data point, the average error could be very big. In statistical terms, we say that the mean is not robust (to large outliers).

Quantiles (or percentiles), on the other hand, are much more robust. To see why this is, let’s take a look at the median (the 50th percentile), which is the element of a set that is larger than half of the set, and smaller than the other half. If the largest element of a set changes from 1 to 100, the mean should shift, but the median would not be affected at all.

One thing that is certain with real data is that there will always be “outliers.” The model will probably not perform very well on them. So it’s important to look at robust estimators of performance that aren’t affected by large outliers. It is useful to look at the median absolute percentage:

MAPE equals median left-parenthesis StartAbsoluteValue left-parenthesis y Subscript i Baseline minus ModifyingAbove y With caret Subscript i Baseline right-parenthesis slash y Subscript i Baseline EndAbsoluteValue right-parenthesis

It gives us a relative measure of the typical error. Alternatively, we could compute the 90th percentile of the absolute percent error, which would give an indication of an “almost worst case” behavior.

“Almost Correct” Predictions

Perhaps the easiest metric to interpret is the percent of estimates that differ from the true value by no more than X%. The choice of X depends on the nature of the problem. For example, the percent of estimates within 10% of the true values would be computed by percent of |(yiŷi)/yi| < 0.1. This gives us a notion of the precision of the regression estimate.

Caution: The Difference Between Training Metrics and Evaluation Metrics

Sometimes, the model training procedure may use a different metric (also known as a loss function) than the evaluation. This can happen when we are reappropriating a model for a different task than it was designed for. For instance, we might train a personalized recommender by minimizing the loss between its predictions and observed ratings, and then use this recommender to produce a ranked list of recommendations.

This is not an optimal scenario. It makes the life of the model difficult—it’s being asked to do a task that it was not trained to do! Avoid this when possible. It is always better to train the model to directly optimize for the metric it will be evaluated on. But for certain metrics, this may be very difficult or impossible. (For instance, it’s very hard to directly optimize the AUC.) Always think about what is the right evaluation metric, and see if the training procedure can optimize it directly.

Caution: Skewed Datasets—Imbalanced Classes, Outliers, and Rare Data

It’s easy to write down the formula of a metric. It’s not so easy to interpret the actual metric measured on real data. Book knowledge is no substitute for working experience. Both are necessary for successful applications of machine learning.

Always think about what the data looks like and how it affects the metric. In particular, always be on the look out for data skew. By data skew, I mean the situations where one “kind” of data is much more rare than others, or when there are very large or very small outliers that could drastically change the metric.

Earlier, we mentioned how imbalanced classes could be a caveat in measuring per-class accuracy. This is one example of data skew—one of the classes is much more rare compared to the other class. It is problematic not just for per-class accuracy, but for all of the metrics that give equal weight to each data point. Suppose the positive class is only a tiny portion of the observed data, say 1%—a common situation for real-world datasets such as click-through rates for ads, user-item interaction data for recommenders, malware detection, etc. This means that a “dumb” baseline classifier that always classifies incoming data as negative would achieve 99% accuracy. A good classifier should have accuracy much higher than 99%. Similarly, if looking at the ROC curve, only the top left corner of the curve would be important, so the AUC would need to be very high in order to beat the baseline. See Figure 2-4 for an illustration of these gotchas.

Figure 2-4. Illustration of classification accuracy and AUC under imbalanced classes

Any metric that gives equal weight to each instance of a class has a hard time handling imbalanced classes, because by definition, the metric will be dominated by the class(es) with the most data. Furthermore, they are problematic not only for the evaluation stage, but even more so when training the model. If class imbalance is not properly dealt with, the resulting model may not know how to predict the rare classes at all.

Data skew can also create problems for personalized recommenders. Real-world user-item interaction data often contains many users who rate very few items, as well as items that are rated by very few users. Rare users and rare items are problematic for the recommender, both during training and evaluation. When not enough data is available in the training data, a recommender model would not be able to learn the user’s preferences, or the items that are similar to a rare item. Rare users and items in the evaluation data would lead to a very low estimate of the recommender’s performance, which compounds the problem of having a badly trained recommender.

Outliers are another kind of data skew. Large outliers can cause problems for a regressor. For instance, in the Million Song Dataset, a user’s score for a song is taken to be the number of times the user has listened to this song. The highest score is greater than 16,000! This means that any error made by the regressor on this data point would dwarf all other errors. The effect of large outliers during evaluation can be mitigated through robust metrics such as quantiles of errors. But this would not solve the problem for the training phase. Effective solutions for large outliers would probably involve careful data cleaning, and perhaps reformulating the task so that it’s not sensitive to large outliers.

Related Reading

Software Packages

Many of the metrics (and more) are implemented in various software packages for data science.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.117.212