6. Evaluating Classifiers

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

6. Evaluating Classifiers

In [1]:

# setup
from mlwpy import *
%matplotlib inline

iris = datasets.load_iris()

tts = skms.train_test_split(iris.data, iris.target,
                            test_size=.33, random_state=21)

(iris_train_ftrs, iris_test_ftrs,
 iris_train_tgt, iris_test_tgt) = tts

In the previous chapter, we discussed evaluation issues that pertain to both classifiers and regressors. Now, I’m going to turn our attention to evaluation techniques that are appropriate for classifiers. We’ll start by examining baseline models as a standard of comparison. We will then progress to different metrics that help identify different types of mistakes that classifiers make. We’ll also look at some graphical methods for evaluating and comparing classifiers. Last, we’ll apply these evaluations on a new dataset.

6.1 Baseline Classifiers

I’ve emphasized—and the entire previous chapter reinforces—the notion that we must not lie to ourselves when we evaluate our learning systems. We discussed fair evaluation of single models and comparing two or more alternative models. These steps are great. Unfortunately, they miss an important point—it’s an easy one to miss.

Once we’ve invested time in making a fancy—new and improved, folks!—learning system, we are going to feel some obligation to use it. That obligation may be to our boss, or our investors who paid for it, or to ourselves for the time and creativity we invested in it. However, rolling a learner into production use presumes that the shiny, new, improved system is needed. It might not be. Sometimes, simple old-fashioned technology is more effective, and more cost-effective, than a fancy new product.

How do we know whether we need a campfire or an industrial stovetop? We figure that out by comparing against the simplest ideas we can come up with: baseline methods. sklearn calls these dummy methods.

We can imagine four levels of learning systems:

Baseline methods—prediction based on simple statistics or random guesses,
Simple off-the-shelf learning methods—predictors that are generally less resource-intensive,
Complex off-the-shelf learning methods—predictors that are generally more resource-intensive, and
Customized, boutique learning methods.

Most of the methods in this book fall into the second category. They are simple, off-the-shelf systems. We’ll glance at more complex systems in Chapter 15. If you need boutique solutions, you should hire someone who has taken a deeper dive into the world of machine learning and statistics—like your humble author. The most basic, baseline systems help us decide if we need a complicated system and if that system is better than something primitive. If our fancy systems are no better than the baseline, we may need to revisit some of our fundamental assumptions. We may need to gather more data or change how we are representing our data. We’ll talk about adjusting our representation in Chapters 10 and 13.

In sklearn, there are four baseline classification methods. We’ll actually show code for five, but two are duplicates. Each of the methods makes a prediction when given a test example. Two baseline methods are random; they flip coins to make a prediction for the example. Two methods return a constant value; they always predict the same thing. The random methods are (1) uniform: choose evenly among the target classes based on the number of classes and (2) stratified: choose evenly among the target classes based on frequency of those classes. The two constant methods are (1) constant (surprise?): return one target class that we’ve picked out and (2) most_frequent: return the single most likely class. most_frequent is also available under the name prior.

The two random methods will behave differently when a dataset has rare occurrences, like a rare disease. Then, with two classes—plentiful healthy people and rare sick people—the uniform method picks evenly, 50%–50%, between sick and healthy. It ends up picking way more sick people than there are in reality. For the stratified method, we pick in a manner similar to stratified sampling. It picks healthy or sick as the target based on the percents of healthy and sick people in the data. If there are 5% of sick people, it would pick sick around 5% of the time and healthy 95% of the time.

Here’s a simple use of a most_frequent baseline method:

In [2]:

	accuracy
constant	0.3600
uniform	0.3800
stratified	0.3400
prior	0.3000
most_frequent	0.3000

	I think: pot is hot (Positive)	I think: pot is cold (Negative)
pot is hot	True (predicted) Positive	False (predicted) Negative
pot is cold	False (predicted) Positive	True (predicted) Negative

	Predicted Positive (PredP)	Predicted Negative (PredN)
Real Positive (RealP)	True Positive (TP)	False Negative (FN)
Real Negative (RealN)	False Positive (FP)	True Negative (TN)

	PredP	PredN
RealP	.05 .15 .25	.55 .65
RealN	.35 .45	.75 .85 .95

Table of Contents for 6. Evaluating Classifiers

Create new playlist

Sign In

Sign Up

6. Evaluating Classifiers

6.1 Baseline Classifiers

6.2 Beyond Accuracy: Metrics for Classification

6.2.1 Eliminating Confusion from the Confusion Matrix

6.2.2 Ways of Being Wrong

6.2.3 Metrics from the Confusion Matrix

6.2.4 Coding the Confusion Matrix

6.2.5 Dealing with Multiple Classes: Multiclass Averaging

6.2.6 F1

6.3 ROC Curves

6.3.1 Patterns in the ROC

6.3.2 Binary ROC

6.3.3 AUC: Area-Under-the-(ROC)-Curve

6.3.4 Multiclass Learners, One-versus-Rest, and ROC

6.4 Another Take on Multiclass: One-versus-One

6.4.1 Multiclass AUC Part Two: The Quest for a Single Value

6.5 Precision-Recall Curves

6.5.1 A Note on Precision-Recall Tradeoff

6.5.2 Constructing a Precision-Recall Curve

6.6 Cumulative Response and Lift Curves

6.7 More Sophisticated Evaluation of Classifiers: Take Two

6.7.1 Binary

6.7.2 A Novel Multiclass Problem

6.8 EOC

6.8.1 Summary

6.8.2 Notes

6.8.3 Exercises

Table of Contents for
6. Evaluating Classifiers

6.2.6 F₁