5. Evaluating and Comparing Learners

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5. Evaluating and Comparing Learners

In [1]:

# setup 
from mlwpy import *
diabetes = datasets.load_diabetes()
%matplotlib inline

5.1 Evaluation and Why Less Is More

Lao Tzu: Those that know others are wise. Those that know themselves are Enlightened.

The biggest risk in developing a learning system is overestimating how well it will do when we use it. I touched on this risk in our first look at classification. Those of us that have studied for a test and thought we had a good mastery of the material, and then bombed the test, will be intimately familiar with this risk. It is very, very easy to (1) think we know a lot and will do well on an exam and (2) not do very well on the exam. On a test, we may discover we need details when we only remember a general idea. I know it happened in mid-nineteenth century, but was it 1861 or 1862!? Even worse, we might focus on some material at the expense of other material: we might miss studying some information entirely. Well, nuts: we needed to know her name but not his birth year.

In learning systems, we have two similar issues. When you study for the test, you are limited in what you can remember. Simply put, your brain gets full. You don’t have the capacity to learn each and every detail. One way around this is to remember the big picture instead of many small details. It is a great strategy—until you need one of those details! Another pain many of us have experienced is that when you’re studying for a test, your friend, spouse, child, anyone hollers at you, “I need your attention now!” Or it might be a new video game that comes out: “Oh look, a shiny bauble!” Put simply, you get distracted by noise. No one is judging, we’re all human here.

These two pitfalls—limited capacity and distraction by noise—are shared by computer learning systems. Now, typically, a learning system won’t be distracted by the latest YouTube sensation or Facebook meme. In the learning world, we call these sources of error by different names. For the impatient, they are bias for the capacity of what we can squeeze into our head and variance for how distracted we get by noise. For now, squirrel away that bit of intuition and don’t get distracted by noise.

Returning to the issue of overconfidence, what can we do to protect ourselves from . . . ourselves? Our most fundamental defense is not teaching to the test. We introduced this idea in our first look at classification (Section 3.3). To avoid teaching to the test, we use a very practical three-step recipe:

Step one: split our data into separate training and testing datasets.
Step two: learn on the training data.
Step three: evaluate on the testing data.

Not using all the data to learn may seem counterintuitive. Some folks—certainly none of my readers—could argue, “Wouldn’t building a model on more data lead to better results?” Our humble skeptic has a good point. Using more data should lead to better estimates by our learner. The learner should have better parameters—better knob settings on our factory machine. However, there’s a really big consequence of using all of the data for learning. How would we know that a more-data model is better than a less-data model? We have to evaluate both models somehow. If we teach to the test by learning and evaluating on all of the data, we are likely to overestimate our ability once we take our system into the big, scary, complex real world. The scenario is similar to studying a specific test from last year’s class—wow, multiple choice, easy!—and then being tested on this year’s exam which is all essays. Is there a doctor in the house? A student just passed out.

In this chapter, we will dive into general evaluation techniques that apply to both regression and classification. Some of these techniques will help us avoid teaching to the test. Others will give us ways of comparing and contrasting learners in very broad terms.

5.2 Terminology for Learning Phases

We need to spend a few minutes introducing some vocabulary. We need to distinguish between a few different phases in the machine learning process. We’ve hit on training and testing earlier. I want to introduce another phase called validation. Due to some historical twists and turns, I need to lay out clearly what I mean by these three terms—training, validation, and testing. Folks in different disciplines can use these terms with slight variations in meaning which can trip the unwary student. I want you to have a clear walking path.

5.2.1 Back to the Machines

I want you to return to the mental image of our factory learning machine from Section 1.3. The machine is a big black box of knobs, inputs, and outputs. I introduced that machine to give you a concrete image of what learning algorithms are doing and how we have control over them. We can continue the story. While the machine itself seems to be part of a factory, in reality, we are a business-to-business (that’s B2B to you early 2000s business students) provider. Other companies want to make use of our machine. However, they want a completely hands-off solution. We’ll build the machine, set all the knobs as in Figure 5.1, and send the machine to the customer. They won’t do anything other than feed it inputs and see what pops out the other side. This delivery model means that when we hand off the machine to our customer, it needs to be fully tuned and ready to rock-and-roll. Our challenge is to ensure the machine can perform adequately after the hand-off.

A machine model is shown. It contains knobs and switches. Inputs (I) are fed to the machine and outputs (O) are driven out. — Figure 5.1 Learning algorithms literally dial-in—or optimize—a relationship between input and output.

In our prior discussion of the machine, we talked about relating inputs to outputs by setting the knobs and switches on the side of the machine. We established that relationship because we had some known outputs that we were expecting. Now, we want to avoid teaching to the test when we set the dials on our machine. We want the machine to do well for us, but more importantly, we want it to do well for our customer. Our strategy is to hold out some of the input-output pairs and save them for later. We will not use the saved data to set the knobs. We will use the saved data, after learning, to evaluate how well the knobs are set. Great! Now we’re completely set and have a good process for making machines for our customers.

You know what’s coming. Wait for it. Here it comes. Houston, we have a problem. There are many different types of machines that relate inputs to outputs. We’ve already seen two classifiers and two regressors. Our customers might have some preconceived ideas about what sort of machine they want because they heard that Fancy Silicon Valley Technologies, Inc. was using one type of machine. FSVT, Inc. might leave it entirely up to us to pick the machine. Sometimes we—or our corporate overlords—will choose between different machines based on characteristics of the inputs and outputs. Sometimes we’ll choose based on resource use. Once we select a broad class of machines (for example, we decide we need a widget maker), there may be several physical machines we can pick (for example, the Widget Works 5000 or the WidgyWidgets Deluxe Model W would both work nicely). Often, we will pick the machine we use based on its learning performance (Figure 5.2).

Two machine models are shown. In the first model 'A' the knobs are arranged horizontally. Its input is 'I' and output is 'O subscript A.' In the second model 'B' the knobs are arranged vertically. Its input is 'I' and output is 'O subscript B.' A cartoon character on the side is confused to choose between the machines. — Figure 5.2 If we can build and optimize different machines, we select one of them for the customer.

Let me step out of the metaphor for a moment. A concrete example of a factory machine is a k-Nearest Neighbors (k-NN) classifier. For k-NN, different values of k are entirely different physical machines. k is not a knob we adjust on the machine. k is internal to the machine. No matter what inputs and outputs we see, we can’t adjust k directly on one machine (see Section 11.1 for details). It’s like looking at the transmission of a car and wanting a different gearing. That modification is at a level beyond the skillsets of most of us. But all is not lost! We can’t modify our car’s transmission, but we can buy a different car. We are free to have two different machines, say 3-NN and 10-NN. We can go further. The machines could also be completely different. We could get two sedans and one minivan. With learning models, they don’t all have to be k-NN variants. We could get a 3-NN, a 10-NN, and a Naive Bayes. To pick among them, we run our input-output pairs through the models to train them. Then, we evaluate how they perform on the held-out data to get a better—less trained on the test—idea of how our machines will perform for the customer (Figure 5.3).

An illustration depicts optimization and selection of machines. — Figure 5.3 Optimization (dial-setting) and selection (machine choice) as steps to create a great machine for our customer.

Two machine models are shown. In the first model 'A' the knobs are arranged horizontally. Its input is 'I' and output is 'O subscript A.' A cartoon character selects the first knob. In the second model 'B' the knobs are arranged vertically. Its input is 'I' and output is 'O subscript B.' A cartoon character selects the knob at the bottom. A cartoon character on the side is confused to choose between the machines. Other sets of machines are shown. Here the input is 'V' and outputs are 'O subscript A' and 'O subscript B.' A cartoon character on the side chooses machine 'B.'

Hurray! We’re done. High-fives all around, it’s time for coffee, tea, soda, or beer (depending on your age and doctor’s advice).

Not so fast. We still have a problem. Just as we can teach to the test in setting the knobs on the machines, we can also teach to the test in terms of picking a machine. Imagine that we use our held-out data as the basis for picking the best machine. For k-NN that means picking the best k. We could potentially try all the values of k up to the size of our dataset. Assuming 50 examples, that’s all values of k from 1 to 50. Suppose we find that 27 is the best. That’s great, except we’ve been looking at the same held-out data every time we try a different k. We no longer have an unseen test to give us a fair evaluation of the machine we’re going to hand off to our customer. We used up our hold-out test set and fine-tuned our performance towards it. What’s the answer now?

The answer to teaching-to-the-test with knobs—tuning a given machine—was to have a separate set of held-out data that isn’t used to set the knobs. Since that worked pretty well there, let’s just do that again. We’ll have two sets of held-out data to deal with two separate steps. One set will be used to pick the machine. The second set will be used to evaluate how well the machine will work for the customer in a fair manner, without peeking at the test. Remember, we also have the non-held-out data that is used to tune the machine.

5.2.2 More Technically Speaking . . .

Let’s recap. We now have three distinct sets of data. We can break the discussion of our needs into three distinct phases. We’ll work from the outside to the inside—that is, from our final goal towards fitting the basic models.

We need to provide a single, well-tuned machine to our customer. We want to have a final, no-peeking evaluation of how that machine will do for our customer.
After applying some thought to the problem, we select a few candidate machines. With those candidate machines, we want to evaluate and compare them without peeking at the data we will use for our final evaluation.
For each of the candidate machines, we need to set the knobs to their best possible settings. We want to do that without peeking at either of the datasets used for the other phases. Once we select one machine, we are back to the basic learning step: we need to set its knobs.

5.2.2.1 Learning Phases and Training Sets

Each of these three phases has a component of evaluation in it. In turn, each different evaluation makes use of a specific set of data containing different known input-output pairs. Let’s give the phases and the datasets some useful names. Remember, the term model stands for our metaphorical factory machine. The phases are

Assessment: final, last-chance estimate of how the machine will do when operating in the wild
Selection: evaluating and comparing different machines which may represent the same broad type of machine (different k in k-NN) or completely different machines (k-NN and Naive Bayes)
Training: setting knobs to their optimal values and providing auxiliary side-tray information

The datasets used for these phases are:

Hold-out test set
Validation test set
Training set

We can relate these phases and datasets to the factory machine scenario. This time, I’ll work from the inside out.

The training set is used to adjust the knobs on the factory machine.
The validation test set is used to get a non-taught-to-the-test evaluation of that finely optimized machine and help us pick between different optimized machines.
The hold-out test set is used to make sure that the entire process of building one or more factory machines, optimizing them, evaluating them, and picking among them is evaluated fairly.

The last of these is a big responsibility: there are many ways to peek and be misled by distractions. If we train and validation-test over and over, we are building up a strong idea of what works and doesn’t work in the validation test set. It may be indirect, but we are effectively peeking at the validation test set. The hold-out test set—data we have never used before in any training or validation-testing for this problem—is necessary to protect us from this indirect peeking and to give us a fair evaluation of how our final system will do with novel data.

5.2.2.2 Terms for Test Sets

If you check out a number of books on machine learning, you’ll find the term validation set used, fairly consistently, for Selection. However, when you talk to practitioners, folks will verbally use the phrase test set for both datasets used for Selection and for Assessment. To sweep this issue under the carpet, if I’m talking about evaluation and it is either (1) clear from context or (2) doesn’t particularly matter, I’ll use the generic phrase testing for the data used in either Assessment or Selection. The most likely time that will happen is when we aren’t doing Selection of models—we are simply using a basic train-test split, training a model and then performing a held-out evaluation, Assessment, on it.

If the terms do matter, as when we’re talking about both phases together, I’ll be a bit more precise and use more specific terms. If we need to distinguish these datasets, I’ll use the terms hold-out test set (HOT) and validation set (ValS). Since Assessment is a one-and-done process—often at the end of all our hard work applying what we know about machine learning—we’ll be talking about the HOT relatively infrequently. That is not to say that the HOT is unimportant—quite the contrary. Once we use it, we can never use it as a HOT again. We’ve peeked. Strictly speaking, we’ve contaminated both ourselves and our learning system. We can delete a learning system and start from scratch, but it is very difficult to erase our own memories. If we do this repeatedly, we’d be right back into teaching to the test. The only solution for breaking the lockbox of the HOT is to gather new data. On the other hand, we are not obligated to use all of the HOT at once. We can use half of it, find we don’t like the results and go back to square one. When we develop a new system that we need to evaluate before deployment, we still have the other half the HOT for Assessment.

5.2.2.3 A Note on Dataset Sizes

A distinctly practical matter is figuring out how big each of these sets should be. It is a difficult question to answer. If we have lots of data, then all three sets can be very large and there’s no issue. If we have very little data, we have to be concerned with (1) using enough data in training to build a good model and (2) leaving enough data for the testing phases. To quote one of the highest-quality books in the field of machine and statistical learning, Elements of Statistical Learning, “It is difficult to give a general rule on how to choose the number of observations in each of the three parts.” Fortunately, Hastie and friends immediately take pity on us poor practitioners and give a generic recommendation of 50%–25%–25% for training, validation testing, and held-out testing. That’s about as good of a baseline split as we can get. With cross-validation, we could possibly consider a 75–25 split with 75% being thrown into the basket for cross-validation—which will be repeatedly split into training and validation-testing sets—and 25% saved away in a lockbox for final assessment. More on that shortly.

If we go back to the 50–25–25 split, let’s drill into that 50%. We’ll soon see evaluation tools called learning curves. These give us an indication of what happens to our validation-testing performance as we train on more and more examples. Often, at some high enough number of training examples, we will see a plateau in the performance. If that plateau happens within the 50% split size, things are looking pretty good for us. However, imagine a scenario where we need 90% of our available data to get a decent performance. Then, our 50–25–25 split is simply not going to give a sufficiently good classifier because we need more training data. We need a learner that is more efficient in its use of data.

5.2.2.4 Parameters and Hyperparameters

Now is the perfect time—I might be exaggerating—to deal with two other terms: parameters and hyperparameters. The knobs on a factory machine represent model parameters set by a learning method during the training phase. Choosing between different machines (3-NN or 10-NN) in the same overall class of machine (k-NN) is selecting a hyperparameter. Selecting hyperparameters, like selecting models, is done in the selection phase. Keep this distinction clear: parameters are set as part of the learning method in the training phase while hyperparameters are beyond the control of the learning method.

For a given run of a learning method, the available parameters (knobs) and the way they are used (internals of the factory machine) are fixed. We can only adjust the values those parameters take. Conceptually, this limitation can be a bit hard to describe. If the phases described above are talked about from outer to inner—in analogy with outer and inner loops in a computer program—the order is Assessment, Selection, Training. Then, adjusting hyperparameters means stepping out one level from adjusting the parameters—stepping out from Training to Selection. We are thinking outside the box, if you will. At the same time—from a different perspective—we are diving into the inner workings of the machine like a mechanic. As is the case with rebuilding car engines, the training phase just doesn’t go there.

With that perfect moment passed, we’re going to minimize the discussion of hyperparameters for several chapters. If you want to know more about hyperparameters right now, go to Section 11.1. Table 5.1 summarizes the pieces we’ve discussed.

Table 5.1 Phases and datasets for learning.

Phase	Name	Dataset Used	Machine	Purpose
inner	training	training set	set knobs	optimize parameters
middle	selection	validation test set	choose machines	select model, hyperparameters
outer	assessment	hold-out test set	evaluate performance	assess future performance

For the middle phase, selection, let me emphasize just how easily we can mislead ourselves. We’ve only considered two kinds of classifiers so far: NB and k-NN. While k could grow arbitrarily big, we commonly limit it to relatively small values below 20 or so. So, maybe we are considering 21 total possible models (20 k-NN variants and 1 Naive Bayes model). Still, there are many other methods. In this book, we’ll discuss about a half dozen. Several of these have almost infinite tunability. Instead of choosing between a k of 3, 10, or 20, some models have a C with any value from zero to infinity. Over many models and many tuning options, it is conceivable that we might hit the jackpot and find one combination that is perfect for our inner and middle phases. However, we’ve been indirectly peeking—homing in on the target by systematic guessing. Hopefully, it is now clear why the outer phase, assessment, is necessary to prevent ourselves from teaching to the test.

5.3 Major Tom, There’s Something Wrong: Overfitting and Underfitting

Now that we’ve laid out some terminology for the learning phases—training, selection, and assessment—I want to dive into things that can go wrong with learning. Let’s turn back to the exam scenario. Mea culpa. Suppose we take an exam and we don’t do as well as we’d like. It would be nice if we could attribute our failure to something more specific than “bad, don’t do that again.” Two distinct failures are (1) not bringing enough raw horsepower—capacity—to the exam and (2) focusing too much on irrelevant details. To align this story with our earlier discussion, number two is really just a case of being distracted by noise—but it makes us feel better about ourselves than binging on Netflix. These two sources of error have technical names: underfitting and overfitting. To investigate them, we’re going to cook up a simple practice dataset.

5.3.1 Synthetic Data and Linear Regression

Often, I prefer to use real-world datasets—even if they are small—for examples. But in this case, we’re going to use a bit of synthetic, genetically modified data. Creating synthetic data is a good tool to have in your toolbox. When we develop a learning system, we might need some data that we completely control. Creating our own data allows us to control the true underlying relationship between the inputs and the outputs and to manipulate how noise affects that relationship. We can specify both the type and amount of noise.

Here, we’ll make a trivial dataset with one feature and a target, and make a train-test split on it. Our noise is chosen uniformly (you might want to revisit our discussion of distributions in Section 2.4.4) from values between –2 and 2.

In [2]:

	0	1	2	3	4	5	6	7	8	9
ftr,	-1.58	-6.84	-3.68	1.58	-7.90	3.68	7.89	4.74	5.79	-0.53
tgt	2.39	91.02	22.38	3.87	122.58	23.00	121.75	40.60	62.77	-1.61

	Train Error	Test Error
Complexity
1	45.4951	86.6915
2	1.0828	1.2766
6	0.2819	6.1417
9	0.0000	317.3634

Driver	t₁	t₂	t_total	s	d
Mario,	35	75	110	.018	2
Luigi	20	40	60	.033	2
Yoshi	40	50	90	.022	2

Inputs	Output	Measurement errors	True Relationship	Try to Relate With	Perfect?	Why?
t₁, t₂	t_total	no	add	add	yes
t₁, t₂	t_total	yes	add	add	no	measurement errors
t_total, s	d	no	multiply	add	no	can’t get right form

Table of Contents for 5. Evaluating and Comparing Learners

Create new playlist

Sign In

Sign Up

5. Evaluating and Comparing Learners

5.1 Evaluation and Why Less Is More

5.2 Terminology for Learning Phases

5.2.1 Back to the Machines

5.2.2 More Technically Speaking . . .

5.2.2.1 Learning Phases and Training Sets

5.2.2.2 Terms for Test Sets

5.2.2.3 A Note on Dataset Sizes

5.2.2.4 Parameters and Hyperparameters

5.3 Major Tom, There’s Something Wrong: Overfitting and Underfitting

5.3.1 Synthetic Data and Linear Regression

5.3.2 Manually Manipulating Model Complexity

5.3.3 Goldilocks: Visualizing Overfitting, Underfitting, and “Just Right”

5.3.4 Simplicity

5.3.5 Take-Home Notes on Overfitting

5.4 From Errors to Costs

5.4.1 Loss

5.4.2 Cost

5.4.3 Score

5.5 (Re)Sampling: Making More from Less

5.5.1 Cross-Validation

5.5.2 Stratification

5.5.3 Repeated Train-Test Splits

5.5.4 A Better Way and Shuffling

5.5.5 Leave-One-Out Cross-Validation

5.6 Break-It-Down: Deconstructing Error into Bias and Variance

5.6.1 Variance of the Data

5.6.2 Variance of the Model

5.6.3 Bias of the Model

5.6.4 All Together Now

5.6.5 Examples of Bias-Variance Tradeoffs

5.6.5.1 Bias-Variance for k-NN

5.6.5.2 Bias-Variance for Linear Regression

5.6.5.3 Relating k-NN and Linear Regression

5.6.5.4 Bias-Variance for Naive Bayes

5.6.5.5 Summary Table

5.7 Graphical Evaluation and Comparison

5.7.1 Learning Curves: How Much Data Do We Need?

5.7.2 Complexity Curves

5.8 Comparing Learners with Cross-Validation

5.9 EOC

5.9.1 Summary

5.9.2 Notes

5.9.3 Exercises

Table of Contents for
5. Evaluating and Comparing Learners