©  Geoff Hulten 2018
Geoff HultenBuilding Intelligent Systemshttps://doi.org/10.1007/978-1-4842-3432-7_19

19. Evaluating Intelligence

Geoff Hulten
(1)
Lynnwood, Washington, USA
 
Evaluation is creation, at least when it comes to building intelligence for Intelligent Systems. That’s because intelligence creation generally involves an iterative search for effective intelligence: produce a new candidate intelligence, compare it to the previous candidate, and choose the better of the two. To do this, you need to be able to look at a pair of intelligences and answer questions like these:
  • Which one of these should I use in my Intelligent System?
  • Which will do a better job of achieving the system’s goals?
  • Which will cause less trouble for me and my users?
  • Is either of them good enough to ship to customers or is there still more work to do?
There are two main ways to evaluate intelligence:
  • Online evaluation : By exposing it to customers and seeing how they respond. We’ve discussed this in earlier chapters when we talked about evaluating experience, and managing intelligence (via silent intelligence, controlled rollouts, and flighting).
  • Offline evaluation : By looking at how well it performs on historical data. This is the subject of this chapter, as it is critical to the intelligence-creation process.
This chapter discusses what it means for an intelligence to be accurate. It will then explain how to use data to evaluate intelligence, as well as some of the pitfalls. It will introduce conceptual tools for comparing intelligences. And finally, it will explore methods for subjective evaluation of intelligence.

Evaluating Accuracy

An intelligence should be accurate, of course. But accurate isn’t a straightforward concept and there are many ways for an intelligence to fail. An effective intelligence will have the following properties :
  • It will generalize to situations it hasn’t seen before.
  • It will make the right types of mistakes.
  • It will distribute the mistakes it makes well.
This section explores these properties in more detail.

Generalization

One of the key challenges in intelligence creation is to produce intelligence that works well on things you don’t know about at the time you create the intelligence.
Consider a student who reads the course textbook and memorizes every fact. That’s good. The student would be very accurate at parroting back the things that were in the textbook. But now imagine the teacher creates a test that doesn’t ask the student to parrot back fact from the textbook. Instead, the teacher wants the student to demonstrate they understood the concepts from the textbook and apply them in a new setting. If the student developed a good mental model about the topic, they might pass this test. If the student has the wrong mental model about the topic, or has no mental model at all (and has just memorized facts), they won't do so well at applying the knowledge in a new setting. This is the same as the intelligence in an Intelligent System—it must generalize to new situations.
Let’s look at an example. Consider building intelligence that examines books and classifies them by genre—sci-fi, romance, technical, thriller, historical fiction, that kind of thing.
You gather 1,000 books , hand-label them with genres, and set about creating intelligence. The goal is to be able to take a new book (one that isn't part of the 1,000) and accurately predict its genre.
What if you built this intelligence by memorizing information about the authors? You might look at your 1,000 books, find that they were written by 815 different authors, and make a list like this:
  • Roy Royerson writes horror.
  • Tim Tiny writes sci-fi.
  • Neel Notson writes technical books.
  • And so on.
When you get a new book, you look up its author in this list. If the author is there, return the genre. If the author isn't there—well, you're stuck. This model doesn't understand the concept of “genre” it just memorized some facts and won't generalize to authors it doesn't know about (and it will get pretty confused by authors who write in two different genres).
When evaluating the accuracy of intelligence, it is important to test how well it generalizes. Make sure you put the intelligence in situations it hasn't seen before and measure how well it adapts.

Types of Mistakes

Intelligences can make many types of mistakes and some mistakes cause more trouble than others. We’ve discussed the concept of false positive and false negative in Chapter 6 when we discussed intelligent experiences, but let’s review (see Figure 19-1). When predicting classifications, an intelligence can make mistakes that
  • Say something is of one class, when it isn't.
  • Say something isn't of a class, when it is.
A455442_1_En_19_Fig1_HTML.gif
Figure 19-1
Different Types of mistakes.
For example, suppose the intelligence does one of these things:
  • Says there is someone at the door, but there isn’t; or it says there is no one at the door, but there is someone there.
  • Says it’s time to add fuel to the fire, but it isn’t (the fire is already hot enough); or it says it isn’t time to add fuel, but it is (because the fire is about to go out).
  • Says the book is a romance, but it isn’t; or it says the book isn’t a romance, but it is.
In order to be useful, an intelligence must make the right types of mistakes to complement its Intelligent System. For example, consider the intelligence that examines web pages to determine if they are funny or not. Imagine I told you I had an intelligence that was 99% accurate. Show this intelligence some new web page (one it has never seen before), the intelligence makes a prediction (funny or not), and 99% of the time the prediction is correct. That’s great. Very accurate generalization. This intelligence should be useful in our funny-webpage-detector Intelligent System.
But what if it turns out that most web pages aren’t funny—99% of web pages, to be precise. In that case, an intelligence could predict “not funny” 100% of the time and still be 99% accurate. On not-funny web pages it is 100% accurate. On funny web pages it is 0% accurate. And overall that adds up to 99% accuracy. And it also adds up to—completely useless.
One measure for trading off between these types of errors is to talk about false positive rate vs false negative rate. For the funny-page-finder, a “positive” is a web page that is actually funny. A “negative” is a web page that is not funny. (Actually, you can define a positive either way—be careful to define it clearly or other people on the project might define it differently and everyone will be confused.) So:
  • The false positive rate is defined as the fraction of all negatives that are falsely classified as positives (what portion of the not-funny page visits are flagged as funny).
    A455442_1_En_19_Figa_HTML.gif
  • The false negative rate is defined as the fraction of all positives that are falsely classified as negatives (what portion of the funny page visits are flagged as not-funny).
    A455442_1_En_19_Figb_HTML.gif
Using this terminology, the brain-dead always-not-funny intelligence would have a 0% false positive rate (which is great) and a 100% false negative rate (which is useless).
Another very common way to talk about these mistake-trade-offs is by talking about a model’s precision and its recall.
  • The precision is defined as the fraction of all of the model’s positive responses that are actually positive (what portion of “this page is funny" responses are correct).
    A455442_1_En_19_Figc_HTML.gif
  • The recall is defined as the proportion of the positives that the model says are positive (what portion of the funny web pages get a positive response).
    A455442_1_En_19_Figd_HTML.gif
Using this terminology, the brain-dead always-not-funny intelligence would have an undefined precision (because it never says positive and you can’t divide by zero, not even with machine learning) and a 0% recall (because it says positive on 0% of the positive pages).
An effective intelligence must balance the types of mistakes it makes appropriately to support to the needs of the Intelligent System.

Distribution of Mistakes

In order to be effective, an intelligence must work reasonably well for all users. That is, it cannot focus its mistakes into specific sub-populations. Consider:
  • A system to detect when someone is at the door that never works for people under 5 feet tall.
  • A system to find faces in images that never finds people wearing glasses.
  • A system to filter spam that always deletes mail from banks.
These types of mistakes can be embarrassing . They can lead to unhappy users, bad reviews. It’s possible to have an intelligence that generalizes well, that makes a good balance of the various types of mistakes, and that is totally unusable because it focuses mistakes on specific users (or in specific contexts).
And finding this type of problem isn't easy. There are so many potential sub-populations, it can be difficult or impossible to enumerate all the ways poorly-distributed mistakes can cause problems for an Intelligent System.

Evaluating Other Types of Predictions

The previous section gave an introduction to evaluating classifications. But there are many, many ways to evaluate the answers that intelligences can give. You could read whole books on the topic, but this section will give a brief intuition for how to approach evaluation of regressions, probabilities, and rankings.

Evaluating Regressions

Regressions return numbers. You might want to know what fraction of time they get the “right” number. But it is almost always more useful to know how close the predicted answers are to the right answers than to know how often the answers are exactly right.
The most common way to do this is to calculate the Mean Squared Error (MSE). That is, take the answer the intelligence gives, subtract it from the correct answer, and square the result. Then take the average of this across the contexts that are relevant to your measurement.
A455442_1_En_19_Fige_HTML.gif
When the MSE is small, the intelligence is usually giving answers that are close to the correct answer. When the MSE is large, the intelligence is usually giving answers that are far from the correct answer.

Evaluating Probabilities

A probability is a number from 0 – 1.0. One way to evaluate probabilities is to use a threshold to convert them to classifications and then evaluate them as classifications.
Want to know if a book is a romance? Ask an intelligence for the probability that it is a romance. If the probability is above a threshold, say, 0.3 (30%), then call the book a romance; otherwise call it something else.
Using a high threshold for converting the probability into a classification, like 0.99 (99%), generally results in higher precision but lower recall—you only call a book a romance if the intelligence is super-certain.
Using a lower threshold for turning the probability into a classification, like 0.01 (1%), generally results in lower precision, but higher recall—you call just about any book a romance, unless the intelligence is super-certain it isn’t a romance.
We’ll discuss this concept of thresholding further in a little while when we talk about operating points and comparing intelligences.
Another way to evaluate probabilities is called log loss. Conceptually, log loss is very similar to mean squared error (for regression), but there is a bit more math to it (which we'll skip). Suffice to say—less loss is better.

Evaluating Rankings

Rankings order content based on how relevant it is to a context. For example, given a user’s history , what flavor of soda are they most likely to order next? The ranking intelligence will place all the possible flavors in order, for example:
  1. 1.
    Cola
     
  2. 2.
    Orange Soda
     
  3. 3.
    Root Beer
     
  4. 4.
    Diet Cola
     
So how do we know if this is right?
One simple way to evaluate this is to imagine the intelligent experience. Say the intelligent soda machine can show 3 sodas on its display. The ranking is good if the user’s actual selection is in the top 3, and it is not good if the user’s actual selection isn’t in the top 3.
You can consider the top 3, the top 1, the top 10—whatever makes the most sense for your Intelligent System .
So one simple way to evaluate a ranking is as the percent of time the user’s selection is among the top K answers.

Using Data for Evaluation

Data is a key tool in evaluating intelligence. Conceptually, intelligence is evaluated by taking historical contexts, running the intelligence on them, and comparing the outputs that actually occurred to the outputs the intelligence predicted.
Of course, using historical data has risks , including:
  • You might accidentally evaluate the intelligence on data that was used to create the intelligence, resulting in over-optimistic estimates of the quality of the intelligence. Basically, letting the intelligence see the answers to the test before testing it.
  • The underlying problem might change between the time the testing data is collected and time the new intelligence will be deployed, resulting in over-optimistic estimates of the quality of the intelligence. When the problem changes, the intelligence might be great—at fighting the previous war.
This section will discuss ways to handle testing data to minimize these problems.

Independent Evaluation Data

The data used to evaluate intelligence must be completely separate from the data used to create the intelligence.
Imagine this. You come up with an idea for some heuristic intelligence, implement it, evaluate it on some evaluation data, and find the precision is 54%. At this point you’re fine. The intelligence is probably around 54% precise (plus or minus based on statistical properties because of sample size, and so on), and if you deploy it to users that’s probably about what they’ll see.
But now you look at some of the mistakes your intelligence is making on the evaluation data. You notice a pattern, so you change your intelligence, improve it. Then you evaluate the intelligence on the same test data and find the precision is now 66%.
At this point you are no longer fine. You have looked at the evaluation data and changed your intelligence because of what you found. At this point it really is hard to say how precise your intelligence will be when you deploy it to users; almost certainly less than the 66% you saw in your second evaluation. Possibly even worse than your initial 54%.
This is because you’ve cheated. You looked at the answers to the test as you built the intelligence. You tuned your intelligence to the part of the problem you can see. This is bad.
One common approach to avoiding this is to create a separate testing set for evaluation. A test set is created by randomly splitting your available data into two sets. One for creating and tweaking intelligence, the other for evaluation.
Now as you are tweaking and tuning your intelligence you don’t look at the test set—not even a little. You tweak all you want on the training set. When you think you have it done, you evaluate on the test set to get an unbiased evaluation of your work.

Independence in Practice

The data used to evaluate intelligence should be completely separate from the data used to produce the intelligence. I know—this is exactly the same sentence I used to start the previous section. It’s just that important.
One key assumption made by machine learning is that each piece of data is independent of the others. That is, take any two pieces of data (two contexts, each with their own outcomes). As long as you create intelligence on one, and evaluate it on the other, there is no way you can be cheating—those two pieces of data are independent. In practice this is not the case. Consider which of the following are more (or less) independent:
  • A pair of interactions from different users, compared to a pair of interactions from the same users.
  • Two web pages from different web sites, compared to two pages from the same site.
  • Two books by different authors, compared to two books by the same author.
  • Two pictures of different cows, compared to two pictures of the same cow.
Clearly some of these pieces of data are not as independent as others, and using data with these types of strong relationships to both create and evaluate intelligence will lead to inaccurate evaluations. Randomly splitting data into training and testing sets does not always work in practice.
Two approaches to achieving independence in practice are to petition your data by time or by identity .
  • Partition data by time . That is, reserve the most recent several days of telemetry to evaluate, and use data from the prior days, weeks, or months to build intelligence. For example, if it is September 5, 2017 the intelligence creator might reserve data from September 2, 3, and 4 for testing, and use all the data from September 1, 2017 and earlier to produce intelligence.
  • If the problem is very stable (that is, does not have a time-changing component), this can be very effective. If the problem has a time-changing component, recent telemetry will be more useful than older telemetry, because it will more accurately represent the current state of the problem. In these cases you'll have to balance how much of the precious most-recent, most-relevant data to use for evaluation (vs training), and how far back in time to go when selecting training data.
  • Partition data by identity . That is, ensure that all interactions with a single identity end up in the same data partition. For example:
    • All interactions with a particular user are either used to create intelligence or they are used to evaluate it.
    • All interactions with the same web site are used to create intelligence or they are used to evaluate it.
    • All sensor readings from the same house are used to create intelligence or they are used to evaluate it.
    • All toasting events from the same toaster are used to create intelligence or they are used to evaluate it.
Most Intelligent Systems will partition by time and by at least one identity when selecting data to use for evaluation.

Evaluating for Sub-Populations

Sometimes it is critical that an intelligence does not systematically fail for critical sub-populations (gender, age, ethnicity, types of content, location, and so on.).
For example, imagine a speech recognition system that needs to work well for all English speakers. Over time users complain that it isn’t working well for them. Upon investigation you discover many of the problems are focused in Hawaii—da pidgin stay too hard for fix, eh brah?
Ahem…
The problem could be bad enough that the product cannot sell in Hawaii—a major market. Something needs to be done!
To solve a problem, it must be measured (remember, verification first). So we need to update the evaluation procedure to measure accuracy specifically on the problematic sub-population. Every time the system evaluates a potential intelligence, it evaluates it in two ways:
  • Once across all users.
  • Once just for users who speak English with a pidgin-Hawaiian accent.
This evaluation procedure might find that the precision is 95% in general, but 75% for the members of the pidgin-speaking sub-population. And that is a pretty big discrepancy.
Evaluating accuracy on sub-populations presents some complications. First is identifying if an interaction is part of the sub-population. A new piece of telemetry arrives, including a context and an outcome. But now you need some extra information—you need to know if the context (for example the audio clip in the telemetry) is for a pidgin-Hawaiian speaker or not. Some approaches to this include:
  1. 1.
    Identify interactions by hand . By inspecting contexts by hand (listening to the audio clips) and finding several thousand from members of the target sub-population. This will probably be expensive, and difficult to repeat regularly, but it will usually work. The resulting set can be preserved long-term to evaluate accuracy on the sub-population (unless the problem changes very fast).
     
  2. 2.
    Identify entities from the sub-population. For example, flagging user as “pidgin speakers” or not. Every interaction from one of your flagged users can be used to evaluate the sub-population. When the context contains an identity (such as a user ID), this approach can be very valuable. But this isn’t always available.
     
  3. 3.
    Use a proxy for the sub-population, like location. Everyone who is in Hawaii gets flagged as part of the sub-population whether they speak pidgin-Hawaiian or not. Not perfect, but sometimes it can be good enough, and sometimes you can do it automatically, saving a bunch of money and time.
     
A second complication to evaluating accuracy for sub-populations is in getting enough evaluation data for each sub-population. If the sub-population is small, a random sample of evaluation data might contain just a few examples of it. Two ways to deal with this are:
  1. 1.
    Use bigger evaluation sets. Set aside enough evaluation data so that the smallest important sub-population has enough representation to be evaluated (see the next section for more detail on the right amount of data).
     
  2. 2.
    Up-sample sub-population members for evaluation. Skew your systems so members of the sub-population are more likely to show up in telemetry and are more likely be used for evaluation instead of for intelligence creation. When doing this you have to be sure to correct for the skew when reporting evaluation results. For example, if users from Hawaii are sampled twice as often in telemetry, then each interaction from Hawaii-based users gets half as much weight when estimating the overall accuracy compared to other users.
     

The Right Amount of Data

So how much data do you need to evaluate an intelligence ? It depends—of course.
Recall that statistics can express how certain an answer is, for example: the precision of my intelligence is 92% plus or minus 4% (which means it is probably between 88% and 96%).
So how much data you need depends on how certain you need to be.
Assuming your data is very independent, the problem isn’t changing too fast, and you aren’t trying to optimize the last 0.1% out of a very hard problem (like speech recognition):
  • Tens of data points is too small a number to evaluate intelligence.
  • Hundreds of data points is a fine size for a sanity check, but really not enough.
  • Thousands of data points is probably a fine size for most things.
  • Tens of thousands of data points is probably overkill, but not crazy.
A starting point for choosing how much data to use to evaluate intelligence might be this:
  1. 1.
    Ensure you have thousands of recent data points reserved for evaluation.
     
  2. 2.
    Ensure you have hundreds of recent data points for each important sub-population.
     
  3. 3.
    Use the rest of your (reasonably recent) data for intelligence creation.
     
  4. 4.
    Unless you have ridiculous amounts of data, at which point simply reserve about 10% of your data to evaluate intelligence and use the rest for intelligence creation.
     
But you'll have to develop your own intuition in your setting (or use some statistics, if that is the way you like to think).

Comparing Intelligences

Now we know some metrics for evaluating intelligence performance and how to measure these metrics from data. But how can we tell one if intelligence is going to be more effective than another? Consider trying to determine whether the genre of a book is romance or not:
  • One intelligence might have a precision of 80% with a recall of 20%.
  • Another intelligence might have a precision of 50% with a recall of 40%.
Which is better? Well, it depends on the experience, and the broader objectives of the Intelligent System. For example, a system that is trying to find the next book for an avid romance reader might prefer a higher recall (so the user won’t miss a single kiss).

Operating Points

One tool to help evaluate intelligences is to select an operating point . That is, a precision point or a recall point that works well with your intelligent experience. Every intelligence must hit the operating point, and then the one that performs best there is used. For example:
  • In a funny-web-page detector the operating point might be set to 95% precision. All intelligence should be tuned to have the best recall possible at 95% precision.
  • In a spam-filtering system the operating point might be set to 99% precision (because it is deleting users’ email, and so must be very sure). All intelligences should strive to flag as much spam as possible, while keeping precision at or above 99%.
  • In a smart-doorbell system the operating point might be set to 80% recall. All intelligences should be set to flag 80% of the times a person walks up to the door, and compete on reducing false positives under that constraint.
This reduces the number of variables by choosing one of the types of mistakes an intelligence might make and setting a target. We need an intelligence that is 90% precise to support our experience—now, Mr. or Mrs. Intelligence creator, go and produce the best recall you can at that precision.
When comparing two intelligences it’s often convenient to compare them at an operating point. The one that is more precise at the target recall (or has higher recall at the target precision) is better.
Easy.

Curves

But sometimes operating points change. For example, maybe the experience needs to become more forceful and the operating point needs to move to a higher precision to support the change. Intelligences can be evaluated across a range of operating points. For example, by varying the threshold used to turn a probability into a classification, an intelligence might have these values:
  • 91% precision at 40% recall.
  • 87% precision at 45% recall.
  • 85% precision at 50% recall.
  • 80% precision at 55% recall.
And on and on. Note that any intelligence that can produce a probability or a score can be used this way (including many, many machine learning approaches).
Many heuristic intelligences and many “classification-only” machine learning techniques do not have the notion of a score or probability, and so they can only be evaluated at a single operating point (and not along a curve).
When comparing two intelligences that can make trade-offs between mistake types, you can compare them at any operating point. For example:
  • At 91% precision, one model has 40% recall, and the other has 47% recall.
  • At 87% precision, one model has 45% recall, and the other has 48% recall.
  • At 85% precision, one model has 50% recall, and the other has 49% recall.
One model might be better at some types of trade-offs and worse at others. For example, one model might be better when you need high precision, but a second model (built using totally different approaches) might be better when you need high recall.
It is sometimes helpful to visualize the various trade-offs an intelligence can make. A precision-recall curve (PR curve) is a plot of all the possible trade-offs a model can make. On the x-axis is every possible recall (from 0% to 100%) and on the y-axis is the precision the model can achieve at the indicated recall.
By plotting two models on a single PR curve it is easy to see which is better in various ranges of operating points.
A455442_1_En_19_Figf_HTML.gif
A similar concept is called a receiver operating characteristic curve (ROC curve) . An ROC curve would have false positive rate on the x-axis and true positive rate on the y-axis.

Subjective Evaluations

Sometimes an intelligence looks good by the numbers, but it just…isn’t… This can happen if you:
  • Have a metric that is out of sync with the actual objective.
  • Have a miscommunication between the experience and the intelligence.
  • Have an important sub-population that you haven't identified yet.
  • And more…
Because of this, it never hurts to look at some data, take a few steps back, and just think. Be a data-detective, take the user’s point of view and imagine what your intelligence will create for them. Some things that can help with subjective evaluations include:
  • Exploring the mistakes.
  • Imagining the user experience.
  • Finding the worst thing that could happen.
We'll discuss these in turn.

Exploring the Mistakes

Statistics (for example, those used to represent model quality with a precision and recall) are nice, they are neat; they summarize lots of things into simple little numbers that go up or down. They are critical to creating intelligence. But they can hide all sorts of problems. Every so often you need to go look at the data.
One useful technique is to take a random sample of 100 contexts where the intelligence was wrong and look at them. When looking at the mistakes, consider:
  • How many of the mistakes would be hard for users to recover from?
  • How many of the mistakes would make sense to the user (vs seeming pretty stupid)?
  • Is there any structure to the mistakes? Any common properties? Maybe things that will turn into important sub-population-style problems in the future?
  • Can you find any hints that there might be a bug in some part of the implementation or the evaluation process?
  • Is there anything that could help improve the intelligence so it might stop making the mistakes? Some new information to add to the context? Some new type of feature?
In addition to inspecting random mistakes , it also helps to look at places where the intelligence was most-certain of the answer—but was wrong (a false positive where the model said it was 100% sure it was a positive, or a false negative where the model said the probability of positive was 0%). These places will often lead to bugs in the implementation or to flaws in the intelligence-creation process.

Imagining the User Experience

While looking at mistakes, imagine the user’s experience. Visualize them coming to the context. Put yourself in their shoes as they encounter the mistake. What are they thinking? What is the chance they will notice the problem? What will it cost if they don’t notice it? What will they have to do to recover if they do notice it?
This is also a good time to think of the aggregate experience the user will have. For example:
  • How many mistakes will they see?
  • How many positive interactions will they have between mistakes?
  • How would they describe the types of mistakes the system makes, if they had to summarize them to a friend?
Putting yourself in your users' shoes will help you know whether the intelligence is good enough or needs more work. It can also help you come up with ideas for improving the intelligent experience.

Finding the Worst Thing

And imagine the worst thing that could happen. What types of mistakes could the system make that would really hurt its users or your business? For example:
  • A system for promoting romance novels that classifies 100% of a particular romance-writer's books as non-romance. This writer stops getting promoted, stops getting sales, doesn’t have the skills to figure out why or what to do, goes out of business. Their children get no presents for Christmas…
  • All the web pages from a particular law firm get classified as funny. They aren’t funny, but people start laughing anyway (because they are told the pages are funny). No one wants to hire a law firm full of clowns. The firm gets mad and sues your business—and they are a little bit pissed off and have nothing but time on their hands…
  • The pellet griller controller’s temperature sensor goes out. Because of this the intelligence always thinks the fire is not hot enough. It dumps fuel-pellet after fuel-pellet onto the fire, starting a huge, raging fire, but the intelligence still thinks it needs more…
These are a little bit silly, but the point is—be creative. Find really bad things for your users before they have to suffer through them for you, and use your discoveries to make better intelligence, or to influence the rest of the system (the experience and the implementation) to do better.

Summary

With intelligence, evaluation is creation.
There are three main components to accuracy: generalizing to new situation, making the right types of mistakes, and distributing mistakes well among different users/contexts.
  • Intelligence can be good at contexts it knows about, but fail at contexts it hasn't encountered before.
  • Intelligence can make many types of mistakes (including false positives and false negatives).
  • Intelligence can make random mistakes, or it can make mistakes that are focused on specific users and contexts—the focused mistakes can cause problems.
There are specific techniques for evaluating classification, regressions, probability estimations, and rankings. This chapter presented some simple ones and some concepts, but you can find more information if you need it.
Intelligence can be evaluated with data. Some data should be held aside and used exclusively to evaluate intelligence (and not to create it). This data should be totally independent of the data used to create the intelligence (it should come from different users, different time periods, and so on).
An operating point helps focus intelligence on the types of mistakes it needs to make to succeed in the broader system. A precision-recall curve is a way of understanding (and visualizing) how an intelligence operates across all possible operating points.
It is important to look at the mistakes your intelligence makes. Try to understand what is going on, but also take your user’s point of view, and imagine the worst outcome they might experience.

For Thought…

After reading this chapter, you should be able to:
  • Describe what it means for an intelligence to be accurate.
  • Evaluate the quality of an intelligence across a wide range of practical criteria.
  • Create useful quality goals for intelligence and progress toward the goals.
  • Take a user’s point of view and see the mistakes an intelligence makes through their eyes.
You should be able to answer questions like these:
  • Describe three situations where an Intelligent System would need an intelligence with very high recall.
  • For each of those three, describe a small change to the system’s goals where it would instead need high precision.
  • Select one of the Intelligent Systems mentioned in this chapter and describe a potential sub-population (not mentioned in the chapter) where bad mistakes might occur.
  • Consider the system for classifying books into genres. Imagine you are going to examine 100 mistakes the system is making. Describe two different ways you might categorize the mistakes you examine to help gain intuition about what is going on. (For example, the mistake is on a short book vs a long book—oops! now you can’t use book length in your answer. Sorry.)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.204.0