1. Let’s Discuss Learning

1.1 Welcome

From time to time, people trot out a tired claim that computers can “only do what they are told to do.” The claim is taken to mean that computers can only do what their programmers know how to do and can explain to the computer. This claim is false. Computers can perform tasks that their programmers cannot explain to them. Computers can solve tasks that their programmers do not understand. We will break down this paradox with an example of a computer program that learns.

I’ll start by discussing one of the oldest—if not the oldest known—examples of a programmed machine-learning system. I’ve turned this into a story, but it is rooted in historical facts. Arthur Samuel was working for IBM in the 1950s and he had an interesting problem. He had to test the big computing machines that were coming off the assembly line to make sure transistors didn’t blow up when you turned a machine on and ran a program—people don’t like smoke in their workplace. Now, Samuel quickly got bored with running simple toy programs and, like many computing enthusiasts, he turned his attention towards games. He built a computer program that let him play checkers against himself. That was fun for a while: he tested IBM’s computers by playing checkers. But, as is often the case, he got bored playing two-person games solo. His mind began to consider the possibility of getting a good game of checkers against a computer opponent. Problem was, he wasn’t good enough at checkers to explain good checkers strategies to a computer!

Samuel came up with the idea of having the computer learn how to play checkers. He set up scenarios where the computer could make moves and evaluate the costs and benefits of those moves. At first, the computer was bad, very bad. But eventually, the program started making progress. It was slow going. Suddenly, Samuel had a great two-for-one idea: he decided to let one computer play another and take himself out of the loop. Because the computers could make moves much faster than Samuel could enter his moves—let alone think about them—the result was many more cycles of “make a move and evaluate the outcome” per minute and hour and day.

Here is the amazing part. It didn’t take very long for the computer opponent to be able to consistently beat Samuel. The computer became a better checkers player than its programmer! How on earth could this happen, if “computers can only do what they are told to do”? The answer to this riddle comes when we analyze what the computer was told to do. What Samuel told the computer to do was not the play-checkers task; it was the learn-to-play-checkers task. Yes, we just went all meta on you. Meta is what happens when you take a picture of someone taking a picture (of someone else). Meta is what happens when a sentence refers to itself; the next sentence is an example. This sentence has five words. When we access the meta level, we step outside the box we were playing in and we get an entirely new perspective on the world. Learning to play checkers—a task that develops skill at another task—is a meta task. It lets us move beyond a limiting interpretation of the statement, computers can only do what they are told. Computers do what they are told, but they can be told to develop a capability. Computers can be told to learn.

1.2 Scope, Terminology, Prediction, and Data

There are many kinds of computational learning systems out there. The academic field that studies these systems is called machine learning. Our journey will focus on the current wunderkind of learning systems that has risen to great prominence: learning from examples. Even more specifically, we will mostly be concerned with supervised learning from examples. What is that? Here’s an example. I start by giving you several photos of two animals you’ve never seen before—with apologies to Dr. Seuss, they might be a Lorax or a Who—and then I tell you which animal is in which photo. If I give you a new, unseen photo you might be able to tell me the type of animal in it. Congratulations, you’re doing great! You just performed supervised learning from examples. When a computer is coaxed to learn from examples, the examples are presented a certain way. Each example is measured on a common group of attributes and we record the values for each attribute on each example. Huh?

Imagine—or glance at Figure 1.1—a cartoon character running around with a basket of different measuring sticks which, when held up to an object, return some characteristic of that object, such as this vehicle has four wheels, this person has brown hair, the temperature of that tea is 180°F, and so on ad nauseam (that’s an archaic way of saying until you’re sick of my examples).

A set of three illustrations is shown. In the first illustration, a cartoon character measures the characteristics of a table using a measuring stick. In the second illustration, it measures the characteristics of a vessel and in the third illustration, it measures the characteristics of a car.

Figure 1.1 Humans have an insatiable desire to measure all sorts of things.

1.2.1 Features

Let’s get a bit more concrete. For example—a meta-example, if you will—a dataset focused on human medical records might record several relevant values for each patient, such as height, weight, sex, age, smoking history, systolic and diastolic (that’s the high and low numbers) blood pressures, and resting heart rate. The different people represented in the dataset are our examples. The biometric and demographic characteristics are our attributes.

We can capture this data very conveniently as in Table 1.1.

Table 1.1 A simple biomedical data table. Each row is an example. Each column contains values for a given attribute. Together, each attribute-value pair is a feature of an example.

patient id

height

weight

sex

age

smoker

hr

sys bp

dia bp

007

5’2”

120

M

11

no

75

120

80

2139

5’4”

140

F

41

no

65

115

75

1111

5’11”

185

M

41

no

52

125

75

Notice that each example—each row—is measured on the same attributes shown in the header row. The values of each attribute run down the respective columns.

We call the rows of the table the examples of the dataset and we refer to the columns as the features. Features are the measurements or values of our attributes. Often, people use “features” and “attributes” as synonyms describing the same thing; what they are referring to are the column of values. Still, some people like to distinguish among three concepts: what-is-measured, what-the-value-is, and what-the-measured-value-is. For those strict folks, the first is an attribute, the second is a value, and the last is a feature—an attribute and a value paired together. Again, we’ll mostly follow the typical conversational usage and call the columns features. If we are specifically talking about what-is-measured, we’ll stick with the term attribute. You will inevitably see both, used both ways, when you read about machine learning.

Let’s take a moment and look at the types of values our attributes—what is measured—can take. One type of value distinguishes between different groups of people. We might see such groups in a census or an epidemiological medical study—for example, sex {male, female} or a broad record of ethnic-cultural-genetic heritage {African, Asian, European, Native American, Polynesian}. Attributes like these are called discrete, symbolic, categorical, or nominal attributes, but we are not going to stress about those names. If you struggled with those in a social science class, you are free to give a hearty huzzah.

Here are two important, or at least practical, points about categorical data. One point is that these values are discrete. They take a small, limited number of possibilities that typically represent one of several options. You’re right that small and several are relative terms—just go with it. The second point is that the information in those attributes can be recorded in two distinct ways:

  • As a single feature that takes one value for each option, or

  • As several features, one per option, where one, and only one, of those features is marked as yes or true and the remainder are marked as no or false.

Here’s an example. Consider

Name

Sex

Mark

Male

Barb

Female

Ethan

Male

versus:

Name

Sex is Female

Sex is Male

Mark

No

Yes

Barb

Yes

No

Ethan

No

Yes

If we had a column for community type in a census, the values might be Urban, Rural, and Suburban with three possible values. If we had the expanded, multicolumn form, it would take up three columns. Generally, we aren’t motivated or worried about table size here. What matters is that some learning methods are, shall we say, particular in preferring one form or the other. There are other details to point out, but we’ll save them for later.

Some feature values can be recorded and operated on as numbers. We may lump them together under the term numerical features. In other contexts, they are known as continuous or, depending on other details, interval or ratio values. Values for attributes like height and weight are typically recorded as decimal numbers. Values for attributes like age and blood pressure are often recorded as whole numbers. Values like counts—say, how many wheels are on a vehicle—are strictly whole numbers. Conveniently, we can perform arithmetic (+, –, ×, /) on these. While we can record categorical data as numbers, we can’t necessarily perform meaningful numerical calculations directly on those values. If two states—say, Pennsylvania and Vermont—are coded as 2 and 14, it probably makes no sense to perform arithmetic on those values. There is an exception: if, by design, those values mean something beyond a unique identifier, we might be able to do some or all of the maths. For extra credit, you can find some meaning in the state values I used where mathematics would make sense.

1.2.2 Target Values and Predictions

Let’s shift our focus back to the list of biomedical attributes we gathered. As a reminder, the column headings were height, weight, sex, age, smoker, heart rate, systolic blood pressure, and diastolic blood pressure. These attributes might be useful data for a health care provider trying to assess the likelihood of a patient developing cardiovascular heart. To do so, we would need another piece of information: did these folks develop heart disease?

If we have that information, we can add it to the list of attributes. We could capture and record the idea of “developing heart disease” in several different ways. Did the patient:

  • Develop any heart disease within ten years: yes/no

  • Develop X-level severity heart disease within ten years: None or Grade I, II, III

  • Show some level of a specific indicator for heart disease within ten years: percent of coronary artery blockage

We could tinker with these questions based on resources at our disposal, medically relevant knowledge, and the medical or scientific puzzles we want to solve. Time is a precious resource; we might not have ten years to wait for an outcome. There might be medical knowledge about what percent of blockage is a critical amount. We could modify the time horizon or come up with different attributes to record.

In any case, we can pick a concrete, measurable target and ask, “Can we find a predictive relationship between the attributes we have today and the outcome that we will see at some future time?” We are literally trying to predict the future—maybe ten years from now—from things we know today. We call the concrete outcome our target feature or simply our target. If our target is a category like {sick, healthy} or {None, I, II, III}, we call the process of learning the relationship classification. Here, we are using the term classification in the sense of finding the different classes, or categories, of a possible outcome. If the target is a smooth sweeping of numerical values, like the usual decimal numbers from elementary school {27.2, 42.0, 3.14159, –117.6}, we call the process regression. If you want to know why, go and google Galton regression for the history lesson.

We now have some handy terminology in our toolbox: most importantly features, both either categorical or numerical, and a target. If we want to emphasize the features being used to predict the future unknown outcome, we may call them input features or predictive features. There are a few issues I’ve swept under the carpet. In particular, we’ll address some alternative terminology at the end of the chapter.

1.3 Putting the Machine in Machine Learning

I want you to create a mental image of a factory machine. If you need help, glance at Figure 1.2. On the left-hand side, there is a conveyor belt that feeds inputs into the machine. On the right-hand side, the machine spits out outputs which are words or numbers. The words might be cat or dog. The numbers might be {0, 1} or {–2.1, 3.7}. The machine itself is a big hulking box of metal. We can’t really see what happens on the inside. But we can see a control panel on the side of the machine, with an operator’s seat in front of it. The control panel has some knobs we can set to numerical values and some switches we can flip on and off. By adjusting the knobs and switches, we can make different products appear on the right-hand side of the machine, depending on what came in the left-hand side. Lastly, there is a small side tray beside the operator’s chair. The tray can be used to feed additional information, that is not easily captured by knobs and switches, into the machine. Two quick notes for the skeptical reader: our knobs can get us arbitrarily small and large values (−∞ to ∞, if you insist) and we don’t strictly need on/off switches, since knobs set to precisely 0 or 1 could serve a similar purpose.

An illustration shows a machine with inputs and outputs.

Figure 1.2 Descriptions go in. Categories or other values come out. We can adjust the machine to improve the relationship between the inputs and outputs.

Moving forward, this factory image is a great entry point to understand how learning algorithms figure out relationships between features and a target. We can sit down as the machine operator and press a magic—probably green—go button. Materials roll in the machine from the left and something pops out on the right. Our curiosity gets the best of us and we twiddle the dials and flip the switches. Then, different things pop out the right-hand side. We turn up KnobOne and the machine pays more attention to the sounds that the input object makes. We turn down KnobTwo and the machine pays less attention to the number of limbs on the input object. If we have a goal—if there is some known product we’d like to see the machine produce—hopefully our knob twiddling gets us closer to it.

Learning algorithms are formal rules for how we manipulate our controls. After seeing examples where the target is known, learning algorithms take a given big-black-box and use a well-defined method to set the dials and switches to good values. While good can be quite a debatable quality in an ethics class, here we have a gold standard: our known target values. If they don’t match, we have a problem. The algorithm adjusts the control panel settings so our predicted outs match the known outs. Our name for the machine is a learning model or just a model.

An example goes into the machine and, based on the settings of the knobs and switches, a class or a numerical value pops out. Do you want a different output value from the same input ingredients? Turn the knobs to different settings or flip a switch. One machine has a fixed set of knobs and switches. The knobs can be turned, but we can’t add new knobs. If we add a knob, we have a different machine. Amazingly, the differences between knob-based learning methods boil down to answering three questions:

  1. What knobs and switches are there: what is on the control panel?

  2. How do the knobs and switches interact with an input example: what are the inner workings of the machine?

  3. How do we set the knobs from some known data: how do we align the inputs with the outputs we want to see?

Many learning models that we will discuss can be described as machines with knobs and switches—with no need for the additional side input tray. Other methods require the side tray. We’ll hold off discussing that more thoroughly, but if your curiosity is getting the best of you, flip to the discussion of nearest neighbors in Section 3.5.

Each learning method—which we imagine as a black-box factory machine and a way to set knobs on that machine—is really an implementation of an algorithm. For our purposes, an algorithm is a finite, well-defined sequence of steps to solve a task. An implementation of an algorithm is the specification of those steps in a particular programming language. The algorithm is the abstract idea; the implementation is the concrete existence of that idea—at least, as concrete as a computer program can be! In reality, algorithms can also be implemented in hardware—just like our factory machines; it’s far easier for us to work with software.

1.4 Examples of Learning Systems

Under the umbrella of supervised learning from examples, there is a major distinction between two things: predicting values and predicting categories. Are we trying (1) to relate the inputs to one of a few possible categories indicated by discrete symbols, or (2) to relate the inputs to a more-or-less continuous range of numerical values? In short, is the target categorical or numerical? As I mentioned, predicting a category is called classification. Predicting a numerical value is called regression. Let’s explore examples of each.

1.4.1 Predicting Categories: Examples of Classifiers

Classifiers are models that take input examples and produce an output that is one of a small number of possible groups or classes:

  1. Image Classification. From an input image, output the animal (e.g., cat, dog, zebra) that is in the image, or none if there is no animal present. Image analysis of this sort is at the intersection of machine learning and computer vision. Here, our inputs will be a large collection of image files. They might be in different formats (png, jpeg, etc.). There may be substantial differences between the images: (1) they might be at different scales, (2) the animals may be centered or cut-off on the edges of the frame, and (3) the animals might be blocked by other things (e.g., a tree). These all represent challenges for a learning system—and for learning researchers! But, there are some nice aspects to image recognition. Our concept of cat and what images constitute a cat is fairly fixed. Yes, there could be blurred boundaries with animated cats—Hobbes, Garfield, Heathcliff, I’m looking at you—but short of evolutionary time scales, cat is a pretty static concept. We don’t have a moving target: the relationship between the images and our concept of cat is fixed over time.

  2. Stock Action. From a stock’s price history, company fundamental data, and other relevant financial and market data, output whether we should buy or sell a stock. This problem adds some challenges. Financial records might only be available in text form. We might be interested in relevant news stories but we have to somehow figure out what’s relevant—either by hand or (perhaps!) using another learning system. Once we’ve figured out the relevant text, we have to interpret it. These steps are where learning systems interact with the field of natural language processing (NLP). Continuing on with our larger task, we have a time series of data—repeated measurements over time. Our challenges are piling up. In financial markets, we probably have a moving target! What worked yesterday to pick a winning stock is almost certainly not going to work tomorrow in the exact same way. We may need some sort of method or technique that accounts for a changing relationship between the inputs and the output. Or, we may simply hope for the best and use a technique that assumes we don’t have a moving target. Disclaimer: I am not a financial advisor nor am I offering investment advice.

  3. Medical Diagnosis. From a patient’s medical record, output whether they are sick or healthy. Here we have an even more complicated task. We might be dealing with a combination of text and images: medical records, notes, and medical imaging. Depending on context that may or may not be captured in the records—for example, traveling to tropical areas opens up the possibility of catching certain nasty diseases—different signs and symptoms may lead to very different diagnoses. Likewise, for all our vast knowledge of medicine, we are only beginning to understand some areas. It would be great for our system to read and study, like a doctor and researcher, the latest and greatest techniques for diagnosing patients. Learning to learn-to-learn is a meta-task in the extreme.

These are big-picture examples of classification systems. As of 2019, learning systems exist that handle many aspects of these tasks. We will even dive into basic image and language classifiers in Chapter 14. While each of these examples has its own domain-specific difficulties, they share a common task in building a model that separates the target categories in a useful and accurate way.

1.4.2 Predicting Values: Examples of Regressors

Numerical values surround us in modern life. Physical measurements (temperature, distance, mass), monetary values, percents, and scores are measured, recorded, and processed endlessly. Each can easily become a target feature that answers a question of interest:

  1. Student Success. We could attempt to predict student scores on exams. Such a system might allow us to focus tutoring efforts on struggling students before an exam. We could include features like homework completion rates, class attendance, a measure of daily engagement or participation, and grades in previous courses. We could even include opened-ended written assessments and recommendations from prior instructors. As with many regression problems, we could reasonably convert this regression problem to a classification problem by predicting a pass/fail or a letter grade instead of a raw numerical score.

  2. Stock Pricing. Similar to the buy/sell stock classifier, we could attempt to predict the future price—dollar value—of a stock. This variation seems like a more difficult task. Instead of being satisfied with a broad estimate of up or down, we want to predict that the price will be $20.73 in two weeks. Regardless of difficulty, the inputs could be essentially the same: various bits of daily trading information and as much fundamental information—think quarterly reports to shareholders—as we’d like to incorporate.

  3. Web Browsing Behavior. From an online user’s browsing and purchasing history, predict (in percentage terms) how likely the user is to click on an advertising link or to purchase an item from an online store. While the input features of browsing and purchasing history are not numerical, our target—a percentage value—is. So, we have a regression problem. As in the image classification task, we have many small pieces of information that each contribute to the overall result. The pieces need context—how they relate to each other—to really become valuable.

1.5 Evaluating Learning Systems

Learning systems are rarely perfect. So, one of our key criteria is measuring how well they do. How correct are the predictions? Since nothing comes for free, we also care about the cost of making the predictions. What computational resources did we invest to get those predictions? We’ll look at evaluating both of these aspects of learning system performance.

1.5.1 Correctness

Our key criteria for evaluating learning systems is that they give us correct predictive answers. If we didn’t particularly care about correct answers, we could simply flip a coin, spin a roulette wheel, or use a random-number generator on a computer to get our output predictions. We want our learning system—that we are investing time and effort in building and running—to do better than random guessing. So, (1) we need to quantify how well the learner is doing and (2) we want to compare that level of success—or sadly, failure—with other systems. Comparing with other systems can even include comparisons with random guessers. There’s a good reason to make that comparison: if we can’t beat a random guesser, we need to go back to the drawing board—or maybe take a long vacation.

Assessing correctness is a surprisingly subtle topic which we will discuss in great detail throughout this book. But, for now, let’s look at two classic examples of the difficulty of assessing correctness. In medicine, many diseases are—fortunately!—pretty rare. So, a doctor could get a large percentage of correct diagnoses by simply looking at every person in the street and saying, “that person doesn’t have the rare disease.” This scenario illustrates at least four issues that must be considered in assessing a potential diagnosis:

  1. How common is an illness: what’s the base rate of sickness?

  2. What is the cost of missing a diagnosis: what if a patient isn’t treated and gets gravely ill?

  3. What is the cost of making a diagnosis? Further testing might be invasive and costly; worrying a patient needlessly could be very bad for a patient with high levels of anxiety.

  4. Doctors typically diagnose patients that come into the office because they are symptomatic. That’s a pretty significant difference from a random person in the street.

A second example comes from the American legal system, where there is a presumption of innocence and a relatively high bar for determining guilt. Sometimes this criteria is paraphrased as, “It is better for 99 criminals to go free than for 1 honest citizen to go to jail.” As in medicine, we have the issue of rareness. Crime and criminals are relatively rare and getting rarer. We also have different costs associated with failures. We value clearing an honest citizen more than catching every criminal—at least that’s how it works in the idealized world of a high-school civics class. Both these domains, legal and medical, deal with unbalanced target classes: disease and guilt are not 50–50 balanced outcomes. We’ll talk about evaluating with unbalanced targets in Section 6.2.

One last issue in assessing correctness: two wrongs don’t necessarily make a right. If we are predicting rainfall and, in one case, we underpredict by 2 inches while in another case we overpredict by 2 inches, these don’t always cancel out. We cannot say, “On average, we were perfect!” Well, in fact, that’s strictly true and it might be good enough in some instances. Usually, however, we do care, and very deeply in other cases, that we were wrong in both predictions. If we were trying to determine the amount of additional water to give some plants, we might end up giving plants a double dose of water that causes them to drown. Brown Thumbs—myself included—might like using that excuse in the next Great Garden Fiasco.

1.5.2 Resource Consumption

In our modern age of disposable everything, it is tempting to apply a consumer-driven strategy to our learning systems: if we hit a barrier, just buy our way through it. Data storage is extremely cheap. Access to phenomenally powerful hardware, such as a computing cluster driven by graphics processors, is just an email or an online purchase away. This strategy begs a question: shouldn’t we just throw more hardware at problems that hit resource limits?

The answer might be yes—but let’s, at least, use quantitative data to make that decision. At each level of increased complexity of a computational system, we pay for the privilege of using that more complex system. We need more software support. We need more specialized human capital. We need more complicated off-the-shelf libraries. We lose the ability to rapidly prototype ideas. For each of these costs, we need to justify the expense. Further, for many systems, there are small portions of code that are a performance bottleneck. It is often possible to maintain the simplicity of the overall system and only have a small kernel that draws on more sophisticated machinery to go fast.

With all that said, there are two primary resources that we will measure: time and memory. How long does a computation take? What is the maximum memory it needs? It is often the case that these can be traded off one for the other. For example, I can precompute the answers to common questions and, presto, I have very quick answers available. This, however, comes at the cost of writing down those answers and storing them somewhere. I’ve reduced the time needed for a computation but I’ve increased my storage requirements.

If you’ve ever used a lookup table—maybe to convert lengths from imperial to metric—you’ve made use of this tradeoff. You could pull out a calculator, plug values into a formula, and get an answer for any specific input. Alternatively, you can just flip through a couple pages of tables and find a precomputed answer up to some number of digits. Now, since the formula method here is quite fast to begin with, we actually end up losing out by using a big table. If the formula were more complicated and expensive to compute, the table could be a big time saver.

A physical-world equivalent of precomputation is when chefs and mixologists premake important components of larger recipes. Then, when they need lime juice, instead of having to locate limes, clean them, and juice them, they simply pull a lime juice cube out of the freezer or pour some lime juice out of a jar. They’ve traded time at the beginning and some storage space in the refrigerator for faster access to lime juice to make your killer mojito or guacamole.

Likewise, a common computation technique called compression trades off time for space. I can spend some time finding a smaller, compact representation of Moby Dick—including the dratted chapter on cetology (the study of whales)—and store the compressed text instead of the raw text of the tome. Now, my hard drive or my bookshelf is less burdened by storage demands. Then, when I get a craving to read about 19th-century whaling adventures, I can do so. But first, I have to pay a computation cost in time because I have to decompress the book. Again, there is a tradeoff between computational time and storage space.

Different learning systems make different tradeoffs between what they remember about the data and how long they spend processing the data. From one perspective, learning algorithms compress data in a way that is suitable for predicting new examples. Imagine that we are able to take a large data table and reduce it to a few knobs on a machine: as long as we have a copy of that machine around, we only need a few pieces of information to recreate the table.

1.6 A Process for Building Learning Systems

Even in this brief introduction to learning systems, you’ve probably noticed that there are many, many options that describe a learning system.

  • There are different domains where we might apply learning, such as business, medicine, and science.

  • There are different tasks within a domain, such as animal image recognition, medical diagnosis, web browsing behavior, and stock market prediction.

  • There are different types of data.

  • There are different models relating features to a target.

We haven’t yet explicitly discussed the different types of models we’ll use, but we will in the coming chapters. Rest assured, there are many options.

Can we capture any generalities about building learning systems? Yes. Let’s take two different perspectives. First, we’ll talk at a high level where we are more concerned with the world around the learning system and less concerned with the learning system itself. Second, we’ll dive into some details at a lower level: imagine that we’ve abstracted away all the complexity of the world around us and are just trying to make a learning system go. More than that, we’re trying to find a solid relationship between the features and the target. Here, we’ve reduced a very open-ended problem to a defined and constrained learning task.

Here are the high-level steps:

  1. Develop an understanding of our task (task understanding).

  2. Collect and understand our data (data collection).

  3. Prepare the data for modeling (data preparation).

  4. Build models of relationships in the data (modeling).

  5. Evaluate and compare one or more models (evaluation).

  6. Transition the model into a deployable system (deployment).

These steps are shown in Figure 1.3. I’ll insert a few common caveats here. First, we normally have to iterate, or repeat, these steps. Second, some steps may feed back to prior steps. As with most real-world processes, progress isn’t always a straight line forward. These steps are taken from the CRISP-DM flow chart that organizes the high-level steps of building a learning system. I’ve renamed the first step from business understanding to task understanding because not all learning problems arise in the business world.

Stepwise processes of high-level machine learning are shown diagrammatically.

Figure 1.3 A high-level view of machine learning.

Within the high-level modeling step—that’s step 4 above—there are a number of important choices for a supervised learning system:

  1. What part of the data is our target and what are the features?

  2. What sort of machine, or learning model, do we want to use to relate our input features to our target feature?

  3. Do the data and machine have any negative interactions? If so, do we need to do additional data preparation as part of our model building?

  4. How do we set the knobs on the machine? What is our algorithm?

While these breakdowns can help us organize our thoughts and discussions about learning systems, they are not the final story. Let’s inform the emperor and empress that they are missing their clothes. Abstract models or flow-chart diagrams can never capture the messy reality of the real world. In the real world, folks building learning systems are often called in (1) after there is already a pile of data gathered and (2) after some primary-stake holders—ahem, bosses—have already decided what they want done. From our humble perspective—and from what I want you to get out of this book—that’s just fine. We’re not going to dive into the details of collecting data, experimental design, and determining good business, engineering, or scientific relationships to capture. We’re just going to say, “Go!” We will move from that pile of data to usable examples, applying different learning systems, evaluating the results, and comparing alternatives.

1.7 Assumptions and Reality of Learning

Machine learning is not magic. I can see the look of shock on your faces. But seriously, learning cannot go beyond some fundamental limits. What are those limits? Two of them are directly related to the data we have available. If we are trying to predict heart disease, having information about preferred hair styles and sock colors is—likely—not going to help us build a useful model. If we have no useful features, we’ll only pick up on illusory patterns—random noise—in the data. Even with useful features, in the face of many irrelevant features, learning methods may bog down and stop finding useful relationships. Thus, we have a fundamental limit: we need features that are relevant to the task at hand.

The second data limit relates to quantity. An entire subject called computational learning theory is devoted to the details of telling us how many examples we need to learn relationships under certain mathematically idealized conditions. From a practical standpoint, however, the short answer is more. We want more data. This rule-of-thumb is often summarized as data is greater than (more important than) algorithms: data > algorithms. There’s truth there, but as is often the case, the details matter. If our data is excessively noisy—whether due to errors or randomness—it might not actually be useful. Bumping up to a stronger learning machine—sort of like bumping up a weight class in wrestling or getting a larger stone bowl for the kitchen—might get us better results. Yet, you can be bigger and not necessarily better: you might not be a more winning wrestler or make a better guacamole just because you are stronger or have better tools.

Speaking of errors in measurements, not every value we have in a data table is going to be 100% accurate. Our measuring rulers might be off by a bit; our ruler-readers might be rounding off their measurements in different ways. Worse yet, we might ask questions in surveys and receive lies in response—the horror! Such is reality. Even when we measure with great attention to detail, there can be differences when we repeat the process. Mistakes and uncertainty happen. The good news is that learning systems can tolerate these foibles. The bad news is that with enough noise it can be impossible to pick out intelligible patterns.

Another issue is that, in general, we don’t know every relevant piece of information. Our outcomes may not be known with complete accuracy. Taken together, these give us unaccounted-for differences when we try to relate inputs to outputs. Even if we have every relevant piece of information measured with perfect accuracy, some processes in the world are fundamentally random—quarks, I’m looking at you. If the random-walk stock market folks are right, the pricing of stocks is random in a very deep sense. In more macro-scale phenomena, the randomness may be less fundamental but it still exists. If we are missing a critical measurement, it may appear as if the relationship in our data is random. This loss of perspective is like trying to live in a three-dimensional world while only seeing two-dimensional shadows. There are many 3D objects that can give the same 2D shadow when illuminated from various angles; a can, a ball, and a coffee cup are all circles from the bird’s eye view (Figure 1.4). In the same way, missing measurements can obscure the real, detectable nature of a relationship.

An illustration shows 3D views of a ball and a cup, placed in front of a bulb. The shadows of the ball and the cup are represented as 2D circles.

Figure 1.4 Perspective can shape our view of reality.

Now for two last technical caveats that we’ll hold throughout this book. One is that the relationship between the features and the target is not, itself, a moving target. For example, over time the factors that go into a successful business have presumably changed. In industrial businesses, you need access to raw materials, so being in the right place and the right time is a massive competitive advantage. In knowledge-driven enterprises, the ability to attract high-quality employees from a relatively small pool of talent is a strong competitive advantage. Higher-end students of the mathematical arts call relationships that don’t change over time stationary learning tasks. Over time, or at least over different examples in our dataset, the underlying relationship is assumed to—we act as if it does—remain the same.

The second caveat is that we don’t necessarily assume that nature operates the same way as our machine. All we care about is matching the inputs and the outputs. A more scientific model may seek to explain the relationship between inputs and outputs with a mathematical formula that represents physical laws of the universe. We aren’t going down that rabbit hole. We are content to capture a surface view—a black box or gift-wrapped present—of the relationship. We have our cake, but we can’t eat it too.

1.8 End-of-Chapter Material

1.8.1 The Road Ahead

There isn’t much to summarize in an introductory chapter. Instead, I’ll talk a bit about what we’re going through in the four parts of this book.

Part I will introduce you to several types of learning machines and the basics of evaluating and comparing them. We’ll also take a brief look at some mathematical topics and concepts that you need to be familiar with to deal with our material. Hopefully, the math is presented in a way that doesn’t leave you running for the exits. As you will see, I use a different approach and I hope it will work for you.

Part II dives into detail about evaluating learning systems. My belief is that the biggest risk in developing learning systems is lying to ourselves about how well we are doing. Incidentally, the second biggest risk is blindly using a system without respect for the evolving and complex systems that surround it. Specifically, components in a complex socio-technical system are not swappable like parts in a car. We also need to tread very carefully with the assumption that the future is like the past. As for the first issue, after I get you up and running with some practical examples, we’ll take on the issue of evaluation immediately. As to the second issue—good luck with that! In all seriousness, it is beyond the scope of this book and it requires great experience and wisdom to deal with data that acts differently in different scenarios.

Part III fleshes out a few more learning methods and then shifts focus towards manipulating the data so we can use our various learning methods more effectively. We then turn our focus towards fine-tuning methods by manipulating their internal machinery—diving into their inner workings.

Part IV attacks some issues of increasing complexity: dealing with inadequate vocabulary in our data, using images or text instead of a nice table of examples and features, and making learners that are themselves composed of multiple sublearners. We finish by highlighting some connections between different learning systems and with some seemingly far more complicated methods.

1.8.2 Notes

If you want to know more about Arthur Samuel, this brief bio will get you started: http://history.computer.org/pioneers/samuel.html.

The idea of the meta level and self-reference is fundamental to higher computer science, mathematics, and philosophy. For a brilliant and broad-reaching look at meta, check out Godel, Escher, Bach: An Eternal Golden Braid by Hofstadter. It is long, but intellectually rewarding.

There are many alternative terms for what we call features and target: inputs/outputs, independent/dependent variables, predictors/outcome, etc.

PA and VT were the 2nd and 14th states to join the United States.

What makes the word cat mean the object *CAT* and how is this related to the attributes that we take to define a cat: meowing, sleeping in sunbeams, etc.? To dive into this topic, take a look at Wittgenstein (https://plato.stanford.edu/entries/wittgenstein), particularly on language and meaning.

The examples I discussed introduce some of the really hard aspects of learning systems. In many cases, this book is about the easy stuff (running algorithms) plus some medium-difficulty components (feature engineering). The real world introduces complexities that are hard.

Outside of supervised learning from examples, there are several other types of learning. Clustering is not supervised learning although it does use examples. We will touch on it in later chapters. Clustering looks for patterns in data without specifying a special target feature. There are other, wilder varieties of learning systems; analytic learning and inductive logic programming, case-based reasoning, and reinforcement learning are some major players. See Tom Mitchell’s excellent book titled Machine Learning. Incidentally, Mitchell has an excellent breakdown of the steps involved in constructing a learning system (the modeling step of the CRISP-DM process).

Speaking of CRISP-DM, Foster Provost and Tom Fawcett have a great book Data Science for Business Understanding that dives into machine learning and its role in organizations. Although their approach is focused on the business world, anyone who has to make use of a machine-learning system that is part of a larger system or organization—that’s most of them, folks—can learn many valuable lessons from their book. They also have a great approach to tackling technical material. I highly recommend it.

There are many issues that make real-world data hard to process. Missing values is one of these issues. For example, data may be missing randomly, or it may be missing in concert with other values, or it may be missing because our data isn’t really a good sample of all the data we might consider. Each of these may require different steps to try to fill in the missing values.

Folks that have background in the social sciences might be wondering why I didn’t adopt the classical distinctions of nominal, ordinal, interval, and ratio data. The answer is twofold. First, that breakdown of types misses some important distinctions; search the web for “level of measurement” to get you started. Second, our use of modeling techniques will convert categories, whether ordered or not, to numerical values and then do their thing. Those types aren’t treated in a fundamentally different way. However, there are statistical techniques, such as ordinal regression, that can account for ordering of categories.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.176.166