Chapter 7. Statistics and modeling: concepts and foundations

This chapter covers

  • Statistical modeling as a core concept in data science
  • Mathematics as a foundation of statistics
  • Other useful statistical methods such as clustering and machine learning

Figure 7.1 shows where we are in the data science process: statistical analysis of data. Statistical methods are often considered as nearly one half, or at least one third, of the skills and knowledge needed for doing good data science. The other large piece is software development and/or application, and the remaining, smaller piece is subject matter or domain expertise. Statistical theory and methods are hugely important to data science, but I’ve said relatively little about them so far in this book. In this chapter, I attempt to present a grand overview.

Figure 7.1. An important aspect of the build phase of the data science process: statistical data analysis

Statistics is a big field. I wouldn’t presume to be able to cover all of statistics in one book, let alone one chapter. Hundreds of textbooks, thousands of journal articles, and even more web pages have been written on the subject, so you’ll find plenty of references if you have specific questions. What I haven’t yet seen in another written work, however, is a conceptual description of statistics and its most important ideas that provides a solid theoretical foundation for someone aspiring to data science who doesn’t have formal statistical training or education. In this chapter I’ll introduce the field of statistics as a collection of related tools, each with pros and cons, for accomplishing the goals of data science. The aim of such an introduction is to enable you to begin to consider the range of possible statistical methods that might be applied in a project, to the point where you’re able to feel comfortable seeking more specific information from more detailed, technical references.

7.1. How I think about statistics

You may already feel comfortable with statistics and technical references describing how to apply its complex techniques, and in that case this chapter may seem unnecessary. But unless you’ve had much formal statistical education, there are likely many areas that you haven’t yet seen. Or you may not be familiar with how various statistical areas are related to each other. I do feel that even experienced data scientists can benefit from thinking about the field of statistics as a whole, how its components relate to each other, and how the methods of statistics are distinct from both the software that performs them and the data upon which they’re used. I don’t intend to present a definitive description of any of these concepts, but I do intend to initiate a discussion of these concepts, how they relate to each other, and how each can be important.

Throughout this chapter, I’ll continue to emphasize the distinctness of the methods, the software, and the data. Using a machine learning library isn’t the same thing as applying a type of machine learning to your data set. One is a tool; the other is an action. Likewise, a database isn’t the same thing as the data contained within it, no matter how intertwined they may be. Therefore, because I want to focus on statistical methods in this chapter, I’ll often mention software and data only abstractly, though I will refer to concrete examples at times when it seems appropriate.

Finally, before we dive in, I’d like to say that I think and write conceptually about the world of statistical methods. I imagine scenarios in which I’m grabbing data with my hands and stuffing it into the pipes of a machine that will somehow learn about this data, and my job is to tweak the pipes and the dials of the machine so that good, useful information comes out the other end. Or in the case of classifying data points, I picture myself drawing a line with chalk that best divides the red points from the blue points and then considering how I might draw another line to correct for some of the red points that fell on the blue side and vice versa. I think in that way, and I’m warning you in case you were expecting a chapter filled with differential equations and correlation coefficients. To the contrary, it’s going to be a big-picture chapter with lots of conceptual and imaginative passages. I like thinking about this stuff, so I hope I can present it in a way that’s fun for you, too.

7.2. Statistics: the field as it relates to data science

The Oxford Dictionary of Statistical Terms (OUP, 2006) describes statistics as “the study of the collection, analysis, interpretation, presentation, and organization of data.” For our purposes in this chapter, we’re going to skip the collection, presentation, and organization and focus on the analysis and interpretation. I’ll assume that you’ve already collected and organized your data, as described in previous chapters, and I’ll discuss presentation in a later chapter.

Analysis and interpretation, from my perspective, are the scientific aspects of statistics. They’re concerned with wringing knowledge from data and recognizing whether there’s enough evidence to support a given hypothesis or putative conclusion. In the face of much uncertainty—which is always the case for highly statistical projects in data science—good analysis and interpretation are important, and so I’d like to dedicate most of this chapter to discussing some of the methods by which statistics helps to achieve them.

7.2.1. What statistics is

Statistics lies between the theoretical field of mathematics and the reality of observable data. Mathematics, surprisingly to most people, has little if anything to do with data. Despite this, it has much to do with data science. Data scientists need mathematics in order to do meaningful statistical analyses, so I’d be remiss if I didn’t begin a discussion about statistics in data science with a discussion of mathematics. In the next section, I’ll write about mathematics, the main concepts it depends on, and how it’s useful in real-world applications.

On the one side of statistics is mathematics, and on the other side is data. Mathematics—particularly, applied mathematics—provides statistics with a set of tools that enables the analysis and interpretation that are the main focus of this chapter. In addition to mathematics, statistics possesses its own set of techniques that are primarily data centric.

Descriptive statistics, introduced in chapter 5, is a generally intuitive or simple kind of statistics that can provide a good overview of the data without being overly complex or difficult to understand. Descriptive statistics usually stays close to the data, in a sense.

Inferential statistics is inherently one or more steps removed from the data. Inference is the process of estimating unknown quantities based on measurable, related quantities. Typically, inferential statistics involves a statistical model that defines quantities, measurable and unmeasurable, and their relationships to each other. Methods from inferential statistics can range from quite simple to wildly complex, varying also in their precision, abstractness, and interpretability.

Statistical modeling is the general practice of describing a system using statistical constructs and then using that model to aid in analysis and interpretation of data related to the system. Both descriptive and inferential statistics rely on statistical models, but in some cases an explicit construction and interpretation of the model itself plays a secondary role. With statistical modeling, the primary focus is on understanding the model and the underlying system that it describes. Mathematical modeling is a related concept that places more emphasis on model construction and interpretation than on its relationship to data. Statistical modeling focuses on the model’s relationship to data.

Farthest from the raw data is a set of statistical techniques that are often called, for better or worse, black box methods. The term black box refers to the idea that some statistical methods have so many moving pieces with complex relationships to each other that it would be nearly impossible to dissect the method itself because it was applied to specific data within a specific context. Many methods from machine learning and artificial intelligence fit this description. If you attempt to classify individuals appearing in a data set into one of several categories, and you apply a machine learning technique such as a random forest or neural network, it will often be difficult to say, after the fact, why a certain individual was classified in a certain way. Data goes into the black box, a classification comes out, and you’re not usually certain what exactly happened in between. I’ll discuss this concept more later in this chapter.

In the following sections, I’ll cover the various concepts in statistics in more detail. I’ll usually favor high-level descriptions over specific applications so as to be widely applicable in many cases, but I’ll use illustrative examples when they seem helpful. There are many excellent technical resources that can provide more detail about each particular topic, and I’ll try to provide enough detail, including key words and common method names, so that you’re able to find additional resources quickly, on the internet or elsewhere.

7.2.2. What statistics is not

The most common misconception about what I do as a data scientist usually appears when I talk to a recruiter or other hiring agent for a company or institution. On occasion, the misconception appears later, after I’ve already taken a job and I’m midway through a project. The misconception is that I, as a data scientist, can set up, load, and administer a number of data stores serving a large number of people in various ways. I’ve explained many times to many people that data science is not data management, and it is definitely not database administration. There’s absolutely nothing wrong with those two roles—in fact, I am forever thankful when I get the chance to work with a highly competent database administrator (DBA)—but those roles are quite the opposite of scientific.

Science is the endeavor to discover the unknown. Moving data around and improving reliability and query speed is an incredibly important job that has nothing to do with discovering the unknown. I’m not sure why, exactly, some people confuse these two data-oriented jobs, but it’s happened to me on more than one occasion. It’s particularly funny to me when someone asks me to set up a database for a large organization, because, of all of the most common data science tasks, database administration is probably the one in which I have the least experience. I can set up a database that serves me well, but I definitely wouldn’t count on myself to build a data management solution for a large organization.

Maybe it’s because I’m a mathematician, but I consider data management among those skills that are useful to me but that are peripheral to the main task. I want to get to the analysis. Anything that enables good data analysis is undeniably good, but I’ll suffer passable database performance for a long time before I feel the need to take control of it—and all of its administrative headaches—in the name of optimal performance. I’m all about the statistics, however long they take.

Data management is to statistics as a food supplier is to a chef: statistics is an art that depends deeply on reliable data management, as a restaurant famous for their bacon-encrusted salmon relies heavily on timely, high-quality raw materials from local pig and salmon farms. (Apologies to my vegetarian readers and fans of wild salmon.) To me, statistics is the job; everything else is only helping. Restaurant-goers want, first and foremost, good food with good ingredients; secondarily, they might want to know that the source of their food was reliable and fast. Consumers of statistical analysis—the customers of data science projects—want to know that they’ve gleaned reliable information in some way. Then and only then would they care if the data store, software, and workflow that uses both are reliable and fast. Statistical analysis is the product, and data management is a necessary part of the process.

The role that statistics plays in data science is not a secondary, peripheral function of dealing with data. Statistics is the slice of data science that provides the insights. All of the software development and database administration that data scientists do contribute to their ability to do statistics. Web development and user interface design—two other tasks that might be asked of a data scientist—help deliver statistical analysis to the customer. As a mathematician and statistician, I might be biased, but I think statistics is the most intellectually challenging part of a data scientist’s job.

On the other hand, some of the biggest challenges I’ve dealt with in data science involve getting various software components to play nicely with one another, so I may be underestimating software engineering. It all depends on where you stand, I suppose. The next chapter will cover the basics of software, so I’ll put off further discussion of it until then.

7.3. Mathematics

The field of mathematics, though its exact boundaries are disputed, is based wholly on logic. Specifically, every mathematical concept can be deconstructed into a series of if-then statements plus a set of assumptions. Yes, even long division and finding the circumference of a circle can be boiled down to purely logical steps that follow from assumptions. It so happens that people have been doing math for so long that there are innumerable logical steps and assumptions that have been in common use for so long that we often take some of them for granted.

7.3.1. Example: long division

Long division—or plain division—as you learned it in elementary school, is an operation between two numbers that comes with a lot of assumptions. It’s likely that everyone reading this book learned how to do long division as a set of steps, a sort of algorithm that takes as input two numbers, the dividend and divisor, and gives a result called the quotient. Long division can be quite useful (more so in the absence of a calculator or computer) in everyday life when, for example, you want to divide a restaurant bill equally among several people or share a few dozen cupcakes with friends.

Many people think the field of mathematics is composed of numerous such moderately useful algorithms for calculating things, and these people wouldn’t be entirely wrong. But far more important than mathematical algorithms are the assumptions and logical steps that can be assembled into a proof that something is true or false. In fact, every mathematical algorithm is constructed from a series of logical statements that can end up proving that the algorithm itself does what it is supposed to do, given the required assumptions.

Take, for instance, three logical statements X, Y, and Z, each of which could be either true or false under various circumstances, as well as the following statements:

  • If X is true, then Y must be false.
  • If Y is false, then Z is true.

This is obviously an arbitrary set of statements that could be straight out of a logic text, but such statements lie at the heart of mathematics.

Given these statements, let’s say that you find out that X is true. It follows that Y is false and also that Z is true. That’s logic, and it doesn’t seem exciting. But what if I put real-life meaning into X, Y, and Z in an example that includes a visitor, or potential customer, to a retail website:

  • Statement X— The potential customer put more than two items into their online shopping cart.
  • Statement Y— The customer is only browsing.
  • Statement Z— The potential customer will buy something.

Those statements are all meaningful to an online retailer. And you know that the statement “Z is true” is exciting to any retailer that’s trying to make money, so the logical statements shown previously imply that the statement “X is true” should also be exciting to the retailer. More practically, it might imply that if the retailer is able to get a potential customer to put more than two items into the shopping cart, then they will make a sale. This might be a viable marketing strategy on the website if other paths to making a sale are more difficult. Obviously, real life is rarely this purely logical, but if you make all of the statements fuzzier such that “is true” becomes “probably is true” and likewise for “is false,” then this scenario might indeed be a realistic one in which a data scientist could help increase sales for the retailer. Such fuzzy statements are often best handled using statements of probability, which I cover later in this chapter.

Back to the example of long division: even the algorithms of basic arithmetic (as you probably learned in school) are predicated on assumptions and logical statements. Before getting into those, instead of boring you with a description of how I do long division, let’s assume that you have a favorite way to do long division—correctly—with pencil and paper, and we’ll refer to that way as The Algorithm hereafter in this example. The Algorithm must be the kind that gives decimal results and not the kind of long division that gives an almost answer plus a remainder. (You’ll understand why in a few minutes.) Furthermore, let’s assume that The Algorithm was originally developed by a mathematician and that this mathematician has already proven that The Algorithm results in correct answers under the appropriate conditions. Now let’s explore some of the conditions that the mathematician requires in order for you to use The Algorithm properly.

First, you have to assume that the dividend, divisor, and quotient are elements of a set called the real numbers. The set of real numbers includes all the decimal and whole numbers you’re used to but not others, such as the imaginary numbers you might get if, for example, you tried to take the square root of a negative number. There are all sorts of other sets of non-real numbers as well as sets not containing numbers at all, but I’ll leave that to mathematics textbooks to discuss.

In addition to the assumption that you’re dealing with the set of real numbers, you also assume that this particular set of real numbers is also a specific type of set called a field. A field, a central concept in abstract algebra, is a set that’s defined by a number of properties, among which are the existence of two operations, commonly called addition and multiplication, which in an abstract sense are not guaranteed to work in the way they do in common arithmetic. You do know that these two operations in fields always operate on two elements of the field in certain specific ways, but the fact that addition and multiplication work in this one specific way is another assumption you have to make when doing long division. For more on fields, consult a reference on abstract algebra.

You’re assuming you have a field composed of real numbers and that you have the operations addition and multiplication that work in the specific ways you learned in school. As part of the definition of a field, these two operations must both have inverses. As you can probably guess, you often call the inverse of addition subtraction and the inverse of multiplication division. The inverse of any operation must undo what the operation does. A number A multiplied by B gives a result C such that C divided by B gives the number A again. Division is defined to be the inverse of multiplication as you know it. That’s not an assumption; it follows from the other assumptions and the definition of a field.

To summarize, here’s what you have on long division:

  • Assumptions:

    1. You have the set of real numbers.
    2. You have a field over the set of real numbers.
    3. The field operations addition and multiplication work as in arithmetic.
  • Statements:

    1. If you have a field, then addition and multiplication have inverses: subtraction and division, respectively.
    2. If the field operations addition and multiplication work as in arithmetic, then subtraction and division also work as in arithmetic.
    3. If division works as in arithmetic, then The Algorithm will give correct answers.

Putting together these assumptions with these statements yields the following:

  • Assumptions 1 and 2 together with statement 1 imply that the operations subtraction and division exist.
  • Assumption 3 and statement 2 imply that subtraction and division work as in arithmetic.
  • The previous two statements together with statement 3 imply that The Algorithm will give correct answers.

That example may seem trivial, and in some ways it is, but I think it’s illustrative of the way that our knowledge of the world, in particular on quantitative topics, is built on specific instances of mathematical constructs. If the system of real numbers didn’t apply for some reason, then long division with decimal results wouldn’t work. If instead you were using the set of whole numbers or the integers, then a different algorithm for long division would be appropriate, possibly one that resulted in a sort of quotient plus a remainder. The reason long division can’t work the same on whole numbers or integers as it does with the real numbers is that neither set—whole numbers or integers—forms a field. Knowledge of the underlying mathematics, such as when you have a field and when you don’t, is the only definitive way to determine when The Algorithm is appropriate and when it is not. Extending that idea, knowledge of mathematics can be useful in choosing analytical methods for data science projects and in diagnosing eventual problems with those methods and their results.

7.3.2. Mathematical models

A model is a description of a system and how it works. A mathematical model describes a system using equations, variables, constants, and other concepts from mathematics. If you’re trying to describe a system that exists in the real world, then you’re venturing into applied mathematics, a phrase that generally implies that the work done can be applied to something outside mathematics, such as physics, linguistics, or data science. Applied mathematics is certainly often close to statistics, and I won’t attempt to make a clear distinction between the two. But, generally speaking, applied math focuses on improving models and techniques, possibly without any data at all, whereas statistics concentrates on learning from data using mathematical models and techniques. The fields of mathematical modeling and applied mathematics are likewise not clearly distinguishable; the former focuses on the models, and the latter on some kind of real-world applications, but neither does so exclusively. The concept and use of mathematical models aren’t intuitive to everyone, so I’ll discuss them briefly here.

One of the simplest and most commonly used mathematical models is the linear model. A linear model is merely a line, described by a linear equation, that’s intended to represent the relationship between two or more variables. When the relationship is linear, it’s equivalent to saying that the variables are directly proportional, terminology that’s used more often in some fields. A linear equation describing a linear model in two dimensions (two variables) can be written in slope-intercept form (remember from school!) as

y = Mx + B

where M is the slope and B is the y-intercept.

Linear models are used in many applications because they’re easy to work with and also because many natural quantities can be reasonably expected to follow an approximately linear relationship with each other. The relationship between distance driven in a car and the amount of gasoline used is a good example. The farther you drive, the more gasoline you burn. Exactly how much gasoline was used depends on other factors as well, such as the type of car, how fast you were driving, and traffic and weather conditions. Therefore, although you can reasonably assume that distance and gasoline usage are approximately linearly related, other somewhat random variables cause variations in gasoline usage from trip to trip.

Models are often used to make predictions. If you had a linear model for gasoline usage based on distance traveled, you could predict the amount of gasoline you’d use on your next trip by putting the distance of the trip into the linear equation that describes your model. If you used the linear model

y = 0.032x + 0.0027

where y is the amount of gasoline (in liters) needed and x is the distance traveled (in kilometers), then the slope of the line, 0.032, implies that trips in your data set required on average 0.032 liters of gasoline per kilometer traveled. In addition to that, there appears to be an additional 0.0027 liters of gasoline used per trip, regardless of distance traveled. This might account for the energy needed to start the car and idle for a few moments before beginning the trip. Regardless, using this model, you can predict the gasoline usage for an upcoming trip of, say, 100 km by setting x = 100 and calculating y. The prediction according to the model would be y = 3.2027 liters. This is a basic example of how a linear model might be used to make predictions.

Figure 7.2 shows a graphical representation of a linear model without any axis labels or other context. I’ve included this graph without context because I’d like to focus on the purely conceptual aspects of the model and the mathematics, and the context can sometimes distract from those. In the graph, the line is the model, and the dots represent data that the line is attempting to model. The y-intercept seems to be approximately 5.0, the slope is about 0.25, and the line seems to follow the data reasonably well. But notice the dispersion of the data around the line. If, for example, you wanted to predict a y-value from a given x-value, the model probably wouldn’t give a perfect prediction, and there would be some error. Based on the dots, the predictions of y-values appear to be within about three or four units of the linear model, which may be good or not, depending on the goals of the project. I’ll discuss fitting models to data later, in the section on statistical modeling. The main idea here is the conceptual relationship between a model and data. Having an image such as this one in your mind—and the conceptual understanding it brings—while modeling data increases awareness and improves decision making throughout your analyses.

Figure 7.2. A representation of a linear model (line) and some data (dots) that the model attempts to describe. The line is a mathematical model, and its optimal parameters—slope and intercept—can be found using statistical modeling techniques.[1]

1

It’s important to emphasize that mathematical models are fabricated things that don’t have any inherent connection to the real-life system. Models are descriptions of what you think happens in those systems, and there’s no guarantee that they work well. It’s the responsibility of the data scientist (or mathematician or statistician) to find a model that’s good enough to suit the purposes of the project and apply it correctly.

Examples of mathematical models

Einstein’s model of gravity, as described in his theory of general relativity, famously supplanted Newton’s model of gravity. Newton’s model is a simple equation that describes the forces of gravity quite accurately at normal masses and distances, but Einstein’s model, which uses a mathematical model based on metric tensors (a sort of higher-order object describing linear relationships), is far more accurate at extreme scales.

The current Standard Model of particle physics, finalized in the 1970s, is a mathematical model based on quantum field theory that theorizes how physical forces and subatomic particles behave. It has survived a few experimental tests of its applicability as a model, most recently during the process of confirming the existence of the Higgs boson. The Higgs boson is an elementary particle predicted by the Standard Model for which, prior to 2012, there was little experimental proof of existence. Since then, experiments at the Large Hadron Collider at CERN have confirmed the existence of a particle consistent with the Higgs boson. Like any good scientist, the researchers at CERN won’t say with certainty that they found the Higgs boson and only the Higgs boson, because some properties of the particle are still unknown, but they do say with certainty that nothing in the Standard Model has yet been contradicted by experimental evidence.

From my own experience, some of the more interesting applications of mathematical models happen in social network analysis. Social networks are composed of individuals and connections between them, making graph theory an excellent mathematical field in which to look for applicable models. Theories of connectedness, centrality, and inbetweenness, among others, can be nearly directly applied to real-life scenarios in which groups of people interact in various ways. Graph theory has numerous other applications, and frankly it has some quite interesting purely mathematical (nonapplied) problems as well, but the recent advent of social networks on the internet provides a wealth of new phenomena to model as well as data to support it.

The geometric equivalent of a linear model is the concept of Euclidean geometry, which is the normal concept of how three-dimensional space works—length, height, width, all lines can extend to infinity without intersecting—extended to any number of dimensions. But other geometries exist, and these can be useful in modeling certain systems. Spherical geometry is the geometry that exists on the surface of a sphere. If you’re standing on the Earth, a sphere (approximately), and you walk in a straight line, ignoring bodies of water, you’ll arrive back where you started some time later. This doesn’t happen in Euclidean geometry—where you’d be walking into infinity—and it’s a property that can come in handy when modeling certain processes. Certainly, any model of airplane traffic could benefit from a spherical geometry, and I’m sure there are many more uses, such as manufacturing engineering, where precision milling a ball joint or other curved surface might need a model of a spherical geometry to get the shape exactly right.

Mathematical models are used in every quantitative field, explicitly or implicitly. Like some of the logical statements and assumptions that we must make in everyday arithmetic, some mathematical models are used so often for a given purpose that they’re taken for granted. For example, surveys, democratic election polling, and medical testing make use of correlations between variables to draw informative conclusions. The common concept of correlation—specifically the Pearson correlation coefficient—assumes a linear model, but that fact is often taken for granted, or at least it’s not common knowledge. The next time you’re reading a forecast for an upcoming election, know that the predictions as well as the margins of error are based on a linear model.

7.3.3. Mathematics vs. statistics

Real math is made up of assumptions and logical statements, and only in specific instances does it involve numeric quantities. In that way, among all the topics that are taught in high school math classes in the United States, geometry—with its proofs of triangle congruence and parallelity of lines—comes the closest to the heart of mathematics. But everyday life obviously deals in numeric quantities quite often, so we tend to focus on the branches of mathematics that deal with quantities. Data science does this quite often, but it has also been known to bleed into not-so-quantitative, or pure, branches of mathematics such as group theory, non-Euclidean geometry, and topology, if they seem useful. In that way, knowledge of some pure mathematical topics can prove useful to a data scientist.

In any case, mathematics generally doesn’t touch the real world. Based wholly on logic and always—always—starting with a set of assumptions, mathematics must first assume a world it can describe before it begins to describe it. Every mathematical statement can be formulated to start with an if (if the assumptions are true), and this if lifts the statement and its conclusion into abstractness. That is not to say that mathematics isn’t useful in the real world; quite the contrary. Mathematics, rather than being a science, is more of a vocabulary with which we can describe things. Some of these things might be in the real world. As with vocabularies and the words they contain, rarely is a description perfectly correct. The goal is to get as close to correct as possible. The mathematician and statistician George Box famously wrote, “Essentially, all models are wrong, but some are useful.” Indeed, if a model is reasonably close to correct, it can be useful.

The field of statistics shares these concerns about the correctness of models, but instead of being a vocabulary and a system of logic, statistics is a lens through which to see the world. Statistics begins with data, and though statistical models are mostly indistinguishable from mathematical models, the intent is quite different. Instead of seeking to describe a system from the inside out, a statistical model observes a system from the outside in by aggregating and manipulating relevant observable data.

Mathematics does, however, provide much of the heavy machinery that statistics uses. Statistical distributions are often described by complex equations with roots that are meaningful in a practical, scientific sense. Fitting statistical models often makes use of mathematical optimization techniques. Even the space in which a project’s data is assumed to lie must be described mathematically, even if the description is merely “N-dimensional Euclidean space.” Although the boundary is a bit blurry, I like to say that the point at which mathematics ends and statistics begins is the point at which real data enters an equation.

7.4. Statistical modeling and inference

In chapter 5, I mentioned statistical inference in the context of the rough statistical analysis I suggested as a part of data assessment. Inference is the task of estimating the value of a quantity you’re not able to measure directly. Because you don’t have direct measurements of the quantity, it’s necessary to construct a model that, at the least, describes the relationship between the quantity you want and the measurements you have. Because of the existence of a model in inference, I’ve lumped statistical modeling and inference together in this section.

A statistical model is not that different from a mathematical model, which I’ve already covered in this chapter. As I’ve written, the difference is mainly the focus: mathematical modeling focuses on the model and its inherent properties, but statistical modeling focuses on the model’s relationship to data. In both cases, the model is a set of variables whose relationships are described by equations and other mathematical relations. A linear model—which I’ve already introduced—between the quantities x and y might look like

y = Mx + B

whereas an exponential model might look like

y = Aex

where e is the exponential constant, also known as Euler’s number. The model parameter values M, B, and A are probably unknown until they’re estimated via some statistical inference technique.

Each of these two models is a description of how x and y might be related to one another. In the first case, the linear model, it is assumed that as x goes up a certain amount, y goes up (or down, depending on the value of M) the same amount no matter how large x gets. In the second case, the exponential one, if x increases a certain amount, then y will increase an amount that depends on the size of x; if x is larger, then an increase in x will increase y an even bigger amount than if x was smaller. In short, if x gets bigger, then bigger again, the second movement will cause a greater increase in y than the first.

A common example of exponential growth is unconstrained population growth. If resources aren’t scarce, then populations—bacteria, plants, animals, even people—sometimes grow exponentially. The growth rate might be 5%, 20%, or larger, but the term exponential implies that percentages (or proportions), and not scalar numbers, describe the growth. For example, if a population has 100 individuals and is growing at a rate of 20% per year, then after one year the population will contain 120 individuals. After two years, you expect to have 20% more than 120, which adds 24 individuals to the total, bringing it to 144. As you can see, the rate of increase grows as the population grows. That’s one of the characteristics of an exponential growth model.

Both of these models, linear and exponential, can be described by a single equation. If you use one of these models in a project, the challenge is to find estimates of the parameters M, B, and/or A that represent the data well and can provide insight into the system you’re modeling. But models can extend far beyond a single equation.

Now that you’ve seen a couple of simple examples, let’s have a look at what statistical models are in general.

7.4.1. Defining a statistical model

A statistical model is a description of a set of quantities or variables that are involved in a system and also a description of the mathematical relationships between those quantities. So far, you’ve seen a linear model as well as an exponential one, both of which pertain only to two variables, x and y, whatever those quantities are. Models can be far more complex, consisting of variables of many dimensions as well as requiring many equations of various types.

Beyond linear and exponential equations, all sorts of function types are used in statistical modeling: polynomial, piecewise polynomial (spline), differential equations, nonlinear equations of various types, and many others. Some equation or function types have more variables (moving parts) than others, which affect the complexity of the model description as well as the difficulty of estimating all the model parameters.

Beyond these mathematical descriptions of models, a statistical model should have some explicit relationship to data that’s relevant to the system being modeled. Usually, this means that the values for which data exists are included explicitly as variables in the model. For instance, if you consider the population growth example from the previous section, and your data set includes several measurements of population size over time, then you’ll want to include the population size in your model as well as a variable for time. In this case, it can be straightforward, such as using the model equation

P = P0ert

where P is the population at time t, P0 is the population at time zero—you can choose when time zero is, and then all other times are relative to that t = 0 point—and r is the growth rate parameter (e is still the exponential constant).

Presumably, one of the goals of modeling this data is that you’d like to be able to predict the population at some time in the future. The data set that you have, a set of population sizes over time, is a collection of value pairs for P and t. The task is then to use these past values to find good model parameters that help you make good predictions about the population in the future. In this model, P0 is defined to be the population when t = 0, so the only remaining, unknown model parameter is r. Estimating a good value for r is one of the primary tasks of statistical modeling in this hypothetical project.

Once you have a good estimate of the model parameter r, and given that you would know the value for P0 because it’s defined by the data and the chosen definition for the time variable t, you’d then have a usable model of the growth of the population you’re studying. You can then use it to make conclusions and predictions about the past, present, and future state of the population. That’s the purpose of statistical modeling: to draw meaningful conclusions about the system you’re studying based on a model of that system and some data.

In order to draw meaningful conclusions about a system via statistical modeling, the model has to be good, the data has to be good, and the relationship between them also has to be good. That’s far easier said than done. Complex systems—and most real-life systems are quite complex—require special care in order to make sure that the model and its relationship to the data are good enough to draw those meaningful conclusions you seek. You often have to take into account many unknowns and moving parts in the system. Some unknowns can be included explicitly in the model, such as the growth rate in the exponential population model. These are called latent variables and are described in the next section.

7.4.2. Latent variables

When you create a model of a system, there are some quantities that you can measure and some you can’t. Even among the measurable quantities, there are some for which you already have measurements in your data and others for which you don’t. In the exponential growth model, it’s fairly obvious that the growth rate is a quantity that exists, regardless of whether you can measure it. Even if you wanted to use a different model, such as a linear one, there would probably still be at least one variable or parameter that represents the rate of growth of the population. In any case, this growth parameter is probably not measurable. There might be some rare cases in which you could keep track of the new members of the population, and in that case you might be able to measure the population growth rate directly, but this seems unlikely. Let’s assume that the growth rate isn’t directly measurable or at least that you don’t have direct measurements of it in the data. Whenever you know a variable or parameter exists but you don’t have measurements of it, you call it a latent variable.

Latent variables, as in the case of the growth rate parameter in an exponential population growth model, are often based on an intuitive concept of how a system works. If you know that a population is growing, you know there must be a growth rate, and so creating a variable for that growth rate is an intuitive thing to do that helps explain your system and how other variables are related. Furthermore, if you can draw conclusions about that growth rate, then that might be helpful to your project’s goals. Those are the two most common reasons to include a latent variable in a statistical model:

  • The variable plays an intuitive role in how the system works.
  • You’d like to draw statistical conclusions about this particular variable.

In either case, latent variables represent variables or quantities of which you’d like to know the value but can’t measure or haven’t measured for one reason or another. In order to use them, you have to infer them from what you know about other, related variables.

In the exponential population growth model, if you know about the population size P at multiple time points, then it’s reasonably easy to get some idea of the rate of change of that population. One way would be to take the differences/changes in population between consecutive time points, which is pretty close to a direct measurement but not quite. Then the question is whether the absolute differences are constant over time, implying linear growth, or if they grow as the population grows, implying an exponential growth or something similar. If the population seems to be growing by a constant number every time period (for example, year, month, day), then linear seems better suited, but if the population seems to be growing by a certain percentage every time period, then an exponential model probably suits the system better.

I’ll discuss model comparison later in this chapter—finding a good statistical model is important—but for now I’ll focus on the fact that the nature of a quantity that seems intuitive—the population growth rate in our example—depends heavily on the choice of model. It’s tempting to think that the population growth rate is directly measurable, but in reality, even if you could measure it, you’d still have to make at least the decision about whether the population grows by an absolute number each time period or whether it grows by a percentage each time period. In addition, many, many other models are also possible; linear (absolute) and exponential (percentage) are only the two most commonly used for population growth. The choice of model and the nature of its latent variables are closely related. Both are highly influenced by how the system works, intuitively, as well as the system’s relationship to the data.

7.4.3. Quantifying uncertainty: randomness, variance, and error terms

There’s always uncertainty in the estimated values of latent variables if you can’t measure them directly, but also even if you can. Getting near-exact values for latent variables and model parameters is difficult, so it’s often useful to explicitly include some variance and error terms in your model, which are typically represented by the notion of a probability distribution of values.

Using probability distributions in statistical models

If a quantity described by your model has an expected value—estimated by some statistical method—then the variance of that quantity describes how far from the expected value an individual instance of that quantity might be. For example, if you’re modeling the height of human beings, you might find the average height of men to be 179.0 cm. But each individual differs from that expected value by some amount; each man is probably a few centimeters taller or shorter than that, with some men almost exactly 179.0 cm tall, and then there are the extremely tall and extremely short, who are 20 or 30 cm taller or shorter. This concept of the dispersion of values around an expected value naturally evokes the idea of a bell curve, or normal distribution, with which most people are familiar.

Probability distributions, in general, describe precisely the dispersion of random values, across a range of possible values, that you’d get from a random process if you took a sample from it. If you observed values that are generated by a random process, the probability distribution for that process would tell you how often you’d expect to see certain values. Most people know how the normal distribution is shaped, and they might be able to say what percentage of values generated by a normally distributed random process would be above or below certain marks. Although the normal distribution is the most popular probability distribution, there are distributions of all shapes, continuous and discrete, and each of these carries with it a set of assumptions. The normal distribution in particular doesn’t deal well with outliers, so in the presence of outliers, a more robust distribution or method might be better. Each specific distribution has its own advantages and caveats, and choosing an appropriate one can have significant implications for your results. All randomness is not created equal, so it’s best do to a little investigation before settling on a particular distribution. Plenty of statistics literature exists for this purpose.

The normal distribution is a probability distribution with two parameters: mean and variance. In the case of modeling a single quantity like human height, the normal distribution could describe the entire model. In that case, the system you’re modeling is, in some sense, a system that produces human beings of various heights. The mean parameter represents the height that the system is aiming for when it makes each human, and the variance represents how far from that mean height the system usually is.

Aiming for a value and missing it by some amount is the core idea of error terms in statistical models. An error in this sense is how far off the mark the instance (or measurement) of the quantity is. Conceptually, this implies that every man should be 179.0 cm tall, but that some error in the system caused some men to be taller and some shorter. In statistical models, these errors are generally considered as unexplainable noise, and the principal concern becomes making sure that all the errors across all the measurements are normally distributed.

Formulating a statistical model with uncertainty

A possible version of the human height model involving an error term is the equation

hi = hp + εi

where hp is the expected height of the human-producing system, and hi is the height of the individual with the label i. The error terms are represented by the variables εi, which are assumed to be normally distributed with mean zero and independent of each other. The Greek letter epsilon is the favored symbol for errors and other arbitrarily small quantities. Note that the subscript i indicates that there’s a different error for each individual. If you have a good estimate for hp and the variance for εi, then you have a reliable model of male human height. This isn’t the only way to model male human height.

Because you know from experience that the height of human males varies among individuals, you might consider the conceptual difference between the expected height within a human population and the heights of the individuals. Individuals can be considered different instances of the same quantity or set of quantities. There are two conceptual reasons why different instances of the same quantity might vary:

  • The system or the measurement process is noisy.
  • The quantity itself can vary from one instance to another.

Note a subtle difference between these two reasons. The first reason is embodied by the notion of error terms. The second corresponds to the notion of a random variable.

Random variables possess their own inherent randomness that generally wouldn’t be considered noise. In the human height example, rather than calling the system noisy, you could assume that the system itself picks a height at random and then produces a human of that height, nearly exactly. This conceptual distinction does have benefits, particularly in more complex models. This version of the model might be described by

hi ~ N( hp , σ2 )

mi = hi + εi

where the first statement indicates that the height hi of individual i is generated by a random process using a normal distribution with mean hp and variance σ2. The second statement indicates that the measurement of hi is a noisy process resulting in a measurement mi. In this case, the error term εi corresponds only to the real-life measurement process and so would probably be only a fraction of a centimeter.

It’s probably bad form to mix probability distribution notation with error term notation in the same model description, because they describe practically the same random process, but I think it’s illustrative of the conceptual difference between inherently random variables that are important to the model and the presumed-unexplained error term.

A good example of an inherently random variable that isn’t an error term would appear if you were to generalize your model of male human height to also include women. Males around the world are consistently taller than local women, so it would likely be a mistake to lump men and women together in a model of human height. Let’s say that one of the goals of the modeling task is to be able to generate predictions about the height of a randomly chosen human, regardless of whether they are male, female, or otherwise. You could construct a model for males, as you did previously, and a corresponding one for females. But if you’re generating a prediction for the height of a randomly chosen human, you wouldn’t know definitively which model is more applicable, so you should include a random variable representing the sex of the individual. It could be a binary variable that you assume gets chosen first, before you make a prediction about the height from the normally distributed height model appropriate for that sex. This model might be described by the equations

si ~ B( 1, p )

hi ~ N( hp(si), σ2(si) )

where si is the sex of the individual, which according to the first statement is chosen randomly from a Bernoulli distribution (a common distribution with two possible outcomes), where the probability of choosing female is assumed to be p (for two sexes, the probability of male is assumed to be 1–p). Given the individual’s sex si, the term hp(si) is intended to represent the mean height of the population of people matching the sex si, and σ2(si) is the variance for people matching the sex si. The second statement, in summary, describes how the predicted height is generated from a normal distribution with parameters that are determined by an individual’s sex, which was randomly chosen from the Bernoulli distribution.

Drawing conclusions from models involving uncertainty

Assuming that you’ve found some good estimates for the parameters in this model, you can predict heights of randomly chosen individuals. Such prediction is useful in analysis of small sample sizes; if you, for example, chose randomly a group of 10 people, and you found that they averaged 161 cm in height, you might want to know if you didn’t have a sample that’s representative of the whole population. By generating height predictions from your model, in groups of 10, you can see how often you’d get such a small average height. If it’s a rare occurrence, then that would be evidence that your sample isn’t representative of the whole population, and you might want to take action to improve your sample in some way.

Random variables can be helpful in statistical modeling for a number of reasons, not the least of which is that many real-life systems contain randomness. In describing such a system using a model, it’s important never to confuse the expectations of the model with the distributions that the model relies on. For instance, even though the model of human male height expects individuals to be 179.0 cm tall, it doesn’t mean that every human male is 179.0 cm. This may seem obvious, but I’ve seen many academic papers confuse the two and take a statistical shortcut because it would be convenient to assume that everyone is of average height. Sometimes it may not matter much, but sometimes it might, and it pays to figure out which situation you’re in. If you’re an architect or a builder, you certainly wouldn’t want to build doorways that are 180 cm tall; probably 40% or more of the population would have to duck their head to walk through, even though the average man wouldn’t. If you’re going to make important decisions based on your project’s conclusions, it’s often best to admit uncertainty at various stages, including in the statistical model.

I hope that this discussion of random variables, variance, and error terms has been illustrative of how uncertainty—which is so pervasive in all of data science—also works its way into statistical models. That might be an understatement; in fact, I consider reducing uncertainty to be the primary job of statistics. But sometimes in order to reduce uncertainty in the way you want, you have to admit that uncertainty exists within the various pieces of the model. Treating uncertainties—randomness, variance, or error—as certainties can lead you to overly confident results or even false conclusions. Both of these are uncertainties themselves, but the bad kind—the kind you can’t explain in a rigorous and useful manner. For that reason, I tend to treat every quantity as a random variable at first, only replacing it with a certain, fixed value after I’ve managed to convince myself rigorously that it’s appropriate to do so.

7.4.4. Fitting a model

So far, I’ve discussed models in a mostly abstract sense, without saying much about the relationship between the model and the data. This was intentional, because I believe that it’s beneficial to think about the system I intend to model and decide how I think the model should work before I try to apply it to data. Fitting a model to a data set is the process of taking the model that you’ve designed and finding the parameter values that describe the data the best. The phrase “fit a model” is synonymous with estimating values for a model’s parameters.

Model fitting is optimization, among all possible combinations of parameter values, of a goodness-of-fit function. Goodness of fit can be defined in many ways. If your model is intended to be predictive, then its predictions should be close to the eventual outcome, so you could define a closeness-of-prediction function. If the model is supposed to represent a population, as in the model of human height discussed earlier in this chapter, then you might want random samples from the model to look similar to the population you’re modeling. There can be many ways to imagine your model being close to representing the data you have.

Because there are many possible ways to define goodness of fit, deciding which one is best for your purposes can be confusing. But a few common functions are suitable for a large number of applications. One of the most common is called the likelihood, and in fact this type of function is so common and well studied that I recommend using it as your goodness-of-fit function unless you have a compelling reason not to do so. One such compelling reason is that likelihood functions are applicable only to models that are specified by probability distributions, so if you have a model that isn’t based on probability, you can’t use the likelihood. In that case, it would be best to check some statistical literature on model fitting for some more appropriate goodness-of-fit function for your model.

The likelihood function

The word likelihood is used commonly in the English language, but it has a special meaning in statistics. It’s much like probability but in reverse, in a way.

When you have a model with known parameter values, you can choose a possible outcome arbitrarily and calculate the probability (or probability density) of that outcome. That’s evaluating the probability density function. If you have data and a model but you don’t know the parameter values for the model, you can sort of do the same thing in reverse: use the same probability function and calculate the joint probability (or probability density) for all points in the data set but do so over a range of model parameter values. The input to a likelihood function is a set of parameter values, and the output is a single number, the likelihood, which could also be called (somewhat improperly) the joint probability of the data. As you move the input parameter values around, you get different values for the likelihood of the data.

Probability is a function of outcomes that’s based on known parameters, and likelihood is a function of parameter values that’s based on known outcomes in a data set.

Maximum likelihood estimation

The maximum likelihood solution for a model with respect to a data set is exactly what it sounds like: the model parameter set that produces the highest value from the likelihood function, given the data. The task of maximum likelihood estimation (MLE) is to find that optimal parameter set.

For linear models with normally distributed error terms, MLE has a quick and easy mathematical solution. But that’s not the case for all models. Optimization is notoriously hard for large and complex parameter spaces. MLE and other methods that depend on optimization are searching for the highest point along a complex, multidimensional surface. I always picture it as a mountain-climbing expedition to find the highest peak in a large area that no one has ever explored. If no one has been to the area, and no aerial photos are available, then it’s difficult to find the highest peak. From the ground, you probably can see the closest peaks but not much farther, and if you head to the one that looks the tallest, you’ll usually see one that looks taller from where you are. Even worse, optimization is usually more akin to no-visibility mountain climbing. Along the mathematical surface that you’re trying to optimize, there’s usually no way to see beyond the immediate surroundings. Usually you know how high you are and maybe which direction the ground is sloping, but you can’t see far enough to find a higher point.

Numerous optimization algorithms can help the situation. The simplest strategy is always to walk uphill; that’s called a greedy algorithm and it doesn’t work well unless you can guarantee that there’s only one peak in the area. Other strategies incorporate some randomness and use some intelligent strategies that tentatively head in one direction before retracing their steps if it doesn’t turn out as well as they hoped.

In any case, MLE tries to find the highest peak in the space of all possible parameter values. It’s great if you know that the highest peak is what you want. But in some cases it might be better to find the highest plateau that has several very high peaks, or you might want to get a general idea of what the whole area looks like before you make a decision. You can use other model-fitting methods to accomplish that.

Maximum a posteriori estimation

Maximum likelihood estimation searches for the highest peak along the surface of all possible model parameter values. This might not be ideal for the purposes of your project. Sometimes you might be more interested in finding the highest collection of peaks than in finding the absolute highest one.

Take, for example, the Enron email data discussed at length in prior chapters and the project involving modeling behavior of the Enron employees based on social network analysis. Because social network analysis is based on a range of human behaviors that are, at best, fuzzy descriptions of tendencies of people to interact in certain ways with other people, I tend not to rely too much on any single behavior when making conclusions that are meaningful to the project. I would rather draw conclusions based on collections of behaviors, any of which could explain whatever phenomenon I’m looking at. Because of this, I’m also skeptical of quirky behavioral explanations that seem to be the best explanation of what has happened in the social network, when in fact if any aspect of the explanation wasn’t true, even a little bit, then the whole explanation and conclusion would fall apart. Finding a collection of pretty-good explanations would be better than finding one seemingly very good but potentially vulnerable explanation.

If you’re looking to find a collection of high peaks and not only the highest overall, then maximum a posteriori (MAP) methods can help. MAP methods are related to MLE methods, but by utilizing some values from Bayesian statistics (discussed later in this chapter), specifically the concept of prior distributions on variables of interest, MAP methods can help find a location in the model parameter space that’s surrounded by points that fit the data well, though not quite as well as the single highest peak. The choice depends on your goals and assumptions.

Expected maximization and variational Bayes

Whereas both MLE and MAP methods result in point estimates of parameter values, both expectation maximization (EM) and variational Bayes (VB) methods find optimal distributions of those parameter values. I lean pretty heavily toward Bayesian statistics rather than frequentist statistics (discussed later in this chapter, in case you’re not familiar), and so methods like EM and VB appeal to me. Similar to how I like to treat every quantity as a random variable until I convince myself otherwise, I like to carry variance and uncertainty through all steps of modeling, including the estimated parameters and final results, if possible.

Both EM and VB are distribution centric in that they try to find the best probability distribution for each random variable in the model, with respect to describing the data. If MLE finds the highest peak, and MAP finds a point surrounded by many high areas, then EM and VB each find an area around which you can explore in every direction and always be in an area of fairly high altitude. In addition to that, EM and VB can tell you how far you can wander before you’re in a much lower area. In a sense, they’re the random variable versions of MAP—but they don’t come for free. They can be computationally intensive and difficult to formulate mathematically.

The specific difference between EM and VB lies mainly in the algorithm used to optimize the latent variable distributions in the model. When optimizing the distribution of one variable, EM makes more simplifications to the assumptions about the other variables in the model, so EM can sometimes be less complicated than VB in terms of both the mathematics involved and the computational needs. VB considers the full estimated distributions of all random variables at all times, taking no shortcuts in that realm, but it does make some of the other assumptions that EM also does, such as independence of most variables from each other.

Like MLE and MAP, EM and VB are focused on finding areas within the parameter space that have high likelihood. The main differences are in their sensitivity to changes. Whereas MLE might walk off a cliff, MAP probably won’t, but it can’t make many guarantees beyond a single location. EM understands the surrounding area, and in addition to that, VB pays a little more attention to how walking in one direction affects the landscape in other directions. That’s the hierarchy of some common parameter optimization methods in a nutshell.

Markov chain Monte Carlo

Whereas MLE, MAP, EM, and VB are all optimizations methods that focus on finding a point or an area in the parameter space that explains the data well, Markov chain Monte Carlo (MCMC) methods are designed to explore and document the entire space of possible parameter values in a clever way, so that you have a topographical map of the whole space and can draw conclusions or explore further based on that map.

Without getting into too much detail—you can find a considerable amount of literature on its behaviors and properties—a single MCMC sampler begins at a certain point in the parameter space. It then chooses at random a direction and a distance in which to step. Typically, the step should be small enough that the sampler doesn’t step over important contours, such as an entire peak, and the step should be large enough that the sampler could theoretically traverse (usually within a few million steps, at the most) the whole region of the parameter space containing reasonable parameter values. MCMC samplers are clever in that they tend to step into areas of higher likelihood, but they don’t always do so. After selecting a tentative place to step, they make a random decision based on the likelihood at the current location and the likelihood at the new, tentative location. Because they’re clever, if a particular region of the parameter space has about twice the likelihood of another region, the MCMC sampler will tend to be located in that region about twice as often as in the other region, as it continues to traverse the space. A well-tuned MCMC sampler therefore finds itself in each region of the parameter space approximately as often as the likelihood function would predict. This means that the set of step locations (samples) is a good empirical representation of the optimal probability distributions of the model parameters.

To make sure that the set of samples does represent the distributions well, it’s usually best to start several MCMC samplers—ideally many of them—in different locations around the parameter space and then watch to see if they all, after some number of steps, tend to give the same picture of the landscape based on their step locations. If all the samplers tend to be mingling around the same areas repeatedly and in the same proportions, then the MCMC samplers are said to have converged. Some heuristic convergence diagnostics have been developed specifically for judging whether convergence has occurred in a meaningful sense.

On the one hand, MCMC usually requires less software development than the other methods, because all you need is the goodness-of-fit function and a statistical software package that has MCMC implemented (of which there are many). The other model-fitting algorithms I’ve mentioned often require manipulations of the goodness-of-fit function and various other model-specific optimization functions in order to find their solution, but not MCMC. MCMC can be off and running as long as you have a model specified mathematically and a data set to operate on.

On the downside, MCMC generally needs considerably more computational power than the others, because it explores the model parameter space randomly—albeit cleverly—and evaluates the altitude at every point at which it lands. It tends to stick around higher peaks but doesn’t stay there exclusively and commonly roves far enough to find another good peak. Another drawback of MCMC is that whether it is exploring the space cleverly or not is usually determined by a set of tuning parameters for the algorithm itself. If the tuning parameters aren’t set correctly, you can get poor results, so MCMC needs some babysitting. On a brighter note, some good evaluation heuristics have been implemented in common software packages that can quickly give you a good idea about whether your tuning parameters are set adequately.

In general, MCMC is a great technique for fitting models that don’t have another obviously better choice for fitting them. On the one hand, MCMC should be able to fit almost any model, at the cost of increased computation time as well as the babysitting and checking of evaluative heuristics. To be fair, other model-fitting methods also require some babysitting and evaluative heuristics but probably not quite as much as MCMC.

Over-fitting

Over-fitting is not a method of fitting a model, but it’s related; it’s something bad that can happen inadvertently while fitting a model. Over-fitting is the term most commonly used to refer to the idea that a model might seem to fit the data well, but if you get some new data that should be consistent with the old data, the model doesn’t fit that data well at all.

This is a common occurrence in modeling the stock market, for example. It seems that right when someone finds a pattern in stock prices, that pattern ceases. The stock market is a complex environment that produces a ton of data, so if you take a thousand or a million specific price patterns and check to see whether they fit the data, at least one of them will seem to fit. This is especially true if you tune the parameters of the pattern (for example, “This stock price usually goes up for four days and then down for two days”) to best explain past data. This is over-fitting. The pattern and the model might fit the data that you have, but they probably won’t fit the data that comes in next.

Certainly, the model should fit the data well, but that’s not the most important thing. The most important thing is that the model serves the purposes and goals of the project. To that end, I like to have a general idea of how a model should look and which aspects of it are indispensable to my project before I begin applying it to data. Using previous exploratory data assessment to inform model design is a good idea, but letting the data design your model for you is probably not.

Over-fitting can happen for a few reasons. If you have too many parameters in your model, then the model’s parameter values, after they’ve already explained the real phenomena in your data, will begin to explain fake phenomena that don’t exist, such as peculiarities in your data set. Over-fitting can also happen when your data has some serious peculiarities that aren’t present in most of the data you could expect to receive in the future. If you’re modeling written language, then having a corpus full of Enron emails or children’s books will result in models that fit your data set well but don’t fit the entire body of the written English language well.

Two techniques that are valuable in checking for over-fitting of your model are train-test separation and cross-validation. In train-test separation, you train (fit) your model based on some of your data, the training data, and then you test your model on the rest of the data, the test data. If you’re over-fitting your model to the training data, it should be pretty obvious when you make predictions about your test data that you’re way off.

Likewise, cross-validation refers to the process of doing repeated train-test-separated evaluations based on different (often random) partitions of the data. If the predictions made based on training data match the outcomes on the test data in several replicates of cross-validation, you can be reasonably sure that your model will generalize to similar data. On the other hand, there can be many reasons why new data will be different from old data, and if you’ve cross-validated only on old data, you have no guarantee that new data will also fit the model. That’s the curse of the stock market, among other systems, and it can be circumvented only by careful application and testing of models and understanding of data.

7.4.5. Bayesian vs. frequentist statistics

Although both have existed since at least the eighteenth century, frequentist statistics were far more popular than Bayesian statistics for most of the twentieth century. Over the last few decades, there has been a debate—I’ll stop short of saying feud—over the merits of one versus the other. I don’t want to fan the flames of the debate, but the two schools of statistics are mentioned in conversation and in literature often enough that it’s helpful to have a decent idea of what they’re about.

The primary difference between the two is a theoretically interpretive one that does have an impact on how some statistical models work. In frequentist statistics, the concept of confidence in a result is a measure of how often you’d expect to get the same result if you repeated the experiment and analysis many times. A 95% confidence indicates that in 95% of the replicates of the experiment, you’d draw the same conclusion. The term frequentist stems from the notion that statistical conclusions are made based on the expected frequency, out of many repetitions, of a particular event happening.

Bayesian statistics holds more closely to the concept of probability. Results from Bayesian statistical inference, instead of having a frequentist confidence, are usually described using probability distributions. In addition, Bayesian probabilities can be described intuitively as a degree of belief that a random event is going to happen. This is in contrast with frequentist probability, which describes probability as a relative frequency of certain random events happening in an infinite series of such events.

To be honest, for many statistical tasks it doesn’t make a difference whether you use a frequentist or Bayesian approach. Common linear regression is one of them. Both approaches give the same result if you apply them in the most common way. But there are some differences between the two approaches that result in some practical differences, and I’ll discuss those here.

Disclaimer: I’m primarily a Bayesian, but I’m not so one-sided as to say that frequentist approaches are bad or inferior. Mainly I feel that the most important factor in deciding on an approach is understanding what assumptions each of them carries implicitly. As long as you understand the assumptions and feel they’re suitable, either approach can be useful.

Prior distributions

Bayesian statistics and inference require that you hold a prior belief about the values of the model parameters. This prior belief should technically be formulated before you begin analyzing your main data set. But basing your prior belief on your data is part of a technique called empirical Bayes, which can be useful but is frowned on in some circles.

A prior belief can be as simple as “I think this parameter is pretty close to zero, give or take one or two,” which can be translated formally into a normal distribution or another appropriate distribution. In most cases, it’s possible to create non-informative (or flat) priors, which are designed to tell your statistical model “I don’t know,” in a rigorous sense. In any case, a prior belief must be codified into a probability distribution that becomes part of the statistical model. In the microarray protocol comparison example from earlier in this chapter, the hyper-parameters that I described are the parameters of the prior distributions for some of the model parameters.

Some frequentist statisticians take exception to the necessity of formulating such a prior distribution. Apparently they think that you shouldn’t have to formulate a prior belief if you know absolutely nothing about the model’s parameter values prior to seeing the data. I’m tempted to agree with them, but the existence in most cases of non-informative prior distributions allows Bayesians to sidestep the requirement for a prior distribution by making it irrelevant. In addition, the frequentist statistics concept of having no prior belief, if you attempted to formalize it, would look a lot like a non-informative prior in Bayesian statistics. You might conclude that frequentist methods often have an implied prior distribution that isn’t denoted explicitly. With this, I don’t mean to say that frequentists are wrong and that Bayesian methods are better; instead, I intend to illustrate how the two approaches can be quite similar and to debunk the notion that the requirement of having a prior belief is somehow a disadvantage.

Updating with new data

I’ve explained how the existence of a prior distribution in Bayesian statistics isn’t a disadvantage, because most of the time you can use a non-informative prior. Now I’ll explain how priors are not only not bad but also good.

One of most commonly cited differences between frequentist and Bayesian statistics, along with “You have to have a prior,” is “You can update your models with new data without having to include the old data as well.” The way to accomplish this is quite simple in a Bayesian framework.

Let’s assume that a while back you had a statistical model, and you received your first batch of data. You did a Bayesian statistical analysis and fit your model using non-informative priors. The result of fitting a Bayesian model is a set of parameter distributions called posterior distributions because they’re formed after the data has been incorporated into the model. Prior distributions represent what you believe before you let the model see the data, and posterior distributions are the new beliefs based on your prior beliefs, plus the data that the model saw.

Now you’re getting more data. Instead of digging up the old data and refitting the model to all the data at once, using the old non-informative priors you can take the posterior distributions based on the first set of data and use those as your prior distributions for fitting the model to the second set of data. If the size of your data sets or computational power is a concern, then this technique of Bayesian updating can save considerable time and effort.

Today, with so many real-time analytics services under development, Bayesian updating provides a way to analyze large quantities of data on the fly, without having to go back and reexamine all the past data every time you want a new set of results.

Propagating uncertainty

Of all the differences between frequentist and Bayesian statistics, I like this one the most, though I haven’t heard it mentioned that often. In short, because Bayesian statistics holds close the notion of probability—it begins with a prior probability distribution and ends with a posterior probability distribution—it allows uncertainty to propagate through quantities in the model, from old data sets into new ones and from data sets all the way into conclusions.

I’ve mentioned several times in this book that I’m a big fan of admitting when uncertainty exists and keeping track of it. By promoting probability distributions to first-class citizens, as Bayesian statistics does, each piece of the model can carry its own uncertainty with it, and if you continue to use it properly, you won’t find yourself being overconfident in the results and therefore drawing false conclusions.

My favorite of the few academic papers that I published in the field of bioinformatics emphasizes this exact concept. The main finding of that paper, called “Improved Inference of Gene Regulatory Networks through Integrated Bayesian Clustering and Dynamic Modeling of Time-Course Expression Data” (PloS ONE, 2013)—the title rolls off the tongue, doesn’t it?—showed how high technical variances in gene expression measurements can be propagated from the data, through the Bayesian model, and into the results, giving a more accurate characterization of which genes interact with which others. Most prior work on the same topic completely ignored the technical variances and assumed that each gene’s expression level was merely the average of the values from the technical replicates. Frankly, I found this absurd, and so I set out to rectify it. I may not quite have succeeded in that goal, as implied by the paper having so few citations, but I think it’s a perfect, real-life example of how admitting and propagating uncertainty in statistical analysis leads to better results. Also, I named the algorithm I presented in the paper BAyesian Clustering Over Networks, also known as BACON, so I have that going for me.

7.4.6. Drawing conclusions from models

With all this talk about designing models, building models, and fitting models, I feel I’ve almost lost track of the real purpose of statistical modeling: to learn about the system you’re studying.

A good statistical model contains all of the system’s variables and quantities that you’re interested in. If you’re interested in the worldwide average female height, that should be a variable in your model. If you’re interested in the gene-expression differences between male and female fruit flies, that should be in your model. If it’s important to you and your project to know about the responsiveness of Enron employees to incoming emails, you should have a variable that represents responsiveness in your model. Then the process of fitting the model, using whichever methods you choose, results in estimates for those variables. In the case of latent model parameters, fitting the model produces parameter estimates directly. In the case of predictive modeling, in which a prediction is a latent variable in the future (from the future?), the fitted model can generate an estimated value, or prediction.

Drawing conclusions based on your fitted model comes in many forms. First, you have to figure out what questions you want to ask of it. Consult the list of questions you generated during the planning phase of your project, discussed in the early chapters of this book. Which ones can the model help answer? A well-designed model, for the purposes of your project, should be able to answer many if not all of the project’s questions regarding the system in question. If it can’t, you may have to create a new model that can. It would be a shame to have to rebuild a model, but it’s better than using a bad or unhelpful model. You can avoid this type of situation by maintaining awareness of the project’s goals at all times, specifically the aspects of the project that this statistical model was intended to address.

Let’s say that many of the questions from the project’s planning phase involve variables and quantities that are represented in your model. For each of these variables, the model-fitting process has produced value estimates or probability distributions. You can ask two main types of questions of these estimates:

  • What is the value of the variable, approximately?
  • Is this variable greater/less than X?

I’ll cover the techniques to address each of these questions in their own subsections.

What’s the value? Estimates, standard error, and confidence intervals

All the model-fitting methods I described earlier produce best guesses—called point estimates—of variables and parameters in a model. Most but not all of them also give some measure of uncertainty of that value. Depending on which specific algorithm you’re using, MLE may not produce such a measure of uncertainty automatically, so if you need it, you may have to find an algorithm that produces one. All the other model-fitting methods give the uncertainty as an inherent output of the model-fitting process.

If all you need is a point estimate, then you’re good to go. But if, as is usually the case, you want some sort of guarantee that the value is approximately what you think it is, then you’ll need either a probability distribution or a standard error. A standard error for a parameter estimate is the frequentist equivalent of a standard deviation (square root of variance) of a probability distribution. In short, you can be 95% confident that a parameter is within two standard errors of its point estimate or 99.7% confident that it’s within three standard errors. These are confidence intervals, and if the standard error is relatively small, then the confidence intervals will be narrow, and you can be reasonably (whichever percentage you choose) sure that the true value falls within the interval.

The Bayesian equivalents of standard error and confidence intervals are variance and credible intervals, respectively. They work almost exactly the same but, as usual, differ on philosophical grounds. Given that a Bayesian parameter estimate is a probability distribution, you can naturally extract the variance and create credible intervals such that for a normal distribution, the probability is 95% that the true value is within two standard deviations of the mean, and 99.7% that it’s within three standard deviations.

Sometimes reporting a value or an interval addresses one of the goals of your project. But other times you might want to know a little more—for example, whether the variable possesses a specific property, such as being greater or less than a certain amount. For this you need hypothesis testing.

Is this variable _______? Hypothesis testing

Often you need to know more about a variable than merely a point estimate or a range of values that probably contain the true value. Sometimes it’s important to know whether a variable possesses a certain property or not, such as these:

  • Is the variable X greater than 10?
  • Is the variable X less than 5?
  • Is the variable X non-zero?
  • Is the variable X substantially different from another variable Y?

Each of these questions can be answered by a hypothesis test. A hypothesis test consists of a null hypothesis, an alternative hypothesis, and then a statistical test that fits the two hypotheses.

A null hypothesis is kind of the status quo. If this hypothesis was true, it would be kind of boring. An alternative hypothesis is a hypothesis that, if true, would be exciting. For instance, let’s say you think that the variable X is greater than 10, and if this was true, it would have cool implications for the project (maybe X is the number of song downloads, in millions, that are expected to happen next week). “X is greater than 10” is a good alternative hypothesis. The null hypothesis would be the inverse: “X is not greater than 10.” Now, because you want the alternative hypothesis to be true, you need to show beyond a reasonable doubt that the null hypothesis is not true. In statistics, you generally never prove that something is true so much as show that the other possibility, the null hypothesis, is almost certainly not true. It’s a subtle distinction, linguistically, that has a fairly large impact mathematically.

To test the null and alternative hypotheses in the example and to reject the null, as they say, you need to show that the value of X almost certainly wouldn’t venture below 10. Let’s say that the posterior probability distribution for X based on your model is normally distributed with a mean of 16 and a standard deviation of 1.5. It’s important to note that this is a one-sided hypothesis test, because you care only if X is too low (below 10) and not if it’s too high. Choosing the correct version of the test (one-sided or two-sided) can make a difference in the significance of your results.

You need to check whether 10 is beyond a reasonable threshold of how low X might be. Let’s choose a significance of 99% because you want to be sure that X isn’t below 10. Consulting a reference for a one-sided test, the 99% threshold is 2.33 standard deviations. You’ll notice that the estimate of 16 is 4.0 standard deviations above 10, meaning that you can be almost certain, and definitely more than 99% certain, that the value of X is above 10.

I could hardly talk about hypothesis testing without at least mentioning p-values. You can find much more thorough explanations elsewhere, but a p-value is the inverse of confidence (1 minus confidence, or 99% confidence, corresponds to p<0.01) and represents the frequentist concept of the chance that a hypothesis test ended up giving you an incorrect answer. It’s important not to treat a p-value in frequentist statistics like a probability in Bayesian statistics. P-values should be used for thresholding of hypothesis test results and nothing else.

Another concept and potential pitfall happens when you run many hypothesis tests on the same data or the same model. Running a few might be OK, but if you run hundreds or more hypothesis tests, then you’re bound to find at least one test that passes. For example, if you do 100 hypothesis tests, all at the 99% significance level, you’d still expect at least one test to pass (a null hypothesis getting rejected) even if none of them should. If you must perform many hypothesis tests, it’s best to do multiple testing correction, in which the significance of the results is adjusted to compensate for the fact that some true null hypotheses will be rejected by random chance otherwise. There are a few different methods for multiple testing correction, and the differences between them are too nuanced to discuss in detail here, so I’ll skip them. Yet again, consult a good statistical reference if you’re interested!

7.5. Miscellaneous statistical methods

Statistical modeling is an explicit attempt to describe a system using mathematical and statistical concepts, with the aim of understanding how a system works inside and out. It’s a holistic process, and I feel that understanding a system holistically—along with the project’s data-to-goals process—is important, regardless of whether you end up implementing a statistical model in the strict sense. Many other statistical techniques fall at least partially outside my definition of a statistical model, and these can help inform your understanding of the system and possibly even be used in place of a formal model.

I’ve discussed descriptive statistics, inference, and other techniques that might be called atomic statistics—by that, I mean they form some of the core concepts and building blocks of statistics. If you move up the ladder of complexity, you can find statistical methods and algorithms that can’t be said to be atomic—they have too many moving pieces—but they’re so popular and often so useful that they should be mentioned in any overview of statistical analysis techniques. In the following subsections I’ll give brief descriptions of a few such higher-complexity techniques, when they might be used effectively, and what to watch out for when using them.

7.5.1. Clustering

Sometimes you have a bunch of data points, and you know some patterns are in there, but you’re not sure where they are exactly, so you can group the data points into clusters of generally similar data points in order to get the broad strokes of what’s going on in the data. In that way, clustering, if it didn’t usually have so many moving pieces that are somewhat hard to dissect and diagnose, could make a good technique for descriptive statistics.

Clustering can also be an integral part of a statistical model. I used clustering as a major aspect of the model of gene interactions—which I named BACON—mentioned earlier. The C in BACON stands for clustering. The full name is BAyesian Clustering Over Networks. In that model, I assumed that the expressions of some genes moved together in unison because they were involved in some of the same high-level processes. Much scientific literature supports this concept. I didn’t specify beforehand which genes’ expression moved together (literature is not often conclusive on this), but instead I incorporated a clustering algorithm into my model that allowed genes with similar expression movement to come together on their own, as the model was fit. Given that there were thousands of genes, clustering served to reduce the number of moving parts (called dimensionality reduction), which can be a goal unto itself, but in this particular case, academic literature provided some justification for the practice, and more importantly I found that the clustered model made better predictions than the unclustered model.

How it works

There are many different clustering algorithms—k-means, Gaussian mixture models, hierarchical—but all of them traverse the space of data points (all continuous numeric values) and group data points that are close to each other in some sense into the same cluster.

Both k-means and Gaussian mixture models are centroid-based clustering algorithms, meaning that each cluster has something like a center that generally represents the members of the cluster. In each of these algorithms, whichever cluster centroid a data point is closer to, roughly speaking, that cluster can be said to contain the data point. Clusters can be fuzzy or probabilistic, meaning that a data point can partially belong to one cluster and partially to others. Usually you have to define a fixed number of clusters before running the algorithm, but there are alternatives to this.

Hierarchical clustering is a bit different. It focuses on individual data points and their proximity to one another. To put it simplistically, hierarchical clustering looks for the two data points that are closest together and joins them into a cluster. Then it looks for the two data points (including the ones in the new cluster) that are the closest two yet-unjoined points, and it joins those two together. The process continues until all data points are joined together into a single mega-cluster. The result is a tree (most statistical software packages will gladly draw one of these for you) organized with close data points close to each other along the structure of the tree’s branches. This is helpful for seeing which data points are close to which other data points. If you want multiple clusters instead of one big (tree) cluster, you can cut the tree at a depth that gives the number of (branch) clusters that you want—cutting the tree in this sense means separating the trunk and the largest limbs, at a certain height, from the smaller branches, each of which would remain a unified cluster.

When to use it

If you want to put your data points (or other variables) into groups for any reason, consider clustering. If you want to be able to describe the properties of each of the groups and what a typical cluster member looks like, try a centroid-based algorithm like k-means or a Gaussian mixture model. If you’d like to get an idea of which of your data points is closest to which other ones, and you don’t care that much how close that is—if closer is more important than close in an absolute sense—then hierarchical clustering might be a good choice.

What to watch out for

With clustering algorithms, there are usually many parameters to tweak. You usually have no guarantees that all designated members of each cluster are represented well by each cluster or that significant clusters even exist. If dimensions (aspects, fields) of your data points are highly correlated, that can be problematic. To help with all this, most software tools have many diagnostic tools for checking how well the clustering algorithm performed. Use them, and clustering in general, with great care.

7.5.2. Component analysis

It can be difficult to make sense of data that has many dimensions. Clustering puts data points together into similarity groups in order to reduce the number of entities under scrutiny or analysis. Methods of component analysis—of which principal component analysis (PCA) and independent component analysis (ICA) are the most popular—do something similar, but they group together the dimensions of the data instead of data points, and they rank the groupings in order of how much of the data’s variance they explain. In a sense, component analysis reduces the dimensionality of the data directly, and by ranking and evaluating the new dimensions that are built out of the old ones, you may be able to explain each data point in terms of only a few of those dimensions.

For example, let’s say you’re analyzing gasoline usage during car trips. In each data point, the fields include distance traveled, duration of the trip, type of car, age of the car, and a few others, and also the amount of gasoline used during the trip. If you were trying to explain gasoline usage using all of the other variables, there’s a good chance that both the distance traveled and the trip duration could help explain how much gasoline was used. Neither is a perfect predictor, but they both can contribute largely, and in fact they’re highly correlated. You’d have to be careful building a model that predicts gasoline usage with both of these variables in the face of high correlation; many models would confuse such highly correlated variables with one another. Component analysis manipulates dimensions—mixing them, combining them, and ranking them—to minimize correlations, loosely speaking. A model that was predicting gasoline usage based on dimensions generated by component analysis usually wouldn’t confuse any of the dimensions.

The notion that distance traveled and trip duration are highly correlated is probably obvious, but imagine that you’re working with a less-familiar system that you’re studying, and you have dozens, hundreds, or even thousands of data dimensions. You know that some of the dimensions are probably correlated, but you don’t know exactly which ones, and you also know that it would be generally beneficial to reduce the total number of dimensions in a clever way. That’s what component analysis is good at.

How it works

Component analysis generally examines the data set as a sort of data cloud in many dimensions and then finds the component or angle along which the length of the data cloud is the longest. By component or angle, I mean a combination of dimensions, so that the dimension chosen might be diagonal, in some sense, when compared to the original dimensions. After the first component is chosen, that dimension is collapsed or disregarded in a clever way, and a second component is then selected, with the goal of finding the longest or widest component that has nothing in common with the first component. The process continues, finding as many components, in order of importance, as you’d like.

When to use it

If you have numerous dimensions in your data and you want fewer, component analysis is probably the best way to reduce the number of dimensions, and in addition to that, the resulting dimensions usually have some nice properties. But if you need the dimensions to be interpretable, you’ll have to be careful.

What to watch out for

PCA, the most popular type of component analysis, is sensitive to the relative scale of the values along various dimensions of the data set. If you rescale a particular field in the data—say, you switch from kilometers to miles—then that will have a significant effect on the components that are generated by PCA. It’s not only that rescaling can be a problem, but that the original (or any) scales of the variables can be problematic as well. Each of the dimensions should be scaled such that the same size change in any of them would be, in some sense, equally notable. It’s probably best to consult a good reference before trying any stunts with component analysis.

7.5.3. Machine learning and black box methods

In the world of analytic software development, machine learning is all the rage these days. Not that it hasn’t been popular for a long time now, but in the last few years I’ve seen the first few products come to market that claim to “bring machine learning to the masses,” or something like that. It sounds great on some level, but on another level it sounds like they’re asking for trouble. I don’t think most people know how machine learning works or how to notice if it has gone wrong. If you’re new to machine learning, I’d like to emphasize that machine learning, in most of its forms, is a tricky tool that shouldn’t be considered a magic solution to anything. There’s a reason why it takes years or decades of academic research to develop a completely new machine learning technique, and it’s the same reason why most people wouldn’t yet understand how to operate it: machine learning is extremely complex.

The term machine learning is used in many contexts and has a somewhat fluid meaning. Some people use it to refer to any statistical methods that can draw conclusions from data, but that’s not the meaning I use. I use the term machine learning to refer to the classes of somewhat abstract algorithms that can make conclusions from data but whose models—if you want to call them that—are difficult to dissect and understand. In that sense, only the machine can understand its own model, in a way. Sure, with most machine learning methods, you can dig into the innards of the machine’s generated model and learn about which variables are most important and how they relate to each other, but in that way the machine’s model begins to feel like a data set unto itself—without reasonably sophisticated statistical analysis, it’s tough to get a handle on how the machine’s model even works. That’s why many machine learning tools are called black box methods.

There’s nothing wrong with having a black box that takes in data and produces correct answers. But it can be challenging to produce such a box and confirm that its answers continue to be correct, and it’s nearly impossible to look inside the box after you’ve finished and debug. Machine learning is great, but probably more than any other class of statistical methods, it requires great care to use successfully.

I’ll stop short of giving lengthy explanations of machine learning concepts, because countless good references are available both on the internet and in print. I will, however, give some brief explanations of some of the key concepts to put them in context.

Feature extraction is a process by which you convert your data points into more informative versions of themselves. To get the best results, it’s crucial to extract good features every time you do machine learning—except, maybe, when doing deep learning. Each feature of a data point should be showing its best side(s) to a machine learning algorithm if it hopes to be classified or predicted correctly in the future. For example, in credit card fraud detection, one possible feature to add to a credit card transaction is the amount by which the transaction is above the normal transaction amount for the card; alternatively, the feature could be the percentile of the transaction size compared to all recent transactions. Likewise, good features are those that common sense would tell you might be informative in differentiating good from bad or any two classes from one another. There are also many valuable features that don’t make common sense, but you always have to be careful in determining whether these are truly valuable or if they’re artifacts of the training data set.

Here are a few of the most popular machine learning algorithms that you would apply to the feature values you extracted from your data points:

  • Random forest— This is a funny name for a useful method. A decision tree is a series of yes/no questions that ends in a decision. A random forest is a collection of randomly generated decision trees that favors trees and branches that correctly classify data points. This is my go-to machine learning method when I know I want machine learning but I don’t have a good reason to choose a different one. It’s versatile and not too difficult to diagnose problems.
  • Support vector machine (SVM)— This was quite popular a few years ago, and now it has settled into the niches where it’s particularly useful as the next machine learning fads pass through. SVMs are designed to classify data points into one of two classes. They manipulate the data space, turning it and warping it in order to drive a wedge between two sets of data points that are known to belong to the two different classifications. SVMs focus on the boundary between the two classes, so if you have two classes of data points, with each class tending to stick together in the data space, and you’re looking for a method to divide the two classes with maximal separation (if possible), then an SVM is for you.
  • Boosting— This is a tricky one to explain, and my limited experience doesn’t provide all the insights I probably need. But I know that boosting was a big step forward in machine learning of certain types. If you have a bunch of so-so machine learning models (weak learners), boosting might be able to combine them intelligently to result in a good machine learning model (a strong learner). Because boosting combines the outputs of other machine learning methods, it’s often called a meta-algorithm. It’s not for the faint of heart.
  • Neural network— The heyday for the neural network seemed to be the last decades of the twentieth century, until the advent of deep learning. In their earlier popular incarnation, artificial neural networks (the more formal name) were perhaps the blackest of black boxes. They seemed to be designed not to be understood. But they worked well, in some cases at least. Neural networks consist of layers upon layers of one-way valves (or neurons), each of which transforms the inputs in some arbitrary way. The neurons are connected to each other in a large network that leads from the input data to the output prediction or classification, and all of the computational work for fitting the model involved weighting and reweighting each of the neurons in clever ways to optimize results.
  • Deep learning— This is a new development in this millennium. Loosely speaking, deep learning refers to the idea that you might not need to worry much about feature extraction because, with enough computational power, the algorithm might be able to find its own good features and then use them to learn. More specifically, deep learning techniques are layered machine learning methods that, on a low level, do the same types of learning that other methods do, but then, on a higher level, they generate abstractions that can be applied generally to recognize important patterns in many forms. Today, the most popular deep learning methods are based on neural networks, causing a sort of revival in the latter.
  • Artificial intelligence— I’m including this term because it’s often conflated with machine learning and rightly so. There’s no fundamental difference between machine learning and artificial intelligence, but with artificial intelligence comes the connotation that the machine is approaching the intellectual capabilities of a human. In some cases, computers have already surpassed humans in specific tasks—famously, chess or Jeopardy!, for instance—but they’re nowhere near the general intelligence of an average human on a wide variety of day-to-day tasks.
How it works

Each specific machine learning algorithm is different. Data goes in, answers come out; you have to do much work to confirm that you didn’t make any mistakes, that you didn’t over-fit, that the data was properly train-test separated, and that your predictions, classifications, or other conclusions are still valid when brand-new data comes in.

When to use it

Machine learning, in general, can do things that no other statistical methods can do. I prefer to try a statistical model first in order to get an understanding of the system and its relationship to the data, but if the model falls short in terms of results, then I begin to think about ways to apply machine learning techniques without giving up too much of the awareness that I had with the statistical model. I wouldn’t say that machine learning is my last resort, but I do usually favor the intuitiveness and insight of a well-formed statistical model until I find it lacking. Head straight for machine learning if you know what you’re doing and you have a complex problem that’s nowhere near linear, quadratic, or any of the other common variable relationships in statistical models.

What to watch out for

Don’t trust the machine’s model or its results until you’ve verified them—completely independently of the machine learning implementation—with test-train separation as well as some completely new data that you and the machine have never seen before. Data snooping—the practice of looking at data before you formally analyze it and using what you see to bias how you analyze—can be a problem if you didn’t already do test-train separation before you snooped.

Exercises

Continuing with the Filthy Money Forecasting personal finance app scenario first described in chapter 2, and relating to previous chapters’ exercises, try these:

1.

Describe two different statistical models you might use to make forecasts of personal financial accounts. For each, give at least one potential weakness or disadvantage.

2.

Assume you’ve been successful in creating a classifier that can accurately put transactions into the categories regular, one time, and any other reasonable categories you’ve thought of. Describe a statistical model for forecasting that makes use of these classifications to improve accuracy.

Summary

  • It’s worthwhile to think about a project and a problem theoretically before you start into a software-building or full-analysis phase. There’s much to be learned, in data science and elsewhere, by stopping and thinking about the problem for a while.
  • Mathematics is a vocabulary and framework for describing how a system works.
  • Statistical modeling is a process of describing a system and connecting it to data.
  • A vast range of analytic methods and software that implements them is available; choosing from among them can be daunting, but it shouldn’t be overwhelming.
  • Machine learning and other complex statistical methods can be good tools for accomplishing the otherwise impossible, but only if you’re careful with them.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.75.10