Now that we've fully covered introductory inferential statistics, we're now going to shift our attention to one of the most exciting and practically useful topics in data analysis: predictive analytics. Throughout this chapter, we are going to introduce concepts and terminology from a closely related field called statistical learning or, as it's (somehow) more commonly referred to, machine learning.
Whereas in the last unit, we were using data to make inferences about the world, this unit is primarily about using data to make inferences (or predictions) about other data. On the surface, this might not sound more appealing, but consider the fruits of this area of study: if you've ever received a call from your credit card company asking to confirm a suspicious purchase that you, in fact, did not make, it's because sophisticated algorithms learned your purchasing behavior and were able to detect deviation from that pattern.
Since this is the first chapter leaving inferential statistics and delving into predictive analytics, it's only natural that we would start with a technique that is used for both ends: linear regression.
At the surface level, linear regression is a method that is used both to predict the values that continuous variables take on, and to make inferences about how certain variables are related to a continuous variable. These two procedures, prediction and inference, foundationally rely on the information from statistical models. Statistical models are idealized representations of a theory meant to illustrate and explain a process that generates data. A model is usually an equation, or series of equations, with some number of parameters.
Throughout this chapter, remember the quote (generally attributed to) George Box:
All models are wrong but some are useful.
A model airplane or car might not be the real thing, but it can help us learn and understand some pretty powerful properties of the object that is being modeled.
Although linear regression is, at a high level, conceptually quite simple, it is absolutely indispensable to modern applied statistics, and a thorough understanding of linear models will pay enormous dividends throughout your career as an analyst.
A small baking outfit in upstate New York called No Scone Unturned keeps careful records of the baked goods it produces. The left panel of Figure 8.1 is a scatterplot of diameters and circumferences (in centimeters) of No Scone Unturned's cookies, and depicts their relationship:
A straight line is the perfect thing to represent this data. After fitting a straight line to the data, we can make predictions about the circumferences of cookies that we haven't observed, like 11 or 0.7 (if you weren't playing truant in grade school, you'd know there's a consistent and predictable relationship between the diameter of a circle and the circle's circumference, namely π, but we'll ignore that for now).
You may have learned that the equation that describes a line in a Cartesian plane is:
where is the y-intercept (the place where the line intersects with the vertical line at ), and is the slope (describing the direction and steepness of the line). In linear regression, the equation describing as a function of is written as:
where (sometimes ) is the y-intercept, and (sometimes ) is the slope. Collectively, the s are known as the beta coefficients.
The equation of the line that best describes this data is:
making and 0 and π respectively.
Knowing this, it is easy to predict the circumferences of cookies that we haven't measured yet. The circumference of the cookie with a diameter of 11 centimeters is 0 + 3.1415()11 or 34.558 and a cookie of 0.7 centimeters is 0 + 3.1415(0.7) or 2.2.
In predictive analytics' parlance, the variable that we are trying to predict is called the dependent (or, sometimes, target) variable, because its values are dependent on other variables. The variables that we use to predict the dependent variable are called independent (or, sometimes, predictor) variables.
Before moving on to a less silly example, it is important to understand the proper interpretation of the slope : it describes how much the dependent variable increases (or decreases) for each unit increase of the independent variable. In this case, for every centimeter increase in a cookie's diameter, the circumference increases π centimeters. In contrast, a negative indicates that as the independent variable increases, the dependent variable decreases.
18.216.34.208