On to a substantially less trivial example, let's say No Scone Unturned has been keeping careful records of how many raisins (in grams) they have been using for their famous oatmeal raisin cookies. They want to construct a linear model describing the relationship between the area of a cookie (in centimeters squared) and how many raisins they use, on average.
In particular, they want to use linear regression to predict how many grams of raisins they will need for a 1-meter long oatmeal raisin cookie. Predicting a continuous variable (grams of raisins) from other variables sounds like a job for regression! In particular, when we use just a single predictor variable (the area of the cookies), the technique is called simple linear regression.
The left panel of Figure 8.2 illustrates the relationship between the area of cookies and the amount of raisins it used. It also shows the best-fit regression line:
Note that, in contrast to the last example, virtually none of the data points actually rest on the best-fit line—there are now errors. This is because there is a random component to how many raisins are used.
The right panel of Figure 8.2 draws dashed red lines between each data point and what the best-fit line would predict is the amount of raisins necessary. These dashed lines represent the error in the prediction, and these errors are called residuals.
So far, we haven't discussed how the best-fit line is determined. In essence, the line of the best fit will minimize the amount of dashed line. More specifically, the residuals are squared and all added up—this is called the Residual Sum of Squares (RSS). The line that is the best fit will minimize the RSS. This method is called ordinary least squares, or OLS.
Look at the two plots in Figure 8.3. Notice how the regression lines are drawn in ways that clearly do not minimize the amount of red line. The RSS can be further minimized by increasing the slope in the first plot, and decreasing it in the second plot:
Now that there are differences between the observed values and the predicted values—as there will be in every real-life linear regression you perform—the equation that describes , the dependent variable, changes slightly:
The equation without the residual term only describes our prediction, , pronounced y hat (because it looks like is wearing a little hat:)
Our error term is, therefore, the difference between the value that our model predicts and the actual empirical value for each observation :
Formally, the RSS is:
Recall that this is the term that gets minimized when finding the best-fit line.
If the RSS is the sum of the squared residuals (or error terms), the mean of the squared residuals is known as the Mean Squared Error (MSE), and is a very important measure of the accuracy of a model.
Formally, the MSE is:
Occasionally, you will encounter the Root Mean Squared Error (RMSE) as a measure of model fit. This is just the square root of the MSE, putting it in the same units as the dependent variable (instead of units of the dependent variable squared). The difference between the MSE and RMSE is like the difference between variance and standard deviation, respectively. In fact, in both these cases (the MSE/RMSE and variance/standard-deviation), the error terms have to be squared for the very same reason; if they were not, the positive and negative residuals would cancel each other out.
Now that we have a bit of the requisite math, we're ready to perform a simple linear regression ourselves, and interpret the output. We will be using the venerable mtcars
data set, and try to predict a car's gas mileage (mpg
) with the car's weight (wt
). We will also be using R's base graphics system (not ggplot2
) in this section, because the visualization of linear models is arguably simpler in base R.
First, let's plot the cars' gas mileage as a function of their weights:
> plot(mpg ~ wt, data=mtcars)
Here we employ the formula syntax that we were first introduced to in Chapter 3, Describing Relationships and that we used extensively in Chapter 6, Testing Hypotheses. We will be using it heavily in this chapter as well. As a refresher, mph ~ wt
roughly reads mpg
as a function of wt
.
Next, let's run a simple linear regression with the lm
function, and save it to a variable called model
:
> model <- lm(mpg ~ wt, data=mtcars)
Now that we have the model saved, we can, very simply, add a plot of the linear model to the scatterplot we have already created:
> abline(model)
Finally, let's view the result of fitting the linear model using the summary
function, and interpret the output:
> summary(model) Call: lm(formula = mpg ~ wt, data = mtcars) Residuals: Min 1Q Median 3Q Max -4.5432 -2.3647 -0.1252 1.4096 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt -5.3445 0.5591 -9.559 1.29e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The first block of text reminds us how the model was built syntax-wise (which can actually be useful in situations where the lm
call is performed dynamically).
Next, we see a five-number summary of the residuals. Remember that this is in units of the dependent variable. In other words, the data point with the highest residual is 6.87 miles per gallon.
In the next block, labeled Coefficients
, direct your attention to the two values in the Estimate
column; these are the beta coefficients that minimize the RSS. Specifically, and . The equation that describes the best-fit linear model then is:
Remember, the way to interpret the coefficient is for every unit increase of the independent variable (it's in units of 1,000 pounds), the dependent variable goes down (because it's negative) 5.345 units (which are miles per gallon). The coefficient indicates, rather nonsensically, that a car that weighs nothing would have a gas mileage of 37.285 miles per gallon. Recall that all models are wrong, but some are useful.
If we wanted to predict the gas mileage of a car that weighed 6,000 pounds, our equation would yield an estimate of 5.125 miles per gallon. Instead of doing the math by hand, we can use the predict
function as long as we supply it with a data frame that holds the relevant information for new observations that we want to predict:
> predict(model, newdata=data.frame(wt=6)) 1 5.218297
Interestingly, we would predict a car that weighs 7,000 pounds would get -0.126 miles per gallon. Again, all models are wrong, but some are useful. For most reasonable car weights, our very simple model yields reasonable predictions.
If we were only interested in prediction—and only interested in this particular model—we would stop here. But, as I mentioned in this chapter's preface, linear regression is also a tool for inference—and a pretty powerful one at that. In fact, we will soon see that many of the statistical tests we were introduced to in Chapter 6 , Testing Hypotheses can be equivalently expressed and performed as a linear model.
When viewing linear regression as a tool of inference, it's important to remember that our coefficients are actually just estimates. The cars observed in mtcars
represent just a small sample of all extant cars. If somehow we observed all cars and built a linear model, the beta coefficients would be population coefficients. The coefficients that we asked R to calculate are best guesses based on our sample, and, just like our other estimates in previous chapters, they can undershoot or overshoot the population coefficients, and their accuracy is a function of factors such as the sample size, the representativeness of our sample, and the inherent volatility or noisiness of the system we are trying to model.
As estimates, we can quantify our uncertainty in our beta coefficients using standard error, as introduced in Chapter 5, Using Data to Reason About the World. The column of values directly to the right of the Estimate
column, labeled Std. Error
, gives us these measures. The estimates of the beta coefficients also have a sampling distribution and, therefore, confidence intervals could be constructed for them.
Finally, because the beta coefficients have well defined sampling distributions (as long as certain simplifying assumptions hold true), we can perform hypothesis tests on them. The most common hypothesis test performed on beta coefficients asks whether they are significantly discrepant from zero. Semantically, if a beta coefficient is significantly discrepant from zero, it is an indication that the independent variable has a significant impact on the prediction of the dependent variable. Remember the long-running warning in Chapter 6, Testing Hypotheses though: just because something is significant doesn't mean it is important.
The hypothesis tests comparing the coefficients to zero yield p-values; those p-values are depicted in the final column of the Coefficients
section, labeled Pr(>|t|)
. We usually don't care about the significance of the intercept coefficient (b0), so we can ignore that. Rather importantly, the p-value for the coefficient belonging to the wt
variable is near zero, indicating that the weight of a car has some predictive power on the gas mileage of that car.
Getting back to the summary
output, direct your attention to the entry called Multiple R-squared
. R-squared—also or coefficient of determination—is, like MSE, a measure of how good of a fit the model is. In contrast to the MSE though, which is in units of the dependent variable, is always between 0 and 1, and thus, can be interpreted more easily. For example, if we changed the units of the dependent variable from miles per gallon to miles per liter, the MSE would change, but the would not.
An of 1 indicates a perfect fit with no residual error, and an of 0 indicates the worst possible fit: the independent variable doesn't help predict the dependent variable at all.
Helpfully, the is directly interpretable as the amount of variance in the dependent variable that is explained by the independent variable. In this case, for example, the weight of a car explains about 75.3% of the variance of the gas mileage. Whether 75% constitutes a good depends heavily on the domain, but in my field (the behavioral sciences), an of 75% is really good.
We will have to come back to the rest of information in the summary
output in the section about multiple regression.
18.118.217.168