Simple linear regression

On to a substantially less trivial example, let's say No Scone Unturned has been keeping careful records of how many raisins (in grams) they have been using for their famous oatmeal raisin cookies. They want to construct a linear model describing the relationship between the area of a cookie (in centimeters squared) and how many raisins they use, on average.

In particular, they want to use linear regression to predict how many grams of raisins they will need for a 1-meter long oatmeal raisin cookie. Predicting a continuous variable (grams of raisins) from other variables sounds like a job for regression! In particular, when we use just a single predictor variable (the area of the cookies), the technique is called simple linear regression.

The left panel of Figure 8.2 illustrates the relationship between the area of cookies and the amount of raisins it used. It also shows the best-fit regression line:

Simple linear regression

Figure 8.2: (left) A scatterplot of areas and grams of raisins in No Scone Unturned's cookies with a best-fit regression line; (right) the same plot with highlighted residuals

Note that, in contrast to the last example, virtually none of the data points actually rest on the best-fit line—there are now errors. This is because there is a random component to how many raisins are used.

The right panel of Figure 8.2 draws dashed red lines between each data point and what the best-fit line would predict is the amount of raisins necessary. These dashed lines represent the error in the prediction, and these errors are called residuals.

So far, we haven't discussed how the best-fit line is determined. In essence, the line of the best fit will minimize the amount of dashed line. More specifically, the residuals are squared and all added up—this is called the Residual Sum of Squares (RSS). The line that is the best fit will minimize the RSS. This method is called ordinary least squares, or OLS.

Look at the two plots in Figure 8.3. Notice how the regression lines are drawn in ways that clearly do not minimize the amount of red line. The RSS can be further minimized by increasing the slope in the first plot, and decreasing it in the second plot:

Simple linear regression

Figure 8.3: Two regression lines that do not minimize the RSS

Now that there are differences between the observed values and the predicted values—as there will be in every real-life linear regression you perform—the equation that describes Simple linear regression, the dependent variable, changes slightly:

Simple linear regression

The equation without the residual term only describes our prediction, Simple linear regression, pronounced y hat (because it looks like Simple linear regression is wearing a little hat:)

Simple linear regression

Our error term is, therefore, the difference between the value that our model predicts and the actual empirical value for each observation Simple linear regression:

Simple linear regression

Formally, the RSS is:

Simple linear regression

Recall that this is the term that gets minimized when finding the best-fit line.

If the RSS is the sum of the squared residuals (or error terms), the mean of the squared residuals is known as the Mean Squared Error (MSE), and is a very important measure of the accuracy of a model.

Formally, the MSE is:

Simple linear regression

Occasionally, you will encounter the Root Mean Squared Error (RMSE) as a measure of model fit. This is just the square root of the MSE, putting it in the same units as the dependent variable (instead of units of the dependent variable squared). The difference between the MSE and RMSE is like the difference between variance and standard deviation, respectively. In fact, in both these cases (the MSE/RMSE and variance/standard-deviation), the error terms have to be squared for the very same reason; if they were not, the positive and negative residuals would cancel each other out.

Now that we have a bit of the requisite math, we're ready to perform a simple linear regression ourselves, and interpret the output. We will be using the venerable mtcars data set, and try to predict a car's gas mileage (mpg) with the car's weight (wt). We will also be using R's base graphics system (not ggplot2) in this section, because the visualization of linear models is arguably simpler in base R.

First, let's plot the cars' gas mileage as a function of their weights:

  > plot(mpg ~ wt, data=mtcars)

Here we employ the formula syntax that we were first introduced to in Chapter 3, Describing Relationships and that we used extensively in Chapter 6, Testing Hypotheses. We will be using it heavily in this chapter as well. As a refresher, mph ~ wt roughly reads mpg as a function of wt.

Next, let's run a simple linear regression with the lm function, and save it to a variable called model:

  > model <- lm(mpg ~ wt, data=mtcars)

Now that we have the model saved, we can, very simply, add a plot of the linear model to the scatterplot we have already created:

  > abline(model)
Simple linear regression

Figure 8.4: The result of plotting output from lm

Finally, let's view the result of fitting the linear model using the summary function, and interpret the output:

   > summary(model)
  
  Call:
  lm(formula = mpg ~ wt, data = mtcars)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -4.5432 -2.3647 -0.1252  1.4096  6.8727 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
  wt           -5.3445     0.5591  -9.559 1.29e-10 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 3.046 on 30 degrees of freedom
  Multiple R-squared:  0.7528,  Adjusted R-squared:  0.7446 
  F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The first block of text reminds us how the model was built syntax-wise (which can actually be useful in situations where the lm call is performed dynamically).

Next, we see a five-number summary of the residuals. Remember that this is in units of the dependent variable. In other words, the data point with the highest residual is 6.87 miles per gallon.

In the next block, labeled Coefficients, direct your attention to the two values in the Estimate column; these are the beta coefficients that minimize the RSS. Specifically, Simple linear regression and Simple linear regression. The equation that describes the best-fit linear model then is:

Simple linear regression

Remember, the way to interpret the Simple linear regression coefficient is for every unit increase of the independent variable (it's in units of 1,000 pounds), the dependent variable goes down (because it's negative) 5.345 units (which are miles per gallon). The Simple linear regression coefficient indicates, rather nonsensically, that a car that weighs nothing would have a gas mileage of 37.285 miles per gallon. Recall that all models are wrong, but some are useful.

If we wanted to predict the gas mileage of a car that weighed 6,000 pounds, our equation would yield an estimate of 5.125 miles per gallon. Instead of doing the math by hand, we can use the predict function as long as we supply it with a data frame that holds the relevant information for new observations that we want to predict:

  > predict(model, newdata=data.frame(wt=6))
         1 
  5.218297

Interestingly, we would predict a car that weighs 7,000 pounds would get -0.126 miles per gallon. Again, all models are wrong, but some are useful. For most reasonable car weights, our very simple model yields reasonable predictions.

If we were only interested in prediction—and only interested in this particular model—we would stop here. But, as I mentioned in this chapter's preface, linear regression is also a tool for inference—and a pretty powerful one at that. In fact, we will soon see that many of the statistical tests we were introduced to in Chapter 6 , Testing Hypotheses can be equivalently expressed and performed as a linear model.

When viewing linear regression as a tool of inference, it's important to remember that our coefficients are actually just estimates. The cars observed in mtcars represent just a small sample of all extant cars. If somehow we observed all cars and built a linear model, the beta coefficients would be population coefficients. The coefficients that we asked R to calculate are best guesses based on our sample, and, just like our other estimates in previous chapters, they can undershoot or overshoot the population coefficients, and their accuracy is a function of factors such as the sample size, the representativeness of our sample, and the inherent volatility or noisiness of the system we are trying to model.

As estimates, we can quantify our uncertainty in our beta coefficients using standard error, as introduced in Chapter 5, Using Data to Reason About the World. The column of values directly to the right of the Estimate column, labeled Std. Error, gives us these measures. The estimates of the beta coefficients also have a sampling distribution and, therefore, confidence intervals could be constructed for them.

Finally, because the beta coefficients have well defined sampling distributions (as long as certain simplifying assumptions hold true), we can perform hypothesis tests on them. The most common hypothesis test performed on beta coefficients asks whether they are significantly discrepant from zero. Semantically, if a beta coefficient is significantly discrepant from zero, it is an indication that the independent variable has a significant impact on the prediction of the dependent variable. Remember the long-running warning in Chapter 6, Testing Hypotheses though: just because something is significant doesn't mean it is important.

The hypothesis tests comparing the coefficients to zero yield p-values; those p-values are depicted in the final column of the Coefficients section, labeled Pr(>|t|). We usually don't care about the significance of the intercept coefficient (b0), so we can ignore that. Rather importantly, the p-value for the coefficient belonging to the wt variable is near zero, indicating that the weight of a car has some predictive power on the gas mileage of that car.

Getting back to the summary output, direct your attention to the entry called Multiple R-squared. R-squared—also Simple linear regression or coefficient of determination—is, like MSE, a measure of how good of a fit the model is. In contrast to the MSE though, which is in units of the dependent variable, Simple linear regression is always between 0 and 1, and thus, can be interpreted more easily. For example, if we changed the units of the dependent variable from miles per gallon to miles per liter, the MSE would change, but the Simple linear regression would not.

An Simple linear regression of 1 indicates a perfect fit with no residual error, and an Simple linear regression of 0 indicates the worst possible fit: the independent variable doesn't help predict the dependent variable at all.

Simple linear regression

Figure 8.5: Linear models (from left to right) with s of 0.75, 0.33, and 0.92

Helpfully, the Simple linear regression is directly interpretable as the amount of variance in the dependent variable that is explained by the independent variable. In this case, for example, the weight of a car explains about 75.3% of the variance of the gas mileage. Whether 75% constitutes a good Simple linear regression depends heavily on the domain, but in my field (the behavioral sciences), an Simple linear regression of 75% is really good.

We will have to come back to the rest of information in the summary output in the section about multiple regression.

Note

Take note of the fact that the p-value of the F-statistic in the last line of the output is the same as the p-value of the t-statistic of the only non-intercept coefficient.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.42.136