Simple linear regression with a binary predictor

One of the coolest things about linear regression is that we are not limited to using predictor variables that are continuous. For example, in the last section, we used the continuous variable wt (weight) to predict miles per gallon. But linear models are adaptable to using categorical variables, like am (automatic or manual transmission) as well.

Normally, in the simple linear regression equation Simple linear regression with a binary predictor, Simple linear regression with a binary predictor will hold the actual value of the predictor variable. In the case of a simple linear regression with a binary predictor (like am), Simple linear regression with a binary predictor will hold a dummy variable instead. Specifically, when the predictor is automatic, Simple linear regression with a binary predictor will be 0, and when the predictor is manual, Simple linear regression with a binary predictor will be 1.

More formally:

Simple linear regression with a binary predictor
Simple linear regression with a binary predictor

Put in this manner, the interpretation of the coefficients changes slightly, since the Simple linear regression with a binary predictor will be zero when the car is automatic, Simple linear regression with a binary predictor is the mean miles per gallon for automatic cars.

Similarly, since Simple linear regression with a binary predictor will equal Simple linear regression with a binary predictor when the car is manual, Simple linear regression with a binary predictor is equal to the mean difference in the gas mileage between automatic and manual cars.

Concretely:

  > model <- lm(mpg ~ am, data=mtcars)
  > summary(model)
  
  Call:
  lm(formula = mpg ~ am, data = mtcars)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -9.3923 -3.0923 -0.2974  3.2439  9.5077 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   17.147      1.125  15.247 1.13e-15 ***
  am             7.245      1.764   4.106 0.000285 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 4.902 on 30 degrees of freedom
  Multiple R-squared:  0.3598,  Adjusted R-squared:  0.3385 
  F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
  >
  >
  > mean(mtcars$mpg[mtcars$am==0])
  [1] 17.14737
  > (mean(mtcars$mpg[mtcars$am==1]) - 
  + mean(mtcars$mpg[mtcars$am==0]))
  [1] 7.244939

The intercept term, Simple linear regression with a binary predictor is 7.15, which is the mean gas mileage of the automatic cars, and Simple linear regression with a binary predictor is 7.24, which is the difference of the means between the two groups.

The interpretation of the t-statistic and p-value are very special now; a hypothesis test checking to see if Simple linear regression with a binary predictor (the difference in group means) is significantly different from zero is tantamount to a hypothesis test testing equality of means (the students t-test)! Indeed, the t-statistic and p-values are the same:

  # use var.equal to choose Students t-test
  # over Welch's t-test
  > t.test(mpg ~ am, data=mtcars, var.equal=TRUE)
  
    Two Sample t-test
  
  data:  mpg by am
  t = -4.1061, df = 30, p-value = 0.000285
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -10.84837  -3.64151
  sample estimates:
  mean in group 0 mean in group 1 
         17.14737        24.39231 

Isn't that neat!? A two-sample test of equality of means can be equivalently expressed as a linear model! This basic idea can be extended to handle non-binary categorical variables too—we'll see this in the section on multiple regression.

Note that in mtcars, the am column was already coded as 1s (manuals) and 0s (automatics). If automatic cars were dummy coded as 1 and manuals were dummy coded as 0, the results would semantically be the same; the only difference is that Simple linear regression with a binary predictor would be the mean of manual cars, and Simple linear regression with a binary predictor would be the (negative) difference in means. The Simple linear regression with a binary predictor and p-values would be the same.

If you are working with a dataset that doesn't already have the binary predictor dummy coded, R's lm can handle this too, so long as you wrap the column in a call to factor. For example:

  
  > mtcars$automatic <- ifelse(mtcars$am==0, "yes", "no")
  > model <- lm(mpg ~ factor(automatic), data=mtcars)
  > model
  
  Call:
  lm(formula = mpg ~ factor(automatic), data = mtcars)
  
  Coefficients:
           (Intercept)  factor(automatic)yes  
                24.392                -7.245  

Finally, note that a car being automatic or manual explains some of the variance in gas mileage, but far less than weight did: this model's Simple linear regression with a binary predictor is only 0.36.

A word of warning

Before we move on, a word of warning: the first part of every regression analysis should be to plot the relevant data. To convince you of this, consider Anscombe's quartet depicted in Figure 8.6

A word of warning

Figure 8.6: Four datasets with identical means, standard deviations, regression coefficients, and A word of warning

Anscombe's quartet holds four x-y pairs that have the same mean, standard deviation, correlation coefficients, linear regression coefficients, and A word of warning. In spite of these similarities, all four of these data pairs are very different. It is a warning to not blindly apply statistics on data that you haven't visualized. It is also a warning to take linear regression diagnostics (which we will go over before the chapter's end) seriously.

Only two of the x-y pairs in Anscombe's quartet can be modeled with simple linear regression: the ones in the left column. Of particular interest is the one on the bottom left; it looks like it contains an outlier. After thorough investigation into why that datum made it into our dataset, if we decide we really should discard it, we can either (a) remove the offending row, or (b) use robust linear regression.

For a more or less drop-in replacement for lm that uses a robust version of OLS called Iteratively Re-Weighted Least Squares (IWLS), you can use the rlm function from the MASS package:

  > library(MASS)
  > data(anscombe)
  > plot(y3 ~ x3, data=anscombe)
  > abline(lm(y3 ~ x3, data=anscombe),
  +        col="blue", lty=2, lwd=2)
  > abline(rlm(y3 ~ x3, data=anscombe),
  +        col="red", lty=1, lwd=2)
A word of warning

Figure 8.7: The difference between linear regression fit with OLS and a robust linear regression fitted with IWLS

Note

OK, one more warning

Some suggest that you should almost always use rlm in favor of lm. It's true that rlm is the bee's knees, but there is a subtle danger in doing this as illustrated by the following statistical urban legend.

Sometime in 1984, NASA was studying the ozone concentrations from various locations. NASA used robust statistical methods that automatically discarded anomalous data points believing most of them to be instrument errors or errors in transmission. As a result of this, some extremely low ozone readings in the atmosphere above Antarctica were removed from NASA's atmospheric models. The very next year, British scientists published a paper describing a very deteriorated ozone layer in the Antarctic. Had NASA paid closer attention to outliers, they would have been the first to discover it.

It turns out that the relevant part of this story is a myth, but the fact that it is so widely believed is a testament to how possible it is.

The point is, outliers should always be investigated and not simply ignored, because they may be indicative of poor model choice, faulty instrumentation, or a gigantic hole in the ozone layer. Once the outliers are accounted for, then use robust methods to your heart's content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.177