Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Simple linear regression with a binary predictor

One of the coolest things about linear regression is that we are not limited to using predictor variables that are continuous. For example, in the last section, we used the continuous variable wt (weight) to predict miles per gallon. But linear models are adaptable to using categorical variables, like am (automatic or manual transmission) as well.

Normally, in the simple linear regression equation , will hold the actual value of the predictor variable. In the case of a simple linear regression with a binary predictor (like am), will hold a dummy variable instead. Specifically, when the predictor is automatic, will be 0, and when the predictor is manual, will be 1.

More formally:

Put in this manner, the interpretation of the coefficients changes slightly, since the will be zero when the car is automatic, is the mean miles per gallon for automatic cars.

Similarly, since will equal when the car is manual, is equal to the mean difference in the gas mileage between automatic and manual cars.

Concretely:

  > model <- lm(mpg ~ am, data=mtcars)
  > summary(model)
  
  Call:
  lm(formula = mpg ~ am, data = mtcars)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -9.3923 -3.0923 -0.2974  3.2439  9.5077 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   17.147      1.125  15.247 1.13e-15 ***
  am             7.245      1.764   4.106 0.000285 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 4.902 on 30 degrees of freedom
  Multiple R-squared:  0.3598,  Adjusted R-squared:  0.3385 
  F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
  >
  >
  > mean(mtcars$mpg[mtcars$am==0])
  [1] 17.14737
  > (mean(mtcars$mpg[mtcars$am==1]) - 
  + mean(mtcars$mpg[mtcars$am==0]))
  [1] 7.244939

The intercept term, is 7.15, which is the mean gas mileage of the automatic cars, and is 7.24, which is the difference of the means between the two groups.

The interpretation of the t-statistic and p-value are very special now; a hypothesis test checking to see if (the difference in group means) is significantly different from zero is tantamount to a hypothesis test testing equality of means (the students t-test)! Indeed, the t-statistic and p-values are the same:

  # use var.equal to choose Students t-test
  # over Welch's t-test
  > t.test(mpg ~ am, data=mtcars, var.equal=TRUE)
  
    Two Sample t-test
  
  data:  mpg by am
  t = -4.1061, df = 30, p-value = 0.000285
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -10.84837  -3.64151
  sample estimates:
  mean in group 0 mean in group 1 
         17.14737        24.39231

Isn't that neat!? A two-sample test of equality of means can be equivalently expressed as a linear model! This basic idea can be extended to handle non-binary categorical variables too—we'll see this in the section on multiple regression.

Note that in mtcars, the am column was already coded as 1s (manuals) and 0s (automatics). If automatic cars were dummy coded as 1 and manuals were dummy coded as 0, the results would semantically be the same; the only difference is that would be the mean of manual cars, and would be the (negative) difference in means. The and p-values would be the same.

If you are working with a dataset that doesn't already have the binary predictor dummy coded, R's lm can handle this too, so long as you wrap the column in a call to factor. For example:

  
  > mtcars$automatic <- ifelse(mtcars$am==0, "yes", "no")
  > model <- lm(mpg ~ factor(automatic), data=mtcars)
  > model
  
  Call:
  lm(formula = mpg ~ factor(automatic), data = mtcars)
  
  Coefficients:
           (Intercept)  factor(automatic)yes  
                24.392                -7.245

Finally, note that a car being automatic or manual explains some of the variance in gas mileage, but far less than weight did: this model's is only 0.36.

A word of warning

Before we move on, a word of warning: the first part of every regression analysis should be to plot the relevant data. To convince you of this, consider Anscombe's quartet depicted in Figure 8.6

Figure 8.6: Four datasets with identical means, standard deviations, regression coefficients, and

Anscombe's quartet holds four x-y pairs that have the same mean, standard deviation, correlation coefficients, linear regression coefficients, and . In spite of these similarities, all four of these data pairs are very different. It is a warning to not blindly apply statistics on data that you haven't visualized. It is also a warning to take linear regression diagnostics (which we will go over before the chapter's end) seriously.

Only two of the x-y pairs in Anscombe's quartet can be modeled with simple linear regression: the ones in the left column. Of particular interest is the one on the bottom left; it looks like it contains an outlier. After thorough investigation into why that datum made it into our dataset, if we decide we really should discard it, we can either (a) remove the offending row, or (b) use robust linear regression.

For a more or less drop-in replacement for lm that uses a robust version of OLS called Iteratively Re-Weighted Least Squares (IWLS), you can use the rlm function from the MASS package:

  > library(MASS)
  > data(anscombe)
  > plot(y3 ~ x3, data=anscombe)
  > abline(lm(y3 ~ x3, data=anscombe),
  +        col="blue", lty=2, lwd=2)
  > abline(rlm(y3 ~ x3, data=anscombe),
  +        col="red", lty=1, lwd=2)

Figure 8.7: The difference between linear regression fit with OLS and a robust linear regression fitted with IWLS

Note

OK, one more warning

Some suggest that you should almost always use rlm in favor of lm. It's true that rlm is the bee's knees, but there is a subtle danger in doing this as illustrated by the following statistical urban legend.

Sometime in 1984, NASA was studying the ozone concentrations from various locations. NASA used robust statistical methods that automatically discarded anomalous data points believing most of them to be instrument errors or errors in transmission. As a result of this, some extremely low ozone readings in the atmosphere above Antarctica were removed from NASA's atmospheric models. The very next year, British scientists published a paper describing a very deteriorated ozone layer in the Antarctic. Had NASA paid closer attention to outliers, they would have been the first to discover it.

It turns out that the relevant part of this story is a myth, but the fact that it is so widely believed is a testament to how possible it is.

The point is, outliers should always be investigated and not simply ignored, because they may be indicative of poor model choice, faulty instrumentation, or a gigantic hole in the ozone layer. Once the outliers are accounted for, then use robust methods to your heart's content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Simple linear regression with a binary predictor

Create new playlist

Sign In

Sign Up

Simple linear regression with a binary predictor

A word of warning

Note

Table of Contents for
Simple linear regression with a binary predictor