One of the coolest things about linear regression is that we are not limited to using predictor variables that are continuous. For example, in the last section, we used the continuous variable wt
(weight) to predict miles per gallon. But linear models are adaptable to using categorical variables, like am
(automatic or manual transmission) as well.
Normally, in the simple linear regression equation , will hold the actual value of the predictor variable. In the case of a simple linear regression with a binary predictor (like am
), will hold a dummy variable instead. Specifically, when the predictor is automatic, will be 0, and when the predictor is manual, will be 1.
More formally:
Put in this manner, the interpretation of the coefficients changes slightly, since the will be zero when the car is automatic, is the mean miles per gallon for automatic cars.
Similarly, since will equal when the car is manual, is equal to the mean difference in the gas mileage between automatic and manual cars.
Concretely:
> model <- lm(mpg ~ am, data=mtcars) > summary(model) Call: lm(formula = mpg ~ am, data = mtcars) Residuals: Min 1Q Median 3Q Max -9.3923 -3.0923 -0.2974 3.2439 9.5077 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.147 1.125 15.247 1.13e-15 *** am 7.245 1.764 4.106 0.000285 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.902 on 30 degrees of freedom Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385 F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285 > > > mean(mtcars$mpg[mtcars$am==0]) [1] 17.14737 > (mean(mtcars$mpg[mtcars$am==1]) - + mean(mtcars$mpg[mtcars$am==0])) [1] 7.244939
The intercept term, is 7.15, which is the mean gas mileage of the automatic cars, and is 7.24, which is the difference of the means between the two groups.
The interpretation of the t-statistic and p-value are very special now; a hypothesis test checking to see if (the difference in group means) is significantly different from zero is tantamount to a hypothesis test testing equality of means (the students t-test)! Indeed, the t-statistic and p-values are the same:
# use var.equal to choose Students t-test # over Welch's t-test > t.test(mpg ~ am, data=mtcars, var.equal=TRUE) Two Sample t-test data: mpg by am t = -4.1061, df = 30, p-value = 0.000285 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.84837 -3.64151 sample estimates: mean in group 0 mean in group 1 17.14737 24.39231
Isn't that neat!? A two-sample test of equality of means can be equivalently expressed as a linear model! This basic idea can be extended to handle non-binary categorical variables too—we'll see this in the section on multiple regression.
Note that in mtcars
, the am
column was already coded as 1s (manuals) and 0s (automatics). If automatic cars were dummy coded as 1 and manuals were dummy coded as 0, the results would semantically be the same; the only difference is that would be the mean of manual cars, and would be the (negative) difference in means. The and p-values would be the same.
If you are working with a dataset that doesn't already have the binary predictor dummy coded, R's lm
can handle this too, so long as you wrap the column in a call to factor
. For example:
> mtcars$automatic <- ifelse(mtcars$am==0, "yes", "no") > model <- lm(mpg ~ factor(automatic), data=mtcars) > model Call: lm(formula = mpg ~ factor(automatic), data = mtcars) Coefficients: (Intercept) factor(automatic)yes 24.392 -7.245
Finally, note that a car being automatic or manual explains some of the variance in gas mileage, but far less than weight did: this model's is only 0.36.
Before we move on, a word of warning: the first part of every regression analysis should be to plot the relevant data. To convince you of this, consider Anscombe's quartet depicted in Figure 8.6
Anscombe's quartet holds four x-y pairs that have the same mean, standard deviation, correlation coefficients, linear regression coefficients, and . In spite of these similarities, all four of these data pairs are very different. It is a warning to not blindly apply statistics on data that you haven't visualized. It is also a warning to take linear regression diagnostics (which we will go over before the chapter's end) seriously.
Only two of the x-y pairs in Anscombe's quartet can be modeled with simple linear regression: the ones in the left column. Of particular interest is the one on the bottom left; it looks like it contains an outlier. After thorough investigation into why that datum made it into our dataset, if we decide we really should discard it, we can either (a) remove the offending row, or (b) use robust linear regression.
For a more or less drop-in replacement for lm
that uses a robust version of OLS called Iteratively Re-Weighted
Least Squares (IWLS), you can use the rlm
function from the MASS
package:
> library(MASS) > data(anscombe) > plot(y3 ~ x3, data=anscombe) > abline(lm(y3 ~ x3, data=anscombe), + col="blue", lty=2, lwd=2) > abline(rlm(y3 ~ x3, data=anscombe), + col="red", lty=1, lwd=2)
OK, one more warning
Some suggest that you should almost always use rlm
in favor of lm
. It's true that rlm is the bee's knees, but there is a subtle danger in doing this as illustrated by the following statistical urban legend.
Sometime in 1984, NASA was studying the ozone concentrations from various locations. NASA used robust statistical methods that automatically discarded anomalous data points believing most of them to be instrument errors or errors in transmission. As a result of this, some extremely low ozone readings in the atmosphere above Antarctica were removed from NASA's atmospheric models. The very next year, British scientists published a paper describing a very deteriorated ozone layer in the Antarctic. Had NASA paid closer attention to outliers, they would have been the first to discover it.
It turns out that the relevant part of this story is a myth, but the fact that it is so widely believed is a testament to how possible it is.
The point is, outliers should always be investigated and not simply ignored, because they may be indicative of poor model choice, faulty instrumentation, or a gigantic hole in the ozone layer. Once the outliers are accounted for, then use robust methods to your heart's content.
3.145.179.177