More often than not, we want to include not just one, but multiple predictors (independent variables), in our predictive models. Luckily, linear regression can easily accommodate us! The technique? Multiple regression.
By giving each predictor its very own beta coefficient in a linear model, the target variable gets informed by a weighted sum of its predictors. For example, a multiple regression using two predictor variables looks like this:
Now, instead of estimating two coefficients ( and ), we are estimating three: the intercept, the slope of the first predictor, and the slope of the second predictor.
Before explaining further, let's perform a multiple regression predicting gas mileage from weight and horsepower:
> model <- lm(mpg ~ wt + hp, data=mtcars) > summary(model) Call: lm(formula = mpg ~ wt + hp, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.941 -1.600 -0.182 1.050 5.854 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.22727 1.59879 23.285 < 2e-16 *** wt -3.87783 0.63273 -6.129 1.12e-06 *** hp -0.03177 0.00903 -3.519 0.00145 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.593 on 29 degrees of freedom Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148 F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
Since we are now dealing with three variables, the predictive model can no longer be visualized with a line; it must be visualized as a plane in 3D space, as seen in Figure 8.8:
Aided by the visualization, we can see that both our predictions of mpg
are informed by both wt
and hp
. Both of them contribute negatively to the gas mileage. You can see this from the fact that the coefficients are both negative. Visually, we can verify this by noting that the plane slopes downward as wt
increases and as hp
increases, although the slope for the later predictor is less dramatic.
Although we lose the ability to easily visualize it, the prediction region formed by a more-than-two predictor linear model is called a hyperplane, and exists in n-dimensional space where n is the number of predictor variables plus 1.
The astute reader may have noticed that the beta coefficient belonging to the wt
variable is not the same as it was in the simple linear regression. The beta coefficient for hp
, too, is different than the one estimated using simple regression:
> coef(lm(mpg ~ wt + hp, data=mtcars)) (Intercept) wt hp 37.22727012 -3.87783074 -0.03177295 > coef(lm(mpg ~ wt, data=mtcars)) (Intercept) wt 37.285126 -5.344472 > coef(lm(mpg ~ hp, data=mtcars)) (Intercept) hp 30.09886054 -0.06822828
The explanation has to do with a subtle difference in how the coefficients should be interpreted now that there is more than one independent variable. The proper interpretation of the coefficient belonging to wt
is not that as the weight of the car increases by 1 unit (1,000 pounds), the miles per gallon, on an average, decreases by -3.878 miles per gallon. Instead, the proper interpretation is Holding horsepower constant, as the weight of the car increases by 1 unit (1,000 pounds), the miles per gallon, on an average, decreases by -3.878 miles per gallon.
Similarly, the correct interpretation of the coefficient belonging to wt
is Holding the weight of the car constant, as the horsepower of the car increases by 1, the miles per gallon, on an average, decreases by -0.032 miles per gallon. Still confused?
It turns out that cars with more horsepower use more gas. It is also true that cars with higher horsepower tend to be heavier. When we put these predictors (weight and horsepower) into a linear model together, the model attempts to tease apart the independent contributions of each of the variables by removing the effects of the other. In multivariate analysis, this is known as controlling for a variable. Hence, the preface to the interpretation can be, equivalently, stated as Controlling for the effects of the weight of a car, as the horsepower…. Because cars with higher horsepower tend to be heavier, when you remove the effect of horsepower, the influence of weight goes down, and vice versa. This is why the coefficients for these predictors are both smaller than they are in simple single-predictor regression.
In controlled experiments, scientists introduce an experimental condition on two samples that are virtually the same except for the independent variable being manipulated (for example, giving one group a placebo and one group real medication). If they are careful, they can attribute any observed effect directly on the manipulated independent variable. In simple cases like this, statistical control is often unnecessary. But statistical control is of utmost importance in the other areas of science (especially, the behavioral and social sciences) and business, where we are privy only to data from non-controlled natural phenomena.
For example, suppose someone made the claim that gum chewing causes heart disease. To back up this claim, they appealed to data showing that the more someone chews gum, the higher the probability of developing heart disease. The astute skeptic could claim that it's not the gum chewing per se that is causing the heart disease, but the fact that smokers tend to chew gum more often than non-smokers to mask the gross smell of tobacco smoke. If the person who made the original claim went back to the data, and included the number of cigarettes smoked per day as a component of a regression analysis, there would be a coefficient representing the independent influence of gum chewing, and ostensibly, the statistical test of that coefficient's difference from zero would fail to reject the null hypothesis.
In this situation, the number of cigarettes smoked per day is called a confounding variable. The purpose of a carefully designed scientific experiment is to eliminate confounds, but as mentioned earlier, this is often not a luxury available in certain circumstances and domains.
For example, we are so sure that cigarette smoking causes heart disease that it would be unethical to design a controlled experiment in which we take two random samples of people, and ask one group to smoke and one group to just pretend to smoke. Sadly, cigarette companies know this, and they can plausibly claim that it isn't cigarette smoking that causes heart disease, but rather that the kind of people who eventually become cigarette smokers also engage in behaviors that increase the risk of heart disease—like eating red meat and not exercising—and that it's those variables that are making it appear as if smoking is associated with heart disease. Since we can't control for every potential confound that the cigarette companies can dream up, we may never be able to thwart this claim.
Anyhow, back to our two-predictor example: examine the value, and how it is different now that we've included horsepower as an additional predictor. Our model now explains more of the variance in gas mileage. As a result, our predictions will, on an average, be more accurate.
Let's predict what the gas mileage of a 2,500 pound car with a horsepower of 275 (horses?) might be:
> predict(model, newdata = data.frame(wt=2.5, hp=275)) 1 18.79513
Finally, we can explain the last line of the linear model summary: the one with the F-statistic and associated p-value. The F-statistic measures the ability of the entire model, as a whole, to explain any variance in the dependent variable. Since it has a sampling distribution (the F-distribution) and associated degrees, it yields a p-value, which can be interpreted as the probability that a model would explain this much (or more) of the variance of the dependent variable if the predictors had no predictive power. The fact that our model has a p-value lower than 0.05 suggests that our model predicts the dependent variable better than chance.
Now we can see why the p-value for the F-statistic in the simple linear regression was the same as the p-value of the t-statistic for the only non-intercept predictor: the tests were equivalent because there was only one source of predictive capability.
We can also see now why the p-value associated with our F-statistic in the multiple regression analysis output earlier is far lower than the p-values of the t-statistics of the individual predictors: the latter only captures the predictive power of each (one) predictor, while the former captures the predictive power of the model as a whole (all two).
18.117.137.252