Let's start with an actual and illuminating example of confounding. Consider that we would like to predict the amount of air pollution based on the size of the city (measured in population size as thousand of habitants). Air pollution is measured by the sulfur dioxide (SO2) concentration in the air, in milligrams per cubic meter. We will use the US air pollution data set (Hand and others 1994) from the gamlss.data
package:
> library(gamlss.data) > data(usair)
Let's draw our very first linear regression model by building a formula. The lm
function from the stats
package is used to fit linear models, which is an important tool for regression modeling:
> model.0 <- lm(y ~ x3, data = usair) > summary(model.0) Residuals: Min 1Q Median 3Q Max -32.545 -14.456 -4.019 11.019 72.549 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.868316 4.713844 3.791 0.000509 *** x3 0.020014 0.005644 3.546 0.001035 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 20.67 on 39 degrees of freedom Multiple R-squared: 0.2438, Adjusted R-squared: 0.2244 F-statistic: 12.57 on 1 and 39 DF, p-value: 0.001035
Formula notation is one of the best features of R, which lets you define flexible models in a human-friendly way. A typical model has the form of response ~ terms
, where response
is the continuous response variable, and terms
provides one or a series of numeric variables that specifies a linear predictor for the response.
In the preceding example, the variable, y
, denotes air pollution, while x3
stands for the population size. The coefficient of x3
says that a one unit (one thousand) increase in the population size causes a 0.02
unit (0.02 milligram per cubic meter) increase in the sulfur dioxide concentration, and the effect is statistically significant with a p
value of 0.001035
.
The intercept in general is the value of the response variable when each predictor equals to 0, but in this example, there are no cities without inhabitants, so the intercept (17.87) doesn't have a direct interpretation. The two regression coefficients define the regression line:
> plot(y ~ x3, data = usair, cex.lab = 1.5) > abline(model.0, col = "red", lwd = 2.5) > legend('bottomright', legend = 'y ~ x3', lty = 1, col = 'red', + lwd = 2.5, title = 'Regression line')
As you can see, the intercept (17.87) is the value at which the regression line crosses the y-axis. The other coefficient (0.02) is the slope of the regression line: it measures how steep the line is. Here, the function runs uphill because the slope is positive (y increases as x3 increases). Similarly, if the slope is negative, the function runs downhill.
You can easily understand the way the estimates were obtained if you realize how the line was drawn. This is the line that best fits the data points. Here, we refer to the best fit as the linear least-squares approach, which is why the model is also known as the ordinary least squares (OLS) regression.
The least-squares method finds the best fitting line by minimizing the sum of the squares of the residuals, where the residuals represent the error, which is the difference between the observed value (an original dot in the scatterplot) and the fitted or predicted value (a dot on the line with the same x-value):
> usair$prediction <- predict(model.0) > usair$residual<- resid(model.0) > plot(y ~ x3, data = usair, cex.lab = 1.5) > abline(model.0, col = 'red', lwd = 2.5) > segments(usair$x3, usair$y, usair$x3, usair$prediction, + col = 'blue', lty = 2) > legend('bottomright', legend = c('y ~ x3', 'residuals'), + lty = c(1, 2), col = c('red', 'blue'), lwd = 2.5, + title = 'Regression line')
The linear term in linear regression refers to the fact that we are interested in a linear relation, which is more natural, easier to understand, and simpler to handle mathematically, as compared to the more complex methods.
On the other hand, if we aim to model a more complex mechanism by separating the effect of the population size from the effect of the presence of industries, we have to control for the variable, x2
, which describes the number of manufacturers employing more than 20 workers. Now, we can either create a new model by lm(y ~ x3 + x2, data = usair)
, or use the update
function to refit the previous model:
> model.1 <- update(model.0, . ~ . + x2) > summary(model.1) Residuals: Min 1Q Median 3Q Max -22.389 -12.831 -1.277 7.609 49.533 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.32508 3.84044 6.855 3.87e-08 *** x3 -0.05661 0.01430 -3.959 0.000319 *** x2 0.08243 0.01470 5.609 1.96e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.49 on 38 degrees of freedom Multiple R-squared: 0.5863, Adjusted R-squared: 0.5645 F-statistic: 26.93 on 2 and 38 DF, p-value: 5.207e-08
Now, the coefficient of x3
is -0.06
! While the crude association between air pollution and city size was positive in the previous model, after controlling for the number of manufacturers, the association becomes negative. This means that a one thousand increase in the population decreases the SO2 concentration by 0.06 unit, which is a statistically significant effect.
On first sight, this change of sign from positive to negative may be surprising, but it is rather plausible after a closer look; it's definitely not the population size, but rather the level of industrialization that affects the air pollution directly. In the first model, population size showed a positive effect because it implicitly measured industrialization as well. When we hold industrialization fixed, the effect of the population size becomes negative, and growing a city with a fixed industrialization level spreads the air pollution in a wider range.
So, we can conclude that x2
is a confounder here, as it biases the association between y
and x3
. Although it is beyond the scope of our current research question, we can interpret the coefficient of x2
as well. It says that holding the city size at a constant level, a one unit increase in the number of manufacturers increases the SO2 concentration by 0.08 mgs.
Based on the model, we can predict the expected value of the response for any combination of predictors. For example, we can predict the expected level of sulfur dioxide concentration for a city with 400,000 habitants and 150 manufacturers, each of whom employ more than 20 workers:
> as.numeric(predict(model.1, data.frame(x2 = 150, x3 = 400))) [1] 16.04756
You could also calculate the prediction by yourself, multiplying the values with the slopes, and then summing them up with the constant—all these numbers are simply copied and pasted from the previous model summary:
> -0.05661 * 40 0 + 0.08243 * 150 + 26.32508 [1] 16.04558
If you have two predictors, the regression line is represented by a surface in the three dimensional space, which can be easily shown via the scatterplot3d
package:
> library(scatterplot3d) > plot3d <- scatterplot3d(usair$x3, usair$x2, usair$y, pch = 19, + type = 'h', highlight.3d = TRUE, main = '3-D Scatterplot') > plot3d$plane3d(model.1, lty = 'solid', col = 'red')
As it's rather hard to interpret this plot, let's draw the 2-dimensional projections of this 3D graph, which might prove to be more informative after all. Here, the value of the third, non-presented variable is held at zero:
> par(mfrow = c(1, 2)) > plot(y ~ x3, data = usair, main = '2D projection for x3') > abline(model.1, col = 'red', lwd = 2.5) > plot(y ~ x2, data = usair, main = '2D projection for x2') > abline(lm(y ~ x2 + x3, data = usair), col = 'red', lwd = 2.5)
According to the changed sign of the slope, it's well worth mentioning that the y-x3 regression line has also changed; from uphill, it became downhill.
18.188.152.136