Linear regression with continuous predictors

Let's start with an actual and illuminating example of confounding. Consider that we would like to predict the amount of air pollution based on the size of the city (measured in population size as thousand of habitants). Air pollution is measured by the sulfur dioxide (SO2) concentration in the air, in milligrams per cubic meter. We will use the US air pollution data set (Hand and others 1994) from the gamlss.data package:

> library(gamlss.data)
> data(usair)

Model interpretation

Let's draw our very first linear regression model by building a formula. The lm function from the stats package is used to fit linear models, which is an important tool for regression modeling:

> model.0 <- lm(y ~ x3, data = usair)
> summary(model.0)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.545 -14.456  -4.019  11.019  72.549 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17.868316   4.713844   3.791 0.000509 ***
x3           0.020014   0.005644   3.546 0.001035 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.67 on 39 degrees of freedom
Multiple R-squared:  0.2438,    Adjusted R-squared:  0.2244 
F-statistic: 12.57 on 1 and 39 DF,  p-value: 0.001035

Note

Formula notation is one of the best features of R, which lets you define flexible models in a human-friendly way. A typical model has the form of response ~ terms, where response is the continuous response variable, and terms provides one or a series of numeric variables that specifies a linear predictor for the response.

In the preceding example, the variable, y, denotes air pollution, while x3 stands for the population size. The coefficient of x3 says that a one unit (one thousand) increase in the population size causes a 0.02 unit (0.02 milligram per cubic meter) increase in the sulfur dioxide concentration, and the effect is statistically significant with a p value of 0.001035.

Note

See more details on the p-value in the How well does the line fit to the data? section. To keep it simple for now, we will refer to models as statistically significant when the p value is below 0.05.

The intercept in general is the value of the response variable when each predictor equals to 0, but in this example, there are no cities without inhabitants, so the intercept (17.87) doesn't have a direct interpretation. The two regression coefficients define the regression line:

> plot(y ~ x3, data = usair, cex.lab = 1.5)
> abline(model.0, col = "red", lwd = 2.5)
> legend('bottomright', legend = 'y ~ x3', lty = 1, col = 'red',
+   lwd = 2.5, title = 'Regression line')
Model interpretation

As you can see, the intercept (17.87) is the value at which the regression line crosses the y-axis. The other coefficient (0.02) is the slope of the regression line: it measures how steep the line is. Here, the function runs uphill because the slope is positive (y increases as x3 increases). Similarly, if the slope is negative, the function runs downhill.

You can easily understand the way the estimates were obtained if you realize how the line was drawn. This is the line that best fits the data points. Here, we refer to the best fit as the linear least-squares approach, which is why the model is also known as the ordinary least squares (OLS) regression.

The least-squares method finds the best fitting line by minimizing the sum of the squares of the residuals, where the residuals represent the error, which is the difference between the observed value (an original dot in the scatterplot) and the fitted or predicted value (a dot on the line with the same x-value):

> usair$prediction <- predict(model.0)
> usair$residual<- resid(model.0)
> plot(y ~ x3, data = usair, cex.lab = 1.5)
> abline(model.0, col = 'red', lwd = 2.5)
> segments(usair$x3, usair$y, usair$x3, usair$prediction,
+   col = 'blue', lty = 2)
> legend('bottomright', legend = c('y ~ x3', 'residuals'),
+   lty = c(1, 2), col = c('red', 'blue'), lwd = 2.5,
+   title = 'Regression line')
Model interpretation

The linear term in linear regression refers to the fact that we are interested in a linear relation, which is more natural, easier to understand, and simpler to handle mathematically, as compared to the more complex methods.

Multiple predictors

On the other hand, if we aim to model a more complex mechanism by separating the effect of the population size from the effect of the presence of industries, we have to control for the variable, x2, which describes the number of manufacturers employing more than 20 workers. Now, we can either create a new model by lm(y ~ x3 + x2, data = usair), or use the update function to refit the previous model:

> model.1 <- update(model.0, . ~ . + x2)
> summary(model.1)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.389 -12.831  -1.277   7.609  49.533 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 26.32508    3.84044   6.855 3.87e-08 ***
x3          -0.05661    0.01430  -3.959 0.000319 ***
x2           0.08243    0.01470   5.609 1.96e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.49 on 38 degrees of freedom
Multiple R-squared:  0.5863,    Adjusted R-squared:  0.5645 
F-statistic: 26.93 on 2 and 38 DF,  p-value: 5.207e-08

Now, the coefficient of x3 is -0.06! While the crude association between air pollution and city size was positive in the previous model, after controlling for the number of manufacturers, the association becomes negative. This means that a one thousand increase in the population decreases the SO2 concentration by 0.06 unit, which is a statistically significant effect.

On first sight, this change of sign from positive to negative may be surprising, but it is rather plausible after a closer look; it's definitely not the population size, but rather the level of industrialization that affects the air pollution directly. In the first model, population size showed a positive effect because it implicitly measured industrialization as well. When we hold industrialization fixed, the effect of the population size becomes negative, and growing a city with a fixed industrialization level spreads the air pollution in a wider range.

So, we can conclude that x2 is a confounder here, as it biases the association between y and x3. Although it is beyond the scope of our current research question, we can interpret the coefficient of x2 as well. It says that holding the city size at a constant level, a one unit increase in the number of manufacturers increases the SO2 concentration by 0.08 mgs.

Based on the model, we can predict the expected value of the response for any combination of predictors. For example, we can predict the expected level of sulfur dioxide concentration for a city with 400,000 habitants and 150 manufacturers, each of whom employ more than 20 workers:

> as.numeric(predict(model.1, data.frame(x2 = 150, x3 = 400)))
[1] 16.04756

You could also calculate the prediction by yourself, multiplying the values with the slopes, and then summing them up with the constant—all these numbers are simply copied and pasted from the previous model summary:

> -0.05661 * 40
0 + 0.08243 * 150 + 26.32508 [1] 16.04558

Note

Prediction outside the range of the data is known as extrapolation. The further the values are from the data, the riskier your prediction becomes. The problem is that you cannot check model assumptions (for example, linearity) outside of your sample data.

If you have two predictors, the regression line is represented by a surface in the three dimensional space, which can be easily shown via the scatterplot3d package:

> library(scatterplot3d)
> plot3d <- scatterplot3d(usair$x3, usair$x2, usair$y, pch = 19,
+   type = 'h', highlight.3d = TRUE, main = '3-D Scatterplot') 
> plot3d$plane3d(model.1, lty = 'solid', col = 'red')
Multiple predictors

As it's rather hard to interpret this plot, let's draw the 2-dimensional projections of this 3D graph, which might prove to be more informative after all. Here, the value of the third, non-presented variable is held at zero:

> par(mfrow = c(1, 2))
> plot(y ~ x3, data = usair, main = '2D projection for x3')
> abline(model.1, col = 'red', lwd = 2.5)
> plot(y ~ x2, data = usair, main = '2D projection for x2')
> abline(lm(y ~ x2 + x3, data = usair), col = 'red', lwd = 2.5)
Multiple predictors

According to the changed sign of the slope, it's well worth mentioning that the y-x3 regression line has also changed; from uphill, it became downhill.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.152.136