How well does the line fit in the data?

Although we know that the trend line is the best fitting among the possible linear trend lines, we don't know how well this fits the actual data. The significance of the regression parameters is obtained by testing the null hypothesis, which states that the given parameter equals to zero. The F-test in the output pertains to the hypothesis that each regression parameter is zero. In a nutshell, it tests the significance of the regression in general. A p-value below 0.05 can be interpreted as "the regression line is significant." Otherwise, there is not much point in fitting the regression model at all.

However, even if you have a significant F-value, you cannot say too much about the fit of the regression line. We have seen that residuals characterize the error of the fit. The R-squared coefficient summarizes them into a single measure. R-squared is the proportion of the variance in the response variable explained by the regression. Mathematically, it is defined as the variance in the predicted Y values, divided by the variance in the observed Y values.

Note

In some cases, despite the significant F-test, the predictors, according to the R-squared, explain only a small proportion (<10 percent) of the total variance. You can interpret this by saying that although the predictors have a statistically significant effect on the response, the response is formed by a mechanism that is much more complex than your model suggests. This phenomenon is common in the area of medicine or biology where complex biological processes are modeled, while it is less common in the area of econometrics, where macro-level, aggregated variables, which usually smooth out small variations in the data.

If we use the population size as the only predictor in our air pollution example, the R-squared equals 0.37, so we can say that 37 percent of the variation in SO2 concentration can be explained by the size of the city:

> model.0 <- update(model.0, data = usair[-31, ])
> summary(model.0)[c('r.squared', 'adj.r.squared')]
$r.squared
[1] 0.3728245
$adj.r.squared
[1] 0.3563199

After adding the number of manufacturers to the model, the R-squared increases dramatically and almost doubles its previous value:

> summary(model.2)[c('r.squared', 'adj.r.squared')]
$r.squared
[1] 0.6433317
$adj.r.squared
[1] 0.6240523

Note

It's important to note here that every time you add an extra predictor to your model, the R-squared increases simply because you have more information to predict the response, even if the lastly added predictor doesn't have an important effect. Consequently, a model with more predictors may appear to have a better fit just because it is bigger.

The solution is to use the adjusted R-squared, which takes into account the number of predictors as well. In the previous example, not only the R-squared but also the adjusted R-squared showed a huge advantage in favor of the latter model.

The two previous models are nested, which means that the extended model contains each predictor of the first one. But unfortunately, the adjusted R-squared cannot be used as a base for choosing the best model for non-nested models. If you have non-nested models, you can use the Akaike Information Criterion (AIC) measure to select the best model.

AIC is founded on the information theory. It introduces a penalty term for the number of parameters in the model, giving a solution for the problem of bigger models tending to show as better fitted. When using this criterion, you should select the model with the least AIC. As a rule of thumb, two models are essentially indistinguishable if the difference between their AICs is less than 2. In the example that follows, we have two plausible alternative models. Taking the AIC into account, model.4 is better than model.3, as its advantage over model.3 is about 10:

> summary(model.3 <- update(model.2, .~. -x2 + x1))$coefficients 
             Estimate   Std. Error   t value     Pr(>|t|)
(Intercept) 77.429836 19.463954376  3.978114 3.109597e-04
x3           0.021333  0.004221122  5.053869 1.194154e-05
x1          -1.112417  0.338589453 -3.285444 2.233434e-03

> summary(model.4 <- update(model.2, .~. -x3 + x1))$coefficients 
               Estimate   Std. Error   t value     Pr(>|t|)
(Intercept) 64.52477966 17.616612780  3.662723 7.761281e-04
x2           0.02537169  0.003880055  6.539004 1.174780e-07
x1          -0.85678176  0.304807053 -2.810899 7.853266e-03

> AIC(model.3, model.4)
        df      AIC
model.3  4 336.6405
model.4  4 326.9136

Note

Note that AIC can tell nothing about the quality of the model in an absolute sense; your best model may still fit poorly. It does not provide a test for testing model fit either. It is essentially for ranking different models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.198.59