Model assumptions

Linear regression models with standard estimation techniques make a number of assumptions about the outcome variable, the predictor variables, and also about their relationship:

  1. Y is a continuous variable (not binary, nominal, or ordinal)
  2. The errors (the residuals) are statistically independent
  3. There is a stochastic linear relationship between Y and each X
  4. Y has a normal distribution, holding each X fixed
  5. Y has the same variance, regardless of the fixed value of the Xs

A violation of assumption 2 occurs in trend analysis, if we use time as the predictor. Since the consecutive years are not independent, the errors will not be independent from each other. For example, if we have a year with high mortality from a specific illness, then we can expect the mortality for the next year to also be high.

A violation of assumption (3) says that the relationship is not exactly linear, but there is a deviation from the linear trend line. Assumption 4 and 5 require the conditional distribution of Y to be normal and having the same variance, regardless of the fixed value of Xs. They are needed for inferences of the regression (confidence intervals, F- and t-tests). Assumption 5 is known as the homoscedasticity assumption. If it is violated, heteroscedasticity holds.

The following plot helps in visualizing these assumptions with a simulated dataset:

> library(Hmisc)
> library(ggplot2)
> library(gridExtra)
> set.seed(7)
> x  <- sort(rnorm(1000, 10, 100))[26:975]
> y  <- x * 500 + rnorm(950, 5000, 20000)
> df <- data.frame(x = x, y = y, cuts = factor(cut2(x, g = 5)),
+                               resid = resid(lm(y ~ x)))
> scatterPl <- ggplot(df, aes(x = x, y = y)) +
+    geom_point(aes(colour = cuts, fill = cuts), shape = 1,
+  show_guide = FALSE) + geom_smooth(method = lm, level = 0.99)
> plot_left <- ggplot(df,  aes(x = y, fill = cuts)) +
+    geom_density(alpha = .5) + coord_flip() + scale_y_reverse()
> plot_right <- ggplot(data = df, aes(x = resid, fill = cuts)) +
+    geom_density(alpha = .5) + coord_flip()
> grid.arrange(plot_left, scatterPl, plot_right,
+    ncol=3, nrow=1, widths=c(1, 3, 1))
Model assumptions

Tip

The code bundle, available to be downloaded from the Packt Publishing homepage, includes a slightly longer code chunk for the preceding plot with some tweaks on the plot margins, legends, and titles. The preceding code block focuses on the major parts of the visualization, without wasting too much space in the printed book on the style details.

We will discuss in more detail, how to assess the model assumptions in Chapter 9, From Big to Smaller Data. If some of the assumptions fail, a possible solution is to look for outliers. If you have an outlier, do the regression analysis without that observation, and determine how the results differ. Ways of outlier detection will be discussed in more detail in Chapter 8, Polishing Data.

The following example illustrates that dropping an outlier (observation number 31) may make the assumptions valid. To quickly verify if a model's assumptions are satisfied, use the gvlma package:

> library(gvlma)
> gvlma(model.1)

Coefficients:
(Intercept)           x3           x2  
   26.32508     -0.05661      0.08243  

ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

                     Value  p-value                   Decision
Global Stat        14.1392 0.006864 Assumptions NOT satisfied!
Skewness            7.8439 0.005099 Assumptions NOT satisfied!
Kurtosis            3.9168 0.047805 Assumptions NOT satisfied!
Link Function       0.1092 0.741080    Assumptions acceptable.
Heteroscedasticity  2.2692 0.131964    Assumptions acceptable.

It seems that three out of the five assumptions are not satisfied. However, if we build the very same model on the same dataset excluding the 31st observation, we get much better results:

> model.2 <- update(model.1, data = usair[-31, ])
> gvlma(model.2)

Coefficients:
(Intercept)           x3           x2  
   22.45495     -0.04185      0.06847  

ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

                    Value p-value                Decision
Global Stat        3.7099  0.4467 Assumptions acceptable.
Skewness           2.3050  0.1290 Assumptions acceptable.
Kurtosis           0.0274  0.8685 Assumptions acceptable.
Link Function      0.2561  0.6128 Assumptions acceptable.
Heteroscedasticity 1.1214  0.2896 Assumptions acceptable.

This suggests that we must always exclude the 31st observation from the dataset when building regression models in the future sections.

However, it's important to note that it is not acceptable to drop an observation just because it is an outlier. Before you decide, investigate the particular case. If it turns out that the outlier is due to incorrect data, you should drop it. Otherwise, run the analysis, both with and without it, and state in your research report how the results changed and why you decided on excluding the extreme values.

Note

You can fit a line for any set of data points; the least squares method will find the optimal solution, and the trend line will be interpretable. The regression coefficients and the R-squared coefficient are also meaningful, even if the model assumptions fail. The assumptions are only needed if you want to interpret the p-values, or if you aim to make good predictions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.69.199