There are two issues
with overall shape of the regression data that should be diagnosed.
These issues are:
-
Non-linearity:
When the true relational shape between a predictor and the dependent
variables is a shape other than a straight line.
-
Heteroscedasticity:
When the relationship could be linear but the residuals from the regression
are not regular.
The following sections
discuss the differences between these issues, how to diagnose the
issues, and how to remedy these issues.
Two
Types of Data Shape Issues
In
Figure 13.14 A fundamentally nonlinear pattern, we see an example of
non-linearity.
Here, a straight line would not be the correct shape to define the
data. You would need to find a mathematical shape that looks like
the data seen here: it would need to rise, fall, and then rise again.
It happens that a cubic equation does exactly that. You would need
to fit a perfect cubic equation to the data.
In
Figure 13.15 Heteroscedasticity (straight line but with uneven residuals), we see an example of data in a fan shape, where you could
probably not do better than a straight line. However, the straight
line fits the data better at low levels of the independent variable,
where the data is packed in a more closely defined line, than in higher
levels of the predictor, where the data have spread out. As you can
see, the residuals (the extent to which the straight line misses the
actual data) will get bigger from left to right. This type of issue
– of which this is only one example – is an issue known
as
heteroscedasticity. The
absence of heteroscedasticity – where the data is evenly distributed
along the straight line – is known as homoscedasticity.
Having discussed the
differences between the patterns, the next sections discuss the effect
these issues can have on your regression.
The
Effects of Non-Linearity or Heteroscedasticity
The following are the
implications of finding data issues:
-
Nonlinearity is a fundamentally
interesting finding, which suggests that your initial linear regression
is not quite right as a model. Usually, you need to change the fundamental
mathematical shape against which you are comparing the data, and compare
the data against the new model.
-
Heteroscedasticity, on the other
hand, is usually a less serious issue, because the correct underlying
shape is a straight line. Usually, the sizes of the slopes –
the crucial issue in regression – are not overly affected.
The fact that the residuals are not even across the length of the
regression line does, however, sometimes affect confidence intervals
and p-values, therefore accuracy assessments.
It is therefore important
to diagnose and distinguish between non-linearity and heteroscedasticity.
The following section discusses the common first diagnostic test,
namely analyzing residual plots.
Using
Residual Plots to Diagnose Data Shape Issues
How do you diagnose
non-linear or heteroscedastic patterns? In
Figure 13.14 A fundamentally nonlinear pattern and
Figure 13.15 Heteroscedasticity (straight line but with uneven residuals), I used examples
that used the raw data plots between one independent variable and
the dependent variable at a time. However, although you should certainly
plot these raw variable pairs prior to looking at a regression, they
do not always diagnose data shape issues. The reason for this is that
in a multiple regression the influence of multiple independent variables
are simultaneously considered in patterns.
Therefore, we need a
plot that assesses the data shapes after all variables have been considered
together as a set. This is made possible through residual
plots in regression. The next sections discuss
what residual plots are, and discuss how to diagnose them.
Understanding
Residual Plots
After running an initial
regression, SAS and other software produce plots of how each data
point misses the regression line, i.e. residual plots. There is one
overall residual plot for the whole regression equation (see the left
panel of
Figure 13.16 Example of the main residual plot and a partial residual plot for the case
example) and another “partial” residual plot for each
independent variable (see below for the example for trust).
The point of any residual
graph is that it tells you about how the data miss the straight line,
which is represented by the horizontal zero (0) line. Points close
to the zero line lie close to the regression line.
Note that the main regression
residual graph is a slightly different thing, although substantially
the same idea. This graph is an overall view of the residuals. The
horizontal axis (predicted levels of the dependent variable) essentially
gives each observation’s values on all the independent variables
simultaneously. The vertical axis gives the residuals.
So how do you diagnose
the residual plots? The following subsections discuss this.
“Good”
Residual Plots
In such cases there
is not likely any non-linear or heteroscedastic pattern.
Non-Linearity
in Residual Plots
To diagnose non-linearity,
it is perhaps helpful to consider a few more hypothetical non-linear
shapes.
Figure 13.19 Examples of nonlinear shapes shows just
a few, in which we see increasing or decreasing growth trends in data,
curvilinearity (where the dependent variable is high or low for both
low and high values of the independent variable) and a cubic shape
as seen before.
If non-linearity is
to exist, perhaps the key thing to look for in residual plots is a
systematic pattern with waves or curves that mimic those seen in
Figure 13.19 Examples of nonlinear shapes, or others. One clue is where all the residuals are above
the 0 line in a residual plot for some places and below it for others.
If you examine the SAS
output, I do not believe you will find any obvious non-linear trends
in the plots for the case example, but there are fan shapes in the
residuals that are discussed next as being possibly diagnostic of
heteroscedasticity.
Heteroscedasticity
in Residual Plots
Unlike non-linearity,
recall that heteroscedasticity has more to do with unequal distribution
of residuals around a linear line. As stated, if the residuals are
consistently big throughout the range of the variables, then the data
is homoscedastic (which is good). If the residuals are bigger over
some ranges of the relationship, there could be a problem.
Figure 13.20 Examples of heteroscedasticity shows examples of residual plots with possible heteroscedasticity
issues. For example, the left panel shows a situation where the regression
line is increasingly inaccurate at larger values of the dependent
variable (which we can see from increasingly large residuals).
Other
Tests for Data Shapes
Residual plots are not
your only test option. There are other formal tests of shapes in residuals,
such as the “spec” test, meaning specification test.
This book does not discuss these further; the interested reader can
pursue them further in more specific texts.
Specifically with regard
to non-linear shapes, it is also possible that you hypothesize in
advance that your data may have such a shape. It would not then be
unusual to run an initial linear regression and see if the residuals
reflect your alternate form.
Remedies
for Non-Linearity
If there is an obvious
nonlinear pattern such as curvilinearity, you should use slightly
different types of lines to try to fit the data. The process, is,
however, exactly the same as in linear regression, i.e. find an appropriate
mathematical line that is the same shape as the data seems to be and
fit the equation to the data that nonlinear shape (e.g. you would
fit the mathematical equation for a parabola to the data in the second
panel of
Figure 13.19 Examples of nonlinear shapes). You do have
to know the appropriate mathematical form of the line, but these are
easy to find out. Details for how to achieve this are beyond the scope
of this book; the interested reader should read up further in intermediate
regression and statistics texts.
Remedies
for Heteroscedasticity
As stated above, heteroscedasticity
implies that a straight line is mostly acceptable, but that the residuals
are a bit “off”.
First, do not overreact
to heteroscedasticity! It does not tend to affect estimation of the
important things (notably slopes) much, it only affects accuracy measures
and even then, only a little unless the situation is particularly
bad. Therefore, you do not necessarily need to apply remedies unless
the residuals are markedly uneven.
I do suggest the following
as checks and solutions:
-
Transformations: Using
mathematical transformations of data can help in the case of some
heteroscedasticity. For instance, in many cases researchers find that
logging variables helps with certain types of heteroscedasticity.
For an example, open, look at, and run “Code13b Regression
logged” where I have logged all the continuous variables. There
are many types of transformations that may help, and a lot more to
learn on this topic – the interested reader should follow up
in more advanced texts.
-
Rely
on bootstrapped confidence intervals: Since heteroscedasticity
only really affects the confidence intervals and p-values, using the
bootstrap procedure described in Chapter 12 to get more accurate confidence
intervals often mitigates the problem.
-
Weighted
regression: There are various weighting procedures
that can down-weight larger residuals. These procedures are often
not quite as good as the right transformation and bootstrapping but
can be useful.
This book does not cover
these solutions any further; refer to more advanced texts for more
help on this solution.