Multicollinearity or when the correlation is too high

In the previous example, we saw how a multiple linear regression model reacts to redundant variables, and we saw the importance of considering possible confounding variables. Now, we will take the previous example to an extreme and see what happens when two variables are highly correlated. To study this problem and its consequences for inference, we will use the same synthetic data and model as before, but now we will increase the degree of correlation between and by reducing the amount of Gaussian noise we add to to obtain :

np.random.seed(42)
N = 100
x_1 = np.random.normal(size=N)
x_2 = x_1 + np.random.normal(size=N, scale=0.01)
y = x_1 + np.random.normal(size=N)
X = np.vstack((x_1, x_2)).T

This change in the data-generating code is practically equivalent to summing zero to , and hence both variables are, for all practical purposes, equal. You can then try varying the values of the scale and using less extreme values, but for now we want to make things crystal clear. After generating the new data, check what the scatter plot looks like:

scatter_plot(X, y)

Figure 3.23

You should see something like Figure 3.23, the scatter plot for and is virtually a straight line with a slope around 1.

We then run a multiple linear regression:

with pm.Model() as model_red:
    α = pm.Normal('α', mu=0, sd=10)
    β = pm.Normal('β', mu=0, sd=10, shape=2)
    ϵ = pm.HalfCauchy('ϵ', 5)

    μ = α + pm.math.dot(X, β)

    y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y)

    trace_red = pm.sample(2000)

Then, we check the results for the parameters with a forest plot:

az.plot_forest(trace_red, var_names=['β'], combined=True, figsize=(8, 
               2))

Figure 3.24

The HPD for coefficients is suspiciously wide. We can get a clue to what is going on with a scatter plot of the coefficients:

az.plot_pair(trace_red, var_names=['β'])

Figure 3.25

Wow! The marginal posterior for is a really narrow diagonal. When one coefficient goes up, the other must go down. Both are effectively correlated. This is just a consequence of the model and the data. According to our model, the mean is:

If we assume and are not just practically equivalent, but mathematically identical, we can rewrite the model as:

It turns out that it is the sum of and , and not their separated values, that affects . We can make smaller and smaller as long as we get . We practically do not have two variables and thus practically we do not have two parameters. We say that the model is indeterminate (or equivalently the data is unable to restrict the parameters in the model). In our example, there are two reasons why cannot freely move over the interval. First, both variables are almost the same, but they are not exactly equal, and second, and the most important point, is that we have a prior restricting the plausible values that can take.

There are a couple of things to notice from this example. First of all, the posterior is just the logical consequence of our data and model, and hence there is nothing wrong with obtaining such wide distributions for ; C'est la vie. Second, we can rely on this model to make predictions. Try, for example, making posterior predictive checks; the values predicted by the model are in agreement with the data; the model is capturing the data very well. Third, this may be not a very good model to understand our problem. It may be more clever just to remove one of the variables from the model. We will end up having a model that predicted the data as well as before, but with an easier (and simpler) interpretation.

In any real dataset, correlations are going to exist to some degree. How strong should two or more variables be correlated to become a problem? Well, 0.9845. No, just kidding. Unfortunately, statistics is a discipline with very few magic numbers. It is always possible to do a correlation matrix before running any Bayesian model and check for variables with a high correlation of, let's say, above 0.9 or so. Nevertheless, the problem with this approach is that what really matters is not the pairwise correlations we can observe in a correlation matrix, but the correlation of the variables inside a model, and as we already saw, variables behave differently in isolation than when they are put together in a model. Two or more variables can increase or decrease their correlation when put in the context of other variables in a multiple regression model. As always, careful inspection of the posterior together with an iterative critical approach to model building are highly recommended and can help us to spot problems and understand the data and models.

Just as a quick guide, what should we do if we find highly correlated variables?

If the correlation is really high, we can eliminate one of the variables from the analysis; given that both variables have similar information, which one we eliminate is often irrelevant. We can eliminate variables based on pure convenience, such as removing the least known variable in our discipline or one that is harder to interpret or measure.
Another possibility is to create a new variable averaging the redundant variables. A more sophisticated version is to use a variable reduction algorithm such as Principal Component Analysis (PCA). The problem with PCA is that the resulting variables are linear combinations of the original ones obfuscating, in general, the interpretability of the results.
Yet another solution is to put stronger priors to restrict the plausible values the coefficient can take. In Chapter 6, Model Comparison we briefly discuss some choices for such priors, known as regularizing priors.

Table of Contents for Multicollinearity or when the correlation is too high

Create new playlist

Sign In

Sign Up

Table of Contents for
Multicollinearity or when the correlation is too high