Robust linear regression

Assuming that the data follows a Gaussian distribution, it is perfectly reasonable in many situations. By assuming Gaussianity, we are not necessarily saying data is really Gaussian; instead, we are saying that it is a reasonable approximation for a given problem. The same applies to other distributions. As we saw in the previous chapter, sometimes, this Gaussian assumption fails, for example, in the presence of outliers. We learned that using a Student's t-distribution is a way to effectively deal with outliers and get a more robust inference. The very same idea can be applied to linear regression.

To exemplify the robustness that a Student's t-distribution brings to a linear regression, we are going to use a very simple and nice dataset: the third data group from the Anscombe quartet. If you do not know what the Anscombe quartet is, remember to check it later at Wikipedia (https://en.wikipedia.org/wiki/Anscombe%27s_quartet). We can upload it using pandas. We are going to center the data, just to make things easier for the sampler—even a cool sampler like NUTS needs a little help from time to time:

ans = pd.read_csv('../data/anscombe.csv')
x_3 = ans[ans.group == 'III']['x'].values
y_3 = ans[ans.group == 'III']['y'].values
x_3 = x_3 - x_3.mean()

Now, let's check what this little tiny dataset looks like:

_, ax = plt.subplots(1, 2, figsize=(10, 5))
beta_c, alpha_c = stats.linregress(x_3, y_3)[:2]
ax[0].plot(x_3, (alpha_c + beta_c * x_3), 'k',
           label=f'y ={alpha_c:.2f} + {beta_c:.2f} * x')
ax[0].plot(x_3, y_3, 'C0o')
ax[0].set_xlabel('x')
ax[0].set_ylabel('y', rotation=0)
ax[0].legend(loc=0)
az.plot_kde(y_3, ax=ax[1], rug=True)
ax[1].set_xlabel('y')
ax[1].set_yticks([])
plt.tight_layout()

Figure 3.10

Now, we are going to rewrite the previous model (model_g), but this time we are going to use a Student's t-distribution instead of a Gaussian. This change also introduces the need to specify the value of , the normality parameter. If you do not remember the role of this parameter, check Chapter 2, Programming Probabilistically, before continuing.

In the following model, we are using a shifted exponential to avoid values of close to zero. The non-shifted exponential puts too much weight on values close to zero. In my experience, this is fine for data with no to moderate outliers, but for data with extreme outliers (or data with a few bulk points), like in the Anscombe's third dataset, it is better to avoid such low values. Take this, as well as other priors recommendations, with a pinch of salt. The defaults are good starting points, but there's no need to stick to them. Other common priors for are gamma(2, 0.1) or gamma(mu=20, sd=15):

with pm.Model() as model_t:
    α = pm.Normal('α', mu=y_3.mean(), sd=1)
    β = pm.Normal('β', mu=0, sd=1)
    ϵ = pm.HalfNormal('ϵ', 5)
    ν_ = pm.Exponential('ν_', 1/29)
    ν = pm.Deterministic('ν', ν_ + 1)

    y_pred = pm.StudentT('y_pred', mu=α + β * x_3,
                         sd=ϵ, nu=ν, observed=y_3)

    trace_t = pm.sample(2000)

In Figure 3.11, we can see the robust fit, according model_t, and the non-robust fit, according to SciPy's linregress (this function is doing least-squares regression). As a bonus exercise, you may try adding the best line that's obtained using model_g:

beta_c, alpha_c = stats.linregress(x_3, y_3)[:2]

plt.plot(x_3, (alpha_c + beta_c * x_3), 'k', label='non-robust', alpha=0.5)
plt.plot(x_3, y_3, 'C0o')
alpha_m = trace_t['α'].mean()
beta_m = trace_t['β'].mean()
plt.plot(x_3, alpha_m + beta_m * x_3, c='k', label='robust')

plt.xlabel('x')
plt.ylabel('y', rotation=0)
plt.legend(loc=2)
plt.tight_layout()

Figure 3.11

While the non-robust fit tries to compromise and include all points, the robust Bayesian model, model_t, automatically discards one point and fits a line that passes exactly through all the remaining points. I know this is a very peculiar dataset, but the message remains for more real and complex ones. A Student's t-distribution, due to its heavier tails, is able to give less importance to points that are far away from the bulk of the data.

Before moving on, take a moment to contemplate the values of the parameters (I am omitting the intermediate parameter as it is not of direct interest):

az.summary(trace_t, var_names=varnames)

	mean	sd	mc error	hpd 3%	hpd 97%	eff_n	r_hat
α	7.11	0.00	0.0	7.11	7.12	2216.0	1.0
β	0.35	0.00	0.0	0.34	0.35	2156.0	1.0
ϵ	0.00	0.00	0.0	0.00	0.01	1257.0	1.0
ν	1.21	0.21	0.0	1.00	1.58	3138.0	1.0

As you can see, the values of , , and are very narrowly defined and even more so with , which is basically 0. This is totally reasonable, given that we are fitting a line to a perfectly aligned set of points (if we ignore the outlier point).

Let's run a posterior predictive check to explore how well our model captures the data. We can let PyMC3 do the hard work of sampling from the posterior for us:

ppc = pm.sample_posterior_predictive(trace_t, samples=200, model=model_t, random_seed=2)

data_ppc = az.from_pymc3(trace=trace_t, posterior_predictive=ppc)
ax = az.plot_ppc(data_ppc, figsize=(12, 6), mean=True)
plt.xlim(0, 12)

Figure 3.12

For the bulk of the data, we get a very good match. Also notice that our model predicts values away from the bulk to both sides and not just above the bulk. For our current purposes, this model is performing just fine and it does not need further changes. Nevertheless, notice that for some problems, we may want to avoid this. In such a case, we should probably go back and change the model to restrict the possible values of y_pred to positive values.

Table of Contents for Robust linear regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Robust linear regression