Student's t-distribution

As a general rule, Bayesians prefer to encode assumptions directly into the model by using different priors and likelihoods rather than through ad hoc heuristics such as outlier removal rules.

One very useful option when dealing with outliers and Gaussian distributions is to replace the Gaussian likelihood with a Student's t-distribution. This distribution has three parameters: the mean, the scale (analogous to the standard deviation), and the degrees of freedom, which is usually referred to using the Greek letter , that can vary in the interval [0, ∞]. Following Kruschke's nomenclature, we are going to call the normality parameter, since it is in charge of controlling how normal-like the Student's t-distribution is. For a value of , we get a distribution with very heavy tails, which is also known as Cauchy or Lorentz distribution. This last name is especially popular among physicists. By heavy tails, we mean that it is more probable to find values away from the mean compared to a Gaussian, or in other words values are not as concentrated around the mean as in a lighter tail distribution like the Gaussian. For example, 95% of the values from a Cauchy distribution are found between -12.7 and 12.7. Instead, for a Gaussian (with a standard deviation of one), this occurs between -1.96 and 1.96. On the other side of the parameter space, when approaches infinity, we recover the Gaussian distribution (you can't be more normal than the normal distribution, right?). A very curious feature of the Student's t-distribution is that it has no defined mean when . Of course, in practice, any finite sample from a Student's t-distribution is just a bunch of numbers from which it is always possible to compute an empirical mean. Is the theoretical distribution itself the one without a defined mean. Intuitively, this can be understood as follows: the tails of the distribution are so heavy that at any moment it could happen that we get a sampled value from almost anywhere from the real line, so if we keep getting numbers, we will never approximate to a fixed value. Instead, the estimate will keep wandering around. Just try the following code several times (and then change df to a larger number, such as 100):

np.mean(stats.t(loc=0, scale=1, df=1).rvs(100))

In a similar fashion, the variance of this distribution is only defined for values of . So, be careful that the scale of the Student's t-distribution is not the same as the standard deviation. For , the distribution has no defined variance and hence no defined standard deviation. The scale and the standard deviation become closer and closer as approaches infinity:

plt.figure(figsize=(10, 6))
x_values = np.linspace(-10, 10, 500)
for df in [1, 2, 30]:
    distri = stats.t(df)
    x_pdf = distri.pdf(x_values)
    plt.plot(x_values, x_pdf, label=fr'$
u = {df}$', lw=3)

x_pdf = stats.norm.pdf(x_values)
plt.plot(x_values, x_pdf, 'k--', label=r'$
u = infty$')
plt.xlabel('x')
plt.yticks([])
plt.legend()
plt.xlim(-5, 5)

Figure 2.12

We are going to rewrite the previous model by replacing the Gaussian distribution with the Student's t-distribution:

Because the Student's t-distribution has one more parameter () than the Gaussian, we need to specify one more prior. We are going to set as an exponential distribution with a mean of 30. From Figure 2.12, we can see that a Student's t-distribution with looks pretty similar to a Gaussian (even when it is not). In fact, from the same diagram, we can see that most of the action happens for relatively small values of . Hence, we can say that the exponential prior with a mean of 30 is a weakly informative prior telling the model we more or less think should be around 30, but can move to smaller and larger values with ease. In many problems, estimating is of not direct interest. Graphically, we have:

Figure 2.13

As usual, PyMC3 allows us to (re)write models just by specifying a few lines. The only cautionary word here is that the exponential in PyMC3 is parameterized with the inverse of the mean:

with pm.Model() as model_t:
    μ = pm.Uniform('μ', 40, 75)
    σ = pm.HalfNormal('σ', sd=10)
    ν = pm.Exponential('ν', 1/30)
    y = pm.StudentT('y', mu=μ, sd=σ, nu=ν, observed=data)
    trace_t = pm.sample(1000)
az.plot_trace(trace_t)

Figure 2.14

Compare the trace from model_g (Figure 2.9) with the trace of model_t (Figure 2.14). Now, print the summary of model_t and compare it with the one from model_g. Before you keep reading, take a moment to spot the difference between both results. Did you notice something interesting?

az.summary(trace_t)

	mean	sd	mc error	hpd 3%	hpd 97%	eff_n	r_hat
μ	53.00	0.39	0.01	52.28	53.76	1254.0	1.0
σ	2.20	0.39	0.01	1.47	2.94	1008.0	1.0
ν	4.51	3.35	0.11	1.10	9.27	898.0	1.0

The estimation of between both models is similar, with a difference of ≈0.5. The estimation of changes from ≈3.5 to ≈2.1. This is a consequence of the Student's t-distribution giving less weight (being less shocked) by values away from the mean. We can also see that , that is, we have a not very Gaussian-like distribution and instead one with heavier tails.

Now, we are going to do a posterior predictive check of the Student's t model and we are going to compare it to the Gaussian model:

y_ppc_t = pm.sample_posterior_predictive(
    trace_t, 100, model_t, random_seed=123)
y_pred_t = az.from_pymc3(trace=trace_t, posterior_predictive=y_ppc_t)
az.plot_ppc(y_pred_t, figsize=(12, 6), mean=False)
ax[0].legend(fontsize=15)
plt.xlim(40, 70)

Using the Student's t-distribution in our model leads to predictive samples that seem to better fit the data in terms of the location of the peak of the distribution and also its spread. Notice how the samples extend far away from the bulk of the data, and how a few of the predictive samples looks very flat. This is a direct consequence of the Student's t-distribution expecting to see data points far away from the mean or bulk of the data, and this is the reason we are setting xlim to [40, 70]:

Figure 2.15

The Student's t-distribution allows us to have a more robust estimation because the outliers have the effect of decreasing , instead of pulling the mean toward them and increasing the standard deviation_. Thus, the mean and the scale are estimated by weighting more data points from the bulk of the data than those apart from it. Once again, it is important to remember that the scale is not the standard deviation. Nevertheless, the scale is related to the spread of the data; the lower its value, the more concentrated the distribution. Also, for values of , the value of the scale tends to be pretty close (at least for most practical purposes) to the values estimated after removing the outliers. So, as a rule of thumb, for values of not too small, and taking into account that it is not theoretically fully correct, we can consider the scale of a Student's t-distribution as a reasonable practical proxy for the standard deviation of the data after removing outliers.

Table of Contents for Student's t-distribution

Create new playlist

Sign In

Sign Up

Table of Contents for
Student's t-distribution