Shrinkage

To show you one of the main consequences of hierarchical models, I will require your assistance, so please join me in a brief experiment. I will need you to print and save the summary using az.summary(trace_h). Then, I want you to re-run the model two more times after making small changes to the synthetic data, and always keep a record of the summary. In total, we will have three runs:

One run setting all the elements of G_samples to 18
One run setting all the elements of G_samples to 3
One last run setting one element to 18 and the other two to 3

Before continuing, please take a moment to think about the outcome of this experiment. Focus on the estimated mean value of in each experiment. Based on the first two runs of the model, could you predict the outcome for the third case?

If we put the result in a table, we get something more or less like this; remember that small variations could occur due to the stochastic nature of the sampling process:

G_samples	θ(mean)
18, 18, 18	0.6, 0.6, 0.6
3, 3, 3	0.11, 0.11, 0.11
18, 3, 3	0.55, 0.13, 0.13

In the first row, we can see that for a dataset of 18 good samples out of 30 we get a mean value for of 0.6; remember that now the mean of is a vector of three elements, one per group. Then, on the second row, we have only 3 good samples out of 30 and the mean of is 0.11. At the end, on the last row, we get something interesting and probably unexpected. Instead of getting a mix of the mean estimates of from the other rows, such as 0.6, 0.11, and 0.11, we get different values, namely 0.55, 0.13, and 0.13. What on earth happened? Maybe a convergence problem or some error with the model specification? Nothing like that—we get our estimates shrunk toward the common mean. And this is totally OK, indeed this is just a consequence of our model; by using hyper-priors, we are estimating the parameters of the beta prior distribution from the data, and each group is informing the rest, which in turn is informed by the estimation of the others. Put in a more succinct way, the groups are effectively sharing information through the hyper-prior. As a result, we are observing what is known as shrinkage. This effect is the consequence of partially pooling the data; we are modeling the groups not as independent from each other, nor as a single big group. Instead, we have something in the middle.

Why is this useful? Because having shrinkage contributes to more stable inferences. This is, in many ways, similar to what we saw with the Student's t-distribution and the outliers, where we used heavy tail distribution results in a model that is more robust or less responsive to data points away from the mean. Introducing hyper-priors and hence inferences at a higher level results in a more conservative model (probably the first time I've used conservative in a flattering way), one that is less responsive to extreme values in individual groups. To illustrate this, imagine that the samples size are different from each neighborhood, some small, some larger; the smaller the sample size, the easier it is to get bogus results. At an extreme, if you take only one sample in a given neighborhood, you may just hit the only really old lead pipe in the whole neighborhood or, on the contrary, the only one made out of PVC. In one case, you will overestimate the bad quality and in the other underestimate it. Under a hierarchical model, the miss-estimation of one group will be ameliorated by the information provided by the other groups. Of course, a larger sample size will also do the trick but, more often than not, that is not an option.

The amount of shrinkage depends, of course, on the data; a group with more data will pull the estimate of the other groups harder than a group with fewer data points. If several groups are similar and one group is different, the similar groups are going to inform the others of their similarity and reinforce a common estimation, while they are going to pull toward them the estimation for the less similar group; this is exactly what we saw in the previous example.

The hyper-priors also have a part in modulating the amount of shrinkage. We can effectively use an informative prior to shrink our estimate to some reasonable value if we have trustworthy information about the group-level distribution. Nothing prevents us from building a hierarchical model with just two groups—but we would prefer to have several groups. Intuitively, the reason is that getting shrinkage is like assuming each group is a data point, and we are estimating the standard deviation at the group level. Generally, we do not trust an estimation with too few data points unless we have a strong prior to inform our estimation. Something similar is true for a hierarchical model.

We may also be interested in seeing what the estimated prior looks like. One way to do this is as follows:

x = np.linspace(0, 1, 100)
for i in np.random.randint(0, len(trace_h), size=100):
    u = trace_h['μ'][i]
    k = trace_h['κ'][i]
    pdf = stats.beta(u*k, (1.0-u)*k).pdf(x)
    plt.plot(x, pdf,  'C1', alpha=0.2)

u_mean = trace_h['μ'].mean()
k_mean = trace_h['κ'].mean()
dist = stats.beta(u_mean*k_mean, (1.0-u_mean)*k_mean)
pdf = dist.pdf(x)
mode = x[np.argmax(pdf)]
mean = dist.moment(1)
plt.plot(x, pdf, lw=3, label=f'mode = {mode:.2f}
mean = {mean:.2f}')
plt.yticks([])

plt.legend()
plt.xlabel('$θ_{prior}$')
plt.tight_layout()

Figure 2.21

This was supposed to be the last example in this chapter, but by popular demand, I will play one more hierarchical model as an encore, so please join me.

Table of Contents for Shrinkage

Create new playlist

Sign In

Sign Up

Table of Contents for
Shrinkage