One more example

Once again, we have chemical shift data. This data comes from a set of protein molecules I have personally prepared. To be precise, we should say the chemical shifts are from the nuclei of proteins as this is an observable we measure only for certain types of atomic nuclei. Proteins are made from sequences of 20 building blocks known as amino acidic residues. Each amino acid can appear in the sequence zero or more times, and sequences can vary from a few amino acids to hundreds or even thousands of them. Each amino acid has one and only one , so we can confidently associate each chemical shift to a particular amino acidic residue in a protein. Furthermore, each one of these 20 amino acids has different chemical properties that contribute to the biological properties of the protein; some can have net electric charges and some can only be neutral; some like to be surrounded by water molecules while others prefer the company of the same type or similar type of amino acids; and so on. The key aspect is that they are similar but not equal, and hence it may sound reasonable or even natural to base any chemical shift-related inference in terms of 20 groups, as defined by the amino acids types. You can learn more about proteins with this excellent video: https://www.youtube.com/watch?v=wvTv8TqWC48.

For the sake of this example, I am simplifying things a little bit here. In practice, experiments are messy, and there is always a chance that we do not get a complete record of chemical shifts. One common problem is signal overlapping, that is, the experiment does not have enough resolution to distinguish two or more close signals. For this example, I just removed those cases, so we will assume that the dataset is complete.

In the following code block, we load the data into a DataFrame; please take a moment to inspect the DataFrame. You will see four columns: the first one is the ID of a protein—if you feel curious, you can access a lot of information about a protein by using the ID in this page: https://www.rcsb.org/. The second column includes the name of the amino acid, using a standard three-letter code, while the following columns correspond to theoretical computed chemical shift values (using quantum chemical computations) and the experimentally measured chemical shifts. The motivation for this example is to compare the differences to access, among other reasons, how well the theoretical computations are reproducing the experimental measures. For that reason, we are computing the pandas' series diff:

cs_data = pd.read_csv('../data/chemical_shifts_theo_exp.csv')
diff = cs_data.theo.values - cs_data.exp.values
idx = pd.Categorical(cs_data['aa']).codes
groups = len(np.unique(idx))

To see the difference between a hierarchical and non-hierarchical model, we are going to build two models. The first one is basically the same as the comparing_groups model:

with pm.Model() as cs_nh:
    μ = pm.Normal('μ', mu=0, sd=10, shape=groups)
    σ = pm.HalfNormal('σ', sd=10, shape=groups)

    y = pm.Normal('y', mu=μ[idx], sd=σ[idx], observed=diff)

    trace_cs_nh = pm.sample(1000)

Now, we will build the hierarchical version of the model. We are adding two hyper-priors: one for the mean of and one for the standard deviation of . We are leaving without hyper-priors. This is just a model choice; I am deciding on a simpler model just for pedagogical purposes. You may face a problem where this seems unacceptable and you may consider it necessary to add a hyper-prior for ; feel free to do that:

with pm.Model() as cs_h:
    # hyper_priors
    μ_μ = pm.Normal('μ_μ', mu=0, sd=10)
    σ_μ = pm.HalfNormal('σ_μ', 10)

    # priors
    μ = pm.Normal('μ', mu=μ_μ, sd=σ_μ, shape=groups)
    σ = pm.HalfNormal('σ', sd=10, shape=groups)

    y = pm.Normal('y', mu=μ[idx], sd=σ[idx], observed=diff)

    trace_cs_h = pm.sample(1000)

We are going to compare the results using ArviZ's plot_forest function. We can pass more than one model to this function. This is useful when we want to compare the values of parameters from different models such as with the present example. Notice that we are passing several arguments to plot_forest to get the plot that we want, like combined=True to merge results from all the chains. I invite you to explore the rest of the arguments:

_, axes = az.plot_forest([trace_cs_nh, trace_cs_h],
                         model_names=['n_h', 'h'],
                         var_names='μ', combined=False, colors='cycle')
y_lims = axes[0].get_ylim()
axes[0].vlines(trace_cs_h['μ_μ'].mean(), *y_lims)

OK, so what we have in Figure 2.22? We have a plot for the 40 estimated means, one per amino acid (20) multiplied by two as we have two models. We also have their 94% credible intervals and the interquantile range (the central 50% of the distribution). The vertical (black) line is the global mean according to the hierarchical model. This value is close to zero, as expected for theoretical values reproducing experimental ones.

The most relevant part of this plot is that the estimates from the hierarchical model are pulled toward the partially-pooled mean, or equivalently they are shrunken with respect to the un-pooled estimates. You will also notice that the effect is more notorious for those groups farther away from the mean (such as 13) and that the uncertainty is on par or smaller than that from the non-hierarchical model. The estimates are partially pooled because we have one estimate for each group, but estimates for individual groups restrict each other through the hyper-prior.

Therefore, we get an intermediate situation between having a single group, all chemical shifts together, and having 20 separated groups, one per amino acid. And that is, ladies, gentleman, and non-binary-gender-fluid people, the beauty of hierarchical models:

Figure 2.22

Paraphrasing the Zen of Python, we can certainly say, hierarchical models are one honking great idea—let's do more of those! In the following chapters, we will keep building hierarchical models and learn how to use them to build better models. We will also discuss how hierarchical models are related to the pervasive overfitting/underfitting issue in statistics and machine learning in Chapter 5, Model Comparison. In Chapter 8, Inference Engines, we will discuss some technical problems that we may find when sampling from hierarchical models and how to diagnose and fix those problems.

Table of Contents for One more example

Create new playlist

Sign In

Sign Up

Table of Contents for
One more example