Gaussian inferences

Nuclear magnetic resonance (NMR) is a powerful technique that's used to study molecules and also living things such as humans or yeast (because, after all, we are just a bunch of molecules). NMR allows you to measure different kinds observable quantities that are related to unobservable and interesting molecular properties. One of these observables is known as chemical shift; we can only get chemical shifts for the nuclei of certain types of atoms.

All of this is in the domain of quantum chemistry and the details are irrelevant for this discussion, but let me explain the name chemical shift; shifts because we measured a signal shift from a reference value, and chemical because the shift is related somehow to the chemical environment on the nuclei we are measuring. For all we care at the moment, we could have been measuring the height of a group of people, the average time to travel back home, the weight of bags of oranges, or even a discrete variable like the number of sexual partners of the Tokay gecko if we are willing to approximate this number as a continuous variable, something that maybe makes sense for very promiscuous geckos! As I am not a not a ethologist, nor a paparazzi, I do not have a clue if this is a good approximation. But keep in mind that we do not need our data to be truly Gaussian (or any other distribution, for that matter): we only demand that a Gaussian is a reasonable approximation to our data. For this example, we have 48 chemical shifts values that we can load into a NumPy array and plot by using the following code:

data = np.loadtxt('../data/chemical_shifts.csv')
az.plot_kde(data, rug=True)
plt.yticks([0], alpha=0)

The KDE plot of this dataset shows a Gaussian-like distribution, except for two data points that are far away from the mean:

Figure 2.7

Let's forget about those two points for a moment and assume that a Gaussian distribution is a proper description of the data. Since we do not know the mean or the standard deviation, we must set priors for both of them. Therefore, a reasonable model could be:

Thus,  comes from a uniform distribution with boundaries  and , which are the lower and upper bounds, respectively, and comes from a half-normal distribution with a standard deviation of . A half-Normal distribution is like the regular Normal distribution but restricted to positive values (including zero). You can get samples from a half-normal by sampling from a normal distribution and then taking the absolute value of each sampled value. Finally, in our model, the data  comes from a Normal distribution with the parameters  and . Using Kruschke-style diagrams:

Figure 2.8

If we do not know the possible values of  and , we can set priors reflecting our ignorance. One option is to set the boundaries of the uniform distribution to be , , which is a range larger than the range of the data. Alternatively, we can choose a range based on our previous knowledge. We may know that this is not physically possible to have values below 0 or above 100 for this type of measurement. In such a case, we can set the prior for the mean as a uniform with parameters , . For the half normal, we can set , just a large value for the data. Using PyMC3, we can write the model as follows:

with pm.Model() as model_g:
μ = pm.Uniform('μ', lower=40, upper=70)
σ = pm.HalfNormal('σ', sd=10)
y = pm.Normal('y', mu=μ, sd=σ, observed=data)
trace_g = pm.sample(1000)
az.plot_trace(trace_g)

Figure 2.9

As you may have noticed, in Figure 2.9, which was generated with the ArviZ function plot_trace, it has one row for each parameter. For this model, the posterior is bi-dimensional, and so Figure 2.9 is showing the marginal distributions of each parameter. We can use the plot_joint function from ArviZ to see what the bi-dimensional posterior looks like, together with the marginal distributions for μ and σ:

az.plot_joint(trace_g, kind='kde', fill_last=False)

Figure 2.10

If you want to get access to the values for any of the parameters stored in a trace object, you can index the trace with the name of the parameter in question. As a result, you will get a NumPy array. Try doing trace_g['σ'] or az.plot_kde(trace_g['σ']). By the way, using Jupyter Notebook/lab, you can get characters such as  by writing sigma in a code cell and then hitting the Tab key.

We are going to print the summary for later use:

az.summary(trace_g)

mean

sd

mc error

hpd 3%

hpd 97%

eff_n

r_hat

μ

53.49

0.50

0.00

52.5

54.39

2081.0

1.0

σ

3.54

0.38

0.01

2.8

4.22

1823.0

1.0

 

Now that we have computed the posterior, we can use it to simulate data and check how consistent the simulated is data with respect to the observed data. If you remember from Chapter 1, Thinking Probabilistically, we generically call this type of comparisons posterior predictive checks, because we are using the posterior to make predictions, and are using those predictions to check the model. Using PyMC3 is really easy to get posterior predictive samples if you use the sample_posterior_predictive function. With the following code, we are generating 100 predictions from the posterior, each one of the same size as the data. Notice that we have to pass the trace and the model to sample_posterior_predictive, while the other arguments are optional:

y_pred_g = pm.sample_posterior_predictive(trace_g, 100, model_g)

The y_pred_g variable is a dictionary, with the keys being the name of the observed variable in our model and the values an array of shape (samples, size), in this case, (100, len(data)). We have a dictionary because we could have models with more than one observed variable. We can use the plot_ppc function for a visual posterior predictive check:

data_ppc = az.from_pymc3(trace=trace_g, posterior_predictive=y_pred_g)
ax = az.plot_ppc(data_ppc, figsize=(12, 6), mean=False)
ax[0].legend(fontsize=15)

Figure 2.11

In Figure 2.11, the single (black) line is a KDE of the data and the many semitransparent (cyan) lines are KDEs computed from each one of the 100 posterior predictive samples. The semitransparent (cyan) lines reflect the uncertainty we have about the inferred distribution of the predicted data. Sometimes, when you have very few data points. A plot like this one could show the predicted curves as hairy or wonky; this is due to the way the KDE is implemented in ArviZ. The density is estimated within the actual range of the data passed to the kde function, while outside this range the density is assumed to be zero. While some could reckon this as a bug, I think it's a feature, since it's reflecting a property of the data instead of over-smoothing it.

From Figure 2.11, we can see that the mean of the simulated data is slightly displaced to the right and that the variance seems to be larger for the simulated data than for the actual data. This is a direct consequence of the two observations that are separated from the bulk of the data. Can we use this plot to confidently say that the model is faulty and needs to be changed? Well, as always, the interpretation of the model and its evaluation is context-dependent. Based on my experience for these type of measures and the ways I generally use this data, I would say this model is a reasonable enough representation of the data and useful for most of my analysis. Nevertheless, in the next section, we are going to learn how to refine model_g and get predictions that match the data even closer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.232.189