Posterior predictive checks

In Chapter 1, Thinking Probabilistically, we introduced the concept of posterior predictive checks, and, in subsequent chapters, we have used it as a way to evaluate how well models explain the same data that's used to fit the model. The purpose of posterior predictive checks is not to dictate that a model is wrong; we already know that! By performing posterior predictive checks, we hope to get a better grasp of the limitations of a model, either to properly acknowledge them, or to attempt to improve the model. Implicit, in the previous statement is the fact that models will not generally reproduce all aspects of a problem equally well. This is not generally a problem given that models are built with a purpose in mind. A posterior predictive check is a way to evaluate a model in the context of that purpose; thus, if we have more than one model, we can use posterior predictive checks to compare them.

Let's upload and plot a very simple dataset:

dummy_data = np.loadtxt('../data/dummy.csv')
x_1 = dummy_data[:, 0]
y_1 = dummy_data[:, 1]

order = 2
x_1p = np.vstack([x_1**i for i in range(1, order+1)])
x_1s = (x_1p - x_1p.mean(axis=1, keepdims=True)) / 
    x_1p.std(axis=1, keepdims=True)
y_1s = (y_1 - y_1.mean()) / y_1.std()
plt.scatter(x_1s[0], y_1s)
plt.xlabel('x')
plt.ylabel('y')

Figure 5.1

Now, we are going to fit this data with two slightly different models, a linear one and a polynomial of order 2, also known as a parabolic or quadratic model:

with pm.Model() as model_l:
    α = pm.Normal('α', mu=0, sd=1)
    β = pm.Normal('β', mu=0, sd=10)
    ϵ = pm.HalfNormal('ϵ', 5)

    μ = α + β * x_1s[0]

    y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y_1s)

    trace_l = pm.sample(2000)

with pm.Model() as model_p:
    α = pm.Normal('α', mu=0, sd=1)
    β = pm.Normal('β', mu=0, sd=10, shape=order)
    ϵ = pm.HalfNormal('ϵ', 5)

    μ = α + pm.math.dot(β, x_1s)

    y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y_1s)

    trace_p = pm.sample(2000)

Now, we will plot the mean fit for both models:

x_new = np.linspace(x_1s[0].min(), x_1s[0].max(), 100)

α_l_post = trace_l['α'].mean()
β_l_post = trace_l['β'].mean(axis=0)
y_l_post = α_l_post + β_l_post *  x_new

plt.plot(x_new, y_l_post, 'C1', label='linear model')

α_p_post = trace_p['α'].mean()
β_p_post = trace_p['β'].mean(axis=0)
idx = np.argsort(x_1s[0])
y_p_post = α_p_post + np.dot(β_p_post, x_1s)

plt.plot(x_1s[0][idx], y_p_post[idx], 'C2', label=f'model order {order}')

α_p_post = trace_p['α'].mean()
β_p_post = trace_p['β'].mean(axis=0)
x_new_p = np.vstack([x_new**i for i in range(1, order+1)])
y_p_post = α_p_post + np.dot(β_p_post, x_new_p) 

plt.scatter(x_1s[0], y_1s, c='C0', marker='.')
plt.legend()

Figure 5.2

The order 2 model seems to be doing a better job, but the linear model is not that bad. Let's use PyMC3 to obtain posterior predictive samples for both models:

y_l = pm.sample_posterior_predictive(trace_l, 2000,
                                     model=model_l)['y_pred']

y_p = pm.sample_posterior_predictive(trace_p, 2000,
                                     model=model_p)['y_pred']

As we already saw, posterior predictive checks are often performed using visualizations, as in the following example:

plt.figure(figsize=(8, 3))
data = [y_1s, y_l, y_p]
labels = ['data', 'linear model', 'order 2']
for i, d in enumerate(data):
    mean = d.mean()
    err = np.percentile(d, [25, 75])
    plt.errorbar(mean, -i, xerr=[[-err[0]], [err[1]]], fmt='o')
    plt.text(mean, -i+0.2, labels[i], ha='center', fontsize=14)
plt.ylim([-i-0.5, 0.5])
plt.yticks([])

Figure 5.3

The preceding diagram shows the mean and the interquartile range (IQR) for the data and for the linear and quadratic models. In this diagram, we are averaging over the posterior predictive samples for each model. We can see that the mean is (on average) well reproduced for both models, and that the interquantile range is not very off, but there are some small differences that, in a real problem, could be worthy of some attention. We could do many different plots to explore a posterior predictive distribution. For example, we can plot the dispersion of both the mean and the interquartile range, as opposed to their mean values. The following diagram is an example of such a plot:

fig, ax = plt.subplots(1, 2, figsize=(10, 3), constrained_layout=True)


def iqr(x, a=0):
    return np.subtract(*np.percentile(x, [75, 25], axis=a))


for idx, func in enumerate([np.mean, iqr]):
    T_obs = func(y_1s)
    ax[idx].axvline(T_obs, 0, 1, color='k', ls='--')
    for d_sim, c in zip([y_l, y_p], ['C1', 'C2']):
        T_sim = func(d_sim, 1)
        p_value = np.mean(T_sim >= T_obs)
        az.plot_kde(T_sim, plot_kwargs={'color': c},
                    label=f'p-value {p_value:.2f}', ax=ax[idx])
    ax[idx].set_title(func.__name__)
    ax[idx].set_yticks([])
    ax[idx].legend()

Figure 5.4

In Figure 5.4, the black dashed line represents the statistic computed from the data (either the mean or the IQR). Because we have a single dataset we have a single value for the statistic (and not a distribution). The curves (using the same color code from Figure 5.3) represent the distribution of the mean (left panel) or interquartile range (right panel) that was computed from the posterior predictive samples. You may have also noted that Figure 5.4 also includes values labeled as -values. We compute such -values by comparing the simulated data to the actual data. For both sets, we compute a summary statistic (the mean or IQR, in this example), and then we count the proportion of times the summary statistics from the simulation is equal or greater than the one computed from the data. If the data and simulation agrees, we should expect a -value around 0.5, otherwise we are in the presence of a biased posterior predictive distribution.

Bayesian's

-values are just a way to get a number measuring the fit of a posterior predictive check.

If you are familiar with frequentists methods and you are reading this book because you were told that the cool guys were not doing -values anymore, then you may feel shocked or even disappointing by your formerly beloved author. But keep calm and keep reading. These Bayesian -values are indeed -values because they are basically defined in the same way as their frequentist cousins:

That is, we are getting the probability of getting a simulated statistic equal or more extreme than a statistic from the data . Here, can be almost anything that provides a summary of the data. In Figure 5.4, is the mean for the left panel and the standard deviation for the right panel. should be chosen while taking into account the question that motivated the inference in the first place.

These -values are Bayesian because, for the sampling distribution, we are using a posterior predictive distribution. Also notice that we are not conditioning on any null hypothesis; in fact, we have the entire posterior distribution of and we are conditioning on the observed data. Yet another difference is the fact that we are not using any predefined threshold to declare statistical significance, nor are we performing hypothesis testing—we are just trying to compute a number to assess the fit of the posterior predictive distribution to a dataset.

Posterior predictive checks, either using plots or numerical summaries such as Bayesian p-values, or even a combination of both, are very flexible ideas. This concept is general enough to let the analyst use their imagination to come up with different ways to explore the posterior predictive distribution and use whatever fits well to allow them to tell a data-driven story, including but not restricted to model comparison. In the following sections, we are going to explore other methods to compare models.

Table of Contents for Posterior predictive checks

Create new playlist

Sign In

Sign Up

Table of Contents for
Posterior predictive checks