Bayes factors and Information Criteria

Notice that if we take the logarithm of Bayes factors, we can turn the ratio of the marginal likelihood into a difference. Comparing differences in marginal likelihoods is similar to comparing differences in information criteria. Moreover, we can interpret the Bayes factors, or to be more precise, the marginal likelihoods, as having a fitting term and a penalizing term. The term indicating how well the model fits the data is the likelihood part, and the penalization part comes from averaging over the prior. The larger the number of parameters, the larger the prior volume compared to the likelihood volume, and hence we will end up taking averages from zones where the likelihood has very low values. The more parameters, the more diluted or diffused the prior and hence the greater the penalty when computing the evidence. This is the reason people say Bayes' theorem leads to a natural penalization of complex models, that is, Bayes theorem comes with a built-in Occam's razor.

We have already said that Bayes factors are more sensitive to priors than many people like (or even realize). It is like having differences that are practically irrelevant when performing inference but that turn out to be important when computing Bayes factors. If there is an infinite multiverse, I am almost sure that there is a Geraldo talk show with Bayesians fighting and cursing each other about Bayes factors. In that universe (well... also in this one) I will be cheering for the anti-BF side. Nevertheless, now, we are going to look at an example that will help clarify what Bayes factors are doing and what information criteria are doing and how, while similar, they focus on two different aspects. Go back to the definition of the data for the coin flip example and now set 300 coins and 90 heads; this is the same proportion as before, but we have 10 times more data. Then, run each model separately:

traces = []
waics = []
for coins, heads in [(30, 9), (300, 90)]:
    y_d = np.repeat([0, 1], [coins-heads, heads])
    for priors in [(4, 8), (8, 4)]:
        with pm.Model() as model:
            θ = pm.Beta('θ', *priors)
            y = pm.Bernoulli('y', θ, observed=y_d)
            trace = pm.sample(2000)
            traces.append(trace)
            waics.append(az.waic(trace))

Figure 5.12

By adding more data, we have almost entirely overcome the prior and now both models make similar predictions. Using 30 coins and 9 heads as our data, we saw a BF of 11. If we repeat the computation (feel free to do it by yourself) with the data of 300 coins and 90 heads, we will see a BF of 25. The Bayes factor is saying that model 0 is favored over model 1 even more than before. When we increase the data, the decision between models becomes clearer. This makes total sense because now we are more certain that model 1 has a prior that does not agree with the data.

Also notice that as we increase the amount of data, both models will tend to agree on the value of ; in fact, we get 0.3 with both models. Thus, if we decide to use to predict new outcomes, it will barely make any difference from which model we compute the distribution of _.

Now, let's compare what WAIC is telling us (see Figure 5.13). WAIC is 368.4 for model 0 and 368.6 for model 1. This intuitively sounds like a small difference. What's more important than the actual difference is that if you compute the information criteria again for the data, that is, 30 coins and 9 heads, you will get something like 38.1 for model 0 and 39.4 for model 1 with WAIC. That is, the relative difference when increasing the data becomes smaller—the more similar the estimation of , are the more similar the values for the predictive accuracy estimated by the information criteria. You will observe essentially the same if you use LOO instead of WAIC:

fig, ax = plt.subplots(1, 2, sharey=True)

labels = model_names
indices = [0, 0, 1, 1]
for i, (ind, d) in enumerate(zip(indices, waics)):
    mean = d.waic
    ax[ind].errorbar(mean, -i, xerr=d.waic_se, fmt='o')
    ax[ind].text(mean, -i+0.2, labels[i], ha='center')

ax[0].set_xlim(30, 50)
ax[1].set_xlim(330, 400)
plt.ylim([-i-0.5, 0.5])
plt.yticks([])
plt.subplots_adjust(wspace=0.05)
fig.text(0.5, 0, 'Deviance', ha='center', fontsize=14)

Figure 5.13

Bayes factors are focused on which model is better while WAIC (and LOO) is focused on which model will give the better predictions. You can see these differences if you go back and check equations 5.6 and 5.11. WAIC, like other information criteria, uses log-likelihood in one way or another, and the priors are not directly part of the computations. Priors only participate indirectly by helping us estimate the value of . Instead, Bayes factors use priors directly as we need to average the likelihood over the entire range of prior values.

Table of Contents for Bayes factors and Information Criteria

Create new playlist

Sign In

Sign Up

Table of Contents for
Bayes factors and Information Criteria