Model averaging

Model selection is appealing for its simplicity, but we are discarding information about the uncertainty in our models. This is somehow similar to computing the full posterior and then only keeping the mean of the posterior; we may become overconfident about what we really know. One alternative is to perform model selection but report and discuss the different models, along with the computed information criteria values, their standard error values, and perhaps also the posterior predictive checks. It is important to put all of these numbers and tests in the context of our problem so that we and our audience can have a better feel of the possible limitations and shortcomings of the models. If you are in the academic world, you can use this approach to add elements to the discussion section of a paper, presentation, thesis, and so on.

Yet another approach is to fully embrace the uncertainty in the model comparison and perform model averaging. The idea now is to generate a meta-model (and meta-predictions) using a weighted average of each model. One way to compute these weights is to apply the following formula:

Here, is the difference between the value of WAIC for the i-esim model, and the model with the lowest WAIC. Instead of WAIC, you can use any other Information Criterion you want, like AIC or other measures, like LOO. This formula is a heuristic way to compute the relative probability of each model (given a fixed set of models) from WAIC values (or other similar measures). Look at how the denominator is just a normalization term to ensure that the weights sum up to one. You may remember this expression from Chapter 4, Generalizing Linear Models, because it is just the softmax function. Using the weights from the preceding formula to average models is known as pseudo Bayesian Modeling Averaging. The true Bayesian Modeling Averaging will be to use the marginal likelihoods instead of WAIC or LOO. Nevertheless, even when using the marginal likelihoods sounds theoretically appealing, there are theoretical and empirical reasons to prefer WAIC or LOO over the marginal likelihood for both model comparison and model averaging. You will find more details about this in the Bayes factors section.

Using PyMC3, you can compute the weights that are expressed in the preceding formula by passing the method='pseudo-BMA' (pseudo-Bayesian Modeling Averaging) argument to the az.compare function. One of the caveats of this formula is that it does not take into account the uncertainty in the computation of the values of . Assuming a Gaussian approximation, we can compute the standard error for each . These are the errors returned by the the functions az.waic, az.loo and also by the function az.compare when the method='pseudo-BMA' argument is passed. We can also estimate uncertainty by using Bayesian bootstrapping. This is a more robust method than assuming normality. PyMC3 can compute this for you if you pass method='BB-pseudo-BMA' to the az.compare function.

A different approach to compute weights for averaging models is known as the stacking of predictive distributions or just stacking. This is implemented in PyMC3 by passing method='stacking' to az.compare. The basic idea is to combine several models in a metamodel by minimizing the diverge between the meta-model and the true generating model. When using a logarithmic scoring rule, this is equivalent to the following:

Here, is the number of data points and the number of models. To enforce a solution, we constrain to be and . The quantity is the leave-one-out predictive distribution for the model. As we already discussed, computing it requires fitting each model times, each time leaving out one data point. Fortunately, we can approximate the exact leave-one-out predictive distribution using WAIC or LOO, and that is what PyMC3 does.

There are other ways to average models such as, for example, explicitly building a meta-model that includes all the models of interest as submodels. We can build such a model in such a way that we perform inference for the parameter of each submodel and at the same time, we compute the relative probability of each model (see the Bayes Factor section for an example of this).

Besides averaging discrete models, we can sometimes think of continuous versions of them. A toy example is to imagine that we have a coin-flipping problem and that we have two different models: one with a prior bias toward heads and the other toward tails. A continuous version of this will be a hierarchical model where the prior distribution is directly estimated from the data. This hierarchical model includes the discrete models as special cases.

Which approach is better? This depends on our concrete problem. Do we really have a good reason to think about discrete models, or is our problem better represented as a continuous model? Is it important for our problem to single out a model, because we are thinking in terms of competing explanations, or would averaging be a better idea because we are more interested in predictions, or can we truly think of the process generating process as an average of subprocesses? All of these questions are not answered by statistics, and are only informed by statistics in the context of domain knowledge.

The following is just a dummy example of how to get a weighed posterior predictive sample from PyMC3. Here, we are using the pm.sample_posterior_predictive_w function (notice the w at the end of the function's name). The difference between pm.sample_posterior_predictive and pm.sample_posterior_predictive_w is that the latter accepts more than one trace and model, as well as a list of weights (by default, the weights are the same for all models). You can get these weights from az.compare or any other source you want:

w = 0.5
y_lp = pm.sample_posterior_predictive_w([trace_l, trace_p],
                                        samples=1000,
                                        models=[model_l, model_p],
                                        weights=[w, 1-w])

_, ax = plt.subplots(figsize=(10, 6))
az.plot_kde(y_l, plot_kwargs={'color': 'C1'},
            label='linear model', ax=ax)
az.plot_kde(y_p, plot_kwargs={'color': 'C2'},
            label='order 2 model', ax=ax)
az.plot_kde(y_lp['y_pred'], plot_kwargs={'color': 'C3'},
           label='weighted model', ax=ax)

plt.plot(y_1s, np.zeros_like(y_1s), '|', label='observed data')
plt.yticks([])
plt.legend()

Figure 5.9

I said this is a dummy example because the quadratic model has such a lower value of WAIC compared to the linear model that the weight is basically 1 for the first model and 0 for the latter and to generate Figure 5.9, I have assumed that both models have the same weight.

Table of Contents for Model averaging

Create new playlist

Sign In

Sign Up

Table of Contents for
Model averaging