Chapter 11: Choosing models

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

11
Choosing models

In this chapter we discuss model choice. So far we postulated a fixed data-generating mechanism f(x|θ) without worrying about how f is chosen. From this perspective, we may think about model choice as the choice of which f to use. Depending on the application, f could be a simple parametric family, or a more elaborate model. George Box’s famous comment that “all models are wrong but some models are useful” (Box 1979) highlights the importance of taking a pragmatic viewpoint in evaluating models, and to set criteria driven by the goals of the modeling. Decision theory would seem to be the perfect perspective to formalize Box’s concise statement of principle.

A view we could take is that of Chapter 10. Forecasters are incarnations of predictive models that we can evaluate and compare based on utility functions such as scoring rules. If we do this, we neatly separate the information that was used by the forecasters to develop and tune the prediction models, from the information that we use to evaluate them. This separation is a luxury we do not always have. More often we would like to be able to entertain several approaches in parallel, and learn something about how well they do directly from the data that are used to develop them. Whether this is even possible is a matter of debate, and some hold, with good reasons, that model training and model assessment should be separate.

But let us say we give in to the temptation of training and evaluating models at the same time. An important question is whether we can talk about a model as a “state of the world” in the same way we did for parameters or future events. Box’s comment that all models are wrong sounds like a negative answer. In the terminology of Savage, a model is perhaps like a “small world.” Within the small world we apply a theory that explains how we should learn from data and make good decisions. But can the theory tell us whether the “small world” is right?

Both these considerations suggest that model choice may require a richer conceptual framework than that of statistical decision theory. However, we can still make a little bit of progress if we are willing to stipulate that a true model exists in our list of candidate models. This is not a real change of perspective conceptually: the “small world” is a little bit bigger, and the model is then simply another parameter, but the results are helpful in clarifying the underpinning of some popular model choice approaches.

In Section 11.1 we set up the general framework for decision problems in which the model is unknown, and look at the implications of model selection and prediction. Then, changing slightly the paradigm, we consider the situation in which only one model is being contemplated and the question arises as to whether or not the model may be adequate. We present a way in which decision theory can can be brought to bear for this problem in Section 11.2. We focus only on the Bayesian approach, though frequentist model selection approaches are also available.

We do not have a featured article for this chapter. Useful general readings are Box (1980), Bernardo and Smith (1994), and Clyde and George (2004).

11.1 The “true model” perspective

11.1.1 Model probabilities

Our discussion in most of this chapter is based on making our “small world” just a little bit bigger so that unknowns that are not normally considered part of it are now included. For example, we can imagine extending estimation of a population mean from a known to an unknown family of distributions, or extending a linear regression with two predictors to the bigger world in which any subset of five additional predictors could also be included in the model. We will assume we can make a list of possible models, and feel comfortable that one of the models is true—much as we could be confident that one real number or another is the true average height of a population. When we can do this, the model is part of the states of the world in the usual sense, and nothing differentiates it from what we normally call θ except habit. We can then apply all we know about decision theory, and handle the fact that the model is unknown in a goal-driven way. This approach takes very seriously the “some models are useful” part of Box’s aphorism, but it ignores the “all models are wrong” part.

Formally, we can take our familiar f(x|θ) and consider it as a special case of a larger collection of possible data-generating mechanisms defined by f(x|θ_M, M), where M denotes the model. We have a list of models, called M. To fix ideas, consider the original f to be a normal with mean θ and variance 1. If you do not trust the normal model but you are reasonably confident the distribution should be symmetric, and your primary concerns are occasional outliers, then you can consider the set of Student’s t distributions with M degrees of freedom and median θ to be your new larger collection of models. If M is all positive integers, the normal case is approached at M = ∞. M is much like θ in that it is an unknown state of the world. The different denomination reflects the common usage of the normal as a fixed assumption (or model) in practical analyses.

In this example, the parameter θ can be interpreted as the population median across all models. However, the interpretation of θ, as well as its dimensionality, may change across models in more complex examples. This is why we need to define separate random variables θ_M for each M.

In general, we have a model index M, and as many parameter sets as there are models, potentially. These are all unknown. The axiomatic foundations tell us that, if the small world has not exploded on us yet, we should have a joint probability distribution on the whole set. If we have a finite list of models, so M is an integer between 1 and M₀, the prior is

This induces a joint distribution over the data, parameters, and models. Implicit in the specification of f(x|θ_M, M) is the idea that, conditional on M and θ_M, x is independent of all the θ_M′ with M′ ≠ M. So the joint probability distribution on all unknowns can be written as

f(x|θ_M, M)π(θ₁, ..., θ_M₀, M).

This can be a very complicated distribution to specify, though for special decision problems some simplifications take place. For example, if all models harbor a common parameter θ, and the loss function depends on the parameters only through θ, then we do not need to specify the horrendous-dimensional prior (11.1). We define θ_M = (θ, η_M), where η_M are model-specific nuisance parameters. Then

where π(θ, η_M, M) is the model-specific prior defined by

To get equation (11.2) reorder each of the M₀ integrals so that the integral with respect to η_M is on the outside.

In this case, because of Theorem 7.1, we can operate directly from the distribution π(θ|x). Compared to working from prior (11.1), here we “only” need to specify model-specific priors π(θ, η_M, M).

Another important quantity is the marginal distribution of the data given the observed model. This will be the critical quantity to look at when the loss function depends on the model but not specifically on the model parameters. An integration similar to that leading to equation (11.2) will produce

which again depends only on the model-specific conditional prior π(θ_M|M). Then, conditional on the data x, the posterior model probabilities are given by

where π(M) is the prior model probability implied by (11.1). Moreover, the posterior predictive density for a new observation is given by

that is, a posterior weighted mixture of the conditional predictive distributions .

The approach outlined here is a very elegant way of handling uncertainty about which model to use: if the true model is unknown, use a set of models, and let the data weigh in about which models are more likely to be the true one. In the rest of this section we will look at the implications for model choice, prediction, and estimation. Two major difficulties with this approach are the specification of prior probabilities and the specification of a list of models large enough to contain the true model, but small enough that prior probabilities can be meaningful. There are also significant computational challenges. For more readings about the Bayesian perspective on model uncertainty see Madigan and Raftery (1994), Draper (1995), Clyde and George (2004), and references therein. Bernardo and Smith (1994) also discuss alternative perspectives where the assignment of probabilities to the model space is not logical, because it cannot be assumed that the true model is included in the list.

11.1.2 Model selection and Bayes factors

In model choice the goal is to choose a single “best” model according to some specified criterion. The simplest formulation is one in which we are interested in guessing the correct model, and consider all mistakes equally undesirable. Within the decision-theoretic framework the set of actions is A = M and the loss function which is 0 for choosing the true model and 1 otherwise. To simplify our discussion we assume that M is a finite set, that is M = {1, ..., M₀}. This loss function allows us to frame the discussion in a decision-theoretic way and get some clear-cut results, but it is not very much in the spirit of the Box aphorism: it really punts on the question of why the model is useful. In this setting the prior expected loss for action that declares M to be the true model is 1 – π(M) and the Bayes action a* is to choose the model with highest prior probability. Similarly, after seeing observations x, the optimal decision is to choose the model with highest posterior probability (11.5). Many model selection procedures are motivated by the desire to approximate this property. However, often they are used in practice to select a model and then perform inference or prediction conditioning on the model. This practice is expedient and often necessary, but a more consistent decision-theoretic approach would be to specify the loss function directly in terms of the final use of the model. We will elaborate on this in the next section.

Given any two models M and M′,

that is, the ratio between posterior probabilities for models M and M′ is the product of the prior odds ratio and the Bayes factor. We discussed Bayes factors in the context of hypothesis testing in Chapter 7. As in hypothesis testing, the Bayes factor measures the relative support for M versus M′ as provided by the data x. Because of its relation to posterior probabilities, choosing a model on the basis of the posterior probability is equivalent to choosing a model using Bayes factors.

In contrast to model selection where a true model is assumed and the utility function explicitly seeks to find it, in model comparison we are simply interested in quantifying the relative support that two models receive from the data. The literature on model comparison is extensive and Bayes factors play an important role. For an extensive discussion on Bayes factors in model comparison, see Kass and Raftery (1995). Alternatively, the Bayesian information criterion (BIC) (or Schwarz criterion) also provides a means for comparing models. The BIC is defined as

where d is the number of parameters in model M. A crude approximation to the Bayes factor for comparing models M and M′ is

Spiegelhalter et al. (2002) propose an alternative criterion known as the deviance information criterion for model checking and comparison, which can be applied to complex settings such as generalized linear models and hierarchical models.

11.1.3 Model averaging for prediction and selection

Let us now bring in more explicitly the typical goals of a statistical analysis. We begin with prediction. Each model in our collection specifies a predictive density

for a future observation . If the loss function depends only on our actions and , then model uncertainty is taken into account by model averaging. Similarly to the previous section, we first compute the distribution , integrating out both models and parameters, and then attack the decision problem by minimizing posterior expected loss. For squared error loss, point predictions can be expressed as

a weighted average of model-specific prediction, with weights given by posterior model probabilities. If, instead, we were to decide on a predictive distribution, using the negative log loss function of Section 10.3, the optimal predictive distribution would be , which can also be expressed as a weighted average of model-specific densities.

Consider now a slightly different case: the decision maker is uncertain about which model is true, but must make predictions based on a single model. This applies, for example, when the model predictions must be produced in a setting where the model averaging approach is not computationally feasible. Formally, the decision maker has to choose a model M in {1, ..., M₀} and, subsequently, make a prediction for a future observation based on data x, and assuming model M is true. We formalize this decision problem and denote a = (a, b) as the action where we select model a and use it to make a prediction b. For simplicity, we consider squared error loss

This is a nested decision problem. The model affects the final loss through the constraints it imposes on the prediction b. There is some similarity between this and the multistage decision problems we will encounter in Part Three, though a key difference here is that there is no additional data acquisition between the choice of the model and the prediction.

For any model a, the optimal prediction rule is

Plugging this solution back into the posterior expected loss function, the optimal model a* minimizes

Working on the integral above we get

The above expression depends on model a only through the first element of the sum in square brackets. The optimal model minimizes the weighted difference between its own model-specific prediction and the predictions of the other possible models, with weights given by the posterior model probabilities.

We can also compare this to , the posterior averaged prediction. This is the global optimum when one is allowed to use predictions based on all models rather than being constrained to using a single one, and coincides with decision rule (11.7). Manipulating the first term of posterior expected loss above a little further we get

Thus, the best model gives a prediction rule closest to the posterior averaged prediction δ*.

We now move to the case when the decision is about the entire prediction density. Let a = (a, b) denote the action where we choose model a and subsequently a density b as the predictive density for a future observation . Let denote the loss function. We assume that such a loss function is a proper scoring rule, so that the optimal choice for a predictive density is in fact the actual belief, that is . Then the posterior expected loss of choosing model a and proceeding optimally is

The optimal strategy is to choose the model that minimizes equation (11.8). San Martini and Spezzaferri (1984) consider the logarithmic loss function corresponding to the scoring rule of Section 10.3. In our context, this implies that model M is preferred to model M′ iff

San Martini and Spezzaferri (1984) further develop this choice criterion in the case of two nested linear regression models, and under additional assumptions regarding prior distributions observe that the criterion takes the form

where LR is the likelihood ratio statistic, d_M is the number of regression parameters in model M, and

with n denoting the number of observations. Equation (11.9) resembles the comparison resulting from the AIC (Akaike 1973) for which k = 2 and the BIC (Schwartz 1978) where k = log(n). Poskitt (1987) provides another decision-theoretic development for a model selection criterion that resembles the BIC, assuming a continuous and bounded utility function.

11.2 Model elaborations

So far our approach to questioning our “small world” has been to make it bigger and deal with the new unknowns according to doctrine. In some settings, this approach also gives us guidance on how to make the world small again—by focusing back on a single model, perhaps different from the one we started out with. Here we briefly consider a different perspective, which is based on looking at small perturbations (or “elaborations”) of the small world that are designed to explore whether bigger worlds are likely to change our behavior substantially, without actually having to build a complete probabilistic representation for those.

In more statistical language, we are interested in model criticism—we plan by default to consider a specific model and we wish to get a sense for whether this choice may be inadequate. In the vast majority of applications this task is addressed by a combination of significance testing, typically for goodness of fit of the model, and exploratory data analysis, for example examination of residuals. For a great discussion and an entry point to the extensive literature see Box (1980). In this paper, Box describes scientific learning as “an iterative process consisting of Criticism and Estimation,” and holds that “sampling theory is needed for exploration and ultimate criticism of entertained models in the light of data, while Bayes’ theory is needed for estimation.” An interesting related discussion is in Gelman et al. (1996).

While we do find much wisdom in Box’s comment, in this section, we look at how traditional Bayesian decision theory can be harnessed to criticize a model, and also we revisit traditional criticism metrics from a decision perspective. In our discussion, model M will be the model initially proposed and the question is whether or not the decision maker should embark on more complex modeling before carrying out the decision analysis. A simple approach, reflecting some statistical practice, is to set up, for probing purposes, a second model M′ and compare it to M. Bernardo and Smith (1994) propose to choose an appropriate loss function for model evaluation, and look at the change in posterior expected loss as an indication of the worthiness of M′ compared to M. A related approach is described in Carota et al. (1996) who use model elaboration to estimate the change in utility resulting from a larger model in a neighborhood of M.

If the current model holds, the joint density of the observed data x and unobserved parameters θ is

f(x, θ|M) = f(x|θ, M)π(θ|M).

To evaluate model M we embed it in a class of models M, called model elaboration (see Box and Tiao 1973, Smith 1983, West 1992). For example, suppose that in the current model data are exponential. To elaborate on this model we may consider M to be the family of Weibull distributions. The parameter ϕ will index models in M so that

f(x, θ, ϕ|M) = f(y|θ, ϕ, M)π(θ|ϕ, M)π(ϕ|M).

The idea of an elaboration is that the original model M is still a member of M for some specific value ϕ_M of the elaboration parameter. In the model criticism situation, the prior distribution π(ϕ|M) is concentrated around ϕ_M, to reflect the initial assumption that M is the default model. One way to think of this prior is that it provides a formalization of DeGroot’s pocket ε from Section 10.4.2.

To illustrate, let , and where are both known. A useful elaboration M is defined by

The elaboration parameter ϕ corresponds to a variance inflation factor, and ϕ_M = 1. West (1985) and Efron (1986) show how this generalizes to the case where M is an exponential family.

For another illustration, a general way to connect two nonnested models M and M′ defined by densities f(x|M) and g(x|M′) is the elaboration

x|ϕ, H ~ c(ϕ) f(x|M)^φg(x|M′)^1−φ

where c(ϕ) is an appropriate normalization function. See Cox (1962) for further discussion.

Given any elaboration, model criticism can be carried out by comparing the original and elaborated posterior expected losses, by comparing the posteriors π(θ|x, M) to π(θ|x, ϕ, M), or, lastly, by comparing π(ϕ|x, M) to π(ϕ|M). Carota et al. (1996) consider the latter for defining a criticism measure, and define a loss function for capturing the distance between the two distributions. Even though this loss is not the actual terminal loss of the problem we started out with, if the data change the marginal posterior of ϕ by a large amount, it is likely that model M will perform poorly for the purpose of the original loss as well. To simplify notation, we drop M in the above distributions.

Motivated by the logarithmic loss function of Section 10.3, we define the diagnostic measure as

Δ is the Kullback–Leibler divergence between the prior and posterior distributions of ϕ. In a decision problem where the goal is to choose a probability distribution on ϕ, and the utility function is logarithmic, it measures the change in expected utility attributable to observing the data. We will return to this point in Chapter 13 in our discussion of the Lindley information.

Low values of Δ indicate agreement between prior and posterior distributions of the model elaboration parameter ϕ validating the current model M. Interpreting high values of Δ is trickier, though, and may require the investigation of π(ϕ|x). If the value of Δ is large and π(ϕ|x) is peaked around ϕ₀, the value for which the elaborated model is equal to M, then model M is adequate. Otherwise, it indicates that model M is inappropriate.

Direct evaluation of equation (11.11) is often difficult. As an alternative, the function Δ can be computed either by an approximation of the prior and the posterior, leading to analytical expressions for Δ, or by using a Monte Carlo approach (Müller and Parmigiani 1996). Another possibility is to consider a linearized diagnostic measure Δ_L which approximates Δ when the prior on ϕ is peaked around ϕ_M. To derive the linearized diagnostic Δ_L, observe that

Now, expanding log f(x|ϕ) about ϕ_M we have

for some remainder function R(.). The linearized version Δ_L is defined as

Δ_L combined three elements all of which are relevant model criticism statistics in their own right: the Savage density ratio defined by

the posterior expected value of ϕ – ϕ_M; and the marginal score function (∂/∂φ) log f(x|ϕ). The Savage density ratio is equivalent, under certain conditions, to the Bayes factor for the null hypothesis that ϕ = ϕ_M against the family of alternatives defined by the elaboration. However, in the diagnostic context, the Bayes factor is a sufficient summary of the data only when the loss function is assigning the same penalty to all incorrect models or when the elaboration is binary. The diagnostic approach differs from a model choice analysis based on a Bayes factor as both Δ and Δ_L incorporate a penalty for the severity of the departure in the utility function.

11.3 Exercises

Problem 11.1 In the context of Section 11.2 suppose that M is such that where θ₀ is a known mean. Consider the elaborated model M given by , and . Show that the diagnostic Δ is