Bayesian Analysis in Testing and Estimation


This chapter is devoted to some topics of estimation and testing hypotheses from the point of view of statistical decision theory. The decision theoretic approach provides a general framework for both estimation of parameters and testing hypotheses. The objective is to study classes of procedures in terms of certain associated risk functions and determine the existence of optimal procedures. The results that we have presented in the previous chapters on minimum mean–squared–error (MSE) estimators and on most powerful tests can be considered as part of the general statistical decision theory. We have seen that uniformly minimum MSE estimators and uniformly most powerful tests exist only in special cases. One could overcome this difficulty by considering procedures that yield minimum average risk, where the risk is defined as the expected loss due to erroneous decision, according to the particular distribution Fθ. The MSE in estimation and the error probabilities in testing are special risk functions. The risk functions depend on the parameters θ of the parent distribution. The average risk can be defined as an expected risk according to some probability distribution on the parameter space. Statistical inference that considers the parameter(s) as random variables is called a Bayesian inference. The expected risk with respect to the distribution of θ is called in Bayesian theory the prior risk, and the probability measure on the parameter space is called a prior distribution. The estimators or test functions that minimize the prior risk, with respect to some prior distribution, are called Bayes procedures for the specified prior distribution. Bayes procedures have certain desirable properties. This chapter is devoted, therefore, to the study of the structure of optimal decision rules in the framework of Bayesian theory. We start Section 8.1 with a general discussion of the basic Bayesian tools and information functions. We outline the decision theory and provide an example of an optimal statistical decision procedure. In Section 8.2, we discuss testing of hypotheses from the Bayesian point of view, and in Section 8.3, we present Bayes credibility intervals. The Bayesian theory of point estimation is discussed in Section 8.4. Section 8.5 discusses analytical and numerical techniques for evaluating posterior distributions on complex cases. Section 8.6 is devoted to empirical Bayes procedures.


8.1.1 Prior, Posterior, and Predictive Distributions

In the previous chapters, we discussed problems of statistical inference, testing hypotheses, and estimation, considering the parameters of the statistical models as fixed unknown constants. This is the so–called classical approach to the problems of statistical inference. In the Bayesian approach, the unknown parameters are considered as values determined at random according to some specified distribution, called the prior distribution. This prior distribution can be conceived as a normalized nonnegative weight function that the statistician assigns to the various possible parameter values. It can express his degree of belief in the various parameter values or the amount of prior information available on the parameters. For the philosophical foundations of the Bayesian theory, see the books of DeFinneti (1974), Barnett (1973), Hacking (1965), Savage (1962), and Schervish (1995). We discuss here only the basic mathematical structure.

Let inline = {F(x; θ); θ inline Θ} be a family of distribution functions specified by the statistical model. The parameters θ of the elements of inline are real or vector valued parameters. The parameter space Θ is specified by the model. Let inline be a family of distribution functions defined on the parameter space Θ. The statistician chooses an element H(θ) of inline and assigns it the role of a prior distribution. The actual parameter value θ0 of the distribution of the observable random variable X is considered to be a realization of a random variable having the distribution H(θ). After observing the value of X the statistician adjusts his prior information on the value of the parameter θ by converting H(θ) to the posterior distribution H(θ | X). This is done by Bayes Theorem according to which if h(θ) is the prior probability density function (p.d.f.) of θ and f(x; θ) the p.d.f. of X under θ, then the posterior p.d.f. of θ is

(8.1.1) numbered Display Equation

If we are given a sample of n observations or random variables X1, X2, …, Xn, whose distributions belong to a family inline, the question is whether these random variables are independent identically distributed (i.i.d.) given θ, or whether θ might be randomly chosen from H(θ) for each observation.

At the beginning, we study the case that X1, …, Xn are conditionally i.i.d., given θ. This is the classical Bayesian model. In Section 8.6, we study the so–called empirical Bayes model, in which θ is randomly chosen from H(θ) for each observation. In the classical model, if the family inline admits a sufficient statistic T(X), then for any prior distribution H(θ), the posterior distribution is a function of T(X), and can be determined from the distribution of T(X) under θ. Indeed, by the Neyman–Fisher Factorization Theorem, if T(X) is sufficient for inline then f(x; θ) = k(x)g(T(x); θ). Hence,

(8.1.2) numbered Display Equation

Thus, the posterior p.d.f. is a function of T(X). Moreover, the p.d.f. of T(X) is g* (t; θ) = k* (t)g(t; θ), where k* (t) is independent of θ. It follows that the conditional p.d.f. of θ given {T(X) = t} coincides with h(θ | x) on the sets {x; T(x) = t} for all t.

Bayes predictive distributions are the marginal distributions of the observed random variables, according to the model. More specifically, if a random vector X has a joint distribution F(x; θ) and the prior distribution of θ is H(θ) then the joint predictive distribution of X under H is

(8.1.3) numbered Display Equation

A most important question in Bayesian analysis is what prior distribution to choose. The answer is, generally, that the prior distribution should reflect possible prior knowledge available on possible values of the parameter. In many situations, the prior information on the parameters is vague. In such cases, we may use formal prior distributions, which are discussed in Section 8.1.3. On the other hand, in certain scientific or technological experiments much is known about possible values of the parameters. This may guide in selecting a prior distribution, as illustrated in the examples.

There are many examples of posterior distribution that belong to the same parametric family of the prior distribution. Generally, if the family of prior distributions inline relative to a specific family inline yields posteriors in inline, we say that inline and inline are conjugate families. For more discussion on conjugate prior distributions, see Raiffa and Schlaifer (1961). In Example 8.2, we illustrate a few conjugate prior families.

The situation when conjugate prior structure exists is relatively simple and generally leads to analytic expression of the posterior distribution. In research, however, we often encounter much more difficult problems, as illustrated in Example 8.3. In such cases, we cannot often express the posterior distribution in analytic form, and have to resort to numerical evaluations to be discussed in Section 8.5.

8.1.2 Noninformative and Improper Prior Distributions

It is sometimes tempting to obtain posterior densities by multiplying the likelihood function by a function h(θ), which is not a proper p.d.f. For example, suppose that X | θN(θ, 1). In this case L(θ; X) = exp inline. This likelihood function is integrable with respect to dθ. Indeed,

Unnumbered Display Equation

Thus, if we consider formally the function h(θ)dθ = cdθ or h(θ) = c then

(8.1.4) numbered Display Equation

which is the p.d.f. of N(X, 1). The function h(θ) = c, c > 0 for all θ is called an improper prior density since inline. Another example is when X | λP(λ), i.e., L(λ | X) = eλ λx. If we use the improper prior density h(λ) = c > 0 for all λ > 0 then the posterior p.d.f. is

(8.1.5) numbered Display Equation

This is a proper p.d.f. of G(1, X + 1) despite the fact that h(λ) is an improper prior density. Some people justify the use of an improper prior by arguing that it provides a “diffused” prior, yielding an equal weight to all points in the parameter space. For example, the improper priors that lead to the proper posterior densities (8.1.4) and (8.1.5) may reflect a state of ignorance, in which all points θ in (−∞, ∞) or λ in (0, ∞) are “equally” likely.

Lindley (1956) defines a prior density h(θ) to be noninformative, if it maximizes the predictive gain in information on θ when a random sample of size n is observed. He shows then that, in large samples, if the family inline satisfies the Cramer–Rao regularity conditions, and the maximum likelihood estimator (MLE) inlinen is minimal sufficient for inline, then the noninformative prior density is proportional to |I(θ)|1/2, where |I(θ)| is the determinant of the Fisher information matrix. As will be shown in Example 8.4, h(θ) inline |I(θ)|1/2 is sometimes a proper p.d.f. and sometimes an improper one.

Jeffreys (1961) justified the use of the noninformative prior |I(θ)|1/2 on the basis of invariance. He argued that if a statistical model inline = {f(x;θ); θ inline Θ} is reparametrized to inline* = {f* (x; ω); ω inline Ω}, where ω = inline(θ) then the prior density h(θ) should be chosen so that h(θ | X) = h(ω | X).

Let θ = inline−1(ω) and let J(ω) be the Jacobian of the transformation, then the posterior p.d.f. of ω is

(8.1.6) numbered Display Equation

Recall that the Fisher information matrix of ω is

(8.1.7) numbered Display Equation

Thus, if h(θ) inline |I(θ)|1/2 then from (8.1.7) and (8.1.8), since

(8.1.8) numbered Display Equation

we obtain

(8.1.9) numbered Display Equation

The structure of h(θ | X) and of h* (ω | X) is similar. This is the “invariance” property of the posterior, with respect to transformations of the parameter.

A prior density proportional to |I(θ)|1/2 is called a Jeffreys prior density.

8.1.3 Risk Functions and Bayes Procedures

In statistical decision theory, we consider the problems of inference in terms of a specified set of actions, inline, and their outcomes. The outcome of the decision is expressed in terms of some utility function, which provides numerical quantities associated with actions of inline and the given parameters, θ, characterizing the elements of the family inline specified by the model. Instead of discussing utility functions, we discuss here loss functions, L(a, θ), a inline inline, θ inline Θ, associated with actions and parameters. The loss functions are nonnegative functions that assume the value zero if the action chosen does not imply some utility loss when θ is the true state of Nature. One of the important questions is what type of loss function to consider. The answer to this question depends on the decision problem and on the structure of the model. In the classical approach to testing hypotheses, the loss function assumes the value zero if no error is committed and the value one if an error of either kind is done. In a decision theoretic approach, testing hypotheses can be performed with more general loss functions, as will be shown in Section 8.2. In estimation theory, the squared–error loss function (inline(x) − θ)2 is frequently applied, when inline(x) is an estimator of θ. A generalization of this type of loss function, which is of theoretical importance, is the general class of quadratic loss function, given by

(8.1.10) numbered Display Equation

where Q(θ) > 0 is an appropriate function of θ. For example, (inline(x)−θ)2/θ2 is a quadratic loss function. Another type of loss function used in estimation theory is the type of function that depends on inline(x) and θ only through the absolute value of their difference. That is, L(inline(x), θ) = W(|inline(x) − θ|). For example, |inline(x) − θ|ν where ν > 0, or log (1 + |inline(x) − θ|). Bilinear convex functions of the form

(8.1.11) numbered Display Equation

are also in use, where a1, a2 are positive constants; (inlineθ) = − min (inlineθ, 0) and (inlineθ)+ = max(inlineθ, 0). If the value of θ is known one can always choose a proper action to insure no loss. The essence of statistical decision problems is that the true parameter θ is unknown and decisions are made under uncertainty. The random vector X = (X1, …, Xn) provides information about the unknown value of θ. A function from the sample space inline of X into the action space inline is called a decision function. We denote it by d(X) and require that it should be a statistic. Let inline denotes a specified set or class of proper decision functions. Using a decision function d(X) the associated loss L(d(X), θ) is a random variable, for each θ. The expected loss under θ, associated with a decision function d(X), is called the risk function and is denoted by R(d, θ) = Eθ {L(d(X), θ)}. Given the structure of a statistical decision problem, the objective is to select an optimal decision function from inline. Ideally, we would like to choose a decision function d0(X) that minimizes the associated risk function R(d, θ) uniformly in θ. Such a uniformly optimal decision function may not exist, since the function d0 for which R(d0, θ) = inline R(d, θ) generally depends on the particular value of θ under consideration. There are several ways to overcome this difficulty. One approach is to restrict attention to a subclass of decision functions, like unbiased or invariant decision functions. Another approach for determining optimal decision functions is the Bayesian approach. We define here the notion of Bayes decision function in a general context.

Consider a specified prior distribution, H(θ), defined over the parameter space Θ. With respect to this prior distribution, we define the prior risk, ρ(d, H), as the expected risk value when θ varies over Θ, i.e.,

(8.1.12) numbered Display Equation

where h(θ) is the corresponding p.d.f. A Bayes decision function, with respect to a prior distribution H, is a decision function dH(x) that minimizes the prior risk ρ(d, H), i.e.,

(8.1.13) numbered Display Equation

Under some general conditions, a Bayes decision function dH(x) exists. The Bayes decision function can be generally determined by minimizing the posterior expectation of the loss function for a given value x of the random variable X. Indeed, since L(d, θ) ≥ 0 one can interchange the integration operations below and write

(8.1.14) numbered Display Equation

where fH(x) = inline f(x; τ) h(τ)dτ is the predictive p.d.f. The conditional p.d.f. h(θ | x) = f(x; θ)h(θ)/fH(x) is the posterior p.d.f. of θ, given X = x. Similarly, the conditional expectation

(8.1.15) numbered Display Equation

is called the posterior risk of d(x) under H. Thus, for a given X = x, we can choose d(x) to minimize R(d(x), H). Since L(d(x), θ) ≥ 0 for all θ inline Θ and d inline inline, the minimization of the posterior risk minimizes also the prior risk ρ (d, H). Thus, dH(X) is a Bayes decision function.


8.2.1 Testing Simple Hypothesis

We start with the problem of testing two simple hypotheses H0 and H1. Let F0(x) and F1(x) be two specified distribution functions. The hypothesis H0 specifies the parent distribution of X as F0(x), H1 specified it as F1(x). Let f0(x) and f1(x) be the p.d.f.s corresponding to F0(x) and F1(x), respectively. Let π, 0 ≤ π ≤ 1, be the prior probability that H0 is true. In the special case of two simple hypotheses, the loss function can assign 1 unit to the case of rejecting H0 when it is true and b units to the case of rejecting H1 when it is true. The prior risks associated with accepting H0 and H1 are, respectively, ρ0(π) = (1 − π)b and ρ1(π) = π. For a given value of π, we accept hypothesis Hi (i = 0, 1) if ρi(π) is the minimal prior risk. Thus, a Bayes rule, prior to making observations is

(8.2.1) numbered Display Equation

where d = i is the decision to accept Hi (i = 0, 1).

Suppose that a sample of n i.i.d. random variables X1, …, Xn has been observed. After observing the sample, we determine the posterior probability π (Xn) that H0 is true. This posterior probability is given by

(8.2.2) numbered Display Equation

We use the decision rule (8.2.1) with π replaced by π(Xn). Thus, the Bayes decision function is

(8.2.3) numbered Display Equation

The Bayes decision function can be written in terms of the test function discussed in Chapter 4 as

(8.2.4) numbered Display Equation

The Bayes test function inlineπ (Xn) is similar to the Neyman–Pearson most powerful test, except that the Bayes test is not necessarily randomized even if the distributions Fi(x) are discrete. Moreover, the likelihood ratio inlinef1(Xj)/f0(Xj) is compared to the ratio of the prior risks.

We discuss now some of the important optimality characteristics of Bayes tests of two simple hypotheses. Let R0(inline) and R1(inline) denote the risks associated with an arbitrary test statistic inline, when H0 or H1 are true, respectively. Let R0(π) and R1(π) denote the corresponding risk values of a Bayes test function, with respect to a prior probability π. Generally

Unnumbered Display Equation


Unnumbered Display Equation

where inline0(inline) and inline1(inline) are the error probabilities of the test statistic inline, c1 and c2 are costs of erroneous decisions. The set R = {R0(inline), R1(inline)); all test functions inline} is called the risk set. Since for every 0 ≤ α ≤ 1 and any functions inline(1) and inline(2), α inline(1) + (1-α)inline(2) is also a test function, and since

(8.2.5) numbered Display Equation

the risk set R is convex. Moreover, the set

(8.2.6) numbered Display Equation

of all risk points corresponding to the Bayes tests is the lower boundary for R. Indeed, according to (8.2.4) and the Neyman–Pearson Lemma, R1(π) is the smallest possible risk of all test functions inline with R0(inline) = R0(π). Accordingly, all the Bayes tests constitute a complete class in the sense that, for any test function outside the class, there exists a corresponding Bayes test with a risk point having component smaller or equal to those of that particular test and at least one component is strictly smaller (Ferguson, 1967, Ch. 2). From the decision theoretic point of view there is no sense in considering test functions that do not belong to the complete class. These results can be generalized to the case of testing k simple hypotheses (Blackwell and Girshick, 1954; Ferguson, 1967).

8.2.2 Testing Composite Hypotheses

Let Θ0 and Θ1 be the sets of θ–points corresponding to the (composite) hypotheses H0 and H1, respectively. These sets contain finite or infinite number of points. Let H(θ) be a prior distribution function specified over Θ = Θ0 inline Θ1. The posterior probability of H0, given n i.i.d. random variables X1, …, Xn, is

(8.2.7) numbered Display Equation

where f(x; θ) is the p.d.f. of X under θ. The notation in (8.2.7) signifies that if the sets are discrete the corresponding integrals are sums and dH(θ) are prior probabilities, otherwise dH(θ) = h(θ)dθ, where h(θ) is a p.d.f. The Bayes decision rule is obtained by computing the posterior risk associated with accepting H0 or with accepting H1 and making the decision associated with the minimal posterior risk. The form of the Bayes test depends, therefore, on the loss function employed.

If the loss functions associated with accepting H0 or H1 are

Unnumbered Display Equation

then the associated posterior risk functions are

(8.2.8) numbered Display Equation


Unnumbered Display Equation

In this case, the Bayes test function is

(8.2.9) numbered Display Equation

In other words, the hypothesis H0 is rejected if the predictive likelihood ratio

(8.2.10) numbered Display Equation

is greater than the loss ratio c1/c0. This can be considered as a generalization of (8.2.4). The predictive likelihood ratio ΛH(X) is called also the Bayes Factor in favor of H1 against H0 (Good, 1965, 1967).

Cornfield (1969) suggested as a test function the ratio of the posterior odds in favor of H0, i.e., P[H0| X]/(1 − P[H0| X]), to the prior odds π /(1 − π) where π = P[H0] is the prior probability of H0. The rule is to reject H0 when this ratio is smaller than a suitable constant. Cornfield called this statistic the relative betting odds. Note that this relative betting odds is [ΛH (X)π /(1 − π)]−1. We see that Cornfield’s test function is equivalent to (8.2.9) for suitably chosen cost factors.

Karlin (l956) and Karlin and Rubin (1956) proved that in monotone likelihood ratio families the Bayes test function is monotone in the sufficient statistic T(X). For testing H0: θθ0 against H1: θ > θ0, the Bayes procedure rejects H0 whenever T(X) ≥ ξ0. The result can be further generalized to the problem of testing multiple hypotheses (Zacks, 1971; Ch. 10).

The problem of testing the composite hypothesis that all the probabilities in a multinomial distribution have the same value has drawn considerable attention in the statistical literature; see in particular the papers of Good (1967), Good and Crook (1974), and Good (1975). The Bayes test procedure proposed by Good (1967) is based on the symmetric Dirichlet prior distribution. More specifically if X = (X1, …, Xk)′ is a random vector having the multinomial distribution M(n, θ) then the parameter vector θ is ascribed the prior distribution with p.d.f.

(8.2.11) numbered Display Equation

0 < θ1, …, θk < 1 and inline = 1. The Bayes factor for testing h0: θ = inline 1 against the composite alternative hypothesis H1: θinline 1, where 1 = (1, …, 1)′, according to (8.2.10) is

(8.2.12) numbered Display Equation

From the purely Bayesian point of view, the statistician should be able to choose an appropriate value of ν and some cost ratio c1/c0 for erroneous decisions, according to subjective judgment, and reject H0 if Λ (ν; X) ≥ c1/c0. In practice, it is generally not so simple to judge what are the appropriate values of ν and c1/c0. Good and Crook (1974) suggested two alternative ways to solve this problem. One suggestion is to consider an integrated Bayes factor

(8.2.13) numbered Display Equation

where inline (ν) is the p.d.f. of a log–Cauchy distribution, i.e.,

(8.2.14) numbered Display Equation

The second suggestion is to find the value ν0 for which Λ (ν; X) is maximized and reject H0 if Λ* = (2log Λ (ν0; X))1/2 exceeds the (1 − α)–quantile of the asymptotic distribution of Λ* under H0. We see that non–Bayesian (frequentists) considerations are introduced in order to arrive at an appropriate critical level for Λ*. Good and Crook call this approach a “Bayes/Non–Bayes compromise.” We have presented this problem and the approaches suggested for its solution to show that in practical work a nondogmatic approach is needed. It may be reasonable to derive a test statistic in a Bayesian framework and apply it in a non–Bayesian manner.

8.2.3 Bayes Sequential Testing of Hypotheses

We consider in the present section an application of the general theory of Section 8.1.5 to the case of testing two simple hypotheses. We have seen in Section 8.2.1 that the Bayes decision test function, after observing Xn, is to reject H0 if the posterior probability, π(Xn), that H0 is true is less than or equal to a constant π*. The associated Bayes risk is ρ(0) (π (Xn)) = π (Xn)I{π (Xn) ≤ π* } + b(1 − π(Xn))I{π(Xn) > π* }, where π* = b/(1 + b). If π (Xn) = π then the posterior probability of H0 after the (n + 1)st observation is inline (π, Xn + 1) = inline, where R(x) = inline is the likelihood ratio. The predictive risk associated with an additional observation is

(8.2.15) numbered Display Equation

where c is the cost of one observation, and the expectation is with respect to the predictive distribution of X given π. We can show that the function inline1(π) is concave on [0, 1] and thus continuous on (0, 1). Moreover, inline1(0) ≥ c and inline1(1) ≥ c. Note that the function inline (π, X) → 0 w.p.l if π → 0 and inline (π, X) → 1 w.p.l if π → 1. Since ρ(0)(π) is bounded by π*, we obtain by the Lebesgue Dominated Convergence Theorem that E{ρ0(inline (π, X))} → 0 as π → 0 or as π → 1. The Bayes risk associated with an additional observation is

(8.2.16) numbered Display Equation

Thus, if cb/(1 + b) it is not optimal to make any observation. On the other hand, if c < b/(1 + b) there exist two points inline and inline, such that 0 < inline < π* < inline < 1, and

(8.2.17) numbered Display Equation


(8.2.18) numbered Display Equation

and let

(8.2.19) numbered Display Equation

Since ρ(1)(inline (π, X)) ≤ ρ0(inline (π, X)) for each π with probability one, we obtain that inline2(π) ≤ inline1(π) for all 0 ≤ π ≤ 1. Thus, ρ(2)(π) ≤ ρ(1)(π) for all π, 0 ≤ π ≤ 1. inline2(π) is also a concave function of π on [0, 1] and inline2(0) =inline2(1) = c. Thus, there exists inlineinline and inlineinline such that

(8.2.20) numbered Display Equation

We define now recursively, for each π on [0, 1],

(8.2.21) numbered Display Equation


(8.2.22) numbered Display Equation

These functions constitute for each π monotone sequences inlinen(π) ≤ inlinen−1 and ρ(n)(π) ≤ ρ(n−1) (π) for every n≥ 1. Moreover, for each n there exist 0 < inlineinline <inlineinline < 1 such that

(8.2.23) numbered Display Equation

Let ρ (π) = inlineρ(n)(π) for each π in [0, 1] and inline(π) = E{ρ (inline (π, X))}. By the Lebesgue Monotone Convergence Theorem, we prove that inline(π) = inline inlinen(π) for each π inline [0, 1]. The boundary points inline and inline converge to π1 and π2, respectively, where 0 < π1 < π2 < 1. Consider now a nontruncated Bayes sequential procedure, with the stopping variable

(8.2.24) numbered Display Equation

where X0 ≡ 0 and π (X0) ≡ π. Since under H0, π (Xn) → 1 with probability one and under H1, π (Xn) → 0 with probability 1, the stopping variable (8.2.24) is finite with probability one.

It is generally very difficult to determine the exact Bayes risk function ρ (π) and the exact boundary points π1 and π2. One can prove, however, that the Wald sequential probability ratio test (SPRT) (see Section 4.8.1) is a Bayes sequential procedure in the class of all stopping variables for which N≥ 1, corresponding to some prior probability π and cost parameter b. For a proof of this result, see Ghosh (1970, p. 93) or Zacks (1971, p. 456). A large sample approximation to the risk function ρ (π) was given by Chernoff (1959). Chernoff has shown that in the SPRT given by the boundaries (A, B) if A → −∞ and B → ∞, we have

(8.2.25) numbered Display Equation

where the cost of observations c→ 0 and I(0, 1), I(1, 0) are the Kullback–Leibler information numbers. Moreover, as c→ 0

(8.2.26) numbered Display Equation

Shiryayev (1973, p. 127) derived an expression for the Bayes risk ρ (π) associated with a continuous version of the Bayes sequential procedure related to a Wiener process. Reduction of the testing problem for the mean of a normal distribution to a free boundary problem related to the Wiener process was done also by Chernoff (1961, 1965, 1968); see also the book of Dynkin and Yushkevich (1969).

A simpler sequential stopping rule for testing two simple hypotheses is

(8.2.27) numbered Display Equation

If π (XN) ≤ inline then H0 is rejected, and if π (XN) ≥ 1 − inline then H0 is accepted. This stopping rule is equivalent to a Wald SPRT (A, B) with the limits

Unnumbered Display Equation

If π = inline then, according to the results of Section 4.8.1, the average error probability is less than or equal to inline. This result can be extended to the problem of testing k simple hypotheses (k≥ 2), as shown in the following.

Let H1, …, Hk be k hypotheses (k ≥ 2) concerning the distribution of a random variable (vector) X. According to Hj, the p.d.f. of X is fj(x; θ), θinline Θ j, j = 1, …, k. The parameter θ is a nuisance parameter, whose parameter space Θ j may depend on Hj. Let Gj(θ), j = 1, …, k, be a prior distribution on Θ j, and let π j be the prior probability that Hj is the true hypothesis, inlineπ j = 1. Given n observations on X1, …, Xn, which are assumed to be conditionally i.i.d., we compute the predictive likelihood of Hj, namely,

(8.2.28) numbered Display Equation

j = 1, …, k. Finally, the posterior probability of Hj, after n observations, is

(8.2.29) numbered Display Equation

We consider the following Bayesian stopping variable, for some 0 < inline < 1.

(8.2.30) numbered Display Equation

Obviously, one considers small values of inline, 0 < inline < 1/2, and for such inline, there is a unique value inline such that π inline (XNinline) ≥ 1 − inline. At stopping, hypothesis Hinline is accepted.

For each n ≥ 1, partition the sample space inline(n) of Xn to (k + 1) disjoint sets

Unnumbered Display Equation

and inline = inlineninlineinline. As long as xn inline inline we continue sampling. Thus, Ninline = min inline. In this sequential testing procedure, decision errors occur at stopping, when the wrong hypothesis is accepted. Thus, let δij denote the predictive probability of accepting Hi when Hj is the correct hypothesis. That is,

(8.2.31) numbered Display Equation

Note that, for π* = 1−inline, πj(xn) ≥ π* if, and only if,

(8.2.32) numbered Display Equation

Let αj denote the predictive error probability of rejecting Hj when it is true, i.e., α j = inlineδij.

The average predictive error probability is inlineπ = inlineπjαj.

Theorem 8.2.1. For the stopping variable Ninline, the average predictive error probability is inlineπinline.

Proof.   From the inequality (8.2.32), we obtain

(8.2.33) numbered Display Equation

Summing over i, we get

Unnumbered Display Equation


Unnumbered Display Equation

Summing over j, we obtain

(8.2.34) numbered Display Equation

The first term on the RHS of (8.2.34) is

(8.2.35) numbered Display Equation

The second term on the RHS of (8.2.34) is

(8.2.36) numbered Display Equation

Substitution of (8.2.35) and (8.2.36) into (8.2.34) yields

Unnumbered Display Equation


Unnumbered Display Equation        QED

Thus, the Bayes sequential procedure given by the stopping variable Ninline and the associated decision rule can provide an excellent testing procedure when the number of hypothesis k is large. Rogatko and Zacks (1993) applied this procedure for testing the correct gene order. In this problem, if one wishes to order m gene loci on a chromosome, the number of hypotheses to test is k = m!/2.


8.3.1 Credibility Intervals

Let inline = {F(x; θ); θ inline Θ} be a parametric family of distribution functions. Let H(θ) be a specified prior distribution of θ and H(θ | X) be the corresponding posterior distribution, given X. If θ is real then an interval (Lα (X), inlineα (X)) is called a Bayes credibility interval of level 1 − α if for all X (with probability 1)

(8.3.1) numbered Display Equation

In multiparameter cases, we can speak of Bayes credibility regions. Bayes tolerance intervals are defined similarly.

Box and Tiao (1973) discuss Bayes intervals, called highest posterior density (HPD) intervals. These intervals are defined as θ intervals for which the posterior coverage probability is at least (1−α) and every θ–point within the interval has a posterior density not smaller than that of any θ–point outside the interval. More generally, a region RH(X) is called a (1 − α) HPD region if

(i) PH(θ inline RH(X)| X] ≥ 1 − α, for all X; and
(ii) h(θ | x) ≥ h(inline | x), for every θ inline RH(x) and inline inline RH(x).

The HPD intervals in cases of unimodal posterior distributions provide in nonsymmetric cases Bayes credibility intervals that are not equal tail ones. For various interesting examples, see Box and Tiao (1973).

8.3.2 Prediction Intervals

Suppose X is a random variable (vector) having a p.d.f. f(x;θ), θ inline Θ. If θ is known, an interval Iα (θ) is called a prediction interval for X, at level (1 − α) if

(8.3.2) numbered Display Equation

When θ is unknown, one can use a Bayesian predictive distribution to determine an interval Iα (H) such that the predictive probability of {Xinline Iα (H)} is at least 1 − α. This predictive interval depends on the prior distribution H(θ). After observing X1, …, Xn, one can determine prediction interval (region) for (Xn+1, …, Xn+m) by using the posterior distribution H(θ| Xn) for the predictive distribution fH(x| xn) = inlinef(x;θ) dH(θ| xn). In Example 8.12, we illustrate such prediction intervals. For additional theory and examples, see Geisser (1993).


8.4.1 General Discussion and Examples

When the objective is to provide a point estimate of the parameter θ or a function ω = g(θ) we identify the action space with the parameter space. The decision function d(X) is an estimator with domain χ and range Θ, or Ω = g(Θ). For various loss functions the Bayes decision is an estimator inlineH(X) that minimizes the posterior risk. In the following table, we present some loss functions and the corresponding Bayes estimators.

In the examples, we derived Bayesian estimators for several models of interest, and show the dependence of the resulting estimators on the loss function and on the prior distributions.

Loss Function Bayes Estimator
(inlineθ)2 inline(X) = EH{θ | X}
(The posterior expectation)
Q(θ)(inline2θ)2 EH{θ Q(θ)| X} /EH{Q(θ)| X}
|inlineθ | inline(X) = median of the posterior
distribution, i.e., H−1(.5| X).
a(inline- θ) + b(inlineθ)+ The inline quantile of H(θ | X);
i.e., H−1(inline| X).

8.4.2 Hierarchical Models

Lindley and Smith (1972) and Smith (1973a, b) advocated a somewhat more complicated methodology. They argue that the choice of a proper prior should be based on the notion of exchangeability. Random variables W1, W2, …, Wk are called exchangeable if the joint distribution of (W1, …, Wk) is the same as that of (Wi1, …, Wik), where (i1, …, ik) is any permutation of (1, 2, …, k). The joint p.d.f. of exchangeable random variables can be represented as a mixture of appropriate p.d.f.s of i.i.d. random variables. More specifically, if, conditional on w, W1, …, Wk are i.i.d. with p.d.f. f(W1, …, Wk;w) = inlineg(Wi, w), and if w is given a probability distribution P(w) then the p.d.f.

(8.4.1) numbered Display Equation

represents a distribution of exchangeable random variables. If the vector X represents the means of k independent samples the present model coincides with the Model II of ANOVA, with known variance components and an unknown grand mean μ. This model is a special case of a Bayesian linear model called by Lindley and Smith a three–stage linear model or hierarchical models. The general formulation of such a model is

Unnumbered Display Equation


Unnumbered Display Equation

where X is an n × 1 vector, θi are pi × 1 (i = 1, 2, 3), A1, A2, A3 are known constant matrices, and V, inline, C are known covariance matrices. Lindley and Smith (1972) have shown that for a noninformative prior for θ2 obtained by letting C−1 → 0, the Bayes estimator of θ, for the loss function L(inline1, θ) = ||inline1θ1||2, is given by

(8.4.2) numbered Display Equation


(8.4.3) numbered Display Equation

We see that this Bayes estimator coincides with the LSE, (AA)−1AX, when V = I and inline,−1→ 0. This result depends very strongly on the knowledge of the covariance matrix V. Lindley and Smith (1972) suggested an iterative solution for a Bayesian analysis when V is unknown. Interesting special results for models of one way and two–way ANOVA can be found in Smith (1973b).

A comprehensive Bayesian analysis of the hierarchical Model II of ANOVA is given in Chapter 5 of Box and Tiao (1973).

In Gelman et al. (1995, pp. 129–134), we find an interesting example of a hierarchical model in which

Unnumbered Display Equation

θ1, …, θk are conditionally i.i.d., with

Unnumbered Display Equation

and (α, β) have an improper prior p.d.f.

Unnumbered Display Equation

According to this model, θ = (θ1, …, θ k) is a vector of priorly exchangeable (not independent) parameters. We can easily show that the posterior joint p.d.f. of θ, given J = (J1, …, Jk) and (α, β) is

(8.4.4) numbered Display Equation

In addition, the posterior p.d.f. of (α, β) is

(8.4.5) numbered Display Equation

The objective is to obtain the joint posterior p.d.f.

Unnumbered Display Equation

From h(θ| J) one can derive a credibility region for θ, etc.

8.4.3 The Normal Dynamic Linear Model

In time–series analysis for econometrics, signal processing in engineering and other areas of applications, one often encounters series of random vectors that are related according to the following linear dynamic model

(8.4.6) numbered Display Equation

where A and G are known matrices, which are (for simplicity) fixed. {inlinen} is a sequence of i.i.d. random vectors; {ωn} is a sequence of i.i.d. random vectors; {inlinen} and {ωn} are independent sequences, and

(8.4.7) numbered Display Equation

We further assume that θ0 has a prior normal distribution, i.e.,

(8.4.8) numbered Display Equation

and that θ0 is independent of {inlinet} and {ω t}. This model is called the normal random walk model.

We compute now the posterior distribution of θ1, given Y1. From multivariate normal theory, since

Unnumbered Display Equation


Unnumbered Display Equation

we obtain

Unnumbered Display Equation

Let F1 = Ω + GC0G′. Then, we obtain after some manipulations

Unnumbered Display Equation


(8.4.9) numbered Display Equation


(8.4.10) numbered Display Equation

Define, recursively for j ≥ 1

Unnumbered Display Equation


(8.4.11) numbered Display Equation

The recursive equations (8.4.11) are called the Kalman filter. Note that, for each n ≥ 1, ηn depends on inlinen = (Y1, …, Yn). Moreover, we can prove by induction on n, that

(8.4.12) numbered Display Equation

for all n ≥ 1. For additional theory and applications in Bayesian forecasting and smoothing, see Harrison and Stevens (1976), West, Harrison, and Migon (1985), and the book of West and Harrison (1997). We illustrate this sequential Bayesian process in Example 8.19.


In this section, we discuss two types of methods to approximate posterior distributions and posterior expectations. The first type is analytical, which is usually effective in large samples. The second type of approximation is numerical. The numerical approximations are based either on numerical integration or on simulations. Approximations are required when an exact functional form for the factor of proportionality in the posterior density is not available. We have seen such examples earlier, like the posterior p.d.f. (8.1.4).

8.5.1 Analytical Approximations

The analytic approximations are saddle–point approximations, based on variations of the Laplace method, which is explained now.

Consider the problem of evaluating the integral

(8.5.1) numbered Display Equation

where θ is m–dimensional, and k(θ) has sufficiently high–order continuous partial derivatives. Consider first the case of m = 1. Let inline be an argument maximizing −k(θ). Make a Taylor expansion of k(θ) around inline, i.e.,

(8.5.2) numbered Display Equation

k′(inline) = 0 and k″(inline) > 0. Thus, substituting (8.5.2) in (8.5.1), the integral I is approximated by

(8.5.3) numbered Display Equation

where EN{f(θ)} is the expected value of f(θ), with respect to the normal distribution with mean inline and variance inline. The expectation EN{f(θ)} can be sometimes computed exactly, or one can apply the delta method to obtain the approximation

(8.5.4) numbered Display Equation

Often we see the simpler approximation, in which f(inline) is used for EN{f(inline)}. In this case, the approximation error is O(n−1). If we use f(inline) for EN{f(θ)}, we obtain the approximation

(8.5.5) numbered Display Equation

In the m > 1 case, the approximating formula becomes

(8.5.6) numbered Display Equation


(8.5.7) numbered Display Equation

These approximating formulae can be applied in Bayesian analysis, by letting −nk(θ) be the log–likelihood function, l(θ* | Xn); inline be the MLE, inlinen, and inline, −1(inline) be J(inlinen) given in (7.7.15). Accordingly, the posterior p.d.f., when the prior p.d.f. is h(θ), is approximated by

(8.5.8) numbered Display Equation

In this formula, inlinen is the MLE of θ and

(8.5.9) numbered Display Equation

If we approximate EN{h(θ)} by h(inlinen), then the approximating formula reduces to

(8.5.10) numbered Display Equation

This is a large sample normal approximation to the posterior density of θ. We can write this, for large samples, as

(8.5.11) numbered Display Equation

Note that Equation (8.5.11) does not depend on the prior distribution, and is not expected therefore to yield good approximation to h(θ| Xn) if the samples are not very large.

One can improve upon the normal approximation (8.5.11) by combining the likelihood function and the prior density h(θ) in the definition of k(θ). Thus, let

(8.5.12) numbered Display Equation

Let inline be a value of θ maximizing −n inline (θ), or inlinen the root of

(8.5.13) numbered Display Equation


(8.5.14) numbered Display Equation

Then, the saddle–point approximation to the posterior p.d.f. h(θ | Xn) is

(8.5.15) numbered Display Equation

This formula is similar to Barndorff–Nielsen p*–formula (7.7.15) and reduces to the p*–formula if h(θ)dθ inline dθ. The normal approximation is given by (8.5.11), in which inlinen is replaced by inlinen and J(inlinen) is replaced by inline(inlinen).

For additional reading on analytic approximation for large samples, see Gamerman (1997, Ch. 3), Reid (1995, pp. 351–368), and Tierney and Kadane (1986).

8.5.2 Numerical Approximations

In this section, we discuss two types of numerical approximations: numerical integrations and simulations. The reader is referred to Evans and Swartz (2001).

I. Numerical Integrations

We have seen in the previous sections that, in order to evaluate posterior p.d.f., one has to evaluate integrals of the form

(8.5.16) numbered Display Equation

Sometimes these integrals are quite complicated, like that of the RHS of Equa-tion (8.1.4).

Suppose that, as in (8.5.16), the range of integration is from −∞ to ∞ and I < ∞. Consider first the case where θ is real. Making the one–to–one transformation ω = eθ /(1+eθ), the integral of (8.5.16) is reduced to

(8.5.17) numbered Display Equation

where q(θ) = L(θ | Xn)h(θ). There are many different methods of numerical integration. A summary of various methods and their accuracy is given in Abramowitz and Stegun (1968, p. 885). The reader is referred also to the book of Davis and Rabinowitz (1984).

If we define f(ω) so that

(8.5.18) numbered Display Equation

then, an n–point approximation to I is given by

(8.5.19) numbered Display Equation


(8.5.20) numbered Display Equation

The error in this approximation is

(8.5.21) numbered Display Equation

Integrals of the form

(8.5.22) numbered Display Equation

Thus, (8.5.22) can be computed according to (8.5.19). Another method is to use an n–points Gaussian quadrature formula:

(8.5.23) numbered Display Equation

where ui and wi are tabulated in Table 25.4 of Abramowitz and Stegun (1968, p. 916). Often it suffices to use n = 8 or n = 12 points in (8.5.23).

II. Simulation

The basic theorem applied in simulations to compute an integral I = inline f(θ) dH(θ) is the strong law of large numbers (SLLN). We have seen in Chapter 1 that if X1, X2, … is a sequence of i.i.d. random variables having a distribution FX(x), and if inline |g(x)|dF(x) < ∞ then

Unnumbered Display Equation

This important result is applied to approximate an integral inlinef(θ)dH(θ) by a sequence θ1, θ2, … of i.i.d. random variables, generated from the prior distribution H(θ). Thus, for large n,

(8.5.24) numbered Display Equation

Computer programs are available in all statistical packages that simulate realizations of a sequence of i.i.d. random variables, having specified distributions. All programs use linear congruential generators to generate “pseudo” random numbers that have approximately uniform distribution on (0, 1). For discussion of these generators, see Bratley, Fox, and Schrage (1983).

Having generated i.i.d. uniform R(0, 1) random variables U1, U2, …, Un, one can obtain a simulation of i.i.d. random variables having a specific c.d.f. F, by the transformation

(8.5.25) numbered Display Equation

In some special cases, one can use different transformations. For example, if U1, U2 are independent R(0, 1) random variables then the Box–Muller transformation

(8.5.26) numbered Display Equation

yields two independent random variables having a standard normal distribution. It is easier to simulate a N(0, 1) random variable according to (8.5.26) than according to X−1(U). In today’s technology, one could choose from a rich menu of simulation procedures for many of the common distributions.

If a prior distribution H(θ) is not in a simulation menu, or if h(θ) is not proper, one can approximate inlinef(θ)h(θ) by generating θ1, …, θn from another convenient distribution, λ (θ) say, and using the formula

(8.5.27) numbered Display Equation

The method of simulating from a substitute p.d.f. λ (θ) is called importance sampling, and λ (θ) is called an importance density. The choice of λ (θ) should follow the following guidelines:

(i) The support of λ (θ) should be the same as that of h(θ);
(ii) λ (θ) should be similar in shape, as much as possible, to h(θ); i.e., λ (θ) should have the same means, standard deviations and other features, as those of h(θ).

The second guideline is sometimes complicated. For example, if h(θ) d(θ) is the improper prior dθ and I = inline f(θ)dθ, where inline |f(θ)|dθ < ∞, one could use first the monotone transformation x = eθ /(1+eθ) to reduce I to I = inline. One can use then a beta, β (p, q), importance density to simulate from, and approximate I by

Unnumbered Display Equation

It would be simpler to use β (1, 1), which is the uniform R(0, 1).

An important question is, how large should the simulation sample be, so that the approximation will be sufficiently precise. For large values of n, the approximation inline f(θi) h(θi) is, by Central Limit Theorem, approximately distributed like inline, where

Unnumbered Display Equation

VS(·) is the variance according to the simulation density. Thus, n could be chosen sufficiently large, so that Z1−α /2 · inline < δ. This will guarantee that with confidence probability close to (1 − α) the true value of I is within inline ± δ. The problem, however, is that generally τ2 is not simple or is unknown. To overcome this problem one could use a sequential sampling procedure, which attains asymptotically the fixed width confidence interval. Such a procedure was discussed in Section 6.7.

We should remark in this connection that simulation results are less accurate than those of numerical integration. One should use, as far as possible, numerical integration rather than simulation.

To illustrate this point, suppose that we wish to compute numerically

Unnumbered Display Equation

Reduce I, as in (8.5.17), to

Unnumbered Display Equation

Simulation of N = 10, 000 random variables UiR(0, 1) yields the approximation

Unnumbered Display Equation

On the other hand, a 10–point numerical integration, according to (8.5.29), yields

Unnumbered Display Equation

When θ is m–dimensional, m ≥ 2, numerical integration might become too difficult. In such cases, simulations might be the answer.


Empirical Bayes estimators were introduced by Robbins (1956) for cases of repetitive estimation under similar conditions, when Bayes estimators are desired but the statistician does not wish to make specific assumptions about the prior distribution. The following example illustrates this approach. Suppose that X has a Poisson distribution P(λ), and λ has some prior distribution H(λ), 0 < λ < ∞. The Bayes estimator of λ for the squared–error loss function is

Unnumbered Display Equation

where p(x;λ) denotes the p.d.f. of P(λ) at the point x. Since λ p(x;λ) = (x + 1)· p(x + 1;λ) for every λ and each x = 0, 1, … we can express the above Bayes estimator in the form

(8.6.1) numbered Display Equation

where pH(x) is the predictive p.d.f. at x. Obviously, in order to determine the posterior expectation we have to know the prior distribution H(λ). On the other hand, if the problem is repetitive in the sense that a sequence (X1, λ1), (X2, λ2), …, (Xn, λn), …, is generated independently so that λ1, λ2, … are i.i.d. having the same prior distribution H(λ), and X1, …, Xn are conditionally independent, given λ1, …, λn, then we consider the sequence of observable random variables X1, …, Xn, … as i.i.d. from the mixture of Poisson distribution with p.d.f. pH(j), j = 0, 1, 2, …. Thus, if on the nth epoch, we observe Xn = i0 we estimate, on the basis of all the data, the value of pH(i0 + 1)/pH(i0). A consistent estimator of pH(j), for any j = 0, 1, … is inline.jpg, where I{Xi = j} is the indicator function of {Xi = j}. This follows from the SLLN. Thus, a consistent estimator of the Bayes estimator EH{λ | Xn} is

(8.6.2) numbered Display Equation

This estimator is independent of the unknown H(λ), and for large values of n is approximately equal to EH{λ | Xn}. The estimator inline.jpgn is called an empirical Bayes estimator. The question is whether the prior risks, under the true H(λ), of the estimators λn converge, as n → ∞, to the Bayes risk under H(λ). A general discussion of this issue with sufficient conditions for such convergence of the associated prior risks is given in the paper of Robbins (1964).

Many papers were written on the application of the empirical Bayes estimation method to repetitive estimation problems in which it is difficult or impossible to specify the prior distribution exactly. We have to remark in this connection that the empirical Bayes estimators are only asymptotically optimal. We have an adaptive decision process which corrects itself and approaches the optimal decisions only when n grows. How fast does it approach the optimal decisions? It depends on the amount of a priori knowledge of the true prior distribution. The initial estimators may be far from the true Bayes estimators. A few studies have been conducted to estimate the rate of approach of the prior risks associated with the empirical Bayes decisions to the true Bayes risk. Lin (1974) considered the one parameter exponential family and the estimation of a function λ (θ) under squared–error loss. The true Bayes estimator is

Unnumbered Display Equation

and it is assumed that inline.jpg(x)fH(x) can be expressed in the form inline.jpg, where inline.jpg is the ith order derivative of fH(x) with respect to x. The empirical Bayes estimators considered are based on consistent estimators of the p.d.f. fH(x) and its derivatives. For the particular estimators suggested it is shown that the rate of approach is of the order 0(nα) with 0 < α ≤ 1/3, where n is the number of observations.

In Example 8.26, we show that if the form of the prior is known, the rate of approach becomes considerably faster. When the form of the prior distribution is known the estimators are called semi–empirical Bayes, or parametric empirical Bayes.

For further reading on the empirical Bayes method, see the book of Maritz (1970) and the papers of Casella (1985), Efron and Morris (1971, l972a, 1972b), and Susarla (1982).

The EM algorithm discussed in Example 8.27 is a very important procedure for estimation and overcoming problems of missing values. The book by McLachlan and Krishnan (1997) provides the theory and many interesting examples.


Example 8.1. The experiment under consideration is to produce concrete under certain conditions of mixing the ingredients, temperature of the air, humidity, etc. Prior experience shows that concrete cubes manufactured in that manner will have a compressive strength X after 3 days of hardening, which has a log–normal distribution LN(μ, σ2). Furthermore, it is expected that 95% of such concrete cubes will have compressive strength in the range of 216–264 (kg/cm2).

According to our model, Y = log XN(μ, σ2). Taking the (natural) logarithms of the range limits, we expect most Y values to be within the interval (5.375, 5.580).

The conditional distribution of Y given (μ, σ2) is

Unnumbered Display Equation

Suppose that σ2 is fixed at σ2 = 0.001, and μ has a prior normal distribution μN(μ0, τ2), then the predictive distribution of Y is N(μ0, σ2 + τ2). Substituting μ0 = 5.475, the predictive probability that Y inline (5.375, 5.580), if inline.jpg = 0.051 is 0.95. Thus, we choose τ2 = 0.0015 for the prior distribution of μ.

From this model of Y | μ, σ2N(μ, σ2) and μN(μ0, τ2). The bivariate distribution of (Y, μ) is

Unnumbered Display Equation

Hence, the conditional distribution of μ given {Y = y} is, as shown in Section 2.9,

Unnumbered Display Equation

The posterior distribution of μ, given {Y = y} is normal.        inline

Example 8.2. (a) X1, X2, …, Xn given λ are conditionally i.i.d., having a Poisson distribution P(λ), i.e., inline = {P(λ), 0 < λ < ∞}.

Let inline.jpg = {G(Λ, α), 0 < α, Λ < ∞}, i.e., inline.jpg is a family of prior gamma distributions for λ. The minimal sufficient statistics, given λ, is inline.jpg. Tn| λ ∼ Pn). Thus, the posterior p.d.f. of λ, given Tn, is

Unnumbered Display Equation

Hence, λ | TnG(n + Λ, Tn + α). The posterior distribution belongs to inline.jpg.

(b) inline = {G(λ, α), 0 < λ < ∞}, α fixed. inline.jpg = {G(Λ, ν), 0 < ν, Λ < ∞}.

Unnumbered Display Equation

Thus, λ | XG(X + Λ, ν + α).        inline

Example 8.3. The following problem is often encountered in high technology industry.

The number of soldering points on a typical printed circut board (PCB) is often very large. There is an automated soldering technology, called “wave soldering, ” which involves a large number of different factors (conditions) represented by variables X1, X2, …, Xk. Let J denote the number of faults in the soldering points on a PCB. One can model J as having conditional Poisson distribution with mean λ, which depends on the manufacturing conditions X1, …, Xk according to a log–linear relationship

Unnumbered Display Equation

where β′ = (β0, …, βk) and x = (1, x1, …, xk). β is generally an unknown parametric vector. In order to estimate β, one can design an experiment in which the values of the control variables X1, …, Xk are changed.

Let Ji be the number of observed faulty soldering points on a PCB, under control conditions given by xi (i = 1, …, N). The likelihood function of β, given J1, …, JN and x1, …, xN, is

Unnumbered Display Equation

where inline.jpg. If we ascribe β a prior multinormal distribution, i.e., βN(β0, V) then the posterior p.d.f. of β, given inlineN = (J1, …, JN, x1, …, xN), is

Unnumbered Display Equation

It is very difficult to express analytically the proportionality factor, even in special cases, to make the RHS of h(β| inlineN)a p.d.f.        inline

Example 8.4. In this example, we derive the Jeffreys prior density for several models.

A. inline = {b(x;n, θ), 0 < θ < 1}.

This is the family of binomial probability distributions. The Fisher information function is

Unnumbered Display Equation

Thus, the Jeffreys prior for θ is

Unnumbered Display Equation

In this case, the prior density is

Unnumbered Display Equation

This is a proper prior density. The posterior distribution of θ, given X, under the above prior is Beta inline.jpg.

B. inline = {N(μ, σ2);−∞ < μ < ∞, 0 < σ < ∞}.

The Fisher information matrix is given in (3.8.8). The determinant of this matrix is |I(μ, σ2)| = 1/2σ6. Thus, the Jeffreys prior for this model is

Unnumbered Display Equation

Using this improper prior density the posterior p.d.f. of (μ, σ2), given X1, …, Xn, is

Unnumbered Display Equation

where inline.jpg. The parameter inline.jpg is called the precision parameter. In terms of μ and inline, the improper prior density is

Unnumbered Display Equation

The posterior density of (μ, inline) correspondingly is

Unnumbered Display Equation        inline

Example 8.5. Consider a simple inventory system in which a certain commodity is stocked at the beginning of every day, according to a policy determined by the following considerations. The daily demand (in number of units) is a random variable X whose distribution belongs to a specified parametric family inline. Let X1, X2, … denote a sequence of i.i.d. random variables, whose common distribution F(x;θ) belongs to inline and which represent the observed demand on consecutive days. The stock level at the beginning of each day, Sn, n = 1, 2, … can be adjusted by increasing or decreasing the available stock at the end of the previous day. We consider the following inventory cost function

Unnumbered Display Equation

where c, 0 < c < ∞, is the daily cost of holding a unit in stock and h, 0 < h < ∞ is the cost (or penalty) for a shortage of one unit. Here (sx)+ = max(0, sx) and (s − x)- = -min(0, s − x). If the distribution of X, F(x;θ) is known, then the expected cost R(S, θ) = Eθ {K(S, X)} is minimized by

Unnumbered Display Equation

where F-1(γ ;θ) is the γ–quantile of F(x;θ). If θ is unknown we cannot determine S0(θ). We show now a Bayesian approach to the determination of the stock levels. Let H(θ) be a specific prior distribution of θ. The prior expected daily cost is

Unnumbered Display Equation

or, since all the terms are nonnegative

Unnumbered Display Equation

The value of S which minimizes ρ (S, H) is similar to (8.1.27),

Unnumbered Display Equation

i.e., the h/(c + h) th–quantile of the predictive distribution FH(x).

After observing the value x1 of X1, we convert the prior distribution H(θ) to a posterior distribution H1(θ | x1) and determine the predictive p.d.f. for the second day, namely

Unnumbered Display Equation

The expected cost for the second day is

Unnumbered Display Equation

Moreover, by the law of the iterated expectations

Unnumbered Display Equation


Unnumbered Display Equation

The conditional expectation inline.jpg is the posterior expected cost given X1 = x; or the predictive cost for the second day. The optimal choice of S2 given X1 = x is, therefore, the h/(c + h)–quantile of the predictive distribution FH1(y | x) i.e., inline.jpg. Since this function minimizes the predictive risk for every x, it minimizes ρ(S2, H). In the same manner, we prove that after n days, given Xn = (x1, …, xn) the optimal stock level for the beginning of the (n + 1)st day is the inline.jpg–quantile of the predictive distribution of Xn + 1, given Xn = xn, i.e., inline.jpg, where the predictive p.d.f. of Xn + 1, given Xn = x is

Unnumbered Display Equation

and h(θ | x) is the posterior p.d.f. of θ given Xn = x. The optimal stock levels are determined sequentially for each day on the basis of the demand of the previous days. Such a procedure is called an adaptive procedure. In particular, if X1, X2, … is a sequence of i.i.d. Poisson random variables (r.v.s), P(θ) and if the prior distribution H(θ) is the gamma distribution, inline.jpg, the posterior distribution of θ after n observations is the gamma distribution inline.jpg, where inline.jpg. Let inline.jpg denote the p.d.f. of this posterior distribution. The predictive distribution of Xn + 1 given Xn, which actually depends only on Tn, is

Unnumbered Display Equation

where inlinen = τ/(1 + (n + 1)τ). This is the p.d.f. of the negative binomial NB(inlinen, ν + Tn). It is interesting that in the present case the predictive distribution belongs to the family of the negative–binomial distributions for all n = 1, 2, …. We can also include the case of n = 0 by defining T0 = 0. What changes from one day to another are the parameters (inlinen, ν + Tn). Thus, the optimal stock level at the beginning of the (n + 1)st day is the h/(c + h)–quantile of the NB(inlinen, ν + Tn).        inline

Example 8.6. Consider the testing problem connected with the problem of detecting disturbances in a manufacturing process. Suppose that the quality of a product is presented by a random variable X having a normal distribution N(θ, 1). When the manufacturing process is under control the value of θ should be θ0. Every hour an observation is taken on a product chosen at random from the process. Consider the situation after n hours. Let X1, …, Xn be independent random variables representing the n observations. It is suspected that after k hours of operation 1 < k < n a malfunctioning occurred and the expected value θ shifted to a value θ1 greater than θ0. The loss due to such a shift is (θ1θ0) [$] per hour. If a shift really occurred the process should be stopped and rectified. On the other hand, if a shift has not occurred and the process is stopped a loss of K [$] is charged. The prior probability that the shift occurred is inline. We present here the Bayes test of the two hypotheses

Unnumbered Display Equation


Unnumbered Display Equation

for a specified k, 1 ≤ kn − 1; which is performed after the nth observation.

The likelihood functions under H0 and under H1 are, respectively, when Xn = xn

Unnumbered Display Equation


Unnumbered Display Equation

Thus, the posterior probability that H0 is true is

Unnumbered Display Equation

where π = 1 –inline. The ratio of prior risks is in the present case K π/((1-π)(n − k)(θ1θ0)). The Bayes test implies that H0 should be rejected if

Unnumbered Display Equation

where inline.jpg.

The Bayes (minimal prior) risk associated with this test is

Unnumbered Display Equation

where inline0(π) and inline1(π) are the error probabilities of rejecting H0 or H1 when they are true. These error probabilities are given by

Unnumbered Display Equation

where Φ(z) is the standard normal integral and

Unnumbered Display Equation


Unnumbered Display Equation

The function An − k(π) is monotone increasing in π and inline.jpg. Accordingly, inline0(0) = 1, inline1(0) = 0 and inline0(1) = 0, inline1(1) = 1.        inline

Example 8.7. Consider the detection problem of Example 8.6 but now the point of shift k is unknown. If θ0 and θ1 are known then we have a problem of testing the simple hypothesis H0 (of Example 8.6) against the composite hypothesis

Unnumbered Display Equation

Let π0 be the prior probability of H0 and πj, j = 1, …, n − 1, the prior probabilities under H1 that {k = j}. The posterior probability of H0 is then

Unnumbered Display Equation

where inline.jpg and inline.jpg. The posterior probability of {k = j} is, for j = 1, …, n − 1,

Unnumbered Display Equation

Let Ri(Xn) (i = 0, 1) denote the posterior risk associated with accepting Hi. These functions are given by

Unnumbered Display Equation


Unnumbered Display Equation

H0 is rejected if R1(Xn) ≤ R0(Xn), or when

Unnumbered Display Equation        inline

Example 8.8. We consider here the problem of testing whether the mean of a normal distribution is negative or positive. Let X1, …, Xn be i.i.d. random variables having a N(θ, 1) distribution. The null hypothesis is H0: θ ≤ 0 and the alternative hypothesis is H1: θ > 0. We assign the unknown θ a prior normal distribution, i.e., θN(0, τ2). Thus, the prior probability of H0 is π = inline.jpg. The loss function L0(θ) of accepting H0 and that of accepting H1, L1(θ), are of the form

Unnumbered Display Equation

For the determination of the posterior risk functions, we have to determine first the posterior distribution of θ given Xn. Since inline.jpgn is a minimal sufficient statistic the conditional distribution of θ given inline.jpgn is the normal

Unnumbered Display Equation

(See Example 8.9 for elaboration.) It follows that the posterior risk associated with accepting H0 is

Unnumbered Display Equation

where inline.jpg is the posterior mean. Generally, if XN(ξ, D2) then

Unnumbered Display Equation

Substituting the expressions

Unnumbered Display Equation

we obtain that

Unnumbered Display Equation

In a similar fashion, we prove that the posterior risk associated with accepting H1 is

Unnumbered Display Equation

The Bayes test procedure is to reject H0 whenever R1(inline.jpgn) ≤ R0(inline.jpgn). Thus, H0 should be rejected whenever

Unnumbered Display Equation

But this holds if, and only if, inline.jpgn ≥ 0.        inline

Example 8.9. Let X1, X2, … be i.i.d. random variables having a normal distribution with mean μ, and variance σ2 = 1. We wish to test k = 3 composite hypotheses H-1: −∞ < μ < −1; H0: −1 ≤ μ ≤ 1; H1: μ > 1. Let μ have a prior normal distribution, μN(0, τ2). Thus, let

Unnumbered Display Equation

and inline.jpg, be the prior probabilities of H−1, H0, and H1, respectively. Furthermore, let

Unnumbered Display Equation

Unnumbered Display Equation


Unnumbered Display Equation

The predictive likelihood functions of the three hypotheses are then

Unnumbered Display Equation


Unnumbered Display Equation

It follows that the posterior probabilities of Hj, j = −1, 0, 1 are as follows:

Unnumbered Display Equation

Thus, π−1(inline.jpgn) ≥ 1 – inline if

Unnumbered Display Equation

Similarly, π1(inline.jpgn) ≥ 1 – inline if

Unnumbered Display Equation

Thus, if inline.jpg then bn and −bn are outer stopping boundaries. In the region (−bn, bn), we have two inner boundaries (−cn, cn) such that if |inline.jpgn| < cn then H0 is accepted. The boundaries ± cn can be obtained by solving the equation

Unnumbered Display Equation

cn ≥ 0 and cn > 0 only if n > n0, where inline.jpg, or n0 = inline.jpg.        inline

Example 8.10. Consider the problem of estimating circular probabilities in the normal case. In Example 6.4, we derived the uniformly most accurate (UMA) lower confidence limit of the function

Unnumbered Display Equation

where J is a inline.jpg random variable for cases of known ρ. We derive here the Bayes lower credibility limit of inline (σ2, ρ) for cases of known ρ. The minimal sufficient statistic is inline.jpg. This statistic is distributed like σ2χ2[2n] or like inline.jpg. Let ω = inline.jpg and let ωG(τ, ν). The posterior distribution of ω, given T2n, is

Unnumbered Display Equation

Accordingly, if G−1(p | T2n + τ, n + ν) designates the pth quantile of this posterior distribution,

Unnumbered Display Equation

with probability one (with respect to the mixed prior distribution of T2n). Thus, we obtain that a 1–α Bayes upper credibility limit for σ2 is

Unnumbered Display Equation

Note that if τ, and ν are close to zero then the Bayes credibility limit is very close to the non–Bayes UMA upper confidence limit derived in Example 7.4. Finally, the (1–α) Bayes lower credibility limit for inline (σ2, ρ) is inline.jpg.        inline

Example 8.11. We consider in the present example the problem of inverse regression. Suppose that the relationship between a controlled experimental variable x and an observed random variable Y(x) is describable by a linear regression

Unnumbered Display Equation

where inline is a random variable such that E{inline} = 0 and E{inline2} = σ2. The regression coefficients α and β are unknown. Given the results on n observations at x1, …, xn, estimate the value of ξ at which E{Y(ξ)} = η, where η is a preassigned value. We derive here Bayes confidence limits for ξ = (ηα)/β, under the assumption that m random variables are observed independently at x1 and x2, where x2 = x1 + Δ. Both x1 and Δ are determined by the design. Furthermore, we assume that the distribution of inline is N(0, σ2) and that (α, β) has a prior bivariate normal distribution with mean (α0, β0) and covariance matrix V = (vij; i, j = 1, 2). For the sake of simplicity, we assume in the present example that σ2 is known. The results can be easily extended to the case of unknown σ2.

The minimal sufficient statistic is (inline.jpg1, inline.jpg2) where inline.jpgi is the mean of the m observations at xi (i = 1, 2). The posterior distribution of (α, β) given (inline.jpg1, inline.jpg2) is the bivariate normal with mean vector

Unnumbered Display Equation


Unnumbered Display Equation

and I is the 2 × 2 identity matrix. Note that X is nonsingular. X is called the design matrix. The covariance matrix of the posterior distribution is

Unnumbered Display Equation

Let us denote the elements of inline by inline.jpgij, i, j = 1, 2. The problem is to determine the Bayes credibility interval to the parameter ξ = (ηα)/β. Let inline.jpg and inline.jpg denote the limits of such a (1 – α) Bayes credibility interval. These limits should satisfy the posterior confidence level requirement

Unnumbered Display Equation

If we consider equal tail probabilities, these confidence limits are obtained by solving simultaneously the equations

Unnumbered Display Equation

where inline.jpg and similarly inline.jpg. By inverting, we can realize that the credibility limits inline.jpg and inline.jpg are the two roots of the quadratic equation

Unnumbered Display Equation


Unnumbered Display Equation


Unnumbered Display Equation

The two roots (if they exist) are

Unnumbered Display Equation

where |inline.jpg| denotes the determinant of the posterior covariance matrix. These credibility limits exist if the discriminant

Unnumbered Display Equation

is nonnegative. After some algebraic manipulations, we obtain that

Unnumbered Display Equation

where tr {·} is the trace of the matrix in {}. Thus, if m is sufficiently large, Δ* > 0 and the two credibility limits exist with probability one.        inline

Example 8.12. Suppose that X1, …, Xn + 1 are i.i.d. random variables, having a Poisson distribution P(λ), 0 < λ < ∞. We ascribe λ a prior gamma distribution, i.e., λ ∼ G(Λ, α).

After observing X1, …, Xn, the posterior distribution of λ, given inline.jpg is λ | TnG(Λ + n, Tn + α). The predictive distribution of Xn + 1, given Tn, is the negative–binomial, i.e.,

Unnumbered Display Equation


Unnumbered Display Equation

Let NB−1(p; inline, α) denote the pth quantile of NB(inline, α). The prediction interval for Xn + 1, after observing X1, …, Xn, at level 1–α, is

Unnumbered Display Equation

According to Equation (2.3.12), the pth quantile of NB(inline, α) is NB−1(p | inline, α) = least integer k, k ≥ 1, such that I1 – inline(α, k + 1) ≥ p.        inline

Example 8.13. Suppose that an n–dimensional random vector Xn has the multinormal distribution Xn | μN(μ1n, Vn), where −∞ < μ < ∞ is unknown. The covariance matrix Vn is known. Assume that μ has a prior normal distribution, μN(μ0, τ2). The posterior distribution of μ, given Xn, is μ | XnN(η (Xn), Dn), where

Unnumbered Display Equation


Unnumbered Display Equation

Accordingly, the predictive distribution, of yet unobserved m–dimensional vector Ym | μN(μ 1m, Vm), is

Unnumbered Display Equation

Thus, a prediction region for Ym, at level (1–α) is the ellipsoid of concentration

Unnumbered Display Equation        inline

Example 8.14. A new drug is introduced and the physician wishes to determine a lower prediction limit with confidence probability of γ = 0.95 for the number of patients in a group of n = 10 that will be cured. If Xn is the number of patients cured among n and if θ is the individual probability to be cured the model is binomial, i.e., XnB(n, θ). The lower prediction limit, for a given value of θ, is an integer kγ (θ) such that Pθ {Xnkγ (θ)} ≥ γ. If B−1(p;n, θ) denotes the pth quantile of the binomial B(n, θ) then kγ (θ) = max(0, B−1(1–γ ;n, θ)−1). Since the value of θ is unknown, we cannot determine kγ (θ). Lower tolerance limits, which were discussed in Section 6.5, could provide estimates to the unknown kγ (θ). A statistician may feel, however, that lower tolerance limits are too conservative, since he has good a priori information about θ. Suppose a statistician believes that θ is approximately equal to 0.8, and therefore, assigns θ a prior beta distribution β (p, q) with mean 0.8 and variance 0.01. Setting the equations for the mean and variance of a β (p, q) distribution (see Table 2.1 of Chapter 2), and solving for p and q, we obtain p = 12 and q = 3. We consider now the predictive distribution of Xn under β (12, 3) prior distribution of θ. This predictive distribution has a probability function

Unnumbered Display Equation

For n = 10, we obtain the following predictive p.d.f. pH(j) and c.d.f. fH(j). According to this predictive distribution, the probability of at least 5 cures out of 10 patients is 0.972 and for at least 6 cures is 0.925.

j pH(j) FH(j)
0 0.000034 0.000034
1 0.000337 0.000378
2 0.001790 0.002160
3 0.006681 0.008841
4 0.019488 0.028329
5 0.046770 0.075099
6 0.094654 0.169752
7 0.162263 0.332016
8 0.231225 0.563241
9 0.256917 0.820158
10 0.179842 1.000000


Example 8.15. Suppose that in a given (rather simple) inventory system (see Example 8.2) the monthly demand, X of some commodity is a random variable having a Poisson distribution P(θ), 0 < θ < ∞. We wish to derive a Bayes estimator of the expected demand θ. In many of the studies on Bayes estimator of θ, a prior gamma distribution inline.jpg is assumed for θ. The prior parameters τ and ν, 0 < τ, ν < ∞, are specified. Note that the prior expectation of θ is ν τ and its prior variance is ν τ2. A large prior variance is generally chosen if the prior information on θ is vague. This yields a flat prior distribution. On the other hand, if the prior information on θ is strong in the sense that we have a high prior confidence that θ lies close to a value θ0 say, pick ν τ = θ0 and ν τ2 very small, by choosing τ to be small. In any case, the posterior distribution of θ, given a sample of n i.i.d. random variables X1, …, Xn, is determined in the following manner. inline.jpg is a minimal sufficient statistic, where TnP(nθ). The derivation of the posterior density can be based on the p.d.f. of Tn. Thus, the product of the p.d.f. of Tn by the prior p.d.f. of θ is proportional to θt + ν −1e-θ (n + 1/τ), where Tn = t. The factors that were omitted from the product of the p.d.fs are independent of θ and are, therefore, irrelevant. We recognize in the function θt + ν −1e-θ (n + 1/τ) the kernel (the factor depending on θ) of a gamma p.d.f. Accordingly, the posterior distribution of θ, given Tn, is the gamma distribution inline.jpg. If we choose a squared–error loss function, then the posterior expectation is the Bayes estimator. We thus obtain the estimator inline.jpg. Note that the unbiased and the MLE of θ is Tn/n which is not useful as long as Tn = 0, since we know that θ > 0. If certain commodities have a very slow demand (a frequently encountered phenomenon among replacement parts) then Tn may be zero even when n is moderately large. On the other hand, the Bayes estimator θ is always positive.        inline

Example 8.16. (a) Let X1, …, XN be i.i.d. random variables having a normal distribution N(θ, 1), −∞ < θ < ∞. The minimal sufficient statistic is the sample mean inline.jpg. We assume that θ has a prior normal distribution N(0, τ2). We derive the Bayes estimator for the zero–one loss function,

Unnumbered Display Equation

The posterior distribution of θ given inline.jpg is normal N(inline.jpg(1 + 1/nτ2)−1, (n + 1/τ2)−1). This can be verified by simple normal regression theory, recognizing that the joint distribution of (inline.jpg, θ) is the bivariate normal, with zero expectation and covariance matrix

Unnumbered Display Equation

Thus, the posterior risk is the posterior probability of the event {|inline.jpgθ | ≥ δ}. This is given by

Unnumbered Display Equation

We can show then (Zacks, 1971; p. 265) that the Bayes estimator of θ is the posterior expectation, i.e.,

Unnumbered Display Equation

In this example, the minimization of the posterior variance and the maximization of the posterior probability of covering θ by the interval (inline.jpgδ, inline.jpg + δ) is the same. This is due to the normal prior and posterior distributions.

(b) Continuing with the same model, suppose that we wish to estimate the tail probability

Unnumbered Display Equation

Since the posterior distribution of θξ0 given inline.jpg is normal, the Bayes estimator for a squared–error loss is the posterior expectation

Unnumbered Display Equation

Note that this Bayes estimator is strongly consistent since, by the SLLN, inline.jpgθ almost surely (a.s.), and Φ (·) is a continuous function. Hence, the Bayes estimator converges to Φ (θξ0) a.s. as n → ∞. It is interesting to compare this estimator to the minimum variance unbiased estimator (MVUE) and to the MLE of the tail probability. All these estimators are very close in large samples.

If the loss function is the absolute deviation, |inline.jpginline|, rather than the squared–error, (inline.jpginline)2, then the Bayes estimator of inline is the median of the posterior distribution of Φ (θξ0). Since the Φ–function is strictly increasing this median is Φ (θ.5ξ0), where θ0.5 is the median of the posterior distribution of θ given inline.jpg. We thus obtain that the Bayes estimator for absolute deviation loss is

Unnumbered Display Equation

This is different from the posterior expectation.        inline

Example 8.17. In this example, we derive Bayes estimators for the parameters μ and σ2 in the normal model N(μ, σ2) for squared–error loss. We assume that X1, …, Xn, given (μ, σ2), are i.i.d. N(μ, σ2). The minimal sufficient statistic is (inline.jpgn, Q), where inline.jpg and inline.jpg. Let inline = 1/σ2 be the precision parameter, and consider the reparametrization (μ, σ2) → (μ, inline).

The likelihood function is

Unnumbered Display Equation

The following is a commonly assumed joint prior distribution for (μ, inline), namely,

Unnumbered Display Equation


Unnumbered Display Equation

where n0 is an integer, and 0 < inline < ∞. This joint prior distribution is called the Normal–Gamma prior, and denoted by inline.jpg. Since inline.jpgn and Q are conditionally independent, given (μ, inline), and since the distribution of Q is independent of μ, the posterior distribution of μ | inline.jpgn, inline is normal with mean

Unnumbered Display Equation

and variance

Unnumbered Display Equation

inline.jpgB is a Bayesian estimator of μ for the squared–error loss function. The posterior risk of inline.jpgB is

Unnumbered Display Equation

The second term on the RHS is zero. Thus, the posterior risk of inline.jpgB is

Unnumbered Display Equation

The posterior distribution of inline depends only on Q. Indeed, if we denote generally by p(inline.jpg| μ, inline) and p(Q | inline) the conditional p.d.f. of inline.jpg and Q then

Unnumbered Display Equation

Hence, the marginal posterior p.d.f. of inline is

Unnumbered Display Equation

Thus, from our model, the posterior distribution of inline, given Q, is the gamma inline.jpg. It follows that

Unnumbered Display Equation

Thus, the posterior risk of inline.jpgB is

Unnumbered Display Equation

The Bayesian estimator of σ2 is

Unnumbered Display Equation

The posterior risk of inline.jpg is

Unnumbered Display Equation

which is finite if n + n0 > 5.        inline

Example 8.18. Consider the model of the previous example, but with priorly independent parameters μ and inline, i.e., we assume that h(μ, inline) = h(μ)g(inline), where

Unnumbered Display Equation


Unnumbered Display Equation

If p(inline.jpg | μ, inline) and p(Q | inline) are the p.d.f. of inline.jpg and of Q, given μ, inline, respectively, then the joint posterior p.d.f. of (μ, inline), given (inline.jpg, Q), is

Unnumbered Display Equation

where A(inline.jpg, Q) > 0 is a normalizing factor. The marginal posterior p.d.f. of μ, given (inline.jpg, Q), is

Unnumbered Display Equation

It is straightforward to show that the integral on the RHS of (8.4.18) is proportional to inline.jpg. Thus,

Unnumbered Display Equation

A simple analytic expression for the normalizing factor A* (inline.jpg, Q) is not available. One can resort to numerical integration to obtain the Bayesian estimator of μ, namely,

Unnumbered Display Equation

By the Lebesgue Dominated Convergence Theorem

Unnumbered Display Equation

Thus, for large values of n,

Unnumbered Display Equation

where inline.jpg.

In a similar manner, we can show that the marginal posterior p.d.f. of inline is

Unnumbered Display Equation

where B* (inline.jpg, Q) > 0 is a normalizing factor. Note that for large values of n, g* (inline | inline.jpg, Q) is approximately the p.d.f. of inline.jpg.

In Chapter 5, we discussed the least–squares and MVUEs of the parameters in linear models. Here, we consider Bayesian estimators for linear models. Comprehensive Bayesian analysis of various linear models is given in the books of Box and Tiao (1973) and of Zellner (1971). The analysis in Zellner’s book (see Chapter III) follows a straightforward methodology of deriving the posterior distribution of the regression coefficients for informative and noninformative priors. Box and Tiao provide also geometrical representation of the posterior distributions (probability contours) and the HPD–regions of the parameters. Moreover, by analyzing the HPD–regions Box and Tiao establish the Bayesian justification to the analysis of variance and simultaneous confidence intervals of arbitrary contrasts (the Scheffé S–method). In Example 8.11, we derived the posterior distribution of the regression coefficients of the linear model Y = α + β x + inline, where inlineN(0, σ2) and (α, β) have a prior normal distribution. In a similar fashion the posterior distribution of β in the multiple regression model Y = Xβ + inline.jpg can be obtained by assuming that inline.jpgN(0, V) and the prior distribution of β is N(β0, B). By applying the multinormal theory, we readily obtain that the posterior distribution of β, given Y, is

Unnumbered Display Equation

This result is quite general and can be applied whenever the covariance matrix V is known. Often we encounter in the literature the inline.jpg prior and the (observations) model

Unnumbered Display Equation


Unnumbered Display Equation

This model is more general than the previous one, since presently the covariance matrices V and B are known only up to a factor of proportionality. Otherwise, the models are equivalent. If we replace V by inline.jpg, where V* is a known positive definite matrix, and inline, 0 < inline < ∞, is an unknown precision parameter then, by factoring V* = C* C*′, and letting Y* = (C*)−1Y, X* = (C*)−1X we obtain

Unnumbered Display Equation

Similarly, if B = DD′ and β* = D−1β then

Unnumbered Display Equation

If X** = X* D then the previous model, in terms of Y* and X**, is reduced to

Unnumbered Display Equation

where Y* = C−1Y, X* = C−1XD, β* = D−1β, V = CC′, and B = DD′.

We obtained a linear model generalization of the results of Example 8.17. Indeed,

Unnumbered Display Equation

Thus, the Bayesian estimator of β, for the squared–error loss, ||βinline.jpg||2 is

Unnumbered Display Equation

As in Example 8.17, the conditional predictive distribution of Y, given inline, is normal,

Unnumbered Display Equation

Hence, the marginal posterior distribution of inline, given Y, is the gamma distribution, i.e.,

Unnumbered Display Equation

where n is the dimension of Y. Thus, the Bayesian estimator of inline is

Unnumbered Display Equation

Finally, if inline = n0 then the predictive distribution of Y is the multivariate t[n0;Xβ0, I + τ2XX′], defined in (2.13.12).        inline

Example 8.19. The following is a random growth model. We follow the model assumptions of Section 8.4.3:

Unnumbered Display Equation

where θ0, t and θ1, t vary at random according to a random–walk model, i.e.,

Unnumbered Display Equation

Thus, let θt = (θ0, t, θ1, t)′ and at = (1, t). The dynamic linear model is thus

Unnumbered Display Equation

Let ηt and Ct be the posterior mean and posterior covariance matrix of θt. We obtain the recursive equations

Unnumbered Display Equation

where σ2 = V{inlinet} and Ω is the covariance matrix of ωt.        inline

Example 8.20. In order to illustrate the approximations of Section 8.5, we apply them here to a model in which the posterior p.d.f. can be computed exactly. Thus, let X1, …, Xn be conditionally i.i.d. random variables, having a common Poisson distribution P(λ), 0 < λ < ∞. Let the prior distribution of λ be that of a gamma, G(Λ, α). Thus, the posterior distribution of λ, given inline.jpg is like that of G(n + Λ, α + Tn), with p.d.f.

Unnumbered Display Equation

The MLE of λ is inline.jpgn = Tn/n. In this model, J(inline.jpgn) = n/inline.jpgn. Thus, formula (8.5.11) yields the normal approximation

Unnumbered Display Equation

From large sample theory, we know that

Unnumbered Display Equation

Thus, the approximation to h(λ | Tn) given by

Unnumbered Display Equation

should be better than inline.jpg, if the sample size is not very large.        inline

Example 8.21. We consider again the model of Example 8.20. In that model, for 0 < λ < ∞,

Unnumbered Display Equation


Unnumbered Display Equation

and the maximizer of inline.jpg is

Unnumbered Display Equation


Unnumbered Display Equation

The normal approximation, based on inline.jpgn and inline.jpg, is inline.jpg This is very close to the large sample approximation (8.5.15). The only difference is that α, in inline.jpg(λ | Tn), is replaced by α′ = α − 1.        inline

Example 8.22. We consider again the model of Example 8.3. In that example, Yi | λiPi), where λi = inline.jpg, i = 1, …, n. Let (X) = (X1, …, Xn)′ be the n × p matrix of covariates. The unknown parameter is β inline inline(p). The prior distribution of β is normal, i.e., βN(β0, τ2I). The likelihood function is, according to Equation (8.1.8),

Unnumbered Display Equation

where inline.jpg. The prior p.d.f. is,

Unnumbered Display Equation


Unnumbered Display Equation


Unnumbered Display Equation


Unnumbered Display Equation

The value inline.jpg is the root of the equation

Unnumbered Display Equation

Note that

Unnumbered Display Equation

where Δ (β) is an n × n diagonal matrix with ith diagonal element equal to eβXi (i = 1, …, n). The matrix inline.jpg is positive definite for all β inline inline.jpg(p). We can determine inline.jpg by solving the equation iteratively, starting with the LSE, inline.jpg, where inline.jpg is a vector whose ith component is

Unnumbered Display Equation

The approximating p.d.f. for h(β | (X), Yn) is the p.d.f. of the p–variate normal inline.jpg. This p.d.f. will be compared later numerically with a p.d.f. obtained by numerical integration and one obtained by simulation.        inline

Example 8.23. Let (Xi, Yi), i = 1, …, n be i.i.d. random vectors, having a standard bivariate normal distribution, i.e., (X, Y)′ ∼ Ninline.jpg. The likelihood function of ρ is

Unnumbered Display Equation

where Tn = (QX, PXY, QY) and inline.jpg. The Fisher information function for ρ is

Unnumbered Display Equation

Using the Jeffreys prior

Unnumbered Display Equation

the Bayesian estimator of ρ for the squared error loss is

Unnumbered Display Equation

This estimator, for given values of QX, QY and PXY, can be evaluated accurately by 16–points Gaussian quadrature. For n = 16, we get from Table 2.54 of Abramowitz and Stegun (1968) the values

Unnumbered Display Equation

For negative values of u, we use -u; with the same weight, ωi. For a sample of size n = 10, with QX = 6.1448, QY = 16.1983, and PXY = 4.5496, we obtain inline.jpgB = 0.3349.        inline

Example 8.24. In this example, we consider evaluating the integrals in inline.jpgB of Example 8.23 by simulations. We simulate 100 random variables UiR(−1, 1) i = 1, …, 100, and approximate the integrals in the numerator and denominator of inline.jpgθ by averages. For n = 100, and the same values of QX, QY, PXY, as in Example 8.23, we obtain the approximation inline.jpg = 0.36615.        inline

Example 8.25. We return to Example 8.22 and compute the posterior expectation of β by simulation. Note that, for a large number M of simulation runs, E{β| X, Y} is approximated by

Unnumbered Display Equation

where βj is a random vector, simulated from the N(β0, τ2I) distribution.

To illustrate the result numerically, we consider a case where the observed sample contains 40 independent observations; ten for each one of the for x vectors:

Unnumbered Display Equation

The observed values of w40 is (6, 3, 11, 29). For β0 = (0.1, 0.1, 0.5, 0.5), we obtain the following Bayesian estimators inline.jpg, with M = 1000,

Unnumbered Table

We see that when τ = 0.01 the Bayesian estimators are very close to the prior mean β0. When τ = 0.05 the Bayesian estimators might be quite different than β0. In Example 8.22, we approximated the posterior distribution of β by a normal distribution. For the values in this example the normal approximation yields similar results to those of the simulation.        inline

Example 8.26. Consider the following repetitive problem. In a certain manufacturing process, a lot of N items is produced every day. Let Mj, j = 1, 2, …, denote the number of defective items in the lot of the jth day. The parameters M1, M2, … are unknown. At the end of each day, a random sample of size n is selected without replacement from that day’s lot for inspection. Let Xj denote the number of defectives observed in the sample of the jth day. The distribution of Xj is the hypergeometric H(N, Mj, n), j = 1, 2, …. Samples from different days are (conditionally) independent (given M1, M2, …). In this problem, it is often reasonable to assume that the parameters M1, M2, … are independent random variables having the same binomial distribution B(N, θ). θ is the probability of defectives in the production process. It is assumed that θ does not change in time. The value of θ is, however, unknown. It is simple to verify that for a prior B(N, θ) distribution of M, and a squared–error loss function, the Bayes estimator of Mj is

Unnumbered Display Equation

The corresponding Bayes risk is

Unnumbered Display Equation

A sequence of empirical Bayes estimators is obtained by substituting in inline.jpgj a consistent estimator of θ based on the results of the first (j − 1) days. Under the above assumption on the prior distribution of M1, M2, …, the predictive distribution of X1, X2, … is the binomial B(n, θ). A priori, for a given value of θ, X1, X2, … can be considered as i.i.d. random variables having the mixed distribution B(n, θ). Thus, inline.jpgj−1 = inline.jpg, for j ≥ 2, is a sequence of consistent estimators of θ. The corresponding sequence of empirical Bayes estimators is

Unnumbered Display Equation

The posterior risk of inline.jpgj given (Xj, inline.jpgj − 1) is

Unnumbered Display Equation

We consider now the conditional expectation of ρj(inline.jpgj) given Xj. This is given by

Unnumbered Display Equation

Notice that this converges as j → ∞ to ρ (θ).        inline

Example 8.27. This example shows the application of empirical Bayes techniques to the simultaneous estimation of many probability vectors. The problem was motivated by a problem of assessing the readiness probabilities of military units based on exercises of big units. For details, see Brier, Zacks, and Marlow (1986).

A large number, N, of units are tested independently on tasks that are classified into K categories. Each unit obtains on each task the value 1 if it is executed satisfactorily and the value 0 otherwise. Let i, i = 1, …, N be the index of the ith unit, and j, j = 1, …, k the index of a category. Unit i was tested on Mij tasks of category j. Let Xij denote the number of tasks in category j on which the ith unit received a satisfactory score. Let θij denote the probability of the ith unit executing satisfactorily a task of category j. There are N parametric vectors θi = (θi1, …, θiK)′, i = 1, …, N, to be estimated.

The model is that, conditional on θi, Xi1, …, XiK are independent random variables, having binomial distributions, i.e.,

Unnumbered Display Equation

In addition, the vectors θi (i = 1, …, N) are i.i.d. random variables, having a common distribution. Since Mij were generally large, but not the same, we have used first the variance stabilizing transformation

Unnumbered Display Equation

For large values of Mij the asymptotic distribution of Yij is N(ηij, 1/Mij), where ηij = 2sin−1inline.jpg, as shown in Section 7.6.

Let Yi = (Yi1, …, YiK)′, i = 1, …, N. The parametric empirical Bayes model is that (Yi, θi) are i.i.d, i = 1, …, N,

Unnumbered Display Equation


Unnumbered Display Equation

where ηi = (ηi1, …, ηiK)′ and

Unnumbered Display Equation

The prior parameters μ and inline are unknown. Note that if μ and inline are given then η, η2, …, ηN are a posteriori independent, given Y1, …, YN. Furthermore, the posterior distribution of ηi, given Yi, is

Unnumbered Display Equation


Unnumbered Display Equation

Thus, if μ and inline are given, the Bayesian estimator of ηi, for a squared–error loss function L(ηi, inline.jpgi) = ||inline.jpgiηi||2, is the posterior mean, i.e.,

Unnumbered Display Equation

The empirical Bayes method estimates μ and inline from all the data. We derive now MLEs of μ and inline. These MLEs are then substituted in inline.jpgi(μ, inline) to yield empirical Bayes estimators of ηi.

Note that Yi | μ, inlineN(μ, inline + Di), i = 1, …, N. Hence, the log–likelihood function of (μ, inline.jpg), given the data (Y1, …, YN), is

Unnumbered Display Equation

The vector inline.jpg(inline), which maximizes l(μ, inline), for a given inline is

Unnumbered Display Equation

Substituting inline.jpg(inline) in l(μ, inline) and finding inline that maximizes the expression can yield the MLE inline.jpg. Another approach, to find the MLE, is given by the E–M algorithm. The EM algorithm considers the unknown parameters η1, …, ηN as missing data. The algorithm is an iterative process, having two phases in each iteration. The first phase is the E–phase, in which the conditional expectation of the likelihood function is determined, given the data and the current values of (μ, inline). In the next phase, the M–phase, the conditionally expected likelihood is maximized by determining the maximizing arguments inline.jpg. More specifically, let

Unnumbered Display Equation

be the log–likelihood of (μ, inline) if η1, …, ηN were known. Let (μ(p), inline(p)) be the estimator of (μ, inline) after p iterations, p ≥ 0, where μ(0), inline(0) are initial estimates.

In the (p + 1)st iteration, we start (the E–phase) by determining

Unnumbered Display Equation

where the conditional expectation is determined as though μ(p) and inline(p) are the true values. It is well known that if E{X} = ξ and the covariance matrix of X is inline.jpg(X) then E{XAX} = μAμ + tr{AC(X)}, where tr {·} is the trace of the matrix (see, Seber, 1977, p. 13). Thus,

Unnumbered Display Equation

where, inline.jpg, in which inline.jpg, and inline.jpg, i = 1, …, N. Thus,

Unnumbered Display Equation

where inline.jpg. In phase– M, we determine μ(p + 1) and inline(p + 1) by maximizing l** (μ, inline | …).

One can immediately verify that

Unnumbered Display Equation


Unnumbered Display Equation

where inline.jpg, and

Unnumbered Display Equation

Finally, the matrix maximizing l** is

Unnumbered Display Equation

We can prove recursively, by induction on p, that

Unnumbered Display Equation

where inline.jpg and

Unnumbered Display Equation


Unnumbered Display Equation

One continues iterating until μ(p) and inline(p) do not change significantly.

Brier, Zacks, and Marlow (1986) studied the efficiency of these empirical Bayes estimators, in comparison to the simple MLE, and to another type of estimator that will be discussed in Chapter 9.        inline

