Chapter 3: Sufficient Statistics and the Information in Samples

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 3

Sufficient Statistics and the Information in Samples

PART I: THEORY

3.1 INTRODUCTION

The problem of statistical inference is to draw conclusions from the observed sample on some characteristics of interest of the parent distribution of the random variables under consideration. For this purpose we formulate a model that presents our assumptions about the family of distributions to which the parent distribution belongs. For example, in an inventory management problem one of the important variables is the number of units of a certain item demanded every period by the customer. This is a random variable with an unknown distribution. We may be ready to assume that the distribution of the demand variable is Negative Binomial NB(, ν). The statistical model specifies the possible range of the parameters, called the parameter space, and the corresponding family of distributions . In this example of an inventory system, the model may be

Such a model represents the case where the two parameters, and ν, are unknown. The parameter space here is Θ = {(, ν); 0 < < 1, 0 < ν < ∞ }. Given a sample of n independent and identically distributed (i.i.d.) random variables X1, …, Xn, representing the weekly demand, the question is what can be said on the specific values of and ν from the observed sample?

Every sample contains a certain amount of information on the parent distribution. Intuitively we understand that the larger the number of observations in the sample (on i.i.d. random variables) the more information it contains on the distribution under consideration. Later in this chapter we will discuss two specific information functions, which are used in statistical design of experiments and data analysis. We start with the investigation of the question whether the sample data can be condensed by computing first the values of certain statistics without losing information. If such statistics exist they are called sufficient statistics. The term statistic will be used to indicate a function of the (observable) random variables that does not involve any function of the unknown parameters. The sample mean, sample variance, the sample order statistics, etc., are examples of statistics. As will be shown, the notion of sufficiency of statistics is strongly dependent on the model under consideration. For example, in the previously mentioned inventory example, as will be established later, if the value of the parameter ν is known, a sufficient statistic is the sample mean . On the other hand, if ν is unknown, the sufficient statistic is the order statistic (X(1), …, X(n)). When ν is unknown, the sample mean by itself does not contain all the information on and ν. In the following section we provide a definition of sufficiency relative to a specified model and give a few examples.

3.2 DEFINITION AND CHARACTERIZATION OF SUFFICIENT STATISTICS

3.2.1 Introductory Discussion

Let X = (X1, …, Xn) be a random vector having a joint c.d.f. Fθ (x) belonging to a family = {Fθ (x); θ Θ }. Such a random vector may consist of n i.i.d. variables or of dependent random variables. Let T(X) = (T1(X), …, Tr(X))′, 1 ≤ r ≤ n be a statistic based on X. T could be real (r = 1) or vector valued (r > 1). The transformations Tj(X), j = 1, …, r are not necessarily one–to–one. Let f(x; θ) denote the (joint) probability density function (p.d.f.) of X. In our notation here Ti(X) is a concise expression for Ti(X1, …, Xn). Similarly, Fθ (x) and f(x; θ) represent the multivariate functions Fθ (x1, …, xn) and f(x1, …, xn;θ). As in the previous chapter, we assume throughout the present chapter that all the distribution functions belonging to the same family are either absolutely continuous, discrete, or mixtures of the two types.

Definition of Sufficiency. Let be a family of distribution functions and let X = (X1, …, Xn) be a random vector having a distribution in . A statistic T(X) is called sufficient with respect to if the conditional distribution of X given T(X) is the same for all the elements of .

Accordingly, if the joint p.d.f. of X, f(x; θ), depends on a parameter θ and T(X) is a sufficient statistic with respect to , the conditional p.d.f. h(x| t) of X given {T(X) = t} is independent of θ. Since f(x;θ) = h(x| t)g(t; θ), where g(t; θ) is the p.d.f. of T(x), all the information on θ in x is summarized in T(x).

The process of checking whether a given statistic is sufficient for some family following the above definition may be often very tedious. Generally the identification of sufficient statistics is done by the application of the following theorem. This celebrated theorem was given first by Fisher (1922) and Neyman (1935). We state the theorem here in terms appropriate for families of absolutely continuous or discrete distributions. For more general formulations see Section 3.2.2. For the purposes of our presentation we require that the family of distributions consists of

(i) absolutely continuous distributions; or

(ii) discrete distributions, having jumps on a set of points {ξ1, ξ2, … } independent of θ, i.e., inline

for all θ

Θ; or

(iii) mixtures of distributions satisfying (i) or (ii). Such families of distributions will be called regular (Bickel and Doksum, 1977, p. 61).

The families of discrete or absolutely continuous distributions discussed in Chapter 2 are all regular.

Theorem 3.2.1 (The Neyman–Fisher Factorization Theorem). Let X be a random vector having a distribution belonging to a regular family and having a joint p.d.f. f(x; θ), θ Θ. A statistic T(X) is sufficient for if and only if

(3.2.1)

where K(x) ≥ 0 is independent of θ and g(T(x); θ) ≥ 0 depends on x only through T(x).

Proof. We provide here a proof for the case of discrete distributions.

(i) Sufficiency:

We show that (3.2.1) implies that the conditional distribution of X given {T(X) = t} is independent of θ. The (marginal) p.d.f. of T(X) is, according to (3.2.1),

(3.2.2) numbered Display Equation

The joint p.d.f. of X and T(X) is

(3.2.3)

Hence, the conditional p.d.f. of X, given {T(X) = t at every point t such that g* (t; θ) > 0, is

(3.2.4) numbered Display Equation

This proves that T(X) is sufficient for .

(ii) Necessity:

Suppose that T(X) is sufficient for . Then, for every t at which the (marginal) p.d.f. of T(X), g* (t;θ), is positive we have,

(3.2.5)

where B(x) ≥ 0 is independent of θ. Moreover, since (3.2.5) is a conditional p.d.f. Thus, for every x,

(3.2.6)

Finally, since for every x,

(3.2.7)

we obtain that

(3.2.8) QED

3.2.2 Theoretical Formulation

3.2.2.1 Distributions and Measures

We generalize the definitions and proofs of this section by providing measure–theoretic formulation. Some of these concepts were discussed in Chapter 1. This material can be skipped by students who have not had real analysis.

Let (Ω, , P) be a probability space. A random variable X is a finite real value measurable function on this probability space, i.e., X: Ω → . Let be the sample space (range of X), i.e., = X(Ω). Let be the Borel σ–field on , and consider the probability space (, , PX) where, for each B , PX{B} = P{X−1(B)}. Since X is a random variable, X = {A: A = X−1(B), B } .

The distribution function of X is

(3.2.9)

Let X1, X2, …, Xn be n random variables defined on the same probability space (Ω, , P). The joint distribution of X = (X1, …, Xn)′ is a real value function of n defined as

(3.2.10) numbered Display Equation

Consider the probability space ((n), (n), P(n)) where (n) = × ··· × , (n) = × ··· × (or the Borel σ–field generated by the intervals (−∞, x1] × ··· (−∞, xn], (x1, …, xn) n) and for B (n)

(3.2.11)

A function h: (n) → is said to be (n)–measurable if the sets h−1((−∞, ζ]) are in (n) for all −∞ < ζ < ∞. By the notation h (n) we mean that h is (n)–measurable.

A random sample of size n is the realization of n i.i.d. random variables (see Chapter 2 for definition of independence).

To economize in notation, we will denote by bold x the vector (x1, …, xn), and by F(x) the joint distribution of (X1, …, Xn). Thus, for all B (n),

(3.2.12)

This is a probability measure on ((n), B(n)) induced by F(x). Generally, a σ–finite measure μ on (n) is a nonnegative real value set function, i.e., μ: (n) → [0, ∞], such that

(i) μ (

) = 0;

(ii) if

is a sequence of mutually disjoint sets in

(n), i.e., Bi

Bj =

for any i ≠ j, then

Unnumbered Display Equation

(iii) there exists a partition of

(n), {B1, B2, … } for which μ (Bi) < ∞ for all i = 1, 2, … .

The Lebesque measure is a σ–finite measure on B(n).

If there is a countable set of marked points in n, S = {x1, x2, x3, … }, the counting measure is

N(B; S) is a σ–finite measure, and for any finite real value function g(x)

Notice that if B is such that N(B; S) = 0 then B g(x)d N(x; S) = 0. Similarly, if B is such that B dx = 0 then, for any positive integrable function g(x), B g(x)dx = 0. Moreover, ν (B) = B g(x)d x and λ (B) = B g(x)d N(x; S) are σ–finite measures on ((n), (n)).

Let ν and μ be two σ–finite measures defined on ((n), (n)). We say that ν is absolutely continuous with respect to μ if μ (B) = 0 implies that ν (B) = 0. We denote this relationship by ν μ. If ν μ and μ ν, we say that ν and μ are equivalent, ν ≡ μ. We will use the notation F μ if the probability measure PF is absolutely continuous with respect to μ. If F μ there exists a nonnegative function f(x), which is measurable, satisfying

(3.2.13)

f(x) is called the (Radon–Nikodym) derivative of F with respect to μ or the generalized density p.d.f. of F(x). We write

(3.2.14)

(3.2.15)

As discussed earlier, a statistical model is represented by a family of distribution functions Fθ on (n), θ Θ. The family is dominated by a σ–finite measure μ if Fθ μ, for each θ Θ.

We consider only models of dominated families. A theorem in measure theory states that if μ then there exists a countable sequence such that

(3.2.16)

induces a probability measure P*, which dominates .

A statistic T(X) is a measurable function of the data X. More precisely, let T: (n) → (k), k ≥ 1 and let (k) be the Borel σ–field of subsets of (k). The function T(X) is a statistic if, for every C (k), T−1(C) (n). Let T = {B: B = T−1(C) for C (k)}. The probability measure PT on (k), induced by PX, is given by

(3.2.17)

Thus, the induced distribution function of T is FT(t), where t k and

(3.2.18)

If F μ then FT μT, where μT(C) = μ (T−1(C)) for all C (k). The generalized density (p.d.f.) of T with respect to μT is gT(t) where

(3.2.19)

If h(x) is (n) measurable and |h(x)| dF(x) < ∞, then the conditional expectation of h(X) given {T(X) = t} is a T measurable function, EF{h(X)| T(X) = t}, for which

(3.2.20) numbered Display Equation

for all C (k). In particular, if C = (k) we obtain the law of the iterated expectation; namely

(3.2.21)

Notice that EF{h(X) | T(X)} assumes a constant value on the coset A(t) = {x: T(x) = t} = T−1({t}), t (k).

3.2.2.2 Sufficient Statistics

Consider a statistical model ((n), (n), ) where is a family of joint distributions of the random sample. A statistic T: ((n), (n), ) → ((k), (k), T) is called sufficient for if, for all B (n), PF{B| T(X) = t} = p(B; t) for all F . That is, the conditional distribution of X given T(X) is the same for all F in . Moreover, for a fixed t, p(B; t) is (n) measurable and for a fixed B, p(B; t) is (k) measurable.

Theorem 3.2.2 Let ((n), (n), ) be a statistical model and μ. Let such that F*(x) = inline and P*. Then T(X) is sufficient for if and only if for each θ Θ there exists a T measurable function gθ (T(x)) such that, for each B (n)

(3.2.22)

i.e.,

(3.2.23)

Proof. (i) Assume that T(X) is sufficient for . Accordingly, for each B (n),

for all θ Θ. Fix B in (n) and let C (k).

for each θ Θ. In particular,

By the Radon–Nikodym Theorem, since Fθ F* for each θ, there exists a T measurable function gθ (T(X)) so that, for every C (k),

Now, for B (n) and θ Θ,

Unnumbered Display Equation

Hence, dFθ (x)/d F* (x) = gθ (T(x)), which is T measurable.

(ii) Assume that there exists a T measurable function gθ (T(x)) so that, for each θ Θ,

Let A (n) and define the σ–finite measure . Thus,

Unnumbered Display Equation

Thus, Pθ {A| T(X)} = P*{A| T(X)} for all θ Θ. Therefore T is a sufficient statistic. QED

Theorem 3.2.3 (Abstract Formulation of the Neyman–Fisher Factorization Theorem) Let ((n), (n), ) be a statistical model with μ. Then T(X) is sufficient for if and only if

(3.2.24)

where h ≥ 0 and h (n), gθ T.

Proof. Since μ, , such that F*(x) = dominates . Hence, by the previous theorem, T(X) is sufficient for if and only if there exists a T measurable function gθ (T(x)) so that

Let fθn(x) = d Fθn(x)/dμ (x) and set h(x) = . The function h(x) (n) and

QED

3.3 LIKELIHOOD FUNCTIONS AND MINIMAL SUFFICIENT STATISTICS

Consider a vector X = (X1, …, Xn)′ of random variables having a joint c.d.f. Fθ (x) belonging to a family = {Fθ (x); θ Θ }. It is assumed that is a regular family of distributions, i.e., μ, and, for each θ Θ, there exists f(x;θ) such that

f(x;θ) is the joint p.d.f. of X with respect to μ (x). We define over the parameter space Θ a class of functions L(θ; X) called likelihood functions. The likelihood function of θ associated with a vector of random variables X is defined up to a positive factor of proportionality as

(3.3.1)

The factor of proportionality in (3.3.1) may depend on X but not on θ. Accordingly, we say that two likelihood functions L1(θ ;X) and L2(θ ; X) are equivalent, i.e., L1(θ ; X) ∼ L2(θ ;X), if L1(θ ;X) = A(X) L2(θ ; X) where A(X) is a positive function independent of θ. For example, suppose that X = (X1, …, Xn)′ is a vector of i.i.d. random variables having a N(θ, 1) distribution, −-∞ < θ < ∞. The likelihood function of θ can be defined as

(3.3.2) numbered Display Equation

where and and 1′ = (1, …, 1) or as

(3.3.3)

where . We see that for a given value of X, L1(θ ; X) ∼ L2(θ ; X). All the equivalent versions of a likelihood function L(θ; X) belong to the same equivalence class. They all represent similar functions of θ.

If S(X) is a statistic having a p.d.f. gS(s; θ), (θ Θ), then the likelihood function of θ given S(X) = s is LS(θ; s) gS(s; θ). LS(θ; s) may or may not have a shape similar to L(θ; X). From the Factorization Theorem we obtain that if L(θ; X)∼ LS(θ; S(X)), for all X, then S(X) is a sufficient statistic for . The information on θ given by X can be reduced to S(X) without changing the factor of the likelihood function that depends on θ. This factor is called the kernel of the likelihood function. In terms of the above example, if T(X) = , since ∼ N , LT(X)(θ; t) = exp . Thus, for all x such that T(x) = t, LX(θ; t) ∼ L1(θ; x) ∼ L2(θ; x). is indeed a sufficient statistic. The likelihood function LT(θ ; T(x)) associated with any sufficient statistic for is equivalent to the likelihood function L(θ; x) associated with X. Thus, if T(X) is a sufficient statistic, then the likelihood ratio

is independent of θ. A sufficient statistic T(X) is called minimal if it is a function of any other sufficient statistic S(X). The question is how to determine whether a sufficient statistic T(X) is minimal sufficient.

Every statistic S(X) induces a partition of the sample space χ(n) of the observable random vector X. Such a partition is a collection of disjoint sets whose union is χ(n). Each set in this partition is determined so that all its elements yield the same value of S(X). Conversely, every partition of χ(n) corresponds to some function of X. Consider now the partition whose sets contain only x points having equivalent likelihood functions. More specifically, let x0 be a point in χ(n). A coset of x0 in this partition is

(3.3.4)

The partition of χ(n) is obtained by varying x0 over all the points of χ(n). We call this partition the equivalent–likelihood partition. For example, in the N(θ, 1) case −∞ < θ < ∞, each coset consists of vectors x having the same mean = 1′x. These means index the cosets of the equivalent–likelihood partitions. The statistic T(X) corresponding to the equivalent–likelihood partition is called the likelihood statistic. This statistic is an index of the likelihood function L(θ; x). We show now that the likelihood statistic T(X) is a minimal sufficient statistic (m.s.s.).

Let x(1) and x(2) be two different points and let T(x) be the likelihood statistic. Then, T(x(1)) = T(x(2)) if and only if L(θ; x(1)) ∼ L(θ ;x(2)). Accordingly, L(θ; X) is a function of T(X), i.e., f(X;θ) = A(X) g*(T(X); θ). Hence, by the Factorization Theorem, T(X) is a sufficient statistic. If S(X) is any other sufficient statistic, then each coset of S(X) is contained in a coset of T(X). Indeed, if x(1) and x(2) are such that S(x(1)) = S(x(2)) and f(x(i), θ) > 0 (i = 1,2), we obtain from the Factorization Theorem that f(x(1); θ) = k(x(1)) g(S(x(1)); θ) = k(x(1))g(S(x(2)); θ) = k(x(1)) f(x(2); θ)/k(x(2)), where k(x(2)) > 0. That is, L(θ; X(1)) ∼ L(θ; X(2)) and hence T(X(1)) = T(X(2)). This proves that T(X) is a function of S(X) and therefore minimal sufficient.

The minimal sufficient statistic can be determined by determining the likelihood statistic or, equivalently, by determining the partition of χ(n) having the property that f(x(1); θ)/f(x(2); θ) is independent of θ for every two points at the same coset.

3.4 SUFFICIENT STATISTICS AND EXPONENTIAL TYPE FAMILIES

In Section 2.16 we discussed the k–parameter exponential type family of distributions. If X1, …, Xn are i.i.d. random variables having a k–parameter exponential type distribution, then the joint p.d.f. of X = (X1, …, Xn), in its canonical form, is

(3.4.1) numbered Display Equation

It follows that inline is a sufficient statistic. The statistic T(X) is minimal sufficient if the parameters {1, …, k} are linearly independent. Otherwise, by reparametrization we can reduce the number of natural parameters and obtain an m.s.s. that is a function of T(X).

Dynkin (1951) investigated the conditions under which the existence of an m.s.s., which is a nontrivial reduction of the sample data, implies that the family of distributions, , is of the exponential type. The following regularity conditions are called Dynkin’s Regularity Conditions. In Dynkin’s original paper, condition (iii) required only piecewise continuous differentiability. Brown (1964) showed that it is insufficient. We phrase (iii) as required by Brown.

Dynkin’s Regularity Conditions

(i) The family

= {Fθ (x); θ

Θ } is a regular parametric family. Θ is an open subset of the Euclidean space Rk.

(ii) If f(x; θ) is the p.d.f. of Fθ (x), then

= { x; f(x; θ) > 0} is independent of θ.

(iii) The p.d.f.s f(x; θ) are such that, for each θ

Θ, f(x; θ) is a continuously differentiable function of x over χ.

(iv) The p.d.f.s f(x; θ) are differentiable with respect to θ for each x

Theorem 3.4.1 (Dynkin’s). If the family is regular in the sense of Dynkin, and if for a sample of n ≥ k i.i.d. random variables U1(X), …, Uk(X) are linearly independent sufficient statistics, then the p.d.f. of X is

Unnumbered Display Equation

where the functions 1 (θ), …, k(θ) are linearly independent.

For a proof of this theorem and further reading on the subject, see Dynkin (1951), Brown (1964), Denny (1967, 1969), Tan (1969), Schmetterer (1974, p. 215), and Zacks (1971, p. 60). The connection between sufficient statistics and the exponential family was further investigated by Borges and Pfanzagl (1965), and Pfanzagl (1972). A one dimensional version of the theorem is proven in Schervish (1995, p. 109).

3.5 SUFFICIENCY AND COMPLETENESS

A family of distribution functions = {Fθ (x);θ Θ } is called complete if, for any integrable function h(X),

(3.5.1)

implies that Pθ [h(X) = 0] = 1 for all θ Θ.

A statistic T(X) is called complete sufficient statistic if it is sufficient for a family , and if the family T of all the distributions of T(X) corresponding to the distributions in is complete.

Minimal sufficient statistics are not necessarily complete. To show it, consider the family of distributions of Example 3.6 with ξ1 = ξ2 = ξ. It is a four–parameter, exponential–type distribution and the m.s.s. is

Unnumbered Display Equation

The family T is incomplete since inline for all θ = (ξ, σ1, σ2). But inline , all θ. The reason for this incompleteness is that when ξ1 = ξ2 the four natural parameters are not independent. Notice that in this case the parameter space Ω = { = (1, 2, 3, 4); 1 = 2 3/4} is three–dimensional.

Theorem 3.5.1 If the parameter space Ω corresponding to a k–parameter exponential type family is k–dimensional, then the family of the minimal sufficient statistic is complete.

The proof of this theorem is based on the analyticity of integrals of the type (2.16.4). For details, see Schervish (1995, p. 108).

From this theorem we immediately deduce that the following families are complete.

1. B(N, θ), 0 < θ < 1; N fixed.

2. P(λ), 0 < λ < ∞.

3. N B(

, ν), 0 <

< 1; ν fixed.

4. G(λ, ν), 0 < λ < ∞, 0 < ν < ∞.

5. β (p, q), 0 < p, q < ∞.

6. N(μ, σ2), −∞ < μ < ∞, 0 < σ < ∞.

7. M(N, θ), θ = (θ1, …, θk), 0 <

θi < 1; N fixed.

8. N(μ, V), μ

R(k); V positive definite.

We define now a weaker notion of boundedly complete families. These are families for which if h(x) is a bounded function and Eθ {h(X)} = 0, for all θ Θ, then Pθ {h(x) = 0} = 1, for all θ Θ. For an example of a boundedly complete family that is incomplete, see Fraser (1957, p. 25).

Theorem 3.5.2 (Bahadur). If T(X) is a boundedly complete sufficient statistic, then T(X) is minimal.

Proof. Suppose that S(X) is a sufficient statistic. If S(X) = (T(X)) then, for any Borel set B ,

Define

By the law of iterated expectation, Eθ {h(T)} = 0, for all θ Θ. But since T(X) is boundedly complete,

Hence, T S, which means that T is a function of S. Hence T(X) is an m.s.s. QED

3.6 SUFFICIENCY AND ANCILLARITY

A statistic A(X) is called ancillary if its distribution does not depend on the particular parameter(s) specifying the distribution of X. For example, suppose that X∼ N(θ 1n, In), −∞ < θ < ∞. The statistic U = (X2 − X1, …, Xn−X1) is distributed like N(On−1, In−1 + Jn−1). Since the distribution of U does not depend on θ, U is ancillary for the family = {N(θ 1, I), −∞ < θ < ∞ }. If S(X) is a sufficient statistic for a family , the inference on θ can be based on the likelihood based on S. If fS(s; θ) is the p.d.f. of S, and if A(X) is ancillary for , with p.d.f. h(a), one could write

(3.6.1)

where (s | a) is the conditional p.d.f. of S given {A = a}. One could claim that, for inferential objectives, one should consider the family of conditional p.d.f.s S|A ={ (s | a), θ Θ }. However, the following theorem shows that if S is a complete sufficient statistic, conditioning on A(X) does not yield anything different, since pS(s; θ) = (s | a), with probability one for each θ Θ.

Theorem 3.6.1 (Basu’s Theorem). Let X = (X1, …, Xn)′ be a vector of i.i.d. random variables with a common distribution belonging to = {Fθ (x), θ Θ }. Let T(X) be a boundedly complete sufficient statistic for . Furthermore, suppose that A(X) is an ancillary statistic. Then T(X) and A(X) are independent.

Proof. Let C A, where A is the Borel σ–subfield induced by A(X). Since the distribution of A(X) is independent of θ, we can determine P{A(X) C} without any information on θ. Moreover, the conditional probability P{A(X) C | T(X)} is independent of θ since T(X) is a sufficient statistic. Hence, P{A(X) C | T(X)} − P{A(X) C} is a statistic depending on T(X). According to the law of the iterated expectation,

(3.6.2)

Finally, since T(x) is boundedly complete,

(3.6.3)

with probability one for each θ. Thus, A(X) and T(X) are independent. QED

From Basu’s Theorem, we can deduce that only if the sufficient statistic S(X) is incomplete for , then an inference on θ, conditional on an ancillary statistic, can be meaningful. An example of such inference is given in Example 3.10.

3.7 INFORMATION FUNCTIONS AND SUFFICIENCY

In this section, we discuss two types of information functions used in statistical analysis: the Fisher information function and the Kullback–Leibler information function. These two information functions are somewhat related but designed to fulfill different roles. The Fisher information function is applied in various estimation problems, while the Kullback–Leibler information function has direct applications in the theory of testing hypotheses. Other types of information functions, based on the log likelihood function, are discussed by Basu (1975), Barndorff–Nielsen (1978).

3.7.1 The Fisher Information

We start with the Fisher information and consider parametric families of distribution functions with p.d.f.s f(x; θ), θ Θ, which depend only on one real parameter θ. A generalization to vector valued parameters is provided later.

Definition 3.7.1. The Fisher information function for a family = {F(x;θ); θ Θ }, where dF(x; θ) = f(x; θ)dμ (x), is

(3.7.1) numbered Display Equation

Notice that according to this definition, log f(x; θ) should exist with probability one, under Fθ, and its second moment should exist. The random variable log f(x; θ) is called the score function. In Example 3.11 we show a few cases.

We develop now some properties of the Fisher information when the density functions in satisfy the following set of regularity conditions.

(i) Θ is an open interval on the real line (could be the whole line);

(ii)

f(x; θ) exists (finite) for every x and every θ

Θ.

(iii) For each θ in Θ there exists a δ < 0 and a positive integrable function G(x; θ) such that, for all

in (θ − δ, θ + δ),

(3.7.2)

(iv) 0 < Eθ inline

< ∞ for each θ

Θ.

One can show that under condition (iii) (using the Lebesgue Dominated Convergence Theorem)

(3.7.3) numbered Display Equation

for all θ Θ. Thus, under these regularity conditions,

This may not be true if conditions (3.7.2) do not hold. Example 3.11 illustrates such a case where X ∼ R(0, θ). Indeed, if X ∼ R(0, θ) then

Unnumbered Display Equation

Moreover, in that example Vθ for all θ. Returning back to cases where regularity conditions (3.7.2) are satisfied, we find that if X1, …, Xn are i.i.d. and In(θ) is the Fisher information function based on their joint distribution,

(3.7.4) numbered Display Equation

Since X1, …, Xn are i.i.d. random variables, then

(3.7.5)

and due to (3.7.3),

(3.7.6)

Thus, under the regularity conditions (3.7.2), I(θ) is an additive function.

We consider now the information available in a statistic S = (S1(X), …, Sr(X)), where 1 ≤ r ≤ n. Let gS(y1, …, yr; θ) be the joint p.d.f. of S. The Fisher information function corresponding to S is analogously

(3.7.7) numbered Display Equation

We obviously assume that the family of induced distributions of S satisfies the regularity conditions (i)–(iv). We show now that

(3.7.8)

We first show that

(3.7.9)

We prove (3.7.9) first for the discrete case. The general proof follows. Let A(y) = {x; S1(x) = y1, …, Sr(x) = yr}. The joint p.d.f. of S at y is given by

(3.7.10)

where f(x;θ) is the joint p.d.f. of X. Accordingly,

(3.7.11) numbered Display Equation

Furthermore, for each x such that f(x; θ ) > 0 and according to regularity condition (iii),

(3.7.12) numbered Display Equation

To prove (3.7.9) generally, let S: (, ,) → (, Γ, ) be a statistic and μ and be regular. Then, for any C Γ,

(3.7.13) numbered Display Equation

Since C is arbitrary, (3.7.9) is proven. Finally, to prove (3.7.8), write

(3.7.14) numbered Display Equation

We prove now that if T(X)is a sufficient statistic for , then

Indeed, from the Factorization Theorem, if T(X) is sufficient for then f(x;θ) = K(x) g(T(x); θ), for all θ Θ. Accordingly, In(θ) = Eθ inline . On the other hand, the p.d.f. of T(X) is gT(t; θ) = A(t)g(t; θ), all θ Θ. Hence, log gT(t; θ) = log g(t; θ) for all θ and all t. This implies that

(3.7.15) numbered Display Equation

for all θ Θ. Thus, we have proven that if a family of distributions, , admits a sufficient statistic, we can determine the amount of information in the sample from the distribution of the m.s.s.

Under regularity conditions (3.7.2), for any statistic U(X),

Unnumbered Display Equation

By (3.7.15), if U(X) is an ancillary statistic, log gU(u; θ) is independent of θ. In this case log gU(u; θ) = , with probability 1, and

3.7.2 The Kullback–Leibler Information

The Kullback–Leibler (K–L) information function, to discriminate between two distributions Fθ(x) and F(x) of = {Fθ(x); θ Θ } is defined as

(3.7.16)

The family is assumed to be regular. We show now that I(θ, ) ≥ 0 with equality if and only if f(X; θ) = f(X; ) with probability one. To verify this, we remind that log x is a concave function of x and by the Jensen inequality (see problem 8, Section 2.5), log (E{Y}) ≥ E{log Y} for every nonnegative random variable Y, having a finite expectation. Accordingly,

(3.7.17) numbered Display Equation

Thus, multiplying both sides of (3.7.17) by −1, we obtain that I(θ, ) ≥ 0. Obviously, if Pθ {f(X; θ ) = f(X; )} = 1, then I(θ, ) = 0. If X1, …, Xn are i.i.d. random variables, then the information function in the whole sample is

(3.7.18) numbered Display Equation

This shows that the K–L information function is additive if the random variables are independent.

If S(X) = (S1(X), …, Sr(X)), 1 ≤ r ≤ n, is a statistic having a p.d.f. gS(y1, …, yr; θ), then the K–L information function based on the information in S(X) is

(3.7.19)

We show now that

(3.7.20)

for all θ, Θ and every statistic S(X) with equality if S(X) is a sufficient statistic. Since the logarithmic function is concave, we obtain from the Jensen inequality

(3.7.21)

Generally, if S is a statistic,

then for any C Γ

Unnumbered Display Equation

This proves that

(3.7.22)

Substituting this expression for the conditional expectation in (3.7.21) and multiplying both sides of the inequality by −1, we obtain (3.7.20). To show that if S(X) is sufficient then equality holds in (3.7.20), we apply the Factorization Theorem. Accordingly, if S(X) is sufficient for ,

(3.7.23)

at all points x at which K(x) > 0. We recall that this set is independent of θ and has probability 1. Furthermore, the p.d.f. of S(X) is

(3.7.24)

Therefore,

(3.7.25) numbered Display Equation

3.8 THE FISHER INFORMATION MATRIX

We generalize here the notion of the information for cases where f(x; θ) depends on a vector of k–parameters. The score function, in the multiparameter case, is defined as the random vector

(3.8.1)

Under the regularity conditions (3.7.2), which are imposed on each component of θ,

(3.8.2)

The covariance matrix of S(θ ;X) is the Fisher Information Matrix (FIM)

(3.8.3)

If the components of (3.8.1) are not linearly dependent, then I(θ) is positive definite.

In the k–parameter canonical exponential type family

(3.8.4)

The score vector is then

(3.8.5)

and the FIM is

(3.8.6) numbered Display Equation

Thus, in the canonical exponential type family, I() is the Hessian matrix of the cumulant generating function K().

It is interesting to study the effect of reparametrization on the FIM. Suppose that the original parameter vector is θ. We reparametrize by defining the k functions

Let

and

Then,

It follows that the FIM, in terms of the parameters w, is

(3.8.7)

Notice that I( (w)) is obtained from I(θ) by substituting (w) for θ.

Partition θ into subvectors θ(1), …, θ(l) (2 ≤ l ≤ k). We say that θ(1), …, θ(l) are orthogonal subvectors if the FIM is block diagonal, with l blocks, each containing only the parameters in the corresponding subvector.

In Example 3.14, μ and σ2 are orthogonal parameters, while 1 and 2 are not orthogonal.

3.9 SENSITIVITY TO CHANGES IN PARAMETERS

3.9.1 The Hellinger Distance

There are a variety of distance functions for probability functions. Following Pitman (1979), we apply here the Hellinger distance.

Let = {F(x; θ), θ Θ } be a family of distribution functions, dominated by a σ–finite measure μ, i.e., dF(x; θ ) = f(x; θ )dμ (x), for all θ Θ. Let θ1, θ2 be two points in Θ. The Hellinger distance between f(x; θ1) and f(x;θ2) is

(3.9.1)

Obviously, ρ (θ1, θ2) = 0 if θ1 = θ2.

Notice that

(3.9.2)

Thus, ρ(θ1, θ2) ≤ , for all θ1, θ2 Θ.

The sensitivity of ρ(θ1, θ0) at θ0 is the derivative (if it exists) of ρ(θ, θ0), at θ = θ0.

Notice that

(3.9.3)

If one can introduce the limit, as θ → θ0, under the integral at the r.h.s. of (3.9.3), then

(3.9.4) numbered Display Equation

Thus, if the regularity conditions (3.7.2) are satisfied, then

(3.9.5)

Equation (3.9.5) expresses the sensitivity of ρ(θ, θ0), at θ0, as a function of the Fisher information I(θ0).

Families of densities that do not satisfy the regularity conditions (3.7.2) usually will not satisfy (3.9.5). For example, consider the family of rectangular distributions = {R(0, θ), 0 < θ < ∞ }.

For θ > θ0 > 0,

Unnumbered Display Equation

Thus,

On the other hand, according to (3.7.1) with n = 1, .

The results of this section are generalizable to families depending on k parameters (θ1, …, θk). Under similar smoothness conditions, if λ = (λ1, …, λk)′ is such that λ′λ = 1, then

(3.9.6)

where I(θ0) is the FIM.

PART II: EXAMPLES

Example 3.1. Let X1, …, Xn be i.i.d. random variables having an absolutely continuous distribution with a p.d.f. f(x). Here we consider the family of all absolutely continuous distributions. Let T(X) = (X(1), …, X(n)), where X(1) ≤ ··· ≤ X(n), be the order statistic. It is immediately shown that

Thus, the order statistic is a sufficient statistic. This result is obvious because the order at which the observations are obtained is irrelevant to the model. The order statistic is always a sufficient statistic, when the random variables are i.i.d. On the other hand, as will be shown in the sequel, any statistic that further reduces the data is insufficient for and causes some loss of information.

Example 3.2. Let X1, …, Xn be i.i.d. random variables having a Poisson distribution, P(λ). The family under consideration is = {P(λ); 0 < λ < ∞ }. Let T(X) = . We know that T(X) ∼ P(n λ). Furthermore, the joint p.d.f. of X and T(X) is

Unnumbered Display Equation

Hence, the conditional p.d.f. of X given T(X) = t is

Unnumbered Display Equation

where x1, …, xn are nonnegative integers and t = 0, 1, …. We see that the conditional p.d.f. of X given T(X) = t is independent of λ. Hence T(X) is a sufficient statistic. Notice that X1, …, Xn have a conditional multinomial distribution given Σ Xi = t.

Example 3.3. Let X = (X1, …, Xn)′ have a multinormal distribution N(μ 1n, In), where 1n = (1, 1, …, 1)′. Let . We set X* = (X2, …, Xn) and derive the joint distribution of (X*, T). According to Section 2.9, (X*, T) has the multinormal distribution

Unnumbered Display Equation

where

Unnumbered Display Equation

Hence, the conditional distribution of X* given T is the multinormal

where is the sample mean and Vn−1 = In−1 − Jn−1. It is easy to verify that Vn−1 is nonsingular. This conditional distribution is independent of μ. Finally, the conditional p.d.f. of X1 given (X*, T) is that of a one–point distribution

We notice that it is independent of μ. Hence the p.d.f. of X given T is independent of μ and T is a sufficient statistic.

Example 3.4. Let (X1, Y1), …, (Xn, Yn) be i.i.d. random vectors having a bivariate normal distribution. The joint p.d.f. of the n vectors is

Unnumbered Display Equation

where −-∞ < ξ, η < ∞; 0 < σ1, σ2 < ∞; −1 ≤ ρ ≤ 1. This joint p.d.f. can be written in the form

Unnumbered Display Equation

where , , , , P(x, y) = (xi − )(yi − ).

According to the Factorization Theorem, a sufficient statistic for is

It is interesting that even if σ1 and σ2 are known, the sufficient statistic is still T(X, Y). On the other hand, if ρ = 0 then the sufficient statistic is T*(X, Y) = (, , Q(X), Q(Y)).

Example 3.5.

A. Binomial Distributions

= {B(n, θ), 0 < θ < 1}, n is known. X1, …, Xn is a sample of i.i.d. random variables. For every point x0, at which f(x0, θ) > 0, we have

Unnumbered Display Equation

Accordingly, this likelihood ratio can be independent of θ if and only if inline . Thus, the m.s.s. is T(X) = .

B. Hypergeometric Distributions

Xi ∼ H(N, M, S), i = 1, …, n. The joint p.d.f. of the sample is

Unnumbered Display Equation

The unknown parameter here is M, M = 0, …, N. N and S are fixed known values. The minimal sufficient statistic is the order statistic Tn = (X(1), …, X(n)). To realize it, we consider the likelihood ratio

Unnumbered Display Equation

This ratio is independent of (M) if and only if x(i) = , for all i = 1, 2, …, n.

C. Negative–Binomial Distributions

Xi ∼ NB(, ν), i = 1, …, n; 0 < < 1, 0 < ν < ∞.

(i) If ν is known, the joint p.d.f. of the sample is

Unnumbered Display Equation

Therefore, the m.s.s. is .

(ii) If ν is unknown, the p.d.f.s ratio is

Hence, the minimal sufficient statistic is the order statistic.

D. Multinomial Distributions

We have a sample of n i.i.d. random vectors, i = 1, …, n. Each X(i) is distributed like the multinomial M(s, θ). The joint p.d.f. of the sample is

Unnumbered Display Equation

Accordingly, an m.s.s. is , where , j = 1, …, k − 1. Notice that inline .

E. Beta Distributions

The joint p.d.f. of the sample is

0 ≤ xi ≤ 1 for all i = 1, …, n. Hence, an m.s.s. is inline . In cases where either p or q are known, the m.s.s. reduces to the component of Tn that corresponds to the unknown parameter.

F. Gamma Distributions

The joint distribution of the sample is

Unnumbered Display Equation

Thus, if both λ and ν are unknown, then an m.s.s. is inline . If only ν is unknown, the m.s.s. is . If only λ is unknown, the corresponding statistic is minimal sufficient.

G. Weibull Distributions

X has a Weibull distribution if (X −ξ)α ∼ E(λ). This is a three–parameter family, θ = (ξ, λ, α); where ξ is a location parameter (the density is zero for all x < ξ); λ−1 is a scale parameter; and α is a shape parameter. We distinguish among three cases.

(i) ξ and α are known.

Let Yi = Xi − ξ, i = 1, …, n. Since ∼ E(λ), we immediately obtain from that an m.s.s., which is,

(ii) If α and λ are known but ξ is unknown, then a minimal sufficient statistic is the order statistic.

(iii) α is unknown.

The joint p.d.f. of the sample is

Unnumbered Display Equation

for all i = 1, …, n. By examining this joint p.d.f., we realize that a minimal sufficient statistic is the order statistic, i.e., Tn = (X(1), …, X(n)).

H. Extreme Value Distributions

The joint p.d.f. of the sample is

Unnumbered Display Equation

Hence, if α is known then is a minimal sufficient statistic; otherwise, a minimal sufficient statistic is the order statistic.

Normal Distributions

(i) Single (Univariate) Distribution Model

The m.s.s. is inline . If ξ is known, then an m.s.s. is ; if σ is known, then the first component of Tn is sufficient.

(ii) Two Distributions Model

We consider a two–sample model according to which X1, …, Xn are i.i.d. having a N(ξ, ) distribution and Y1, …, Ym are i.i.d. having a N(η, ) distribution. The X–sample is independent of the Y–sample. In the general case, an m.s.s. is

Unnumbered Display Equation

If = then the m.s.s. reduces to inline . On the other hand, if ξ = η but σ1 ≠ σ2 then the minimal statistic is T.

Example 3.6. Let (X, Y) have a bivariate distribution , with −∞ < ξ1, ξ2 < ∞; 0 < σ1, σ2 < ∞. The p.d.f. of (X, Y) is

Unnumbered Display Equation

This bivariate p.d.f. can be written in the canonical form

where

and

Thus, if 1, 2, 3, and 4 are independent, then an m.s.s. is T(X) = inline . This is obviously the case when ξ1, ξ2, σ1, σ2 can assume arbitrary values. Notice that if ξ1 = ξ2 but σ1 ≠ σ2 then 1, …, 4 are still independent and T(X) is an m.s.s. On the other hand, if ξ1 ≠ ξ2 but σ1 = σ2 then an m.s.s. is

Unnumbered Display Equation

The case of ξ1 =ξ2, σ1 ≠ σ2, is a case of four–dimensional m.s.s., when the parameter space is three–dimensional. This is a case of a curved exponential family.

Example 3.7. Binomial Distributions

= {B(n, θ); 0 < θ < 1}, n fixed. Suppose that Eθ {h(X)} = 0 for all 0 < θ < 1. This implies that

Unnumbered Display Equation

0 < < ∞, where = θ/(1 − θ) is the odds ratio. Let an, j = , j = 0, …, n. The expected value of h(X) is a polynomial of order n in . According to the fundamental theorem of algebra, such a polynomial can have at most n roots. However, the hypothesis is that the expected value is zero for all in (0, ∞ ). Hence an, j = 0 for all j = 0, …, n, independently of . Or,

Example 3.8. Rectangular Distributions

Suppose that = {R(0, θ); 0 < θ < ∞ }. Let X1, …, Xn be i.i.d. random variables having a common distribution from . Let X(n) be the sample maximum. We show that the family of distributions of X(n), , is complete. The p.d.f. of X(n) is

Suppose that Eθ {h(X(n))} = 0 for all 0 < θ < ∞. That is

Consider this integral as a Lebesque integral. Differentiating with respect to θ yields

θ (0, ∞ ).

Example 3.9. In Example 2.15, we considered the Model II of analysis of variance. The complete sufficient statistic for that model is

where is the grand mean; inline is the “within” sample variance; and is the “between” sample variance. Employing Basu’s Theorem we can immediately conclude that is independent of (, ). Indeed, if we consider the subfamily σ, ρ for a fixed σ and ρ, then is a complete sufficient statistic. The distributions of and , however, do not depend on μ. Hence, they are independent of . Since this holds for any σ and ρ, we obtain the result.

Example 3.10. This example follows Example 2.23 of Barndorff–Nielsen and Cox (1994, p. 42). Consider the random vector N = (N1, N2, N3, N4) having a multinomial distribution M(n, p) where p = (p1, …, p4)′ and, for 0 < θ < 1,

Unnumbered Display Equation

The distribution of N is a curved exponential type. N is an m.s.s., but N is incomplete. Indeed, for all 0 < θ < 1, but < 1 for all θ. Consider the statistic A1 = N1 + N2. A1 ∼ B . Thus, A1 is ancillary. The conditional p.d.f. of N, given A1 = a is

Unnumbered Display Equation

for n1 = 0, 1, …, a; n3 = 0, 1, …, n − a; n2 = a − n1 and n4 = n − a − n3. Thus, N1 is conditionally independent of N3 given A1 = a.

Example 3.11. A. Let X ∼ B(n, θ), n known, 0 < θ < 1; f(x; θ) = (1 − θ)n − x satisfies the regularity conditions (3.7.2). Furthermore,

Hence, the Fisher information function is

Unnumbered Display Equation

B. Let X1, …, Xn be i.i.d. random variables having a rectangular distribution R(0, θ), 0 < θ < ∞. The joint p.d.f. is

where . Accordingly,

and the Fisher information in the whole sample is

C. Let X ∼ μ + G(1, 2), −∞ < μ < ∞. In this case,

Thus,

But

Hence, I(μ ) does not exist.

Example 3.12. In Example 3.10, we considered a four–nomial distribution with parameters pi(θ ), i = 1, …, 4, which depend on a real parameter θ, 0 < θ < 1. We considered two alternative ancillary statistics A = N1 + N2 and A′ = N1 + N4. The question was, which ancillary statistic should be used for conditional inference. Barndorff–Nielsen and Cox (1994, p. 43) recommend to use the ancillary statistic which maximizes the variance of the conditional Fisher information.

A version of the log–likelihood function, conditional on {A = a} is

This yields the conditional score function

Unnumbered Display Equation

The corresponding conditional Fisher information is

Finally, since A∼ , the Fisher information is

Unnumbered Display Equation

In addition,

In a similar fashion, we can show that

and

Thus, V{I(θ | A)} > V{I(θ | A′)} for all 0 < θ < 1. Ancillary A is preferred.

Example 3.13. We provide here a few examples of the Kullback–Leibler information function.

A. Normal Distributions

Let be the class of all the normal distributions {N(μ, σ2); −∞ < μ < ∞, 0 < σ < ∞ }. Let θ1 = (μ1, σ1) and θ2 = (μ2, σ2). We compute I(θ1, θ2). The likelihood ratio is

Unnumbered Display Equation

Thus,

Unnumbered Display Equation

Obviously, . On the other hand

Unnumbered Display Equation

where U ∼ N (0, 1). Hence, we obtain that

Unnumbered Display Equation

We see that the distance between the means contributes to the K–L information function quadratically while the contribution of the variances is through the ratio ρ = σ2/σ1.

B. Gamma Distributions

Let θi = (λi, νi), i = 1, 2, and consider the ratio

We consider here two cases.

Case I: ν1 = ν2 = ν. Since the νs are the same, we simplify by setting θi = λi (i = 1, 2). Accordingly,

Unnumbered Display Equation

This information function depends on the scale parameters λi (i = 1, 2), through their ratio ρ = λ2/λ1.

Case II: λ1 = λ2 = λ. In this case, we write

Furthermore,

Unnumbered Display Equation

The derivative of the log–gamma function is tabulated (Abramowitz and Stegun, 1968).

Example 3.14. Consider the normal distribution N(μ, σ2); −∞ < μ < ∞, 0 < σ2 < ∞. The score vector, with respect to θ = (μ, σ2), is

Unnumbered Display Equation

Thus, the FIM is

Unnumbered Display Equation

We have seen in Example 2.16 that this distribution is a two–parameter exponential type, with canonical parameters 1 = and 2 = . Making the reparametrization in terms of 1 and 2, we compute the FIM as a function of 1, 2.

The inverse transformation is

Unnumbered Display Equation

Thus,

Unnumbered Display Equation

Substituting (3.8.9) into (3.8.8) and applying (3.8.7) we obtain

Unnumbered Display Equation

Notice that .

Example 3.15. Let (X, Y) have the bivariate normal distribution , 0 < σ2 < ∞, −1 < ρ < 1. This is a two–parameter exponential type family with

where U1(x, y) = x2 + y2, U2(x, y) = xy and

The Hessian of K(1, 2) is the FIM, with respect to the canonical parameters. We obtain

Using the reparametrization formula, we get

Unnumbered Display Equation

Notice that in this example neither 1, 2 nor σ2, ρ are orthogonal parameters.

Example 3.16. Let = {E(λ), 0 < λ < ∞ }. ρ2(λ1, λ2) = . Notice that for all 0 < λ1, λ2 < ∞. If λ1 = λ2 then ρ (λ1, λ2) = 0. On the other hand, 0 < ρ2 (λ1, λ2) < 2 for all 0 < λ1, λ2 < ∞. However, for λ1 fixed

PART III: PROBLEMS

Section 3.2

3.2.1 Let X1, …, Xn be i.i.d. random variables having a common rectangular distribution R(θ1, θ2), −∞ < θ1 < θ2 < ∞.

(i) Apply the Factorization Theorem to prove that X(1) = min{Xi} and X(n) = max{Xi} are sufficient statistics.

(ii) Derive the conditional p.d.f. of X = (X1, …, Xn) given (X(1), X(n)).

3.2.2 Let X1, X2, …, Xn be i.i.d. random variables having a two–parameter exponential distribution, i.e., X ∼ μ + E(λ), −∞ < μ < ∞, 0 < λ < ∞. Let X(1) ≤ ··· ≤ Xn be the order statistic.

(i) Apply the Factorization Theorem to prove that X(1) and S =

− X(1)) are sufficient statistics.

(ii) Derive the conditional p.d.f. of X given (X(1), S).

(iii) How would you generate an equivalent sample X′ (by simulation) when the value of (X(1), S) are given?

3.2.3 Consider the linear regression model (Problem 3, Section 2.9). The unknown parameters are (α, β, σ). What is a sufficient statistic for ?

3.2.4 Let X1, …, Xn be i.i.d. random variables having a Laplace distribution with p.d.f.

0 < σ < ∞. What is a sufficient statistic for

(i) when μ is known?

(ii) when μ is unknown?

Section 3.3

3.3.1 Let X1, …, Xn be i.i.d. random variables having a common Cauchy distribution with p.d.f.

Unnumbered Display Equation

−∞ < μ < ∞, 0 < σ < ∞. What is an m.s.s. for ?

3.3.2 Let X1, …, Xn be i.i.d. random variables with a distribution belonging to a family of contaminated normal distributions, having p.d.f.s,

Unnumbered Display Equation

−∞ < μ < ∞; 0 < σ < ∞; 0 < α < 10−2. What is an m.s.s. for ?

3.3.3 Let X1, …, Xn be i.i.d. having a common distribution belonging to the family of all location and scale parameter beta distributions, having the p.d.f.s

−μ ≤ x ≤ μ + σ; −∞ < μ < ∞; 0 < σ < ∞; 0 < p, q < ∞.

(i) What is an m.s.s. when all the four parameters are unknown?

(ii) What is an m.s.s. when p, q are known?

(iii) What is an m.s.s. when μ, σ are known?

3.3.4 Let X1, …, Xn be i.i.d. random variables having a rectangular R(θ1, θ2), −∞ < θ1 < θ2 < ∞. What is an m.s.s.?

3.3.5 is a family of joint distributions of (X, Y) with p.d.f.s

Given a sample of n i.i.d. random vectors (Xi, Yi), i = 1, …, n, what is an m.s.s. for ?

3.3.6 The following is a model in population genetics, called the Hardy–Weinberg model. The frequencies N1, N2, N3, = n, of three genotypes among n individuals have a distribution belonging to the family of trinomial distributions with parameters (n, p1(θ), p2(θ ), p3(θ)), where

(3.3.1)

0 < θ < 1. What is an m.s.s. for ?

Section 3.4

3.4.1 Let X1, …, Xn be i.i.d. random variables having a common distribution with p.d.f.

−∞ < 1 < 2 < ∞. Prove that T(X) = inline is an m.s.s.

3.4.2 Let {(Xk, Yi), i = 1, …, n} be i.i.d. random vectors having a common bivariate normal distribution

Unnumbered Display Equation

where −∞ < ξ, η < ∞; 0 < , < ∞; −1 < ρ < 1.

(i) Write the p.d.f. in canonical form.

(ii) What is the m.s.s. for

3.4.3 In continuation of the previous problem, what is the m.s.s.

(i) when ξ = η = 0?

(ii) when σx = σy = 1?

(iii) when ξ = η = 0, σx = σy = 1?

Section 3.5

3.5.1 Let = {Gα (λ, 1); 0 < α < ∞, 0 < λ < ∞ } be the family of Weibull distributions. Is complete?

3.5.2 Let be the family of extreme–values distributions. Is complete?

3.5.3 Let = {R(θ1, θ2); −∞ < θ1 < θ2 < ∞ }. Let X1, X2, …, Xn, n ≥ 2, be a random sample from a distribution of . Is the m.s.s. complete?

3.5.4 Is the family of trinomial distributions complete?

3.5.5 Show that for the Hardy–Weinberg model the m.s.s. is complete.

Section 3.6

3.6.1 Let {(Xi, Yi), i = 1, …, n} be i.i.d. random vectors distributed like , −1 < ρ < 1.

(i) Show that the random vectors X and Y are ancillary statistics.

(ii) What is an m.s.s. based on the conditional distribution of Y given {X = x}?

3.6.2 Let X1, …, Xn be i.i.d. random variables having a normal distribution N(μ, σ2), where both μ and σ are unknown.

(i) Show that U(X) =

is ancillary, where Me is the sample median;

is the sample mean; Q1 and Q3 are the sample first and third quartiles.

(ii) Prove that U(X) is independent of |

|/S, where S is the sample standard deviation.

Section 3.7

3.7.1 Consider the one–parameter exponential family with p.d.f.s

Show that the Fisher information function for θ is

Check this result specifically for the Binomial, Poisson, and Negative–Binomial distributions.

3.7.2 Let (Xi, Yi), i = 1, …, n be i.i.d. vectors having the bivariate standard normal distribution with unknown coefficient of correlation ρ, −1 ≤ ρ ≤ 1. Derive the Fisher information function In(ρ).

3.7.3 Let (x) denote the p.d.f. of N(0, 1). Define the family of mixtures

Derive the Fisher information function I(α).

3.7.4 Let = {f(x; ), −∞ < < ∞ } be a one–parameter exponential family, where the canonical p.d.f. is

(i) Show that the Fisher information function is

(ii) Derive this Fisher information for the Binomial and Poisson distributions.

3.7.5 Let X1, …, Xn be i.i.d. N(0, σ2), 0 < σ2 < ∞.

(i) What is the m.s.s. T?

(ii) Derive the Fisher information I(σ2) from the distribution of T.

(iii) Derive the Fisher information IS2(σ2), where S2 =

(Xi −

)2 is the sample variance. Show that IS2(σ2) < I(σ2).

3.7.6 Let (X, Y) have the bivariate standard normal distribution , −1 < ρ < 1. X is an ancillary statistic. Derive the conditional Fisher information I(ρ | X) and then the Fisher information I(ρ ).

3.7.7 Consider the model of Problem 6. What is the Kullback–Leibler information function I(ρ1, ρ2) for discriminating between ρ1 and ρ2 where −1 ≤ ρ1 < ρ2 ≤ 1.

3.7.8 Let X ∼ P(λ). Derive the Kullback–Leibler information I(λ1, λ2) for 0 < λ1, λ2 < ∞.

3.7.9 Let X ∼ B(n, θ). Derive the Kullback–Liebler information function I(θ1, θ2), 0 < θ1, θ2 < 1.

3.7.10 Let X ∼ G(λ, ν), 0 < λ < ∞, ν known.

(i) Express the p.d.f. of X as a one–parameter canonical exponential type density, g(x;

(ii) Find

for which g(x;

) is maximal.

(iii) Find the Kullback–Leibler information function I(

) and show that

Section 3.8

3.8.1 Consider the trinomial distribution M(n, p1, p2), 0 < p1, p2, p1 + p2 < 1.

(i) Show that the FIM is

Unnumbered Display Equation

(ii) For the Hardy–Weinberg model, p1(θ) = θ2, p2(θ) = 2θ (1 − θ), derive the Fisher information function

3.8.2 Consider the bivariate normal distribution. Derive the FIM I(ξ, η, σ1, σ2, ρ).

3.8.3 Consider the gamma distribution G(λ, ν). Derive the FIM I(λ, ν).

3.8.4 Consider the Weibull distribution W(λ, α) ∼ (G(λ, 1))1/2; 0 < α, λ < ∞. Derive the Fisher informaton matrix I(λ, α).

Section 3.9

3.9.1 Find the Hellinger distance between two Poisson distributions with parameters λ1 and λ2.

3.9.2 Find the Hellinger distance between two Binomial distributions with parameters p1≠ p2 and the same parameter n.

3.9.3 Show that for the Poisson and the Binomial distributions Equation (3.9.4) holds.

PART IV: SOLUTIONS TO SELECTED PROBLEMS

3.2.1 X1, …, Xn are i.i.d. ∼ R(θ1, θ2), 0 < θ1 < θ2 < ∞.

(i)

Unnumbered Display Equation

Thus, f(X1, …, Xn; θ) = A(x)g(T(x), θ), where A(x) = 1 x and

T(X) = (X(1), X(n)) is a likelihood statistic and thus minimal sufficient.

(ii) The p.d.f. of (X(1), X(n)) is

Let (X(1), …, X(n)) be the order statistic. The p.d.f. of (X(1), …, X(n)) is

The conditional p.d.f. of (X(1), …, X(n)) given (X(1), X(n)) is

That is, (X(2), …, X(n−1)) given (X(1), X(n)) are distributed like the (n − 2) order statistic of (n − 2) i.i.d. from R(X(1), X(n)).

3.3.6 The likelihood function of θ, 0 < θ < 1, is

Since N3 = n − N1 − N2, 2N3 + N2 = 2n − 2N1 − N2. Hence,

The sample size n is known. Thus, the m.s.s. is Tn = 2N1 + N2.

3.4.1

Unnumbered Display Equation

The likelihood function of is

Unnumbered Display Equation

Thus inline is a likelihood statistic, i.e., minimal sufficient.

3.4.2 The joint p.d.f. of (Xi, Yi), i = 1, …, n is

Unnumbered Display Equation

In canonical form, the joint density is

Unnumbered Display Equation

where

Unnumbered Display Equation

and

The m.s.s. for is T(X, Y) = (Σ Xi, Σ Yi, Σ , Σ , Σ XiYi).

3.5.5 We have seen that the likelihood of θ is L(θ) θT(N)(1 − θ)2n − T(N) where T(N) = 2N1 + N2. This is the m.s.s. Thus, the distribution of T(N) is B(2n, θ). Finally = B(2n, θ), 0 < θ < 1} is complete.

3.6.2

(i)

Unnumbered Display Equation

independent of μ and σ.

(ii) By Basu’s Theorem, U(X) is independent of (

, S), which is a complete sufficient statistic. Hence, U(X) is independent of |

|/S.

3.7.1 The score function is

Hence, the Fisher information is

Consider the equation

Differentiating both sides of this equation with respect to θ, we obtain that

Differentiating the above equations twice with respect to θ, yields

Thus, we get

Therefore

In the Binomial case,

Thus, (θ) = log , K(θ) = −n log (1 − θ)

Unnumbered Display Equation

Hence, I(θ) = .

In the Poisson case,

Unnumbered Display Equation

8 In the Negative–Binomial case, ν known,

3.7.2 Let .

Let l(ρ) denote the log–likelihood function of ρ, −1 < ρ < 1. This is

Furthermore,

Recall that I(ρ) = E{−l″(ρ)}. Moreover,

Thus,

3.7.7

Unnumbered Display Equation

Thus, the Kullback–Leibler information is

Unnumbered Display Equation

The formula is good also for ρ1 > ρ2.

3.7.10 The p.d.f. of G(λ, ν) is

When ν is known, we can write the p.d.f. as g(x; ) = h(x) exp ( x + ν log (−)), where h(x) = , = −λ. The value of maximizing g(x, ) is = −. The K–L information I(1, ) is