Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7
Parametric Inference

Package(s): UsingR

Dataset(s): ns, ps, bs, cs, galton, sleep, airquality

7.1 Introduction

In Chapter 4 we came across one form of analyzing data, viz., the exploratory data analysis. That approach mainly made no assumptions about the probability mechanism for the data generation. In later chapters we witnessed certain models which plausibly explain the nature of the random phenomenon. In reality, we seldom have information about the parameters of the probability distributions. Historically, or intuitively, we may have enough information about the probability distributions, sparing a few parameters. In this chapter we consider various methods for inferring about such parameters, using the data generated under the assumptions of these probability models.

Parametric statistical inference arises when we have information for the model describing an uncertain experiment sans a few values, called parameters, of the model. If the parameter values are known, we have problems more of the probabilistic kind than statistical ones. The parameter values need to be obtained based on some data. The data may be a pure random sample in the sense of all the observations being drawn with the same probability. However, it is also the practical case that obtaining a random sample may not be possible in many stochastic experiments. For example, the temperature in the morning and afternoon are certainly not identical observations. We undertake statistical inference of uncertain experiments in this chapter.

In Chapter 6, we came across a pool of diverse experiments which have certain underlying probability models, say $c07-math-0001$ . Under the assumption that $c07-math-0002$ is the truth, we now develop methods for inference about $c07-math-0003$ . We will begin with some important families of distribution in Section 7.2. This section and the next few sections rely heavily on Lehmann and Casella (1998). The form of loss functions plays a vital role on the usefulness of an estimator/statistic. For an observation from binomial distribution, we discuss some choice of loss functions in Section 7.3. Data reduction through the concepts of sufficiency and completeness are theoretically examined in Section 7.4. Section 7.5 empathizes the importance of the likelihood principle through visualization and examples. The role of information function for obtaining the parameter values is also detailed in this section. The discussion thus far focuses on the preliminaries of point estimation.

Using the foundations from Sections 7.2–7.5, we next focus on the specific techniques of obtaining the estimates of parameters using maximum likelihood estimator and moment estimator in Section 7.6. Estimators are further compared for their unbiasedness and variance in Section 7.7. The techniques discussed up to this point of the chapter return us a single value of the parameter, which is seldom useful in appropriating inference for the parameters. Thus, we seek a range, actually interval, of plausible values of the parameters in Section 7.8. In the case of missing values, or a data structure which may be simplified through missing variables, the EM algorithm is becoming very popular and the reader will find it illustrated in Section 7.16.

Sections 7.9–7.15 offers a transparent approach to the problem of Testing Statistical Hypotheses. It is common for statistical software texts to focus on the testing framework using straightforward useful R functions available in prop.test, t.test, etc. However, here we take a pedagogical approach and begin with preliminary concepts of Type I and Type II errors. The celebrated Neyman-Pearson lemma is stated and then demonstrated for various examples with R programs in Section 7.10. The Neyman-Pearson lemma returns us to a unique powerful test which cannot be extended for testing problems of composite hypotheses, and thus we need slightly relaxed conditions leading to uniformly most powerful tests and also uniformly most powerful unbiased tests, as seen in Sections 7.11 and 7.12. A more generic class of useful tests is available in the family of likelihood ratio tests. Its examples are detailed with R programs in Section 7.13. A very interesting problem, which is still unsolved, arises in the comparison of normal means from two populations whose variances are completely unknown. This famous Behrens-Fisher problem is discussed with appropriate solutions in Section 7.14. The last technical section of the chapter deals with the problem of testing multiple hypotheses, Section 7.15.

7.2 Families of Distribution

We will begin with a definition of a group family, page 17 of Lehmann and Casella (1998).

The identity element of a group and the inverses of any element are unique. The two important properties of a group are (i) closure under composition, and (ii) closure under inversion. Basically, for various reasons, we change the characteristics of random variables by either adding them, or addition or multiplication of some constants, etc. We need some assurance that such changes do not render the probability model from which we started useless. The changes which we subject the random variable to is loosely referred to as a transformation. Let us discuss these two properties in more detail.

Closure under composition. A 1:1 transformation may be addition or multiplication. We say that a class $c07-math-0017$ of transformation is closed under composition if $c07-math-0018$ implies $c07-math-0019$ .

Closure under inversion. For any 1:1 transformation $c07-math-0020$ , the inverse of $c07-math-0021$ , denoted by $c07-math-0022$ , undoes the transformation $c07-math-0023$ , that is, $c07-math-0024$ . A class of transformations $c07-math-0025$ is said to be closed under inversion if for any $c07-math-0026$ , the inverse $c07-math-0027$ is also in $c07-math-0028$ .

In plain words, if a random variable under a group family is subject to a transformation, the resulting distribution also belongs to the same class of distribution. The transformation may include addition, multiplication, sine, etc.

Table 4.1 of Lehmann and Casella (1998) gives an important list of location-scale families.

7.2.1 The Exponential Family

The canonical form is not necessarily unique.

Note that we speak of a family of probability distributions belonging to an exponential family, and as such we are not focusing on a single density, say $c07-math-0067$ .

7.2.2 Pitman Family

The exponential family meets the regularity condition since the range of random variable is not a function of the parameter $c07-math-0090$ . In many interesting cases, the range of the random variable depends on a function of its parameters. In such cases the regularity condition is not satisfied. A nice discussion about such non-regular random variables appears on Page 19 of Srivastava and Srivastava (2009).

The mathematical argument is as follows. Let $c07-math-0091$ be a positive function of $c07-math-0092$ and let $c07-math-0093$ and $c07-math-0094$ , $c07-math-0095$ be the extended real-valued functions of parameter $c07-math-0096$ . We can then define a density by

7.6

where

Thus, we can see that the range of $c07-math-0099$ depends on $c07-math-0100$ , and hence the density is a non-regular. The family of such densities is referred to as the Pitman family. In Table 7.1, a list of some members of the Pitman family is provided, refer to Table 1.3 of Srivastava and Srivastava (2009).

c07-math-0101 — **Table 7.1** Pitman Family of Distributions

We will next consider the role of loss functions.

7.3 Loss Functions

A statistic, or an estimator, $c07-math-0110$ is employed for estimation of a parameter, say $c07-math-0111$ . The loss incurred as a consequence of using $c07-math-0112$ is captured using a loss function. In convention with the standard notation, the loss function of a parameter inferred by using $c07-math-0113$ will be denoted by $c07-math-0114$ . Some useful loss functions are now given.

In this section we will consider the squared error loss function only. To evaluate the performance of an estimator $c07-math-0126$ under a loss function $c07-math-0127$ , we require the notion of risk function defined as

7.7

For a fixed $c07-math-0129$ , $c07-math-0130$ calculates the risk of using the statistic $c07-math-0131$ . The risk function is the average loss due to $c07-math-0132$ . The risk function under the squared error loss functions is also popularly known as the mean squared error of the estimator. The next example illustrates the risk function for four different statistics $c07-math-0133$ . Since $c07-math-0134$ also leads more often to making some decisions, we sometimes also denote the statistic by $c07-math-0135$ . In fact, the loss functions have a more prominent role in decision theory.

Example 7.3.1. Risk Functions for the Binomial Family

Suppose $c07-math-0136$ , and we have four statistics/estimators, see Table 7.2, for estimation of $c07-math-0137$ . To verify that the risk functions are correctly given in the table, let us check them out:

Verify that the risk function for $c07-math-0139$ is indeed as given by $c07-math-0140$ in Table 7.2.

c07-math-0117 — **Table 7.2** Risk Functions for Four Statistics

This example is adapted from Chapter 3 of Keener (2010). We now see how to plot the above tabled risk functions. First, we initialize the values of the parameter $c07-math-0141$ in the R object p <- seq(0,1,0.002) and then calculate the risk functions for the four statistics in the respective R objects Rdelta0, Rdelta1, Rdelta2, Rdelta3. Next, the risk function for $c07-math-0142$ , that is $c07-math-0143$ , is plotted against the parameter values. Using the lines function, the remaining risk functions are obtained in the same plot. To obtain suitable legends for the risk functions, we first create an expression object in exp_legends and then use it in the legend function. The final output is given in Figure 7.1.

> # Risk Plots for 4 Loss Functions of Binomial Distribution
> p <- seq(0,1,0.002)
> Rdelta1 <- p*(1-p)/100
> Rdelta2 <- (9+100*p*(1-p))/100^{2}
> Rdelta3 <- (9-8*p)*(1+8*p)/106^{2}
> Rdelta0 <- (p-0.25)^{2}
> plot(p,Rdelta1,"l",xlim=c(0,1),ylim=c(0,0.004),xlab="p",
+ ylab=expression(R(L(p,T))),col="green")
> lines(p,Rdelta2,"l",col="blue")
> lines(p,Rdelta3,"l",col="black")
> lines(p,Rdelta0,"l",col="red")
> exp_legends <- expression(paste("R(p,",delta[1],")"), paste("R(p,",delta[2],")"),
+ paste("R(p,",delta[3],")"),paste("R(p,",delta[0],")"))
> legend(x=c(0.8,1.0),y=c(0.004,0.003),exp_legends,
+ col=c("green","blue","black","red"),pch="-")

Figure 7.1 Loss Functions for Binomial Distribution

Referring to Part A of Figure 7.1, observe that the degenerate estimator, the red-colored curve, is a very bad example of an estimator and it may be considered as an analogy of the stopped watch showing the right time twice a day. Among the non-trivial estimators, $c07-math-0144$ , blue-colored curve, has more risk than the other two for each $c07-math-0145$ . A clear winner cannot be picked from $c07-math-0146$ and $c07-math-0147$ .□

The role of loss functions will become more prominent in Section 7.7. The section will be closed with an interesting example where the degenerate estimator is to be preferred over a reasonable estimator.

Example 7.3.2. Romano and Siegel's Example 9.21

Let $c07-math-0148$ be a Bernoulli random variable with $c07-math-0149$ being the probability of success. However, we now restrict the range of $c07-math-0150$ to the interval $c07-math-0151$ . In light of a single observation, an estimator, say $c07-math-0152$ , is meaningful if we conclude that $c07-math-0153$ if $c07-math-0154$ and $c07-math-0155$ if $c07-math-0156$ . Consider the degenerate estimator $c07-math-0157$ . Now, the mean squared errors for these two estimators are

The R program gives the plot of the risk functions.

> p <- seq(0.33,.67,0.02)
> Rdelta1 <- (3*p^2-3*p+1)/9
> Rdelta0 <- (4*p^2-4*p+1)/4
> plot(p,Rdelta1,"l",xlim=c(0.33,0.67),ylim=c(0,0.04),xlab="p",
+ ylab=expression(R(L(p,T))),col="green")
> lines(p,Rdelta0,"l",col="red")
> exp_legends <- expression(paste("R(p,",delta[1],")"), paste("R(p,",delta[0],")"))
> legend(x=c(0.45,0.5),y=c(0.04,0.035),exp_legends, col=c("green","red"),pch="-")

It appears from Part B of Figure 7.1 that the degenerate estimator is better than the reasonable estimator, which will be dealt with in Section 7.6. The important question here now is whether we can accept the degenerate estimator?□

7.4 Data Reduction

Statistical inference has two very important pillars: (i) The Sufficiency Principle, and (ii) The Likelihood Principle. Berger and Woolpert (1988) is a treatise for understanding these principles. In this section we will consider the sufficiency principle in depth.

7.4.1 Sufficiency

It is seen in Section 7.3 that many possible statistics exist for a given parameter. Some statistics have an advantage over others and a meaningful criteria needs to be arrived at to help in these type of decisions.

Thus far we began with statistics and verified whether it satisfied the sufficiency condition. It is not always possible to guess what may turn out to be sufficient statistics. The Neyman Factorization theorem gives a result which helps to obtain the sufficient statistics from the joint probability function of the random sample. A measure theoretic framework of this theorem is due to Halmos and Savage (1949), see page 289 of Mukhopadhyay (2000) for more details. The following theorem may be found in Casella and Berger (2002).

We will now use this result to obtain the sufficient statistics in some important cases.

If $c07-math-0208$ in the previous example is known, it can be seen that $c07-math-0209$ , $c07-math-0210$ , $c07-math-0211$ , and other permutations are all sufficient for $c07-math-0212$ . Thus, we need a more general framework to identify the sufficient statistics. Furthermore, there may be more than one sufficient statistic and in such cases we need to pick one among them.

7.4.2 Minimal Sufficiency

Dynkin (1951) gave the criteria for a statistic to be necessary.

In the discussion before this subsection, it can be seen that the statistics $c07-math-0213$ can be mathematically written as a function of $c07-math-0214$ and $c07-math-0215$ . Thus, if $c07-math-0216$ , $c07-math-0217$ , and $c07-math-0218$ are the only three sufficient statistics, though there are many more sufficient statistics in the scenario, then $c07-math-0219$ turns out to be a necessary statistic. This definition further guides towards the concept of minimal sufficient statistics.

It is not practical to obtain a minimal sufficient statistic from its definition. Lehmann and Scheffé (1950) provide a result which is useful in obtaining a minimal sufficient statistic. We first describe this important result.

A useful result states that a complete sufficient statistic will be minimal and hence if the completeness of the sufficient statistic can be established, it will be minimal sufficient too. In the case of the exponential families of full rank, the statistic $c07-math-0267$ will be a complete statistic. We had already seen that $c07-math-0268$ is also sufficient, and hence for exponential families of full rank, it will be minimal sufficient. Since the examples of gamma, Poisson, etc., are members of an exponential family with full rank, the sufficient estimators/statistics seen earlier will be complete, and hence minimal sufficient too. The likelihood principle is developed in the next section.

7.5 Likelihood and Information

As mentioned at the beginning of the previous section, we will now consider the second important principle in the theory of statistical inference: the likelihood principle. The likelihood function is first defined.

7.5.1 The Likelihood Principle

A major difference between the likelihood function and (joint) pdf needs to be emphasized. In a pdf (or pmf), we know the parameters and try to make certain probability statements about the random variable. In the likelihood function the parameters are unknown and hence we use the data to infer certain aspects of the parameters. In simple and practical terms, we generally plot the pdf $c07-math-0273$ against $c07-math-0274$ values to understand it, whereas for the likelihood function plots, we plot $c07-math-0275$ against the parameter $c07-math-0276$ . Obviously there is more to it than what we have simply said here, though what is detailed here suffices to understand the likelihood principle. Furthermore, it has been observed that many books lay far more emphasis on the maximum likelihood estimator, to be introduced in the Section 7.6, and the primary likelihood function is given a formal introduction. Barnard, et al. (1962) have been emphatic about the importance of the likelihood function. Pawitan (2001) is another excellent source for understanding the importance of likelihood. Moreover, Pawitan also provides R codes and functions towards understanding this principle. Naturally, this section is influenced by his book.

Example 7.5.1. The Binomial Distribution

Consider a sequence of Bernoulli trials, $c07-math-0277$ where the probability of success is $c07-math-0278$ . Let $c07-math-0279$ denote the number of success in the $c07-math-0280$ trials. Then $c07-math-0281$ . The likelihood function for an observed value of $c07-math-0282$ is given by

7.11

Before carrying out the inference, we can enrich our understanding by plotting the likelihood function as a function of $c07-math-0284$ . At the end of $c07-math-0285$ trials, suppose that $c07-math-0286$ takes one of the values 0, 2, 5, 8, or 10. The next R program plots the likelihood for the various data.

> n <- 10; x <- 0
> likefn <- function(n,x,p){
+     choose(n,x)*p^{x}*(1-p)^{n-x}
+ }
> pseq <- seq(0,1,by=0.02)
> likefnbinom <- sapply(pseq,n=10,x=0,likefn)
> likefnbinom <- likefnbinom/max(likefnbinom)
> plot(pseq,likefnbinom,"l",xlab="p",ylab="Likelihood",col="red")
> legend(x=0,y=0.95,legend="L(p|x=0)",col="red", box.col="white",box.lwd=0)
> likefnbinom <- sapply(pseq,n=10,x=2,likefn)
> likefnbinom <- likefnbinom/max(likefnbinom)
> lines(pseq,likefnbinom,col="green")
> legend(x=0.15,y=0.8,legend="L(p|x=2)",col="green", box.col="white",box.lwd=0)
> likefnbinom <- sapply(pseq,n=10,x=5,likefn)
> likefnbinom <- likefnbinom/max(likefnbinom)
> lines(pseq,likefnbinom,col="brown")
> legend(x=0.42,y=0.8,legend="L(p|x=5)",col="brown", box.col="white",box.lwd=0)
> likefnbinom <- sapply(pseq,n=10,x=8,likefn)
> likefnbinom <- likefnbinom/max(likefnbinom)
> lines(pseq,likefnbinom,col="grey")
> legend(x=0.6,y=0.95,legend="L(p|x=8)",col="grey", box.col="white",box.lwd=0)
> likefnbinom <- sapply(pseq,n=10,x=10,likefn)
> likefnbinom <- likefnbinom/max(likefnbinom)
> lines(pseq,likefnbinom,col="blue")
> legend(x=0.8,y=0.95,legend="L(p|x=10)",col="blue", box.col="white",box.lwd=0)

To define the function likefn, we first initialize the variables $c07-math-0287$ and $c07-math-0288$ with n <- 10; x <- 0. Though we use choose(n,x)*p^{x}*(1-p)^{n-x}, it is important to understand that the choose function may be omitted without loss of generality since it does not involve $c07-math-0289$ . The sapply functionality helps obtain the likelihood values over the sequence pseq. The plot, lines, and legend functions do not need more explanation at this stage. Figure 7.2 shows that the likelihood functions attain the maximum values at 0, 0.2, 0.5, 0.8, and 1 respectively for the number of successes at 0, 2, 5, 8, and 10.□

Figure 7.2 A Binomial Likelihood

Example 7.5.2. The Normal Likelihood Function

Suppose we have a single observation from the normal distribution $c07-math-0290$ and the value of the observation is noted as 2.45. The likelihood function is given by

7.12

The R codes return the desired likelihood plot. Here, we begin by declaring the $c07-math-0292$ and $c07-math-0293$ values and then initialize a plausible range of the mean parameter in museg. The likenorm function computes the likelihood value over the specified range museq and we then normalize it to obtain values between 0 and 1. The plot function along with the options of xlab and ylab returns us a diagram as displayed in Part A of Figure 7.3. Since we had a single datum with value 2.45, the likelihood function plot reflects that the most likely value of $c07-math-0294$ is approximately 2.45 only, and this result is not too surprising!

Figure 7.3 Various Likelihood Functions

> # Normalized Likelihood Function for a Datum
> # from Normal Distribution
> x <- 2.45; sigma <- 1
> museq <- seq(x-3*sigma,x+3*sigma,0.02)
> likenorm <- function(mu,x,sigma) dnorm(x,mu,sigma)
> likefnnorm <- sapply(museq,x=x,sigma=sigma,likenorm)
> likefnnorm <- likefnnorm/max(likefnnorm)
> plot(museq,likefnnorm,"l",xlab=expression(mu),
+ ylab=expression(L(mu|x)))

Consider the scenario when we have been told that there are $c07-math-0295$ observations and the maximum of the observed sample is 3.5. The variance is assumed to be known at $c07-math-0296$ . The likelihood function of the maximum of the sample is given by

7.13

where $c07-math-0298$ denotes the maximum value of the $c07-math-0299$ data points $c07-math-0300$ , and $c07-math-0301$ is the cumulative distribution function of the standard normal RV. We next develop the necessary R codes for obtaining the likelihood plot as listed above. Note that the likelihood function likenorm needs to be defined carefully, as given in Equation 7.13. The rest of the R program is fairly easy to follow, so is not explained any further.

> # Normalized Likelihood Function for the
> # Maximum of the Random Sample
> n <- 5; xmax <- 3.5; sigma <- 1
> museq <- seq(-6,6,0.02)
> likenorm <- function(mu,x,sigma) (n*(pnorm(xmax-mu)^{n-1}) *dnorm(xmax-mu))
> likefnnorm <- sapply(museq,x=xmax,sigma=sigma,likenorm)
> likefnnorm <- likefnnorm/max(likefnnorm)
> plot(museq,likefnnorm,"l",xlab=expression(mu), ylab=expression(L(mu|x)))

The result of the current R code is given in Part B of Figure 7.3. Note that the likelihood function is interpreted along the same lines as in Part A of the same figure, despite the data being specified in an entirely different way. This is a reason for us to emphasize the importance of likelihood plots.□

The likelihood function contains more information about the data and the parameters than some summary measures of the data. Plots of the likelihood, whenever possible, throw more light on random phenomenon and should be employed in as many cases as they permit. We will now formally state the likelihood principle.

Example 7.5.3. Berger and Woolpert's Example 9

Let $c07-math-0309$ be an experiment where $c07-math-0310$ are Bernoulli random variables with probability of success being $c07-math-0311$ . Suppose we have observed that $c07-math-0312$ . The likelihood function is then $c07-math-0313$ . In an experiment $c07-math-0314$ we proceed with Bernoulli trials until three failures are observed. The observed random variables here are $c07-math-0315$ 's. Suppose that in the experiment $c07-math-0316$ we have observed $c07-math-0317$ . The likelihood function for $c07-math-0318$ is $c07-math-0319$ . We will obtain the likelihood plots. As with the earlier examples, we begin with initializing $c07-math-0320$ values and define the the two likelihood functions in l1 and l2 and then plot them using plot and lines functions. The legend part is simple to follow.

> # Illustration of the Likelihood Principle
> p <- seq(0,1,0.01)
> l1 <- function(p) {choose(12,9)*p^{9}*(1-p)^{3} }
> l2 <- function(p) {choose(11,9)*p^{9}*(1-p)^{3} }
> plot(p,sapply(p,l1),"l",xlab="p",ylab="L(p|x)",col="green")
> lines(p,sapply(p,l2),"l",col="red")
> legend(x=0,y=0.25,legend=c("l1","l2"),col=c("green","red"),pch="-")

The graph, Part C of Figure 7.3, shows that the two likelihood functions are proportional to each other. It can also be seen that the maximum of the two likelihood functions occur at the same $c07-math-0321$ value. Hence, any conclusions drawn from the two likelihood functions should result in the same answer.□

Example 7.5.4. Combining Likelihoods

Let $c07-math-0322$ be a random sample from an exponential distribution with rate $c07-math-0323$ . From the definition, we know that the likelihood of $c07-math-0324$ is

The likelihood function for $c07-math-0326$ , given the complete random sample $c07-math-0327$ is

Consequently, the log-likelihood for a sample will be the sum of the log-likelihoods. This is one method of combining likelihood functions for a random sample of size $c07-math-0329$ .

Next, consider two samples from exponential distribution where the data for the first sample is $c07-math-0330$ and $c07-math-0331$ , and for the second sample $c07-math-0332$ and $c07-math-0333$ . The likelihoods for the two samples are then given by

The next R program is used to plot the combined likelihood, which gives us Part D of Figure 7.3. Note that the entire program is simply an extension of the earlier R codes, which have been explained in detail. Thus, Part D of Figure 7.3 shows that combining likelihood functions give a single curve to draw conclusions about the parameter.

> n1 <- 10; xmin <- 0.754
> thetaseq <- seq(0.1,20,0.1)
> likeexp1 <- function(theta,x)
+   (n1*(exp(-x/theta))^{n1-1}*exp(-x/theta)/theta)
> likefnexp1 <- sapply(thetaseq,x=xmin,likeexp1)
> n2 <- 10; ybar <- 10.86
> likeexp2 <- function(theta,x)
+   (exp(-10*ybar/theta)/theta^10)
> likefnexp2 <- sapply(thetaseq,x=ybar,likeexp2)
> likefnCombined <- likefnexp1*likefnexp2
> likefnexp1 <- likefnexp1/max(likefnexp1)
> likefnexp2 <- likefnexp2/max(likefnexp2)
> likefnCombined <- likefnCombined/max(likefnCombined)
> plot(thetaseq,likefnCombined,"l",
+      xlab=expression(theta),ylab=expression(L(theta)))
> lines(thetaseq,likefnexp1,"l",col="red")
> lines(thetaseq,likefnexp2,"l",col="green")
> exp_legends <- expression(paste("L(",theta,"|",x[min],",",
+ n[1],")"),paste("L(",theta,"|",bar(y),",",n[2],",)"),
+ paste("L(",theta,"|",x[min],",",bar(y),",",n[1],",",n[2],",)"))
> legend(x=c(-1,1),y=c(1,0.6),legend=exp_legends,col=c("red",
+ "green","black"), pch="-",box.col="white")
> title("D: Combining Likelihoods")

If you do not normalize the likelihood functions, do you expect similar curves? The reader is advised to actually do away with the normalization part and then verify if the final plots justify the intuitive answer.□

More formal use of the likelihood function will be explained in the forthcoming subsection.

7.5.2 The Fisher Information

The likelihood function will now be more formally used towards the inference of the parameters. We will first define the score function.

The natural question is how do we make use of the score function as given in Equation 7.15. We will first begin with an illustration of the sampling variance of score functions.

Example 7.5.5. Understanding the Sampling Variance of Score Function

This example illustrates the concepts detailed in chapter 8 of Pawitan (2001). We consider samples from normal, Poisson, binomial, and Cauchy distributions. For each of the distributions, we have 20 independent samples, each of size $c07-math-0341$ . The score function is plotted for each sample set and the behavior of the score function is then visualized. The scale parameter for a normal and a Cauchy distribution is known to be 1, whereas the mean, or location, for all the four distributions is approximately 4. The 20 sample datasets, for each distribution, is simulated. The simulation technique is discussed in more detail in Chapter 11. However, for our purpose, we can treat them as real datasets.

To obtain the score function, we first need to obtain the log-likelihood functions. We state the log-likelihood functions as

Differentiating the log-likelihoods with respect to the parameters gives us the score function:

7.16

7.17

7.18

7.19

The R program for the score function of normal distribution is first explained. The simulated dataset is first loaded with the data(ns) file. We have 20 batches, each of size 10. The function colMeans returns the means for the 20 batches, while the function normal_score_fn defines the score function for the normal distribution. A sequence of plausible values for $c07-math-0347$ is specified through seq(from=2,to=8,by=0.2). In practice, we may use the summary statistics to decide this range. The sub-code sapply(mu,normal_score_fn,xbar=sample_means[1]), and later with the index i, returns the score function values as required in Equation 7.16, and the plot(...) generates the plot of the score function for the first sample. The loop for(i in 2:20), along with the lines function, completes the visualization for score function for the normal sample dataset. An important part of this code is given in abline(v=4) and abline(h=0), which respectively produce a vertical line at the $c07-math-0348$ -coordinate $c07-math-0349$ and a horizontal line at the $c07-math-0350$ -coordinate 0. What can you infer from Part A of Figure 7.4? In particular, the question is can you deduce the parameter value of $c07-math-0351$ for unknown scenarios using such a score function plot?

Figure 7.4 Understanding Sampling Variation of Score Function

The detailed R program for Poisson, binomial, and Cauchy distributions are straightforward to follow through, and their outputs respectively form Parts C to D of Figure 7.4. The reader is especially advised to interpret the score function definitions in light of Equations 7.17–7.19.

> # Understanding the Sampling Variation of the Score Function
> # The Normal Model
> data(ns)
> n <- 10
> sample_means <- colMeans(ns)
> normal_score_fn <- function(mu,xbar) n*(xbar-mu)
> mu <- seq(from=2,to=8,by=0.2)
> plot(mu,sapply(mu,normal_score_fn,xbar=sample_means[1]),
+ "l",xlab=expression(mu),ylab=expression(S(mu)))
> title(main="A: Score Function Plot of the Normal Model")
> for(i in 2:20) lines(mu,sapply(mu,normal_score_fn,
+ xbar <- sample_means[i]),"l")
> abline(v=4)
> abline(h=0)
> # The Poisson Model
> data(ps)
> n <- 10
> sample_means <- colMeans(ps)
> poisson_score_fn <- function(theta,xbar) n*(xbar-theta)/theta
> theta <- seq(from=2,to=8,by=0.2)
> plot(theta,sapply(theta,poisson_score_fn,xbar=sample_means[1]),
+ "l",xlab=expression(lambda),ylab=expression(S(lambda)), ylim=c(-5,15))
> title(main="B: Score Function Plot of the Poisson Model")
> for(i in 2:20)
+ lines(theta,sapply(theta,poisson_score_fn,xbar= sample_means[i]),"l")
> abline(v=4)
> abline(h=0)
> # The Binomial Model
> data(bs)
> n <- 10
> sample_means <- colMeans(bs)
> binomial_score_fn <- function(p,xbar)
+       n*(xbar-10*p)/(p*(1-p))
> p <- seq(from=0,to=1,by=0.02)
> plot(p,sapply(p,binomial_score_fn,xbar=sample_means[1]),
+ "l",xlab=expression(p),ylab=expression(S(p)))
> title(main="C: Score Function Plot of Binomial Model")
> for(i in 2:20) lines(p,sapply(p,
+ binomial_score_fn,xbar=sample_means[i]),"l")
> abline(v=4)
> abline(h=0)
> # The Cauchy Model
> data(cs)
> n <- 10
> cauchy_score_fn  <-  function(mu,x)
+       sum(2*(x-mu)/(1+(x-mu)^{2}))
> mu <- seq(from=-15,to=20,by=0.5)
> plot(mu,sapply(mu,cauchy_score_fn,x=cs[,1]),
+ "l",xlab=expression(mu),ylab=expression(S(mu)),
+ ylim=c(-10,10))
> title(main="D: Score Function Plot of Cauchy Model")
> for(i in 2:20) lines(mu,sapply(mu,
+ cauchy_score_fn,x=cs[,i]),"l")
> abline(v=4)
> abline(h=0)

This current example thus provides a nice insight into the behavior of the sampling variance of the score functions. It may thus also be noted that the expected value of a score function is 0.□

An important exercise is to prove that the expectation of the score function equals zero, that is, $c07-math-0352$ , which we leave to the reader.

In the above definition, the term partial derivative has been used in the sense that the parameter $c07-math-0365$ may be a vector and in which case we consider the derivative of the likelihood function with respect to each component of the vector.

A useful result regarding the expected Fisher information, under the assumptions of the regularity conditions of course, is

We will next find the Fisher information for a few well-known probability distributions.

The next result is an important one when we have to calculate the Fisher information for a random sample.

In the next section we will use the observed Fisher information in a random sample.

$c07-math-0382$

7.6 Point Estimation

“Point Estimation” or “Estimation Theory” are some of the common names for dealing with the problem of estimation of parameters. Lehmann and Casella (1998) is one of the best sources for classical inference. Modern accounts of this domain are Casella and Berger (2002), Shao (2003), and Mukhopadhyay (2000). For details about the maximum likelihood technique implementation, refer to Millar (2011).

7.6.1 Maximum Likelihood Estimation

In the previous section we introduced the likelihood function. It will now be used to find estimators.

Let $c07-math-0383$ be a random sample with common pdf or pmf $c07-math-0384$ , $c07-math-0385$ . The random variable $c07-math-0386$ and/or $c07-math-0387$ may be scalar or vector. Recall the definition of likelihood function:

It is important to note that the MLE definition simply requires that the likelihood function be optimized. There is no specific method outlined on how to obtain the MLE. We will begin with a few graphical methods, which will be a continuation of some of our earlier examples.

Example 7.6.1. The Normal Likelihood Function

Example 7.5.1 Contd. Let us revisit the problem when we have a datum with value 2.45. The variance is known as 1 and we had obtained a plot of the likelihood function, over the grid of points seq(x-3*sigma,x+3*sigma,0.02). We need to obtain the value of $c07-math-0393$ at which the likelihood function is a maximum value. Thus, using the which function we can obtain the maximum:

> thetaseq[which(likefnnorm==1)]
[1] 2.45

Note that we already know that the maximum of the likelihood function is 1 as we had normalized it. It may appear that what we have really done here is a trivial matter. After all, we started with 2.45 and closed the argument with 2.45. However, that is the best estimate of $c07-math-0394$ when we have only one observation.

Now, consider the next example where we have observed the maximum of five observations as 3.5. Here

> thetaseq[which(likefnnorm==1)]
[1] 2.44

Clearly, we have now done much better. Increasing the search of the grid area to thetaseq<-seq(-6,6,0.001) and thetaseq<-seq(-6,6,0.001) returns the MLE values as 2.438 and 2.4385 respectively. When performing such grid searches, it is recommended to vary the settings as much as it permits before accepting the final solution.□

We will now attempt to obtain the MLEs when the parameters are continuous. A standard technique in calculus for obtaining the optimum value of a continuous function is by differentiating the function with respect to the continuous variable and setting the resultant expression to zero. In our case, such an expression turns out to be the score function which we saw in the previous Section 7.5. The MLE is then a root (solution) of the score function, that is, a solution of the equation:

Note that in each of the four score function plots in Figure 7.4, the parameter values which correspond to the score function $c07-math-0405$ at 0 is 4. This is not surprising since we had simulated the datasets to have the average (mean) of 4 only. Let us verify again with a plot of the score function for the normal sample and check if the plot helps to find the MLE in the spirit of Pawitan (2001).

Example 7.6.3. Example 7.19 Continued

Here, restating the facts, $c07-math-0406$ , $c07-math-0407$ . In this case, the log-likelihood function is

and the score function is

A simple R program gives the score plot and suggests the MLE when the score function is equal to 0.

> scorefunction <- function(mu) {3*(mu-4)}
> curve(scorefunction,from=0, to=10,xlab=expression(mu), ylab=expression(S(mu)))
> abline(h=0)

In this case, the MLE is observed to be 4, as verified in Figure 7.5.□

Figure 7.5 Score Function of Normal Distribution

In the case of an iid sample, R can be used efficiently to define the likelihood function and then solve the score function for obtaining the MLE. Two excellent introductions which illustrate this point may be found in Geyer (2003), see www.stat.umn.edu/geyer/5931/mle/mle.pdf, and Steenbergen (2006), see www.unc.edu/ monogan/computing/r/MLE_in_R.pdf. Also check Monahan (2011) for numerical methods in statistics, which also includes many interesting examples on MLE, see http://www4.stat.ncsu.edu/monahan/jul10/toc9r.html.

Setting the score functions equal to zero, and then obtaining an appropriate expression for the parameters, will give us estimators of the likelihood function, that is, the MLE. For the score-functions given in Equations 7.16–7.18, we obtain the MLEs by solving them as follows:

7.23

7.24

7.25

To be assured that $c07-math-0413$ , $c07-math-0414$ , and $c07-math-0415$ as given above, are indeed the MLEs, we need to look at the derivatives of the score function and verify that they are negative. The derivatives of the score functions for the distributions are stated next:

7.26

7.27

7.28

7.29

The derivatives of the log-likelihood function for normal and Poisson distribution, Equations 7.26 and 7.27, may easily be understood as negative, whereas it is not straightforward to obtain the answer for binomial and Cauchy distributions. In the binomial case, the reader may refer to a slightly different case at http://www.montana.edu/rotella/502/binom_like.pdf.

Note that the MLE for the Cauchy distribution has not been derived here and it requires a different way of solving the score function to obtain the MLE. We will be using three functions available in R, optimize, mle and nlm for obtaining the MLEs. Since we have emphasized the likelihood function, it is always good practice to report the values of the likelihood function, or its variants such as the log-likelihood function, or negative of the log-likelihood function.

Example 7.6.4. The Poisson Distribution. Death by Horse Kick

A classic example of the Poisson random variable is the death of army corps due to a kick by a horse. von Bortkiewicz reported this data of 122 deaths by horse kicks when a group of 10 Prussian army corps were under observation for a period of 20 years. Thus, we have $c07-math-0420$ observations. Table 7.3 gives the data collected by von Bortkiewicz.

Table 7.3 Death by Horse Kick Data

Number of Deaths in a Corp	Frequency
0	109
1	65
2	22
3	3
4	1

Let us try to find the MLE under the assumption that the number of deaths follows a Poisson random variable. Let $c07-math-0421$ be the average number of deaths due to horse kicks. The log-likelihood function is then specified by

We will ignore the last term since it does not involve the parameter of interest. The R program which gives the MLE is given next.

> n <- 200
> x <- rep(c(0,1,2,3,4),c(109,65,22,3,1))
> logl <- function(lambda){log(lambda)*sum(x) - n*lambda -  sum(log(factorial(x)))}
> optimize(logl,c(0,10),maximum=TRUE)
$maximum
[1] 0.6100066
$objective
[1] -206.1067

In this R program, the log-likelihood function is defined by logl. Using the optimize function, we optimize logl over the $c07-math-0423$ values. The components of the optimize function, maximum and objective respectively return the MLE and log-likelihood function value. Thus, the ML estimate is 0.6100066. Note here that logl(0.6100066) returns exactly -206.1067. Now, to really test that you have obtained a maximum value of $c07-math-0424$ , try the codes logl(0.5) and logl(0.8) and comment. A useful insight will also be to visualize plot(seq(0.01,5,0.01),sapply(seq(0.01,5,0.01),logl),“l”).

Another R technique to obtain the MLE is to use the mle function from the stats4 package. To use this function, we need to first define the negative log-likelihood function and specify initial parameter values to the mle function.

> pois_nll <- function(lambda) -sum(dpois(x,lambda,log=TRUE))
> pois_mle <- mle(pois_nll,start=list(lambda=mean(x)),nobs=length(x))
> summary(pois_mle)
Maximum likelihood estimation
Call:
mle(minuslogl = pois_nll, start = list(lambda = mean(x)), nobs = length(x))
Coefficients:
       Estimate Std. Error
lambda     0.61 0.05522666
-2 log L: 412.2134

The results of mle and optimize are equivalent, since -2*logl(0.6100066) gives 412.2134. Thus, using both the optimize and mle functions, we are able to obtain the MLE.□

Example 7.6.5. Estimation of the Shape Parameter from a Gamma Sample

Consider a random sample of size $c07-math-0425$ , which is believed to arise from a Gamma distribution, that is,

7.30

From historical data, it is known that $c07-math-0427$ and we have to obtain the MLE for $c07-math-0428$ . The log-likelihood function is given by

It is found for the sample dataset that $c07-math-0430$ and $c07-math-0431$ . Substituting this information along with $c07-math-0432$ and $c07-math-0433$ , we obtain the observed log-likelihood function as

In R, we obtain the MLE with the next program:

> # MLE of Gamma shape parameter
> log_lik <- function(k) {-7.39 +9.03 * (k-1) -60.02*k  -20*(log(gamma(k))) }
> optimise(log_lik,c(0,10),maximum=TRUE)
$maximum
[1] 0.4016414
$objective
[1] -52.74936

Note here that optimize and optimise are the same functions only. The log_lik R function mimicks the expression in the previous equation. The MLE, given the data, is 0.3041274 and value of the log-likelihood function $c07-math-0435$ is -40.9889.□

The Cauchy distribution is slightly more complex and its MLE does not exist in a simple form. We resort to the numerical optimization technique, essentially the Newton-Raphson technique, through its score function as given in Equation 7.19.

Example 7.6.6. The Cauchy Distribution

In Example 7.5.5 we obtained the score function for the location parameter of a Cauchy distribution, which is recollected as follows:

As noted in Example 5.21 of Knight (2000), the score function $c07-math-0437$ is not monotone in $c07-math-0438$ , and hence it may have multiple solutions, technically roots of the equation. The Newton-Raphson technique can be used here to obtain $c07-math-0439$ . Starting with an initial estimate $c07-math-0440$ , the updated estimates are given by the equation

7.31

where the Hessian, see Equation 7.29, is given by

7.32

In general, the hessian is a matrix of second-order partial derivatives. We will illustrate this estimation technique for the first sample of Example 7.5.5.

> data(cs)
> mysample <- cs[,1]
> n <- 10
> loglik <- muhat=NULL
> muhat[1] <- median(mysample)
> loglik[1] <- -sum(log(1+(mysample-muhat[1])^{2}))-n*log(pi)
> cauchy_score_fn <- function(mu,x){
+   sum(2*(x-mu)/(1+(x-mu)^{2}))
+ }
> cauchy_hessian_fn <- function(mu,x){
+   2*(sum((1-(x-mu)^{2})/(1+(x-mu)^{2})^{2}))
+ }
> mutest <- -10000000
> i <- 1
> while(abs(mutest-muhat[i])>0.0001){
+       mutest <- muhat[i]
+       i <- i+1
+       muhat[i] <- muhat[i-1]+cauchy_score_fn(muhat[i-1],mysample)
+  /cauchy_hessian_fn(muhat[i-1],mysample)
+       loglik[i] <- -sum(log(1+(mysample-muhat[i])^{2}))-n*log(pi)
+ }
> loglik
[1] -28.84483 -28.53130 -28.50123 -28.50120 -28.50120
> muhat
[1] 3.666913 4.081575 3.994956 3.992124 3.992121

The Cauchy score function in Equation 7.29 is computed with the function cauchy_score_fn. The initial estimate for $c07-math-0443$ is the sample median median(mysample). The Hessian function given by Equation 7.32 is computed with the R function cauchy_hessian_fn. The improvements required by Equation 7.31 are updated with muhat[i] <- muhat[i-1]+cauchy_score_fn(muhat[i-1],mysample), and the iterations are carried out until the difference between consecutive muhat values will be less than 0.0001, see the R code while(abs(mutest-muhat[i])>0.0001). The numerical convergence is obtained at the 5th iteration and the MLE estimate $c07-math-0444$ is 3.992121.

The R function mle can also be used to obtain the answer and we leave it to the reader to complete this part.□

The MLE problems, and likelihood functions as well, thus far has been discussed in the context of a single unknown parameter only. In many applied contexts, it is a common theme that multiple parameters would be unknown. The approach remains the same, though the details will be obviously more. Let us consider a random sample from the normal distribution, where both the parameters $c07-math-0445$ and $c07-math-0446$ are unknown. Recollect that in Example 7.5.5 we had assumed $c07-math-0447$ .

Example 7.6.7. MLE for a Normal Sample

Consider a random sample, of size $c07-math-0448$ , from a normal distribution $c07-math-0449$ , where both the parameters are unknown. The likelihood function is then given below by

We will simply use the mle function to obtain MLEs of $c07-math-0451$ and $c07-math-0452$ .

> data(ns)
> x <- ns[,1]
> nlogl <- function(mean,sd) {-sum(dnorm(x,mean=mean,sd=sd, log=TRUE)) }
> norm_mle <- mle(nlogl,start=list(mean=median(x),sd=IQR(x)), nobs=length(x))
> summary(norm_mle)
Maximum likelihood estimation
Call:
mle(minuslogl = nlogl, start = list(mean = median(x), sd = IQR(x)),
    nobs = length(x))
Coefficients:
      Estimate Std. Error
mean 3.4290740  0.1908120
sd   0.6034005  0.1349205
-2 log L: 18.27551

Here, a simple option of the mle function in properly defining the log-likelihood function for two unknown parameters and a suitable value for the initial values help in obtaining the MLEs for $c07-math-0453$ and $c07-math-0454$ .□

We will close this sub-section with a very interesting example.

Example 7.6.8. Estimating the “Size” of Binomial Distribution

Assume that the number of trials for a binomial experiment, that is size, is unknown, and the probability of success $c07-math-0455$ is known. In this example, the size will be denoted by $c07-math-0456$ and probability of success by $c07-math-0457$ . Examples 7.2.9 and 7.2.13 of Casella and Berger (2002) are the driving motivation behind the current example. Five data points are available (16, 18, 22, 25, 27), that is, $c07-math-0458$ , …, $c07-math-0459$ . A quick temptation is to use the mle function. Let us do that and get the answer for $c07-math-0460$ and $c07-math-0461$ .

> x <- c(16, 18, 22, 25, 27)
> nlog_binom <- function(size) {-sum(dbinom(x,size, prob=0.5,log=TRUE)) }
> mle_x <- mle(nlog_binom,start=list(size=2*max(x)))
Error in optim(start, f, method = method, hessian = TRUE, ...) :
  non-finite finite-difference value [1]
In addition: Warning messages:
1: In dbinom(x, size, prob = 0.5, log = TRUE) : NaNs produced
2: In dbinom(x, size, prob = 0.5, log = TRUE) : NaNs produced

Unfortunately, on this occasion, R gave us errors and not answers. The reason being differentiating the likelihood function, or $c07-math-0462$ too, is not feasible. The MLE of $c07-math-0463$ is not a straightforward routine and needs to be dealt with care. An important argument is that the likelihood function must be 0 for any $c07-math-0464$ value less than $c07-math-0465$ . We need to find $c07-math-0466$ for which $c07-math-0467$ and $c07-math-0468$ . We can use a while loop to find us the answer.□

We will close this discussion with some properties of the MLE.

1. If $c07-math-0469$ is a sufficient statistic for the family of distributions $c07-math-0470$ and a unique MLE of $c07-math-0471$ exists, then the MLE is a function of the sufficient statistic $c07-math-0472$ .
2. If $c07-math-0473$ is an MLE for $c07-math-0474$ and $c07-math-0475$ is a one-to-one function of $c07-math-0476$ , then $c07-math-0477$ will be an MLE for $c07-math-0478$ .
3. Under certain regularity conditions, see Section 7.7, the MLE is a consistent estimator, and further
7.33 $c07-math-0479$

7.6.2 Method of Moments Estimator

Karl Pearson invented the method of moments estimator. Suppose that $c07-math-0480$ is the pmf (pdf), and that the parameters are $c07-math-0481$ in number, $c07-math-0482$ . The method of moments estimator means that we need to first find the $c07-math-0483$ theoretical moments of $c07-math-0484$ and assume that they are equal to the corresponding sample moments. We assume that we have a sample of size $c07-math-0485$ . Thus, we have $c07-math-0486$ equations for $c07-math-0487$ unknown quantities. A solution for this set of equations leads to the method of moments estimator. Symbolically, we have the following setup:

We will demonstrate the applications of the moment estimators through some examples of Mukhopadhyay (2000) and Casella and Berger (2002). An important point to note here is that the moment estimators can be computed. even if the complete density function is not known.

Example 7.6.9. Moment Estimator for Normal Distribution

Consider a sample $c07-math-0489$ from $c07-math-0490$ , where both the parameters are unknown. For a normal distribution it is known that $c07-math-0491$ and $c07-math-0492$ . Thus, the moment estimators is a solution for the following set of two equations:

For a given sample, we first find the mean estimator and then insert it into the next equation to obtain an estimate of the variance. A little rearrangement gives the moment estimators as

Suppose that we have observed the 10 values for an experiment as 7.96, 6.01, 9.30, 6.92, 7.70, 0.15, 5.24, 12.53, 8.69, and 6.94. We can then compute the moment estimators for mean and variance.

> x <- c(7.96, 6.01, 9.30, 6.92, 7.70, 0.15, 5.24, 12.53, 8.69, 6.94)
> mum <- mean(x)
> sigmam <- mean(x^{2})-mum^{2}
> mum;sigmam
[1] 7.144
[1] 9.094144

A simple program is all that is rquired in this example.□

Example 7.6.10. Moment Estimators for Binomial Distribution

Consider a random sample $c07-math-0495$ from a binomial distribution $c07-math-0496$ , where both the parameters $c07-math-0497$ and $c07-math-0498$ are unknown. That is, the probability mass function of $c07-math-0499$ is

The first two sample moments equated to the population moments, see Example 7.2.2 of Casella and Berger (2002), leads to the system of two equations:

which give the moment estimators:

Suppose that the number of crimes in 20 different localities is reported as 3, 0, 3, 2, 2, 2, 4, 1, 1, 4, 0, 2, 3, 1, 1, 2, 1, 2, 2, and 3. We need to find $c07-math-0503$ and $c07-math-0504$ , since the number of criminals across the localities is assumed to be unknown but equal. The next program is easier to follow and hence the explanation is omitted.

> x <- c(3, 0, 3, 2, 2, 2, 4, 1, 1, 4, 0, 2, 3, 1, 1, 2, 1, 2, 2, 3)
> khat <- (length(x)-1)*var(x)/length(x)
> khat <- mean(x)-khat
> khat <- mean(x)^{2}/khat
> phat <- mean(x)/khat
> khat;phat
[1] 5.412811
[1] 0.3602564

Notice the order of computation for $c07-math-0505$ ! A small correction is in order here. The number of trials cannot be a fraction and hence $c07-math-0506$ is not acceptable. The floor of this number is not acceptable, since the empirical evidence suggests that $c07-math-0507$ is at least 5.412811, and hence the ceiling should be accepted for the number of trials. Using this corrected $c07-math-0508$ estimate, the reader should re-estimate $c07-math-0509$ .□

However, it is to be noted that the moment estimator is an ad hoc solution for obtaining estimators and it is severely restricted. Please refer to Mukhopadhyay (2000) and Casella and Berger (2002) for more pointers in this regard.

$c07-math-0510$

7.7 Comparison of Estimators

In the previous sections we considered two main methods of estimation. It is possible to propose several estimators for the same parameter and we would have to then justify the use of one estimator over other. That is, we need criteria for comparison of estimators. We will begin with unbiased estimator as a criterion for comparison purposes.

7.7.1 Unbiased Estimators

It needs to be recorded that the lack of bias is a property of an estimator and not of a sample. An unbiased estimator reaches the target $c07-math-0517$ on average and its bias will be 0 for all values of $c07-math-0518$ . An important measure of the performance of a statistic is provided by its Mean Squared Error.

A result which connects variance, bias, and MSE is given next.

In the previous Example 7.7.1, we had four unbiased estimators $c07-math-0550$ . Among these we will prefer the estimator with the least variance. This leads to the following definition.

In the following, we will first consider improving the unbiased estimators using sufficient statistics via the Rao-Blackwellization process. Next we will state what will be the lower bound for variance of unbiased estimators using Cramér-Rao inequality. Finally, we will briefly discuss how to obtain UMVU estimators.

7.7.2 Improving Unbiased Estimators

If we have an unbiased estimator, and we seek an improvement of it in terms of reduction in variance, we have an affirmative answer in the Rao-Blackwell theorem.

An illustration of how the Rao-Blackwell theorem works is required here.

The reader may consult Mukhopadhyay (2000) for more interesting examples of Rao-Blackwellization. We have taken one important step towards reducing the variance via sufficiency and we should now ask what is the best we can do further. Under some mild regularity conditions, the answer is provided by Cramér-Rao's lower bound. We need the following assumptions.

a. The support of $c07-math-0582$ is independent of $c07-math-0583$ .
b. $c07-math-0584$ exists $c07-math-0585$ .
c. $c07-math-0586$ .
d. $c07-math-0587$ .
e. $c07-math-0588$ .

As the above inequality involves the Fisher information, Lehmann and Casella (1998) refer to the Cramér-Rao inequality as the information inequality. The Cramér-Rao inequality can be derived using the characteristic function, see Kay and Xu (2008).

We will close this section with an important result which leads towards deriving UMVUE for some special family of probability distributions.

7.8 Confidence Intervals

Point estimates of the parameters are useful as already seen. However, there is often a need to complement them with other techniques, and a very brief discussion on the technique of confidence intervals will be discussed here. It is admitted that this topic requires more depth, and is restricted only in part by the important problem of hypotheses testing, which is covered in a more rigorous and R way. Another reason for skipping over the details on confidence intervals in part is that most of the R statistical functions, the confidence intervals, are also provided as part of the output when we use the well-known tests. In fact, the R function confint is also available, which extracts the confidence intervals with desired confidence widths for the regression models.

The reader may refer Chapter 5 of Tattar (2013) for confidence interval functions binom_CI, normal_CI_ksd, and normal_CI_uksd. Chapter 8 of Ugarte, et al. (2008) is a comprehensive account of the confidence intervals construction in R. A reason for not developing this section in more detail is that most of the statistical tests in R such as t.test, var.test, binom.test give the confidence intervals as side products in the output. Furthermore, the function confint is applicable to many statistical models fitted in R.

7.9 Testing Statistical Hypotheses–The Preliminaries

In earlier sections we explored various techniques of estimating the unknown parameters. The next task is validation of these parameters. Especially, we would like to deduce if the estimated parameters are in agreement with certain conjectures. We will now introduce some important terminology.

Consider a random sample whose underlying probability law is a pmf or a pdf $c07-math-0627$ .

Testing the hypotheses problem is the statistical criteria of choosing between two plausible hypotheses, that is, we find a mechanism to choose between hypotheses $c07-math-0639$ and $c07-math-0640$ . Formally, the testing hypotheses problem is

7.41

where $c07-math-0642$ and $c07-math-0643$ are two subsets of $c07-math-0644$ and mutually exclusive, that is, $c07-math-0645$ and $c07-math-0646$ is empty. The choice among $c07-math-0647$ and $c07-math-0648$ is to be based on a random sample $c07-math-0649$ of size $c07-math-0650$ . The values in the sample space which lead to rejection of the hypothesis $c07-math-0651$ has a special name.

We need an instrument which will decide between $c07-math-0655$ or $c07-math-0656$ . A formal definition is as follows.

A standard notation for a hypothesis test is $c07-math-0660$ . In terms of the rejection region, we can define the test as

The tests can also be defined in terms of decision rules. Let $c07-math-0662$ and $c07-math-0663$ denote the decisions of accepting or rejecting the hypothesis $c07-math-0664$ . Note that if $c07-math-0665$ , then $c07-math-0666$ .

The hypothesis test may lead to two types of error, which is brought out in Table 7.4.

Table 7.4 Type I and II Error

Hypothesis Test	$c07-math-0667$ True	$c07-math-0668$ True
Accept $c07-math-0669$	No Error	Type II Error
Accept $c07-math-0670$	Type I Error	No Error

It is customary to denote the probabilities of Type I and II errors by $c07-math-0671$ and $c07-math-0672$ respectively, that is,

7.42

7.43

Ideally, we would like to construct a hypothesis test that will keep both types of errors to a minimum. Unfortunately, it is not possible to do this, and hence we seek hypothesis tests which assign an upper bound on the probability of Type I errors and attempt to minimize Type II errors subject to this bound. A formal definition captures the requirement.

The concept of Type I and Type II errors will be demonstrated in the next example with a small R program.

Example 7.9.2. Computing Probabilities of Type I and Type II Errors

Consider a random sample of size $c07-math-0680$ , that is $c07-math-0681$ , from an exponential distribution with parameter $c07-math-0682$ , and denote the sample mean by $c07-math-0683$ . We are interested in testing $c07-math-0684$ against the hypothesis $c07-math-0685$ . Consider four tests defined below:

We wish to compute the probability of Type I and Type II errors for the above four tests. Towards this, we need to re-formalize the tests in terms of the gamma distribution with the important result that for a sample of size $c07-math-0687$ from an exponential distribution with parameter $c07-math-0688$ , the sample mean distribution is a gamma distribution with parameters $c07-math-0689$ . A simple tweak will give us the necessary probabilities.

> theta_h <- 10; theta_k <- 20;
> x <- 15; n <- 5
> 1-pgamma(x,n,n/theta_h) # Type I Error of Test 1
[1] 0.1321
> pgamma(x,n,n/theta_k) # Type II Error of Test 1
[1] 0.3225
> x <- 15; n <- 15
> 1-pgamma(x,n,n/theta_h) # Type I Error of Test 2
[1] 0.0386
> pgamma(x,n,n/theta_k) # Type II Error of Test 2
[1] 0.1648
> x <- 13; n <- 25
> 1-pgamma(x,n,n/theta_h) # Type I Error of Test 3
[1] 0.07536
> pgamma(x,n,n/theta_k) # Type II Error of Test 3
[1] 0.02614
> x <- 18; n <- 25
> 1-pgamma(x,n,n/theta_h) # Type I Error of Test 4
[1] 0.0004492
> pgamma(x,n,n/theta_k) # Type II Error of Test 4
[1] 0.3262

The R program shows that the test $c07-math-0690$ has large Type I and II error probabilities. Though the test $c07-math-0691$ has a very less Type I probability, its Type II error probability is very high. The choice between tests $c07-math-0692$ and $c07-math-0693$ depends on which error needs to be given more emphasis.□

Example 7.9.3. Uniform Distribution

Consider a random sample of size $c07-math-0694$ from $c07-math-0695$ . We wish to test the hypothesis $c07-math-0696$ against $c07-math-0697$ . A test function is given by

The density function of $c07-math-0699$ is given by

Suppose $c07-math-0701$ and we want to compute the size of the test and its power. Under the null hypothesis

as seen from the minor program:

> myfun <- function(n,theta,x) n*x^{n}/theta^{n}
> integrate(myfun, lower=0, upper=1/2,theta=1/2,n=8)
0.4444444 with absolute error < 4.9e-15

The function myfun is an R replica of $c07-math-0703$ and with the integrate function and the options theta=1/2 and n=8, computes the Type I error. The Type II error value is left to the reader as an exercise.□

The level of significance places an upper bound on the probability of Type I error. However, sometimes we need to find from the data the probability of rejecting the hypothesis $c07-math-0704$ .

The concept of the $c07-math-0709$ -value will be especially useful in a lot of the hypotheses testing problems to be seen in Part IV. We will need one more concept here.

The power of a test, also called the power function, will be denoted by $c07-math-0712$ . If $c07-math-0713$ and $c07-math-0714$ are simple hypotheses, we have $c07-math-0715$ and $c07-math-0716$ , where $c07-math-0717$ and $c07-math-0718$ are the hypothesized values of the parameter. This notation of the power function should not be confused with the similarly used notation in Section 7.16 of the EM algorithm.

Example 7.9.4. Continuation of Example 7.9.2

The power function for the four tests are given by:

Plotting of the power function of tests gives an indication of the behavior of the test functions.

> Q1 <- function(x)  {1-pgamma(15,shape=5,rate=5/x)}
> Q2 <- function(x)  {1-pgamma(15,shape=15,rate=15/x)}
> Q3 <- function(x)  {1-pgamma(13,shape=25,rate=25/x)}
> Q4 <- function(x)  {1-pgamma(18,shape=25,rate=25/x)}
> curve(Q1,from=0.1,to=40,n=400,xlab=expression(theta), ylab=expression(Q(theta)),
+ "l",col='red',add=FALSE,ylim=c(0,1))
> curve(Q2,from=0.1,to=40,n=400,"l",col='green',add=TRUE)
> curve(Q3,from=0.1,to=40,n=400,"l",col='blue',add=TRUE)
> curve(Q4,from=0.1,to=40,n=400,"l",col='yellow',add=TRUE)
> title(main="Various Power Functions")
> exp_legends <- expression(paste(Q[phi[1]],"(",theta,")"),
+ paste(Q[phi[2]], "(",theta,")"),paste(Q[phi[3]],
+ "(",theta,")"),paste(Q[phi[4]],"(",theta,")"))
> legend(x=c(30,40),y=c(0.7,0.5),exp_legends,col=
+ c("red","green","blue", "yellow"),lwd=rep(1.5,4))
> abline(v=c(10,20))

The R functions Q1, Q2, Q3, and Q4 are the corresponding implementation of the power functions stated before them. The curve function plots the power function curves. Discuss which test you will prefer, based on the power function plots in Figure 7.6.□

Figure 7.6 Power Function Plot for Normal Distribution

The next section will present a nice technique to obtain meaningful tests.

$c07-math-0720$

7.10 The Neyman-Pearson Lemma

The Neyman-Pearson lemma is one of the ground-breaking results in statistics. It begins with the problem of testing a simple hypothesis $c07-math-0721$ against the simple hypothesis $c07-math-0722$ . The requirement of a size $c07-math-0723$ test with maximum power leads to the definition of a most powerful test.

The most powerful test is abbreviated as the MP test. We now state the lemma.

The test $c07-math-0753$ may be rewritten in the form of likelihood functions as

7.49

We will discuss some aspects of the Neyman-Pearson lemma before its applications. Some key points in this lemma are emphasized in the following:

a. Points are increased in the critical region until the size of the test reaches $c07-math-0755$ . To understand this, note that we consider the likelihood ratio $c07-math-0756$ and then rate the points of $c07-math-0757$ on the basis of the ratio of the explanation of $c07-math-0758$ under $c07-math-0759$ to the explanation under $c07-math-0760$ . Thus, the points with higher values of the likelihood ratio enjoy a better explanation under $c07-math-0761$ in comparison with $c07-math-0762$ .
b. The power of MP tests increases for corresponding increases in the test sizes.
c. The risk set for $c07-math-0763$ against $c07-math-0764$ is defined by7.50 $c07-math-0765$
d. The risk set $c07-math-0766$ defined in Equation 7.50 is a convex and compact set.

The Neyman-Pearson lemma will be illustrated through various examples now.

Example 7.10.1. MP Test for Normal Distribution

We consider a random sample of size $c07-math-0767$ from $c07-math-0768$ known in $c07-math-0769$ . The problem is to test the hypothesis $c07-math-0770$ against the alternative $c07-math-0771$ , $c07-math-0772$ . As the comparison problem involves two simple hypotheses, we can use the Neyman-Pearson lemma. In terms of the likelihood function, the MP test function takes the form

Since the likelihood function is of the form $c07-math-0774$ , the MP test is to reject the hypothesis $c07-math-0775$ if

Alternatively, we can state that the test procedure is to reject $c07-math-0777$ if $c07-math-0778$ . We need the size of the test to be $c07-math-0779$ , and to achieve this, standardize the test procedure and thus it is required that

where $c07-math-0781$ is the upper 100 $c07-math-0782$ % of the standard normal distribution. It is easy to verify that $c07-math-0783$ . Some examples are demonstrated in R.

> MPNormal <- function(mu0, mu1, sigma, n,alpha) {
+   if(mu0<mu1) k <- qnorm(alpha,lower.tail = FALSE)* sigma/sqrt(n) + mu0
+   if(mu0>mu1) k <- mu0 - qnorm(alpha,lower.tail = FALSE)* sigma/sqrt(n)
+   return(k)
+ }
> MPNormal(mu0=0, mu1=0.5,sigma=1,n=10,alpha=0.05)
[1] 0.5201
> MPNormal(mu0=0, mu1=-0.5,sigma=1,n=10,alpha=0.05)
[1] -0.5201
> MPNormal(mu0=0, mu1=0.5,sigma=1,n=10,alpha=0.1)
[1] 0.4053
> MPNormal(mu0=0, mu1=-0.5,sigma=1,n=10,alpha=0.1)
[1] -0.4053
> MPNormal(mu0=10, mu1=15,sigma=2,n=10,alpha=0.05)
[1] 11.04
> MPNormal(mu0=10, mu1=5,sigma=2,n=10,alpha=0.05)
[1] 8.96
> MPNormal(mu0=10, mu1=15,sigma=2,n=10,alpha=0.1)
[1] 10.81
> MPNormal(mu0=10, mu1=5,sigma=2,n=10,alpha=0.1)
[1] 9.189

The R function MPNormal is created to take care of both scenarios (i) $c07-math-0784$ , and (ii) $c07-math-0785$ . Basically, the function returns the value of $c07-math-0786$ , which leads to rejecting the hypothesis $c07-math-0787$ if $c07-math-0788$ . The function MPNormal is available in the ACSWR package.□

Example 7.10.2. MP Test for Binomial Distribution

Consider a binomial distribution $c07-math-0789$ , where $c07-math-0790$ . We want to test $c07-math-0791$ against $c07-math-0792$ . The target is construction of the MP test of size $c07-math-0793$ . Since the value of the parameter $c07-math-0794$ is greater under $c07-math-0795$ than $c07-math-0796$ , the MP test criteria will reject $c07-math-0797$ for some $c07-math-0798$ , such that $c07-math-0799$ . Next, by the significance value, we require

Since the second term in the above equation $c07-math-0801$ is a positive number, it is clear that we need to find $c07-math-0802$ , such that $c07-math-0803$ . Let us use the pbinom function to identify $c07-math-0804$ :

> pbinom(0:3,prob=0.95,size=3)
[1] 0.000125 0.007250 0.142625 1.000000

It is clear from the above that the value of $c07-math-0805$ should be 1. Plugging in this value of $c07-math-0806$ , we can easily find $c07-math-0807$ as

> dbinom(1,prob=0.95,size=3)
[1] 0.007125
> (0.001-0.0001)/0.0071
[1] 0.1267606

Thus the MP level $c07-math-0808$ for $c07-math-0809$ against $c07-math-0810$ is given by

□

In the previous example, the value of $c07-math-0812$ under $c07-math-0813$ was less than under $c07-math-0814$ . Let us consider the same problem with roles reversed.

Example 7.10.3. MP Test for Binomial Distribution

Contd. Consider a random sample of size $c07-math-0815$ from Bernoulli, where the probability of success is $c07-math-0816$ . We are now interested in testing $c07-math-0817$ against $c07-math-0818$ , where $c07-math-0819$ . The $c07-math-0820$ - level MP test, after sacrificing a few vital steps, is defined by

The form of the MP test is then

We need to obtain the values of $c07-math-0823$ and $c07-math-0824$ to complete the hypothesis frame. First we find the smallest integer $c07-math-0825$ such that $c07-math-0826$ . Then, we find $c07-math-0827$ by

An interesting program MPbinomial has been developed, which will readily return us the values of $c07-math-0829$ and $c07-math-0830$ . For different values of $c07-math-0831$ , the values of $c07-math-0832$ and $c07-math-0833$ are returned, which will be the $c07-math-0834$ -level MP test. The function MPbinomial will be applied for different $c07-math-0835$ values under the hypotheses $c07-math-0836$ and $c07-math-0837$ . It is available in the companion ACSWR package.

> MPbinomial <- function(Hp, Kp, alpha,n){
+       k <- min(which((1-pbinom(0:n,size=n,prob=Hp))<alpha))-1
+       gamma <- (alpha-1+pbinom(k,size=n,prob=Hp))/dbinom (k,size=n,prob=Hp)
+       return(list=c(k,gamma))
+  }
> MPbinomial(Hp=0.25,Kp=0.9,alpha=0.1,n=10)
[1] 4.0000 0.1498
> MPbinomial(Hp=0.5,Kp=0.9,alpha=0.1,n=10)
[1] 7.0000 0.3867
> MPbinomial(Hp=0.5,Kp=0.9,alpha=0.2,n=10)
[1] 6.0000 0.1371
> MPbinomial(Hp=0.75,Kp=0.9,alpha=0.2,n=10)
[1] 9.0000 0.7655
> MPbinomial(Hp=0.3,Kp=0.9,alpha=0.1,n=50)
[1] 19.0000  0.2726
> MPbinomial(Hp=0.3,Kp=0.9,alpha=0.2,n=50)
[1] 18.0000  0.7695
> MPbinomial(Hp=0.6,Kp=0.9,alpha=0.1,n=100)
[1] 66.0000  0.2238
> MPbinomial(Hp=0.6,Kp=0.9,alpha=0.2,n=100)
[1] 64.0000  0.3471

The Neyman-Pearson test for the hypotheses testing problem MPbinomial has not been explained in detail, although it should not be difficult to follow. Note that unlike the MPNormal function, we need two values of $c07-math-0838$ and $c07-math-0839$ to be returned here.□

The reader may not be too comfortable with writing different programs, for the reason that $c07-math-0840$ may be lesser or greater than $c07-math-0841$ . Recall that the MPNormal function takes care of both scenarios. The next function takes care of this.

MPbinomial <- function(Hp, Kp, alpha,n) {
  if(Hp<Kp){
    k <- min(which((1-pbinom(0:n,size=n,prob=Hp))<alpha))-1
    gamma <- (alpha-1+pbinom(k,size=n,prob=Hp))/dbinom(k,size=n, prob=Hp)
    return(list=c(k,gamma))
  }
  else {
    k <- max(which((pbinom(0:n,size=n,prob=Hp))<alpha))
    gamma <- (alpha-pbinom(k-1,size=n,prob=Hp))/dbinom(k,size=n, prob=Hp)
    return(list=c(k,gamma))
  }
}

Example 7.10.4. MP Test for Poisson Distribution

Let $c07-math-0842$ and assume that $c07-math-0843$ replicates of it are available. The testing problem is hypothesis $c07-math-0844$ against the hypothesis $c07-math-0845$ . We now allow $c07-math-0846$ to be either greater or less than $c07-math-0847$ . The MPPoisson function returns the desired values of $c07-math-0848$ and $c07-math-0849$ at a specified level $c07-math-0850$ .

> MPPoisson <- function(Hlambda, Klambda, alpha,n) {
+   Hlambda <- n*Hlambda
+   Klambda <- n*Klambda
+   nn <- n*Hlambda
+   if(Hlambda<Klambda) }
+     k <- min(which((1-ppois(0:nn,lambda=Hlambda))<alpha))-1
+     gamma <- (alpha-1+ppois(k,lambda=Hlambda))/dpois (k,lambda=Hlambda)
+     return(list=c(k,gamma))
+   {
+   else {
+     k <- max(which((ppois(0:nn,lambda=Hlambda))<alpha))
+     gamma <- (alpha-ppois(k-1,lambda=Hlambda))/dpois (k,lambda=Hlambda)
+     return(list=c(k,gamma))
+   }
+ }
> MPPoisson(Hlambda=5,Klambda=10,alpha=0.2,n=10)
[1] 56.0000  0.5875
> MPPoisson(Hlambda=5,Klambda=10,alpha=0.15,n=10)
[1] 57.0000  0.1557
> MPPoisson(Hlambda=5,Klambda=10,alpha=0.1,n=10)
[1] 59.0000  0.3206
> MPPoisson(Hlambda=5,Klambda=10,alpha=0.05,n=10)
[1] 62.0000  0.5725
> MPPoisson(Hlambda=15,Klambda=10,alpha=0.2,n=50)
[1] 727.0000   0.4001
> MPPoisson(Hlambda=15,Klambda=10,alpha=0.15,n=50)
[1] 722.000   0.128
> MPPoisson(Hlambda=15,Klambda=10,alpha=0.1,n=50)
[1] 715.0000   0.5068
> MPPoisson(Hlambda=15,Klambda=10,alpha=0.05,n=50)
[1] 705.0000   0.7362

The MPPoisson is tailored to Poisson distribution on the same lines as MPNormal and MPbinomial functions.□

We will now move to the next type of hypotheses testing problem.

$c07-math-0851$

7.11 Uniformly Most Powerful Tests

The general one-sided hypothesis testing problem for a single real valued parameter $c07-math-0852$ is stated as:

Note that the hypotheses $c07-math-0854$ and $c07-math-0855$ are composite hypotheses. A definition of the size of a test for composite hypothesis is required.

We can now define the Uniformly Most Powerful Tests.

Let us first consider the problem of testing $c07-math-0871$ against $c07-math-0872$ . In this case the UMP test may be easily obtained. Towards this, fix a value $c07-math-0873$ and set up the Neyman-Pearson MP test for $c07-math-0874$ against $c07-math-0875$ . Then the MP test continues to be a UMP test if the test remains unaffected by the specific choice of $c07-math-0876$ .

We note that the UMP tests do not exist in general for one-sided testing problems. However, the UMP tests exist for a family of distribution satisfying a particular property. The mathematical property which is necessary for the existence of UMP tests is defined next.

A result due to Karlin and Rubin states that whenever there exists a statistic $c07-math-0887$ for which $c07-math-0888$ admits the MLR property, a UMP test can be constructed for the one-sided hypothesis.

Example 7.11.2. UMP Test for Exponential Distribution

In an experiment involving electronic tubes, an experimenter collects data on the failure times from a random sample of 20 units. Assume that the failure time follows an exponential distribution with mean time $c07-math-0896$ . The aim of the experiment is to test $c07-math-0897$ against $c07-math-0898$ . The observed data contains the 20 failure times as 9.9, 35.6, 57.9, 94.6, 141.4, 154.4, 163.3, 226.7, 244.3, 337.2, 391.8, 417.2, 444.6, 461.2, 497.1, 582.6, 606.8, 616.0, 784.7, and 794.7.

Define $c07-math-0899$ . Clearly, $c07-math-0900$ is sufficient for $c07-math-0901$ and the density function of exponential distribution is monotone in $c07-math-0902$ . Thus, the size $c07-math-0903$ UMP test is given by

Note that $c07-math-0905$ and hence we can determine $c07-math-0906$ in the following steps. To obtain a size $c07-math-0907$ UMP test we require

which translates to

This integral can be easily evaluated to obtain $c07-math-0910$ using the qgamma function. The R program below gives us the required results.

> UMPExponential <- function(theta0, n, alpha){
+ t <- qgamma(1-alpha, shape=n,scale=theta0)
+ return(t)
+ }
> UMPExponential(theta0=350,n=20,alpha=0.05)
[1] 9757.734
> x <- c(9.9, 35.6, 57.9, 94.6, 141.4, 154.4, 163.3, 226.7,
+ 244.3, 337.2, 391.8, 417.2, 444.6, 461.2, 497.1, 582.6,
+ 606.8, 616.0, 784.7, 794.7)
> (t <- sum(x))
[1] 7062

Since the observed $c07-math-0911$ value 7062 is less than 9757.734, we fail to reject the hypothesis $c07-math-0912$ .□

Example 7.11.3. UMP Test for Uniform Distribution

Consider a random sample $c07-math-0913$ from $c07-math-0914$ . The problem of interest is $c07-math-0915$ against $c07-math-0916$ . It is then known that $c07-math-0917$ is a sufficient statistic for $c07-math-0918$ . Recall the distribution of $c07-math-0919$ :

Clearly, $c07-math-0921$ is monotone decreasing in $c07-math-0922$ . Thus, by the application of the Karlin-Rubin theorem, the UMP test is given by

To obtain a size $c07-math-0924$ test, we need to evaluate

For a dummy dataset, we have

> UMPUniform <- function(theta0,n,alpha)  return(theta0*(1-alpha)^{1/n})
> UMPUniform(0.6,10,0.05)
[1] 0.5969303

Thus, if the $c07-math-0926$ value is greater than 0.5969303, we reject the hypothesis $c07-math-0927$ .□

Example 7.11.4. UMP Test for Normal Distribution

The one-sided hypothesis testing problem is $c07-math-0928$ against $c07-math-0929$ for a random sample of $c07-math-0930$ observations from $c07-math-0931$ . Here, we assume that the variance $c07-math-0932$ is known. It can be easily proved that the UMP test function is of the form:

Note that if the problem is of testing $c07-math-0934$ against $c07-math-0935$ , the above test function with the inequalities sign reversed on the right-hand side of the equation will continue to be an UMP test. The power function for the previous hypotheses is given by

In the next program, we create R functions which will enable us to test the hypotheses and also obtain the power function.

> # UMP Test for Normal Distribution
> # H:mu <= mu_0 vs K: mu> mu_0
> UMPNormal <- function(mu0, sigma, n,alpha){
+       qnorm(alpha)*sigma/sqrt(n)+mu0
+  }
> UMPNormal(mu0=0, sigma=1,n=1,alpha=0.5)
[1] 0
> powertestplot <- function(mu0,sigma,n,alpha){
+       mu0seq <- seq(mu0-3*sigma, mu0+3*sigma,(6*sigma/100))
+       betamu <- pnorm(sqrt(n)*(mu0seq-mu0)/sigma-qnorm(1-alpha))
+       plot(mu0seq,betamu,"l",xlab=expression(mu),ylab="Power of UMP Test",
+       main = expression(paste("H:",mu <= mu[0]," vs K:", mu>mu[0])))
+       abline(h=alpha)
+       abline(v=mu0)
+  }
> powertestplot(mu0=0,sigma=1,n=10,alpha=0.05)
> # H:mu> mu_0 vs K: mu <= mu_0
> UMPNormal <- function(mu0, sigma, n,alpha){
+       mu0-qnorm(alpha)*sigma/sqrt(n)
+ }
> UMPNormal(mu0=0, sigma=1,n=1,alpha=0.5)
[1] 0
> powertestplot <- function(mu0,sigma,n,alpha){
+       mu0seq=seq(mu0-3*sigma, mu0+3*sigma,(6*sigma/100))
+       betamu = pnorm(sqrt(n)*(mu0-mu0seq)/sigma-qnorm(1-alpha))
+       plot(mu0seq,betamu,"l",xlab=expression(mu),ylab="Power of  UMP Test",
+       main = expression(paste("H:",mu>= mu[0]," vs K:", mu<mu[0])))
+       abline(h=alpha)
+       abline(v=mu0)
+ }
> powertestplot(mu0=0,sigma=1,n=10,alpha=0.05)

It may be noted from Figure 7.7 that the power function reaches value $c07-math-0937$ exactly at $c07-math-0938$ . We conclude this example with a small discussion of the sample size determination so that the UMP test attains a specified power. Thankfully, in R, we have an inbuilt function for determining the sample size for normal samples in power.t.test.

> power.t.test(delta=0.5,sd=1,sig.level=0.025,type="one.sample",
+ alternative="one.sided",power=0.9)
     One-sample t test power calculation
              n = 43.99552
          delta = 0.5
             sd = 1
      sig.level = 0.025
          power = 0.9
    alternative = one.sided

Since the sample size cannot be a fraction, we need to have a minimum sample of size $c07-math-0939$ , so that the power of the test is 0.9.□

Figure 7.7 UMP Tests for One-Sided Hypotheses

$c07-math-0940$

7.12 Uniformly Most Powerful Unbiased Tests

The general hypotheses of interest is of the form: $c07-math-0941$ against $c07-math-0942$ . We will begin with an example.

Example 7.12.1. Non-existence of UMP Test for Testing Simple Hypothesis Against Two-sided Hypothesis for Normal Distribution

For the problem of testing $c07-math-0943$ against $c07-math-0944$ based on a sample of size $c07-math-0945$ from $c07-math-0946$ , where $c07-math-0947$ is known, let us first consider two UMP size $c07-math-0948$ tests for two testing problems: (a) $c07-math-0949$ against $c07-math-0950$ , and (b) $c07-math-0951$ against $c07-math-0952$ . Consider the following two tests:

From the examples in previous sections, it is clear that the tests $c07-math-0954$ and $c07-math-0955$ are two UMP tests of size $c07-math-0956$ for testing problems (a) and (b) respectively. Let us obtain a plot of the power functions for these two tests.

> pdf("Non_Existence_UMP_Normal.pdf")
> powertestplot <- function(mu0,sigma,n,alpha){
+       mu0seq <- seq(mu0-3*sigma, mu0+3*sigma,(6*sigma/100))
+       betamu <- pnorm(sqrt(n)*(mu0-mu0seq)/sigma-qnorm(1-alpha))
+       betamu2 <- pnorm(sqrt(n)*(mu0seq-mu0)/sigma-qnorm(1-alpha))
+       plot(mu0seq,betamu,"l",xlab=expression(mu[0]),
+       ylab="Power of UMP Test",main = expression(paste("H:",mu
+       = mu[0]," vs K:",mu != mu[0])),col="red",xaxt="n")
+       points(mu0seq,betamu2,"l",col="blue")
+       legend(2,0.6,c(expression(phi[1]),expression(phi[2])),
+       col=c("red","blue"),lty=c(1,1))
+       abline(h=alpha)
+       abline(v=mu0)
+ }
> powertestplot(mu0=0,sigma=1,n=10,alpha=0.05)
> dev.off()
1

We can then see from Figure 7.8 that for $c07-math-0957$ , the power of the UMP test $c07-math-0958$ is less than $c07-math-0959$ , whereas for $c07-math-0960$ , the power of $c07-math-0961$ is less than $c07-math-0962$ . Hence, there does not exist a UMP size $c07-math-0963$ for testing $c07-math-0964$ against $c07-math-0965$ .□

Figure 7.8 Non-Existence of UMP Test for Normal Distribution

It needs to be noted though that a UMP test for a simple hypothesis against the two-sided alternative exists for the uniform distribution, see Mukhopadhyay (2000). We will return to the problem of such hypotheses for the normal distribution. A condition needs to be relaxed for identifying meaningful tests and hence we consider the next definition.

Let $c07-math-0971$ denote the collection of all size $c07-math-0972$ unbiased tests. The next definition follows naturally.

The main reason for discussing the results in this fashion thus far is the integration of statistical concepts with R. It is believed that the reader is now convinced that the details of statistical theory can be understood using a software package. We now skip rest of the details, though illustrations are given, and simply leave it as an exercise for the reader to verify that the Student's $c07-math-0979$ -test is indeed a UMPU test. In fact, we need a host of other related and interesting concepts, such as similarity, to prove that the Student's $c07-math-0980$ -test is indeed a UMPU test. The details may be found in Lehmann and Romano (2005) and Srivastava and Srivastava (2009).

7.12.1 Tests for the Means: One- and Two-Sample t-Test

Assume that $c07-math-0981$ is a random sample from $c07-math-0982$ with both the parameters being unknown. Suppose we are interested in testing $c07-math-0983$ . The parameters $c07-math-0984$ and $c07-math-0985$ are respectively estimated using the sample mean and the sample standard deviation. The $c07-math-0986$ -test is then given by

7.59

which has a $c07-math-0988$ -distribution with $c07-math-0989$ degrees of freedom. The $c07-math-0990$ -test in R software may be found in the stats package whose constitution is given as

     t.test(x, y = NULL,
            alternative = c("two.sided", "less", "greater"),
            mu = 0, paired = FALSE, var.equal = FALSE,
            conf.level = 0.95, ...)

The following is clear from the above display:

1. The default function is a one-sample test with a two-sided alternative being tested for $c07-math-0991$ and 95% confidence interval as an output.
2. The user has the options of specifying the nature of alternatives, $c07-math-0992$ , and the confidence interval level.

Example 7.12.2. A $c07-math-0993$ -test for the Galton Data

For the famous Galton dataset, suppose that we want to test if the height of the child equals the height of the parent. Using the $c07-math-0994$ -test, we get the following answer.

> library(UsingR)
> summary(galton)
     child           parent
 Min.   :61.70   Min.   :64.00
 1st Qu.:66.20   1st Qu.:67.50
 Median :68.20   Median :68.50
 Mean   :68.09   Mean   :68.31
 3rd Qu.:70.20   3rd Qu.:69.50
 Max.   :73.70   Max.   :73.00
> t.test(galton$child,mu=mean(galton$parent))
 One Sample t-test
data:  galton$child
t = -2.6583, df = 927, p-value = 0.00799
alternative hypothesis: true mean is not equal to 68.30819
95 percent confidence interval:
 67.92626 68.25068
sample estimates:
mean of x
 68.08847

Thus, we can see from the $c07-math-0995$ -value, and the confidence interval too, that we need to reject the hypotheses that the height of the child is equal to the height of the parent.□

In the above example we specified that $c07-math-0996$ is the mean of the parent height. However, a more appropriate test will be a direct comparison as to whether the height of the child is the same as the height of the parent.

Let $c07-math-0997$ be a random sample from $c07-math-0998$ , and $c07-math-0999$ a random sample from $c07-math-1000$ , with $c07-math-1001$ unknown. Here, we assume that the variances are equal, though unknown. Suppose that we are interested to test the hypothesis $c07-math-1002$ against the hypothesis $c07-math-1003$ . The two-sample $c07-math-1004$ -test is then given by

7.60

which has a $c07-math-1006$ -distribution with $c07-math-1007$ degrees of freedom, with $c07-math-1008$ being the pooled standard deviation.

Example 7.12.3. Illustration through `sleep` dataset in R, or the AD4

The dataset is not large and this is compensated for by its antiquity. The t.test can be applied on the sleep data site straightaway.

> plot(extra ∼ group, data = sleep) # output suppressed
> t.test(extra ∼ group, data = sleep)
 Welch Two Sample t-test
data:  extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.3654832  0.2054832
sample estimates:
mean in group 1 mean in group 2
           0.75            2.33

At size $c07-math-1009$ , the hypothesis that the heights are equal needs to be rejected, whereas at $c07-math-1010$ level, it may be seen from the $c07-math-1011$ -value and 95% confidence interval that there is not enough evidence in the dataset to reject the hypothesis of equal heights of parent and child.□

$c07-math-1012$

7.13 Likelihood Ratio Tests

Consider the generic testing problem $c07-math-1013$ vs $c07-math-1014$ . As earlier, it is assumed that a random sample $c07-math-1015$ of size $c07-math-1016$ is available.

The constant $c07-math-1023$ is to be determined from the size restriction:

The following examples deal with the construction of likelihood ratio tests.

The likelihood ratio tests are obtained for the normal distribution in the following subsections.

7.13.1 Normal Distribution: One-Sample Problems

In Section 7.11 we saw that UMP tests do not exist for many crucial types of hypothesis problems. As an example, for a random sample of size $c07-math-1034$ from $c07-math-1035$ , $c07-math-1036$ known, the UMP test does not exist for testing $c07-math-1037$ against $c07-math-1038$ . We will consider these types of problems in this subsection.

Example 7.13.2. Testing $c07-math-1039$ against $c07-math-1040$ when $c07-math-1041$ is known

In this problem, $c07-math-1042$ , and $c07-math-1043$ . The likelihood function is given by

It is straightforward to see that the maximum of $c07-math-1045$ over $c07-math-1046$ is simply given by the likelihood function at $c07-math-1047$ , that is,

For the denominator term of $c07-math-1049$ , note that maximum over $c07-math-1050$ occurs at the MLE $c07-math-1051$ , that is:

Thus, the likelihood ratio becomes

The LR test procedure will be to reject $c07-math-1054$ if $c07-math-1055$ , that is, if the value of $c07-math-1056$ is too large or too small. The value of $c07-math-1057$ is determined by the size of the required test. The final form of the LR test is given by

A small R function LRNormalMean_KV is given below:

> LRNormalMean_KV = function(x,mu0,alpha,sigma){
+ ifelse(abs(sqrt(length(x))*(mean(x)-mu0)/sigma)>qnorm(1-alpha/2),
+ "Reject Hypothesis H","Fail to Reject Hypothesis H")
+ }

The user may use this function to test for sample datasets of his choice.□

Example 7.13.3. Testing $c07-math-1059$ against $c07-math-1060$ when $c07-math-1061$ is unknown

Unlike the previous problem, here the the hypothesis $c07-math-1062$ is composite, as $c07-math-1063$ is unknown. We will first begin with two quantities:

7.62

7.63

It can then be shown that

7.64

7.65

The likelihood ratio then becomes

The test procedure is given by

where $c07-math-1070$ is the sampling variance. We close this example with the R function LRNormalMean_UV.

> LRNormalMean_UV <- function(x,mu0,alpha){
+ S <- sd(x); n <- length(x)
+ ifelse(abs(sqrt(length(x))*(mean(x)-mu0)/S)>qt(n-1,1-alpha/2),
+ "Reject Hypothesis H","Fail to Reject Hypothesis H")
+ }

Test the working of the likelihood ratio function LRNormalMean_UV on any dataset of your choice.□

Example 7.13.4. Testing $c07-math-1071$ against $c07-math-1072$ when both $c07-math-1073$ and $c07-math-1074$ are unknown

As in the previous example, the hypothesis $c07-math-1075$ is composite here. Define the sample mean and variance by

Since the MLE of $c07-math-1077$ is $c07-math-1078$ , we have

Define $c07-math-1080$ , as in Equation 7.63. However, for $c07-math-1081$ , we have

Thus, the likelihood ratio is given by

The likelihood ratio test is finally given by

7.66

The R function LRNormalVariance_UM closes the example.

> LRNormalVariance_UM <- function(x,sigma0,alpha){
+     S <- var(x); n <- length(x)
+     chidata <- ((n-1)*S)/(sigma0^2)
+     ifelse((chidata<qchisq(df=n-1,p=alpha/2)||
+     (chidata>qchisq(df=n-1,p=1-alpha/2))),
+     "Reject Hypothesis H","Fail to Reject Hypothesis H")
+ }

Thus, using the LRNormalVariance_UM function, the likelihood ratio test in the case of the unknown variance can be carried out.□

7.13.2 Normal Distribution: Two-Sample Problem for the Mean

As in the previous subsection, we will only consider the testing problem related to means. A very brief summary is given here.

The general problem is described as follows. Let $c07-math-1085$ be a random sample from $c07-math-1086$ , and $c07-math-1087$ a random sample from $c07-math-1088$ . Assume that all the three parameters $c07-math-1089$ , and $c07-math-1090$ are unknown. For a specified level $c07-math-1091$ , the aim is to obtain the likelihood ratio test for testing $c07-math-1092$ against $c07-math-1093$ . Define the following quantities:

7.67

7.68

7.69

The size $c07-math-1097$ likelihood ratio test for $c07-math-1098$ against $c07-math-1099$ is given by

7.70

The next R function, LRNormal2Mean, with an illustration, gives the likelihood ratio test.

> LRNormal2Mean <- function(x,y,alpha){
+     xbar <- mean(x); ybar <- mean(y)
+     nx <- length(x); ny <- length(y)
+     Sx <- var(x); Sy <- var(y)
+     Sp <- ((nx-1)*Sx+(ny-1)*Sy)/(nx+ny-2)
+     tcalc <- abs(xbar-ybar)/sqrt(Sp*(1/nx+1/ny))
+     conclusion=ifelse(tcalc>qt(df=nx+ny-2,p=alpha/2),
+ "Reject Hypothesis H","Fail to Reject Hypothesis H")
+     return(c(tcalc,conclusion,Sp))
+ }
> lisa <- c(234.26, 237.18, 238.16, 259.53, 242.76, 237.81, 250.95, 277.83)
> mike <- c(187.73, 206.08, 176.71, 213.69, 224.34, 235.24)
> LRNormal2Mean(mike,lisa,0.05)
[1] "4.06112227911276"       "Reject Hypothesis H" "332.808456944444"

To the best of our knowledge, R does not have an implementation for the likelihood ratio test.

$c07-math-1101$

7.14 Behrens-Fisher Problem

In the two-sample problems for normal distributions considered earlier, the problem of testing $c07-math-1102$ against $c07-math-1103$ , when $c07-math-1104$ and $c07-math-1105$ are unknown and distinct, has not been considered. There has been a special reason for this. It has been proved by Linnik (1968) in this case that the UMPU test does not exist and there has been a lot of controversy surrounding the solutions proposed to date and it is even today an open problem.

The problem was first attempted by Behrens in 1929 and by Fisher in 1935. Kim and Cohen (1995) provide an excellent review of the solutions proposed by various statisticians. Schéffe (1943), Aspin (1948), Lindley (1965), Robinson (1976), and Welch (1938, 1947) are some of the important works in this direction.

Suppose that we have $c07-math-1106$ observations from $c07-math-1107$ and $c07-math-1108$ observations from $c07-math-1109$ . As earlier, let $c07-math-1110$ denote the sample means, variances, and pooled variance. The Student's $c07-math-1111$ -test pivotal statistic with $c07-math-1112$ degrees of freedom is given by

However, the Student's $c07-math-1114$ -test procedure makes the assumption of the variances being equal. Thus, the use of the $c07-math-1115$ -test is inappropriate here. An ad-hoc solution is the following. Compute $c07-math-1116$ by

and compare it with the critical value obtained from a $c07-math-1118$ variable with $c07-math-1119$ degrees of freedom. For a dataset from Kim and Cohen's review paper, an R program is given next.

> adhocBF <- function(x,y,delta,alpha){
+   tstar <- (delta-mean(y)+mean(x))/sqrt(var(x)/length(x)+var(y)/ length(y))
+   v <- min(length(x)-1,length(y)-1)
+   pval <- 2*(1-pt(tstar,v))
+   confint <- c(mean(y)-mean(x)-qt(1-alpha/2,v) *sqrt(var(x)/length(x)+
+   var(y)/length(y)),mean(y)-mean(x)+qt(1-alpha/2,v)*
+   sqrt(var(x)/length(x)+var(y)/length(y)))
+   return(list=c(tstar,pval,confint))
+ }
> x <- c(8,10,12,15)
> y <- c(1,7,11)
> adhocBF(x,y,delta=0,alpha=0.05)
[1]   1.5049258   0.2712717 -18.9736452   9.1403119

A more satisfactory solution for the Behrens-Fisher problem is given by Welch and we will discuss his solution with an R program.

Compute the value of test statistic $c07-math-1120$ as the same given by $c07-math-1121$ . Define $c07-math-1122$ . Define

7.71

The Welch solution is to carry out the test by using the value $c07-math-1124$ and comparing it with a $c07-math-1125$ random variate with $c07-math-1126$ degrees of freedom. As may be expected, in general $c07-math-1127$ is not an integer. In such a case we round off the value to the nearest integer.

> WelchBF <- function(x,y,alpha){
+     gx <- var(x); gy <- var(y)
+     t <- (mean(x)-mean(y))/sqrt(gx/length(x)+gy/length(y))
+     vhat <- (gx+gy)^2/(gx^2/(length(x)-1) + gy^2/(length(y)-1))
+     pval <- 2*(1-pt(t,round(vhat)))
+     ci <- qt(c(alpha/2,1-alpha/2),round(vhat))
+     return(list=c(t,pval,ci))
+ }
> WelchBF(x,y,alpha=0.05)
[1]  1.5049258  0.2294048 -3.1824463  3.1824463

For more details related to the Behrens-Fisher problem, refer to the review article of Kim and Cohen.

$c07-math-1128$

7.15 Multiple Comparison Tests

Consider the following hypothesis:

7.72

Here, we have a set of hypotheses to be tested and this framework is popularly known as the multiple comparision test. Such hypotheses are very common in Experimental Designs, see Chapter 15. Suppose that $c07-math-1130$ , denotes the mean yield due to the $c07-math-1131$ -th treatment. In its general setup, the hypothesis says that none of the treatment means are significant. In case we fail to reject $c07-math-1132$ , the conclusion is indeed that none of the treatment means are significant and the analysis stops. However, if we reject the hypothesis $c07-math-1133$ , a host of questions then arise. In this case, the conclusion says that at least one treatment is significant and the interest is then to identify such a treatment. A slight variant of the problem is testing to some pre-specified level, which is generally known as the mean of the control treatment.

Let us begin with a naive approach. That is, we consider $c07-math-1134$ hypotheses instead of a single hypothesis and consider the problem of testing $c07-math-1135$ . Suppose each hypothesis is tested at level $c07-math-1136$ . A simple exploration shows the dire consequence of this naive approach. The forthcoming program will show that the probability of one or more false rejections increases drastically with $c07-math-1137$ .

> n <- c(1,2,5,10,50)
> alpha <- 0.05
> prob_rejection <- function(n,alpha) (1-(1-alpha)^{n})
> round(sapply(n,prob_rejection,alpha),2)
[1] 0.05 0.10 0.23 0.40 0.92

That is, the Type I error grows very fast and with $c07-math-1138$ , we are almost certain of having committed the error. This motivates the next definition.

The goal of the multiple testing problem is to restrict the FWER to a pre-specified level $c07-math-1143$ :

7.74

In the next section we will focus on two simple, yet useful, procedures for the multiple testing problem.

7.15.1 Bonferroni's Method

The Bonferroni's method is a simple consequence of using the Bonferroni inequality. Let $c07-math-1145$ be the $c07-math-1146$ -value associated with hypothesis $c07-math-1147$ . Then reject the family of hypotheses $c07-math-1148$ if $c07-math-1149$ . It may be easily verified is this case that

An illustration will be provided for the example provided in R.

Example 7.15.1. Bonferroni's Method for Testing if Ozone depends on the Month

Consider the airquality dataset from the datasets package. The data is available on the ozone levels measured in parts per billion from 1300 to 1500 hours at Roosevelt Island, and Month variable denotes the month number. The problem is to test if the ozone level depends on the month number.

> data(airquality)
> boxplot(airquality$Ozone ∼ airquality$Month) # Output suppressed
> airquality$Month <- factor(airquality$Month)
> pairwise.t.test(airquality$Ozone,airquality$Month, p.adj = "bonf")
 Pairwise comparisons using t tests with pooled SD
data:  airquality$Ozone and airquality$Month
  5       6       7       8
6 1.00000 -       -       -
7 0.00029 0.10225 -       -
8 0.00019 0.08312 1.00000 -
9 1.00000 1.00000 0.00697 0.00485
> pairwise.t.test(airquality$Ozone, airquality$Month,
+ p.adj = "bonf")$p. value<=0.05/10
      5     6     7    8
6 FALSE    NA    NA   NA
7  TRUE FALSE    NA   NA
8  TRUE FALSE FALSE   NA
9 FALSE FALSE FALSE TRUE

Using the pairwise.t.test function with the option p.adj=“bonf”, the multiple hypotheses can be tested using the Bonferroni solution in R. Since some $c07-math-1151$ -values are less than $c07-math-1152$ , we reject the family of hypotheses.□

7.15.2 Holm's Method

Consider the ordered $c07-math-1153$ -values $c07-math-1154$ and let the associated hypotheses be $c07-math-1155$ . The Holm procedure is a stepdown procedure and is described below, adapted from page 351 of Lehmann and Romano (2005).

Step 1. If $c07-math-1156$ , accept $c07-math-1157$ and stop. If $c07-math-1158$ , reject $c07-math-1159$ and test the remaining $c07-math-1160$ hypotheses at level $c07-math-1161$ .
Step 2. If $c07-math-1162$ with $c07-math-1163$ , accept $c07-math-1164$ and stop. If $c07-math-1165$ and $c07-math-1166$ , reject $c07-math-1167$ and $c07-math-1168$ and test the remaining $c07-math-1169$ hypotheses at level $c07-math-1170$ .
Step 3. Continue the steps until $c07-math-1171$ .

For proof that the Holm's method meets the requirement $c07-math-1172$ , see Theorem 9.1.2 of Lehmann and Romano (2005). We will close the discussion with an example.

Example 7.15.2. Bonferroni's Method for Testing if Ozone depends on the Month

Contd. In continuation of the previous example, we now investigate similar hypotheses for the Holm's method. The reader should pay more attention to the R program.

> pairwise.t.test(airquality$Ozone, airquality$Month, p.adj = "holm") $p. value
             5          6           7           8
6 1.0000000000         NA          NA          NA
7 0.0002638036 0.05112741          NA          NA
8 0.0001949061 0.04987333 1.000000000          NA
9 1.0000000000 1.00000000 0.004878798 0.003878108
> holmmat <- pairwise.t.test(airquality$Ozone,
+ airquality$Month, p.adj = "holm")$p. value
> holmmat[lower.tri(holmmat,diag=TRUE)]<(0.05/(1:10))
 [1] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

Using the pairwise.t.test function with the option p.adj=“bonf”, the multiple hypotheses is tested with the Holm solution in R. The same conclusion is reached as in the previous example.□

$c07-math-1173$

7.16 The EM Algorithm*

7.16.1 Introduction

The Expectation-Maximization Algorithm, more popularly known as the EM algorithm, is a very popular tool, not only among the statisticians, but also among the data miners. Wu and Kumar (2009) have selected the EM algorithm as one among the top ten useful algorithms for data miners. McLachlan and Krishnan (1998, 2008) give a rigorous mathematical introduction with a large number of illustrations of the EM algorithm. Little and Rubin (1987, 2002) is also one of the earliest books to give a detailed account of the algorithm. Dempster, Laird, and Rubin (1977) introduced the breakthrough EM algorithm and enhanced statistical methods which can accommodate missing data. This paper is also popularly referred to as the DLR paper. The introductory literature has so far been in reversed chronological order.

It is important to understand that the EM algorithm is not really an algorithm in the traditional usage of the technical word “algorithm”. It is a generic tool which gives rise to different statistical methods depending on the context of the application. Ripley in a reply to an R user has rightly explained this de facto as “The EM algorithm is not an algorithm for solving problems, rather an algorithm for creating statistical methods.”

In the context of handling missing data, Schafer (2000) has rightly said that “The key ideas behind EM and data augmentation are the same: to solve a difficult incomplete-data problem by repeatedly solving tractable complete-data problems.” Terry Speed (2008) has also said this about the EM algorithm: “I know many statisticians are deeply in love with the EM algorithm.”

“EMMIX” is probably one of the few softwares which implements the EM algorithm for the mixture of multivariate normal or $c07-math-1174$ - distribution.

7.16.2 The Algorithm

In general, the EM algorithm is stated in two steps: the E-step and the M-step. We will begin with a description as given in McLachlan and Krishnan (2008). Let $c07-math-1175$ be a random vector and $c07-math-1176$ be its observed value. The sample space of $c07-math-1177$ is denoted by $c07-math-1178$ . We will denote the pdf of $c07-math-1179$ by $c07-math-1180$ , where $c07-math-1181$ is the vector of unknown parameters.

To make use of the EM algorithm, we will always pretend that $c07-math-1182$ is incomplete in the sense that the experiment consists of some values which we treat as missing data. That is, we will assume that we have missing data in $c07-math-1183$ , and if this is augmented with $c07-math-1184$ , we will have the complete data in $c07-math-1185$ . Let $c07-math-1186$ denote the sample space of $c07-math-1187$ .

The pdf of the complete observation $c07-math-1188$ will be denoted by $c07-math-1189$ . Thus, under the assumption that $c07-math-1190$ is completely observed, the log-likelihood of $c07-math-1191$ is given by

7.75

Clearly, as the sample space of the $c07-math-1193$ 's is larger than the $c07-math-1194$ 's, we have a many-to-one mapping from $c07-math-1195$ to $c07-math-1196$ . Thus, the observed data can be written as the function $c07-math-1197$ . Hence, we have the relationship

Assume that we have an initial value as an estimate of $c07-math-1199$ in $c07-math-1200$ . Using the observed data $c07-math-1201$ and $c07-math-1202$ , we next specify the conditional probability distribution of $c07-math-1203$ . Since the complete data log-likelihood $c07-math-1204$ is not observable, we will replace it by its conditional expectation given $c07-math-1205$ and $c07-math-1206$ . This conditional expectation is the famous Q-function defined by

7.76

This is the famous E-step of the EM algorithm. In the M-step, we maximize $c07-math-1208$ to obtain $c07-math-1209$ such that

7.77

Thus, the EM algorithm can be summarized as below:

E-Stem: Calculate $c07-math-1211$ , where
7.78 $c07-math-1212$
M-Step: Select any value $c07-math-1213$ of $c07-math-1214$ , such that
7.79 $c07-math-1215$

The convergence criteria for the EM algorithm is that the difference $c07-math-1216$ should be approximately 0. This explanation of the EM algorithm in two steps can be found almost everywhere. However, we have found the five steps detail of the EM algorithm by Gupta and Chen (2011) to be more friendly and despite a repetition of the above content, we will state it here.

1. Set $c07-math-1217$ and obtain an initial estimate for $c07-math-1218$ as $c07-math-1219$ .
2. Assume that $c07-math-1220$ as the truth and using the observed data $c07-math-1221$ , completely specify the conditional probability distribution $c07-math-1222$ for the complete data $c07-math-1223$ .
3. Obtain the conditional expected log-likelihood Q-function:
4. Find $c07-math-1225$ , which maximizes $c07-math-1226$ .
5. Set $c07-math-1227$ and go to the first step.

We understand that the EM algorithm is best illustrated through applications.

7.16.3 Introductory Applications

We will consider problems which have been widely used illustrating the EM algorithm.

Example 7.16.1. The Multinomial Distribution

Consider an observation from a multinomial distribution, say $c07-math-1228$ , that is, we have

Suppose that the probability vector associated with the multinomial distribution is specified by

A popular dataset in the context of a multinomial distribution is $c07-math-1231$ . This example has been used in Rao (1973), DLR, and McLachlan and Krishnan (1998, 2008), among many other researchers. The goal of the problem is estimation of $c07-math-1232$ based on the observed data $c07-math-1233$ . The probability mass function, $c07-math-1234$ , is then given by

Thus, the log-likelihood function, sans the constant terms, may be written as

It is important to note that the score function for the above log-likelihood function can be solved explicitly for obtaining the MLE $c07-math-1237$ . Let us use the MLE technique seen in the Section 7.6.

> y <- c(125, 18, 20, 34)
> logl <- function(p)  {
+   y[1]*log(2+p)+(y[2]+y[3])*log(1-p)+y[4]*log(p)
+   }
> optimize(logl,c(0,1),maximum=TRUE)
$maximum
[1] 0.6268298
$objective
[1] 67.3841

However, we will treat this problem as a missing data problem. In the original formulation we have four cells. The first cell with probability $c07-math-1238$ can be split into two sub-cells with probabilities $c07-math-1239$ and $c07-math-1240$ , and we will denote the frequencies of these two cells by $c07-math-1241$ and $c07-math-1242$ , which will meet the requirement $c07-math-1243$ . Define the complete data vector as

The multinomial cell probabilities for $c07-math-1245$ are then specified by

and the log-likelihood function for the complete data vector is

Using the score function of the above log-likelihood, we can obtain the MLE as

We have used $c07-math-1249$ to represent the ML estimator based on the complete data. However, we cannot straightaway compute it, as $c07-math-1250$ is not observed. This sets up room for the EM algorithm. The E-step will help in obtaining an estimate of the missing data $c07-math-1251$ , and the M-step will update the estimate of the parameters.

In the E-step, we recognize that as the probability distribution of the complete data vector $c07-math-1252$ is a multinomial distribution, conditional on $c07-math-1253$ , $c07-math-1254$ has a binomial distribution with probability

where $c07-math-1256$ is the estimate at the iteration $c07-math-1257$ . Thus, the $c07-math-1258$ -th iteration estimate of $c07-math-1259$ is given by

7.80

and for $c07-math-1261$ it is simply

It is further easier to see that an estimate of $c07-math-1263$ , for the M-step, is given by

7.81

An R program which implements the E- and M-step detailed above is given next.

> p0 <- 0.5
> estep <- function(y,p0){
+       temp <- c(2*y[1]/(2+p0),p0*y[1]/(2+p0),y[2],y[3],y[4])
+       return(temp)
+ }
> emconvergence <- function(y,p0){
+       pold <- p0
+       pnew <- p0+0.5
+       while(abs(pnew-pold)>0.0000000001){
+       pold <- p0
+       x <- estep(y,p0) # E-Step
+       pnew <- (p0*y[1]/(2+p0)+y[4])/(p0*y[1]/(2+p0)+y[2]+y[3]+y[4])
+       # M-Step
+       p0 <- pnew
+   }
+       return(pnew)
+ }
> pmle <- emconvergence(y,p0)
> pmle
[1] 0.6268215

Note that in the estep function, the small piece of code 2*y[1]/(2+p0) implements the E-step as required in Equation 7.80. Similarly, the code on the right-hand side of the line pnew clearly captures the M-step given in Equation 7.81. The while loop ensures convergence up to the required accuracy. The ML estimate 0.6268215 in the EM algorithm is closer to the one obtained using the optimize function, as seen earlier.□

Example 7.16.2. Application of Multinomial Distribution in Genetics

A very interesting application of the multinomial distribution occurs in genetics. Rao (1973), DLR, and McLachlan and Krishnan (1998, 2008), Monahan (2011), and many other texts discuss this application at great length. The problem is estimation of gene frequencies of blood antigens A and B by observing four main blood groups AB, A, B, O. Let $c07-math-1285$ and $c07-math-1286$ respectively denote gene frequencies A and B. Let $c07-math-1287$ be the probabilities of the group AB, A, B, O. As the sampling mechanism here is with replacement, the cell probabilities are obtained in the following way:

Symbolically, we have conveyed how we get $c07-math-1289$ and $c07-math-1290$ . It is an exercise for the reader to similarly derive the other two probabilities. Of course, if you get $c07-math-1291$ , $c07-math-1292$ follows by symmetry. The parameters of interest in this problem is of course $c07-math-1293$ . Also, define $c07-math-1294$ . The observed data is $c07-math-1295$ . The log-likelihood function is then given by

Equivalently, the log-likelihood can be written in terms of $c07-math-1297$ as

In terms of $c07-math-1299$ , the log-likelihood does not admit closed-form expression and we will introduce a new framework which will help in deployment of the EM algorithm. Define the complete-data vector by

where $c07-math-1301$ represents the missing data corresponding to the frequencies $c07-math-1302$ . In Table 7.5, we present cells and their corresponding frequencies.

Table 7.5 Multinomial Distribution in Genetics

Original Problem			Modified for EM Algorithm
Category	Cell Probability	Observed Frequency	Category	Cell Probability	Observed Frequency
O	$c07-math-1265$	$c07-math-1266$	O	$c07-math-1267$	$c07-math-1268$
A	$c07-math-1269$	$c07-math-1270$	AA	$c07-math-1271$	$c07-math-1272$
B	$c07-math-1273$	$c07-math-1274$	AO	$c07-math-1275$	$c07-math-1276$
AB	$c07-math-1277$	$c07-math-1278$	B	$c07-math-1279$	$c07-math-1280$
			BO	$c07-math-1281$	$c07-math-1282$
			AB	$c07-math-1283$	$c07-math-1284$

The complete data log-likelihood function is then given by

where $c07-math-1304$ , $c07-math-1305$ , and $c07-math-1306$ .

The E- and M-step unfold as follows. In the E step, we find the missing data by using the current conditional expectation of the sufficient statistic of $c07-math-1307$ , which is essentially $c07-math-1308$ and $c07-math-1309$ . To obtain them, assume that the current estimate of $c07-math-1310$ is available in the form of $c07-math-1311$ . Note that the computation of $c07-math-1312$ and $c07-math-1313$ requires values on the unobserved data $c07-math-1314$ . Consider the first element $c07-math-1315$ . Now, conditional on the observed data $c07-math-1316$ , $c07-math-1317$ has a binomial distribution with sample size $c07-math-1318$ and probability given by

where $c07-math-1320$ denotes the current iteration. That is, current conditional expectation of $c07-math-1321$ given $c07-math-1322$ is

7.82

Similarly, the following may be verified:

7.83

7.84

7.85

Then, the M step consists of the following:

7.86

7.87

7.88

The following R program for the first five iterations of the EM algorithm gives the solution for the problem on hand.

> y <- c(176,182,60,17)
> n <- sum(y)
> p0 <- 0.26399
> q0 <- 0.09299
> r0 <- 1-p0-q0
> log_lik=n_aa=n_ao=n_bb=n_bo=p_new=q_new=r_new=NULL
> for(i in 1:5){
+     log_lik[i] <- 2*y[1]*log(r0)+y[2]*log(p0^{2}+2*p0*r0)+y[3]
+        *log(q0^{2}+2*q0*r0)+y[4]*log(2*p0*q0)
+     n_aa[i] <- y[2]*p0^{2}/(p0^{2}+2*p0*r0)
+     n_ao[i] <- (2*y[2]*p0*r0)/(p0^{2}+2*p0*r0)
+     n_bb[i] <- y[3]*q0^{2}/(q0^{2}+2*q0*r0)
+     n_bo[i] <- (2*y[3]*q0*r0)/(q0^{2}+2*q0*r0)
+     p_new[i] <- (n_aa[i]+n_ao[i]/2+y[4]/2)/n
+     q_new[i] <- (n_bb[i]+n_bo[i]/2+y[4]/2)/n
+     r_new[i] <- 1-p_new[i]-q_new[i]
+     p0 <- p_new[i];q0 <- q_new[i];r0 <- 1-p0-q0
+ }
> p_new;q_new;r_new;log_lik
[1] 0.2643643 0.2644311 0.2644422 0.2644440 0.2644443
[1] 0.09315619 0.09316760 0.09316866 0.09316879 0.09316881
[1] 0.6424795 0.6424013 0.6423892 0.6423872 0.6423869
[1] -492.5360 -492.5353 -492.5353 -492.5353 -492.5353

The vector objects n_aa, n_ao, n_bb, and n_bo are related with the computations for the E-step, as given in the system of Equations 7.82–7.85, while p_new, q_new, and r_new deal with the M-steps given in Equations 7.86–7.88. The rest of the program is simpler to follow. The results may be verified with Table 2.8 of McLachlan and Krishnan (1998, 2008).□

$c07-math-1330$

7.17 Further Reading

7.17.1 Early Classics

Fisher! In the 1920s, Sir R.A. Fisher wrote a series of ground-breaking papers on inference. Fisher (1925-1954) has given a first account on what should form the fundamentals of inference. Kendall and Stuart (1945–79) is one of the earliest and rigorous development of inference. Cramér's (1946) book is one of the landmarks for inference. Lehmann (1958) gave a detailed account related to testing of hypotheses. Rao (1965–73) is one of the all-time classics and goes beyond the “linear” indicated in its title. Wilks (1962), Zacks (1971), and Cox and Hinkley (1973) are some of the other rigorous books on statistical inference.

Let us now look at some of the earlier books which introduce the subject at an elementary level. Snedecor and Cochran (1937–89) may have been the first book on “Statistical Methods”. Mood, et al. (1950–74) is one of the earliest, elegant and elementary introduction to statistics. Hoel, et al. (1971), Hogg and Craig (1978), Hogg and Tanis (1977), and DeGroot and Schervish (2012), are also some of the best books written at their level.

In the Indian subcontinent, Das (1996) and Goon, et al. (1963) have written very useful texts.

7.17.2 Texts from the Last 30 Years

We do not intend to retain the chronological year of publication and jot down the texts which readily come to mind. As seen, chapter, Mukhopadhyay (2000), Rohatgi and Saleh (2000), and Casella and Berger (2000) have influenced this chapter a lot. Pawitan (2001) has been freely used for illustration of many concepts. Geisser and Johnson (2006) is a very compact work and will be useful to brush up on the details for an expert. Sen, et al. (2009) is a very concise course on the recent topics in inference. Wasserman (2004) is an advanced text which the reader will find useful for the modern development of the subject. Keener (2010), Dekking, et al. (2005), Liese and Miescke (2008), Knight (2000), Schervish (1995), and Shao (2003) are some of the finest written texts.

McLachlan and Krishnan (2008) is the first book to detail the EM algorithm. Huber and Ronchetti (2009) deals with the robustness of inference tools. As with the bibliography section of Chapter 5, we have again repeated a futile exercise.

7.18 Complements, Problems, and Programs

Problem 7.1 For different values of $c07-math-1331$ , obtain a plot of the curved normal family.
Problem 7.2 Italicize the $c07-math-1332$ -axis label in the expression part in Example 7.3.1.
Problem 7.3 Find a sufficient statistic for $c07-math-1333$ when $c07-math-1334$ .
Problem 7.4 Suppose $c07-math-1335$ follows a negative binomial distribution with parameters as defined in Equation (6.20). Assume that for obtaining $c07-math-1336$ failures, $c07-math-1337$ is noted as 10. Obtain the likelihood function plot and then graphically infer about the ML estimate of $c07-math-1338$ .
Problem 7.5 In a directory on a particular folder of a hard disk drive, there are $c07-math-1339$ files. Suppose that in a random selection of $c07-math-1340$ files, 9 are observed to be e-books. Under the assumption of a hypergeometric distribution, and by using the likelihood function approach, give the ML estimate of $c07-math-1341$ . Check Equation (6.30) if required to complete the R program.
Problem 7.6 For the two likelihood functions of the multinomial distribution in Examples 7.9.1 and 7.9.2, plot the likelihood function for obtaining the ML estimates.
Problem 7.7 Section 7.6 makes use of the function optimize and mle to obtain the ML estimate. Will these techniques, return the ML estimate for the parameters in the previous two examples? If the technique fails, test across values of the parameters and data, what may be the reason behind it?
Problem 7.8 Using the Fishers score function technique, obtain the ML estimate in the previous three examples.
Problem 7.9 For the galton dataset from UsingR package, what will be the conclusion of the MP test that the height of the child is $c07-math-1342$ against $c07-math-1343$ , given that variance is known to be 1.7873.
Problem 7.10 If the variance is unknown in the previous example, carry out the likelihood-ratio test, see LRNormalMean_UV, and draw the conclusion at the $c07-math-1344$ level of significance.
Problem 7.11 In Section 7.12, it was mentioned that for a sample from a uniform distribution an UMP test exists for the hypothesis test problem of $c07-math-1345$ against $c07-math-1346$ . Obtain the UMP test and, if possible, an appropriate R program.
Problem 7.12 Assume that the variances for the two treatments of the Youden-Beale problem are unknown and there is no reasonable way they can be assumed to be equal. Use the two tests (R programs) developed in Section 7.14 in adhocBF and WelchBF to draw the right conclusions.
Problem 7.13 Carry out the multiple hypothesis testing problem, see glht function from the multcomp package, for the median polish regression model fitted in Section 4.5.2.
Problem 7.14 Interpret the R program in Example 7.10.2.
Problem 7.15 The $c07-math-1347$ -test used on the galton dataset is t.test(galton$child,mu=mean(galton$parent)). However, there is a “pairing” between the height of the child and the parent. Is the test t.test(galton$child,galton$parent,paired=TRUE) more appropriate?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Name	Density	Parameter Space	Range
Uniform (continuous)	$c07-math-0101$	$c07-math-0102$	$c07-math-0103$
Shifted Exponential	$c07-math-0104$	$c07-math-0105$	$c07-math-0106$
Shifted Geometric	$c07-math-0107$	$c07-math-0108$	$c07-math-0109$

Statistic $c07-math-0116$	Risk Function $c07-math-0117$
$c07-math-0118$	$c07-math-0119$
$c07-math-0120$	$c07-math-0121$
$c07-math-0122$	$c07-math-0123$
$c07-math-0124$	$c07-math-0125$