Chapter 7
Parametric Inference

Package(s): UsingR

Dataset(s): ns, ps, bs, cs, galton, sleep, airquality

7.1 Introduction

In Chapter 4 we came across one form of analyzing data, viz., the exploratory data analysis. That approach mainly made no assumptions about the probability mechanism for the data generation. In later chapters we witnessed certain models which plausibly explain the nature of the random phenomenon. In reality, we seldom have information about the parameters of the probability distributions. Historically, or intuitively, we may have enough information about the probability distributions, sparing a few parameters. In this chapter we consider various methods for inferring about such parameters, using the data generated under the assumptions of these probability models.

Parametric statistical inference arises when we have information for the model describing an uncertain experiment sans a few values, called parameters, of the model. If the parameter values are known, we have problems more of the probabilistic kind than statistical ones. The parameter values need to be obtained based on some data. The data may be a pure random sample in the sense of all the observations being drawn with the same probability. However, it is also the practical case that obtaining a random sample may not be possible in many stochastic experiments. For example, the temperature in the morning and afternoon are certainly not identical observations. We undertake statistical inference of uncertain experiments in this chapter.

In Chapter 6, we came across a pool of diverse experiments which have certain underlying probability models, say c07-math-0001. Under the assumption that c07-math-0002 is the truth, we now develop methods for inference about c07-math-0003. We will begin with some important families of distribution in Section 7.2. This section and the next few sections rely heavily on Lehmann and Casella (1998). The form of loss functions plays a vital role on the usefulness of an estimator/statistic. For an observation from binomial distribution, we discuss some choice of loss functions in Section 7.3. Data reduction through the concepts of sufficiency and completeness are theoretically examined in Section 7.4. Section 7.5 empathizes the importance of the likelihood principle through visualization and examples. The role of information function for obtaining the parameter values is also detailed in this section. The discussion thus far focuses on the preliminaries of point estimation.

Using the foundations from Sections 7.27.5, we next focus on the specific techniques of obtaining the estimates of parameters using maximum likelihood estimator and moment estimator in Section 7.6. Estimators are further compared for their unbiasedness and variance in Section 7.7. The techniques discussed up to this point of the chapter return us a single value of the parameter, which is seldom useful in appropriating inference for the parameters. Thus, we seek a range, actually interval, of plausible values of the parameters in Section 7.8. In the case of missing values, or a data structure which may be simplified through missing variables, the EM algorithm is becoming very popular and the reader will find it illustrated in Section 7.16.

Sections 7.97.15 offers a transparent approach to the problem of Testing Statistical Hypotheses. It is common for statistical software texts to focus on the testing framework using straightforward useful R functions available in prop.test, t.test, etc. However, here we take a pedagogical approach and begin with preliminary concepts of Type I and Type II errors. The celebrated Neyman-Pearson lemma is stated and then demonstrated for various examples with R programs in Section 7.10. The Neyman-Pearson lemma returns us to a unique powerful test which cannot be extended for testing problems of composite hypotheses, and thus we need slightly relaxed conditions leading to uniformly most powerful tests and also uniformly most powerful unbiased tests, as seen in Sections 7.11 and 7.12. A more generic class of useful tests is available in the family of likelihood ratio tests. Its examples are detailed with R programs in Section 7.13. A very interesting problem, which is still unsolved, arises in the comparison of normal means from two populations whose variances are completely unknown. This famous Behrens-Fisher problem is discussed with appropriate solutions in Section 7.14. The last technical section of the chapter deals with the problem of testing multiple hypotheses, Section 7.15.

7.2 Families of Distribution

We will begin with a definition of a group family, page 17 of Lehmann and Casella (1998).

The identity element of a group and the inverses of any element are unique. The two important properties of a group are (i) closure under composition, and (ii) closure under inversion. Basically, for various reasons, we change the characteristics of random variables by either adding them, or addition or multiplication of some constants, etc. We need some assurance that such changes do not render the probability model from which we started useless. The changes which we subject the random variable to is loosely referred to as a transformation. Let us discuss these two properties in more detail.

Closure under composition. A 1:1 transformation may be addition or multiplication. We say that a class c07-math-0017 of transformation is closed under composition if c07-math-0018 implies c07-math-0019.

Closure under inversion. For any 1:1 transformation c07-math-0020, the inverse of c07-math-0021, denoted by c07-math-0022, undoes the transformation c07-math-0023, that is, c07-math-0024. A class of transformations c07-math-0025 is said to be closed under inversion if for any c07-math-0026, the inverse c07-math-0027 is also in c07-math-0028.

In plain words, if a random variable under a group family is subject to a transformation, the resulting distribution also belongs to the same class of distribution. The transformation may include addition, multiplication, sine, etc.

Table 4.1 of Lehmann and Casella (1998) gives an important list of location-scale families.

7.2.1 The Exponential Family

The canonical form is not necessarily unique.

Note that we speak of a family of probability distributions belonging to an exponential family, and as such we are not focusing on a single density, say c07-math-0067.

7.2.2 Pitman Family

The exponential family meets the regularity condition since the range of random variable is not a function of the parameter c07-math-0090. In many interesting cases, the range of the random variable depends on a function of its parameters. In such cases the regularity condition is not satisfied. A nice discussion about such non-regular random variables appears on Page 19 of Srivastava and Srivastava (2009).

The mathematical argument is as follows. Let c07-math-0091 be a positive function of c07-math-0092 and let c07-math-0093 and c07-math-0094, c07-math-0095 be the extended real-valued functions of parameter c07-math-0096. We can then define a density by

7.6 equation

where

equation

Thus, we can see that the range of c07-math-0099 depends on c07-math-0100, and hence the density is a non-regular. The family of such densities is referred to as the Pitman family. In Table 7.1, a list of some members of the Pitman family is provided, refer to Table 1.3 of Srivastava and Srivastava (2009).

Table 7.1 Pitman Family of Distributions

Name Density Parameter Space Range
Uniform (continuous) c07-math-0101 c07-math-0102 c07-math-0103
Shifted Exponential c07-math-0104 c07-math-0105 c07-math-0106
Shifted Geometric c07-math-0107 c07-math-0108 c07-math-0109

We will next consider the role of loss functions.

7.3 Loss Functions

A statistic, or an estimator, c07-math-0110 is employed for estimation of a parameter, say c07-math-0111. The loss incurred as a consequence of using c07-math-0112 is captured using a loss function. In convention with the standard notation, the loss function of a parameter inferred by using c07-math-0113 will be denoted by c07-math-0114. Some useful loss functions are now given.

equation

In this section we will consider the squared error loss function only. To evaluate the performance of an estimator c07-math-0126 under a loss function c07-math-0127, we require the notion of risk function defined as

7.7 equation

For a fixed c07-math-0129, c07-math-0130 calculates the risk of using the statistic c07-math-0131. The risk function is the average loss due to c07-math-0132. The risk function under the squared error loss functions is also popularly known as the mean squared error of the estimator. The next example illustrates the risk function for four different statistics c07-math-0133. Sincec07-math-0134 also leads more often to making some decisions, we sometimes also denote the statistic by c07-math-0135. In fact, the loss functions have a more prominent role in decision theory.

The role of loss functions will become more prominent in Section 7.7. The section will be closed with an interesting example where the degenerate estimator is to be preferred over a reasonable estimator.

7.4 Data Reduction

Statistical inference has two very important pillars: (i) The Sufficiency Principle, and (ii) The Likelihood Principle. Berger and Woolpert (1988) is a treatise for understanding these principles. In this section we will consider the sufficiency principle in depth.

7.4.1 Sufficiency

It is seen in Section 7.3 that many possible statistics exist for a given parameter. Some statistics have an advantage over others and a meaningful criteria needs to be arrived at to help in these type of decisions.

Thus far we began with statistics and verified whether it satisfied the sufficiency condition. It is not always possible to guess what may turn out to be sufficient statistics. The Neyman Factorization theorem gives a result which helps to obtain the sufficient statistics from the joint probability function of the random sample. A measure theoretic framework of this theorem is due to Halmos and Savage (1949), see page 289 of Mukhopadhyay (2000) for more details. The following theorem may be found in Casella and Berger (2002).

We will now use this result to obtain the sufficient statistics in some important cases.

If c07-math-0208 in the previous example is known, it can be seen that c07-math-0209, c07-math-0210, c07-math-0211, and other permutations are all sufficient for c07-math-0212. Thus, we need a more general framework to identify the sufficient statistics. Furthermore, there may be more than one sufficient statistic and in such cases we need to pick one among them.

7.4.2 Minimal Sufficiency

Dynkin (1951) gave the criteria for a statistic to be necessary.

In the discussion before this subsection, it can be seen that the statistics c07-math-0213 can be mathematically written as a function of c07-math-0214 and c07-math-0215. Thus, if c07-math-0216, c07-math-0217, and c07-math-0218 are the only three sufficient statistics, though there are many more sufficient statistics in the scenario, then c07-math-0219 turns out to be a necessary statistic. This definition further guides towards the concept of minimal sufficient statistics.

It is not practical to obtain a minimal sufficient statistic from its definition. Lehmann and Scheffé (1950) provide a result which is useful in obtaining a minimal sufficient statistic. We first describe this important result.

A useful result states that a complete sufficient statistic will be minimal and hence if the completeness of the sufficient statistic can be established, it will be minimal sufficient too. In the case of the exponential families of full rank, the statistic c07-math-0267 will be a complete statistic. We had already seen that c07-math-0268 is also sufficient, and hence for exponential families of full rank, it will be minimal sufficient. Since the examples of gamma, Poisson, etc., are members of an exponential family with full rank, the sufficient estimators/statistics seen earlier will be complete, and hence minimal sufficient too. The likelihood principle is developed in the next section.

7.5 Likelihood and Information

As mentioned at the beginning of the previous section, we will now consider the second important principle in the theory of statistical inference: the likelihood principle. The likelihood function is first defined.

7.5.1 The Likelihood Principle

A major difference between the likelihood function and (joint) pdf needs to be emphasized. In a pdf (or pmf), we know the parameters and try to make certain probability statements about the random variable. In the likelihood function the parameters are unknown and hence we use the data to infer certain aspects of the parameters. In simple and practical terms, we generally plot the pdf c07-math-0273 against c07-math-0274 values to understand it, whereas for the likelihood function plots, we plot c07-math-0275 against the parameter c07-math-0276. Obviously there is more to it than what we have simply said here, though what is detailed here suffices to understand the likelihood principle. Furthermore, it has been observed that many books lay far more emphasis on the maximum likelihood estimator, to be introduced in the Section 7.6, and the primary likelihood function is given a formal introduction. Barnard, et al. (1962) have been emphatic about the importance of the likelihood function. Pawitan (2001) is another excellent source for understanding the importance of likelihood. Moreover, Pawitan also provides R codes and functions towards understanding this principle. Naturally, this section is influenced by his book.

The likelihood function contains more information about the data and the parameters than some summary measures of the data. Plots of the likelihood, whenever possible, throw more light on random phenomenon and should be employed in as many cases as they permit. We will now formally state the likelihood principle.

More formal use of the likelihood function will be explained in the forthcoming subsection.

7.5.2 The Fisher Information

The likelihood function will now be more formally used towards the inference of the parameters. We will first define the score function.

The natural question is how do we make use of the score function as given in Equation 7.15. We will first begin with an illustration of the sampling variance of score functions.

An important exercise is to prove that the expectation of the score function equals zero, that is, c07-math-0352, which we leave to the reader.

In the above definition, the term partial derivative has been used in the sense that the parameter c07-math-0365 may be a vector and in which case we consider the derivative of the likelihood function with respect to each component of the vector.

A useful result regarding the expected Fisher information, under the assumptions of the regularity conditions of course, is

equation

We will next find the Fisher information for a few well-known probability distributions.

The next result is an important one when we have to calculate the Fisher information for a random sample.

In the next section we will use the observed Fisher information in a random sample.

c07-math-0382

7.6 Point Estimation

“Point Estimation” or “Estimation Theory” are some of the common names for dealing with the problem of estimation of parameters. Lehmann and Casella (1998) is one of the best sources for classical inference. Modern accounts of this domain are Casella and Berger (2002), Shao (2003), and Mukhopadhyay (2000). For details about the maximum likelihood technique implementation, refer to Millar (2011).

7.6.1 Maximum Likelihood Estimation

In the previous section we introduced the likelihood function. It will now be used to find estimators.

Let c07-math-0383 be a random sample with common pdf or pmf c07-math-0384, c07-math-0385. The random variable c07-math-0386 and/or c07-math-0387 may be scalar or vector. Recall the definition of likelihood function:

equation

It is important to note that the MLE definition simply requires that the likelihood function be optimized. There is no specific method outlined on how to obtain the MLE. We will begin with a few graphical methods, which will be a continuation of some of our earlier examples.

We will now attempt to obtain the MLEs when the parameters are continuous. A standard technique in calculus for obtaining the optimum value of a continuous function is by differentiating the function with respect to the continuous variable and setting the resultant expression to zero. In our case, such an expression turns out to be the score function which we saw in the previous Section 7.5. The MLE is then a root (solution) of the score function, that is, a solution of the equation:

equation

Note that in each of the four score function plots in Figure 7.4, the parameter values which correspond to the score function c07-math-0405 at 0 is 4. This is not surprising since we had simulated the datasets to have the average (mean) of 4 only. Let us verify again with a plot of the score function for the normal sample and check if the plot helps to find the MLE in the spirit of Pawitan (2001).

In the case of an iid sample, R can be used efficiently to define the likelihood function and then solve the score function for obtaining the MLE. Two excellent introductions which illustrate this point may be found in Geyer (2003), see www.stat.umn.edu/geyer/5931/mle/mle.pdf, and Steenbergen (2006), see www.unc.edu/ monogan/computing/r/MLE_in_R.pdf. Also check Monahan (2011) for numerical methods in statistics, which also includes many interesting examples on MLE, see http://www4.stat.ncsu.edu/monahan/jul10/toc9r.html.

Setting the score functions equal to zero, and then obtaining an appropriate expression for the parameters, will give us estimators of the likelihood function, that is, the MLE. For the score-functions given in Equations 7.167.18, we obtain the MLEs by solving them as follows:

7.23 equation
7.24 equation
7.25 equation

To be assured that c07-math-0413, c07-math-0414, and c07-math-0415 as given above, are indeed the MLEs, we need to look at the derivatives of the score function and verify that they are negative. The derivatives of the score functions for the distributions are stated next:

7.28 equation

The derivatives of the log-likelihood function for normal and Poisson distribution, Equations 7.26 and 7.27, may easily be understood as negative, whereas it is not straightforward to obtain the answer for binomial and Cauchy distributions. In the binomial case, the reader may refer to a slightly different case at http://www.montana.edu/rotella/502/binom_like.pdf.

Note that the MLE for the Cauchy distribution has not been derived here and it requires a different way of solving the score function to obtain the MLE. We will be using three functions available in R, optimize, mle and nlm for obtaining the MLEs. Since we have emphasized the likelihood function, it is always good practice to report the values of the likelihood function, or its variants such as the log-likelihood function, or negative of the log-likelihood function.

The Cauchy distribution is slightly more complex and its MLE does not exist in a simple form. We resort to the numerical optimization technique, essentially the Newton-Raphson technique, through its score function as given in Equation 7.19.

The MLE problems, and likelihood functions as well, thus far has been discussed in the context of a single unknown parameter only. In many applied contexts, it is a common theme that multiple parameters would be unknown. The approach remains the same, though the details will be obviously more. Let us consider a random sample from the normal distribution, where both the parameters c07-math-0445 and c07-math-0446 are unknown. Recollect that in Example 7.5.5 we had assumed c07-math-0447.

We will close this sub-section with a very interesting example.

We will close this discussion with some properties of the MLE.

  1. 1. If c07-math-0469 is a sufficient statistic for the family of distributions c07-math-0470 and a unique MLE of c07-math-0471 exists, then the MLE is a function of the sufficient statistic c07-math-0472.
  2. 2. If c07-math-0473 is an MLE for c07-math-0474 and c07-math-0475 is a one-to-one function of c07-math-0476, then c07-math-0477 will be an MLE for c07-math-0478.
  3. 3. Under certain regularity conditions, see Section 7.7, the MLE is a consistent estimator, and further

    7.33 c07-math-0479

7.6.2 Method of Moments Estimator

Karl Pearson invented the method of moments estimator. Suppose that c07-math-0480 is the pmf (pdf), and that the parameters are c07-math-0481 in number, c07-math-0482. The method of moments estimator means that we need to first find the c07-math-0483 theoretical moments of c07-math-0484 and assume that they are equal to the corresponding sample moments. We assume that we have a sample of size c07-math-0485. Thus, we have c07-math-0486 equations for c07-math-0487 unknown quantities. A solution for this set of equations leads to the method of moments estimator. Symbolically, we have the following setup:

equation

We will demonstrate the applications of the moment estimators through some examples of Mukhopadhyay (2000) and Casella and Berger (2002). An important point to note here is that the moment estimators can be computed. even if the complete density function is not known.

However, it is to be noted that the moment estimator is an ad hoc solution for obtaining estimators and it is severely restricted. Please refer to Mukhopadhyay (2000) and Casella and Berger (2002) for more pointers in this regard.

c07-math-0510

7.7 Comparison of Estimators

In the previous sections we considered two main methods of estimation. It is possible to propose several estimators for the same parameter and we would have to then justify the use of one estimator over other. That is, we need criteria for comparison of estimators. We will begin with unbiased estimator as a criterion for comparison purposes.

7.7.1 Unbiased Estimators

It needs to be recorded that the lack of bias is a property of an estimator and not of a sample. An unbiased estimator reaches the target c07-math-0517 on average and its bias will be 0 for all values of c07-math-0518. An important measure of the performance of a statistic is provided by its Mean Squared Error.

A result which connects variance, bias, and MSE is given next.

In the previous Example 7.7.1, we had four unbiased estimators c07-math-0550. Among these we will prefer the estimator with the least variance. This leads to the following definition.

In the following, we will first consider improving the unbiased estimators using sufficient statistics via the Rao-Blackwellization process. Next we will state what will be the lower bound for variance of unbiased estimators using Cramér-Rao inequality. Finally, we will briefly discuss how to obtain UMVU estimators.

7.7.2 Improving Unbiased Estimators

If we have an unbiased estimator, and we seek an improvement of it in terms of reduction in variance, we have an affirmative answer in the Rao-Blackwell theorem.

An illustration of how the Rao-Blackwell theorem works is required here.

The reader may consult Mukhopadhyay (2000) for more interesting examples of Rao-Blackwellization. We have taken one important step towards reducing the variance via sufficiency and we should now ask what is the best we can do further. Under some mild regularity conditions, the answer is provided by Cramér-Rao's lower bound. We need the following assumptions.

  1. a. The support of c07-math-0582 is independent of c07-math-0583.
  2. b. c07-math-0584 exists c07-math-0585.
  3. c. c07-math-0586.
  4. d. c07-math-0587.
  5. e. c07-math-0588.

As the above inequality involves the Fisher information, Lehmann and Casella (1998) refer to the Cramér-Rao inequality as the information inequality. The Cramér-Rao inequality can be derived using the characteristic function, see Kay and Xu (2008).

We will close this section with an important result which leads towards deriving UMVUE for some special family of probability distributions.

7.8 Confidence Intervals

Point estimates of the parameters are useful as already seen. However, there is often a need to complement them with other techniques, and a very brief discussion on the technique of confidence intervals will be discussed here. It is admitted that this topic requires more depth, and is restricted only in part by the important problem of hypotheses testing, which is covered in a more rigorous and R way. Another reason for skipping over the details on confidence intervals in part is that most of the R statistical functions, the confidence intervals, are also provided as part of the output when we use the well-known tests. In fact, the R function confint is also available, which extracts the confidence intervals with desired confidence widths for the regression models.

The reader may refer Chapter 5 of Tattar (2013) for confidence interval functions binom_CI, normal_CI_ksd, and normal_CI_uksd. Chapter 8 of Ugarte, et al. (2008) is a comprehensive account of the confidence intervals construction in R. A reason for not developing this section in more detail is that most of the statistical tests in R such as t.test, var.test, binom.test give the confidence intervals as side products in the output. Furthermore, the function confint is applicable to many statistical models fitted in R.

7.9 Testing Statistical Hypotheses–The Preliminaries

In earlier sections we explored various techniques of estimating the unknown parameters. The next task is validation of these parameters. Especially, we would like to deduce if the estimated parameters are in agreement with certain conjectures. We will now introduce some important terminology.

Consider a random sample whose underlying probability law is a pmf or a pdf c07-math-0627.

Testing the hypotheses problem is the statistical criteria of choosing between two plausible hypotheses, that is, we find a mechanism to choose between hypotheses c07-math-0639 and c07-math-0640. Formally, the testing hypotheses problem is

7.41 equation

where c07-math-0642 and c07-math-0643 are two subsets of c07-math-0644 and mutually exclusive, that is, c07-math-0645 and c07-math-0646 is empty. The choice among c07-math-0647 and c07-math-0648 is to be based on a random sample c07-math-0649 of size c07-math-0650. The values in the sample space which lead to rejection of the hypothesis c07-math-0651 has a special name.

We need an instrument which will decide between c07-math-0655 or c07-math-0656. A formal definition is as follows.

A standard notation for a hypothesis test is c07-math-0660. In terms of the rejection region, we can define the test as

equation

The tests can also be defined in terms of decision rules. Let c07-math-0662 and c07-math-0663 denote the decisions of accepting or rejecting the hypothesis c07-math-0664. Note that if c07-math-0665, then c07-math-0666.

The hypothesis test may lead to two types of error, which is brought out in Table 7.4.

Table 7.4 Type I and II Error

Hypothesis Test c07-math-0667 True c07-math-0668 True
Accept c07-math-0669 No Error Type II Error
Accept c07-math-0670 Type I Error No Error

It is customary to denote the probabilities of Type I and II errors by c07-math-0671 and c07-math-0672 respectively, that is,

7.42 equation
7.43 equation

Ideally, we would like to construct a hypothesis test that will keep both types of errors to a minimum. Unfortunately, it is not possible to do this, and hence we seek hypothesis tests which assign an upper bound on the probability of Type I errors and attempt to minimize Type II errors subject to this bound. A formal definition captures the requirement.

The concept of Type I and Type II errors will be demonstrated in the next example with a small R program.

The level of significance places an upper bound on the probability of Type I error. However, sometimes we need to find from the data the probability of rejecting the hypothesis c07-math-0704.

The concept of the c07-math-0709-value will be especially useful in a lot of the hypotheses testing problems to be seen in Part IV. We will need one more concept here.

The power of a test, also called the power function, will be denoted by c07-math-0712. If c07-math-0713 and c07-math-0714 are simple hypotheses, we have c07-math-0715 and c07-math-0716, where c07-math-0717 and c07-math-0718 are the hypothesized values of the parameter. This notation of the power function should not be confused with the similarly used notation in Section 7.16 of the EM algorithm.

The next section will present a nice technique to obtain meaningful tests.

c07-math-0720

7.10 The Neyman-Pearson Lemma

The Neyman-Pearson lemma is one of the ground-breaking results in statistics. It begins with the problem of testing a simple hypothesis c07-math-0721 against the simple hypothesis c07-math-0722. The requirement of a size c07-math-0723 test with maximum power leads to the definition of a most powerful test.

The most powerful test is abbreviated as the MP test. We now state the lemma.

The test c07-math-0753 may be rewritten in the form of likelihood functions as

7.49 equation

We will discuss some aspects of the Neyman-Pearson lemma before its applications. Some key points in this lemma are emphasized in the following:

  1. a. Points are increased in the critical region until the size of the test reaches c07-math-0755. To understand this, note that we consider the likelihood ratio c07-math-0756 and then rate the points of c07-math-0757 on the basis of the ratio of the explanation of c07-math-0758 under c07-math-0759 to the explanation under c07-math-0760. Thus, the points with higher values of the likelihood ratio enjoy a better explanation under c07-math-0761 in comparison with c07-math-0762.
  2. b. The power of MP tests increases for corresponding increases in the test sizes.
  3. c. The risk set for c07-math-0763 against c07-math-0764 is defined by7.50 c07-math-0765
  4. d. The risk set c07-math-0766 defined in Equation 7.50 is a convex and compact set.

The Neyman-Pearson lemma will be illustrated through various examples now.

In the previous example, the value of c07-math-0812 under c07-math-0813 was less than under c07-math-0814. Let us consider the same problem with roles reversed.

The reader may not be too comfortable with writing different programs, for the reason that c07-math-0840 may be lesser or greater than c07-math-0841. Recall that the MPNormal function takes care of both scenarios. The next function takes care of this.

MPbinomial <- function(Hp, Kp, alpha,n) {
  if(Hp<Kp){
    k <- min(which((1-pbinom(0:n,size=n,prob=Hp))<alpha))-1
    gamma <- (alpha-1+pbinom(k,size=n,prob=Hp))/dbinom(k,size=n, prob=Hp)
    return(list=c(k,gamma))
  }
  else {
    k <- max(which((pbinom(0:n,size=n,prob=Hp))<alpha))
    gamma <- (alpha-pbinom(k-1,size=n,prob=Hp))/dbinom(k,size=n, prob=Hp)
    return(list=c(k,gamma))
  }
}

We will now move to the next type of hypotheses testing problem.

c07-math-0851

7.11 Uniformly Most Powerful Tests

The general one-sided hypothesis testing problem for a single real valued parameter c07-math-0852 is stated as:

equation

Note that the hypotheses c07-math-0854 and c07-math-0855 are composite hypotheses. A definition of the size of a test for composite hypothesis is required.

We can now define the Uniformly Most Powerful Tests.

Let us first consider the problem of testing c07-math-0871 against c07-math-0872. In this case the UMP test may be easily obtained. Towards this, fix a value c07-math-0873 and set up the Neyman-Pearson MP test for c07-math-0874 against c07-math-0875. Then the MP test continues to be a UMP test if the test remains unaffected by the specific choice of c07-math-0876.

We note that the UMP tests do not exist in general for one-sided testing problems. However, the UMP tests exist for a family of distribution satisfying a particular property. The mathematical property which is necessary for the existence of UMP tests is defined next.

A result due to Karlin and Rubin states that whenever there exists a statistic c07-math-0887 for which c07-math-0888 admits the MLR property, a UMP test can be constructed for the one-sided hypothesis.

c07-math-0940

7.12 Uniformly Most Powerful Unbiased Tests

The general hypotheses of interest is of the form: c07-math-0941 against c07-math-0942. We will begin with an example.

It needs to be noted though that a UMP test for a simple hypothesis against the two-sided alternative exists for the uniform distribution, see Mukhopadhyay (2000). We will return to the problem of such hypotheses for the normal distribution. A condition needs to be relaxed for identifying meaningful tests and hence we consider the next definition.

Let c07-math-0971 denote the collection of all size c07-math-0972 unbiased tests. The next definition follows naturally.

The main reason for discussing the results in this fashion thus far is the integration of statistical concepts with R. It is believed that the reader is now convinced that the details of statistical theory can be understood using a software package. We now skip rest of the details, though illustrations are given, and simply leave it as an exercise for the reader to verify that the Student's c07-math-0979-test is indeed a UMPU test. In fact, we need a host of other related and interesting concepts, such as similarity, to prove that the Student's c07-math-0980-test is indeed a UMPU test. The details may be found in Lehmann and Romano (2005) and Srivastava and Srivastava (2009).

7.12.1 Tests for the Means: One- and Two-Sample t-Test

Assume that c07-math-0981 is a random sample from c07-math-0982 with both the parameters being unknown. Suppose we are interested in testing c07-math-0983. The parameters c07-math-0984 and c07-math-0985 are respectively estimated using the sample mean and the sample standard deviation. The c07-math-0986-test is then given by

7.59 equation

which has a c07-math-0988-distribution with c07-math-0989 degrees of freedom. The c07-math-0990-test in R software may be found in the stats package whose constitution is given as

     t.test(x, y = NULL,
            alternative = c("two.sided", "less", "greater"),
            mu = 0, paired = FALSE, var.equal = FALSE,
            conf.level = 0.95, ...)

The following is clear from the above display:

  1. 1. The default function is a one-sample test with a two-sided alternative being tested for c07-math-0991 and 95% confidence interval as an output.
  2. 2. The user has the options of specifying the nature of alternatives, c07-math-0992, and the confidence interval level.

In the above example we specified that c07-math-0996 is the mean of the parent height. However, a more appropriate test will be a direct comparison as to whether the height of the child is the same as the height of the parent.

Let c07-math-0997 be a random sample from c07-math-0998, and c07-math-0999 a random sample from c07-math-1000, with c07-math-1001 unknown. Here, we assume that the variances are equal, though unknown. Suppose that we are interested to test the hypothesis c07-math-1002 against the hypothesis c07-math-1003. The two-sample c07-math-1004-test is then given by

7.60 equation

which has a c07-math-1006-distribution with c07-math-1007 degrees of freedom, with c07-math-1008 being the pooled standard deviation.

c07-math-1012

7.13 Likelihood Ratio Tests

Consider the generic testing problem c07-math-1013 vs c07-math-1014. As earlier, it is assumed that a random sample c07-math-1015 of size c07-math-1016 is available.

The constant c07-math-1023 is to be determined from the size restriction:

equation

The following examples deal with the construction of likelihood ratio tests.

The likelihood ratio tests are obtained for the normal distribution in the following subsections.

7.13.1 Normal Distribution: One-Sample Problems

In Section 7.11 we saw that UMP tests do not exist for many crucial types of hypothesis problems. As an example, for a random sample of size c07-math-1034 from c07-math-1035, c07-math-1036 known, the UMP test does not exist for testing c07-math-1037 against c07-math-1038. We will consider these types of problems in this subsection.

7.13.2 Normal Distribution: Two-Sample Problem for the Mean

As in the previous subsection, we will only consider the testing problem related to means. A very brief summary is given here.

The general problem is described as follows. Let c07-math-1085 be a random sample from c07-math-1086, and c07-math-1087 a random sample from c07-math-1088. Assume that all the three parameters c07-math-1089, and c07-math-1090 are unknown. For a specified level c07-math-1091, the aim is to obtain the likelihood ratio test for testing c07-math-1092 against c07-math-1093. Define the following quantities:

7.67 equation
7.68 equation
7.69 equation

The size c07-math-1097 likelihood ratio test for c07-math-1098 against c07-math-1099 is given by

7.70 equation

The next R function, LRNormal2Mean, with an illustration, gives the likelihood ratio test.

> LRNormal2Mean <- function(x,y,alpha){
+     xbar <- mean(x); ybar <- mean(y)
+     nx <- length(x); ny <- length(y)
+     Sx <- var(x); Sy <- var(y)
+     Sp <- ((nx-1)*Sx+(ny-1)*Sy)/(nx+ny-2)
+     tcalc <- abs(xbar-ybar)/sqrt(Sp*(1/nx+1/ny))
+     conclusion=ifelse(tcalc>qt(df=nx+ny-2,p=alpha/2),
+ "Reject Hypothesis H","Fail to Reject Hypothesis H")
+     return(c(tcalc,conclusion,Sp))
+ }
> lisa <- c(234.26, 237.18, 238.16, 259.53, 242.76, 237.81, 250.95, 277.83)
> mike <- c(187.73, 206.08, 176.71, 213.69, 224.34, 235.24)
> LRNormal2Mean(mike,lisa,0.05)
[1] "4.06112227911276"       "Reject Hypothesis H" "332.808456944444"

To the best of our knowledge, R does not have an implementation for the likelihood ratio test.

c07-math-1101

7.14 Behrens-Fisher Problem

In the two-sample problems for normal distributions considered earlier, the problem of testing c07-math-1102 against c07-math-1103, when c07-math-1104 and c07-math-1105 are unknown and distinct, has not been considered. There has been a special reason for this. It has been proved by Linnik (1968) in this case that the UMPU test does not exist and there has been a lot of controversy surrounding the solutions proposed to date and it is even today an open problem.

The problem was first attempted by Behrens in 1929 and by Fisher in 1935. Kim and Cohen (1995) provide an excellent review of the solutions proposed by various statisticians. Schéffe (1943), Aspin (1948), Lindley (1965), Robinson (1976), and Welch (1938, 1947) are some of the important works in this direction.

Suppose that we have c07-math-1106 observations from c07-math-1107 and c07-math-1108 observations from c07-math-1109. As earlier, let c07-math-1110 denote the sample means, variances, and pooled variance. The Student's c07-math-1111-test pivotal statistic with c07-math-1112 degrees of freedom is given by

equation

However, the Student's c07-math-1114-test procedure makes the assumption of the variances being equal. Thus, the use of the c07-math-1115-test is inappropriate here. An ad-hoc solution is the following. Compute c07-math-1116 by

equation

and compare it with the critical value obtained from a c07-math-1118 variable with c07-math-1119 degrees of freedom. For a dataset from Kim and Cohen's review paper, an R program is given next.

> adhocBF <- function(x,y,delta,alpha){
+   tstar <- (delta-mean(y)+mean(x))/sqrt(var(x)/length(x)+var(y)/ length(y))
+   v <- min(length(x)-1,length(y)-1)
+   pval <- 2*(1-pt(tstar,v))
+   confint <- c(mean(y)-mean(x)-qt(1-alpha/2,v) *sqrt(var(x)/length(x)+
+   var(y)/length(y)),mean(y)-mean(x)+qt(1-alpha/2,v)*
+   sqrt(var(x)/length(x)+var(y)/length(y)))
+   return(list=c(tstar,pval,confint))
+ }
> x <- c(8,10,12,15)
> y <- c(1,7,11)
> adhocBF(x,y,delta=0,alpha=0.05)
[1]   1.5049258   0.2712717 -18.9736452   9.1403119

A more satisfactory solution for the Behrens-Fisher problem is given by Welch and we will discuss his solution with an R program.

Compute the value of test statistic c07-math-1120 as the same given by c07-math-1121. Define c07-math-1122. Define

7.71 equation

The Welch solution is to carry out the test by using the value c07-math-1124 and comparing it with a c07-math-1125 random variate with c07-math-1126 degrees of freedom. As may be expected, in general c07-math-1127 is not an integer. In such a case we round off the value to the nearest integer.

> WelchBF <- function(x,y,alpha){
+     gx <- var(x); gy <- var(y)
+     t <- (mean(x)-mean(y))/sqrt(gx/length(x)+gy/length(y))
+     vhat <- (gx+gy)^2/(gx^2/(length(x)-1) + gy^2/(length(y)-1))
+     pval <- 2*(1-pt(t,round(vhat)))
+     ci <- qt(c(alpha/2,1-alpha/2),round(vhat))
+     return(list=c(t,pval,ci))
+ }
> WelchBF(x,y,alpha=0.05)
[1]  1.5049258  0.2294048 -3.1824463  3.1824463

For more details related to the Behrens-Fisher problem, refer to the review article of Kim and Cohen.

c07-math-1128

7.15 Multiple Comparison Tests

Consider the following hypothesis:

7.72 equation

Here, we have a set of hypotheses to be tested and this framework is popularly known as the multiple comparision test. Such hypotheses are very common in Experimental Designs, see Chapter 15. Suppose that c07-math-1130, denotes the mean yield due to the c07-math-1131-th treatment. In its general setup, the hypothesis says that none of the treatment means are significant. In case we fail to reject c07-math-1132, the conclusion is indeed that none of the treatment means are significant and the analysis stops. However, if we reject the hypothesis c07-math-1133, a host of questions then arise. In this case, the conclusion says that at least one treatment is significant and the interest is then to identify such a treatment. A slight variant of the problem is testing to some pre-specified level, which is generally known as the mean of the control treatment.

Let us begin with a naive approach. That is, we consider c07-math-1134 hypotheses instead of a single hypothesis and consider the problem of testing c07-math-1135. Suppose each hypothesis is tested at level c07-math-1136. A simple exploration shows the dire consequence of this naive approach. The forthcoming program will show that the probability of one or more false rejections increases drastically with c07-math-1137.

> n <- c(1,2,5,10,50)
> alpha <- 0.05
> prob_rejection <- function(n,alpha) (1-(1-alpha)^{n})
> round(sapply(n,prob_rejection,alpha),2)
[1] 0.05 0.10 0.23 0.40 0.92

That is, the Type I error grows very fast and with c07-math-1138, we are almost certain of having committed the error. This motivates the next definition.

The goal of the multiple testing problem is to restrict the FWER to a pre-specified level c07-math-1143:

7.74 equation

In the next section we will focus on two simple, yet useful, procedures for the multiple testing problem.

7.15.1 Bonferroni's Method

The Bonferroni's method is a simple consequence of using the Bonferroni inequality. Let c07-math-1145 be the c07-math-1146-value associated with hypothesis c07-math-1147. Then reject the family of hypotheses c07-math-1148 if c07-math-1149. It may be easily verified is this case that

equation

An illustration will be provided for the example provided in R.

7.15.2 Holm's Method

Consider the ordered c07-math-1153-values c07-math-1154 and let the associated hypotheses be c07-math-1155. The Holm procedure is a stepdown procedure and is described below, adapted from page 351 of Lehmann and Romano (2005).

  1. Step 1. If c07-math-1156, accept c07-math-1157 and stop. If c07-math-1158, reject c07-math-1159 and test the remaining c07-math-1160 hypotheses at level c07-math-1161.
  2. Step 2. If c07-math-1162 with c07-math-1163, accept c07-math-1164 and stop. If c07-math-1165 and c07-math-1166, reject c07-math-1167 and c07-math-1168 and test the remaining c07-math-1169 hypotheses at level c07-math-1170.
  3. Step 3. Continue the steps until c07-math-1171.

For proof that the Holm's method meets the requirement c07-math-1172, see Theorem 9.1.2 of Lehmann and Romano (2005). We will close the discussion with an example.

c07-math-1173

7.16 The EM Algorithm*

7.16.1 Introduction

The Expectation-Maximization Algorithm, more popularly known as the EM algorithm, is a very popular tool, not only among the statisticians, but also among the data miners. Wu and Kumar (2009) have selected the EM algorithm as one among the top ten useful algorithms for data miners. McLachlan and Krishnan (1998, 2008) give a rigorous mathematical introduction with a large number of illustrations of the EM algorithm. Little and Rubin (1987, 2002) is also one of the earliest books to give a detailed account of the algorithm. Dempster, Laird, and Rubin (1977) introduced the breakthrough EM algorithm and enhanced statistical methods which can accommodate missing data. This paper is also popularly referred to as the DLR paper. The introductory literature has so far been in reversed chronological order.

It is important to understand that the EM algorithm is not really an algorithm in the traditional usage of the technical word “algorithm”. It is a generic tool which gives rise to different statistical methods depending on the context of the application. Ripley in a reply to an R user has rightly explained this de facto as “The EM algorithm is not an algorithm for solving problems, rather an algorithm for creating statistical methods.”

In the context of handling missing data, Schafer (2000) has rightly said that “The key ideas behind EM and data augmentation are the same: to solve a difficult incomplete-data problem by repeatedly solving tractable complete-data problems.” Terry Speed (2008) has also said this about the EM algorithm: “I know many statisticians are deeply in love with the EM algorithm.”

“EMMIX” is probably one of the few softwares which implements the EM algorithm for the mixture of multivariate normal or c07-math-1174- distribution.

7.16.2 The Algorithm

In general, the EM algorithm is stated in two steps: the E-step and the M-step. We will begin with a description as given in McLachlan and Krishnan (2008). Let c07-math-1175 be a random vector and c07-math-1176 be its observed value. The sample space of c07-math-1177 is denoted by c07-math-1178. We will denote the pdf of c07-math-1179 by c07-math-1180, where c07-math-1181 is the vector of unknown parameters.

To make use of the EM algorithm, we will always pretend that c07-math-1182 is incomplete in the sense that the experiment consists of some values which we treat as missing data. That is, we will assume that we have missing data in c07-math-1183, and if this is augmented with c07-math-1184, we will have the complete data in c07-math-1185. Let c07-math-1186 denote the sample space of c07-math-1187.

The pdf of the complete observation c07-math-1188 will be denoted by c07-math-1189. Thus, under the assumption that c07-math-1190 is completely observed, the log-likelihood of c07-math-1191 is given by

7.75 equation

Clearly, as the sample space of the c07-math-1193's is larger than the c07-math-1194's, we have a many-to-one mapping from c07-math-1195 to c07-math-1196. Thus, the observed data can be written as the function c07-math-1197. Hence, we have the relationship

equation

Assume that we have an initial value as an estimate of c07-math-1199 in c07-math-1200. Using the observed data c07-math-1201 and c07-math-1202, we next specify the conditional probability distribution of c07-math-1203. Since the complete data log-likelihood c07-math-1204 is not observable, we will replace it by its conditional expectation given c07-math-1205 and c07-math-1206. This conditional expectation is the famous Q-function defined by

7.76 equation

This is the famous E-step of the EM algorithm. In the M-step, we maximize c07-math-1208 to obtain c07-math-1209 such that

7.77 equation

Thus, the EM algorithm can be summarized as below:

  • E-Stem: Calculate c07-math-1211, where

    7.78 c07-math-1212

  • M-Step: Select any value c07-math-1213 of c07-math-1214, such that

    7.79 c07-math-1215

The convergence criteria for the EM algorithm is that the difference c07-math-1216 should be approximately 0. This explanation of the EM algorithm in two steps can be found almost everywhere. However, we have found the five steps detail of the EM algorithm by Gupta and Chen (2011) to be more friendly and despite a repetition of the above content, we will state it here.

  1. 1. Set c07-math-1217 and obtain an initial estimate for c07-math-1218 as c07-math-1219.
  2. 2. Assume that c07-math-1220 as the truth and using the observed data c07-math-1221, completely specify the conditional probability distribution c07-math-1222 for the complete data c07-math-1223.
  3. 3. Obtain the conditional expected log-likelihood Q-function:equation
  4. 4. Find c07-math-1225, which maximizes c07-math-1226.
  5. 5. Set c07-math-1227 and go to the first step.

We understand that the EM algorithm is best illustrated through applications.

7.16.3 Introductory Applications

We will consider problems which have been widely used illustrating the EM algorithm.

c07-math-1330

7.17 Further Reading

7.17.1 Early Classics

Fisher! In the 1920s, Sir R.A. Fisher wrote a series of ground-breaking papers on inference. Fisher (1925-1954) has given a first account on what should form the fundamentals of inference. Kendall and Stuart (1945–79) is one of the earliest and rigorous development of inference. Cramér's (1946) book is one of the landmarks for inference. Lehmann (1958) gave a detailed account related to testing of hypotheses. Rao (1965–73) is one of the all-time classics and goes beyond the “linear” indicated in its title. Wilks (1962), Zacks (1971), and Cox and Hinkley (1973) are some of the other rigorous books on statistical inference.

Let us now look at some of the earlier books which introduce the subject at an elementary level. Snedecor and Cochran (1937–89) may have been the first book on “Statistical Methods”. Mood, et al. (1950–74) is one of the earliest, elegant and elementary introduction to statistics. Hoel, et al. (1971), Hogg and Craig (1978), Hogg and Tanis (1977), and DeGroot and Schervish (2012), are also some of the best books written at their level.

In the Indian subcontinent, Das (1996) and Goon, et al. (1963) have written very useful texts.

7.17.2 Texts from the Last 30 Years

We do not intend to retain the chronological year of publication and jot down the texts which readily come to mind. As seen, chapter, Mukhopadhyay (2000), Rohatgi and Saleh (2000), and Casella and Berger (2000) have influenced this chapter a lot. Pawitan (2001) has been freely used for illustration of many concepts. Geisser and Johnson (2006) is a very compact work and will be useful to brush up on the details for an expert. Sen, et al. (2009) is a very concise course on the recent topics in inference. Wasserman (2004) is an advanced text which the reader will find useful for the modern development of the subject. Keener (2010), Dekking, et al. (2005), Liese and Miescke (2008), Knight (2000), Schervish (1995), and Shao (2003) are some of the finest written texts.

McLachlan and Krishnan (2008) is the first book to detail the EM algorithm. Huber and Ronchetti (2009) deals with the robustness of inference tools. As with the bibliography section of Chapter 5, we have again repeated a futile exercise.

7.18 Complements, Problems, and Programs

  1. Problem 7.1 For different values of c07-math-1331, obtain a plot of the curved normal family.

  2. Problem 7.2 Italicize the c07-math-1332-axis label in the expression part in Example 7.3.1.

  3. Problem 7.3 Find a sufficient statistic for c07-math-1333 when c07-math-1334.

  4. Problem 7.4 Suppose c07-math-1335 follows a negative binomial distribution with parameters as defined in Equation (6.20). Assume that for obtaining c07-math-1336 failures, c07-math-1337 is noted as 10. Obtain the likelihood function plot and then graphically infer about the ML estimate of c07-math-1338.

  5. Problem 7.5 In a directory on a particular folder of a hard disk drive, there are c07-math-1339 files. Suppose that in a random selection of c07-math-1340 files, 9 are observed to be e-books. Under the assumption of a hypergeometric distribution, and by using the likelihood function approach, give the ML estimate of c07-math-1341. Check Equation (6.30) if required to complete the R program.

  6. Problem 7.6 For the two likelihood functions of the multinomial distribution in Examples 7.9.1 and 7.9.2, plot the likelihood function for obtaining the ML estimates.

  7. Problem 7.7 Section 7.6 makes use of the function optimize and mle to obtain the ML estimate. Will these techniques, return the ML estimate for the parameters in the previous two examples? If the technique fails, test across values of the parameters and data, what may be the reason behind it?

  8. Problem 7.8 Using the Fishers score function technique, obtain the ML estimate in the previous three examples.

  9. Problem 7.9 For the galton dataset from UsingR package, what will be the conclusion of the MP test that the height of the child is c07-math-1342 against c07-math-1343, given that variance is known to be 1.7873.

  10. Problem 7.10 If the variance is unknown in the previous example, carry out the likelihood-ratio test, see LRNormalMean_UV, and draw the conclusion at the c07-math-1344 level of significance.

  11. Problem 7.11 In Section 7.12, it was mentioned that for a sample from a uniform distribution an UMP test exists for the hypothesis test problem of c07-math-1345 against c07-math-1346. Obtain the UMP test and, if possible, an appropriate R program.

  12. Problem 7.12 Assume that the variances for the two treatments of the Youden-Beale problem are unknown and there is no reasonable way they can be assumed to be equal. Use the two tests (R programs) developed in Section 7.14 in adhocBF and WelchBF to draw the right conclusions.

  13. Problem 7.13 Carry out the multiple hypothesis testing problem, see glht function from the multcomp package, for the median polish regression model fitted in Section 4.5.2.

  14. Problem 7.14 Interpret the R program in Example 7.10.2.

  15. Problem 7.15 The c07-math-1347-test used on the galton dataset is t.test(galton$child,mu=mean(galton$parent)). However, there is a “pairing” between the height of the child and the parent. Is the test t.test(galton$child,galton$parent,paired=TRUE) more appropriate?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.129.100