Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2

Basic Statistics

2.1 Introduction

In this chapter we present some basic nonparametric statistical procedures and show their computation in R. We begin with a brief example involving the distribution-free sign test. Then for the one-sample problem for continuous data, we present the signed-rank Wilcoxon nonparametric procedure and review the parametric t procedure. Next, we discuss inference based on bootstrapping (resampling). In the second part of the chapter, we turn our attention to discrete data. We discuss inference for the binomial probability models of the one- and two-sample problems, which we then generalize to the common goodness-of-fit χ2-tests including the usual tests of homogeneity of distributions and independence for discrete random variables. We next present McNemar’s test for significant change. We close the chapter with a brief discussion on robustness.

Our discussion focuses on the computation of these methods via R. More details of these nonparametric procedures can be found in the books by Hettmansperger and McKean (2011), Higgins (2003), Hollander and Wolfe (1999). A more theoretical discussion on the χ2 goodness-of-fit tests can be found in Agresti (2002) or Hogg, McKean, and Craig (2013).

2.2 Sign Test

The sign test requires only the weakest assumptions of the data. For instance, in comparing two objects the sign test only uses the information that one object is better in some sense than the other.

As an example, suppose that we are comparing two brands of ice cream, say Brand A and Brand B. A blindfolded taster is given the ice creams in a randomized order with a washout period between tastes. His/her response is the preference of one ice cream over the other. For illustration, suppose that 12 tasters have been selected. Each taster is put through the blindfolded test. Suppose the results are such that Brand A is preferred by 10 of the tasters, Brand B by one of the tasters, and one taster has no preference. These data present pretty convincing evidence in favor of Brand A. How likely is such a result due to chance if the null hypothesis is true, i.e., no preference in the brands? As our sign test statistic, let S denote the number of tasters that prefer Brand A to Brand B. Then for our data S = 10. The null hypothesis is that there is no preference in brands; that is, one brand is selected over the other with probability 1/2. Under the null hypothesis, then, S has a binomial distribution with the probability of success of 1/2 and, in this case, n = 11 as the number of trials. A two-sided p-value can be calculated as follows.

> 2*dbinom(10,11,1/2)

[1] 0.01074219

On the basis of this p-value we would reject the null hypothesis at the 5% level. If we make a one-sided conclusion, we would say Brand A is preferred over Brand B.

The sign test is an example of a distribution-free (nonparametric) test. In the example, suppose we can measure numerically the goodness of taste. The distribution of the sign test under the null hypothesis does not depend on the distribution of the measure; hence, the term distribution-free. The distribution of the sign test, though, is not distribution-free under alternative hypotheses.

2.3 Signed-Rank Wilcoxon

The sign test discussed in the last section is a nonparametric procedure for the one-sample or paired problem. Although it requires few assumptions, the power can be low, for example relative to the t-test at the normal distribution. In this section, we present the signed-rank Wilcoxon procedure which is a nonparametric procedure that has power nearly that of the t-test for normal distributions and it generally has power greater than that of the t-test for distributions with heavier tails than the normal distribution. More details for the Wilcoxon signed-rank test can be found in the references cited in Section 2.1. We discuss these two nonparametric procedures and the t-procedure for the one-sample location problem, showing their computation using R. For each procedure, we also discuss the R syntax for computing the associated estimate and confidence interval for the location effect.

We begin by defining a location model to set the stage for future discussions. Let X1,X2,...,Xn denote a random sample which follows the model

Xi=θ+ei,(2.1) $\begin{array}{l} X_{i} = θ + e_{i}, & (2.1) \end{array}$

where, to simplify discussion, we assume that the random errors, e1, ..., en are independent and identically distributed (iid) with a continuous probability density function f(t) which is symmetric about 0. We call this model the location model. Under the assumption of symmetry any location measure (parameter) of Xi, including the mean and median, is equal to θ. Suppose we are interested in testing the hypotheses

$\begin{array}{l} H_{0} : θ = 0 versus H_{A} : θ > 0. & (2.2) \end{array}$

The sign test of the last section is based on the test statistic

$\begin{array}{l} S = \sum_{i = 1}^{n} sign (X_{i}), & (2.3) \end{array}$

where sign(t) = −1, 0, or 1 for t < 0, t = 0, or t > 0, respectively. Let

$\begin{array}{l} S^{+} = #_{i} {X_{i} > 0} . & (2.4) \end{array}$

Then S = 2S+−n. This assumes that none of the Xi’s is equal to 0. In practice, generally observations with value 0 are omitted and the sample size is reduced accordingly. Note that under H0, S+ has a binomial distribution with n trials and probability of success 1/2. Hence, critical values of the test are obtained from the binomial distribution. Since the null distribution of S does not depend on f(t), we say that the sign test is distribution-free. Let s+ denote the observed (realized) value of S+ when the sample is drawn. Then the p-value of the sign test for the hypotheses (2.2) is $P_{H_{0}} (S^{+} \geq s^{+}) = 1 - F_{B} (s^{+} - 1; n, 0.5)$ , where FB(t; n, p) denotes the cdf of a binomial distribution over n trials with probability of success p (pbinom is the R function which returns the cdf of a binomial distribution).

The traditional t-test of the hypotheses (2.2) is based on the sum of the observations.1 The distribution of the statistic T depends on the population pdf f(x). In particular, it is not distribution-free. The usual form of the test is the t-ratio

$\begin{array}{l} t = \frac{\bar{X}}{s / \sqrt{n}}, & (2.5) \end{array}$

where $\bar{X}$ and s are, respectively, the mean and standard deviation of the sample. If the population is normal then t has a Student t-distribution with n – 1 degrees of freedom. Let t0 be the observed value of t. Then the p-value of the t-test for the hypotheses (2.2) is $P_{H_{0}} (t \geq t_{0}) = 1 - F_{T} (t_{0}; n - 1)$ , where $F_{T} (t; v)$ denotes the cdf of the Student t-distribution with v degrees of freedom (pt is the R function which returns the cdf of a t distribution). This is an exact p-value if the population is normal; otherwise it is an approximation.

The difference between the t-test and the sign test is that the t-test statistic is a function of the distances of the sample items from 0 in addition to their signs. The signed-rank Wilcoxon test statistic, however, uses only the ranks of these distances. Let R|Xi| denote the rank of |Xi| among |X1|,..., |Xn|, from low to high. Then the signed-rank Wilcoxon test statistic is

$\begin{array}{l} W = \sum_{i = 1}^{n} sign (X_{i}) R | X_{i} | . & (2.6) \end{array}$

Unlike the t-test statistic, W is distribution-free under H0. Its distribution, though, cannot be obtained in closed-form. There are iterated algorithms for its computation which are implemented in R (psignrank, qsignrank, etc.). Usually the statistic computed is the sum of the ranks of the positive items, W+, which is

$\begin{array}{l} W^{+} = \sum_{X_{i} > 0} R | X_{i} | = \frac{1}{2} W + \frac{n (n + 1)}{4} . & (2.7) \end{array}$

The R function psignrank computes the cdf of W+. Let w+ be the observed value of W+. Then, for the hypotheses (2.2), the p-value of the signed-rank Wilcoxon test is $P_{H_{0}} (W^{+} \geq w^{+}) = 1 - F_{W^{+}} (w^{+} - 1; n)$ , where $F_{W^{+}} (x; n)$ denotes the cdf of the signed-rank Wilcoxon distribution for a sample of size n.

2.3.1 Estimation and Confidence Intervals

Each of these three tests has an associated estimate and confidence interval for the location effect θ of Model (2.1). They are based on inversions2 of the associated process. In this section we present the results and then show their computation in R. As in the last section, assume that we have modeled the sample Xi, X2,...,Xn as the location model given in expression (2.1).

The confidence intervals discussed below, involve the order statistics of a sample. We denote the order statistics with the usual notation; that is, X(1) is the minimum of the sample, X(2) is the next smallest, ..., and X(n) is the maximum of the sample. Hence, X(1) < X(2) < ... < X(n). For example, if the sample results in x1 = 51, x2 = 64, x3 = 43 then the ordered sample is given by x(1) = 43, x(2) = 51, x(3) = 64.

The estimator of the location parameter θ associated with sign test is the sample median which we write as,

$\begin{array}{l} \hat{θ} = median {X_{i}, X_{2}, ..., X_{n}} . & (2.8) \end{array}$

For 0 < α < 1, a corresponding confidence interval for θ of confidence (1 – α) 100% is given by $(X_{(c_{1} + 1)}, X_{(n - c_{1})})$ , where X(i) denotes the ith order statistic of the sample and c1 is the α/2 quantile of the binomial distribution, i.e., $F_{B} (c_{1}; n, 0.5) = α / 2$ ; see Section 1.3 of Hettmansperger and McKean (2011) for details. This confidence interval is distribution-free and, hence, has exact confidence (1 – α) 100% for any random error distribution. Due to the discreteness of the binomial distribution, for each value of n, there is a limited number of values for α. Approximate interpolated confidence intervals for the median are presented in Section 1.10 of Hettmansperger and McKean (2011)

With regard to the t-test, the associated estimator of location is the sample mean $\bar{X}$ . The usual confidence interval for θ is $(\bar{X} - t_{α / 2, n - 1} [s / \sqrt{n}], \bar{X} + t_{α / 2, n - 1} [s / \sqrt{n}])$ , where $F_{T} (- t_{α / 2, n - 1}; n - 1) = α / 2$ . This interval has the exact confidence of (1 –α)100% provided the population is normal. If the population is not normal then the confidence coefficient is approximately (1 – α)100%. Note the t-procedures are not distribution-free.

For the signed-rank Wilcoxon, the estimator of location is the Hodges-Lehmann estimator which is given by

$\begin{array}{l} {\hat{θ}}_{W} = {med}_{i \leq j} {\frac{X_{i} + X_{j}}{2}} . & (2.9) \end{array}$

The pairwise averages $A_{i j} = (X_{i} + X_{j}) / 2$ , i ≤ j, are called the Walsh averages of the sample. Let $A_{(1)} < \dots < A_{(n (n + 1) / 2)}$ denote the ordered Walsh averages. Then a (1 – α) 100% confidence interval for θ is

$(A_{(c_{2} + 1)}, A_{([n (n + 1) / 2] - c_{2})}),$

c2 is the α/2 quantile of the signed-rank Wilcoxon distribution. Provided the random error pdf is symmetric, this is a distribution-free confidence interval which has exact confidence (1 − α)100%. Note that the range of W+ is the set {0,1,...n(n + 1)/2} which is of order n2. So for moderate sample sizes the signed-rank Wilcoxon does not have the discreteness problems that the inference based on the sign test has; meaning α is close to the desired level.

2.3.2 Computation in R

The signed-rank Wilcoxon and t procedures can be computed by the intrinsic R functions wilcox.test and t.test, respectively. Suppose x is the R vector containing the sample items. Then for the two-sided signed-rank Wilcoxon test of H0 : θ = 0, the call is

wilcox.test(x,conf.int=TRUE).

This returns the value of the test statistic W+, the p-value of the test, the Hodges-Lehmann estimate of θ and the distribution-free 95% confidence interval for θ. The t.test function has similar syntax. The default hypothesis is two-sided. For the one-sided hypothesis HA : θ > 0, use alternative=“greater” as an argument. If we are interested in testing the null hypothesis H0 θ = 5, for example, use mu=5 as an argument. For, say, a 90% confidence interval use the argument conf.level = .90. For more information see the help page (help(wilcox.test)). Although the sign procedure does not have an intrinsic R function, it is simple to code such a function. One such R-function is given in Exercise 2.8.7.

Example 2.3.1 (Nursery School Intervention).

This dataset is drawn from a study discussed by Siegel (1956). It involves eight pairs of identical twins who are of nursery school age. In the study, for each pair, one is randomly selected to attend nursery school while the other remains at home. At the end of the study period, all 16 children are given the same social awareness test. For each pair, the response of interest is the difference in the twins’ scores, (Twin at School – Twin at Home). Let θ be the true median effect. As discussed in Remark 2.3.1, the random selection within a pair ensures that the response is symmetrically distributed under H0 : θ = 0. So the signed-rank Wilcoxon process is appropriate for this study. The following R session displays the results of the signed-rank Wilcoxon and the Student t-tests for one-sided tests of H0 : θ = 0 versus HA : θ > 0.

> school<-c(82,69,73,43,58,56,76,65)

> home<-c(63,42,74,37,51,43,80,62)

> response <- school - home

> wilcox.test(response,alternative=“greater”,conf.int=TRUE)

   Wilcoxon signed rank test

data: response

V = 32, p-value = 0.02734

alternative hypothesis: true location is greater than 0

95 percent confidence interval:

  1 Inf

sample estimates:

(pseudo)median

  7. 75

> t.test(response,alternative=“greater”,conf.int=TRUE)

   One Sample t-test

data: response

t = 2.3791, df = 7, p-value = 0.02447

alternative hypothesis: true mean is greater than 0

95 percent confidence interval:

 1.781971  Inf

sample estimates:

mean of x

 8.75

Both procedures reject the null hypothesis at level 0.05. Note that the onesided test option forces a one-sided confidence interval. To obtain a two-sided confidence interval use the two-sided option.

Remark 2.3.1 (Randomly Paired Designs).

The design used in the nursery school study is called a randomly paired design. For such a design, the experimental unit is a block of length two. In particular, in the nursery school study, the block was a set of identical twins. The factor of interest has two levels or there are two treatments. Within a block, the treatments are assigned at random, say, by a flip of a fair coin. Suppose H0 is true; i.e., there is no treatment effect. If d is a response realization, then whether we observe d or –d depends on whether the coin came up heads or tails. Hence, D and –D have the same distribution; i.e., D is symmetrically distributed about 0. Thus the symmetry assumption for the signed-rank Wilcoxon test automatically holds.

As a last example, we present the results of a small simulation study.

Example 2.3.2.

Which of the two tests, the signed-rank Wilcoxon or the t-test, is the more powerful? The answer depends on the distribution of the random errors. Discussions of the asymptotic power of these two tests can be found in the references cited at the beginning of this chapter. In this example, however, we compare the powers of these two tests empirically for a particular situation. Consider the situation where the random errors of Model (2.1) have a t-distribution with 2 degrees of freedom. Note that it suffices to use a standardized distribution such as this because the tests and their associated estimators are equivariant to location and scale changes. We are interested in the two-sided test of H0 : θ = 0 versus HA : θ ≠ 0 at level α = 0.05. The R code below obtains 10,000 samples from this situation. For each sample, it records the p-values of the two tests. Then the empirical power of a test is the proportion of times its p-values is less than or equal to 0.05.

n = 30; df = 2; nsims = 10000; mu = .5; collwil = rep(0,nsims)

collt = rep(0,nsims)

for(i in 1:nsims){

 x = rt(n,df) + mu

 wil = wilcox.test(x)

 collwil [i] = wil$p.value

 ttest = t.test(x)

 collt [i] = ttest$p.value

powwil = rep(0,nsims); powwil[collwil <= .05] = 1

powerwil = sum(powwil)/nsims

powt = rep(0,nsims); powt[collt <= .05] = 1

powert = sum(powt)/nsims

We ran this code for the three situations: θ = 0 (null situation) and the two alternative situations with θ = 0.5 and θ = 1. The empirical powers of the tests are:

Test	θ = 0	θ = 0.5	θ = 1
Wilcoxon	0.0503	0.4647	0.9203
t	0.0307	0.2919	0.6947

The empirical α level of the signed-rank Wilcoxon test is close to the nominal value of 0.05, which is not surprising because it is a distribution-free test. On the other hand, the t-test is somewhat conservative. In terms of power, the signed-rank Wilcoxon test is much more powerful than the t-test. So in this situation, the signed-rank Wilcoxon is the preferred test.

2.4 Bootstrap

As computers have become more powerful, the bootstrap, as well as resampling procedures in general, has gained widespread use. The bootstrap is a general tool that is used to measure the error in an estimate or the significance of a test of hypothesis. In this book we demonstrate the bootstrap for a variety of problems, though we still only scratch the surface; the reader interested in a thorough treatment is referred to Efron and Tibshirani (1993) or Davison and Hinkley (1997). In this section we illustrate estimation of confidence intervals and p-values for the one-sample and paired location problems.

To fix ideas, recall a histogram of the sample is often used to provide context of the distribution of the random variable (e.g., location, variability, shape). One way to think of the bootstrap is that it is a procedure to provide some context for the the sampling distribution of a statistic. A bootstrap sample is simply a sample from the original sample taken with replacement. The idea is that if the sample is representative of the population, or more concretely, the histogram of the sample resembles the pdf of the random variable, then sampling from the sample is representative of sampling from the population. Doing so repeatedly will yield an estimate of the sampling distribution of the statistic.

R offers a number of capabilities for implementing the bootstrap. We begin with an example which illustrates the bootstrap computed by the R function sample. Using sample is useful for illustration; however, in practice one will likely want to implement one of R’s internal functions and so the library boot (Canty and Ripley 2013) is also discussed.

Example 2.4.1.

To illustrate the use of the bootstrap, first generate a sample of size 25 from a normal distribution with mean 30 and standard deviation 5.

> x<-rnorm(25,30,5)

In the following code segment we obtain 1000 bootstrap samples and for each sample we calculate the sample mean. The resulting vector xbar contains the 1000 sample means. Figure 2.1 contains a histogram of the 1000 estimates. We have also included a plot of the true pdf of the sampling distribution of $\bar{X}$ ; i.e. a N(30, 52/25).

Figure 2.1

Figure showing histogram of 1000 bootstrap estimates of the sample mean based on a sample of size n = 25 from a n(30, 52) distribution. the pdf of a n(30, 1) is overlaid.

Histogram of 1000 bootstrap estimates of the sample mean based on a sample of size n = 25 from a N(30, 52) distribution. The pdf of a N(30, 1) is overlaid.

> B<-1000 # number of bootstrap samples to obtain

> xbar<-rep(0,B)

> for(i in 1:B) {

+ xbs<-sample(x,length(x),replace=TRUE)

+ xbar[i]<-mean(xbs)

+}

The standard deviation of the bootstrap sampling distribution may serve as an estimate of the standard error of the estimate.

> se.xbar<-sd(xbar)

> se.xbar

[1] 0.9568816

The estimated standard error may then be used for inference. For example, as we know the distribution of the sample mean is normally distributed, we can calculate an approximate 95% confidence interval using t-critical values as follows. We have included the usual t-interval for comparison.

> tcv<-qt(0.975,length(x)-1)

> mean(x)+c(-1,1)*tcv*se.xbar

[1] 28.12227 31.87773

> mean(x)+c(-1,1)*tcv*sd(x)/sqrt(length(x))

[1] 29.89236 30.10764

2.4.1 Percentile Bootstrap Confidence Intervals

In Example 2.4.1 we presented a simple confidence interval based on a bootstrap estimate of standard error. Such an estimate requires assumptions on the sampling distribution of the estimate; for example, that the sampling distribution is symmetric and that the use of t-critical values is appropriate. In this section we present an alternative, the percentile bootstrap confidence interval, which is free of such assumptions. Let $\hat{θ}$ be any location estimator.

Let x = [x1,...,xn]T denote a vector of observations observed from the distribution F. Let $\hat{θ}$ denote the estimate of θ based on this sample. Define the empirical cumulative distribution function of the sample by

$\begin{array}{l} {\hat{F}}_{n} (t) = \frac{1}{n} \sum_{i = 1}^{n} I (x_{i} \leq t) . & (2.10) \end{array}$

Then a bootstrap sample is a sample taken with replacement from ${\hat{F}}_{n}$ ; i.e. $x_{1}^{*}, ... x_{n}^{*}$ are iid ${\hat{F}}_{n}$ . Denote this sample by $x^{*} = {[x_{1}^{*}, ..., x_{n}^{*}]}^{T}$ . Let $\hat{θ} = T (x)$ be the estimate based on the original sample. Similarly ${\hat{θ}}^{*} = T (x^{*})$ is the estimate based on the bootstrap sample. The bootstrap process is repeated a large number of times, say B, from which we obtain ${\hat{θ}}_{1}^{*}, ..., {\hat{θ}}_{B}^{*}$ . Since the empirical distribution of the bootstrap estimates approximates the sampling distribution of $\hat{θ}$ we may use it to obtain an estimate of certainty in our estimate $\hat{θ}$ . To obtain our confidence interval, we order the estimates ${\hat{θ}}_{(1)}^{*} \leq {\hat{θ}}_{(2)}^{*} \leq, ..., \leq {\hat{θ}}_{(B)}^{*}$ .

Let m = [α/2 * B] then $({\hat{θ}}_{(m)}^{*}, {\hat{θ}}_{(B - m)}^{*})$ is an approximate (1 – α) * 100% confidence interval for θ. That is, the end points of the percentile bootstrap confidence interval are the α/2 and 1 – α/2 percentiles of the empirical distribution of the ${\hat{θ}}_{i}^{*}' s$ .

Returning again to our example, let $T (x) = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ . The following code segment obtains a 95% bootstrap percentile confidence interval.

> quantile(xbar,probs=c(0.025,0.975),type=1)

 2.5% 97.5%

28.30090 32.14894

> m<-0.025*1000

> sort(xbar)[c(m,B-m)]

[1] 28.30090 32.14894

Next we illustrate the use of the boot library to arrive at the same result.

> bsxbar<-boot(x,function(x,indices) mean(x[indices]), B)

> boot.ci(bsxbar)

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 1000 bootstrap replicates

CALL :

boot.ci(boot.out = bsxbar)

Intervals :

Level  Normal   Basic

95%  (29.89, 30.11) (29.89, 30.11)

Level Percentile  BCa

95%  (29.89, 30.11) (29.89, 30.12)

Calculations and Intervals on Original Scale

> quantile(bsxbar$t,probs=c(0.025,0.975),type=1)

 2.5% 97.5%

29.88688 30.11383

2.4.2 Bootstrap Tests of Hypotheses

In bootstrap testing, the resampling is conducted under conditions that ensure the null hypothesis, H0, is true. This allows the formulation of a bootstrap p-value. In this section we illustrate the use of the bootstrap testing procedure by applying it to the paired and one-sample problems discussed in Sections 2.2–2.3.

The bootstrap testing procedure for the paired problem is as follows. First sample with replacement from the set of pairs; then treatment is applied at random to the pair. Notice this preserves the correlation of the paired design. If d1,..., dn denote the difference based on the original sample data, then in the bootstrap sample, if the ith pair is selected di and –di each have probability $\frac{1}{2}$ of being in the bootstrap sample; hence the null hypothesis is true. Let $T_{1}^{*}, ..., T_{B}^{*}$ be the test statistics based on the B bootstrap samples. These form an estimate of the null distribution of the test statistic T. The bootstrap p-value is then calculated as

$p -value = \frac{# {T_{i}^{*} \geq T}}{B} .$

Example 2.4.2 (Nursery School Intervention Continued).

There is more than one way to to implement the bootstrap testing procedure for the paired problem, the following is one which utilizes the set of differences.

> d<-school-home

> dpm<-c(d,-d)

Then dpm contains all the 2n possible differences. Obtaining bootstrap samples from this vector ensures the null hypothesis is true. In the following we first obtain B = 5000 bootstrap samples and store them in the vector dbs.

> n<-length(d)

> B<-5000

> dbs<-matrix(sample(dpm,n*B,replace=TRUE),ncol=n)

Next we will use the apply function to obtain the Wilcoxon test statistic for each bootstrap sample. First we define a function which will return the value of the test statistic.

> wilcox.teststat<-function(x) wilcox.test(x)$statistic

> bs.teststat<-apply(dbs,1,wilcox.teststat)

> mean(bs.teststat>=wilcox.teststat(d))

[1] 0.0238

Hence, the p-value = 0.0238 and is significant at the 5% level.

For the second problem, consider the one-sample location problem where the goal is to test the hypothesis

$H_{0} : θ = θ_{0} versus θ > θ_{0} .$

Let x1,..., xn be a random sample from a population with location θ. Let $\hat{θ} = T (x)$ be an estimate of θ.

To ensure the null hypothesis is true, that we are sampling from a distribution with location θ0, we take our bootstrap samples from

$\begin{array}{l} x_{1} - \hat{θ} + θ_{0}, ..., x_{n} - \hat{θ} + θ_{0} . & (2.11) \end{array}$

Denote the bootstrap estimates as ${\hat{θ}}_{1}^{*}, ..., {\hat{θ}}_{B}^{*}$ . Then the bootstrap p-value is given by

$\frac{# {{\hat{θ}}_{i}^{*} \geq \hat{θ}}}{B} .$

We illustrate this bootstrap test with the following example.

Example 2.4.3 (Bootstrap test for sample mean).

In the following code segment we first take a sample from a N(1.5, 1) distribution. Then we illustrate a test of the hypothesis

$H_{0} : θ = 1 versus H_{A} : θ > 1.$

The sample size is n = 25.

> x<-rnorm(25,1.5,1)

> thetahat<-mean(x)

> x0<-x-thetahat+1 #theta0 is 1

> mean(x0) # notice H0 is true

[1] 1

> B<-5000

> xbar<-rep(0,B)

> for(i in 1:B) {

+ xbs<-sample(x0,length(x),replace=TRUE)

+ xbar[i]<-mean(xbs)

+}

> mean(xbar>=thetahat)

[1] 0.02

In this case the p-value = 0.02 is significant at the 5% level.

2.5 Robustness*

In this section, we briefly discuss the robustness properties of the three estimators discussed so far in this chapter, namely, the mean, the median, and the Hodges–Lehmann estimate. Three of the main concepts in robustness are efficiency, influence, and breakdown. In Chapter 3, we touch on efficiency, while in this section we briefly explore the other two concepts for the three estimators.

The finite sample version of the influence function of an estimator is its sensitivity curve. It measures the change in an estimator when an outlier is added to the sample. More formally, let the vector xn = (x1, x2,..., xn)T denote a sample of size n. Let ${\hat{θ}}_{n} = {\hat{θ}}_{n} (x_{n})$ denote an estimator. Suppose we add a value x to the sample to form the new sample $x_{n + 1} = {(x_{1}, x_{2}, ..., x_{n}, x)}^{T}$ of size n + 1. Then the sensitivity curve for the estimator is defined by

$\begin{array}{l} S (x; \hat{θ}) = \frac{{\hat{θ}}_{n + 1} - {\hat{θ}}_{n}}{1 / (n + 1)} . & (2.12) \end{array}$

The value $S (x; \hat{θ})$ measures the rate of change of the estimator at the outlier x.

As an illustration consider the sample

${1.85, 2.35, - 3.85, - 5.25, - 0.15, 2.15, 0.15, - 0.25, - 0.55, 2.65} .$

The sample mean of this dataset is −0.09 while the median and Hodges–Lehmann estimates are both 0.0. The top panel of Figure 2.2 shows the sensitivity curves of the three estimators, the mean, median, and Hodges–Lehmann for this sample when x is in the interval (–20, 20). Note that the sensitivity curve for the mean is unbounded; i.e., as the outlier becomes large the rate in change of the mean becomes large; i.e., the curve is unbounded. On the other hand, the median and the Hodges–Lehmann estimators change slightly as x changes sign, but their changes soon become constant no matter how large |x| is. The sensitivity curves for the median and Hodges–Lehmann estimates are bounded.

Figure 2.2

Figure showing the top panel shows the sensitivity curves for the mean, median, and hodges– lehmann estimators for the sample given in the text. the bottom panel displays the influence function of the three estimators.

The top panel shows the sensitivity curves for the mean, median, and Hodges– Lehmann estimators for the sample given in the text. The bottom panel displays the influence function of the three estimators.

While intuitive, a sensitivity curve depends on the sample items. Its theoretical analog is the influence function which measures rate of change of the functional of the estimator. at the probability distribution, F(t), of the random errors of the location model. We say an estimator is robust if its influence function is bounded. Down to a constant of proportionality and center, the influence functions of the mean, median, and Hodges–Lehmann estimators at a point x are respectively x, sign(x), and F(x) − 0.5. Hence, the median and the Hodges–Lehmann estimators are robust, while the mean is not. The influence functions of the three estimators are displayed in the lower panel of Figure 2.2 for a normal probability model. For the median and Hodges–Lehmann estimators, they are smooth versions of their respective sensitivity curves.

To briefly define the breakdown point of an estimator, consider again a sample xn = (x1, x2,..., xn)T from a location model with parameter θ. Let $\hat{θ} = \hat{θ} (x_{n})$ be an estimator of θ. Suppose we contaminate m points in the sample, so that the sample becomes

$x_{n}^{*} = {(x_{1}^{*}, ..., x_{m}^{*}, x_{m + 1}, ..., x_{n})}^{T},$

where $x_{1}^{*}, ..., x_{m}^{*}$ are the contaminated points. Think of the contaminated points as very large (nearly ∞) in absolute value. The smallest value of m so that the value of the estimator $\hat{θ} (x_{n}^{*})$ becomes meaningless is the breaking point of the estimator and the ratio m/n is called the finite sample breakdown point of $\hat{θ}$ . If this ratio converges to a finite value, we call this value the breakdown point of the estimator. Notice for the sample mean that one point of contamination suffices to make the mean meaningless (as $x_{1}^{*} \to \infty$ , $\bar{x} \to \infty$ ). Hence, the breakdown point of the mean is 0. On the other hand, the sample median can tolerate almost half of the data being contaminated. In the limit, its ratio converges to 0.50. So we say that median has 50% breakdown. The Hodges–Lehmann estimate has breakdown point 0.29; see, for instance, Chapter 1 of Hettmansperger and McKean (2011).

In summary, the sample median and the Hodges–Lehmann estimator are robust, with positive breakdown points. The mean is not robust and has breakdown point 0. Of the two, the sample median and the Hodges–Lehmann estimator, because of its higher breakdown, it would seem that the median is preferred. This, however, ignores efficiency between estimators which is discussed in Chapter 3. Efficiency generally reverses this preference.

2.6 One- and Two-Sample Proportion Problems

For this and the next section, we focus on discrete variables. Recall that X is a discrete random variable if its range consists of categories. In this section, we consider discrete random variables whose ranges consist of two categories which we generally label as failure (0) and success (1). Let X denote such a random variable. Let p denote the probability of success. Then we say that X has a Bernoulli distribution with the probability model

x	0	1
P(X = x)	1–p	p

It is easy to show that the mean of X is p and that the variance of X is p(1–p).

2.6.1 One-Sample Problems

Statistical problems consist of estimating p, forming confidence intervals for it, and testing hypotheses of the form

$\begin{array}{l} H_{0} : p = p_{0} versus H_{A} : p \neq p_{0}, & (2.13) \end{array}$

where p0 is specified. One sided hypotheses can be similarly formulated.

Let X1,...,Xn be a random sample on X. Let S be the total number of successes in the sample of size n. Then S has a binomial distribution with the distribution

$\begin{array}{l} P (S = j) = (\begin{array}{l} n \\ j \end{array}) p^{j} {(1 - p)}^{n - j}, j = 0, 1, ..., n . & (2.14) \end{array}$

The estimate of p is the sample proportion of successes; i.e.,

$\begin{array}{l} \hat{p} = \frac{S}{n} . & (2.15) \end{array}$

Based on the asymptotic normality of S, an approximate (1 – α) 100% confidence interval for p is

$\begin{array}{l} (\hat{p} - z_{α / 2} \sqrt{\hat{p} (1 - \hat{p}) / n}, \hat{p} + z_{α / 2} \sqrt{\hat{p} (1 - \hat{p}) / n}) . & (2.16) \end{array}$

Example 2.6.1 (Squeaky Hip Replacements).

As a numerical example, Devore (2012), page 284, reports on a study of 143 subjects who have obtained ceramic hip replacements. Ten of the subjects in the study reported that their hip replacements squeaked. Consider patients who receive such a ceramic hip replacement and let p denote the true proportion of those whose replacement hips develop a squeak. Based on the data, we next compute3 the estimate of p and a confidence interval for it.

> phat<-10/143

> zcv<-qnorm(0.975)

> phat+c(-1,1)*zcv*sqrt(phat*(1-phat)/143)

[1] 0.02813069 0.11172945

Hence, we estimate between roughly 3 and 11% of patients who receive ceramic hip replacements such as the ones in the study will report squeaky replacements.

Asymptotic tests of hypotheses involving proportions, such as (2.13), are often used. For hypotheses (2.13), the usual test is to reject H0 in favor of HA, if |z| is large, where

$\begin{array}{l} z = \frac{\hat{p} - p_{0}}{\sqrt{p_{0} (1 - p_{0}) / n}} & (2.17) \end{array}$

Note that z has an asymptotic N(0,1) distribution under H0, so an equivalent test statistic is based on χ2 = z2. The p-value for a two-sided test is p-value = P[χ2(1) > (Observed χ2)]. This χ2-formulation is the test and p-value computed by the R function prop.test with correct=FALSE indicating that a continuity correction not be applied. The two-sided hypothesis is the default, but one-sided hypotheses can be tested by specifying the alternative argument. The null value of p is set by the argument p.

Example 2.6.2 (Left-Handed Professional Ball Players).

As an example of this test, consider testing whether the proportion of left-handed professional baseball players is the same as the proportion of left-handed people in the general population, which is about 0.15. For our sample we use the dataset baseball that consists of observations on 59 professional baseball players, including throwing hand (‘L‘ or ‘R’). The following R segment computes the test:

> ind<-with(baseball,throw==‘L’)

> prop.test(sum(ind),length(ind),p=0.15,correct=FALSE)

   1-sample proportions test without continuity correction

data: sum(ind) out of length(ind), null probability 0.15

X-squared = 5.0279, df = 1, p-value = 0.02494

alternative hypothesis: true p is not equal to 0.15

95 percent confidence interval:

 0.1605598 0.3779614

sample estimates:

0.2542373

Because the p-value of the test is 0.02494, H0 is rejected at the 5% level.

The above inference is based on the asymptotic distribution of S, the number of successes in the sample. This statistic, though, has a binomial distribution, (2.14), and inference can be formulated based on it. This includes finite sample tests and confidence intervals. For a given level α, though, these confidence intervals are conservative; that is, their true confidence level is at least (see Section 4.3 of Hogg et al. (2013).) 1 – α. These tests and confidence intervals are computed by the R function binom.test. We illustrate its computation for the baseball example.

> binom.test(sum(ind) ,59,p=.15)

   Exact binomial test

data: sum(ind) and 59

number of successes = 15, number of trials = 59, p-value = 0.04192

alternative hypothesis: true probability of success is not equal to 0.15

95 percent confidence interval:

 0.1498208 0.3844241

sample estimates:

probability of success

    0.2542373

Note that the confidence interval traps p = 0.15, even though the two-sided test rejects H0. This example illustrates the conservativeness of the finite sample confidence interval.

2.6.2 Two-Sample Problems

Consider two Bernoulli random variables X and Y with respective probabilities of success p1 and p2. The parameter of interest is the difference in proportions p1 − p2. Inference concerns estimates of this difference, confidence intervals for it, and tests of hypotheses of the form

$\begin{array}{l} H_{0} : p_{1} = p_{2} versus H_{A} : p_{1} \neq p_{2} . & (2.18) \end{array}$

Let X1,..., Xn1 and Y1,..., Yn2 be random samples on X and Y, respectively. Assume that the samples are independent of one another. Section 2.7.4 discusses the paired (dependent) case. The estimate of the difference in proportions is the difference in sample proportions, i.e., ${\hat{p}}_{1} - {\hat{p}}_{2}$ .

It follows that the estimator ${\hat{p}}_{1} - {\hat{p}}_{2}$ has an asymptotic normal distribution. Based on this, a (1 – α) 100% asymptotic confidence interval for p1 – p2 is

$\begin{array}{l} {\hat{p}}_{1} - {\hat{p}}_{2} \pm z_{α / 2} \sqrt{\frac{{\hat{p}}_{1} (1 - {\hat{p}}_{1})}{n_{1}} + \frac{{\hat{p}}_{2} (1 - {\hat{p}}_{2})}{n_{2}}} . & (2.19) \end{array}$

For the hypothesis (2.18), there are two test statistics which are used in practice. The Wald-type test is the standardization of ${\hat{p}}_{1} - {\hat{p}}_{2}$ based on its standard error, (the square-root term in expression (2.19)). The more commonly used test statistic is the scores test which standardizes under H0. Under H0 the population proportions are the same; hence the following average

$\begin{array}{l} \hat{p} = \frac{n_{1} {\hat{p}}_{1} + n_{2} {\hat{p}}_{2}}{n_{1} + n_{2}} & (2.20) \end{array}$

is an estimate of the common proportion. The scores test statistic is given by

$\begin{array}{l} z = \frac{{\hat{p}}_{1} - {\hat{p}}_{2}}{\sqrt{\hat{p} (1 - \hat{p})} \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}} . & (2.21) \end{array}$

This test statistic is compared with z-critical values. As in the one-sample problem, the χ2-formulation, χ2 = z2, is often used. We illustrate the R computation of this analysis with the next example.

Example 2.6.3 (Polio Vaccine).

Rasmussen (1992), page 355, discusses one of the original clinical studies for the efficacy of the Salk polio vaccine which took place in 1954. The effectiveness of the vaccine was not known and there were fears that it could even cause polio since the vaccine contained live virus. Children with parental written consent were randomly divided into two groups. Children in the treatment group (1) were injected with the vaccine while those in the control or placebo group (2) were injected with a biologically inert solution. Let p1 and p2 denote the true proportions of children who get polio in the treatment and control groups, respectively. The hypothesis of interest is the two-sided hypothesis (2.18). The following data are taken from Rasmussen (1992):

Group	No. Children	No. Polio Cases	Sample Proportion
Treatment	200,745	57	0.000284
Control	201,229	199	0.000706

The R function for the analysis is the same function used in the one-sample proportion problem, namely, prop.test. The first and second arguments are respectively the vectors c(S1,S2) and c(n1,n2), where S1 and S2 are the number of successes for the two samples. The default hypotheses are the two-sided hypotheses (2.18). The following R segment provides the analysis for the polio vaccine data.

> prop.test(c(57,199),c(200745,201229),correct=FALSE)

   2-sample test for equality of proportions without

   continuity correction

data: c(57, 199) out of c(200745, 201229)

X-squared = 78.4741, df = 1, p-value < 2.2e-16

alternative hypothesis: two.sided

95 percent confidence interval:

 -0.0008608391 -0.0005491224

sample estimates:

  prop 1  prop 2

0.0002839423 0.0009889231

The χ2 test statistic has the value 77.3704 with a p-value that is zero to 15 places; hence, the null hypothesis would certainly be rejected. The direction indicates that the vaccine has been effective.

2.7 χ2 Tests

In Section 2.6.1, we discussed inference for Bernoulli (binomial) random variables; i.e., discrete random variables with a range consisting of two categories. In this section, we extend this discussion to discrete random variables whose range consists of a general number of categories. Recall that the tests of Section 2.6.1 could be formulated in terms of χ2-tests. We extend these χ2 goodness-of-fit tests for the situations of this section. Technical details may be found in Agresti (2002) or Hogg et al. (2013). Consider a hypothesis (null) concerning the categories. Then the χ2-test statistic is essentially the sum over the categories of the squared and standardized differences between the observed and expected frequencies, where the expected frequencies are formulated under the assumption that the null hypothesis is true. In general, under the null hypothesis, this test statistic has an asymptotic χ2-distribution with degrees of freedom equal to the number of free categories (cells) minus the number of parameters, if any, that need to be estimated to form the expected frequencies. As we note later, at times the exact null distribution can be used instead of the asymptotic distribution. For now, we present three general applications and their computation using R.

2.7.1 Goodness-of-Fit Tests for a Single Discrete Random Variable

Consider a discrete random variable X with range (categories) {1, 2,..., c}. Let p(j) = P[X = j] denote the the probability mass function (pmf) of the distribution of X. Suppose the hypotheses of interest are:

$\begin{array}{l} H_{0} : p (j) = p_{0} (j), j = 1, ..., c versus H_{A} : p (j) \neq p_{0} (j), for some j . & (2.22) \end{array}$

Suppose X1,..., Xn is a random sample on X. Let Oj = #{Xi = j}. The statistics O1,...Oc are called the observed frequencies of the categories of X. The observed frequencies are constrained as $\sum_{j = 1}^{c} O_{j} = n$ ; so, there are essentially c − 1 free cells. The expected frequencies of these categories under H0 are given by $E_{j} = E_{H_{0}} [O_{j}]$ , where $E_{H_{0}}$ denotes expectation under the null hypothesis. There are two cases.

In the first case, the null distribution probabilities, p0(j), are completely specified. In this case, Ej = np0(j) and the test statistic is given by

$\begin{array}{l} χ^{2} = \sum_{j = 1}^{c} \frac{{(O_{j} - E_{j})}^{2}}{E_{j}} . & (2.23) \end{array}$

The hypothesis H0 is rejected in favor of HA for large values of χ2. Note that the vector of observed frequencies, (O1,..., Oc)T has a multinomial distribution, so the exact distribution of χ2 can be obtained. It is also asymptotically equivalent to the likelihood ratio test statistic,4 and, hence, has an asymptotically χ2-distribution with c− 1 degrees of freedom under H0. In practice, this asymptotic result is generally used. Let $χ_{0}^{2}$ be the realized value of the statistic χ2 when the sample is drawn. Then the p-value of the goodness-of-fit test is $1 - F_{χ^{2}} (χ_{0}^{2}; c - 1)$ , where $F_{χ^{2}} (\cdot; c - 1)$ denotes the cdf of a χ2-distribution with c − 1 degrees of freedom.

The R function chisq.test computes the test statistic (2.23). The input consists of the vectors of observed frequencies and the pmf ${(p_{0} (1), ..., p_{0} (c))}^{T}$ . The uniform distribution (p(j) ≡ 1/c) is the default null distribution. The output includes the value of the χ2-test statistic and the p-value. The return list includes values for the observed frequencies (observed), the expected frequencies (expected), and the residuals (residuals). These residuals are $(O_{j} - E_{j}) / \sqrt{E_{j}}$ , $j = 1, ..., c$ and are often called the Pearson residuals. The squares of the residuals are the categories’ contributions to the test statistic and offer valuable post-test information on which categories had large discrepancies from those expected under H0.

Here is a simple example. Suppose we roll a die n = 370 times and we observe the frequencies (58, 55, 62, 68, 66, 61)T Suppose we are interested in testing to see if the die is fair; i.e., p(j) ≡ 1/6. Computation in R yields

> x <- c(58,55,62,68,66,61)

> chifit <- chisq.test(x)

> chifit

   Chi-squared test for given probabilities

data: x

X-squared = 1.9027, df = 5, p-value = 0.8624

> round(chifit$expected,digits=4)

[1] 61.6667 61.6667 61.6667 61.6667 61.6667 61.6667

> round((chifit$residuals)^2,digits=4)

[1] 0.2180 0.7207 0.0018 0.6505 0.3045 0.0072

Thus there is no evidence to support the die being unfair.

In the second case for the goodness-of-fit tests, only the form of the null pmf is known. Unknown parameters must be estimated.5 Then the expected values are the estimates of Ej based on the estimated pmf. The degrees of freedom, though, decrease by the number of parameters that are estimated.6 The following example illustrates this case.

Example 2.7.1 (Birth Rate of Males to Swedish Ministers).

This data is discussed on page 266 of Daniel (1978). It concerns the number of males in the first seven children for n = 1334 Swedish ministers of religion. The data are

No. of Males	0	1	2	3	4	5	6	7
No. of Ministers	6	57	206	362	365	256	69	13

For example, 206 of these ministers had 2 sons in their first 7 children. The null hypothesis is that the number of sons is binomial with probability of success p, where success is a son. The maximum likelihood estimator of p is the number of successes over the total number of trials which is

$\hat{p} = \frac{\sum_{j = 0}^{7} j \times O_{j}}{7 \times 1334} = 0.5140.$

The expected frequencies are computed as

$E_{j} = n (\begin{array}{l} 7 \\ j \end{array}) {\hat{p}}^{j} {(1 - \hat{p})}^{7 - j} .$

The values of the pmf can be computed in R. The following code segment shows R computations of them along with the corresponding χ2 test. As we have estimated $\hat{p}$ , the number of degrees of freedom of the test is 8 – 1 – 1 = 6.

> oc<-c(6,57,206,362,365,256,69,13)

> n<-sum(oc)

> range<-0:7

> phat<-sum(range*oc)/(n*7)

> pmf<-dbinom(range,7,phat)

The estimated probability mass function is given in the following code segment.

> rbind(range,round(pmf,3))

  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]

range 0.000 1.000 2.00 3.000 4.00 5.000 6.000 7.000

  0.006 0.047 0.15 0.265 0.28 0.178 0.063 0.009

The p-value is calculated using pchisq with the correct degress of freedom (reduced by one due to the estimation of p).

> test.result<-chisq.test(oc,p=pmf)

> pchisq(test.result$statistic,df=6,lower.tail=FALSE)

X-squared

0.4257546

With a p-value = 0.426 we would not reject H0. There is no evidence to refute a binomial probability model for the number of sons in the first seven children of a Swedish minister. The following provides the expected frequencies which can be compared with the the observed.

> round(test.result$expected, 1)

[1]  8.5 63.2 200.6 353.7 374.1 237.4 83.7 12.6

Confidence Intervals

In this section, we have been discussing tests for a discrete random variable with a range consisting of c categories, say, {1, 2,..., c}. Write the distribution of X as $p_{j} = p (j) = P (X = j)$ , $j = 1, 2, ..., c$ . Using the notation at the beginning of this section, for a given j, the estimate of pj is the proportion of sample items in category j; i.e., ${\hat{p}}_{j} = O_{j} / n$ . Note that this is a binomial situation where category j is success and all other categories are failures. Hence from expression (2.16), an asymptotic (1 – α)100% confidence interval for pj is

$\begin{array}{l} {\hat{p}}_{j} \pm z_{α / 2} \sqrt{\frac{{\hat{p}}_{j} (1 - {\hat{p}}_{j})}{n}} . & (2.24) \end{array}$

Another confidence interval of interest in this situation is for a difference in proportions, say, pj – pk, j ≠ k. This parameter is the difference in two proportions in a multinomial setting; hence, the standard error7 of this estimate is

$\begin{array}{l} SE ({\hat{p}}_{j} - {\hat{p}}_{k}) = \sqrt{\frac{{\hat{p}}_{j} + {\hat{p}}_{k} - {({\hat{p}}_{j} - {\hat{p}}_{k})}^{2}}{n}} . & (2.25) \end{array}$

Thus, an asymptotic (1 – α)100% confidence interval for pj – pk is

$\begin{array}{l} {\hat{p}}_{j} - {\hat{p}}_{k} \pm z_{α / 2} \sqrt{\frac{{\hat{p}}_{j} + {\hat{p}}_{k} - {({\hat{p}}_{j} - {\hat{p}}_{k})}^{2}}{n}} . & (2.26) \end{array}$

Example 2.7.2 (Birth Rate of Males to Swedish Ministers, continued).

Consider Example 2.7.1 concerning the number of sons in the first seven children of Swedish ministers. Suppose we are interested in the difference in the probabilities of all females or all sons. The following R segment estimates this difference along with a 95% confidence interval, (2.26), for it. The counts for these categories are respectively 6 and 13 with n = 1334.

> n <- 1334; p0 <- 6/n; p7 <- 13/n

> se <- sqrt((p0+p7-(p0-p7) ^2)/n)

> lb <- p0-p7 - 1.96*se; ub <- p0-p7 + 1.96*se

> res<- c(p0-p7,lb,ub)

> res

[1] -0.005247376 -0.011645562 0.001150809

Since 0 is in the confidence interval there is no discernible difference in the proportions at level 0.05.

A cautionary note is needed here. In general, many confidence intervals can be formulated for a given situation. For example, if there are c categories then there are $(\begin{array}{l} c \\ 2 \end{array})$ possible pairwise comparison confidence intervals. In such cases, the overall confidence may slip. This is called a multiple comparison problem (MCP) in statistics. There are several procedures to use. One such procedure is the Bonferroni procedure. Suppose there are m confidence intervals of interest. Then if each confidence interval is obtained with confidence coefficient (1 – (α/m)), the simultaneous confidence of all of the intervals is at least 1 – α. See Exercise 2.8.25.

2.7.2 Several Discrete Random Variables

A frequent application of goodness-of-fit tests concerns several discrete random variables, say X1,..., Xr, which have the same range {1, 2,..., c}. The hypotheses of interest are

$\begin{array}{l} \begin{matrix} H_{0} : & X_{1}, ..., X_{r} have the same distribution \\ H_{A} : & Distributions of X_{i} and X_{j} differ for some i \neq j . \end{matrix} & (2.27) \end{array}$

Note that the null hypothesis does not specify the common distribution. Information consists of independent random samples on each random variable. Suppose the random sample on Xi is of size ni. let $n = \sum_{i = 1}^{r} n_{i}$ denote the total sample size. The observed frequencies are

$O_{i j} = # {sample items in sample drawn on X_{i} such that X_{i} = j},$

for i = 1,..., r and j = 1,..., c. The set of {Oij}s form a r × c matrix of observed frequencies. These matrices are often referred to as contingency tables. We want to compare these observed frequencies to the expected frequencies under H0. To obtain these we need to estimate the common distribution (p1,..., pc)T, where pj is the probability that category j occurs. The nonparametric estimate of pj is

${\hat{p}}_{j} = \frac{\sum_{i = 1}^{r} O_{i j}}{n}, j = 1, ..., c .$

Hence, the estimated expected frequencies are ${\hat{E}}_{i j} = n_{i} {\hat{p}}_{j}$ . This formula is easy to remember since it is the row total times the column total over the total number. The test statistic is the χ2-test statistic, (2.23); that is,

$\begin{array}{l} χ^{2} = \sum_{i = 1}^{r} \sum_{j = 1}^{c} \frac{{(O_{i j} - {\hat{E}}_{i j})}^{2}}{{\hat{E}}_{i j}} . & (2.28) \end{array}$

For degrees of freedom, note that each row has c − 1 free cells because the sample sizes ni are known. Further c − 1 estimates had to be made. So there are $r (c - 1) - (c - 1) = (r - 1) (c - 1)$ degrees of freedom. Thus, an asymptotic level α test is to reject H0 if $χ^{2} \geq χ_{α, (r - 1) (c - 1)}^{2}$ , where $χ_{α, (r - 1) (c - 1)}^{2}$ is the α critical value of a χ2-distribution with (r − 1)(c − 1) degrees of freedom. This test is often referred to as the χ2-test of homogeneity (same distributions). We illustrate it with the following example.

Example 2.7.3 (Type of Crime and Alcoholic Status).

The contingency table, Table 2.1, contains the frequencies of criminals who committed certain crimes and whether or not they are alcoholics. We are interested in seeing whether or not the distribution of alcoholic status is the same for each type of crime. The data were obtained from Kendall and Stuart (1979).

Table 2.1

Contingency Table for Type of Crime and Alcoholic Status Data.

Crime	Alcoholic	Non-Alcoholic
Arson	50	43
Rape	88	62
Violence	155	110
Theft	379	300
Coining	18	14
Fraud	63	144

To compute the test for homogeneity for this data in R, assume the contingency table is in the matrix ct. Then the command is chisq.test(ct), as the following R session shows:

> c1 <- c(50,88,155,379,18,63)

> c2 <- c(43,62,110,300,14,144)

> ct <- cbind(c1,c2)

> chifit <- chisq.test(ct)

> chifit

  Pearson’s Chi-squared test

data: ct

X-squared = 49.7306, df = 5, p-value = 1.573e-09

> (chifit$residuals)^2

   c1   c2

[1,] 0.01617684  0.01809979

[2,] 0.97600214  1.09202023

[3,] 1.62222220  1.81505693

[4,] 1.16680759  1.30550686

[5,] 0.07191850  0.08046750

[6,] 19.61720859 21.94912045

The result is highly significant, but note that most of the contribution to the test statistic comes from the crime fraud. Next, we eliminate fraud and retest.

> ct2 <- ct[-6,]

> chisq.test(ct2)

  Pearson’s Chi-squared test

data: ct2

X-squared = 1.1219, df = 4, p-value = 0.8908

These results suggest that conditional on the criminal not committing fraud, his alcoholic status and type of crime are independent.

Confidence Intervals

For a given category, say, j, it may be of interest to obtain confidence intervals for differences such as $P (X_{i} = j) - P (X_{i^{'}} = j)$ . In the notation of this section, the estimate of this difference is $(O_{i j} / n_{i}) - (O_{i^{'} j} / n_{i^{'}})$ , where ni and ni′ are the respective sums of rows i and i′ of the contingency table. Since the samples on these random variables are independent, the two-sample proportion confidence interval given in expression (2.19) can be used. The cautionary note regarding simultaneous confidence of the last section holds here, also.

2.7.3 Independence of Two Discrete Random Variables

The χ2 goodness-of-fit test can be used to test the independence of two discrete random variables. Suppose X and Y are discrete random variables with respective ranges {1, 2,..., r} and {1, 2,..., c}. Then we can write the hypothesis of independence between X and Y as

$\begin{array}{l} H_{0} : P [X = i, Y = j] = P [X = i] P [Y = j] for all i and j versus \\ H_{A} : P [x = i, Y = j] \neq P [X = i] P [Y = j] for some i and j . (2.29) \end{array}$

To test this hypothesis, suppose we have the observed the random sample $(X_{1}, Y_{1}), ..., (X_{n}, Y_{n})$ on (X, Y). We categorize these data into the r × c contingency table with frequencies Oij where

$O_{i j} = #_{1 \leq l \leq n} {(X_{l}, Y_{l}) = (i, j)} .$

So the Oijs are our observed frequencies and there are initially rc−1 free cells. The expected frequencies are formulated under H0. Note that the maximum likelihood estimates (mles) of the marginal distributions of P[X = i] and P[Y = j] are the respective statistics Oi·/n and O·j/n. Hence, under H0, the mle of P[X = i, Y = j] is the product of these marginal distributions. So the expected frequencies are

$\begin{array}{l} {\hat{E}}_{i j} = n \frac{O_{i \cdot}}{n} \frac{O_{\cdot j}}{n} = \frac{i th row total \times j th col . total}{total number}, & (2.30) \end{array}$

which is the same formula as for the expected frequencies in the test for homogeneity. Further, the degrees of freedom are also the same. To see this, there are rc – 1 free cells, and to formulate the expected frequencies we had to estimate r – 1 + c − 1 parameters. Hence, the degrees of freedom are: rc – 1 – r – c + 2 = (r − 1)(c − 1). Thus the R code for the χ2-test of independence is the same as for the test of homogeneity. Several examples are given in the exercises.

Confidence Intervals

Notice that the sampling scheme in this section consists of one-sample over r × c categories. Hence, it is the same scheme as in the beginning of this section, Section 2.7.1. The estimate of each probability pij = P[X = i, Y = j] is Oij/n and a confidence interval for pij is given by expression (2.24). Likewise, confidence intervals for differences of the form pij – pi′j′ can be obtained by using expression (2.26).

2.7.4 McNemar’s Test

McNemar’s test for significant change is used in many applications. The data are generally placed in a contingency table but the analysis is not the χ2–goodness-of-fit tests discussed earlier. A simple example motivates the test. Suppose A and B are two candidates for a political office who are having a debate. Before and after the debate, the preference, A or B, of each member of the audience is recorded. Given a change in preference of candidate, we are interested in the difference in the change from B to A minus the change from A to B. If the estimate of this difference is significantly greater than 0, we might conclude that A won the debate.

For notation assume we are observing a pair of discrete random variables X and Y. In most applications, the ranges of X and Y have two values, say, {0, 1}.8. In our simple debate example, the common range can be written as {For A, For B}. Note that there are four categories (0,0), (0,1), (1,0), (1,1). Let pij, i,j = 0,1, denote the respective probabilities of these categories. Consider the hypothesis

$\begin{array}{l} H_{0} : p_{01} - p_{10} = 0 versus H_{A} : p_{01} \neq p_{10} . & (2.31) \end{array}$

One-sided tests are of interest, also; for example, in the debate situation, the claim that B wins the debate is expressed by the alternative HA : p01 > p10. Let (X1, Y1),...,(Xn,Yn) denote a random sample on (X,Y). Let Nij, i, j = 0, 1, denote the respective frequencies of the categories (0, 0), (0, 1), (1, 0), (1, 1). For convenience, the data can be written in the contingency table

	0	1
0	N00	N01
1	N10	N11

The estimate of p01 – p10 is ${\hat{p}}_{01} - {\hat{p}}_{10} = (N_{01} / n) - (N_{10} / n)$ . This is the difference in two proportions in a multinomial setting; hence, the standard error of this estimate is given in expression (2.26). For convenience, we repeat it with the current notation.

$\begin{array}{l} SE ({\hat{p}}_{01} - {\hat{p}}_{10}) = \sqrt{\frac{{\hat{p}}_{01} + {\hat{p}}_{10} - {({\hat{p}}_{01} - {\hat{p}}_{10})}^{2}}{n}} . & (2.32) \end{array}$

The Wald test statistic is the z-statistic which is the ratio of ${\hat{p}}_{01} - {\hat{p}}_{10}$ over its standard error. Usually, though, a scores test is used. In this case the squared difference in the numerator of the standard error is replaced by 0, its parametric value under the null hypothesis. Then the square of the z-scores test statistic reduces to

$\begin{array}{l} χ^{2} = \frac{{(N_{01} - N_{10})}^{2}}{N_{01} + N_{10}} . & (2.33) \end{array}$

Under H0, this test statistic has an asymptotic χ2-distribution with 1 degree of freedom. Letting $χ_{0}^{2}$ be the realized values of the test statistic once the sample is drawn, the p-value of this test is $1 - F_{χ^{2}} (χ_{0}^{2}; 1)$ . For a one-sided test, simply divide this p-value by 2.

Actually, an exact test is easily formulated. Note that this test is conditioned on the categories (0, 1) and (1, 0). Furthermore, the null hypothesis says that these two categories are equilikely. Hence under the null hypothesis, the statistic N01 has a binomial distribution with probability of success 1/2 and N01 + N10 trials. So the exact p-value can be determined from this binomial distribution. While either the exact or the asymptotic p-value is easily calculated by R, we recommend the exact p-value.

Example 2.7.4 (Hodgkin’s Disease and Tonsillectomy).

Hollander and Wolfe (1999) report on a study concerning Hodgkin’s disease and tonsillectomy. A theory purports that tonsils offer protection against Hodgkin’s disease. The data in the study consist of 85 paired observations of siblings. For each pair, one of the pair have Hodgkin’s disease and the other does not. Whether or not each had a tonsillectomy was also reported. The data are:

		Sibling
		Tonsillectomy (0)	No Tonsillectomy (1)
Hodgkin’s Patients	Tonsillectomy (0) No Tonsillectomy (1)	26	15
Hodgkin’s Patients	Tonsillectomy (0) No Tonsillectomy (1)	7	37

If the medical theory is correct then po1 > p10. So we are interested in a onesided test. The following R calculations show how easily the test statistic and p-value (including the exact) are calculated:

> teststat <- (15-7)^2/(15+7)

> pvalue <- (1 - pchisq(teststat,1))/2

> pexact <- 1 - pbinom(14,(15+7), .5)

> c(teststat,pvalue,pexact)

[1] 2.90909091 0.04404076 0.06690025

If the level of significance is set at 0.05 then different conclusions may be drawn depending on whether or not the exact p-value is used.

Remark 2.7.1.

In practice, the p-values for the χ2-tests discussed in this section are often the asymptotic p-values based on the χ2-distribution. For McNemar’s test we have the option of an exact p-value based on a binomial distribution. There are other situations where an exact p-value is an option. One such case concerns contingency tables where both margins are fixed. For such cases, Fisher’s exact test can be used; see, for example, Chapter 2 of Agresti (1996) for discussion. The R function for the analysis is fisher.test. One nonparametric example of this test concerns Mood’s two-sample median test (e.g. Hettmansperger and McKean 2011: Chapter 2). In this case, Fisher’s exact test is based on a hypergeometric distribution.

2.8 Exercises

2.8.1. Verify, via simulation, the level of the wilcox.test when sampling from a standard normal distribution. Use n = 30 and levels of α = 0.1,0.05,0.01. Based on the resulting estimate of α, the empirical level, obtain a 95% confidence interval for α.
2.8.2. Redo Exercise 2.8.1 for a t-distribution using 1,2,3,5,10 degrees of freedom.
2.8.3. Redo Example 2.4.1 without a for loop and using the apply function.
2.8.4. Redo Example 2.3.2 without a for loop and using the apply function.
2.8.5. Suppose in a poll of 500 registered voters, 269 responded that they would vote for candidate P. Obtain a 90% percentile bootstrap confidence interval for the true proportion of registered voters who plan to vote for P.
2.8.6. For Example 2.3.1 obtain a 90% two-sided confidence interval for the treatment effect.
2.8.7. Write an R function which computes the sign analysis. For example, the following commands compute the statistic S+, assuming that the sample is in the vector x.
```
xt <- x[x!=0]; nt <- length(xt); ind <- rep(0,nt);
```
```
ind[xt > 0] <-1; splus <- sum(ind)
```
2.8.8. Calculate the sign test for the nursery school example, Example 2.3.1. Show that the p-value for the one-sided sign test is 0.1445.
2.8.9. The data for the nursery school study were drawn from page 79 of Siegel (1956). In the data table, there is an obvious typographical error. In the 8th set of twins, the score for the the twin that stayed at home is typed as 82 when it should be 62. Rerun the signed-rank Wilcoxon and t-analyses using the typographical error value of 82.
2.8.10. The contaminated normal distribution is frequently used in simulation studies. A standardized variable, X, having this distribution can be written as

$X = (1 - I_{ϵ}) Z + c I_{ϵ} Z,$

where 0 ≤ ϵ < 1, Iϵ has a binomial distribution with n = 1 and probability of success ϵ, Z has a standard normal distribution, c> 1, and Iϵ and Z are independent random variables. When sampling from the distribution of X, (1 – ϵ)100% of the time the observations are drawn from a N(0,1) distribution but ϵ100% of the time the observations are drawn from a N(0, c2). These later observations are often outliers. The distribution of X is a mixture distribution; see, for example, Section 3.4.1 of Hogg et al. (2013). We say that X has a CN(c, ϵ) distribution.
1. Using the R functions rbinom and rnorm, write an R function which obtains a random sample of size n from a contaminated normal distribution CN(c, ϵ).
2. Obtain samples of size 100 from a N(0,1) distribution and a CN(16,0.25) distribution. Form histograms and comparison box-plots of the samples. Discuss the results.
2.8.11. Perform the simulation study of Example 2.3.2 when the population has a CN(16,0.25) distribution. For the alternatives, select values of θ so the spread in empirical powers of the signed-rank Wilcoxon test ranges from approximately 0.05 to 0.90.
2.8.12. The ratio of the expected squared lengths of confidence intervals is a measure of efficiency between two estimators. Based on a simulation of size 10,000, estimate this ratio between the Hodges–Lehmann and the sample mean for n = 30 when the population has a standard normal distribution. Use 95% confidence intervals. Repeat the study when the population has a t-distribution with 2 degrees of freedom.
2.8.13. Suppose the cure rate for the standard treatment of a disease is 0.60. A new drug has been developed for the disease and it is thought that the cure rate for patients using it will exceed 0.60. In a small clinical trial 48 patients having the disease were treated with the new drug and 34 were cured.
1. (a) Let p be the probability that a patient having the disease is cured by the new drug. Write the hypotheses of interest in terms of p.
2. (b) Determine the p-value for the clinical study. What is the decision for a nominal level of 0.05?
2.8.14. Let p be the probability of success. Suppose it is of interest to test

$H_{0} : p = 0.30 versus H_{A} : p < 0.30.$

Let S be the number of successes out of 75 trials. Suppose we reject H0, if S ≤ 16.
1. (a) Determine the significance level of the test.
2. (b) Determine the power of the test if the true p is 0.25.
3. (c) Determine the power function for the test for the sequence for the probabilities of success in the set {0.02, 0.03,..., 0.35}. Then obtain a plot of the power curve.
2.8.15. For the situation of Exercise 2.8.13, a larger clinical study was run. In this study, patients were randomly assigned to either the standard drug or the new drug. Let p1 and p2 denote the cure rates for patients under the new drug and the standard drug, respectively. The hypotheses of interest are:

$H_{0} : p_{1} = p_{2} versus H_{A} : p_{1} > p_{2} .$

The results of the study are:

Treatment

No. of Patients

No. Cured

New Drug

200

135

Standard Drug

210

130
1. (a) Determine the p-value of the scores test (2.21). Conclude at the 5% level of significance.
2. (b) Obtain the 95% confidence interval for p1 – p2.
2.8.16. Simulate the power of the Wald and scores type two-sample proportions test for the hypotheses

$H_{0} : p_{1} = p_{2} versus H_{A} : p_{1} > p_{2} .$

for the following situation. Assume that population 1 is Bernoulli with p1 = 0.6; population 2 is Bernoulli with p2 = 0.5; the level is α = 0.05; and n1 = n2 = 50. Recall that the call rbinom(m,n,p) returns m binomial variates with distribution bin(n,p).
2.8.17. In a large city, four candidates (Smith, Jones, Martinelli, and Wagner) are running for Mayor. A poll was conducted by random dialing with the following results:

Smith

Jones

Martinelli

Wagner

Others

442

208

460

180

205

Using a 95% confidence interval, determine if there is a significant difference between the two front runners.
2.8.18. In Example 2.7.1 we tested whether or not a dataset was drawn from a binomial distribution. For this exercise, generate a sample of size n = 500 from a truncated Poisson distribution as illustrated with the following R code:
```
x <- rpois(500,3)
```
```
x[x >= 8] = 7
```
1. (a) Obtain a plot of the histogram of the sample.
2. (b) Obtain an estimate of the sample proportion (phat<-mean(x/7)).
3. (c) Test to see if the sample has a binomial distribution with n = 7, (i.e., use the same test as in Example 2.7.1).
2.8.19. Rasmussen (1992) presents the following data on a survey of workers in a large factory on two variables: their feelings concerning a smoking ban (Approve, Do not approve, Not sure) and Smoking status (Never smoked, Ex-smoker, Current smoker). Use the χ2-test to test the independence of these two variables. Using a post-test analysis, determine what categories contributed heavily to the dependence.

Approval of the smoking ban

Smoking status

Approve

Do not approve

Not sure

Never smoked

237

3

10

Ex-smoker

106

4

7

Current smoker

24

32

11
2.8.20. The following data are drawn from Agresti (1996). It concerns the approval ratings of a Canadian prime minister in two surveys. In the first survey, ratings were obtained on 1600 citizens and then in a second survey, six months later, the same citizens were resurveyed. The data are tabled below. Use McNemar’s test to see if given a change in attitude toward the prime minister, the probability of going from approval to disapproval is higher than the probability of going from disapproval to approval. Also determine a 95% confidence interval for the difference of these two probabilities.

Second survey

First survey

Approve

Disapprove

Approve

794

150

Disapprove

86

570
2.8.21. Even though the χ2-tests of homogeneity and independence are the same, they are based on different sampling schemes. The scheme for the test of independence is one-sample of bivariate data, while the scheme for the test of homogeneity consists of one-sample from each population. Let C be a contingency table with r rows and c columns. Assume for the test of homogeneity that the rows contain the samples from the r populations. Determine the (large sample) confidence intervals for each of the following parameters under both schemes, where pij is the probability of cell (i,j) occurring. Write R code to obtain these confidence intervals assuming the input is a contingency table.
1. p11.
2. p11–p12.
2.8.22. Mendel’s early work on heredity in peas is well known. Briefly, he conducted experiments and the peas could be either round or wrinkled; yellow or green. So there are four possible combinations: RY, RG, WY, WG. If his theory were correct the peas would be observed in a 9:3:3:1 ratio. Suppose the outcome of the experiment yielded the the following observed data

RY

RG

WY

WG

315

108

101

32

Calculate a p-value and comment on the results.
2.8.23. Suppose there are two ways of making widgets: process A and process B. Assume there is a reliable way in which to measure the overall quality of widgets made from either process such that higher value can be measured with some accuracy.

Suppose that a plant has 25 operators and each operator then makes a widget of each type in random order. The results are such that process A has more value than process B for 20 operators, B has more value than A for 3, and the measurements were not different for 2 operators. These data present pretty convincing evidence in favor of Process A. How likely is such a result due to chance if the processes were actually equal in terms of quality?
2.8.24. Conduct a Monte Carlo simulation to approximate the power of the test discussed in Example 2.4.3 when the true θ = 1.5.
2.8.25. Let 0 < α < 1. Suppose I1 and I2 are respective confidence intervals for two parameters θ1 and θ2 both with confidence coefficient 1 – (α/2); that is,

$P_{θ_{i}} [θ_{i} \in I_{i}] = 1 - \frac{α}{2}, i = 1, 2.$

Show that the simultaneous confidence for both intervals is at least 1 – α, i.e.,

$P_{θ_{1}, θ_{2}} [{θ_{1} \in I_{1}} \cap {θ_{2} \in I_{2}}] \geq 1 - α .$

Hint: Use the method of complements and Boole’s inequality, P[A ∪ B] ≤ P(a) + P(B). Extend the argument to m intervals each with confidence coefficient 1 − (α/2) to obtain a set of m simultaneous Bonferroni confidence intervals.

Treatment	No. of Patients	No. Cured
New Drug	200	135
Standard Drug	210	130

Smith	Jones	Martinelli	Wagner	Others
442	208	460	180	205

	Approval of the smoking ban
Never smoked	237	3	10
Ex-smoker	106	4	7
Current smoker	24	32	11

	Second survey
Approve	794	150
Disapprove	86	570

RY	RG	WY	WG
315	108	101	32

1 For comparison purposes, can be written as $T = \sum_{i = 1}^{n} sign (X_{i}) | X_{i} |$ .

2 See, for example, Chapter 1 of Hettmansperger and McKean (2011).

3 The base R function prop.test provides a confidence interval which is computed by inverting the score test.

4 See, for example, Exercise 6.5.8 of Hogg et al. (2013).

5 For this situation, generally we estimate the unknown parameters of the pmf by their maximum likelihood estimators. See Hogg et al. (2013).

6 See Section 4.7 of Hogg et al. (2013).

7 See page 363 of Hogg et al. (2013).

8 See Hettmansperger and McKean (1973) for generalizations to more than two categories.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2 Basic Statistics

Create new playlist

Sign In

Sign Up

Basic Statistics

2.1 Introduction

2.2 Sign Test

2.3 Signed-Rank Wilcoxon

2.3.1 Estimation and Confidence Intervals

2.3.2 Computation in R

2.4 Bootstrap

2.4.1 Percentile Bootstrap Confidence Intervals

2.4.2 Bootstrap Tests of Hypotheses

2.5 Robustness*

2.6 One- and Two-Sample Proportion Problems

2.6.1 One-Sample Problems

2.6.2 Two-Sample Problems

2.7 χ2 Tests

2.7.1 Goodness-of-Fit Tests for a Single Discrete Random Variable

Confidence Intervals

2.7.2 Several Discrete Random Variables

Confidence Intervals

2.7.3 Independence of Two Discrete Random Variables

Confidence Intervals

2.7.4 McNemar’s Test

2.8 Exercises

Table of Contents for
Chapter 2 Basic Statistics