Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3
Testing Hypotheses – One‐ and Two‐Sample Problems

3.1 Introduction

In empirical research, a scientist often formulates conjectures about objects of his research. For instance, he may argue that the fat content in the milk of Jersey cows is higher than that of Holstein Friesians. To check conjectures, he will perform an experiment. Now statistics come into play.

Sometimes the aim of investigation is not to determine certain statistics (to estimate parameters), but to test or to examine carefully considered hypotheses (assumptions, suppositions) and often also wishful notions based on practical material. In addition, in this case we establish a mathematical model where the hypothesis is formulated in the form of model parameters.

We assume that we have one random sample or two random samples from special distributions. We begin with the one‐sample problem and assume that the distribution of the components of the sample depends on a parameter (vector) θ. We would like to test a hypothesis about θ. First, we define what we have to understand by these terms.

A statistical test is a procedure that allows a decision for accepting or rejecting a hypothesis about the unknown parameter to occur in the distribution of a random variable. We shall suppose in the following that two hypotheses are possible. The first (or main) hypothesis is the null hypothesis H₀, the other one is the alternative hypothesis H_A. The hypothesis H₀ is right, if H_A is wrong, and vice versa. Hypotheses can be composite or simple. A simple hypothesis prescribes the parameter value θ uniquely, e.g. the hypothesis H₀ : θ = θ₀ is simple. A composite hypothesis admits that the parameter θ can have several values.

We discuss hypotheses testing based on random samples. A special random variable, a so‐called test statistic, is derived from the random sample. We reject the null hypothesis if the realisation of the test statistic has some relation to a real number; let us say it exceeds some quantile. This quantile is chosen so that the probability that the random test statistic exceeds it if the null hypothesis is correct is equal to a value 0 < α < 1; this is fixed in advance and is called the first kind risk or significance level. The first kind risk equals the probability to make an error of the first kind, i.e. to reject the null hypothesis if it is correct. Usually we choose α relatively small. From the time in which the quantiles mentioned above are calculated to produce tables – only for few values α‐quantiles did exist. From this time often α = 0.05 or 0.01 where used. However, even now it makes sense to use one of these values to make different experiments comparable.

Besides an error of the first kind, an error of the second kind may occur if we accept the null hypothesis although it is wrong; the probability that this occurs is the second kind risk.

Both errors have different consequences. Assume that the null hypothesis states that a special atomic power plant is safe. The alternative hypothesis is then that it is not safe. Of course, it is more harmful if the null hypothesis is erroneously accepted and the plant is assembled. Therefore, in this case the risk of the second kind is more important than the risk of the first kind. In many other cases, the risk of the first kind plays an important role and it is fixed in advance.

At the end of a classical statistical test we decide on one of two possibilities, namely accepting or rejecting the null hypothesis.

Tests for a given first kind risk α are called α‐tests; usually we have many α‐tests for a given pair of hypotheses. Amongst them, the ones that are preferable are those that have the smallest second kind risk β, or, equivalently, the largest power 1 − β. Such tests are called the most powerful α‐tests. When α ≤ β if H_A is valid, the test is called unbiased.

A special situation arises in sequential testing. Here three decisions are possible after each of a series of sequentially registered observations:

accept H₀
reject H₀
make the next observation.

The value of the risk of the second kind depends on the distance between the parameter values of the null and the alternative hypothesis.

We need another term, the power function. This is the probability of rejecting H₀. Its value equals α if the null hypothesis is correct.

It is completely wrong to state, after rejecting a null hypothesis based on observations, that this decision is wrong with probability α. This decision is either right or wrong. On the other hand, it is correct to say that the decision is based on a procedure, which supplies wrong rejections with probability α.

However, often a decision must be made.

Then one may say H₀ is correct if accepting H₀, even if this may be wrong.

Therefore, the user is recommended to choose α (or β, respectively) small enough that a rejection (or acceptance) of H₀ allows the user to behave with a clear conscience, as H₀ would be wrong (or H_A right). However, there is also an important statistical consequence: if the user has to conclude many such decisions during his investigations, then he will wrongly decide in about 100α (and 100β, respectively) per cent of the cases. This is a realistic point of view, which essentially we have confirmed by experience. If we move in traffic, we should realise the risk of one's own and other people's incorrect actions or an accident (observe that in this case α is considerably smaller than 0.05), but we must participate, just as a researcher must derive a conclusion from an experiment, although he knows that he/she could be wrong. On the other hand, it is very important to control risks. Concerning risks of the second kind, this is only possible if the sample size is determined before the experiment. The user should take care not to transfer probability statements to single observed cases.

In this chapter, we discuss tests on expectations, variances, and other general parameters. Tests on special parameters like regression coefficients are found in the corresponding chapters.

Statistical tests and confidence estimations for expectations of normal distributions are extremely robust against the violation of the normality assumption – see Rasch and Tiku (1985).

3.2 The One‐Sample Problem

We begin with the one‐sample problem and assume that the distribution of the components of the sample depend on a parameter (vector) θ. We would like to test a hypothesis about θ. First, we define what we have to understand by these terms.

In this section we assume that a random sample Y = (y₁, y₂, … , y_n)^T, n ≥ 1 of size n is taken from a distribution with an unknown parameter. In Sections 3.2.1 and 3.2.3 we handle problems in which this distribution is a normal one with parameter vector .

3.2.1 Tests on an Expectation

We know simple and composite hypotheses.

In a simple hypothesis the value of θ = μ is fixed at a real number μ₀.

A simple null hypothesis could be H₀ : μ = μ₀. In the pair H₀ : μ = μ₀; H_A : μ = μ₁ both hypotheses are simple.

Examples for composite null hypotheses are:

H₀ : μ = μ₀, σ² arbitrary
H₀ : μ = μ₀ or μ = μ₁
H₀ : μ < μ₀,
H₀ : μ ≠ μ₀.

3.2.1.1 Testing the Hypothesis on the Expectation of a Normal Distribution with Known Variance

We first assume that any component of Y = (y₁, y₂, … , y_n)^T is independently distributed as y, namely normally with unknown expectation μ and known variance σ².

We like to test the null hypothesis

H₀ : μ = μ₀ against
1. (a) H_A : μ = μ₁ > μ₀ (one‐sided alternative)
2. (b) H_A : μ = μ₁ < μ₀ (one‐sided alternative)
3. (c) H_A : μ = μ₁ ≠ μ₀ (two‐sided alternative)

with a significance level α.

All hypotheses are simple hypotheses. In this case, usually the test statistic

3.1

is applied where is the mean taken from the random sample Y of size n.

Starting with the (random) sample mean first the value μ₀ of the null hypothesis is subtracted and then this difference is divided by the standard error of .

We know that is normally distributed with variance 1 and, under the null hypothesis, the expectation of is 0. Under the alternative hypothesis the expectation of is . The value

3.2

is called the non‐centrality parameter (ncp) or, in applications, it is also called the relative effect size and δ = μ₁ − μ₀ is called the effect size.

We calculate from a realised sample, i.e. from observations Y = (y₁, y₂, … , y_n)^T, the sample mean following Problem 2.1 and from this the realised test statistic

3.3

Now we reject the null hypothesis for the alternative hypotheses (a), (b), and (c) with z from (3.3) and a fixed first kind risk or significance level α if:

(a) z > Z(1 − α)
(b) z < Z(α)
(c) or .

Above Z(P), 0 < P < 1, is the P‐quantile of z, which has a standard normal distribution N(0,1).

Problem 3.1

Determine the P‐quantile Z(P) of the standard normal distribution.

Solution

Use the R‐command

 > qnorm(P)

Example

Determine the 0.95‐quantile Z(0.95) of the standard normal distribution

 > qnorm(0.95)
[1] 1.644854

Rounding gives us Z(0.95) = 1.645.

In hypothesis testing we often use α = 0.01, 0.05, or 0.10. The corresponding quantiles for one‐sided and two‐sided alternative testing are given in Table 3.1.

Table 3.1 P‐quantiles Z(P) of the standard normal distribution.

P	Z(P)	Z(1 − P)
0.01	−2.326	2.326
0.005	−2.576	2.576
0.05	−1.645	1.645
0.025	−1.960	1.960
0.1	−1.282	1.282

A test seems to be better the smaller its risk of the first kind. Considering practical investigations, a risk of the first kind α = 0.05 seems to be only just acceptable in most cases. Users may ask why the test is not designed in such a way that α has a very small value, say α = 0.000 01. The smaller α the larger will be the probability to make another error. Namely, if we calculate an estimate from the realisation of the sample, then we accept the null hypothesis, although this value would be also possible in the case that the alternative hypothesis is right and, consequently, the null hypothesis is wrong.

It is clear that α can only be reduced for a certain test and a fixed sample size if on the other hand a larger β is accepted. Hence, we cannot make the risks of first and second kind simultaneously arbitrarily small for a fixed sample size n. When we increase n, the variance of is smaller and by this also the risks. Because this is the case in all tests for location parameters, we can fix α and β before starting the experiment and by this calculate the sample size n.

Applying statistical tests, it is wrong but common to focus mainly on the risk of the first kind while the risk of the second kind is neglected. There are many examples where the wrong acceptance of the null hypothesis can produce serious consequences (consider ‘genetic corn has no damaging side effects’ or ‘nuclear power stations are absolutely safe’). Therefore, it is advisable to control both risks, which is always possible by a suitably chosen sample size. In Table 3.2 the decisions performing a statistical test with respect to the facts (H₀ null hypothesis, H_A alternative hypothesis) are shown.

Table 3.2 Situations and decisions in hypotheses testing.

True situation	Decision	Result of the decision	Probability of the result
H₀ right (H_A wrong)	H₀ accepted (H_A rejected)	Right decision	Acceptance (or confidence) probability 1 − α
H₀ rejected (H_A accepted)	Error of the first kind	Significance or risk α of the first kind
H₀ wrong (H_A right)	H₀ accepted (H_A rejected)	Error of the second kind	Risk β of the second kind
H₀ rejected (H_A accepted)	Right decision	Power 1 − β

Actually, each difference of the parameters under the null hypothesis (μ₀) on the one hand and under the alternative hypothesis (μ₁) on the other hand can become significant as soon as the sample size is large enough. Hence, a significant result alone is not yet meaningful. It expresses nothing, because the difference could also be very small, for instance |μ₁ − μ₀| = 0.000 01. Therefore, investigations have to be planned by fixing the difference of the parameter value from the null hypothesis (μ₀) to be practically relevant. For explaining the risk β of the second kind, we pretend the alternative hypothesis consists only of one single value μ₁. However, in most applications μ₁ can take all values apart from μ₀ for two‐sided test problems, and all values smaller than or larger than μ₀ for one‐sided test problems. The fact is that each value of μ₁ causes another value for the risk β of the second kind. More precisely, β becomes smaller the larger the difference μ₁ − μ₀ will be. The quantity E = (μ₁ − μ₀)/σ, i.e. the relative or standardised practically relevant difference is called the (relative) effect size.

Therefore, the fixing of the practically relevant minimal difference δ = μ₁ − μ₀ is an essential step for planning an investigation. Namely, if δ is determined and if certain risks α of the first kind and β of the second kind are chosen, then the necessary sample size can be calculated. The fixing of α, β, and δ is called the precision requirement. Differences μ₁ − μ₀ equal to or larger than the prescribed δ should be overlooked only with a probability less than or equal to β.

The sample size, which fulfils the posed precision requirement, is obtained using the power function of the test. This function states the power for a given sample size and for all possible values of δ, i.e. the probability for rejecting the null hypothesis if indeed the alternative hypothesis holds. If the null hypothesis is true, the power function has the value α. It would not be fair to compare the power of a test with α = 0.01 with that of a test with α = 0.05, because larger α also means that the power is larger for all arguments referring to the alternative hypothesis. Hence, only tests with the same α should be compared.

For calculating the required sample size, we first look for all power functions related to all possible sample sizes, which have the probability α for μ₀, i.e. the parameter value under the null hypothesis. Now we look up the point of the minimum difference δ. Then we choose under all power functions that one which has the probability 1 − β at this point, i.e. the probability for the justified rejection of the null hypothesis; hence, at this point the probability of unjustified acceptance, i.e. of making an error of the second kind, is β. Finally, we have to choose the size n corresponding to this power function. For two‐sided test problems the points −δ and +δ have to be fixed. Figure 3.1 illustrates that deviations larger than δ are overlooked with lower probability than the β chosen.

Figure 3.1 The power functions of the t‐test testing the null hypothesis H₀: μ = μ₀ against H_A: μ ≠ μ₀ for a risk α = 0.05 of the first kind and a sample size n = 5 (bold plotted curve below) as well as other values of n (broken lined curves) up to n = 20 (bold plotted curve above).

Problem 3.2

Calculate the power function of the one‐sided test for (3.3) case (a).

Solution

Power = P(z > x; x ∈ R¹), i.e. for any x on the real line, where z has a normal distribution N([(μ₁ − μ₀) ]/σ, 1) or z = ([(μ₁ − μ₀) ]/σ + Z with Z being distributed as N(0,1).

Example

Let be δ = μ₁ − μ₀ (δ ≥ 0). Then the power function for α = 0.05 is

Table 3.3 lists π(δ) for special δ and n.

images — Table 3.3 Values of the power function for n = 9, 16, 25, σ = 1 and special δ.

In the applications, we call δ the practically interesting minimum difference to the value of the null hypothesis (also called the effect size). If we want to avoid such a difference with at most probability β, i.e. to discover it with probability at least 1 − β, we have to prescribe a corresponding sample size n. Again we consider the general case that Y is a random sample of size n taken from a N(μ, σ²) distribution.

In hypothesis testing we often use α = 0.01, 0.05, or 0.10.

Problem 3.3

To test H₀ : μ = μ₀ for a given risk of the first kind α the sample size n has to be determined so that the second kind β is not larger than β₀ as long as μ₁ − μ₀≥δ in the one‐ and two‐sided cases described above.

Solution

Cases (a) and (b) of (3.3) (one‐sided alternatives). We restrict our attention on (a) because for (b) we receive the same sample size.

H₀ : μ = μ₀ is rejected if z > z(1 − α). For any λ > 0 our precision requirement in the least favourable case β = β₀ means that λ + z(1 − α) = z(β₀). From this it follows, with n^* for possibly non‐integer values to calculate n = ⌈n*⌉,

The smallest integer larger or equal to x is written as ⌈x⌉. Because we need integer numbers of observations, we use this operator ⌈x⌉ and obtain

3.4

If we have a two‐sided alternative, analogously, we obtain

3.5

Often the variance σ² is not known. Then we can either use a relative effect size so that we do not need a‐priori information on σ or we use the value of a sample standard deviation from an analogous experiment in place of σ. A practical method is as follows: divide the expected range of the investigated character, that is the difference between the imaginably maximal and minimal realisation of the character, by six (assuming a normal distribution approximately 99% of the realisations lie between μ₀ − 3σ and μ₀ + 3σ) and use the result as an estimate for σ.

If n and also β and α are given, we can use (3.4) or (3.5) to calculate δ.

Example

To test H₀ : μ = μ₀ for a given risk of the first kind α = 0.05 the sample n has to be determined so that the second kind risk β is not larger than 0.1 as long as μ₁ − μ₀ ≥ δ=0.6σ in the one‐ and two‐sided alternative cases.

One‐sided alternative:

Two‐sided alternative:

Example 3.1

We take the data set y with n = 25 observations from Chapter 2 and test the null hypothesis H₀ : μ = 10. We further assume that y is a realised sample from a normal distribution with variance σ² = 25. Remembering that we found in Chapter 2

 > mean(y)
[1] 11.2

we have and from this .

When we use α = 0.05 we have for a one‐sided alternative to check whether 1.2 > z(0.9) = 1.645. Because this is not the case, we accept H₀ : μ = 10. We have also to accept H₀ : μ = 10 for a two‐sided alternative.

Because our sample size was 25, we know approximately that for a one‐sided alternative the risk of the second kind is not larger than 0.1 as long as μ − 10 > 0.6 · 5 = 3.

3.2.1.2 Testing the Hypothesis on the Expectation of a Normal Distribution with Unknown Variance

We now assume that the components of a random sample Y = (y₁, y₂, … , y_n)^T are distributed normally with unknown expectation μ and unknown variance σ².

We like to test the null hypothesis:

H₀ : μ = μ₀, σ² arbitrary against
1. (a) H_A : μ = μ₁ > μ₀, σ² arbitrary (one‐sided alternative)
2. (b) H_A : μ = μ₁ < μ₀, σ² arbitrary (one‐sided alternative)
3. (c) H_A : μ = μ₁ ≠ μ₀, σ² arbitrary (two‐sided alternative).

All hypotheses are composite hypotheses because σ² is arbitrary. In this case, usually the test statistic (where s is the sample standard deviation):

3.6

is used which is non‐centrally t‐distributed with n − 1 degrees of freedom and ncp

Under the null hypothesis, the distribution is central t with n − 1 degrees of freedom.

If the type I error probability is α, H₀ will be rejected if:

in case (a), t > t(n − 1;1 − α),
in case (b), t < −t(n − 1;1 − α),
in case (c),|t| > t(n − 1;1 − α/2).

Our precision requirement is given by α and the risk of the second kind β if μ − μ₀ = δ.

From this we get

where t(n − 1;λ;β) is the β‐quantile of the non‐central t‐distribution with n − 1 degrees of freedom and ncp

For example, assuming a power of 0.9 the relative effect can be read on the abscissa; it is approximately 1.5 for n = 7.

From the precision requirement above the minimum sample size is iteratively determined as the integer solution of

3.7

(Rasch et al. 2011b) with P = 1 − α in the one‐sided case and P = 1 − α/2 in the two‐sided case. In Problem 3.3 it is shown what to do when σ is unknown.

The R‐implementation reads as follows:

 > size=function(p,rdelta,beta)
  {f=function(n,p,rdelta,beta)
  {qt(p,n-1,0)-qt(beta,n-1,rdelta*sqrt(n))}
  k=uniroot(f,c(2,1000),p=p,rdelta=rdelta,beta=beta)$root
  k0=ceiling(k)
  print(paste("optimum sample number: n =",k0), quote=F)}

In the command rdelta is equal to and p equals 1 − α. As an example we calculate

 > size(p=0.9,rdelta=0.5,beta=0.1)
[1] optimum sample number: n = 28.

The requirement (3.7) is illustrated in Figure 3.2 for the one‐sided case.

Graphical presentation of two bell-shaped curves representing two risks alpha and beta and the non-centrality parameter Gamma of a test. — Figure 3.2 Graphical representation of the two risks α and β and the non‐centrality parameter λ of a test based on (3.6).

Problem 3.4

Calculate the minimal sample size for testing the null hypothesis:

H₀: μ = μ₀ against one of the following alternative hypotheses:
1. (a) H_A: μ > μ₀ (one‐sided alternative),
2. (b) H_A: μ ≠ μ₀ (two‐sided alternative).

Solution

We use the Optimal Design of Experiments (OPDOE) package in R where sd is used for σ.

We can use the function >size

 > size(p=0.99,rdelta=0.6,beta=0.1)
[1] optimum sample number: n = 39

Or we use the OPDOE‐command

 > install.packages("OPDOE")
> library(OPDOE)
> size.t.test(type="one.sample",power=, delta= ,sd= , sig.level= ,alternative="one.sided")

Example

The precision requirement is defined by δ = 0.6σ; α = 0.01 and β = 0.1.

For alternative (a)

 > size.t.test(type="one.sample",power=0.9,delta=0.6,sd=1,
      sig.level=0.01,alternative = "one.sided")
[1] 39

For alternative (b)

 > size.t.test(type="one.sample",power=0.9,delta=0.6,sd=1,
        sig.level=0.01,alternative = "two.sided")
[1] 45

We need a sample of size n = 39 in the one‐sided alternative case and a sample of size n = 45 in the two‐sided alternative case.

When we have a discrete distribution it may happen that there exists no test in the sense above so that a given α can be arrived at. Then we define more generally a statistical test via a critical function k(Y) based on the likelihood function L(Y, θ) with the realisation Y = (y₁, … , y_n)^Tof the random sample Y = (y₁, … , y_n)^T. The random critical function k(Y) is the probability to reject a null hypothesis.

A test of the (simple) null hypothesis H₀ : θ = θ₀ against the (simple) alternative hypothesis H_A : θ = θ_A is written with some constant c_α ≥0 as

3.8

That such a test is a uniformly most powerful α‐test is, in principle, the statement of the Neyman–Pearson lemma – the exact formulation is in theorem 3.1 of Rasch and Schott (2018). For more general problems and composite hypotheses two constants c_1α and c_2α can be needed. Such a test is called a randomised test because besides the acceptance of H₀ (k(Y) = 0) and the rejection of H₀ (k(Y) = 1) we have a third option: accept H₀ with probability γ(Y). This means that after making observations we need a random decision if the likelihood function with both parameters fulfils L(Y, θ_A) = c_α L(Y, θ₀). This can be done by generating a uniformly distributed random variable in (0,1) and rejecting H₀ if its value is below γ(Y). Experimenters certainly do not like such an approach but fortunately for continuous random variables this situation cannot occur because P{L(Y, θ_A) = c_α L(Y, θ₀)} = 0. Contrary to Rasch and Schott (2018) a distribution function of a random variable Y is defined as F(y) = P (Y ≤ y) and also so used in R.

Problem 3.5

Let y be binomially distributed (as B(n,p)). Describe an α‐test for H₀ : p = p₀ against H_A : p = p_A < p₀ and for H₀ : p = p₀ against H_A : p = p_A > p₀.

Solution

F(y, p) is the distribution function of B(n, p). Use for H₀ : p = p₀ against H_A : p = p_A < p₀

and for H₀ : p = p₀ against H_A : p = p_A > p₀

where y⁻ is determined by

and y⁺ by

Example

Let n = 10 and H₀ : p = 0.5 is to be tested against H_A : p = 0.1, then the value y⁻ = 1 follows for α = 0.1 where from R we find the binomial probabilities

 > pbinom(1,10,0.5)
[1] 0.01074219
> pbinom(2,10,0.5)
[1] 0.0546875

and

 > dbinom(2, 10, 0.5)
[1] 0.04394531

Hence we get

Then k(Y) has the form

i.e. for y < 2 the hypothesis H₀: p = 0.5 is rejected, for y = 2 H₀ is rejected with the probability 0.8933, and for y > 2 H₀ is accepted. The random trial in the case y = 2 can be simulated on a computer. Using a generator of random numbers supplying uniformly distributed pseudo‐random numbers in the interval (0, 1) we obtain a value v. For v < 0.8933 the hypothesis H₀ is rejected and otherwise accepted. This test is a most powerful 0.05‐test.

Problem 3.6

Let P be the family of Poisson distributions and Y = (y_1, … ,y_n)^T a realisation of a random sample Y = (y_1, … ,y_n)^T. The likelihood function is

How can we test the pair H_o: λ = λ_o, H_A: λ ≠λ_o of hypotheses with a first kind risk α?

Solution

Solve with θ₀ = ln λ₀ and T = the simultaneous equations

Example

Test H₀ : λ = 10 against H_A : λ ≠ 10. R. can determine the values of the probability function and the likelihood function. We choose α = 0.1 and look for possible pairs (c₁, c₂).

For c₁ = 4, c₂ = 15 we obtain the equations

supplying the improper solutions γ′₁ = −0.215, γ′₂ = 0.296. The pairs (4, 16) and (5, 15) also lead to improper values (γ′₁, γ′₂). Finally, we recognise that only the values c₁ = 5, c₂ = 16 and γ′₁ = 0.697, γ′₂ = 0.799 solve the problem. Hence, the test has the form

and k(y) is the uniformly most powerful unbiased 0.1‐test.

3.2.2 Test on the Median

Sometimes the population median m should be preferred to the expectation of a random variable y. For any probability distribution of y on the real line R¹with distribution function F(y), regardless of whether it is a continuous distribution or a discrete one, a median is by definition any real number m that satisfies the inequalities

If m is unique, the population median is the point where the distribution function equals 0.5.

It is better to use the median instead of the expectation if we expect serious outliers or if the distribution has an extremely skewness, so that in a sample the median better characterises the data than the mean. Tests on the population median are so‐called non‐parametric tests or distribution‐free tests. These are tests without a specific assumption about the distribution.

It is notable that such an approach is not intrinsically justified by a suspected deviation from the normal distribution of the character in question. Particularly as Rasch and Guiard (2004) showed that all tests based on the t‐distribution are extraordinarily robust against violations from the normal distribution. That is, even if the normal random variable is not a good model for the given character, the type I risk actually holds almost accurately. Therefore, if a test statistic results, on some occasions, in significance, we can be quite sure that an erroneous decision concerning the rejection of the null hypothesis has no larger risk than in the case of a normal distribution. We describe Wilcoxon's signed‐ranks test useful for continuous distributions.

We test the hypothesis

H₀: m = m₀ against a one‐ or two‐sided alternative:
- H_A: m < m₀
- H_A: m > m₀
- H_A: m ≠ m₀.

First, we calculate from each observation y_i of a sample Y=(y₁, y₂, … , y_n)^T the differences d_i = y_i − med(Y) between the observations and the median =med(Y) . If the value of d_i is equal to zero, we do not use these values. From the remaining data subset of size n* ≤ n the rank of the absolute values |d_i| is calculated. The sum,

3.9

with images is the test statistic. Wilcoxon's signed‐ranks test is based on the distribution of the corresponding random variable, V.

Planning a study is generally difficult for non‐parametric tests, because it is hard to formulate a non‐centrality parameter (however see, for instance, Munzel and Brunner 2002).

Problem 3.7

Perform a Wilcoxon's signed‐ranks test for a sample Y.

Solution

We test the null hypothesis H₀: m = 10 against the alternative H_A: m ≠ 10 with significance level α = 0.05.

Use in R > d <‐ y‐m0 with m0 = 10.

 > wilcox.test(d, alternative="two.sided",mu=0,correct = F,
     conf.level=0.95)

Example

We use the data y of Chapter 1, y = (5 7 1 7 8 9 13 9 10 10 18 10 15 10 10 11 8 11 12 13 15 22 10 25 11), and calculate first > d <‐ y‐10.

 > y = c(5,7,1,7,8,9,13,9,10,10,18,10,15,10,10,11,8,11,12,13,
        15,22,10,25,11)
> d <- y- 10
> wilcox.test(d, alternative="two.sided",mu=0, correct =
      F,conf.level=0.95)
        Wilcoxon signed rank test
data:  d
V = 118, p-value = 0.3528
alternative hypothesis: true location is not equal to 0

Warning messages:
1: In wilcox.test.default(d, alternative = "two.sided",
      mu = 0, correct = F,  :
  cannot compute exact p-value with ties
2: In wilcox.test.default(d, alternative = "two.sided",
      mu = 0, correct = F,  :
  cannot compute exact p-value with zeroes

The Wilcoxon signed rank test statistic V can be checked in R as follows:

 > signd <- sign(d)
> absd <- abs(d)
> df0 <- data.frame(absd, signd)
> selnot0 <- df0$absd >0
> df1 <- data.frame(absd,signd,selnot0)
> df2 <- subset(df1,selnot0 == T)
> rabsd <- rank(df2$absd)
> df <- data.frame(rabsd,df2$signd)
> sel <- df2$signd >0
> wilcoxon <- df[sel, ]
> V <- sum(wilcoxon$rabsd)
> V
[1] 118

Problem 3.8

Perform a Wilcoxon's signed‐rank test for two paired samples X and Y.

Solution

We test the null hypothesis H₀: m = 0 against the alternative H_A: m ≠ 0 with significance level α = 0.05.

Use in R

 > wilcox.test(x,y, paired=T, alternative="two.sided",
     mu=0,conf.level=0.95)

Example

For seven people the reaction time under drugs X and Y are measured,

The reaction times under drug X are x = (0.223, 0.216, 0.211, 0.212, 0.209, 0.205, 0.201).

The reaction times under drug Y are y = (0.208, 0.205, 0.202, 0.207, 0.206, 0.204, 0.203).

The R commands are:

 > x <- c(.223, .216, .211, .212, .209, .205, .201)
> y <- c(.208, .205, .202, .207, .206, .204, .203)
> wilcox.test(x,y, paired=T, alternative="two.sided",
     mu=0,conf.level=0.95)
   Wilcoxon signed rank test
data:  x and y
V = 26, p-value = 0.04688
alternative hypothesis:   true location shift is not  equal  to    0

The Wilcoxon signed rank test statistic V can be checked in R as follows:

 > dif <- (x-y)
> absdif <- abs(dif)
> selnot0  <- absdif >0
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> # Because selnot0 is always TRUE we can find V directly as
     follows
> rabsdif<- rank(absdif)
> signdif <- sign(dif)
> df <- data.frame(rabsdif,signdif)
> sel <- df$signdif >0
> wilcoxon <- df[sel, ]
> V <- sum(wilcoxon$rabsdif)
> V
[1] 26

3.2.3 Test on the Variance of a Normal Distribution

Equation (2.21) shows an unbiased estimator s² of the variance σ² of a normal distribution.

We would like to test the null hypothesis

against one of the alternatives below:

with a first kind risk α.

We know that

3.10

is centrally χ² distributed with n − 1 degrees of freedom.

We calculate from a sample of size n the realisation of (3.10) and compare the result with the P‐quantile CS(P, n − 1) of the central χ² distribution as follows:

3.11

Example 3.2

Intelligence tests are standardised to σ = 15. When in a sample from a class of pupils in some school the estimated variance of 28 observed intelligence quotient (IQ) is 120, is this a reason to reject the H₀ : σ² = 225 against the two‐sided alternative for α = 0.05?

We calculate and CS(0.025, 27); using R we calculate

 > qchisq(0.025,27)
[1] 14.57338

and because 0.5333 < 14.573 38 we reject H₀ : σ² = 225.

Problem 3.9

Show how, based on sample values Y, the null hypothesis can be tested.

Solution

Input the values in an R‐data file y and then calculate the sample variance of y. Divide this by and compare the outcome with

 > qchisq(P,n-1)

following (3.11).

Example

Take y from Chapter 2 in Problem 3.9 to test H₀ : σ² = 20 against H_A : σ² > 20 with α = 0.05.

  > y
 [1]  5  7  1  7  8  9 13  9 10 10 18 10 15 10 10 11  8 11
      12 13 15 22 10 25 11

and calculate

 > var(y)
[1] 25.25

From this we obtain and this is smaller than

 > qchisq(0.95,24)
[1] 36.41503

and H₀ : σ² = 20 is accepted.

3.2.4 Test on a Probability

We already discussed the analysis problem in Problem 3.5. Here we show how to determine the sample size approximately.

Let p be the probability of some event. To test one of the pairs of hypotheses below:

(a) H₀ : p ≤ p₀, H_A : p > p₀
(b) H₀ : p ≥ p₀, H_A : p < p₀
(c) H₀ : p = p₀, H_A : p ≠ p₀

with a risk of the first kind α, and if we want the risk of the second kind to be not larger than β as long as

(a) p₁ > p_0, and p₁ − p₀ ≥ δ
(b) p₁ < p₀ and p₀ − p₁ ≥ δ
(c) p₀ ≠ p₁ and |p₀ − p₁| ≥ δ.

Determine the minimum sample size from

3.12

The same size is needed for the other one‐sided test. In the two‐sided case:

we replace α by α/2 and calculate (3.12) twice for p₁ = p₀ − δ > 0 and for p₁ = p₀ + δ < 1 if δ is the difference from p₀, which should not be overlooked with a probability of at least 1 − β. From the two n‐values, we then take the maximum.

Example 3.3

We would like to test that the probability p = P(A) of the event A: ‘an iron casting is faulty’ equals p₀ = 0.1 against the alternative that it is larger. The risk of the first kind is planned to be α = 0.05 and the power should be 0.9 as long as p ≥ 0.2. How many castings should be tested?

From Table 3.1 we take Z_0.975 = 1.96 and Z_0.9 = 1.282 and get

(the other option is not possible because p₁ = p₀ − δ = 0).

In R we can calculate n as follows:

 > numerator <- (1.96*sqrt((0.1*0.9))
     + 1.282*sqrt((0.2*0.8)))^2
> denominator <- (0.1)^2
> n <- ceiling((numerator/denominator))
> n
[1] 122

We check the sample size n = 122 by first determining the right‐sided critical value of B(122, 0.10) for α = 0.05 with R.

First, we calculate the 0.95‐quantile of B(122, 0.10):

 > qbinom(p=0.95, size=122, prob=0.10)
[1] 18

Now we calculate Prob(B(122, 0.10) ≥ 18)

 > 1-pbinom(q=17,size=122,prob=0.1)
[1] 0.06062245

and 0.060 622 45 > 0.05.

And we calculate Prob(B(122,0.10) ≥ 19)

 > 1-pbinom(q=18,size=122,prob=0.1)
[1] 0.03452349

and 0.034 523 49 < 0.05.

Hence the right‐sided critical value is 19 for α = 0.05.

The power of B(122, 0.20) is found as Prob(B(122, 0.20) ≥ 19)

 > 1-pbinom(q=18,size=122,prob=0.2)
[1] 0.9125418

and 0.912 541 8 > 0.90.

3.2.5 Paired Comparisons

If we measure two traits x and y from some individuals, we could say that we have a two‐sample problem. However, from a mathematical point of view we can say that the two measurements are realisations of a two‐dimensional random variable and from the corresponding universe one sample is drawn with measurements . We then speak about paired observations or statistical twins. In place of discussing the question of whether x_i and y_i have the same distribution we can ask whether the expectation of d_i = x_i − y_i is zero. The problem is reduced to that considered in Section 3.2.1 and must not be discussed again.

If the random variable d is extremely non‐normal (skew for instance) with the median m, then based on n realisations d_i, i = 1, … , n of the d_i null hypothesis, for the test below all d_i = 0 are excluded from the sample; the reduced sample size is denoted by n^*.

H₀ : m = m₀ against H_A : m ≠ m₀ can be tested with Wilcoxon's signed rank test. For this the new differences are calculated as well as their absolute values after deleting the values 0. The remaining data set has a size n* ≤ n. Finally, these values are to be ordered (ascending) and ranks R_i are to be assigned accordingly.

Ties (identical values) receive a rank equal to the average of the ranks they span. The sum of the so‐called signed ranks is the test statistic of Wilcoxon's signed rank test. The corresponding random variable V has expectation E(V) = n*(n* + 1)/4 and variance . The null hypothesis H₀ : m = m₀ is rejected if V > V(n^*, 1 − α), where α is the risk of the first kind. The critical values V(n*, 1 − α) are calculated by the R program in the following example.

Example 3.4

Each of 13 mice gave birth to 2 litters. x_i is the weight of each of the first litter and y_i the weight of each of the second litter of these mice, which are considered to be a random sample from a population of mice. Test H₀: ‘the first and second litter produce litter weights that are symmetric about a common axis’ against the alternative H_A: ‘the differences are not symmetric about a common axis’ with significance level α = 0.05.

Because no difference is zero, we have n = n*. The analysis is done using R as follows. First we create data files from Table 3.4.

 > x <- c(7.6, 13.2, 9.1, 10.6, 8.7, 10.6, 6.8, 9.9, 7.3,
        10.4, 13.3, 10.0, 9.5 )
> y <- c( 7.8, 11.1, 16.4, 13.7, 10.7, 12.3, 14.0, 11.9,
        8.8, 7.7, 8.9, 16.4, 10.2)
> d <- x-y

Table 3.4 The litter weights of mice (in grams) and the differences between the first and the second litter.

i	x_i	y_i	d_i = x_i − y_i
1	7.6	7.8	−0.2
2	13.2	11.1	2.1
3	9.1	16.4	−5.3
4	10.6	13.7	−3.1
5	8.7	10.7	−2
6	10.6	12.3	−1.9
7	6.8	14.0	−7.2
8	9.9	11.9	−2
9	7.3	8.8	−1.5
10	10.4	7.7	2.7
11	13.3	8.9	4.4
12	10.0	16.4	−6.4
13	9.5	10.2	−0.7

Then we apply Wilcoxon's signed rank test by typing

 > wilcox.test( x, y, paired = TRUE, correct =
    FALSE,alternative= "two.sided",mu=0,conf.level=0.95)
        Wilcoxon signed rank test
data:  x and y
V = 25, p-value = 0.1518
alternative hypothesis:   true   location shift   is not   equal   to   0
Warning message:
In wilcox.test.default(x, y, paired = TRUE, correct = FALSE,
     alternative = "two.sided",  :
  cannot compute exact p-value with ties

Of course we can also use the vector d <- x-y of the differences to test H₀:

 > wilcox.test(d,correct = FALSE, alternative = "two.sided",
     mu=0, conf.level=0.95)
        Wilcoxon signed rank test
data:  d
V = 25, p-value = 0.1518
alternative hypothesis: true location is not equal to 0
Warning message:
In wilcox.test.default(d, correct = FALSE, alternative =
     "two.sided",  :
  cannot compute exact p-value with ties

Remark

The p‐value is here calculated using R with the normal approximation where z is the standard normal variable:

  p-value = 2*Pr(z > |(25-[13*(13+1)/4]) /
      √[13*(13+1)*(2*13+1)/24]|) = 2*Prob(z>|-1.433|) =
      2*Prob(z > 1.433)
> p_value <- 2*(1-pnorm(1.433))
> p_value
[1] 0.1518578

3.2.6 Sequential Tests

Nevertheless, we need also the precision requirements α, β, and δ, moreover a sequential test cannot be applied without this, which means that we cannot start observations without fixing β. Until now a sample of fixed size n was given; Stein (1945) proposed a method of realising a two‐stage experiment. In the first stage a sample of size n₀ > 1 is drawn to estimate σ² using the variance of this sample and to calculate the sample size n of the method using (3.7). In the second stage n−n_o further measurements are taken. Following the original method of Stein in the second stage at least one further measurement is necessary from a theoretical point of view. In this section we simplify this method by introducing the condition that no further measurements are to be taken for n − n_o ≤0. Nevertheless, this yields an α‐test of acceptable power.

Since both parts of the experiment are carried out one after the other, such experiments are called sequential. Sometimes it is even tenable to make all measurements step by step, where each measurement is followed by calculating a new test statistic. A sequential testing of this kind can be used if the observations of a random variable in an experiment take place successively in time. Typical examples are series of single experiments in a laboratory, psychological diagnostics in single sessions, medical treatments of patients in hospitals, consultations of clients of certain institutions, and certain procedures of statistical quality control, where the sequential approach was used the first time (Dodge and Romig 1929). The basic idea is to utilise the observations already made before the next are at hand.

For example, testing the hypothesis H₀: μ = μ₀ against H_A: μ > μ₀ (where the sample size can be determined by (3.7)) there are three possibilities in each step of evaluation, namely:

accept H₀
reject H₀
continue the investigation.

The, up to now unsurpassed, textbook of Wald (1947) has since been reprinted and is therefore generally available (Wald 2004), and new results can be found in the books of Ghosh and Sen (1991) and DeGroot (2005). We do not recommend the application of this general theory, but we recommend closed plans, which end after a finite number of steps with certainty (and not only with probability 1). We first give a general definition useful in other sections independent of our one‐sample problem.

Definition 3.1

A sequence S = {y₁, y₂, …} of random variables is given where the components are identically and stochastically independent distributed with a parameter P_θ ɛ Ω. Let the parameter space Ω consist of the two different elements θ₀ and θ_A′ A sequential test for H₀: θ = θ₀ against H_A: θ = θ_A, based on the ratio

of the likelihood functions L(Y⁽ⁿ⁾|θ) of both parameter values and on the first n elements Y⁽ⁿ⁾ = {y₁, y₂, …, y_n} of the sequence S is said to be the sequential likelihood ratio test (SLRT), if for certain numbers A and B with 0 < B < 1 < A the decomposition of {Y⁽ⁿ⁾} reads

An SLRT that leads with probability 1 to a final decision with the strength (α, β) fulfils, with the numbers A and B from Definition 3.1, the conditions

In the applications the equalities are often used to calculate approximately the bounds A and B. Such tests are called approximate tests.

It follows from the theory that SLRTs can hardly be recommended, since they end under certain assumptions only with probability 1. On the other hand, so far they are the most powerful tests for a given strength as the expectation of the sample size – the average sample number (ASN) – for such tests is minimal and smaller than the size for tests where the size is fixed. Since it is unknown for which maximal sample size the SLRT ends with certainty, it belongs to the class of open sequential tests. In comparison there are also closed sequential tests, i.e. tests with a secure maximal sample size, but this advantage is won by a bit larger ASN. We will concentrate on closed sequential triangular tests.

Their principle goes back to Whitehead (1997) and is as follows. Using the observed values y_i, i = 1, 2, …, some cumulative ascertained ancillary values Z_v and V_v are calculated.

Let S = (y₁, y₂, …) be a sequence with components distributed together with y as N(μ, σ²) and S = (y₁, y₂, …) its realisation. We want to test

Then we get

The boundary lines of the triangle follow from

3.13

3.14

We put and calculate from the n₁ and n₂ observations the maximum likelihood estimator

Then we introduce

The hypothesis H₀ : μ=μ₀ is accepted, if

and if

Both straight lines on the boundary of the triangle meet at the point

Problem 3.10

Show how a triangular sequential test can be calculated.

If the output

 Test not finished, continue by adding single data via
     update() current sample size for x: 2

occurs, a further observation has to be made and input after

 > conscient.tt <- update(conscient.tt, x =)

Solution

Use in the OPDOE‐command μ₀ = mu0 and μ₁ = mu1

 > conscient.tt <- triangular.test.norm(x =, mu0=mu1 =, 
       sigma = 10, alpha =, beta = )

Example from Rasch et al. (2011c)

In a study with the aid of a personality questionnaire, the character ‘conscientiousness’ was tested. Conscientiousness is standardised to T‐scores, i.e. modelled as a normally distributed random variable with mean μ₀ = 50 and standard deviation σ = 10. The null hypothesis is H₀: μ = μ₀ = 50 and the alternative hypothesis is H_A: μ < 50, if the sample does not stem from the population in question, then we expect the children to achieve lower test scores. We choose α = 0.05, β = 0.20, and δ = 0.67. We start with the first five children of our data set, who are tested with an intelligence test battery.

Their results are as follows: 50, 52, 53, 40, and 48.

In R, we type

 > conscient.tt <- triangular.test.norm(x = c(50, 52), mu0 = 
     50,mu1 = 43.3, sigma =  10,alpha = 0.05, beta = 0.2)

The outcome was

 Triangular Test for normal distribution
Sigma known: 10
H0: mu = 50 versus H1: mu < 43.3 alpha: 0.05 beta:0.2
       Test not finished, continue by adding single data via
         update()

We now sequentially input further values like 38, 45, 56, 45, 50, 53, 68, 64, 55, 54; again the test could not be finished. However, after input 51 we obtain

 > conscient.tt <- update(conscient.tt, x=51)

and obtain Figure 3.3.

Figure 3.3 Result of the example.

3.3 The Two‐Sample Problem

We assume that two independent random samples from each of two populations are investigated.

Independent means that each outcome in one sample, as a particular event, will be observed independently of all outcomes in the other sample. Once the outcomes of the particular character are given, (point) estimates are calculated separately for each sample, as shown in Section 3.2. However, we discuss here the problem whether the two samples really stem from two different populations.

We will first consider again the parameter μ; that is, the two means of the two populations underlying the two samples.

3.3.1 Tests on Two Expectations

We assume that the character of interest is in each population modelled by a normally distributed random variable. That is to say, independent random samples of size n₁ and n₂, respectively, are drawn from populations 1 and 2. The observations of the random variables on the one hand and on the other hand will be and . We call the underlying parameters μ₁ and μ₂, and and , respectively. The unbiased estimators are then used for the expectations and , respectively, and for the variances and , respectively (according to Section 3.2).

We test the null hypothesis

H₀: μ₁ = μ₂ = μ; against one of the alternative hypotheses
1. (a) H_A: μ₁ < μ₂
2. (b) H_A: μ₁ > μ₂
3. (c) H_A: μ₁ ≠ μ₂.

According to recent findings from simulation studies (Rasch et al. 2011a), the traditional approach is usually unsatisfactory because it is not theoretically regulated for all possibilities.

Nevertheless, we start with an introduction to the traditional approach because the current literature on applications of statistics refers to it, and methods for more complex issues are built on it. Further, there are situations in which it is certain that both populations have equal variances, especially if standardisation takes place, as in intelligence tests.

3.3.1.1 The Two‐Sample t‐Test

The traditional approach bases upon the test statistic in Formula (3.15). It examines the given null hypothesis for the case that the variances and , respectively, of the relevant variables in the two populations, are not known; this is the usual case. However, it assumes further that these variances are equal in both populations. The test statistic

3.15

is (central) t‐distributed with n₁ + n₂ − 2 degrees of freedom. This means, that we can decide in favour of, or against, the null hypothesis, analogous to Section 3.2, by using the (1 − α)‐quantile of the (central) t‐distribution t(n₁ + n₂ − 2, 1 − α) and t(n₁ + n₂ − 2, 1 − ), respectively, depending on whether it is a one‐ or two‐sided question.

H₀: μ₁ = μ₂ = μ; will be rejected if:

(a) t < t(n₁ + n₂ − 2, α)
(b) t > t(n₁ + n₂ − 2, 1 − α)
(c)

If the null hypothesis is rejected, we say that the two expectations differ significantly.

Example 3.5

We consider the entries of Table 3.4 with the litter weights of mice (in grams) but interpret the entries differently. We assume that the x and y values are obtained independently from two strains of mice.

 > x <- c(7.6, 13.2, 9.1, 10.6, 8.7, 10.6, 6.8, 9.9, 7.3,
      10.4, 13.3, 10.0, 9.5 )
> y <- c( 7.8, 11.1, 16.4, 13.7, 10.7, 12.3, 14.0, 11.9,
      8.8, 7.7, 8.9, 16.4, 10.2)

We obtain using R the t‐value t (24, 0.05) = −1.785 with 24 degrees of freedom.

For α = 0.05 and f = 24 we find using R

 > qt(0.05,24)
[1] -1.710882
> qt(0.025,24)
[1] -2.063899

In a practical problem, exactly one of the three possible alternative hypotheses will be used. Here we show the procedures for all the possible choices.

Case 1: the experimenter tests H₀ against
Decision: accept H₀ because
Case 2: the experimenter tests H₀ against
Decision: reject H₀, because
Case 3: the experimenter tests H₀ against
Decision: accept H₀, because

With R we proceed as follows:

 > t.test(x,y,alternative = "less", mu=0, conf.level =0.95,
     var.equal=TRUE)

        Two Sample t-test

data:  x and y
t = -1.7849, df = 24, p-value = 0.04346
alternative hypothesis: true difference in means is
     less than 0
95 percent confidence interval:
       -Inf -0.0730922
sample estimates:
mean of x mean of y
 9.769231 11.530769

> t.test(x,y,alternative = "greater", mu=0, conf.level =0.95,
     var.equal=TRUE)

        Two Sample t-test

data:  x and y
t = -1.7849, df = 24, p-value = 0.9565
alternative hypothesis: true difference in means is
     greater than 0
95 percent confidence interval:
 -3.449985       Inf
sample estimates:
mean of x mean of y
 9.769231 11.530769

> t.test(x,y,alternative = "two.sided",mu=0,conf.level=0.95,
     var.equal=TRUE)

        Two Sample t-test

data:  x and y
t = -1.7849, df = 24, p-value = 0.08692
alternative hypothesis: true difference in means is not
     equal to 0
95 percent confidence interval:
 -3.798372  0.275295
sample estimates:
mean of x mean of y
 9.769231 11.530769

Which sample size is needed for this two‐sample t‐test for a risk of the first kind α and a risk of the second kind not larger than β as long as |μ₁ − μ₂| > δ?

Given the total number of observations n₁ + n₂ then is minimised for n₁ = n₂ = n, i.e. when the two sample sizes are equal.

The value of n can be determined from the demand that the (1 − α)‐quantile for the one‐sided alternatives above or the ‐quantile for the two‐sided alternative of the central t‐distribution with 2n − 2 degrees of freedom equals the (1 − β)‐quantile of the non‐central t‐distribution with ncp and 2n − 2 degrees of freedom. This illustrates Figure 3.1. In Problem 3.3 it is shown what to do when σ is unknown.

We show how to proceed with R in Problem 3.11 below.

Problem 3.11

How can we calculate the minimum sample size n per sample for the two‐sample t‐test for a risk of the first kind α and a risk of the second kind not larger than β as long as |μ₁ − μ₂| > δ with σ = 1?

Solution

Use in R the OPDOE command, e.g. for the one‐sided test

 > size.t.test(delta=0.5,sd=1,sig.level=0.01,power=0.9,
     type="two.sample", alternative="one.sided")
> size.t.test(delta=0.5,sd=1,sig.level=0.01,power=0.9,
     type="two.sample", alternative="two.sided")

Here sd is the standard deviation σ and power = 1 − β.

Example

We demonstrate the command above for a one‐ and a two‐sided alternative α = 0.01, δ = 0.5; σ = 1, and β = 0.1.

 > size.t.test(delta=0.5,sd=1,sig.level=0.01,power=0.9,
     type="two.sample", alternative="one.sided")
[1] 106
> size.t.test(delta=0.5,sd=1,sig.level=0.01,power=0.9,
     type="two.sample", alternative="two.sided")
[1] 121.

Thus, the minimum sample size for the one-sided case is 106, for the two-sided case we would need at least 121 observations each.

3.3.1.2 The Welch Test

We now discuss the case that either it is known that the variances and of the two populations are unequal or that we are not sure that they are equal – this is often the case. Therefore, we recommend always using the proposed Welch‐test below in place of the two‐sample t‐test. The recommendation to use the Welch test instead of the t‐test cannot be found in most textbooks of applied statistics and may therefore surprise experienced users. This recommendation is instead based on the results of current research from Rasch et al. (2011a). There it is also explained that pre‐testing first for equality of variances and if the equality is accepted to continue with the two‐sample t‐test makes no sense. We show in Table 3.5 a part of table 2 from Rasch et al. (2011a).

Table 3.5 A summary of a simulation study in Rasch et al. (2011a).

δ = 0
With pre‐tests			Without pre‐tests
√(σ²₁/σ²₂)	n₁	n₂	t	Welch (Levene)	W (KS)	t	Welch	W
1	10	10	4.95	4.55 (4.84)	0.00 (0.02)	4.96	4.82	5.21
30	30	4.97	4.82 (4.98)	11.11 (0.09)	4.96	4.95	4.92
10	30	5.00	4.87 (4.31)	0.00 (0.06)	5.01	5.05	4.87
30	10	4.97	5.09 (4.32)	0.00 (0.06)	4.96	5.15	5.00
30	100	4.86	5.00 (4.80)	9.09 (0.11)	4.84	4.91	4.82
2	10	10	6.08	3.33 (32.99)	0.00 (0.03)	5.38	4.93	6.06
30	30	7.37	4.78 (91.89)	10.00 (0.10)	5.15	4.98	5.88
10	30	1.02	5.90 (30.18)	0.00 (0.07)	0.90	5.00	2.19
30	10	19.38	2.98 (73.73)	16.67 (0.06)	15.51	5.19	10.09
30	100	1.33	4.99 (98.38)	0.00 (0.11)	0.63	4.96	1.89

The entries give the percentage of rejecting the (final) hypothesis H₀: μ₁ = μ₂. t is for student's t‐test and W for the Wilcoxon–Mann–Whitney‐test both with and without pre‐testing. Levene means the Levene test and KS the Kolmogorov–Smirnov test on normality.

Pre‐tests of the assumptions are worthless, as shown by our simulation experiments. First testing the normal assumption per sample, testing then the null hypothesis for equality of variances, with the Levene test from Section 3.3.4, and finally, testing the null hypothesis for equality of means leads to terrible risks, i.e. the risk estimated using simulation is only in the case of the Welch test near the chosen value α = 0.05 – it is exceeded by more than 20% in only a few cases. By contrast, the t‐test rejects the null hypothesis largely. Moreover, the Wilcoxon W‐test (see Section 3.3.1.3) does not maintain the type‐I risk in any way; it is too large and it has to be carried out with a newly collected pair of samples.

The Welch test is based on the test statistic

The distribution of the test statistic was derived by Welch (1947) and is presented in theorem 3.18 in Mathematical Statistics (Rasch and Schott, 2018).

If the pair of hypotheses

is to be tested, often the approximate test statistic

is used. H₀ is rejected if |t*| is larger than the corresponding quantile of the central t‐distribution with approximate degrees of freedom f, the so‐called Satterthwaite's f,

degrees of freedom.

Referring to the t‐test, the desired relative effect size is, for equal sample sizes

For the Welch test, the desired relative effect size then reads:

Planning a study for the Welch test is carried out in a rather similar manner to that for two‐sample t‐test. A problem arises, however: in most cases it is not known whether the two variances in the relevant population are equal or not, and if not, to what extent they are unequal. If one knew this, then it would be possible to calculate the necessary sample size exactly. In the other case, one can use the largest appropriate size, which results from equal, realistically maximum expected variances.

Problem 3.12

Use the Welch test based on the approximate test statistic to test:

H₀: μ₁ = μ₂ = μ; against one of the alternative hypotheses:
1. (a) H_A: μ₁ < μ₂
2. (b) H_A: μ₁ > μ₂
3. (c) H_A: μ₁ ≠ μ₂.

Solution

In R use the commands respectively for (a), (b) and (c):

 > t.test(x,y,alternative= "less", mu=0, conf.level =0.95, var.equal=FALSE)
> t.test(x,y,alternative= "greater", mu=0, conf.level =0.95,  var.equal=FALSE)
> t.test(x,y,alternative= "two.sided", mu=0, conf.level =0.95, var.equal=FALSE)

Because the default option in the >t.test( ) is var.equal= FALSE we do not use this option.

Example

We use the entries of Table 3.4 to test respectively for (a), (b), and (c) whether the strains of mice have the same expectations, which we have used in Example 3.6.

 > t.test(x,y,alternative= "less", mu=0, conf.level =0.95)

        Welch Two Sample t-test

data:  x and y
t = -1.7849, df = 21.021, p-value = 0.04435
alternative hypothesis: true difference in means is
     less than 0
95 percent confidence interval:
        -Inf -0.06343762
sample estimates:
mean of x mean of y
 9.769231 11.530769

> t.test(x,y,alternative= "greater", mu=0, conf.level =0.95)

        Welch Two Sample t-test

data:  x and y
t = -1.7849, df = 21.021, p-value = 0.9556
alternative hypothesis: true difference in means is
     greater than 0
95 percent confidence interval:
 -3.459639       Inf
sample estimates:
mean of x mean of y
 9.769231 11.530769

> t.test(x,y,alternative= "two.sided", mu=0, conf.level
     =0.95)

        Welch Two Sample t-test

data:  x and y
t = -1.7849, df = 21.021, p-value = 0.08871
alternative hypothesis: true difference in means is not
     equal to 0
95 percent confidence interval:
 -3.8137583  0.2906814
sample estimates:
mean of x mean of y
 9.769231 11.530769

3.3.1.3 The Wilcoxon Rank Sum Test

Wilcoxon (1945) proposed for equal sample sizes, and later Mann and Whitney (1947) extended for unequal sample sizes, a two‐sample distribution‐free test based on the ranks of the observations. This test is not based on the normal assumption, in its exact form only two continuous distributions with all moments existing are assumed; we call it the Wilcoxon test. As is seen from the title of Mann and Whitney (1947), this test is testing whether one of the underlying random variables is stochastically larger than the other. It can be used for testing the equality of medians under additional assumptions: if the continuous distributions are symmetric, the medians are equal to the expectations. The null hypothesis tested by the Wilcoxon test corresponds to the hypothesis of equality of the medians m₁ and m₂, H₀ : m₁ = m₂ = m if and only if all higher moments of the two populations exist and are equal. Otherwise a rejection of the Wilcoxon hypothesis says little about the rejection of H₀ : m₁ = m₂ = m. The test runs as follows based on the observations and of the random variables and respectively. We assume the sample size n₁ ≤ n₂.

Calculate:

3.16

and

3.17

We denote the P‐quantiles of W based on n₁ ≤ n₂ observations by W(n₁, n₂, P).

Reject H₀:

in case (a), if W > W(n₁, n₂, 1 − α),
in case (b), if W < W(n₁, n₂, α),
in case (c), if either or ,

and accept it otherwise (the three cases are those given in problem 3.13).

The expectation of W if H₀ is valid is E(W) = n₁ (n₁ + n₂)/2.

The variance of W if H₀ is valid is var(W) = n₁ n₂(n₁ + n₂ + 1)/12.

Mann–Whitney used the test statistic U, which is the number of times that an x observation is larger than a y observation. The relation between U and W is: U = W − n₁(n₁ + 1)/2. Hence the expectation of U is E(U) = E(W) − n₁(n₁ + 1)/2 and the variance of U is var(U) = var(W − n₁(n₁ + 1)/2)

The expressions for var(W) and var(U) must be corrected if there are ties (equal observations) in the data.

If there are ties of size t_i then var(W) must be subtracted by n₁n₂[Σ(t_i³ − t_i)]/[12(N² − N)] and with ties we have var(W) = var(U) = {n₁ n₂ (N³ − N − Σ(t_i³ − t_i)}/[12(N² − N)].

The expectations of E(W) and E(U) remain the same in the case of ties.

Be aware that the quantiles of the random test statistic U are in R denoted by W and are given by the R‐command

 > qwilcox(P,n1,n2)

as an example take

 > qwilcox(0.05,15,20)
[1] 101
>  qwilcox(0.95,15,20)
[1] 199

Problem 3.13

Test for two continuous distributions:

H₀: m₁ = m₂ = m; against one of the alternative hypotheses:
1. (a) H_A: m₁ < m₂
2. (b) H_A: m₁ > m₂
3. (c) H_A: m₁ ≠ m₂.

Solution

In R use the commands with significance level α = 0.05 respectively for (a), (b), and (c):

 > wilcox.test(x,y,alternative="less", mu=0,exact=T, conf.level=0.95)
> wilcox.test(x,y,alternative="greater", mu=0,exact=T, conf.level=0.95)
> wilcox.test(x,y,alternative="two.sided", mu=0,exact=T, conf.level=0.95)

Example

We use the entries of Table 3.4 to test respectively for (a), (b), and (c) whether the strains of mice have the same expectations that we used in Example 3.6.

 > wilcox.test(x,y,alternative="less", mu=0,exact=T, conf.level=0.95)

        Wilcoxon rank sum test with continuity correction

data:  x and y
W = 51, p-value = 0.04524
alternative hypothesis: true location shift is less than 0

Warning message:
In wilcox.test.default(x, y, alternative = "less", mu = 0, exact = T,  :
  cannot compute exact p-value with ties

> wilcox.test(x,y,alternative="greater", mu=0,exact=T, conf.level=0.95) 

        Wilcoxon rank sum test with continuity correction

data:  x and y
W = 51, p-value = 0.9594
alternative hypothesis: true location shift is greater
     than 0

Warning message:
In wilcox.test.default(x, y, alternative = "greater",
     mu = 0, exact = T,  :
  cannot compute exact p-value with ties

> wilcox.test(x,y,alternative="two.sided", mu=0,exact=T,
     conf.level=0.95)

        Wilcoxon rank sum test with continuity correction

data:  x and y
W = 51, p-value = 0.09048
alternative hypothesis: true location shift is not
     equal to 0

Warning message:
In wilcox.test.default(x, y, alternative = "two.sided",
     mu = 0,  :
  cannot compute exact p-value with ties

The p‐value is in this case of ties calculated in R by the normal approximation.

For the alternative (a), alternative = ‘less’, we calculate with the continuity correction of 0.5 as follows. There are two ties, the observation 10.6 and also the observation 16.4 occurs twice, all other 22 observations occur once.

 N=13+13=26, t1 = 2, t2 = 2, The p-value =
Prob(Z<[51 - 13*13/2+ 0.5]/√[13*13*{(263 -26 –(23 -2) -(23 -2)-22*0}/{12*(262 -26}] =
Prob(Z < -33/√[2963922/7800]) = Prob (Z < - 33/19.4933) = Prob (Z < -1.6929)
> p_value <- pnorm(-1.6929) [1] 0.04523725

3.3.1.4 Definition of Robustness and Results of Comparing Tests by Simulation

We give here an introduction to methods of empirical statistics via simulations and methods, which will be used later in this chapter and also in other chapters (especially in Chapter 11).

The robustness of a statistical method means that the essential properties of this method are relatively insensitive to variations of the assumptions. We especially want to investigate the robustness of the methods of this section with respect to violating normality.

Definition 3.2

Let k_α be an α‐test (0 < α < 1) for the pair {H₀, H_A} of hypotheses in the class G₁ of distributions of the random sample Y with size n. And let G₂ be a class of distributions containing G₁ and at least one distribution, which does not fulfil all assumptions for guaranteeing k_α to be an α‐test.

Finally, let α(g) be the risk of the first kind for k_α concerning the element g of G₂ (estimated by simulation). Here and in the sequel we write α(g) = α_act, the actual α and the α pre‐given nominal α written as α_nom.

Then k_α is said to be (1 − ε)‐robust in the class G₂ if

We call a statistical test acceptable if 100(1 − ε)% ≥ 80%.

We use for G₁ the family of univariate normal (N(μ, σ²)) distributions and for G₂ the Fleishman system of distributions.

Definition 3.3

A distribution belongs to the Fleishman system (Fleishman 1978) if its first four moments exist and if it is the distribution of the transform

3.18

where x is a standard normal random variable (with mean 0 and variance 1).

By a proper choice of the coefficients a, b, c, and d, the random variable y will have any quadruple of first four moments (μ, σ², γ₁, γ₂). By γ₁ and γ₂ we denote the skewness (standardised third moment) and the kurtosis (standardised fourth moment) of any distribution, respectively. For instance, any normal distribution (i.e. any element of G₁) with mean μ and variance σ² can be represented as a member of the Fleishman system by choosing a = μ, b = σ and c = d = 0. This shows that we really have G₂ ⊃ G₁, as demanded in Definition 3.3.

It is known that all probability and empirical distributions (with existing fourth moment) fulfil the inequality

3.19

The equality sign in (3.19) defines a parabola in the (γ₁,γ₂) – plane {(g₁,g₂) – plane for observations).

Rasch and Guiard (2004) selected seven (γ₁, γ₂) values in that parabola for the robustness investigations reported in this paper. The values, together with the coefficients a, b, c, and d of the elements in the Fleishman system for the case μ = 0 (which means a = −c) and σ = 1 (which means b² + 6bd + 2c² + 15d² = 1), are given in Table 3.6.

Table 3.6 (γ₁,γ₂)‐values and the corresponding coefficients used in the simulations.

No of distribution	γ₁	γ₂
1	0	0
2	0	1.5
3	0	3.75
4	0	7
5	1	1.5
6	1.5	3.75
7	2	7

Rasch and Guiard (2004) used in a simulation experiment 10 000 runs for comparing two means of independent random samples of size n₁ and n₂ drawn from two populations with parameters μ₁, , γ₁₁, γ₁₂ and μ₂, , γ₂₁, γ₂₂, respectively.

To test the null hypothesis:

H_o: μ₁ = μ₂ against one of the following one‐ or two‐sided alternative hypotheses:
1. (a) H_A: μ₁ > μ₂
2. (b) H_A: μ₁ < μ₂
3. (c) H_A: μ₁ ≠ μ₂

amongst others the two sample t‐test, the Welch test, and the Wilcoxon (Mann–Whitney) test have been used. The variances were chosen as (0.5, 1), (0.5, 0.5), and (0.5, 2), respectively. The simulation was performed under the null hypothesis (the relative frequency of rejecting H₀ was used as α_act) and for the relative frequency of rejecting H₀ was used as an estimate of the power. Summarising, the results lead to the conclusion that for continuous distributions the Wilcoxon test, even for very skew distributions with large kurtosis, is no better than the Welch test, which is generally used. If we are not sure whether the distributions of the two populations are normal we can use the Welch test because Rasch and Guiard (2004) as well as Rasch et al. (2011a) have shown that the first kind risk is near to the nominal value even if the distributions are non‐normal and the power is sufficient.

3.3.1.5 Sequential Two‐Sample Tests

The technique of sequential testing offers the advantage that, given many studies, on average fewer research units need to be sampled as compared to the ‘classic’ approach of hypothesis testing in Sections 3.3.1.1–3.3.1.4 with sample sizes fixed beforehand. Nevertheless, we also need the precision requirements α, β, and δ. In the case of two samples from two populations and testing the null hypothesis H₀: μ₁ = μ₂, a sequential triangular test is preferable. Its application is completely analogous to Section 3.2.5. The only difference is that we use δ = 0 − (μ₁ − μ₂) = μ₂ − μ₁ for the relevant (minimum difference) between the null and the alternative hypothesis, instead of δ = μ₀ − μ. Again we keep on sampling data, i.e. outcomes, until we are able to make a terminal decision, namely to accept or to reject the null hypothesis.

Problem 3.14

Show how a triangular sequential two‐sample test is calculated.

Solution

Use the OPDOE command in R:

 > valid.tt <- triangular.test.norm(x= , y= , mu1= , mu2= , sigma= , alpha= , beta= )

Example

In R, for now a sequential triangular test corresponding to the Welch test is not available. Hence we have to use, rather inappropriately, the one for the t‐test, staying well aware that this leads, in the case of distinct variances in both populations, to falsely rejecting the null hypothesis more often than the nominal for the assumed α. We use the data of Problem 3.10.

        > valid.tt <- triangular.test.norm(x = 48, y = 54,
                  mu1 = 50, mu2 = 60, sigma = 10,
                  alpha = 0.05, beta = 0.2)

i.e. we apply the function triangular.test.norm() and use the first observation value of group 1 (x = 48) and group 2 (y = 54) as the first two arguments; furthermore we set μ₀ (mu1 = 50)and μ₁ (mu2 = 60), and sigma = 10 for σ. Finally, we set the appropriate precision requirements according to alpha = 0.05 and beta = 0.2. All of this we assign to object valid.tt.

As a result, we get:

        Triangular Test for normal distribution
     
       Sigma known: 10
     
       H0: mu1=mu2= 50  versus H1: mu1= 50  mu2>= 60
       alpha: 0.05  beta: 0.2
       Test not finished, continue by adding single data
            via update()

After the input of further data

 > valid.tt <- update(valid.tt, x=48, y=54)
Triangular Test for normal distribution
Sigma known: 10 
H0: mu1=mu2= 50  versus H1: mu1= 50  mu2>= 60 
alpha: 0.05  beta: 0.2 
Test not finished, continue by adding single data via update()
current sample size for x:  2 
current sample size for y:  2
> valid.tt <- update(valid.tt, x=47, y=55)
Triangular Test for normal distribution
sigma known: 10 
H0: mu1=mu2= 50  versus H1: mu1= 50  mu2>= 60 
alpha: 0.05  beta: 0.2 
Test not finished, continue by adding single data via
update()
current sample size for x:  3 
current sample size for y:  3
> valid.tt <- update(valid.tt, x=49, y=61)
Triangular Test for normal distribution
Sigma known: 10 
H0: mu1=mu2= 50  versus H1: mu1= 50  mu2>= 60 
alpha: 0.05  beta: 0.2 
Test not finished, continue by adding single data via update()
current sample size for x:  4 
current sample size for y:  4
> valid.tt <- update(valid.tt, x=45, y=63)
Triangular Test for normal distribution
Sigma known: 10 
H0: mu1=mu2= 50  versus H1: mu1= 50  mu2>= 60 
alpha: 0.05  beta: 0.2 
Test finished: accept H1 
Sample size for x:  5 
Sample size for y:  5

A graph of this test is shown in Figure 3.4.

Figure 3.4 Graph of the triangular sequential two‐sample test of Problem 3.14.

3.3.2 Test on Two Medians

To test H₀ whether two populations have the same median, against the alternative hypothesis H_A that the medians are different, a random sample of size n_i (i = 1,2) is drawn from each population. Be aware that the scale of the continuous measurement is at least ordinal, or else the term median is without meaning. A 2 × 2 contingency table is constructed, with rows ‘>median’ and ‘≤median’ and the columns for the sample i = 1, 2 of the two populations. The two entries in the ith column are the numbers of observations in the ith sample, which are above and below the combined sample grand median ; this is the median of all observations combined.

The median test has three appealing properties from the practical standpoint. In the first place it is primarily sensitive to differences in location between cells and not to the shapes of the cell distributions. Thus, if the observations of some cells were symmetrically distributed while in other cells they were positively skewed, the rank test would be inclined to reject the null hypothesis even though all population medians were the same. The median test would not be much affected by such differences in the shapes of the cell distributions. In the second place the computations associated with the median test are quite simple and the test itself is nothing more than the familiar contingency table test. In the third place when we come to consider more complex experiments it will be found that the median tests are not much affected by differing cell sizes. Bradley (1968) gives the following rationale and example for the Westenberg–Mood median test (Westenberg 1948; Mood 1950, 1954).

3.3.2.1 Rationale

Suppose that populations I and II are both continuously distributed populations of variate‐values, that the median variate‐value of the combined sample has the value , and that the two row categories are < and ≥, respectively. Then (if the number of observations in the combined sample is odd) the single observation equal to is excluded from all frequencies in the 2 × 2 table so that N (the total number of remaining data) is even, Fisher's exact method for the 2 × 2 contingency table becomes a test of hypotheses that equal but inexactly known proportions of populations I and II lie below (or if the original combined sample contained an even number of observations, below some population value lying between the [N/2]th and [(N/2) + 1]th observations in order of increasing size in the combined sample). Thus it tests the hypothesis that (or some value close to it) is the same, but unknown, quantile in the two populations; and if H₀ is true and N is large the common proportion of the two populations lying below will tend to be close to 0.5, so will tend to be a population quantile, which is nearly a median. For this reason the Westenberg–Mood test is sometimes stated to attest to identical population medians; however, it is quite possible for two populations to have unequal medians and equal ‘ – quantiles’ or vice versa, especially when both samples are small. Instead of a median test, therefore, it might be better described as a quasi‐median test or a test for a common, probably more or less centrally located, quantile.

Example 3.6

An experimenter, content to let the exact nature of ‘location’ to be defined a posteriori, wishes a rough test (at the α = 0.05 significance level) of whether or not two continuous populations have a common location. From population I he draws a random sample of size n₁ = 13 consisting of the values 2, 3, 5, 6, 8, 9, 13, 15, 16, 18, 31, 54, 89, and from population II, a random sample of size n₂ = 15, he obtains the sample observations 10, 11, 17, 19, 20, 22, 23, 27, 30, 33, 34, 35, 39, 40, 42. Of the N = n₁ + n₂ = 28 observations in both samples combined, the fourteenth and fifteenth in increasing order of size are 19 and 20, so '

Using R we perform the test as follows:

 > sampleI <- c(2, 3, 5, 6, 8, 9, 13, 15, 16, 18, 31, 54, 89)
> sampleII <- c(10, 11, 17, 19, 20, 22, 23, 27, 30, 33, 34, 35, 39, 40, 42)
> M <-median(c(sampleI, sampleII))
> M   #  median of the combined samples
[1] 19.5
> combined <- c(sampleI, sampleII)
> group <- rep(1:2, c(length(sampleI), length(sampleII)))
> CT2x2  <-  table(combined <= M, group)
> CT2x2
              group
                1  2
  FALSE         3 11
  TRUE          10  4
> median.test<- function(y1,y2){  z<-c(y1,y2)  g <- rep(1:2, c(length(y1),length(y2)))  m<-median(z)  fisher.test(z  <=  m,g)$p. value  } 
> median.test(sampleI,sampleII)
[1] 0.02130091

We may reject, therefore, at the predesigned α = 0.05 level, the null‐hypothesis that equal proportions of the two populations lie below = 19.5 (or at least below some point between 19 and 20) because the two‐sided p‐value 0.0213 < 0.05.

3.3.3 Test on Two Probabilities

Let y be distributed as B(n, p). Knowing one observation y = Y we want to test H₀: p = p₀ against H_A: p ≠ p₀, p₀ ɛ (0, 1). The natural parameter is η =, and y is sufficient with respect to the family of binomial distributions. Therefore the uniformly most powerful unbiased (UMPU)‐α‐test is given by (3.27) in Rasch and Schott (2018), where the c_iα and γ′_iα (i = 1, 2) have to be determined from (3.28) and (3.29) in Rasch and Schott (2018). With

Equation (3.28) in Rasch and Schott (2018) has the form

3.20

Regarding

the relation (3.29) in Rasch and Schott (2018) leads to

3.21

The solution of these two simultaneous equations can be obtained by statistical software, e.g. by R. Further results can be found in the book of Fleiss et al. (2003).

3.3.4 Tests on Two Variances

We consider two independent random samples , where the components y_ij are supposed to be distributed as N(μ_i ). We intend to test the null hypothesis

against the alternative

Under H₀ we have = l, and the random variable

does not depend on μ₁ and μ₂, and . Since the random variable F is centrally distributed as F(n₁ − 1, n₂ − 1) under H₀, the function

defines a UMPU‐α‐test, where F(n₁ − 1, n₂ − 1| P) is the P‐quantile of the F‐distribution with n₁ − 1 and n₂ − 1 degrees of freedom. This test is very sensitive to deviations from the normal distribution. Therefore the following Levene test should be used instead of it in the applications. Therefore we present no example or problem.

Instead we propose the Levene test. Box (1953) has already mentioned the extreme non‐robustness of the F‐test comparing two variances. Rasch and Guiard (2004) have reported on extensive simulation experiments devoted to this problem. The results of Box show that non‐robustness has to be expected already for relatively small deviations from the normal distribution. Hence, we generally suggest applying the test of Levene (1960) which is described now.

For j = 1, 2 we put

and

where .

The null hypothesis H₀ is rejected if

3.22

In R the Levene test is not available in the base package, but it is quite easy to program it yourself or it can be taken from the package car v3.0.0 by John Fox, see https://cran.r‐project.org/web/packages/car/car.pdf.

Example 3.7

We consider the entries of Table 3.4 with the litter weights of mice (in grams) but interpret the entries differently. We assume that the continuous x and y values are obtained independently from two strains of mice. Let var(x) = and var(y) = . The continuous distributions are not assumed to be normally distributed. The null‐hypothesis given above concerning the equality of variances is tested using the Levene test with significance level α = 0.05.

The Levene test here is based on the squared deviations from the sample mean for x and y, respectively. We make in R the function >levene_square(x,y).

 > x <- c(7.6, 13.2, 9.1, 10.6, 8.7, 10.6, 6.8, 9.9, 7.3, 10.4, 13.3, 10.0, 9.5 )
> y <- c( 7.8, 11.1, 16.4, 13.7, 10.7, 12.3, 14.0, 11.9, 8.8, 7.7, 8.9, 16.4, 10.2)
> levene_square <- function(x,y){ nx <- length(x) dfx <- nx -1 mx <- mean(x) devx <- x- mx ny <- length(y) dfy <- ny -1 my <- mean(y) devy <- y-my z1 <- devx^2 mz1 <- mean(z1) SSz1 <- dfx*var(z1) z2 <- devy^2 mz2 <- mean(z2) SSz2 <- dfy*var(z2) z <- c(z1,z2) mz <- mean(z) SSbetween <- nx*(mz1-mz)^2 + ny*(mz2-mz)^2 SSwithin <-  SSz1 + SSz2 Fstar <- SSbetween * (nx + ny -2)/ SSwithin p_value <- pf(Fstar, 1, nx+ny-2, lower.tail=FALSE) list(p_value) }
> levene_square (x,y)
[[1]]
[1] 0.1136212

Because the p‐value 0.113 621 2 > 0.05 we cannot reject the null hypothesis H₀ of equal variances at the level α = 0.05.

Another version of the Levene test is based on the absolute deviations from the sample mean for x and y, respectively. We make in R the function >levene_abs(x,y).

  > x <- c(7.6, 13.2, 9.1, 10.6, 8.7, 10.6, 6.8, 9.9, 7.3, 10.4, 13.3, 10.0, 9.5) 
> y <- c( 7.8, 11.1, 16.4, 13.7, 10.7, 12.3, 14.0, 11.9, 8.8, 7.7, 8.9, 16.4, 10.2)
> levene_abs <- function(x,y){ nx <- length(x) dfx <- nx -1
   mx <- mean(x)
   devx <- x- mx
   ny <- length(y)
   dfy <- ny -1
   my <- mean(y)
   devy <- y-my
   z1 <- abs(devx)
   mz1 <- mean(z1)
   SSz1 <- dfx*var(z1)
   z2 <- abs(devy)
   mz2 <- mean(z2)
   SSz2 <- dfy*var(z2)
   z <- c(z1,z2)
   SSbetween <- nx*(mz1-mz)^2 + ny*(mz2-mz)^2
   SSwithin <-  SSz1 + SSz2
   Fstar <- SSbetween * (nx + ny -2)/ SSwithin
   p_value <- pf(Fstar, 1, nx+ny-2, lower.tail=FALSE)
   list( p_value)
   }
> levene_abs(x,y)
[[1]]
[1] 0.1198781

Because the p‐value 0.119 878 1 > 0.05 we cannot reject the null hypothesis H₀ of equal variances at the level α = 0.05.

References

Box, G.E.P. (1953). Non‐normality and tests on variances. Biometrika 40: 318–335.
Bradley, J.V. (1968). Distribution‐Free Statistical Tests. Englewood Cliffs, New Jersey: Prentice‐Hall, Inc.
DeGroot, M.H. (2005). Optimal Statistical Decisions. New York: Wiley Online Library.
Dodge, H.F. and Romig, H.G. (1929). A method of sampling inspection. Bell Syst. Tech. J. 8: 613–631.
Fleishman, A.J. (1978). A method for simulating non‐normal distributions. Psychometrika 43: 521–532.
Fleiss, J.L., Levin, B., and Paik, M.C. (2003). Statistical Methods for Rates and Proportions, 3e. Hoboken: Wiley.
Ghosh, M. and Sen, P.K. (1991). Handbook of Sequential Analysis. Boca Raton: CRC Press.
Levene, H. (1960). Robust tests for equality of variances. In: Contributions to Probability and Statistics, 278–292. Stanford: Stanford University Press.
Mann, H.H. and Whitney, D.R. (1947). On a test whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18: 50–60.
Mood, A.M. (1950). Introduction to the Theory of Statistics. New York: McGraw‐Hill.
Mood, A.M. (1954). On asymptotic efficiency of certain nonparametric two‐sample tests. Ann. Math. Stat. 25: 514–533.
Munzel, U. and Brunner, E. (2002). An exact paired rank test. Biom. J. 44: 584–593.
Rasch, D. and Guiard, V. (2004). The robustness of parametric statistical methods. Psychol. Sci. 46: 175–208.
Rasch, D. and Schott, D. (2018). Mathematical Statistics. Oxford: Wiley.
Rasch, D. and Tiku, M.L. (eds. 1985). Robustness of statistical methods and nonparametric statistics Proc. Conf. on Robustness of Statistical Methods and Nonparametric Statistics, Schwerin (DDR), May 29–June 2, 1983. Reidel Publishing Company Dordrecht.
Rasch, D., Kubinger, K.D., and Moder, K. (2011a). The two‐sample t test: pre‐testing its assumptions does not pay‐off. Stat. Pap. 52: 219–231.
Rasch, D., Pilz, J., Verdooren, R., and Gebhardt, A. (2011b). Optimal Experimental Design with R. Boca Raton: Chapman and Hall.
Rasch, D., Kubinger, K.D., and Yanagida, T. (2011c). Statistics in Psychology Using R and SPSS. Hoboken: Wiley.
Stein, C. (1945). A two sample test for a linear hypothesis whose power is independent of the variance. Ann. Math. Stat. 16: 243–258.
Wald, A. (1947). Sequential Analysis. New York: Dover Publ.
Wald, A. (2004). Sequential Analysis Reprint. New York: Wiley.
Welch, B.L. (1947). The generalization of students problem when several different population variances are involved. Biometrika 34: 28–35.
Westenberg, J. (1948). Significance test for median and interquartile range in samples from continuous populations of any form. Proc. Koninklijke Nederlandse Akademie van Wetenschappen, (The Netherlands) 51: 252–261.
Whitehead, J. (1997). The Design and Analysis of Sequential Clinical Trials, 2e Revised. New York: Wiley.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biom. Bull. 1: 80–82.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

δ	π(δ), n = 9	π(δ), n = 16	π(δ), n = 25
0	0.05	0.05	0.05
0.1	0.0893	0.1066	0.1261
0.2	0.1480	0.1991	0.2595
0.3	0.2282	0.3282	0.4424
0.4	0.3282	0.4821	0.6387
0.5	0.4424	0.6387	0.8038
0.6	0.5616	0.7749	0.9123
0.7	0.6755	0.8760	0.9682
0.8	0.7749	0.9400	0.9907
0.9	0.8543	0.9747	0.9978
1.0	0.9123	0.9907	0.9996
1.1	0.9510	0.9971	0.9999
1.2	0.9747	0.9992	1.0000

Table of Contents for 3 Testing Hypotheses – One‐ and Two‐Sample Problems

Create new playlist

Sign In

Sign Up

3.1 Introduction

3.2 The One‐Sample Problem

3.2.1 Tests on an Expectation

3.2.1.1 Testing the Hypothesis on the Expectation of a Normal Distribution with Known Variance

3.2.1.2 Testing the Hypothesis on the Expectation of a Normal Distribution with Unknown Variance

3.2.2 Test on the Median

3.2.3 Test on the Variance of a Normal Distribution

3.2.4 Test on a Probability

3.2.5 Paired Comparisons

3.2.6 Sequential Tests

3.3 The Two‐Sample Problem

3.3.1 Tests on Two Expectations

3.3.1.1 The Two‐Sample t‐Test

3.3.1.2 The Welch Test

3.3.1.3 The Wilcoxon Rank Sum Test

3.3.1.4 Definition of Robustness and Results of Comparing Tests by Simulation

3.3.1.5 Sequential Two‐Sample Tests

3.3.2 Test on Two Medians

3.3.2.1 Rationale

3.3.3 Test on Two Probabilities

3.3.4 Tests on Two Variances

References

Table of Contents for
3 Testing Hypotheses – One‐ and Two‐Sample Problems