J. Roy. Statist. Soc., Series B, 38 (1) (1976), 54–59.
This chapter introduces a test of the composite hypothesis of normality. The test is based on the property of the normal distribution that its entropy exceeds that of any other distribution with a density that has the same variance. The test statistic is based on a class of estimators of entropy constructed here. The test is shown to be a consistent test of the null hypothesis for all alternatives without a singular continuous part. The power of the test is estimated against several alternatives. It is observed that the test compares favorably with other tests for normality.
The entropy of a distribution F with a density function f is defined as
Let , be a sample from the distribution F. Express (1) in the form
An estimate of (2) can be constructed by replacing the distribution function F by the empirical distribution function , and using a difference operator in place of the differential operator. The derivative of is then estimated by for , , where are the order statistics and m is a positive integer smaller than n/2. One-sided differences of the type are used in place of when , respectively. This produces an estimate of
where , and .
To investigate the behavior of , it is useful to write it as a sum of three components,
where
The first term in Eq. (4) does not depend on m and represents the sample mean estimate of assuming that the value of f at the points is known. If the variance of − is finite, it is the minimum variance unbiased estimate of given the values of f at the sample points. The two remaining terms represent two sources of additional estimation error. The term is due to estimation of f by finite differences. For fixed n, its effect decreases with decreasing values of m. The term corresponds to the error due to estimating increments of F by increments of Fn. The increments are taken over intervals , whose length increases with m, and therefore the disturbance due to is the smaller the larger is the value of m.
As , simultaneous reduction of the effect of these two noise terms requires that . An optimal choice of m for a given n, however, depends on the (unknown) distribution F. In general, the smoother the density of F, the larger is such optimal value of m.
Since are distributed as an ordered sample of size n from the uniform distribution on (0, 1), the distribution of does not depend on F. Its limiting behavior is given by the following lemma.
Lemma 1. The variable converges to zero in probability as .
Proof. Put , . Since the geometric mean does not exceed the arithmetic mean, it follows that . Therefore, is a nonpositive variable with the mean
The variable has the beta distribution with parameters . The expected value of its logarithm is easily evaluated by differentiation of the generating function at zero as , where is the digamma function. Thus, after some algebra,
The right-hand side of the last equality converges to zero with . Thus, forms a series of nonpositive variables with expectations approaching zero, and consequently
Since the distribution of is independent of F, the bias due to the presence of in (4) can be eliminated by using
rather than , as an estimate of entropy. Here is given by (5). The following theorem deals with consistency of (and, by Lemma 1, also that of ).
Theorem 1. Let be a sample from a distribution F with a density f and a finite variance. Then
Proof. With some reorganization, can be written
where
and Fn is the empirical distribution function. When belong to an interval in which is positive and continuous, then there exists a value such that
Therefore, is a Stieltjes sum of the function − with respect to the measure over the sum of intervals of continuity of f in which . The contribution of terms in Sj that corresponds to intervals in which approaches zero with . Since in any interval in which is positive, a.s. as and a.s. uniformly over x, Sj converges a.s. to , which is either finite or −∞ in virtue of finite variance of F. Moreover, this convergence is uniform over j. Consequently,
Since
the statement of the theorem follows from (7).
A well-known theorem of information theory (Shannon, 1949, p. 55) states that among all distributions that possess a density function f and have a given variance σ2, the entropy is maximized by the normal distribution. The entropy of the normal distribution with variance is . The question arises as to whether a test of the composite hypothesis of normality can be based on this property. The estimate will be used for that purpose.
Definition. Let be a sample from a distribution F and let be the order statistics. Let m be a positive integer smaller than and define for . The test of the composite hypothesis of normality is a test with critical region , where
and
Under the null hypothesis,
Under an alternative distribution with density f and a finite variance ,
This means that the K test is consistent for such alternatives. There is no need, however, to restrict the use of the test to distributions with a density and a finite second moment, as will be established in Theorem 2. First, a lemma will be proven.
Lemma 2. Let F be a distribution with a density function f and without a finite second moment. Put
For each c such that , define a density function by
Denote the variance of by . Then as .
Proof. Let d be such that . Then for ,
where . According to an inequality in information theory (cf., for instance, Kullback, 1959, p. 15),
for nonnegative functions . Let g be the density of the normal distribution with the same mean and variance as . An application of inequality (10) and the inequality then yields
For a fixed d and , the right-hand side of the last inequality approaches minus infinity, as was to be proven.
Theorem 2. The test of any size is a consistent test, as , for all alternatives without a singular continuous part.
Proof. Let be a sample from a distribution F. If F has a density and a finite variance, the consistency of the test follows from Theorem 1. Assume that F has a density f but the second moment is infinite. Let be the truncated density (9) with variance . Define a statistic as
where is the subsample of all xi such that . Since the subsample has the density and a.s. as , it follows that
The difference converges to zero in probability with uniformly over n. Therefore,
in virtue of Lemma 2, which establishes consistency for that class of alternatives.
Finally, let F have an atom a with a weight . Then
as . Thus,
and the consistency of the test for alternatives with an atom follows. This completes the proof.
It can be shown that always
Except in the simplest case , the distribution of under the null hypothesis has not been obtained analytically. To determine the percentage points , Monte Carlo simulations were employed. For each samples of size n from the normal distribution were formed, using the congruence method of generating pseudo-random numbers and obtaining approximately normal deviates as sums of 12 uniform deviates. The statistic for several values of m was calculated from each sample, and percentage points of the distribution of were estimated by the corresponding order statistics. For each significance level and each value of m, the estimates were smoothed by fitting a polynomial in powers of . The lower-tail 5 percent significance points of for selected values of n, m are given in Table 35.1.1
Table 35.1 0.05 points for the K statistic
m = 1 | m = 2 | m = 3 | m = 4 | m = 5 | |
n = 3 | 0.99 | ||||
4 | 1.05 | ||||
5 | 1.19 | 1.70 | |||
6 | 1.33 | 1.77 | |||
7 | 1.46 | 1.87 | 1.97 | ||
8 | 1.57 | 1.97 | 2.05 | ||
9 | 1.67 | 2.06 | 2.13 | ||
10 | 1.76 | 2.15 | 2.21 | ||
12 | 1.90 | 2.31 | 2.36 | ||
14 | 2.01 | 2.43 | 2.49 | ||
16 | 2.11 | 2.54 | 2.60 | 2.57 | |
18 | 2.18 | 2.62 | 2.69 | 2.67 | |
20 | 2.25 | 2.69 | 2.77 | 2.76 | |
25 | 2.83 | 2.93 | 2.93 | 2.91 | |
30 | 2.93 | 3.04 | 3.06 | 3.05 | |
35 | 3.00 | 3.13 | 3.16 | 3.16 | |
40 | 3.19 | 3.24 | 3.24 | ||
45 | 3.25 | 3.29 | 3.30 | ||
50 | 3.29 | 3.34 | 3.35 |
The power of the test was estimated against several alternatives. The method was that of Monte Carlo simulation of the distribution of Kmn under alternative population distributions. For each alternative, 1,000 samples of sizes n = 10, 20, 50 were generated, and the test power was estimated by the frequency of the samples falling into the critical region. The continuous alternatives investigated were gamma (1) (exponential), gamma (2), beta (1,1) (uniform), beta (2,1), and Cauchy distributions.
For these alternatives, the maximum power was typically attained by choosing for for , and for . With increasing n, an optimal choice of m also increases, while the ratio tends to zero.
The power of the K test was compared to that of some other tests for normality against the same alternatives. The tests investigated by Stephens (1974) were considered. These are the Kolmogorov-Smirnov D, Cramér-von Mises W2, Kuiper V, Watson , Anderson-Darling , and Shapiro-Wilk W tests. Of these, only the Shapiro-Wilk test is a test of the composite hypothesis of normality. The tests , and , based on the empirical distribution function (EDF), require a complete specification of the null hypothesis. When these tests are used to test the composite hypothesis, the parameters must be estimated from the sample. Critical values corresponding to such modification of the test statistics are then applicable.
Table 35.2 lists power estimates of .05 size tests with sample size n = 20. These results have been obtained by Stephens (1974) for the EDF statistics against the exponential, uniform, and Cauchy alternatives; by Van Soest (1967) for against gamma (2); and by Shapiro and Wilk (1965) for W. The powers of , and against gamma (2) and of the EDF statistics against beta (2,1) were estimated by the author from 2,000 samples, using the critical values given in Stephens (1974). The standard error of the power estimates in Table 35.2 does not exceed .015.
Table 35.2 Powers of .05 tests against some alternatives (n = 20)
Alternative | D | W2 | V | U2 | A2 | W | K3 |
Exponential | .59 | .74 | .71 | .70 | .82 | .84 | .85 |
Gamma (2) | .33 | .45 | .33 | .37 | .48 | .50 | .45 |
Uniform | .12 | .16 | .17 | .18 | .21 | .23 | .44 |
Beta (2,1) | .17 | .23 | .20 | .23 | .28 | .35 | .43 |
Cauchy | .86 | .88 | .87 | .88 | .98 | .88 | .75 |
It is apparent from Table 35.1 that none of the tests considered performs better than all other tests against all alternatives. Compared with any other test, however, the K test exhibits higher power against at least three of the five alternative distributions. For three of the alternatives, the power of the K test is uniformly the highest. Similar results hold for other sample sizes and sizes of the test.
These results, together with the relative simplicity of the K test (no tables of coefficients or function values are needed to calculate the test statistic) and its asymptotic properties against any alternative, suggest that the K test may be preferred in many situations.
This research was partly supported by Wells Fargo Bank, N.A. The author is indebted to Larry J. Cuneo for help with the computer simulations. The author wishes to express his thanks to the editor and referees of the Journal for their helpful suggestions.
18.188.191.11