6.3.8 Empty Cells

Analyzing multifactor data with empty (or missing) cells often gives results of questionable value, since the data contain insufficient information to estimate the parameters of the model. (See Freund (1980) for a discussion of this problem.) The absence of data in one or more cells makes it very difficult to establish guidelines for imposing restrictions or generating appropriate estimable functions. PROC GLM helps investigate alternate estimable functions for such analyses, but no packaged program, including PROC GLM, provides a single best solution for all situations.

The problem of empty cells is illustrated by deleting the A=1, B=3 cell from the 2*3 factorial data in Output 6.7.

Table 6.6 gives the general form of estimable functions, showing that the data with the empty cells generate only five parameters. This is because there are only five cells with data, so that only five (linearly independent) parameters are estimable.

For this example, only the Type III and Type IV estimable functions will be discussed. The following statements are used:

proc glm;
   class a b;
   model y=a b a*b / e3 e4 solution;
   lsmeans a b / e stderr;

Table 6.6 General Form of Estimable Functions

Effect

Coefficients

Intercept

L1

A

1

L2

2

L1–L2

B

1

L4

2

L5

3

L1–L4–L5

A*B

1

1

L7

1

2

L2–L7

2

1

L4–L7

2

2

–L2 + L5 + L7

2

3

L1 – L4 – L5

 

The coefficients of the functions for the A effect appear in Table 6.7. Type III and Type IV are identical for A, since A has only two levels.

Table 6.7 Estimable Functions for Factor A

Effect Type III & IV
Coefficients
Intercept

1

0

A

1

L2

2

–L2

B

1

0

2

0

3

0

A*B

1

1

0.5*L2

1

2

0.5*L2

2

1

–0.5*L2

2

2

–0.5*L2

2

3

0

 

As in a classification with complete data, Type III and Type IV functions for A do not involve the parameters of factor B. Setting L2=1 shows that the A effect is equal to

= .5(μ11 + μ12) − .5 (μ21 + μ22)

This is shown in this two-way, μ-model diagram:

B

1

2

3

A

1

.5

.5

.

2  

–.5

–.5

0

No information from the A2B3 cell is used, since there is no matching data from the A1B3 cell.

Table 6.8 gives the Type III and Type IV coefficients associated with the B factor. (Notice that there are no coefficients for, μ, α1, and α2.)

Table 6.8 Estimable Functions for Factor B

Effect Type III
Coefficients
Type IV
Coefficients
Intercept

0

0

A

1

0

0

2

0

0

B

1

L4

L4

2

L5

L5

3

–L4–L5

–L4–L5

A*B

1

1

0.25*L4–0.25*L5

0

1

2

–0.25*L4+0.25*L5

0

2

1

0.75*L4+0.25*L5

L4

2

2

0.25*L4+0.75*L5

L5

2

3

–L4–L5

–L4–L5

First, consider the Type III coefficients. A set of two estimable functions tested with the Type III F statistic for B can be obtained by setting L4=1, L5=0 and L4=0, L5=1, which test H0: (β1 −β3 = 0) and (β2 −β3 = 0), respectively. The coefficients of the cell means in terms of the μ model are shown in the diagram:

(L4 = 1, L5 = 0)

      

(L4 = 0, L5 = 1)

Type III A

.25

–.25

      A

–.25

.25

.75

.25

–1.0

.25

.75

–1.0

The choices L4=1, L5=0 and L4=0, L5=1 did not result in very appealing comparisons, since they involve cell means from levels of B that are not part of the desired hypotheses. Therefore, make another selection. Taking L4=1, L5=1 and L4=1, L5=1, more interesting coefficients result.

(L4 = 1, L5 = 1)

      

(L4 = 1, L5 = 1)

Type III A

0

0

      A

.5

–.5

1

1

–2

.5

–.5

0

The hypotheses being tested are

H0: μ21 + μ22 − 2μ23 = 0 and .5(μ11 + μ21 − μ12 − μ22) = 0

or equivalently

H0: .5(μ21 + μ22) = μ23 and μ̅.1 = μ̅.2

Thus, the Type III F-test for B simultaneously compares B3 with the average of B1 and B2 within the level of A that has complete data and compares B1 with B2 averaged across the levels of A. Remember that the first selection of coefficients (L4=1, L5=0 and L4=0, L5=1) provides an equivalent H0; it is just more difficult to understand.

For the same Type III test with no missing cells, the Type III hypothesis for B is

H0: μ.1 = μ.2 = μ.3

or equivalently

H0: μ̅.1 = μ̅.., μ̅.2 = μ̅.., μ̅.3 = μ̅..

For the example with the empty A1B3 cell, the parameter μ13 is not estimable unless further conditions are imposed. (If there are no data from a population, the mean of that population cannot be estimated unless some relationship between this mean and other means is established.) The Type III hypothesis is equivalent to

H0: μ.1 = μ.2 = μ.3

subject to μ13 = .5(μ11 + μ12).

Now consider the Type IV coefficients. Taking L4=1, L5=0 and L4=0, L5=1 results in the diagram

(L4 = 1, L5 = 0)

      

(L4 = 0, L5 = 1)

Type III A

0

0

      A

0

0

1

0

–1

0

1

–1

Thus, the Type IV H0 is clearly different from the Type III H0, because theType IV H0 does not involve the means in level 1 of A, namely μ11 and μ12, even though there are data in these cells.

Output 6.13 shows that Type III and Type IV sums of squares are indeed different (Type III SS=16.073, Type IV SS=41.733).

Output 6.13 Comparison of the Type III and Type IV Sums of Squares with Empty Cells

The GLM Procedure
 
Class Level Information
 
Class Levels    Values
 
a 2    1 2
 
b 3    1 2 3
Number of observations   18
 
NOTE: Due to missing values, only 17 observations can be used in this analysis.
The SAS System 14:53 Wednesday, November 14, 2001 30
The GLM Procedure
 
Dependent Variable: y  
 
  Sum of  
  Source DF Squares   Mean Square  F Value  Pr > F
 
  Model 4   45.81568627   11.45392157 5.27 0.0110
   
  Error 12 26.06666667 2.17222222
   
  Corrected Total 16 71.88235294
 
R-Square Coeff Var Root MSE y Mean
 
0.637370 27.53339 1.473846 5.352941
 
Source DF Type III SS Mean Square F Value Pr > F
 
a 1 0.35072464 0.35072464 0.16 0.6949
b 2 16.07330642 8.03665321 3.70 0.0560
a*b 1 29.56811594 29.56811594 13.61 0.0031
 
Source DF Type IV SS Mean Square F Value Pr > F
 
a 1* 0.35072464 0.35072464 0.16 0.6949
b 2* 41.73333333 20.86666667 9.61 0.0032
a*b 1 29.56811594 29.56811594 13.61 0.0031
 
* NOTE: Other Type IV Testable Hypotheses exist which may yield different SS.
 
      Standard    
Parameter   Estimate Error  t Value  Pr > |t|
 
Intercept     5.400000000 B   0.65912400 8.19 <.0001
a 1 -3.733333333 B 1.07634498 -3.47 0.0046
a 2 0.000000000 B  ⋅         ⋅    ⋅      
b 1 -2.900000000 B 1.23310809 -2.35 0.0366
b 2 2.933333333 B 1.07634498 2.73 0.0184
b 3 0.000000000 B  ⋅         ⋅    ⋅      
a*b 1 1 6.733333333 B 1.82503171 3.69 0.0031
a*b 1 2 0.000000000 B  ⋅         ⋅    ⋅      
a*b 2 1 0.000000000 B  ⋅         ⋅    ⋅      
a*b 2 2 0.000000000 B  ⋅         ⋅    ⋅      
a*b 2 3 0.000000000 B  ⋅         ⋅    ⋅      
 
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

The message in Output 6.13,

*NOTE: Other Type IV Testable Hypotheses exist which may yield different SS.

is a warning that the Type IV estimable functions (and consequently the associated hypotheses and sums of squares) are not unique. This refers to the phenomenon that Type IV estimable functions depend on the location of the empty cell.

Suppose, for example, that the values of the levels of B are changed, say, from B=1 to B=9, B=2 to B=8, and B=3 to B=7. Since PROC GLM sorts on the levels, the first and third columns are interchanged, placing the empty cell in the upper-left-hand corner. An examination of the estimable functions (which involve all cells that contain data) reveals these coefficients:

   

B

      

B

7

8

9

7

8

9

0

0

    A 

0

.5

–.5

Type III A

1 0

0

.5

–.5

1

old 

old 

old 

old 

old 

old 

B=3 

B=2 

B=1 

B=3 

B=2 

B=1 

Now we repeat the analysis with the missing cells placed in all possible locations. Note that the data are not changed. The same data are missing, only subscripts are changed to locate the missing cell in a different position.

The partitioning of the sums of squares from these analyses is given in the first two columns of Table 6.8. Two items are of interest are that

❏ the Type III and Type IV sums of squares always give identical results for the sum of squares due to the A factor.

❏ the Type IV sums of squares due to B depend on the location of the missing cell with respect to the B factor and are never the same as the Type III sums of squares.

The right-hand portion of Table 6.8 gives estimates of functions whose analogs with no missing data would be B1 vs B3 and B2 vs B3, obtained by taking L4=1, L5=0 and L4=0, L5=1 from the table. Estimates of the A effects are not given since they are not affected by changing the location of the missing cell. Note that the Type III and Type IV estimates disagree even in some cases when the function does not involve the missing cell. When the missing cell is in the last level of B, they always differ, in which case the Type IV function uses only information in rows not containing the empty cell.

Table 6.9 gives the estimable function for the interaction effect. Since there is only one degree of freedom, there is one function, involving only the complete portion of the factorial structure.

Table 6.9 Sums of Squares and Estimates for B Effect

 

Sums of Squares

B1 vs B3 (L4=1, L5=0)

B2 vs B3 (L4=0, L5=1)

 

Type IV

Type IV

Type IV

Location of
Missing Cell
A B Basis Estimate Basis Estimate
1,3

.351

41.733

0

0

–2.900

0

0

2.933

1

0

–1

1

0

–1

2,3

.351

41.733

0

0

–2.900

0

0

2.933

1

0

–1

0

1

–1

1,1

.351

23.386

.5

–.5

.467

0

0

2.933

.5

+.5

–1

0

1

–1

2,1

.351

23.386

.5

–.5

.467

0

0

2.933

.5

.5

–1

0

1

–1

1,2

.351

18.978

0

0

–2.900

–.5

.5

–4.33

1

0

–1

.5

.5

–1

2,2

.351

23.286

.5

–.5

.467

0

0

2.933

.5

.5

–1

0

1

–1

All Type III

.351

16.073

.25

–.25

x

–1.217

–.25

.25

x

1.250

.75

.25

–1

.25

.75

–1

Table 6.10Estimable Functions for A*B

Effect

All Types
Coefficients

Intercept

0

A

1

0

2

0

B

1

0

2

0

3

0

A*B

1

1

L7

1

2

–L7

2

1

–L7

2

2

L7

2

3

0

Now consider these CONTRAST, ESTIMATE, and LSMEANS statements used on the data with an empty A1 B3 cell:

proc glm;
   class a b;
   model y=a b a*b;
   contrast 'B 2 vs 3' b 0 -1 1 / e;
   estimate 'A in B1' a -1 1 a*b -1 0 1 / e;
   lsmeans a / stderr e;

The estimable functions resulting from the E option appear in Output 6.14.

Output 6.14 Estimable Functions for the CONTRAST, ESTIMATE, and LSMEANS Statements

Coefficients for Contrast B 2 vs 3
 
Row 1
 
Intercept 0
 
a 1 0
a 2 0
 
b 1 0
b 2 -1
b 3 1
 
a*b 1 1 0
a*b 1 2 -0.5
a*b 2 1 0
a*b 2 2 -0.5
a*b 2 3 1
 
Coefficients for Estimate A in B1
 
Row 1
 
Intercept 0
 
a 1 -1
a 2 1
 
b 1 0
b 2 0
b 3 0
 
a*b 1 1 -1
a*b 1 2 0
a*b 2 1 1
a*b 2 2 0
a*b 2 3 0
 
Coefficients for a Least Square Means
 
a Level
Effect 1 2
 
Intercept 1 1
a 1   1 0
a 2   0 1
b 1   0.33333333 0.33333333
b 2   0.33333333 0.33333333
b 3   0.33333333 0.33333333
a*b 1 1   0.5 0
a*b 1 2   0.5 0
a*b 2 1   0 0.33333333
a*b 2 2   0 0.33333333
a*b 2 3   0 0.33333333

Output 6.15 Output from the ESTIMATE and LSMEANS Statements

    Standard  
a y LSMEAN Error   Pr > |t|
 
1 Non-est ⋅         ⋅     
2 5.41111111   0.49940294 <.0001
 
    Standard    
Parameter Estimate Error t Value Pr > |t|
 
A in B1 -3.00000000 1.47384606 -2.04 0.0645

The function specified by the CONTRAST statement B2 vs 3 is designated nonestimable in the SAS log (not shown) because it involves level 3 of B, which contained the empty cell. Therefore, no statistical computation is printed. A more technical reason for the nonestimable CONTRAST function is ascertainable from the coefficients printed in Output 6.14. They show that the function is

Lβ = −β2 + β3 − .5αβ12 − .5αβ22 + αβ23

In order for a function Lβ to be estimable, it must be equal to a linear function of the μij. This is equivalent to the condition that the coefficients of the μij are equal to the corresponding coefficients of αβij in Lβ. Thus, the condition for the Lβ in the CONTRAST statement to be estimable is that Lβ is equal to

−.5μ12 − .5μ12 + μ23

but

.5μ12.5μ22+μ23=.5α1+.5α2+[β2+β3.5αβ12.5αβ22+αβ23]

which contains α2 and is thus not the same function as Lβ.

Compare the estimable functions in Output 6.14 with those in Output 6.12 to see further effects of the empty cell. Later in this section, you see that this CONTRAST statement does produce an estimable function if no interaction is specified in the model.

The function specified in the ESTIMATE statement is estimable because it involves only cells that contain data; namely A1B1 and A2B1 (see Output 6.14). From the printed coefficients in Output 6.14, the function is evidently

−α1 + α2 − αβ11 + αβ21

In terms of the μ model the function is

−μ11 + μ21

and is therefore estimable because it is expressible as a linear function of the μij for cells containing data.

Caution is required in using the ESTIMATE or CONTRAST statements with empty cells. Compare the ESTIMATE statements that produced Output 6.12 and Output 6.14. If we used the ESTIMATE statement that produced Output 6.12 with a data set that had the A1B3 cell empty, the following nonestimable function would result:

Table 6.11Coefficients Produced by the ESTIMATE Statement

Effect Coefficients
Intercept

0

A

1

–1

2

1

B

1

0

2

0

3

0

A*B

1

1

–1

1

2

0

2

1

0

2

2

1

2

3

0

That is, specifying A*B –1 0 0 1 places the 1 as a coefficient on αβ22 instead of αβ21 as desired. When interaction parameters are involved, it is necessary to know the location of the empty cells if you are using the ESTIMATE statement.

The estimable functions from the LSMEANS statement show the least-squares mean for A1 to be nonestimable and the least-squares mean for A2 to be estimable. The principles involved are the same as those already discussed with respect to the CONTRAST and ESTIMATE statements.

To illustrate the computations, consider the estimability of functions produced by CONTRAST, ESTIMATE, and LSMEANS statements in the context of a MODEL statement that contains no interaction:

proc glm;
   class a b;
   model y=a b / solution ss1 ss2 ss3 ss4;
   contrast 'B 2 vs 3' b 0 -1 1 / e;
   estimate 'A EFFECT' a -1 1 / e;
   lsmeans a / e;

Partial results appear in Output 6.16. None of the functions are nonestimable even though the A1B3 cell has no data. This is because the functions involve only the parameters in the model

Yijk = μ + αi + βj + εijk

under which the means (μij = μ + αi + βj) are all estimable, even μ13. Any linear function of the μij is estimable, and all the functions in Output 6.16 are linear functions of the μij. For example, the A1 least-squares mean is

=μ+α1+.3333(β1+β2+β3)=.3333(3μ+3α1+β1+β2+β3)=.3333(μ11+μ12+μ13)

Output 6.16 Coefficients with an Empty Cell and No Interaction

The GLM Procedure
 
Coefficients for Contrast B 2 vs 3
 
Row 1
 
Intercept 0
 
a 1 0
a 2 0
 
b 1 0
b 2 -1
b 3 1
 
The GLM Procedure
 
Coefficients for Estimate A EFFECT
 
Row 1
 
Intercept 0
 
a 1 -1
a 2 1
 
b 1 0
b 2 0
b 3 0
 
Coefficients for a Least Square Means
 
a Level
Effect 1 2
 
Intercept 1 1
a 1   1 0
a 2   0 1
b 1   0.33333333 0.33333333
b 2   0.33333333 0.33333333
b 3   0.33333333 0.33333333

Table 6.12 summarizes results of the CONTRAST, ESTIMATE, and LSMEANS statements.

Table 6.12 Summary of Results from CONTRAST, ESTIMATE, and LSMEANS Output for a Two-Way Layout with an Empty Cell and No Interaction

 

Contrast

DF

SS

F

Pr>F

B2 vs 3

1

4.68597211

1.09

0.3144

 

Parameter

Estimate

T for H0
Parameter=0

Pr> | T |

STD Error
of Estimate

A

–1.39130435

–1.14

0.2747

1.22006396

 

Least-Squares Means

A

Y
LSMEAN

Std Err
LSMEAN

Pr > | T |
H0: LSMEAN=0

1

4.26376812

0.92460046

0.0005

2

5.65507246

0.69479997

0.0001

Understanding the results from an analysis with empty cells is admittedly difficult, another reminder that the existence of empty cells precludes a universally correct analysis. The problem is that empty cells leave a gap in the data and make it difficult to estimate interactions and to adjust for interaction effects. If the interaction is not requested, the empty cell causes fewer problems. Of course, the analysis is incorrect if the interaction is present in the data.

For situations that require analyses including interaction effects, the GLM procedure does not claim to always have the correct answer, but it does provide information about the nature of the estimates and allows the experimenter to decide whether these results have any real meaning.

6.4 Mixed-Model Issues

Previous sections in this chapter were concerned with fixed-effects models, in which all parameters are measures of the effects of given levels of the factors. For such models, F-tests of hypotheses on those parameters use the residual mean square in the denominator. In some models, however, one or more terms represent a random variable that measures the effect of a random sample of levels of the corresponding factor. Such terms are called random effects; models containing random effects only are called random models, whereas models containing both random and fixed effects are called mixed models. See Steel and Torrie (1980), especially Sections 7.5 and 16.6.

6.4.1 Proper Error Terms

For situations in which the proper error term is known by the construction of the design, PROC GLM provides the TEST statement (see Section 4.6.1, “A Standard Split-Plot Experiment”). PROC GLM also allows specification of appropriate error terms in MEANS, LSMEANS, and CONTRAST statements. For situations that are not obvious, PROC GLM gives a set of expected mean squares that can be used to indicate proper denominators of F-statistics. In some cases, these may have to be computed by hand. To illustrate the use and interpretation of these tools, consider a variation of the split-plot design involving the effect on yield of different irrigation treatments and cultivars.

Irrigation treatments are more easily applied to larger areas (main plots), whereas different cultivars may be planted in smaller areas (subplots). In this example, consider three irrigation treatments (IRRIG) assigned in a completely random manner to nine main-plot units (REPS). REPS are each split into two subplots, and two cultivars (CULT), A and B, are randomly assigned to the subplots.

The appropriate partitioning of sums of squares is

Source

DF

IRRIG

2

REPS in IRRIG

6

CULT

1

IRRIG*CULT

2

ERROR

6

TOTAL

17

The proper error term for irrigation is REPS within IRRIG because the REPS are the experimental units for the IRRIG factor. The other effects are tested by ERROR (which is actually CULT*REPS in IRRIG). Data for the irrigation experiment appear in Output 6.17.

Output 6.17 Data for a Split-Plot Experiment

The SAS System
 
Obs irrig reps cult yield
 
1 1 1 A 27.4
2 1 1 B 29.7
3 1 2 A 34.5
4 1 2 B 29.4
5 1 3 A 32.5
6 1 3 B 34.4
7 2 1 A 28.9
8 2 1 B 28.7
9 2 2 A 33.4
10 2 2 B 28.7
11 2 3 A 32.4
12 2 3 B 36.4
13 3 1 A 28.6
14 3 1 B 29.7
15 3 2 A 32.9
16 3 2 B 27.2
17 3 3 A 29.1
18 3 3 B 32.6

The analysis is implemented as follows:

proc glm;
   class irrig reps cult;
   model yield=irrig reps(irrig)
               cult irrig*cult;
   test h=irrig e=reps(irrig);
   contrast 'IRRIG 1 vs IRRIG 2' irrig 1 -1 / e=reps(irrig);

Note the use of the nested effect in the MODEL statement (See Section 4.2, “Nested Classifications”).

The TEST statement requests that the IRRIG effect be tested against the REP within the IRRIG mean square.

The CONTRAST statement requests a comparison of the means for irrigation methods 1 and 2. The appropriate error mean square for testing the contrast is REPS(IRRIG) for the same reason that this was the appropriate error term for testing the IRRIG factor. The results appear in Output 6.18.

Output 6.18F-Tests with Correct Denominators

The GLM Procedure
 
  Sum of  
Source DF Squares  Mean Square F Value Pr > F
 
Model 11 31.91611111 2.90146465 1.34 0.3766
 
Error 6 13.02666667 2.17111111
 
Corrected Total 17 44.94277778
 
Source DF Type III SS Mean Square F Value Pr > F
 
irrig 2 13.06777778 6.53388889 3.01 0.1244
reps(irrig) 6 13.32000000 2.22000000 1.02 0.4896
cult 1 3.12500000 3.12500000 1.44 0.2755
irrig*cult 2 2.40333333 1.20166667 0.55 0.6017
 
Tests of Hypotheses Using the Type III MS for reps(irrig) as an Error Term
 
Source DF Type III SS Mean Square F Value Pr > F
 
irrig 2 13.06777778 6.53388889 2.94 0.1286
 
Tests of Hypotheses Using the Type III MS for reps(irrig) as an Error Term
 
Source DF Type III SS Mean Square F Value Pr > F
 
IRRIG 1 vs IRRIG 2 1 0.70083333 0.70083333 0.32 0.5946

6.4.2 More on Expected Mean Squares

Expected mean squares are algebraic expressions specifying the functions of the model parameters that are estimated by the mean squares resulting from partitioning the sums of squares. Generally, these expected mean squares are linear functions of elements that represent the

❏ error variance

❏ functions of variances of random effects

❏ functions of sums of squares and products (quadratic forms) of fixed effects.

The underlying principle of an F-test on a set of fixed-effects parameters is that the expected mean square for the denominator contains a linear function of variances of random effects, whereas the expected mean square for the numerator contains the same function of these variances plus a quadratic form of the parameters being tested. If no such matching pair of variance functions is available, no proper test exists; however, approximate tests are available.

For fixed models, all mean squares estimate the residual error variance plus a quadratic form (variance) of the parameters in question. Hence, the proper denominator for all tests is the error term. Expected mean squares are usually not required for fixed models.

For a mixed model, the expected mean squares are requested by a RANDOM statement in PROC GLM, specifying the model effects that are random.

For the irrigation example, REPS within IRRIG is random. Regarding IRRIG and CULT as fixed effects in the example, a model is

yijk = μ + αi + wij + βk + αβik + eijk (6.9)

where αi, βk, and αβik are the main effect of IRRIG, the main effect of CULT, and the interaction effect of IRRIG*CULT. The wij term is the random effect of REPS within IRRIG.

Use the following SAS statements to obtain an analysis for this model:

proc glm;
   class irrig reps cult;
   model yield=irrig reps(irrig) cult cult*irrig / ss3;
   contrast 'IRRIG 1 vs IRRIG 2' irrig 1-1 / e=reps(irrig);
   random reps(irrig) / q;

The RANDOM statement specifies that REPS(IRRIG) is a random effect. The Q option requests that the coefficients of the quadratic form in the expected mean squares (EMS) are printed. The analysis of variance is the same as in Output 6.18.

The expected mean squares appear in Output 6.19.

Output 6.19Expected Mean Squares, and Quadratic Form for IRRIG

The SAS System
 
The GLM Procedure
 
Source Type III Expected Mean Square
 
irrig Var(Error) + 2 Var(reps(irrig)) + Q(irrig,irrig*cult)
 
reps(irrig) Var(Error) + 2 Var(reps(irrig))
 
cult Var(Error) + Q(cult,irrig*cult)
 
irrig*cult Var(Error) + Q(irrig*cult)
 
 
Contrast Contrast Expected Mean Square
 
IRRIG 1 vs IRRIG 2 Var(Error) + 2 Var(reps(irrig)) + Q(irrig,irrig*cult)
 
Quadratic Forms of Fixed Effects in the Expected Mean Squares
 
Source: Type III Mean Square for irrig
 
irrig 1 irrig 2 irrig 3 Dummy010 Dummy011
 
irrig 1 4.00000000 -2.00000000 -2.00000000 2.00000000 2.00000000
irrig 2 -2.00000000 4.00000000 -2.00000000 -1.00000000 -1.00000000
irrig 3 -2.00000000 -2.00000000 4.00000000 -1.00000000 -1.00000000
Dummy010 2.00000000 -1.00000000 -1.00000000 1.00000000 1.00000000
Dummy011 2.00000000 -1.00000000 -1.00000000 1.00000000 1.00000000
Dummy012 -1.00000000 2.00000000 -1.00000000 -0.50000000 -0.50000000
Dummy013 -1.00000000 2.00000000 -1.00000000 -0.50000000 -0.50000000
Dummy014 -1.00000000 -1.00000000 2.00000000 -0.50000000 -0.50000000
Dummy015 -1.00000000 -1.00000000 2.00000000 -0.50000000 -0.50000000
 
Dummy012 Dummy013 Dummy014 Dummy015
 
irrig 1 -1.00000000 -1.00000000 -1.00000000 -1.00000000
irrig 2 2.00000000 2.00000000 -1.00000000 -1.00000000
irrig 3 -1.00000000 -1.00000000 2.00000000 2.00000000
Dummy010 -0.50000000 -0.50000000 -0.50000000 -0.50000000
Dummy011 -0.50000000 -0.50000000 -0.50000000 -0.50000000
Dummy012 1.00000000 1.00000000 -0.50000000 -0.50000000
Dummy013 1.00000000 1.00000000 -0.50000000 -0.50000000
Dummy014 -0.50000000 -0.50000000 1.00000000 1.00000000
Dummy015 -0.50000000 -0.50000000 1.00000000 1.00000000

Because this experiment is completely balanced, all four types of expected mean squares are identical. Consider the EMSs for IRRIG. The results in Output 6.19 translate into

σe2+2σw2+(quadratic form inα,s and αβs)/(a1)

where σe2=V(e), σw2=V(w), and a=number of levels of IRRIG. (The a 1 term is not explicitly indicated by PROC GLM.) The quadratic form is a measure of differences among the irrigation means

μi..=E(y¯i..)=μ+αi+βk+αβik

The coefficients appear in Output 6.19. (Although not shown here, the labels DUMMY010-DUMMY015 would be previously indicated in the output to correspond to αβ11, αβ12, αβ21, αβ22, αβ31, and αβ32.) In matrix notation, the quadratic form is α, where

A=[42222111124211221122411112221111.5.5.5.521111.5.5.5.5121.5.511.5.5121.5.511.5.5112.5.5.5.511112.5.5.5.511]

and

σ́ = σ1 σ2 σ3 σβ11 σβ12 σβ21 σβ22 σβ31σβ32

This is the general expression for the quadratic form in which no constraints are assumed on the parameters. In this representation, irrigation mean differences are functions not only of the αi but also of the αβij, that is,

μ1.. − μ2.. = α1 − α2 + αβ1. + αβ2.

Thus, the quadratic form measuring these combined differences involves the αβijs. In most texts, these expected mean squares are presented with the constraints α1 + α2 + α3 = 0 and αβ1. + αβ2. + αβ3. = 0. When these constraints are imposed, the quadratic form reduces to

αAα=6α12+6α22+6α32

To see this, note that the contribution from the first row of A is

α1(4α12α22α3+2αβ11+2αβ12αβ21αβ22αβ31αβ32)=α1(4α1+2(α2α3+2αβ1.αβ2.αβ3.)=α1(4α1+2α1000)=6α12

Similarly, row 2 and row 3 yield 6α22 and 6α23, respectively. Row 4 gives

αβ11(2α1α2α3+αβ11+αβ12.5αβ21.5αβ22.5αβ31.5αβ32)=αβ11(3α1)

and row 5 gives αβ12(3α1). Thus, the sum of rows 4 and 5 is 0. Similarly, the net contribution from rows 6 through 9 is 0.

Under the null hypothesis of no difference between irrigation methods, the expected mean squares for IRRIG becomes σe2+4σw2. But this is the same as the expected mean squares for REPS(IRRIG). Therefore, REPS(IRRIG) is, in fact, the correct denominator in the F-test for IRRIG.

Now, view this example as if the irrigation treatments are in a randomized-blocks design instead of a completely random design—that is, assume REPS is crossed with IRRIG rather than nested in IRRIG. If both IRRIG and CULT are fixed effects, then a model is

yijk=μ+φj+αi+wij+βk+αβik+εijk(6.10)

where αi, βk, αβik, and εijk have the same meaning as in the model shown in equation (6.9). The ϕj term is the random block (REPS) effect, with V(φj)=σρ2,and wij is the random main-plot error, that is, the random block by irrigation (REP*IRRIG) effect. The model in equation (6.10) is equivalent to model 12.3 of Steel and Torrie (1980, p. 245) with their A=IRRIG and B=CULT. Now it is commonly presumed that any interaction effect must be a random effect if it involves a random main effect. In terms of equation (6.10), this presumption implies that if ϕj is random, then wij must be random. However, PROC GLM does not operate under this presumption, and, therefore, both REPS and REPS*IRRIG must be explicitly designated as random in the RANDOM statement in order to obtain expected mean squares corresponding to ϕj and wij both as random. In other words, if only REPS appears in the RANDOM statement, then the expected mean squares printed by PROC GLM would correspond to wij as a fixed effect.

Output 6.20 contains the expected mean squares resulting from the following statements:

proc glm;
   class reps irrig cult;
   model yield=reps irrig reps*irrig cult cult*irrig / ss1;
   random reps reps*irrig;

Output 6.20Expected Mean Squares for Split Plot: Fixed Main-Plot and Subplot Factors

The SAS System
 
The GLM Procedure
 
Source Type III Expected Mean Square
 
irrig Var(Error) + 2 Var(irrig*reps) + Q(irrig,irrig*cult)
 
reps Var(Error) + 2 Var(irrig*reps) + 6 Var(reps)
 
irrig*reps Var(Error) + 2 Var(irrig*reps)
 
cult Var(Error) + Q(cult,irrig*cult)
 
irrig*cult Var(Error) + Q(irrig*cult)

The line for IRRIG in the output translates to

δe2+2δw2+(quadratic form inα,s and αβ,s)/2

and, as discussed above, the quadratic form reduces to 6αi2if iαi=iαβij=0. This matches the expected mean squares given in Steel and Torrie (1980, p. 394) for this design.

6.4.3 An Issue of Model Formulation Related to Expected Mean Squares

Before leaving this example, you should understand one more point concerning the expected mean squares in mixed models. Suppose that CULT is a random effect, but IRRIG remains fixed. Following Steel and Torrie (1980), the table below shows the expected mean squares:

Source Expected Mean Squares
REPS σε2+2σγ2+6σρ2
IRRIG σε2+2σγ2+3(3/2)σαβ2+6Σiαi2/2
Error(A)=REPS*IRRIG σε2+2σγ2
CULT σε2+9σβ2
CULT*IRRIG σε2+3(3/2)σαβ2
Error(B) σε2

Output 6.21 contains the expected mean squares output from the following statements:

proc glm;
   class reps irrig cult;
   model yield=reps irrig reps*irrig cult
             irrig*cult / ss3;
   random reps reps*irrig cult cult*irrig;

Output 6.21Expected Mean Squares for a Split-Plot Design: Fixed Main-Plot and Random Subplot Factors

The SAS System 69
 
The GLM Procedure
 
Source Type III Expected Mean Square
 
irrig Var(Error) + 3 Var(irrig*cult) + 2 Var(irrig*reps) + Q(irrig)
 
reps Var(Error) + 2 Var(irrig*reps) + 6 Var(reps)
 
irrig *reps Var(Error) + 2 Var(irrig*reps)
 
cult Var(Error) + 3 Var(irrig*cult) + 9 Var(cult)
 
irrig *cult Var(Error) + 3 Var(irrig*cult)

The only quadratic form for fixed effects is Q(IRRIG) in the IRRIG line, which corresponds to 6Σαi2, assuming Σαi = 0. The IRRIG line also contains Var(IRRIG*CULT) and Var(IRRIG*REPS), in agreement with the expected mean squares of Steel and Torrie given above. The lines in the output for REPS, IRRIG*REPS, and IRRIG*CULT also agree with these expected mean squares. However, although the line for CULT in Output 6.21 contains Var(IRRIG*CULT), the expected mean squares given by Steel and Torrie for CULT B does not contain σαβ2. The exclusion by Steel and Torrie is in agreement with the general principle that if U is a fixed effect and V is a random effect, then the expected mean squares for U contains σU*V2, but the expected mean squares for V does not contain σU*V2.

This principle is an item of controversy among statisticians and relates to formulation and parameterization of the model. Basically, exclusion or inclusion of σαβ2 in the line for CULT depends on variance and covariance definitions in the model. See Hocking (1973, 1985) for detailed accounts of modeling ramifications for the two-way mixed model. The PROC GLM output for a two-way mixed model corresponds to the results shown by Hocking (1973) for Model III in his Table 2. Refer to Hartley and Searle (1969), who point out that exclusion of σαβ2 is inconsistent with results commonly reported for the unbalanced case. More recently, other authors, including Samuels, Casella, and McCabe (1991), Lentner, Arnold, and Hinkleman (1989), and McLean, Sanders, and Stroup (1991), have discussed this modeling problem, but it is not resolved.

6.5 ANOVA Issues for Unbalanced Mixed Models

We stated in Chapter 5 that there are no definitive guidelines for using ANOVA methods for analyzing unbalanced mixed-model data. This is true even in the simplest case of a two-way classification. In this section we discuss in greater detail the issues involved, although we are not able to resolve all of them. One of the main problems is to choose numerator and denominator means squares to construct approximate F-ratios for hypotheses about fixed effects.

6.5.1 Using Expected Mean Squares to Construct Approximate F-Tests for Fixed Effects

Refer again to the unbalanced clinical trial data set. We consider the studies to be random and drugs fixed. Run the following statements to obtain Types I, II, and III sums of squares and their expected mean squares:

proc glm data=drugs;
   class trt study;
   model flush = trt study trt*study / e1 e2 e3;
   random study trt*study / q test;
run;

ANOVA results appear in Output 6.22.

Output 6.22Three Types of Sums of Squares for an Unbalanced Two-Way Classification

Unbalanced Two-way Classification
 
The GLM Procedure
 
Dependent Variable: FLUSH  
 
  Sum of  
  Source DF Squares   Mean Square  F Value  Pr > F
 
  Model 17    16618.75357 977.57374 2.24 0.0063
 
  Error 114 49684.09084 435.82536
 
  Corrected Total 131 66302.84440
 
Source DF Type I SS Mean Square F Value Pr > F
 
TRT 1 1134.560964 1134.560964 2.60 0.1094
STUDY 8 6971.606045 871.450756 2.00 0.0526
TRT*STUDY 8 8512.586561 1064.073320 2.44 0.0178
 
Source DF Type II SS Mean Square F Value Pr > F
 
TRT 1 1377.550724 1377.550724 3.16 0.0781
STUDY 8 6971.606045 871.450756 2.00 0.0526
TRT*STUDY 8 8512.586561 1064.073320 2.44 0.0178
 
Source DF Type III SS Mean Square F Value Pr > F
 
TRT 1 1843.572090 1843.572090 4.23 0.0420
STUDY 8 7081.377266 885.172158 2.03 0.0488
TRT*STUDY 8 8512.586561 1064.073320 2.44 0.0178

First of all, we consider the choice of a mean square for TRT to use in the numerator of an approximate F-statistic. You see that the values of MS(TRT) range from 1134.6 to Type I to 1843.6 for Type III. But it is not legitimate to choose the mean square based on its observed value. Instead, the choice should be made based on the expected mean squares, which describe what the mean squares measure. Expected mean squares are computed from the RANDOM statement. But in order to get the Types I and II expected mean squares the e1 and e2 options must be specified in the MODEL statement. The Types I-III expected mean squares are shown in Output 6.23.

Output 6.23Three Types of Expected Mean Squares for an Unbalanced Two-Way Classification

Source Type I Expected Mean Square
 
TRT Var(Error) + 9.1461 Var(TRT*STUDY) + 0.04 Var(STUDY) + Q(TRT)
 
STUDY Var(Error) + 7.1543 Var(TRT*STUDY) + 14.213 Var(STUDY)
 
TRT*STUDY Var(Error) + 7.0585 Var(TRT*STUDY)
 
 
Source Type II Expected Mean Square
 
TRT Var(Error) + 9.1385 Var(TRT*STUDY) + Q(TRT)
 
STUDY Var(Error) + 7.1543 Var(TRT*STUDY) + 14.213 Var(STUDY)
 
TRT*STUDY Var(Error) + 7.0585 Var(TRT*STUDY)
 
 
Source Type III Expected Mean Square
 
TRT Var(Error) + 4.6613 Var(TRT*STUDY) + Q(TRT)
 
STUDY Var(Error) + 7.0585 Var(TRT*STUDY) + 14.117 Var(STUDY)
 
TRT*STUDY Var(Error) + 7.0585 Var(TRT*STUDY)

Basically, we want to choose a mean square for TRT that results in the most powerful test for differences between drug means. The mean squares are quadratic forms of normally distributed data, and therefore they are approximately distributed as a constant times non-central chi-square random variables. We want to select the mean square with the largest non-centrality parameter. The non-centrality parameters are equal to Q(TRT)/[(Var(Error) + k1 Var(TRT*STUDY) + k2 Var(STUDY))]. Except for Type I, k2 is equal to 0, and it is very small (0.04). Thus, the choice boils down to selecting the mean square with largest value of Q(TRT)/ k1. The quantities Q(TRT) k1 are available from the Q option on the MODEL statement, shown in Output 6.24.

Output 6.24Three Types of Expected Mean Squares for an Unbalanced Two-Way Classification

Source: Type I Mean Square for TRT
 
  TRT A TRT B
 
TRT A 32.99242424 -32.99242424
TRT B -32.99242424 32.99242424
 
Source: Type II Mean Square for TRT
 
  TRT A TRT B
 
TRT A 32.80309690 -32.80309690
TRT B -32.80309690 32.80309690
 
Source: Type III Mean Square for TRT
 
  TRT A TRT B
 
TRT A 20.97586951 -20.97586951
TRT B -20.97586951 20.97586951

In terms of model parameters, for each type of mean square, Q(TRT) is equal to α’A α, where α’ =11) and A is the matrix shown in Output 6.23. Since there are only two treatments, all elements in A have the same value (except for sign), and Q(TRT) is determined by this number. Thus the mean square with largest value of that number divided by k1 will have the largest non-centrality parameter. For this example the values are

Type I:

32.99/9.15 = 3.61

Type II:

32.81/9.14 = 3.59

Type III:

20.98/4.66 = 4.50

There is not much to choose between Types I and II, but the value for Type III is approximately 20% larger than the others. Therefore, Type III has the largest non-centrality parameter.

6.6 GLS and Likelihood Methodology Mixed Model

The MIXED procedure uses generalized least squares (GLS) and likelihood methodology to construct test statistics, estimates, and standard errors of estimates. In this section we present a brief description of the methods. Although the methods are based on sound criteria, they usually do not result in “exact” results, in the sense that p-values and confidence coefficients are only approximate.

6.6.1 An Overview of Generalized Least Squares Methodology

The statistical model on which PROC MIXED is based is

Y = Xβ + ZU +ε,

where Y is the vector of data, Xβ is the set of linear combinations for fixed effects, ZU is a set of linear combinations of random effects, and ε is a vector of residual errors. The matrices X and Z contain known constants, often values of independent variables or values of indicator variables for classification variables. The vector β contains fixed but unknown constants, such as regression parameters or differences between treatment means. The random vector U has expectation E(U) = 0 and covariance V(U) = G. The random vector ε has expectation E(ε) = 0 and covariance V(ε) = R. With these specifications, the data vector Y has expectation E(Y) = Xβ and variance V(Y) = ZGZ' + R. In most applications, we assume U and ε are normally distributed, which makes Y normally distributed, also.

Let V = ZGZ' + R be the covariance matrix of Y. Then GLS methodology gives us the following results: The best estimator of β is

b = (X'V-1X)-1X'V-1Y            (6.11)

and the variance of b is

V(b) = (X'V-1X)-1            (6.12)

It follows that the best estimator of a linear combination, or set of linear combinations, Lβ is Lb and its variance is L(X'V-1X)-1L'. These results provide the basis for statistical inference about model parameters that is used in PROC MIXED. Here are some of the most common inference procedures:

Standard error of an estimate Lb:     (L(X′V-1X)-1L)1/2            (6.13)

Test about linear combinations H0: Lβ = 0:   χ2 = Lb(L(X′V-1X)-1L′)b′L′ (6.14)

These statistical inference methods are exact. However, they cannot be used in most practical applications because they require knowledge of the covariance matrix V. Therefore, V must be estimated from the data. In most cases, the elements of V are functions of a small number of parameters. These are called covariance parameters, although individual parameters might not represent covariances, strictly speaking. For example, in a repeated measures analysis, V might involve variation between subjects and variation between repeated measures within subjects. One of the parameters might be the variance between subjects, and other parameters might represent variances and covariances within subjects. (See Chapter 8.)

Equations (6.11) through (6.14) assume (X'V-1X)-1. If (X'V-1X) is singular, then a g-inverse is computed instead.

The structure of V is specified with REPEATED and RANDOM statements. The REPEATED statement determines R and the RANDOM statement determines Z and G. Parameters in R and G and estimated, and the values of the estimates are inserted into R and G. The method of maximum likelihood, or a variation of it, is usually used to estimate the covariance parameters. Then the estimates of R and G are used to get the estimate of V, and it is inserted into expressions in (6.11)–(6.14). As a result, the “exactness” no longer holds, and most results become approximate. In addition, t-distributions are used instead of z-distributions for confidence intervals, and F-distributions are used instead of chi-square distributions for tests of fixed effects.

There are more consequences due to estimating the parameters in V. The degrees of freedom for the t- and F-statistics must be estimated using Satterthwaite-type methods. Also, standard errors of estimates are biased downward because the estimates contain variations induced by estimation of V.

6.6.2 Some Practical Issues about Generalized Least Squares Methodology

The methods employed by PROC MIXED are based on sound principles as prescribed by the specified model. Of course, the model must be valid in order for results to be valid. Assuming the model is valid, the inferential methods are not exact only because covariance parameters must be estimated. Nonetheless, standard errors of estimates from the MODEL statement, and from the ESTIMATE and LSMEANS statements, use the correct basic mathematical expression. Therefore, these standard errors are credible, whereas standard errors computed by PROC GLM in mixed-model applications are usually suspect because they are not based on the correct basic mathematical expression. Instead, they are based on expressions for fixed-effects models.

PROC MIXED treats random effects as random and fixed effects as fixed. These concepts are built into the model. This takes considerable guesswork out of using PROC MIXED that occurs when using PROC GLM. Here are two examples. One, in choosing a numerator mean square for a test of fixed effects, you do not have to be concerned about the involvement of random effects. You simply choose a type of hypothesis you want to test based on considerations of only the fixed effects. Two, estimability is judged on the basis of fixed effects alone, without regard to the random effects.

1 You can use the ORDER= option in the PROC GLM statement to alter the column position

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.210.12