11.6 A Lack-of-Fit Analysis

A lack-of-fit analysis can provide information on the adequacy of a model that does not include all possible terms. The basic principle is to compare the fits of “complete” and “reduced” models as described in Chapter 2. The sum of squares for the complete model can be obtained from a one-way analysis of variance that computes the sums of squares among all unique treatments. The error sum of squares for the full model is subtracted from the error sum of squares for the incompletely specified model to obtain the sum of squares for all terms not specified.

One of the most common applications of lack-of-fit analysis is testing the adequacy of a regression model. In this procedure, you want to determine if a fitted model accounts for essentially all of the variation in a response variable due to differences between the levels of a quantitative independent variable. For example, consider an experiment in which chickens were fed a form of dietary copper to relate the copper uptake in the liver to copper intake. The chickens were fed a basal diet of 11 ppm copper sulfate, plus a supplemental rate of 0, 150, 300, or 450 ppm. There were six chickens in each of the four treatment groups. The data for this experiment are shown in the SAS data set LIVCU printed in Output 11.25. The variable LOGLIVCU is the logarithm (base 10) of the copper in the livers of the chickens.

Output 11.25 Data for a Lack-of-Fit Analysis

Liver Copper in Poultry Fed Sulfate or Lysine Source

 

Obs level lackofit loglivcu
 
1 0 0 1.16761
2 0 0 1.25789
3 0 0 1.27312
4 0 0 1.09688
5 0 0 1.26881
6 0 0 1.24391
7 150 150 1.38957
8 150 150 1.46716
9 150 150 1.51402
10 150 150 1.30969
11 150 150 1.24596
12 150 150 1.37160
13 300 300 1.99269
14 300 300 2.19897
15 300 300 2.14038
16 300 300 1.83695
17 300 300 1.97164
18 300 300 2.11470
19 450 450 2.41911
20 450 450 2.34434
21 450 450 2.15644
22 450 450 2.32868
23 450 450 2.46058
24 450 450 2.43342

The variable LACKOFIT was defined to be equal to LEVEL in the DATA step. Its purpose will become apparent.

We want to fit a linear regression of LOGLIVCU on LEVEL and determine if the linear equation adequately models the response of LOGLIVCU to LEVEL. There are three degrees of freedom for differences between treatments. The linear regression accounts for one of those. The other two account for lack of fit of the linear regression, plus random error. The challenge is to determine how much is lack of fit.

Within each treatment group there are six observations. Variation between these observations, within a group, measures random variation. This is sometimes called “pure error.” Pooled across treatment groups, there are 20 DF for pure error. The analysis of variance is

Source

DF

LEVEL

1

Lack of Fit

2

Pure Error

20

This ANOVA can be obtained by the Type I sums of squares that are provided by the following statements:

proc glm;
   class lackofit;
   model loglivcu=level lackofit/ss1;

The term LEVEL in the MODEL statement to the right of the equal sign is the usual linear effect of LEVEL. The LACKOFIT variable measures the variation in LOGLIVCU due to treatment that is not accounted for by the linear regression. Specifying the variable LACKOFIT, which has four levels, in the CLASS statement causes the generation of four dummy variables. If LACKOFIT were the only variable in the MODEL statement, then it would account for 3 DF. One of these is confounded with the linear effect. Preceding LACKOFIT by LEVEL leaves only 2 DF for LACKOFIT in the Type I sums of squares. It is important to precede LACKOFIT by LEVEL, because if LACKOFIT preceded LEVEL, then all 3 DF would go to LACKOFIT and 0 DF would go to LEVEL. Note, however, that all other types of sums of squares for LEVEL would be zero.

Results of the preceding statements appear in Output 11.26.

Output 11.26 A Lack-of-Fit Analysis

The GLM Procedure

 

Dependent Variable: loglivcu

 
Source DF Sum of
Squares
Mean Square F Value Pr > F
 
Model 3 5.23096460 1.74365487 155.47 <.0001
 
Error 20 0.22430992 0.01121550    
 
Corrected Total 23 5.45527452      
 
R-Square Coeff Var Root MSE loglivcu Mean
 
0.958882 6.051018 0.105903 1.750172
 
Source DF Type I SS Mean Square F Value Pr > F
 
level 1 4.98592403 4.98592403 444.56 <.0001
lackofit 2 0.24504057 0.12252029 10.92 0.0006

The test for LACKOFIT is significant with p=0.0006. This indicates that the relation between LOGLIVCU and LEVEL is not linear. However, the analysis sheds no light on the true relationship.

You could test whether the relationship is quadratic with a similar analysis provided by the statements

proc glm;
   class lackofit;
   model loglivcu=level level*level lackofit/ss1;

Output is not shown. There would be 1 DF for LEVEL and 1 DF for LEVEL*LEVEL.

Now consider a three-factor factorial experiment with factors A, B, and C. Suppose you specify an incomplete model that omits the B*C and A*B*C interactions, and you want to test for lack of fit of this model. Run the following statements:

proc glm;
   class a b c;
   model y=a b a*b c a*c;

The difference between the error sum of squares that you obtain and the error sum of squares from a between-cell analysis of variance provides the additional sum of squares due to both B*C and A*B*C.

You can get the between-cell analysis of variance from PROC ANOVA. If the CLASS variables are integers, a single variable can be generated to represent all cell combinations. Assume, for example, that the values of three CLASS variables (A, B, C) consist of integers between 1 and 10. The following assignment statement placed in the DATA step provides the subscript:

group=100*a+10*b+c;

The following SAS statements provide the desired error sums of squares:

proc glm;
   class group;
   model y=group;

The CLASS variables may not conform to these specifications. Character values can be concatenated, but the resulting single classification variable may exceed eight characters. Alternately, the following statements can be used to compute the sum of squares for differences between the cells:

proc glm;
   class a b c;
   model y=a*b*c;

11.7 An Unbalanced Nested Structure

Nested structure concerns samples within samples, as discussed in Section 4.2, “Nested Classifications.” An example is treatments applied to plants in pots. There are several pots per treatment and several plants per pot. The pots do not necessarily have the same number of plants, and there may be different numbers of pots per treatment. Another example of nested structure occurs in sample surveys in which households are sampled within blocks, blocks are sampled within precincts, precincts are sampled within cities, and so on. In many cases, PROC NESTED is adequate for such analyses. Some applications, however, require PROC GLM or PROC MIXED. This section addresses some basic issues of computing means with unbalanced data and random effects. Section 11.8 continues in addressing the issues in a more complex setting.

Data containing a nested structure often have both fixed and random components, which raises questions about proper error terms. Consider an experiment with t treatments (TRT) applied randomly to a number of pots (POT), each containing several plants (PLANT). The data appear in Output 11.27.

Output 11.27 Data from an Unbalanced Nested Classification

Unbalanced Nested Structure

 

Obs   TRT   POT PLANT Y
 
1 1 1 1 15
2 1 1 2 13
3 1 1 3 16
4 1 2 1 17
5 1 2 2 19
6 1 3 1 12
7 2 1 1 20
8 2 1 2 21
9 2 2 1 20
10 2 2 2 23
11 2 2 3 19
12 2 2 4 19
13 3 1 1 12
14 3 1 2 13
15 3 1 3 14
16 3 2 1 11
17 3 3 1 12
18 3 3 2 13
19 3 3 3 15
20 3 3 4 11
21 3 3 5 9

The model is

yijk = μ + λi + ρij + εijk

where

yijk

is the observed response in the kth PLANT of the jth POT in the ith TREATMENT.

μ

is the overall mean response.

λi

is the effect of the ith TREATMENT.

ρij

is the effect of the jth POT within the ith TREATMENT.

εijk

is the effect of the kth individual PLANT in the jth POT of the ith TREATMENT. This effect is usually considered to be the random error.

You are primarily interested in tests and estimates related to the TREATMENTs as well as the variation among POTs and PLANTs. The analysis is implemented by the following SAS statements:

proc glm;
   class trt pot;
   model y=trt pot(trt) / ss1 ss3;
   means trt pot(trt);
   lsmeans trt pot(trt);

Most items in these statements are similar to those from previous examples. The analysis of variance appears in Output 11.28.

Output 11.28 Types I and III ANOVA for an Unbalanced Nested Classification

The GLM Procedure

 

Dependent Variable: Y

 
Source DF Sum of
Squares
Mean Square F Value Pr > F
 
Model 7 267.2261905 38.1751701 12.43 <.0001
 
Error 13 39.9166667 3.0705128    
 
Corrected Total 20 307.1428571      
 
R-Square Coeff Var Root MSE Y Mean
 
0.870039 11.35742 1.752288 15.42857
 
Source DF Type I SS Mean Square F Value Pr > F
 
TRT 2 236.9206349 118.4603175 38.58 <.0001
POT(TRT) 5 30.3055556 6.0611111 1.97 0.1499
 
Source DF Type III SS Mean Square F Value Pr > F
 
TRT 2 200.1109726 100.0554863 32.59 <.0001
POT(TRT) 5 30.3055556 6.0611111 1.97 0.1499

The analysis of variance has the same form as given in Output 4.4. Note, however, that there is a slight difference in the Type I and Type III sums of squares of TRT in Output 11.28, due to the unbalanced structure of the data. This difference is made clearer by noting the differences between the unadjusted means from the MEANS statement and the adjusted or least-squares means from the LSMEANS statement. Table 11.2 shows the differences.

The means and least-squares means are shown in Output 11.29.

Output 11.29 Means and Least Squares from the GLM Procedure

The GLM Procedure

 

Level of   ---------Y---------
TRT N Mean Std Dev
 
1 6 15.3333333 2.58198890
2 6 20.3333333 1.50554531
3 9 12.2222222 1.78730088
 
Level of Level of   ---------Y---------
POT TRT N Mean Std Dev
 
1 1 3 14.6666667 1.52752523
2 1 2 18.0000000 1.41421356
3 1 1 12.0000000  .        
1 2 2 20.5000000 0.70710678
2 2 4 20.2500000 1.89296945
1 3 3 13.0000000 1.00000000
2 3 1 11.0000000  .        
3 3 5 12.0000000 2.23606798

Unbalanced Nested Structure            43

                10:05 Thursday, January 10, 2002

 

The GLM Procedure
Least Squares Means

 

TRT Y LSMEAN
 
1 14.8888889
2 20.3750000
3 12.0000000
 
POT TRT Y LSMEAN
 
1 1 14.6666667
2 1 18.0000000
3 1 12.0000000
1 2 20.5000000
2 2 20.2500000
1 3 13.0000000
2 3 11.0000000
3 3 12.0000000

Table 11.2 Means and Least-Squares Means


TRT


POT


N


Means

Least-Squares
Means

1

 

6

15.333

14.899

2

 

6

20.333

20.375

3

 

9

12.222

12.000

1

1

3

14.667

14.667

 

2

2

18.000

18.000

 

3

1

12.000

12.000

2

1

2

20.500

20.500

 

2

4

20.250

20.250

3

1

3

13.000

13.000

 

2

1

11.000

11.000

 

3

5

12.000

12.000

The values produced by the MEANS statement are the means of all observations in a TREATMENT. These are the weighted POT means, as shown in the following equation:

mean(i) = i.. = (1/ni.) Σj nij ij.

nij is the number of plants in POT j of TREATMENT i. On the other hand, the least-squares means are the unweighted POT means, as shown in the following equation:

least-squares mean (i) = (1/kijij.

ki is the number of pots in TREATMENT i.

Both of these types of means have specific uses. In sample surveys, particularly self-weighting samples, it is usually appropriate to use the ordinary means. For the present case, POTs would probably be considered a random effect (see Section 4.2.1, “Analysis of Variance for Nested Classifications.”) In this event, the variance of least-squares mean (i) is less than the variance of mean (i) if σρ2 is large relative σ2 to and conversely, where σρ2=V(ρij) and σ2=V(εijk).

11.8 An Analysis of Multi-Location Data

Multiple location studies, such as clinical trials conducted at several centers, or on-farm trials in agriculture, raise several linear model issues. These issues primarily involve mixed-model inference considerations introduced in Chapter 4, and linear model unbalanced data concepts discussed in Chapters 5 and 6. The analysis of multi-location data can be both confusing and controversial, partly because different kinds of multi-location studies call for different approaches and partly because there is disagreement within the statistics community on what methods are appropriate. This section presents an example multi-location data set and several alternative analyses using linear and mixed-model methods. The purpose of this section is not to prescribe, but simply to demonstrate the various linear model approaches and in the process frame the main linear model issues.

Output 11.30 contains data from a study to compare 3 treatments (TRT) conducted at 8 locations (LOC). At each location, a randomized complete-blocks design was used, but the number of blocks varied. Locations 1-4 used 3 blocks each, locations 5 and 6 used 6 blocks each, and locations 7 and 8 used 12 blocks each. In the interest of space, not all the data are shown in Output 11.30. However, Output 11.31 shows the response variable (Y) mean and number of observations (blocks) per treatment for each location.

Output 11.30 Multi-Location Data

Obs loc blk trt y
 
1 1  1 1 46.6
2 1  1 2 46.4
3 1  1 3 44.4
4 1  2 1 43.7
5 1  2 2 43.6
6 1  2 3 31.4
7 1  3 1 37.9
8 1  3 2 39.5
9 1  3 3 48.2
10 2  1 1 34.0
      .  
      .  
      .  
124 8  6 1 43.5
125 8  6 2 52.1
126 8  6 3 61.4
127 8  7 1 44.1
128 8  7 2 54.8
129 8  7 3 59.9
130 8  8 1 43.3
131 8  8 2 49.4
132 8  8 3 63.0
133 8  9 1 44.2
134 8  9 2 54.6
135 8  9 3 64.8
136 8 10 1 54.6
137 8 10 2 56.6
138 8 10 3 64.6
139 8 11 1 52.1
140 8 11 2 44.3
141 8 11 3 59.7
142 8 12 1 44.9
143 8 12 2 43.3
144 8 12 3 65.0

Output 11.31 Mean Response for Each Treatment by Location

Obs loc trt _FREQ_ y_mean
 
1 1 1 3 42.7333
2 1 2  3 43.1667
3 1 3  3 41.3333
4 2 1  3 33.5333
5 2 2  3 37.0000
6 2 3  3 22.2333
7 3 1  3 36.6667
8 3 2  3 43.4000
9 3 3  3 47.9000
10 4 1  3 47.7000
11 4 2  3 52.3000
12 4 3  3 73.7000
13 5 1  6 41.8000
14 5 2  6 45.9500
15 5 3  6 47.0000
16 6 1  6 33.9667
17 6 2  6 38.1667
18 6 3  6 30.2333
19 7 1 12 38.6417
20 7 2 12 44.1833
21 7 3 12 51.8500
22 8 1 12 47.5417
23 8 2 12 50.6500
24 8 3 12 60.5500

The variable _FREQ_ in Output 11.31 refers to the number of blocks in a given location.

These are the main controversies for the analysis of multi-location data:

❏ Should the location×treatment interaction be included in the model, or should it be assumed to be zero?

❏ Should locations be considered fixed or random?

❏ Should means be weighted by the number of observations per location, or equivalently, should Type I or Type III SS be used if nonzero location×treatment interactions are assumed?

❏ If random locations and hence random location×treatment interaction effects are assumed, how should location-specific treatment effects, if they arise, be handled?

The following analyses illustrate different approaches to these questions. These illustrations suggest advantages and disadvantages for each approach.

11.8.1 An Analysis Assuming No Location× Treatment Interaction

This approach assumes the model yijk = μ + Li + B(L)ij + τk + eijk, where Li, B(L)i, and βk are the location, block within location, and treatment effects, respectively, and the random errors eijk are assumed i.i.d. N(0,σ2). Its rationale presumes that 1) the reason for having multiple locations is solely to provide a practical way to obtain an adequate sample size and 2) treatment effects are known with certainty not to be location-specific. You can use the following SAS statements to implement the analysis:

proc glm data=mloc;
   class loc blk trt;
   model y=trt loc blk(loc);
   means trt;
   lsmeans trt;

Output 11.32 shows the results. Normally, you would place TRT last in the model. It is placed before LOC and BLK(LOC) here to illustrate a point. Though not shown here, in practice you would usually add CONTRAST statements or mean comparison options to the MEANS or LSMEANS statements to complete inference about treatment effects.

Output 11.32 An Analysis of Multi-Location Data Using the No LOC×TRT Model

Source DF Sum of
Squares
Mean Square F Value Pr > F
 
Model 49 10530.96049 214.91756 4.15 <.0001
 
Error 94 4869.34944 51.80159    
 
Corrected Total 143 15400.30993      
 
Source DF Type I SS Mean Square F Value Pr > F
 
trt 2 1641.777222 820.888611 15.85 <.0001
loc 7 7770.496597 1110.070942 21.43 <.0001
blk(loc) 40 1118.686667 27.967167 0.54 0.9846
 
Source DF Type III SS Mean Square F Value Pr > F
 
trt 2 1641.777222 820.888611 15.85 <.0001
loc 7 7770.496597 1110.070942 21.43 <.0001
blk(loc) 40 1118.686667 27.967167 0.54 0.9846
 
Level of   ---------y---------
trt  N Mean Std Dev
 
1 48 41.0562500 6.9066698
2 48 45.2145833 6.8837456
3 48 49.3270833 14.0586876
 

Least Squares Means

 
trt y LSMEAN
1 39.6986111
2 43.8569444
3 47.9694444

You can see that the MEANS and LS means are different. This reflects a different weighting scheme for the Li effects: The MEANS weight them according to the number of observations per location, whereas the LS means weight them equally. Notice that the differences among pairs of treatment MEANS and LS means, however, are unaffected and the Type I and Type III SS are identical. You get different estimates of means but identical estimates of treatment differences regardless of whether you use MEANS or LS means. This is because differences among the MEANS and LS means eliminate weighting based on the number of observations per location. You can see this by using the E option with the LSMEANS statement (as shown in Chapter 6) to show the weighting scheme. The estimates of the treatment effects are thus disproportionately affected by the locations with the greatest number of observations (in this case locations 7 and 8).

The main risk of using this analysis is that it is very sensitive to the assumption of no location-specific treatment effects. Even minor violations of this assumption can seriously affect the results when you use this model.

11.8.2 A Fixed-Location Analysis with an Interaction

In many, perhaps most, multi-location studies, researchers are not prepared to assume no location×treatment interaction without at least testing the assumption. One approach is to modify the model from Section 11.8.1 by adding an interaction term, yielding the model equation yijk = μ + Li + B(L)ij + τk + (τL)ik + eijk, where (τL) ik denotes the location×treatment interaction. Use the following SAS statements to implement the analysis:

proc glm data=mloc;
   class loc blk trt;
   model y=loc blk(loc) trt loc*trt;
   means trt;
   lsmeans trt loc*trt/slice=loc;
run;

As with any factorial arrangement, the appropriate strategy for inference is first to test the location×treatment (LOC*TRT) interaction and then evaluate simple effects of treatment by location (for example, using the SLICE=LOC option) if the interaction is non-negligible, or otherwise, evaluate main effects. The results appear in Output 11.33.

Output 11.33 An Analysis of Multi-Location Data Using a Fixed-Location Model with an Interaction

Source DF Sum of
Squares
Mean Square F Value Pr > F
 
Model 63 13042.45660 207.02312 7.02 <.0001
 
Error 80 2357.85333 29.47317    
 
Corrected Total 143 15400.30993      
 
Source DF Type I SS Mean Square F Value Pr > F
 
loc 7 7770.496597 1110.070942 37.66 <.0001
blk(loc) 40 1118.686667 27.967167 0.95 0.5634
trt 2 1641.777222 820.888611 27.85 <.0001
loc*trt 14 2511.496111 179.392579 6.09 <.0001
 
Source DF Type III SS Mean Square F Value Pr > F
 
loc 7 7770.496597 1110.070942 37.66 <.0001
blk(loc) 40 1118.686667 27.967167 0.95 0.5634
trt 2 757.254848 378.627424 12.85 <.0001
loc*trt 14 2511.496111 179.392579 6.09 <.0001
 
Level of   ---------y---------
trt  N Mean Std Dev
 
1 48 41.0562500 6.9066698
2 48 45.2145833 6.8837456
3 48 49.3270833 14.0586876

 

Least Squares Means

 

trt y LSMEAN
1 40.3229167
2 44.3520833
3 46.8500000

 

loc*trt Effect Sliced by loc for y

 

loc DF Sum of
Squares
Mean Square F Value Pr > F
 
1 2 5.508889 2.754444 0.09 0.9109
2 2 357.762222 178.881111 6.07 0.0035
3 2 191.775556 95.887778 3.25 0.0438
4 2 1155.120000 577.560000 19.60 <.0001
5 2 90.730000 45.365000 1.54 0.2208
6 2 189.031111 94.515556 3.21 0.0457
7 2 1055.791667 527.895833 17.91 <.0001
8 2 1107.553889 553.776944 18.79 <.0001

Output 11.33 reveals several points about the data. First, there is very strong evidence of a location×treatment interaction (F=6.09, p<0.0001). The SLICE output partially reveals the nature of the interaction: Significant treatment effects were observed at locations 2, 3, 4, 6, 7, and 8, but not at locations 1 or 5. You could pursue this by computing the simple effect comparisons among treatments for each comparison using the steps presented in Section 3.7.5 “Simple Effect Comparisons.” These comparisons are not shown, but you can inspect the treatment means by location in Output 11.30 to anticipate the results: In locations 2 and 6, the mean of treatment 3 is substantially lower than the means of treatments 1 and 2, whereas in locations 4, 7, and 8, and to a lesser extent locations 3 and 6, the opposite is true.

If you do proceed with inference on main effects, despite the evidence of interaction, then you can see that the MEANS and their associated test using the Type I SS for TRT produce different results than the LS means and their associated test using Type III SS. This mainly results from the fact that the MEANS weight locations 7 and 8 more heavily relative to the other locations, whereas the LS means weight all locations equally. Thus, the large difference between treatment 3 and the others in locations 7 and 8 affects the MEANS to a much greater extent than the LS means.

Recalling the discussion of MEANS and Type I SS versus LS means and Type III SS from Chapter 6, you would want to use the MEANS if the number of observations per location closely reflects the true proportion of populations in the various locations. In other words, if locations 7 and 8, for example, are in communities whose populations are roughly four times the populations of locations 1 through 4, then the proportion of observation is representative. On the other hand, if the number of observations per location is mainly a sampling artifact, and does not reflect the actual size of the populations in each location, then the MEANS may seriously misrepresent the actual treatment effects. Note that if you decide to drop LOC*TRT from the model based on the test for interaction, your subsequent inference is implicitly based on the MEANS.

Keep in mind that the fixed-locations model with interaction makes two critical assumptions about the data. First, recalling the discussion of fixed-effects versus random-effects inference from Chapter 4, the fixed-locations model assumes that the observed locations are the entire population. The analysis neither measures nor recovers any information about distribution among locations. Second, fixed-locations analysis assumes that the only relevant source of uncertainty comes from variation among observations within locations, making MS(ERROR) an appropriate error term for testing TRT. If locations are meant to represent a larger population, this assumption is probably untrue and, as you will see in the next section, the tests for treatment shown in this section are incorrect and likely to be misleading. Assuming fixed locations when in fact the location and location×treatment effects represent probability distributions can produce severely inflated Type I error rates for the test of treatment effects. Therefore, you should use this model only when the locations in fact are the entire population of inference or when the locations are chosen to represent a second treatment factor associated with known characteristics of the location (for example, soil type or climatic zone in agricultural trials, or socioeconomic group in multi-center clinical trials).

11.8.3 A Random-Location Analysis

In many multi-location studies, locations represent a larger target population. Implicitly, the goal of these studies is to apply inference beyond the observed locations to the entire population. Recalling the criteria for distinguishing fixed from random effects given in the introduction to Chapter 4, location effects are random when the locations actually observed represent a probability distribution of locations that could, in theory, have been sampled. In most multi-location studies, locations are not drawn from a true random sample, but again recalling the discussion in Chapter 4, this is usually a moot point. Location effects are random if the locations plausibly represent the population (and if they don’t, you should question either the study design or the use of the data to draw inference beyond the observed locations).

The model equation for random-location analysis is identical to the equation given in Section 11.8.2 for fixed-location analysis with interaction, but the assumptions are different: The location effects, Li, are assumed i.i.d. N(0, σL2), and the location×treatment effects are assumed i.i.d. N(0, σL2). In addition, the block within treatment effects is assumed random as well. You can obtain the expected mean squares and the overall test for treatment effects using PROC GLM, but, as with other mixed-model examples discussed in previous chapters, the standard errors and tests of various treatment comparisons are wrong or awkward to obtain using PROC GLM. PROC MIXED is a better choice. Use the SAS statements

proc mixed method=type3;
   class loc blk trt;
   model y=trt/ddfm=kr;
   random loc blk(loc) loc*trt;
   lsmeans trt/diff;

The METHOD=TYPE3 option is not necessary in practice; it is used here merely to show the expected mean squares, which appear in Output 11.34. The rest of the analysis appears in Output 11.35.

Output 11.34 An Analysis of Variance and Expected Mean Squares for a Random-Locations Analysis of Multi-Location Data

Type 3 Analysis of Variance

Source DF Sum of
Squares
Mean Square
 
trt 2 757.254848 378.627424
loc 7 7770.496597 1110.070942
blk(loc) 40 1118.686667 27.967167
loc*trt 14 2511.496111 179.392579
Residual 80 2357.853333 29.473167
 
Source Expected Mean Square
 
trt Var(Residual) + 4.3636 Var(loc*trt) + Q(trt)
loc Var(Residual) + 5.6786 Var(loc*trt)
+ 3 Var(blk(loc)) + 17.036 Var(loc)
blk(loc) Var(Residual) + 3 Var(blk(loc))
loc*trt Var(Residual) + 5.6786 Var(loc*trt)
Residual Var(Residual)

Note the coefficients of the LOC*TRT variance for the TRT main effect and the LOC*TRT interaction. They are different because of the unequal number of observations per location. With unbalanced data and the random location×treatment interaction, the appropriate error term for testing treatment effects is a linear combination of MS(LOC*TRT) and MS(ERROR).

Output 11.35 Random Location Analysis of Multi-Location Data

Covariance Parameter
Estimates

 

Cov Parm Estimate
 
loc 54.7194
blk(loc) -0.5020
loc*trt 26.4009
Residual 29.4732

 

Type 3 Tests of Fixed Effects

 

Effect Num
DF
Den
DF
F Value Pr > F
 
trt 2 18.1 2.77 0.0893

 

Least Squares Means

 

Effect trt Estimate Standard
Error
DF  t Value  Pr > |t|
 
trt 1 40.2770 3.3091 15.8 12.17 <.0001
trt 2 44.3284 3.3091 15.8 13.40 <.0001
trt 3 46.9789 3.3091 15.8 14.20 <.0001

 

Differences of Least Squares Means

 

Effect trt _trt Estimate   Standard
Error
DF  t Value  Pr > |t|
 
trt 1 2 -4.0515 2.8690 18.1 -1.41 0.1749
trt 1 3 -6.7020 2.8690 18.1 -2.34 0.0312
trt 2 3 -2.6505 2.8690 18.1 -0.92 0.3677

You can see from Output 11.35 that the test of treatment effect is considerably more conservative than the corresponding tests in the fixed-locations analyses. This is partly because the MS(LOC*TRT) term is considerably larger than MS(ERROR)—recall the highly significant location×treatment interaction in Output 11.33—and partly because the denominator degrees of freedom depend mainly on the degrees of freedom for LOC*TRT and are thus substantially lower. The BLK(TRT) variance is allowed to remain negative when the METHOD=TYPE3 option is used. The REML default sets the estimate to zero, with some impact on the LOC*TRT variance estimate and some of the test statistics. Section 4.4.2, “Standard Errors for the Two-Way Mixed Model: GLM versus MIXED,” discussed the arguments for and against the REML default; this remains an unresolved controversy in mixed-model inference.

Now look at the LS means and the estimates of treatment differences. The estimates are close to the values you would get using the LS means in the fixed location with interaction model in Output 11.33. This means that the random-locations model implicitly weights locations approximately equally. The standard errors and denominator degrees of freedom are considerably different from the fixed-locations analysis because the mixed model uses the LOC*TRT variance.

You can consider location-specific effects with the random-locations model by using best linear unbiased predictors. The following SAS statements obtain the location-specific BLUPs for the differences between treatments 1 and 2 and between 1 and 3, respectively:

estimate 't1 vs t2 at loc 1' trt 1 -1 0 | loc*trt 1 -1 0;
estimate 't1 vs t3 at loc 1' trt 1 0 -1 | loc*trt 1 0 -1;
estimate 't1 vs t2 at loc 2' trt 1 -1 0 | loc*trt 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 2' trt 1 0 -1 | loc*trt 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 3' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 3' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 4' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 4' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 5' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 5' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 6' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 6' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 7' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 7' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 8' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    1 -1 0;
estimate 't1 vs t3 at loc 8' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     1 0 -1;

The results appear in Output 11.36.

Output 11.36 Location-Specific Best Linear Unbiased Predictors for Multi-Location Data

Estimates

 

Label Estimate Standard
Error
DF t Value Pr > |t|
 
t1 vs t2 at loc 1 -1.4146 3.8811 133 -0.36 0.7161
t1 vs t3 at loc 1 -0.7973 3.8811 133 -0.21 0.8376
t1 vs t2 at loc 2 -3.6253 3.8811 133 -0.93 0.3519
t1 vs t3 at loc 2 6.4178 3.8811 133 1.65 0.1006
t1 vs t2 at loc 3 -6.0060 3.8811 133 -1.55 0.1241
t1 vs t3 at loc 3 -10.0044 3.8811 133 -2.58 0.0110
t1 vs t2 at loc 4 -4.4512 3.8811 133 -1.15 0.2535
t1 vs t3 at loc 4 -20.7663 3.8811 133 -5.35 <.0001
t1 vs t2 at loc 5 -4.1345 2.9216 104 -1.42 0.1600
t1 vs t3 at loc 5 -5.4356 2.9216 104 -1.86 0.0656
t1 vs t2 at loc 6 -4.1767 2.9216 104 -1.43 0.1558
t1 vs t3 at loc 6 2.0963 2.9216 104 0.72 0.4747
t1 vs t2 at loc 7 -5.4148 2.1376 90.7 -2.53 0.0130
t1 vs t3 at loc 7 -12.6546 2.1376 90.7 -5.92 <.0001
t1 vs t2 at loc 8 -3.1886 2.1376 90.7 -1.49 0.1392
t1 vs t3 at loc 8 -12.4716 2.1376 90.7 -5.83 <.0001

These estimates are similar to the simple effects you would obtain in the fixed location with interaction analysis in Section 11.8.2, except that the estimates in Output 11.36 are shrinkage estimators to account for the location and location×treatment distributions. Also note that you should use the DDFM=KR option in the MODEL statement of the PROC MIXED program; otherwise, the standard errors of the BLUPs are biased downward.

The main argument for using the random-locations analysis is that it most accurately reflects the inference implicitly intended in most multi-location studies. The primary disadvantage is that in order to get reasonable estimates of LOC and LOC*TRT variance and in order to have sufficient denominator degrees of freedom to test TRT effects, studies must be designed so that there are an adequate number of locations and that the locations plausibly represent the target population.

11.8.4 Further Analysis of a Location×Treatment Interaction Using a Location Index

Closer inspection of the treatment means by location in Output 11.31 and the location-specific BLUPs in Output 11.36 reveals that the treatment 1 minus treatment 3 difference tends to be large favoring treatment 1 in locations with relatively low overall mean responses whereas treatment 3 tends to be favored in locations with relatively high overall mean responses. You formalize this relationship by using an analysis that uses regression on an index characterizing the mean response at each location. This method is closely related to the Tukey test of non-additivity in randomized-complete-blocks designs and has been used by Eberhart and Russell (1966) to characterize genotype-by-environment interactions (which are a special case of multi-location studies). Milliken and Johnson (1989, Chapters 1-3) give an excellent overview of these methods.

The model for this analysis is yijk = μ + Li + B(L)ij + τk + βkIi + (τL)ik + eijk, where all of the terms in the model equation are defined as previously with the addition of a location index, Ii, and a regression coefficient, βk, for the kth treatment. The location index is usually defined as the mean response over all observations on the ith location. You can implement the analysis using the following SAS statements:

proc sort data=mloc;
   by loc;
proc means noprint data=mloc;
   by loc; var y;
   output out=env_indx mean=index;
data all;
   merge mloc env_indx;
   by loc;
proc mixed data=all;
   class loc blk trt;
   model y=trt trt*index/noint solution ddfm=satterth;
   random loc blk(loc) loc*trt;
   lsmeans trt/diff;
   contrast 'trt at mean index'
      trt 1 -1 0 trt*index 45.2 -45.2 0,
      trt 1 0 -1 trt*index 45.2 0 -45.2;

The PROC SORT and PROC MEANS statements generate a new data set, called ENV_INDX, which contains the means of Y by location. The new variable is called INDEX. The ENV_INDX data are then merged with the original data set. You compute the analysis using PROC MIXED. You can see that the MIXED program is similar to the random-locations analysis in Section 11.8.3, except that you add the term TRT*INDEX to the MODEL statement. This term corresponds to βkIi in the model equation given above. The NOINT and SOLUTION options allow easier interpretation of the output. The CONTRAST statement computes an appropriate test of the equality of treatment effects; because this model is a special case of an unequal slopes analysis-of-covariance model, the test of treatment effects varies with the covariate. The test shown here is for the mean of the INDEX variable over all locations. You could choose different values of INDEX. In fact, in many studies you would want to test treatment effect at several values of the INDEX, say at relatively low and relatively high values, to get an idea of how treatment effects change over locations with different mean responses. Output 11.37 shows the results of the analysis.

Output 11.37 Location Index Analysis of Multi-Location Data

Covariance Parameter
Estimates

 

Cov Parm Estimate
 
loc 0
blk(loc) 0
loc*trt 0.8334
Residual 27.9211

 

Solution for Fixed Effects

 

Effect trt Estimate Standard
Error
DF t Value Pr > |t|
 
trt 1 12.4035 5.1377 32.7 2.41 0.0215
trt 2 17.0483 5.1377 32.7 3.32 0.0022
trt 3 -29.4519 5.1377 32.7 -5.73 <.0001
index*trt 1 0.6345 0.1128 29.2 5.62 <.0001
index*trt 2 0.6232 0.1128 29.2 5.52 <.0001
index*trt 3 1.7423 0.1128 29.2 15.44 <.0001

 

Type 3 Tests of Fixed Effects

 

Effect Num
DF
Den
DF
F Value Pr > F
 
trt 3 32.7 16.57 <.0001
index*trt 3 29.2 100.17 <.0001

 

Contrasts

 

Label Num
DF
Den
DF
F Value Pr > F
 
trt at mean index 2 19.2 23.45 <.0001

 

Least Squares Means

 

Effect trt Estimate Standard
Error
DF t Value Pr > |t|
 
trt 1 41.0822 0.8483 19.2 48.43 <.0001
trt 2 45.2182 0.8483 19.2 53.31 <.0001
trt 3 49.2975 0.8483 19.2 58.12 <.0001

 

Differences of Least Squares Means

 

Effect trt _trt Estimate Standard
Error
DF t Value Pr > |t|
 
trt 1 2 -4.1360 1.1996 19.2 -3.45 0.0027
trt 1 3 -8.2153 1.1996 19.2 -6.85 <.0001
trt 2 3 -4.0793 1.1996 19.2 -3.40 0.0030

The “Covariance Parameter Estimates” show that the INDEX accounts for most of the variation among locations. The LOC variance estimate is 0 and the LOC*TRT variance is sharply reduced compared to its estimate in the random-locations analysis in Output 11.35. The “Solution for Fixed Effects” parameters have the following interpretation. The TRT parameter estimate μ + τk and the INDEX*TRT parameters estimate βk. Thus, TRT + (INDEX*TRT)×(location index) gives you the expected treatment mean at a given value of location index. For example, at a location whose average response is 45.2, the expected mean of treatment 1 is 12.4035+(0.6345)*(45.2)=41.4, the LS mean shown in the output, aside from round-off error. The INDEX*TRT estimates tell you how the expected treatment mean changes as the location index increases.

The INDEX*TRT estimate is much larger for treatment 3, and the intercept (TRT) is much smaller, which tells you that treatment 3 is expected to have a low mean relative to treatments 1 and 2 in locations with a relatively low mean response. But its mean increases more quickly and thus is expected to have a higher mean relative to treatments 1 and 2 for locations with relatively high mean responses.

The “Type 3 Tests of Fixed Effects” results for TRT and INDEX*TRT test the joint equality of these terms to zero. As such, the test for TRT is usually not of interest. The CONTRAST result for TRT at a given value of INDEX supercedes the TRT test. Use INDEX*TRT to test whether the expected responses vary linearly with location index for the three treatments. You could construct a similar CONTRAST defined on INDEX*TRT to test the equality of the βk’s, which you interpret as a linear location index×treatment interaction.

As discussed earlier, the CONTRAST result tests treatment effects at a location index of 45.2. The “Least Squares Means” and “Differences of Least Squares Means” output are computed for the mean INDEX values determined by the LS means algorithm. You could vary the LS means and differences using the AT option to see what happens in different environments. For example, the following statements compute LS means for INDEX values of 30.9 (roughly the lowest INDEX for any location observed) and 57.9 (roughly the highest INDEX among all locations). For complete-ness, the LSMEANS statement with the default AT MEANS option is also shown. The SAS statements are

lsmeans trt/at index=30.9 diff;
lsmeans trt/at means diff;
lsmeans trt/at index=57.9 diff;

The results appear in Output 11.38.

Output 11.38 LS Means and Differences Computed at Various Location Indices

Least Squares Means

 

Effect trt index Estimate Standard
Error
DF t Value Pr > |t|
 
trt 1 30.90 32.0094 1.7935 36.6 17.85 <.0001
trt 2 30.90 36.3064 1.7935 36.6 20.24 <.0001
trt 3 30.90 24.3842 1.7935 36.6 13.60 <.0001
trt 1 45.20 41.0822 0.8483 19.2 48.43 <.0001
trt 2 45.20 45.2182 0.8483 19.2 53.31 <.0001
trt 3 45.20 49.2975 0.8483 19.2 58.12 <.0001
trt 1 57.90 49.1407 1.6936 19.6 29.02 <.0001
trt 2 57.90 53.1338 1.6936 19.6 31.37 <.0001
trt 3 57.90 71.4255 1.6936 19.6 42.17 <.0001

 

Differences of Least Squares Means

 

Effect trt _trt index  Estimate Standard
Error
DF t Value Pr > |t|
 
trt 1 2 30.90 -4.2970 2.5363 36.6 -1.69 0.0987
trt 1 3 30.90 7.6252 2.5363 36.6 3.01 0.0048
trt 2 3 30.90 11.9221 2.5363 36.6 4.70 <.0001
trt 1 2 45.20 -4.1360 1.1996 19.2 -3.45 0.0027
trt 1 3 45.20 -8.2153 1.1996 19.2 -6.85 <.0001
trt 2 3 45.20 -4.0793 1.1996 19.2 -3.40 0.0030
trt 1 2 57.90 -3.9930 2.3951 19.6 -1.67 0.1114
trt 1 3 57.90 -22.2848 2.3951 19.6 -9.30 <.0001
trt 2 3 57.90 -18.2917 2.3951 19.6 -7.64 <.0001

You can see that in locations with the lowest index, or mean response, the expected response of treatment 3 is considerably lower than that for the other two treatments. This is consistent with what was observed in locations 2 and 6 (see Output 11.31). On the other hand, the locations with the highest mean, the expected response of treatment 3 exceeds that of the other two treatments by a considerable margin, as was observed in locations 4, 7, and 8.

As a final note, this analysis suggests that when there are strong location-specific effects, the argument between weighted versus unweighted means, that is, MEANS versus LS means, respectively, from the fixed and random location with interaction models in Sections 11.8.2 and 11.8.3, is probably moot. Evaluating changes in treatment response at different locations and trying to understand why they are occurring is usually more to the point.

11.9 Absorbing Nesting Effects

Nested effects can produce a very large number of dummy variables in models and challenge the capacity of computers. This problem is trivial with modern computers compared with only a few years ago, but it is still an issue. A methodology called absorption greatly reduces the size of the problem by eliminating the need to obtain an explicit solution to the complete set of normal equations. In most applications, nested effects are random, and their estimates might not be required.

Absorption reduces the number of normal equations by eliminating the parameters for one factor from the system before a solution is obtained. This is analogous to the method of solving a set of three equations in three unknowns, x1, x2, and x3. Suppose you combine the first and second equations and eliminate x3. Next, combine the first and third equations and eliminate x3. Then, with two equations left involving x1 and x2 (the variable x3 having been absorbed), solve the reduced set for x1 and x2.

The use of the ABSORB statement is illustrated with data on 65 steers from Harvey (1975). Several values are recorded for each steer, including line number (LINE), sire number (SIRE), age of dam (AGEDAM), steer age (AGE), initial weight (INTLWT), and the dependent variable, average daily gain (AVDLYGN). Output 11.39 shows the data.

Output 11.39 Data Set Sires

Obs line sire agedam steerno age intlwt avdlygn
 
1 1 1 3 1 192 390 2.24
2 1 1 3 2 154 403 2.65
3 1 1 4 3 185 432 2.41
4 1 1 4 4 193 457 2.25
5 1 1 5 5 186 483 2.58
6 1 1 5 6 177 469 2.67
7 1 1 5 7 177 428 2.71
8 1 1 5 8 163 439 2.47
9 1 2 4 9 188 439 2.29
10 1 2 4 10 178 407 2.26
11 1 2 5 11 198 498 1.97
12 1 2 5 12 193 459 2.14
13 1 2 5 13 186 459 2.44
14 1 2 5 14 175 375 2.52
15 1 2 5 15 171 382 1.72
16 1 2 5 16 168 417 2.75
17 1 3 3 17 154 389 2.38
18 1 3 4 18 184 414 2.46
19 1 3 5 19 174 483 2.29
20 1 3 5 20 170 430 2.30
21 1 3 5 21 169 443 2.94
22 2 4 3 22 158 381 2.50
23 2 4 3 23 158 365 2.44
24 2 4 4 24 169 386 2.44
25 2 4 4 25 144 339 2.15
26 2 4 5 26 159 419 2.54
27 2 4 5 27 152 469 2.74
28 2 4 5 28 149 379 2.50
29 2 4 5 29 149 375 2.54
30 2 5 3 30 189 395 2.65
31 2 5 4 31 187 447 2.52
32 2 5 4 32 165 430 2.67
33 2 5 5 33 181 453 2.79
34 2 5 5 34 177 385 2.33
35 2 5 5 35 151 414 2.67
36 2 5 5 36 147 353 2.69
37 3 6 4 37 184 411 3.00
38 3 6 4 38 184 420 2.49
39 3 6 5 39 187 427 2.25
40 3 6 5 40 184 409 2.49
41 3 6 5 41 183 337 2.02
42 3 6 5 42 177 352 2.31
43 3 7 3 43 205 472 2.57
44 3 7 3 44 193 340 2.37
45 3 7 4 45 162 375 2.64
46 3 7 5 46 206 451 2.37
47 3 7 5 47 205 472 2.22
48 3 7 5 48 187 402 1.90
49 3 7 5 49 178 464 2.61
50 3 7 5 50 175 414 2.13
51 3 8 3 51 200 466 2.16
52 3 8 3 52 184 356 2.33
53 3 8 3 53 175 449 2.52
54 3 8 4 54 178 360 2.45
55 3 8 5 55 189 385 1.44
56 3 8 5 56 184 431 1.72
57 3 8 5 57 183 401 2.17
58 3 9 3 58 166 404 2.68
59 3 9 4 59 187 482 2.43
60 3 9 4 60 186 350 2.36
61 3 9 4 61 184 483 2.44
62 3 9 5 62 180 425 2.66
63 3 9 5 63 177 420 2.46
64 3 9 5 64 175 440 2.52
65 3 9 5 65 164 405 2.42

The analysis, as performed by Harvey, can be obtained directly by PROC GLM with the following SAS statements:

proc glm;
   class line sire agedam;
   model avdlygn=line sire(line) agedam
         line*agedam age intlwt / solution ss3;
   test h=line e=sire(line);

The results appear in Output 11.40.

Output 11.40 Complete Analysis of Variance

The GLM Procedure

 

Dependent Variable: avdlygn

 

Source DF Sum of
Squares
Mean Square F Value Pr > F
 
Model 16 2.52745871 0.15796617 3.14 0.0011
 
Error 48 2.41191667 0.05024826    
 
Corrected Total 64 4.93937538      
 
R-Square Coeff Var Root MSE avdlygn Mean
 
0.511696 9.295956 0.224161 2.411385
 
Source DF Type III SS Mean Square F Value Pr > F
 
line 2 0.13620255 0.06810128 1.36 0.2676
sire(line) 6 0.97388905 0.16231484 3.23 0.0095
agedam 2 0.13010623 0.06505311 1.29 0.2834
line*agedam 4 0.45343434 0.11335859 2.26 0.0768
age 1 0.38127612 0.38127612 7.59 0.0083
intlwt 1 0.26970425 0.26970425 5.37 0.0248

 

Tests of Hypotheses Using the Type III MS for sire(line) as an Error Term

 
Source DF Type III SS Mean Square F Value Pr > F
 
line 2 0.13620255 0.06810128 0.42 0.6752
Parameter Estimate Standard
Error
t Value Pr > |t|
 
Intercept 2.996269167 B 0.51285394 5.84 <.0001
line         1 0.071824656 B 0.14550628 0.49 0.6238
line         2 0.252468579 B 0.13716655 1.84 0.0719
line         3 0.000000000 B  ⋅          ⋅    ⋅    
sire(line) 1 1 0.085729012 B 0.13027803 0.66 0.5137
sire(line) 2 1 -0.121705157 B 0.13622078 -0.89 0.3761
sire(line) 3 1 0.000000000 B  ⋅          ⋅    ⋅    
sire(line) 4 2 -0.244601122 B 0.12669287 -1.93 0.0594
sire(line) 5 2 0.000000000 B  ⋅          ⋅    ⋅    
sire(line) 6 3 0.105395737 B 0.12908764 0.82 0.4183
sire(line) 7 3 -0.019520926 B 0.12037674 -0.16 0.8719
sire(line) 8 3 -0.330235387 B 0.12566795 -2.63 0.0115
sire(line) 9 3 0.000000000 B  ⋅          ⋅    ⋅    
agedam     3 0.370387027 B 0.11455814 3.23 0.0022
agedam     4 0.275459487 B 0.10377628 2.65 0.0107
agedam     5 0.000000000 B  ⋅          ⋅    ⋅    
line*agedam 1 3 -0.448936131 B 0.19581259 -2.29 0.0263
line*agedam 1 4 -0.282831924 B 0.16085047 -1.76 0.0851
line*agedam 1 5 0.000000000 B  ⋅          ⋅    ⋅    
line*agedam 2 3 -0.260782670 B 0.19528690 -1.34 0.1880
line*agedam 2 4 -0.350258133 B 0.17438656 -2.01 0.0502
line*agedam 2 5 0.000000000 B  ⋅          ⋅    ⋅    
line*agedam 3 3 0.000000000 B  ⋅          ⋅    ⋅    
line*agedam 3 4 0.000000000 B  ⋅          ⋅    ⋅    
line*agedam 3 5 0.000000000 B  ⋅          ⋅    ⋅    
age -0.008530438   0.00309679 -2.75 0.0083
intlwt 0.002026334   0.00087464 2.32 0.0248

NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

 

The factor AGEDAM is treated as a discrete variable with levels (3, 4, and ≥5). The denominator for the F-test for testing LINE is SIRE(LINE).

To introduce the ABSORB statement, Harvey's model has been simplified. All sources of variation except the main effects of SIRE and AGEDAM have been disregarded. For the abbreviated model, the following SAS statements give the desired analysis:

proc glm;
   class sire agedam;
   model avdlygn=sire agedam / solution ss1 ss2 ss3;

The results appear in Output 11.41.

Output 11.41 Abbreviated Least-Squares Analysis of Variance

The GLM Procedure

 

Dependent Variable: avdlygn

 
Source DF Sum of
Squares
Mean Square F Value Pr > F
 
Model 10 1.42537863 0.14253786 2.19 0.0324
 
Error 54 3.51399676 0.06507401    
 
Corrected Total 64 4.93937538      
 
R-Square Coeff Var Root MSE avdlygn Mean
0.288575 10.57882 0.255096 2.411385
 
Source DF Type I SS Mean Square F Value Pr > F
 
sire 8 1.30643634 0.16330454 2.51 0.0214
agedam 2 0.11894229 0.05947115 0.91 0.4071
 
Source DF Type II SS Mean Square F Value Pr > F
 
agedam 2 0.11894229 0.05947115 0.91 0.4071
 
Source DF Type III SS Mean Square F Value Pr > F
 
agedam 2 0.11894229 0.05947115 0.91 0.4071

 

Parameter Estimate Standard
Error
t Value Pr > |t|
 
agedam     3 0.1173825552 B 0.08911680 1.32 0.1933
agedam     4 0.0482979994 B 0.07715379 0.63 0.5340
agedam     5 0.0000000000 B  .           .    .    

 

NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

If the number of sires were large, then this analysis would be expensive. However, because there is little concern for the actual estimates of the effects of SIRE, considerable expense can be avoided by using the ABSORB statement:

proc glm;
   absorb sire;
   class agedam;
   model avdlygn=agedam / solution ss1 ss2 ss3;

The results appear in Output 11.42.

Output 11.42 Abbreviated Least-Squares Analysis of Variance Using the ABSORB Statement

The GLM Procedure

 

Dependent Variable: avdlygn

 
Source DF Sum of
Squares
Mean Square F Value Pr > F
 
Model 10 1.42537863 0.14253786 2.19 0.0324
 
Error 54 3.51399676 0.06507401    
 
Corrected Total 64 4.93937538      
 
R-Square Coeff Var Root MSE avdlygn Mean
 
0.288575 10.57882 0.255096 2.411385
 
Source DF Type I SS Mean Square F Value Pr > F
 
sire 8 1.30643634 0.16330454 2.51 0.0214
agedam 2 0.11894229 0.05947115 0.91 0.4071
 
Source DF Type II SS Mean Square F Value Pr > F
 
agedam 2 0.11894229 0.05947115 0.91 0.4071
 
Source DF Type III SS Mean Square F Value Pr > F
 
agedam 2 0.11894229 0.05947115 0.91 0.4071
 
Parameter Estimate Standard
Error
t Value Pr > |t|
 
agedam     3 0.1173825552 B 0.08911680 1.32 0.1933
agedam     4 0.0482979994 B 0.07715379 0.63 0.5340
agedam     5 0.0000000000 B  .           .    .    

 

NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable

The results in Output 11.41 and Output 11.42 are the same except that the SIRE sums of squares and the SIRE parameter estimates are not printed when SIRE is absorbed. (Note: Type I sums of squares for absorbed effects are computed as nested effects.)

Output 11.40 and Output 11.41 include results for the following statements:

contrast 'young vs old' agedam .5 .5 -1;
estimate 'young vs old' agedam .5 .5 -1;

The output illustrates that the CONTRAST and ESTIMATE statements are legitimate with the ABSORB statement as long as the coefficients of the linear function do not involve absorbed effects—that is, parameter estimates that are not printed (in this case the SIRE parameter estimates). The following ESTIMATE statement would not be legitimate when SIRE is absorbed because it involves the SIRE parameters:

estimate 'oldmean' sire .111111 ... .111111 agedam 1;

For the same reason, the LSMEANS statement for SIRE is not legitimate with the ABSORB statement.

The ABSORB statement is now applied to the full analysis as given by Harvey (see Output 11.39). If the sums of squares for LINE and SIRE(LINE) are not required, the remaining sums of squares can be obtained with the following statements:

proc glm;
   absorb line sire;
   class line agedam;
   model avdlygn=agedam line*agedam age
         intlwt / solution ss3;

Output 11.43 contains the output, which is identical to Harvey’s original results (see Output 11.40) except that neither the sums of squares nor the parameter estimates for LINE and SIRE(LINE) are computed when LINE and SIRE are absorbed.

Output 11.43 Complete Least-Squares Analysis of Variance Using the ABSORB Statement

The GLM Procedure

 

Dependent Variable: avdlygn

 
Source DF Sum of
Squares
Mean Square F Value Pr > F
 
Model 16 2.52745871 0.15796617 3.14 0.0011
 
Error 48 2.41191667 0.05024826    
 
Corrected Total 64 4.93937538      
 
R-Square Coeff Var Root MSE avdlygn Mean
 
0.511696 9.295956 0.224161 2.411385
 
Source DF Type III SS Mean Square F Value Pr > F
 
agedam 2 0.13010623 0.06505311 1.29 0.2834
line*agedam 4 0.45343434 0.11335859 2.26 0.0768
age 1 0.38127612 0.38127612 7.59 0.0083
intlwt 1 0.26970425 0.26970425 5.37 0.0248
 
Contrast DF Contrast SS Mean Square F Value Pr > F
 
young vs old 1 0.11895160 0.11895160 2.37 0.1305
Parameter Estimate Standard
Error
t Value Pr > |t|
 
young vs old 0.09912178 0.06442352 1.54 0.1305
 
Parameter Estimate Standard
Error
t Value Pr > |t|
 
agedam      3 0.3703870271 B 0.11455814 3.23 0.0022
agedam      4 0.2754594872 B 0.10377628 2.65 0.0107
agedam      5 0.0000000000 B  .           .    .    
line*agedam 1 3 -.4489361310 B 0.19581259 -2.29 0.0263
line*agedam 1 4 -.2828319237 B 0.16085047 -1.76 0.0851
line*agedam 1 5 0.0000000000 B  .           .    .    
line*agedam 2 3 -.2607826701 B 0.19528690 -1.34 0.1880
line*agedam 2 4 -.3502581329 B 0.17438656 -2.01 0.0502
line*agedam 2 5 0.0000000000 B  .           .    .    
line*agedam 3 3 0.0000000000 B  .           .    .    
line*agedam 3 4 0.0000000000 B  .           .    .    
line*agedam 3 5 0.0000000000 B  .           .    .    
age -.0085304380   0.00309679 -2.75 0.0083
intlwt 0.0020263340   0.00087464 2.32 0.0248

We conclude this section by running a mixed-model analysis using PROC MIXED in which we consider sires to be random. This analysis might be considered preferable to the analysis of variance using PROC GLM because it truly treats sires as random. In some more complicated situations with very large data sets, however, PROC MIXED might overwhelm the computer.

The appropriate statements are

proc mixed data=sires;
   class line sire agedam;
   model avdlygn=line agedam line*agedam age intlwt/
   ddfm=satterthwaite;
   random sire(line);
   contrast 'young vs old' agedam .5 .5 -1;
   estimate 'young vs old' agedam .5 .5 -1;
run;

Edited results are shown in Output 11.44.

Output 11.44 A Mixed-Model Analysis

The Mixed Procedure

 

Model Information

 

Data Set   WORK.SIRES
Dependent Variable   avdlygn
Covariance Structure   Variance Components
Estimation Method   REML
Residual Variance Method   Profile
Fixed Effects SE Method   Model-Based
Degrees of Freedom Method   Satterthwaite

 

Class Level Information

 

Class Levels     Values
line 3     1 2 3
sire 9     1 2 3 4 5 6 7 8 9
agedam 3     3 4 5

 

Covariance Parameter
Estimates

 

Cov Parm Estimate
 
sire(line) 0.01792
Residual 0.05028

 

Type 3 Tests of Fixed Effects

 

Effect Num
DF
Den
DF
 F Value  Pr > F
 
line 2 7.2 0.43 0.6687
agedam 2 50 1.21 0.3068
line*agedam 4 49.5 2.00 0.1095
age 1 53.8 7.55 0.0082
intlwt 1 51.6 5.92 0.0185

 

Estimates

 

Label Estimate Standard
Error
DF t Value Pr > |t|
 
young vs old 0.09581 0.06348 50.6 1.51 0.1374

 

Contrasts

 

Label Num
DF
Den
DF
F Value Pr > F
 
young vs old 1 50.6 2.28 0.1374

As you have seen with other examples, the mixed-model analysis provides estimates and tests that use appropriate error terms, at least in principle. The “experimental unit” for LINE is SIRE(LINE), and the table for “Type 3 Tests of Fixed Effects” reflects this, with two numerator DF and 7.2 denominator DF. The other tests have approximately 50 denominator DF, which is essentially the same as the 48 DF for residual error in Output 11.40. Steer in the “experimental unit” for the AGEDAM, LINE*AGEDAM, AGE, and INTLWT. Significance probabilities in Outputs 11.40 and 11.44 agree, for practical purposes, for these effects, but it is worth noting that the ANOVA tests in Output 11.40 are exact F-tests, whereas the mixed-model tests in Output 11.44 are approximate due to estimating the covariance parameters. This illustrates a basic lesson. If you are interested only in the effects AGEDAM, LINE*AGEDAM, AGE, and INTLWT, then there is really no benefit in using mixed-model methodology.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.207.129