Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

11.6 A Lack-of-Fit Analysis

A lack-of-fit analysis can provide information on the adequacy of a model that does not include all possible terms. The basic principle is to compare the fits of “complete” and “reduced” models as described in Chapter 2. The sum of squares for the complete model can be obtained from a one-way analysis of variance that computes the sums of squares among all unique treatments. The error sum of squares for the full model is subtracted from the error sum of squares for the incompletely specified model to obtain the sum of squares for all terms not specified.

One of the most common applications of lack-of-fit analysis is testing the adequacy of a regression model. In this procedure, you want to determine if a fitted model accounts for essentially all of the variation in a response variable due to differences between the levels of a quantitative independent variable. For example, consider an experiment in which chickens were fed a form of dietary copper to relate the copper uptake in the liver to copper intake. The chickens were fed a basal diet of 11 ppm copper sulfate, plus a supplemental rate of 0, 150, 300, or 450 ppm. There were six chickens in each of the four treatment groups. The data for this experiment are shown in the SAS data set LIVCU printed in Output 11.25. The variable LOGLIVCU is the logarithm (base 10) of the copper in the livers of the chickens.

Output 11.25 Data for a Lack-of-Fit Analysis

Liver Copper in Poultry Fed Sulfate or Lysine Source

Obs	level	lackofit	loglivcu

1	0	0	1.16761
2	0	0	1.25789
3	0	0	1.27312
4	0	0	1.09688
5	0	0	1.26881
6	0	0	1.24391
7	150	150	1.38957
8	150	150	1.46716
9	150	150	1.51402
10	150	150	1.30969
11	150	150	1.24596
12	150	150	1.37160
13	300	300	1.99269
14	300	300	2.19897
15	300	300	2.14038
16	300	300	1.83695
17	300	300	1.97164
18	300	300	2.11470
19	450	450	2.41911
20	450	450	2.34434
21	450	450	2.15644
22	450	450	2.32868
23	450	450	2.46058
24	450	450	2.43342

The variable LACKOFIT was defined to be equal to LEVEL in the DATA step. Its purpose will become apparent.

We want to fit a linear regression of LOGLIVCU on LEVEL and determine if the linear equation adequately models the response of LOGLIVCU to LEVEL. There are three degrees of freedom for differences between treatments. The linear regression accounts for one of those. The other two account for lack of fit of the linear regression, plus random error. The challenge is to determine how much is lack of fit.

Within each treatment group there are six observations. Variation between these observations, within a group, measures random variation. This is sometimes called “pure error.” Pooled across treatment groups, there are 20 DF for pure error. The analysis of variance is

Source	DF
LEVEL	1
Lack of Fit	2
Pure Error	20

This ANOVA can be obtained by the Type I sums of squares that are provided by the following statements:

proc glm;
class lackofit;
model loglivcu=level lackofit/ss1;

The term LEVEL in the MODEL statement to the right of the equal sign is the usual linear effect of LEVEL. The LACKOFIT variable measures the variation in LOGLIVCU due to treatment that is not accounted for by the linear regression. Specifying the variable LACKOFIT, which has four levels, in the CLASS statement causes the generation of four dummy variables. If LACKOFIT were the only variable in the MODEL statement, then it would account for 3 DF. One of these is confounded with the linear effect. Preceding LACKOFIT by LEVEL leaves only 2 DF for LACKOFIT in the Type I sums of squares. It is important to precede LACKOFIT by LEVEL, because if LACKOFIT preceded LEVEL, then all 3 DF would go to LACKOFIT and 0 DF would go to LEVEL. Note, however, that all other types of sums of squares for LEVEL would be zero.

Results of the preceding statements appear in Output 11.26.

Output 11.26 A Lack-of-Fit Analysis

The GLM Procedure

Dependent Variable: loglivcu


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F

Model	3	5.23096460	1.74365487	155.47	<.0001

Error	20	0.22430992	0.01121550

Corrected Total	23	5.45527452


R-Square	Coeff Var	Root MSE	loglivcu Mean

0.958882	6.051018	0.105903	1.750172


Source	DF	Type I SS	Mean Square	F Value	Pr > F

level	1	4.98592403	4.98592403	444.56	<.0001
lackofit	2	0.24504057	0.12252029	10.92	0.0006

The test for LACKOFIT is significant with p=0.0006. This indicates that the relation between LOGLIVCU and LEVEL is not linear. However, the analysis sheds no light on the true relationship.

You could test whether the relationship is quadratic with a similar analysis provided by the statements

proc glm;
class lackofit;
model loglivcu=level level*level lackofit/ss1;

Output is not shown. There would be 1 DF for LEVEL and 1 DF for LEVEL*LEVEL.

Now consider a three-factor factorial experiment with factors A, B, and C. Suppose you specify an incomplete model that omits the B*C and A*B*C interactions, and you want to test for lack of fit of this model. Run the following statements:

proc glm;
class a b c;
model y=a b a*b c a*c;

The difference between the error sum of squares that you obtain and the error sum of squares from a between-cell analysis of variance provides the additional sum of squares due to both B*C and A*B*C.

You can get the between-cell analysis of variance from PROC ANOVA. If the CLASS variables are integers, a single variable can be generated to represent all cell combinations. Assume, for example, that the values of three CLASS variables (A, B, C) consist of integers between 1 and 10. The following assignment statement placed in the DATA step provides the subscript:

group=100*a+10*b+c;

The following SAS statements provide the desired error sums of squares:

proc glm;
class group;
model y=group;

The CLASS variables may not conform to these specifications. Character values can be concatenated, but the resulting single classification variable may exceed eight characters. Alternately, the following statements can be used to compute the sum of squares for differences between the cells:

proc glm;
class a b c;
model y=a*b*c;

11.7 An Unbalanced Nested Structure

Nested structure concerns samples within samples, as discussed in Section 4.2, “Nested Classifications.” An example is treatments applied to plants in pots. There are several pots per treatment and several plants per pot. The pots do not necessarily have the same number of plants, and there may be different numbers of pots per treatment. Another example of nested structure occurs in sample surveys in which households are sampled within blocks, blocks are sampled within precincts, precincts are sampled within cities, and so on. In many cases, PROC NESTED is adequate for such analyses. Some applications, however, require PROC GLM or PROC MIXED. This section addresses some basic issues of computing means with unbalanced data and random effects. Section 11.8 continues in addressing the issues in a more complex setting.

Data containing a nested structure often have both fixed and random components, which raises questions about proper error terms. Consider an experiment with t treatments (TRT) applied randomly to a number of pots (POT), each containing several plants (PLANT). The data appear in Output 11.27.

Output 11.27 Data from an Unbalanced Nested Classification

Unbalanced Nested Structure

Obs	TRT	POT	PLANT	Y

1	1	1	1	15
2	1	1	2	13
3	1	1	3	16
4	1	2	1	17
5	1	2	2	19
6	1	3	1	12
7	2	1	1	20
8	2	1	2	21
9	2	2	1	20
10	2	2	2	23
11	2	2	3	19
12	2	2	4	19
13	3	1	1	12
14	3	1	2	13
15	3	1	3	14
16	3	2	1	11
17	3	3	1	12
18	3	3	2	13
19	3	3	3	15
20	3	3	4	11
21	3	3	5	9

The model is

y_ijk = μ + λ_i + ρ_ij + ε_ijk

where

y_ijk	is the observed response in the kth PLANT of the jth POT in the ith TREATMENT.
μ	is the overall mean response.
λ_i	is the effect of the ith TREATMENT.
ρ_ij	is the effect of the jth POT within the ith TREATMENT.
ε_ijk	is the effect of the kth individual PLANT in the jth POT of the ith TREATMENT. This effect is usually considered to be the random error.

You are primarily interested in tests and estimates related to the TREATMENTs as well as the variation among POTs and PLANTs. The analysis is implemented by the following SAS statements:

proc glm;
   class trt pot;
   model y=trt pot(trt) / ss1 ss3;
   means trt pot(trt);
   lsmeans trt pot(trt);

Most items in these statements are similar to those from previous examples. The analysis of variance appears in Output 11.28.

Output 11.28 Types I and III ANOVA for an Unbalanced Nested Classification

The GLM Procedure

Dependent Variable: Y


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F

Model	7	267.2261905	38.1751701	12.43	<.0001

Error	13	39.9166667	3.0705128

Corrected Total	20	307.1428571


R-Square	Coeff Var	Root MSE	Y Mean

0.870039	11.35742	1.752288	15.42857


Source	DF	Type I SS	Mean Square	F Value	Pr > F

TRT	2	236.9206349	118.4603175	38.58	<.0001
POT(TRT)	5	30.3055556	6.0611111	1.97	0.1499


Source	DF	Type III SS	Mean Square	F Value	Pr > F

TRT	2	200.1109726	100.0554863	32.59	<.0001
POT(TRT)	5	30.3055556	6.0611111	1.97	0.1499

The analysis of variance has the same form as given in Output 4.4. Note, however, that there is a slight difference in the Type I and Type III sums of squares of TRT in Output 11.28, due to the unbalanced structure of the data. This difference is made clearer by noting the differences between the unadjusted means from the MEANS statement and the adjusted or least-squares means from the LSMEANS statement. Table 11.2 shows the differences.

The means and least-squares means are shown in Output 11.29.

Output 11.29 Means and Least Squares from the GLM Procedure

The GLM Procedure

Level of		---------Y---------
TRT	N	Mean	Std Dev

1	6	15.3333333	2.58198890
2	6	20.3333333	1.50554531
3	9	12.2222222	1.78730088


Level of	Level of		---------Y---------
POT	TRT	N	Mean	Std Dev

1	1	3	14.6666667	1.52752523
2	1	2	18.0000000	1.41421356
3	1	1	12.0000000	.
1	2	2	20.5000000	0.70710678
2	2	4	20.2500000	1.89296945
1	3	3	13.0000000	1.00000000
2	3	1	11.0000000	.
3	3	5	12.0000000	2.23606798

Unbalanced Nested Structure 43

10:05 Thursday, January 10, 2002

The GLM Procedure
Least Squares Means

TRT	Y LSMEAN

1	14.8888889
2	20.3750000
3	12.0000000


POT	TRT	Y LSMEAN

1	1	14.6666667
2	1	18.0000000
3	1	12.0000000
1	2	20.5000000
2	2	20.2500000
1	3	13.0000000
2	3	11.0000000
3	3	12.0000000

Table 11.2 Means and Least-Squares Means

TRT	POT	N	Means	Least-Squares Means
1		6	15.333	14.899
2		6	20.333	20.375
3		9	12.222	12.000
1	1	3	14.667	14.667
	2	2	18.000	18.000
	3	1	12.000	12.000
2	1	2	20.500	20.500
	2	4	20.250	20.250
3	1	3	13.000	13.000
	2	1	11.000	11.000
	3	5	12.000	12.000

The values produced by the MEANS statement are the means of all observations in a TREATMENT. These are the weighted POT means, as shown in the following equation:

mean(i) = y̅_i.. = (1/n_i.) Σ_j n_ij y̅_ij.

n_ij is the number of plants in POT j of TREATMENT i. On the other hand, the least-squares means are the unweighted POT means, as shown in the following equation:

least-squares mean (i) = (1/k_i)Σ_jy̅_ij.

k_i is the number of pots in TREATMENT i.

Both of these types of means have specific uses. In sample surveys, particularly self-weighting samples, it is usually appropriate to use the ordinary means. For the present case, POTs would probably be considered a random effect (see Section 4.2.1, “Analysis of Variance for Nested Classifications.”) In this event, the variance of least-squares mean (i) is less than the variance of mean (i) if $σ_{ρ}^{2}$ $σ_{ρ}^{2}$ is large relative $σ^{2}$ $σ^{2}$ to and conversely, where $σ_{ρ}^{2} = V (ρ_{i j})$ $σ_{ρ}^{2} = V (ρ_{i j})$ and $σ^{2} = V (ε_{i j k})$ $σ^{2} = V (ε_{i j k})$ .

11.8 An Analysis of Multi-Location Data

Multiple location studies, such as clinical trials conducted at several centers, or on-farm trials in agriculture, raise several linear model issues. These issues primarily involve mixed-model inference considerations introduced in Chapter 4, and linear model unbalanced data concepts discussed in Chapters 5 and 6. The analysis of multi-location data can be both confusing and controversial, partly because different kinds of multi-location studies call for different approaches and partly because there is disagreement within the statistics community on what methods are appropriate. This section presents an example multi-location data set and several alternative analyses using linear and mixed-model methods. The purpose of this section is not to prescribe, but simply to demonstrate the various linear model approaches and in the process frame the main linear model issues.

Output 11.30 contains data from a study to compare 3 treatments (TRT) conducted at 8 locations (LOC). At each location, a randomized complete-blocks design was used, but the number of blocks varied. Locations 1-4 used 3 blocks each, locations 5 and 6 used 6 blocks each, and locations 7 and 8 used 12 blocks each. In the interest of space, not all the data are shown in Output 11.30. However, Output 11.31 shows the response variable (Y) mean and number of observations (blocks) per treatment for each location.

Output 11.30 Multi-Location Data

Obs	loc	blk	trt	y

1	1	1	1	46.6
2	1	1	2	46.4
3	1	1	3	44.4
4	1	2	1	43.7
5	1	2	2	43.6
6	1	2	3	31.4
7	1	3	1	37.9
8	1	3	2	39.5
9	1	3	3	48.2
10	2	1	1	34.0
			.
			.
			.
124	8	6	1	43.5
125	8	6	2	52.1
126	8	6	3	61.4
127	8	7	1	44.1
128	8	7	2	54.8
129	8	7	3	59.9
130	8	8	1	43.3
131	8	8	2	49.4
132	8	8	3	63.0
133	8	9	1	44.2
134	8	9	2	54.6
135	8	9	3	64.8
136	8	10	1	54.6
137	8	10	2	56.6
138	8	10	3	64.6
139	8	11	1	52.1
140	8	11	2	44.3
141	8	11	3	59.7
142	8	12	1	44.9
143	8	12	2	43.3
144	8	12	3	65.0

Output 11.31 Mean Response for Each Treatment by Location

Obs	loc	trt	_FREQ_	y_mean

1	1	1	3	42.7333
2	1	2	3	43.1667
3	1	3	3	41.3333
4	2	1	3	33.5333
5	2	2	3	37.0000
6	2	3	3	22.2333
7	3	1	3	36.6667
8	3	2	3	43.4000
9	3	3	3	47.9000
10	4	1	3	47.7000
11	4	2	3	52.3000
12	4	3	3	73.7000
13	5	1	6	41.8000
14	5	2	6	45.9500
15	5	3	6	47.0000
16	6	1	6	33.9667
17	6	2	6	38.1667
18	6	3	6	30.2333
19	7	1	12	38.6417
20	7	2	12	44.1833
21	7	3	12	51.8500
22	8	1	12	47.5417
23	8	2	12	50.6500
24	8	3	12	60.5500

The variable _FREQ_ in Output 11.31 refers to the number of blocks in a given location.

These are the main controversies for the analysis of multi-location data:

❏ Should the location×treatment interaction be included in the model, or should it be assumed to be zero?

❏ Should locations be considered fixed or random?

❏ Should means be weighted by the number of observations per location, or equivalently, should Type I or Type III SS be used if nonzero location×treatment interactions are assumed?

❏ If random locations and hence random location×treatment interaction effects are assumed, how should location-specific treatment effects, if they arise, be handled?

The following analyses illustrate different approaches to these questions. These illustrations suggest advantages and disadvantages for each approach.

11.8.1 An Analysis Assuming No Location× Treatment Interaction

This approach assumes the model y_ijk = μ + L_i + B(L)_ij + τ_k + e_ijk, where L_i, B(L)_i, and β_k are the location, block within location, and treatment effects, respectively, and the random errors e_ijk are assumed i.i.d. N(0,σ²). Its rationale presumes that 1) the reason for having multiple locations is solely to provide a practical way to obtain an adequate sample size and 2) treatment effects are known with certainty not to be location-specific. You can use the following SAS statements to implement the analysis:

proc glm data=mloc;
   class loc blk trt;
   model y=trt loc blk(loc);
   means trt;
   lsmeans trt;

Output 11.32 shows the results. Normally, you would place TRT last in the model. It is placed before LOC and BLK(LOC) here to illustrate a point. Though not shown here, in practice you would usually add CONTRAST statements or mean comparison options to the MEANS or LSMEANS statements to complete inference about treatment effects.

Output 11.32 An Analysis of Multi-Location Data Using the No LOC×TRT Model

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F

Model	49	10530.96049	214.91756	4.15	<.0001

Error	94	4869.34944	51.80159

Corrected Total	143	15400.30993


Source	DF	Type I SS	Mean Square	F Value	Pr > F

trt	2	1641.777222	820.888611	15.85	<.0001
loc	7	7770.496597	1110.070942	21.43	<.0001
blk(loc)	40	1118.686667	27.967167	0.54	0.9846


Source	DF	Type III SS	Mean Square	F Value	Pr > F

trt	2	1641.777222	820.888611	15.85	<.0001
loc	7	7770.496597	1110.070942	21.43	<.0001
blk(loc)	40	1118.686667	27.967167	0.54	0.9846

Level of		---------y---------
trt	N	Mean	Std Dev

1	48	41.0562500	6.9066698
2	48	45.2145833	6.8837456
3	48	49.3270833	14.0586876

Least Squares Means


trt	y LSMEAN
1	39.6986111
2	43.8569444
3	47.9694444

You can see that the MEANS and LS means are different. This reflects a different weighting scheme for the L_i effects: The MEANS weight them according to the number of observations per location, whereas the LS means weight them equally. Notice that the differences among pairs of treatment MEANS and LS means, however, are unaffected and the Type I and Type III SS are identical. You get different estimates of means but identical estimates of treatment differences regardless of whether you use MEANS or LS means. This is because differences among the MEANS and LS means eliminate weighting based on the number of observations per location. You can see this by using the E option with the LSMEANS statement (as shown in Chapter 6) to show the weighting scheme. The estimates of the treatment effects are thus disproportionately affected by the locations with the greatest number of observations (in this case locations 7 and 8).

The main risk of using this analysis is that it is very sensitive to the assumption of no location-specific treatment effects. Even minor violations of this assumption can seriously affect the results when you use this model.

11.8.2 A Fixed-Location Analysis with an Interaction

In many, perhaps most, multi-location studies, researchers are not prepared to assume no location×treatment interaction without at least testing the assumption. One approach is to modify the model from Section 11.8.1 by adding an interaction term, yielding the model equation y_ijk = μ + L_i + B(L)_ij + τ_k + (τL)_ik + e_ijk, where (τL) _ik denotes the location×treatment interaction. Use the following SAS statements to implement the analysis:

proc glm data=mloc;
   class loc blk trt;
   model y=loc blk(loc) trt loc*trt;
   means trt;
   lsmeans trt loc*trt/slice=loc;
run;

As with any factorial arrangement, the appropriate strategy for inference is first to test the location×treatment (LOC*TRT) interaction and then evaluate simple effects of treatment by location (for example, using the SLICE=LOC option) if the interaction is non-negligible, or otherwise, evaluate main effects. The results appear in Output 11.33.

Output 11.33 An Analysis of Multi-Location Data Using a Fixed-Location Model with an Interaction

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F

Model	63	13042.45660	207.02312	7.02	<.0001

Error	80	2357.85333	29.47317

Corrected Total	143	15400.30993


Source	DF	Type I SS	Mean Square	F Value	Pr > F

loc	7	7770.496597	1110.070942	37.66	<.0001
blk(loc)	40	1118.686667	27.967167	0.95	0.5634
trt	2	1641.777222	820.888611	27.85	<.0001
loc*trt	14	2511.496111	179.392579	6.09	<.0001


Source	DF	Type III SS	Mean Square	F Value	Pr > F

loc	7	7770.496597	1110.070942	37.66	<.0001
blk(loc)	40	1118.686667	27.967167	0.95	0.5634
trt	2	757.254848	378.627424	12.85	<.0001
loc*trt	14	2511.496111	179.392579	6.09	<.0001

Level of		---------y---------
trt	N	Mean	Std Dev

1	48	41.0562500	6.9066698
2	48	45.2145833	6.8837456
3	48	49.3270833	14.0586876

Least Squares Means

trt	y LSMEAN
1	40.3229167
2	44.3520833
3	46.8500000

loc*trt Effect Sliced by loc for y

loc	DF	Sum of Squares	Mean Square	F Value	Pr > F

1	2	5.508889	2.754444	0.09	0.9109
2	2	357.762222	178.881111	6.07	0.0035
3	2	191.775556	95.887778	3.25	0.0438
4	2	1155.120000	577.560000	19.60	<.0001
5	2	90.730000	45.365000	1.54	0.2208
6	2	189.031111	94.515556	3.21	0.0457
7	2	1055.791667	527.895833	17.91	<.0001
8	2	1107.553889	553.776944	18.79	<.0001

Output 11.33 reveals several points about the data. First, there is very strong evidence of a location×treatment interaction (F=6.09, p<0.0001). The SLICE output partially reveals the nature of the interaction: Significant treatment effects were observed at locations 2, 3, 4, 6, 7, and 8, but not at locations 1 or 5. You could pursue this by computing the simple effect comparisons among treatments for each comparison using the steps presented in Section 3.7.5 “Simple Effect Comparisons.” These comparisons are not shown, but you can inspect the treatment means by location in Output 11.30 to anticipate the results: In locations 2 and 6, the mean of treatment 3 is substantially lower than the means of treatments 1 and 2, whereas in locations 4, 7, and 8, and to a lesser extent locations 3 and 6, the opposite is true.

If you do proceed with inference on main effects, despite the evidence of interaction, then you can see that the MEANS and their associated test using the Type I SS for TRT produce different results than the LS means and their associated test using Type III SS. This mainly results from the fact that the MEANS weight locations 7 and 8 more heavily relative to the other locations, whereas the LS means weight all locations equally. Thus, the large difference between treatment 3 and the others in locations 7 and 8 affects the MEANS to a much greater extent than the LS means.

Recalling the discussion of MEANS and Type I SS versus LS means and Type III SS from Chapter 6, you would want to use the MEANS if the number of observations per location closely reflects the true proportion of populations in the various locations. In other words, if locations 7 and 8, for example, are in communities whose populations are roughly four times the populations of locations 1 through 4, then the proportion of observation is representative. On the other hand, if the number of observations per location is mainly a sampling artifact, and does not reflect the actual size of the populations in each location, then the MEANS may seriously misrepresent the actual treatment effects. Note that if you decide to drop LOC*TRT from the model based on the test for interaction, your subsequent inference is implicitly based on the MEANS.

Keep in mind that the fixed-locations model with interaction makes two critical assumptions about the data. First, recalling the discussion of fixed-effects versus random-effects inference from Chapter 4, the fixed-locations model assumes that the observed locations are the entire population. The analysis neither measures nor recovers any information about distribution among locations. Second, fixed-locations analysis assumes that the only relevant source of uncertainty comes from variation among observations within locations, making MS(ERROR) an appropriate error term for testing TRT. If locations are meant to represent a larger population, this assumption is probably untrue and, as you will see in the next section, the tests for treatment shown in this section are incorrect and likely to be misleading. Assuming fixed locations when in fact the location and location×treatment effects represent probability distributions can produce severely inflated Type I error rates for the test of treatment effects. Therefore, you should use this model only when the locations in fact are the entire population of inference or when the locations are chosen to represent a second treatment factor associated with known characteristics of the location (for example, soil type or climatic zone in agricultural trials, or socioeconomic group in multi-center clinical trials).

11.8.3 A Random-Location Analysis

In many multi-location studies, locations represent a larger target population. Implicitly, the goal of these studies is to apply inference beyond the observed locations to the entire population. Recalling the criteria for distinguishing fixed from random effects given in the introduction to Chapter 4, location effects are random when the locations actually observed represent a probability distribution of locations that could, in theory, have been sampled. In most multi-location studies, locations are not drawn from a true random sample, but again recalling the discussion in Chapter 4, this is usually a moot point. Location effects are random if the locations plausibly represent the population (and if they don’t, you should question either the study design or the use of the data to draw inference beyond the observed locations).

The model equation for random-location analysis is identical to the equation given in Section 11.8.2 for fixed-location analysis with interaction, but the assumptions are different: The location effects, L_i, are assumed i.i.d. N(0, $σ_{L}^{2}$ $σ_{L}^{2}$ ), and the location×treatment effects are assumed i.i.d. N(0, $σ_{L}^{2}$ $σ_{L}^{2}$ ). In addition, the block within treatment effects is assumed random as well. You can obtain the expected mean squares and the overall test for treatment effects using PROC GLM, but, as with other mixed-model examples discussed in previous chapters, the standard errors and tests of various treatment comparisons are wrong or awkward to obtain using PROC GLM. PROC MIXED is a better choice. Use the SAS statements

proc mixed method=type3;
   class loc blk trt;
   model y=trt/ddfm=kr;
   random loc blk(loc) loc*trt;
   lsmeans trt/diff;

The METHOD=TYPE3 option is not necessary in practice; it is used here merely to show the expected mean squares, which appear in Output 11.34. The rest of the analysis appears in Output 11.35.

Output 11.34 An Analysis of Variance and Expected Mean Squares for a Random-Locations Analysis of Multi-Location Data

Type 3 Analysis of Variance

Source	DF	Sum of Squares	Mean Square

trt	2	757.254848	378.627424
loc	7	7770.496597	1110.070942
blk(loc)	40	1118.686667	27.967167
loc*trt	14	2511.496111	179.392579
Residual	80	2357.853333	29.473167


Source	Expected Mean Square

trt	Var(Residual) + 4.3636 Var(loc*trt) + Q(trt)
loc	Var(Residual) + 5.6786 Var(loc*trt) + 3 Var(blk(loc)) + 17.036 Var(loc)
blk(loc)	Var(Residual) + 3 Var(blk(loc))
loc*trt	Var(Residual) + 5.6786 Var(loc*trt)
Residual	Var(Residual)

Note the coefficients of the LOC*TRT variance for the TRT main effect and the LOC*TRT interaction. They are different because of the unequal number of observations per location. With unbalanced data and the random location×treatment interaction, the appropriate error term for testing treatment effects is a linear combination of MS(LOC*TRT) and MS(ERROR).

Output 11.35 Random Location Analysis of Multi-Location Data

Covariance Parameter
Estimates

Cov Parm	Estimate

loc	54.7194
blk(loc)	-0.5020
loc*trt	26.4009
Residual	29.4732

Type 3 Tests of Fixed Effects

Effect	Num DF	Den DF	F Value	Pr > F

trt	2	18.1	2.77	0.0893

Least Squares Means

Effect	trt	Estimate	Standard Error	DF	t Value	Pr > \|t\|

trt	1	40.2770	3.3091	15.8	12.17	<.0001
trt	2	44.3284	3.3091	15.8	13.40	<.0001
trt	3	46.9789	3.3091	15.8	14.20	<.0001

Differences of Least Squares Means

Effect	trt	_trt	Estimate	Standard Error	DF	t Value	Pr > \|t\|

trt	1	2	-4.0515	2.8690	18.1	-1.41	0.1749
trt	1	3	-6.7020	2.8690	18.1	-2.34	0.0312
trt	2	3	-2.6505	2.8690	18.1	-0.92	0.3677

You can see from Output 11.35 that the test of treatment effect is considerably more conservative than the corresponding tests in the fixed-locations analyses. This is partly because the MS(LOC*TRT) term is considerably larger than MS(ERROR)—recall the highly significant location×treatment interaction in Output 11.33—and partly because the denominator degrees of freedom depend mainly on the degrees of freedom for LOC*TRT and are thus substantially lower. The BLK(TRT) variance is allowed to remain negative when the METHOD=TYPE3 option is used. The REML default sets the estimate to zero, with some impact on the LOC*TRT variance estimate and some of the test statistics. Section 4.4.2, “Standard Errors for the Two-Way Mixed Model: GLM versus MIXED,” discussed the arguments for and against the REML default; this remains an unresolved controversy in mixed-model inference.

Now look at the LS means and the estimates of treatment differences. The estimates are close to the values you would get using the LS means in the fixed location with interaction model in Output 11.33. This means that the random-locations model implicitly weights locations approximately equally. The standard errors and denominator degrees of freedom are considerably different from the fixed-locations analysis because the mixed model uses the LOC*TRT variance.

You can consider location-specific effects with the random-locations model by using best linear unbiased predictors. The following SAS statements obtain the location-specific BLUPs for the differences between treatments 1 and 2 and between 1 and 3, respectively:

estimate 't1 vs t2 at loc 1' trt 1 -1 0 | loc*trt 1 -1 0;
estimate 't1 vs t3 at loc 1' trt 1 0 -1 | loc*trt 1 0 -1;
estimate 't1 vs t2 at loc 2' trt 1 -1 0 | loc*trt 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 2' trt 1 0 -1 | loc*trt 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 3' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 3' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 4' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 4' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 5' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 5' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 6' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 6' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 7' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0;
estimate 't1 vs t3 at loc 7' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -1;
estimate 't1 vs t2 at loc 8' trt 1 -1 0
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0    1 -1 0;
estimate 't1 vs t3 at loc 8' trt 1 0 -1
         | loc*trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     1 0 -1;

The results appear in Output 11.36.

Output 11.36 Location-Specific Best Linear Unbiased Predictors for Multi-Location Data

Estimates

Label	Estimate	Standard Error	DF	t Value	Pr > \|t\|

t1 vs t2 at loc 1	-1.4146	3.8811	133	-0.36	0.7161
t1 vs t3 at loc 1	-0.7973	3.8811	133	-0.21	0.8376
t1 vs t2 at loc 2	-3.6253	3.8811	133	-0.93	0.3519
t1 vs t3 at loc 2	6.4178	3.8811	133	1.65	0.1006
t1 vs t2 at loc 3	-6.0060	3.8811	133	-1.55	0.1241
t1 vs t3 at loc 3	-10.0044	3.8811	133	-2.58	0.0110
t1 vs t2 at loc 4	-4.4512	3.8811	133	-1.15	0.2535
t1 vs t3 at loc 4	-20.7663	3.8811	133	-5.35	<.0001
t1 vs t2 at loc 5	-4.1345	2.9216	104	-1.42	0.1600
t1 vs t3 at loc 5	-5.4356	2.9216	104	-1.86	0.0656
t1 vs t2 at loc 6	-4.1767	2.9216	104	-1.43	0.1558
t1 vs t3 at loc 6	2.0963	2.9216	104	0.72	0.4747
t1 vs t2 at loc 7	-5.4148	2.1376	90.7	-2.53	0.0130
t1 vs t3 at loc 7	-12.6546	2.1376	90.7	-5.92	<.0001
t1 vs t2 at loc 8	-3.1886	2.1376	90.7	-1.49	0.1392
t1 vs t3 at loc 8	-12.4716	2.1376	90.7	-5.83	<.0001

These estimates are similar to the simple effects you would obtain in the fixed location with interaction analysis in Section 11.8.2, except that the estimates in Output 11.36 are shrinkage estimators to account for the location and location×treatment distributions. Also note that you should use the DDFM=KR option in the MODEL statement of the PROC MIXED program; otherwise, the standard errors of the BLUPs are biased downward.

The main argument for using the random-locations analysis is that it most accurately reflects the inference implicitly intended in most multi-location studies. The primary disadvantage is that in order to get reasonable estimates of LOC and LOC*TRT variance and in order to have sufficient denominator degrees of freedom to test TRT effects, studies must be designed so that there are an adequate number of locations and that the locations plausibly represent the target population.

11.8.4 Further Analysis of a Location×Treatment Interaction Using a Location Index

Closer inspection of the treatment means by location in Output 11.31 and the location-specific BLUPs in Output 11.36 reveals that the treatment 1 minus treatment 3 difference tends to be large favoring treatment 1 in locations with relatively low overall mean responses whereas treatment 3 tends to be favored in locations with relatively high overall mean responses. You formalize this relationship by using an analysis that uses regression on an index characterizing the mean response at each location. This method is closely related to the Tukey test of non-additivity in randomized-complete-blocks designs and has been used by Eberhart and Russell (1966) to characterize genotype-by-environment interactions (which are a special case of multi-location studies). Milliken and Johnson (1989, Chapters 1-3) give an excellent overview of these methods.

The model for this analysis is y_ijk = μ + L_i + B(L)_ij + τk + β_kI_i + (τL)_ik + e_ijk, where all of the terms in the model equation are defined as previously with the addition of a location index, I_i, and a regression coefficient, β_k, for the kth treatment. The location index is usually defined as the mean response over all observations on the ith location. You can implement the analysis using the following SAS statements:

proc sort data=mloc;
   by loc;
proc means noprint data=mloc;
   by loc; var y;
   output out=env_indx mean=index;
data all;
   merge mloc env_indx;
   by loc;
proc mixed data=all;
   class loc blk trt;
   model y=trt trt*index/noint solution ddfm=satterth;
   random loc blk(loc) loc*trt;
   lsmeans trt/diff;
   contrast 'trt at mean index'
      trt 1 -1 0 trt*index 45.2 -45.2 0,
      trt 1 0 -1 trt*index 45.2 0 -45.2;

The PROC SORT and PROC MEANS statements generate a new data set, called ENV_INDX, which contains the means of Y by location. The new variable is called INDEX. The ENV_INDX data are then merged with the original data set. You compute the analysis using PROC MIXED. You can see that the MIXED program is similar to the random-locations analysis in Section 11.8.3, except that you add the term TRT*INDEX to the MODEL statement. This term corresponds to β_kI_i in the model equation given above. The NOINT and SOLUTION options allow easier interpretation of the output. The CONTRAST statement computes an appropriate test of the equality of treatment effects; because this model is a special case of an unequal slopes analysis-of-covariance model, the test of treatment effects varies with the covariate. The test shown here is for the mean of the INDEX variable over all locations. You could choose different values of INDEX. In fact, in many studies you would want to test treatment effect at several values of the INDEX, say at relatively low and relatively high values, to get an idea of how treatment effects change over locations with different mean responses. Output 11.37 shows the results of the analysis.

Output 11.37 Location Index Analysis of Multi-Location Data

Covariance Parameter
Estimates

Cov Parm	Estimate

loc	0
blk(loc)	0
loc*trt	0.8334
Residual	27.9211

Solution for Fixed Effects

Effect	trt	Estimate	Standard Error	DF	t Value	Pr > \|t\|

trt	1	12.4035	5.1377	32.7	2.41	0.0215
trt	2	17.0483	5.1377	32.7	3.32	0.0022
trt	3	-29.4519	5.1377	32.7	-5.73	<.0001
index*trt	1	0.6345	0.1128	29.2	5.62	<.0001
index*trt	2	0.6232	0.1128	29.2	5.52	<.0001
index*trt	3	1.7423	0.1128	29.2	15.44	<.0001

Type 3 Tests of Fixed Effects

Effect	Num DF	Den DF	F Value	Pr > F

trt	3	32.7	16.57	<.0001
index*trt	3	29.2	100.17	<.0001

Contrasts

Label	Num DF	Den DF	F Value	Pr > F

trt at mean index	2	19.2	23.45	<.0001

Least Squares Means

Effect	trt	Estimate	Standard Error	DF	t Value	Pr > \|t\|

trt	1	41.0822	0.8483	19.2	48.43	<.0001
trt	2	45.2182	0.8483	19.2	53.31	<.0001
trt	3	49.2975	0.8483	19.2	58.12	<.0001

Differences of Least Squares Means

Effect	trt	_trt	Estimate	Standard Error	DF	t Value	Pr > \|t\|

trt	1	2	-4.1360	1.1996	19.2	-3.45	0.0027
trt	1	3	-8.2153	1.1996	19.2	-6.85	<.0001
trt	2	3	-4.0793	1.1996	19.2	-3.40	0.0030

The “Covariance Parameter Estimates” show that the INDEX accounts for most of the variation among locations. The LOC variance estimate is 0 and the LOC*TRT variance is sharply reduced compared to its estimate in the random-locations analysis in Output 11.35. The “Solution for Fixed Effects” parameters have the following interpretation. The TRT parameter estimate μ + τ_k and the INDEX*TRT parameters estimate β_k. Thus, TRT + (INDEX*TRT)×(location index) gives you the expected treatment mean at a given value of location index. For example, at a location whose average response is 45.2, the expected mean of treatment 1 is 12.4035+(0.6345)*(45.2)=41.4, the LS mean shown in the output, aside from round-off error. The INDEX*TRT estimates tell you how the expected treatment mean changes as the location index increases.

The INDEX*TRT estimate is much larger for treatment 3, and the intercept (TRT) is much smaller, which tells you that treatment 3 is expected to have a low mean relative to treatments 1 and 2 in locations with a relatively low mean response. But its mean increases more quickly and thus is expected to have a higher mean relative to treatments 1 and 2 for locations with relatively high mean responses.

The “Type 3 Tests of Fixed Effects” results for TRT and INDEX*TRT test the joint equality of these terms to zero. As such, the test for TRT is usually not of interest. The CONTRAST result for TRT at a given value of INDEX supercedes the TRT test. Use INDEX*TRT to test whether the expected responses vary linearly with location index for the three treatments. You could construct a similar CONTRAST defined on INDEX*TRT to test the equality of the β_k’s, which you interpret as a linear location index×treatment interaction.

As discussed earlier, the CONTRAST result tests treatment effects at a location index of 45.2. The “Least Squares Means” and “Differences of Least Squares Means” output are computed for the mean INDEX values determined by the LS means algorithm. You could vary the LS means and differences using the AT option to see what happens in different environments. For example, the following statements compute LS means for INDEX values of 30.9 (roughly the lowest INDEX for any location observed) and 57.9 (roughly the highest INDEX among all locations). For complete-ness, the LSMEANS statement with the default AT MEANS option is also shown. The SAS statements are

lsmeans trt/at index=30.9 diff;
lsmeans trt/at means diff;
lsmeans trt/at index=57.9 diff;

The results appear in Output 11.38.

Output 11.38 LS Means and Differences Computed at Various Location Indices

Least Squares Means

Effect	trt	index	Estimate	Standard Error	DF	t Value	Pr > \|t\|

trt	1	30.90	32.0094	1.7935	36.6	17.85	<.0001
trt	2	30.90	36.3064	1.7935	36.6	20.24	<.0001
trt	3	30.90	24.3842	1.7935	36.6	13.60	<.0001
trt	1	45.20	41.0822	0.8483	19.2	48.43	<.0001
trt	2	45.20	45.2182	0.8483	19.2	53.31	<.0001
trt	3	45.20	49.2975	0.8483	19.2	58.12	<.0001
trt	1	57.90	49.1407	1.6936	19.6	29.02	<.0001
trt	2	57.90	53.1338	1.6936	19.6	31.37	<.0001
trt	3	57.90	71.4255	1.6936	19.6	42.17	<.0001

Differences of Least Squares Means

Effect	trt	_trt	index	Estimate	Standard Error	DF	t Value	Pr > \|t\|

trt	1	2	30.90	-4.2970	2.5363	36.6	-1.69	0.0987
trt	1	3	30.90	7.6252	2.5363	36.6	3.01	0.0048
trt	2	3	30.90	11.9221	2.5363	36.6	4.70	<.0001
trt	1	2	45.20	-4.1360	1.1996	19.2	-3.45	0.0027
trt	1	3	45.20	-8.2153	1.1996	19.2	-6.85	<.0001
trt	2	3	45.20	-4.0793	1.1996	19.2	-3.40	0.0030
trt	1	2	57.90	-3.9930	2.3951	19.6	-1.67	0.1114
trt	1	3	57.90	-22.2848	2.3951	19.6	-9.30	<.0001
trt	2	3	57.90	-18.2917	2.3951	19.6	-7.64	<.0001

You can see that in locations with the lowest index, or mean response, the expected response of treatment 3 is considerably lower than that for the other two treatments. This is consistent with what was observed in locations 2 and 6 (see Output 11.31). On the other hand, the locations with the highest mean, the expected response of treatment 3 exceeds that of the other two treatments by a considerable margin, as was observed in locations 4, 7, and 8.

As a final note, this analysis suggests that when there are strong location-specific effects, the argument between weighted versus unweighted means, that is, MEANS versus LS means, respectively, from the fixed and random location with interaction models in Sections 11.8.2 and 11.8.3, is probably moot. Evaluating changes in treatment response at different locations and trying to understand why they are occurring is usually more to the point.

11.9 Absorbing Nesting Effects

Nested effects can produce a very large number of dummy variables in models and challenge the capacity of computers. This problem is trivial with modern computers compared with only a few years ago, but it is still an issue. A methodology called absorption greatly reduces the size of the problem by eliminating the need to obtain an explicit solution to the complete set of normal equations. In most applications, nested effects are random, and their estimates might not be required.

Absorption reduces the number of normal equations by eliminating the parameters for one factor from the system before a solution is obtained. This is analogous to the method of solving a set of three equations in three unknowns, x₁, x₂, and x₃. Suppose you combine the first and second equations and eliminate x₃. Next, combine the first and third equations and eliminate x₃. Then, with two equations left involving x₁ and x₂ (the variable x₃ having been absorbed), solve the reduced set for x₁ and x₂.

The use of the ABSORB statement is illustrated with data on 65 steers from Harvey (1975). Several values are recorded for each steer, including line number (LINE), sire number (SIRE), age of dam (AGEDAM), steer age (AGE), initial weight (INTLWT), and the dependent variable, average daily gain (AVDLYGN). Output 11.39 shows the data.

Output 11.39 Data Set Sires

Obs	line	sire	agedam	steerno	age	intlwt	avdlygn

1	1	1	3	1	192	390	2.24
2	1	1	3	2	154	403	2.65
3	1	1	4	3	185	432	2.41
4	1	1	4	4	193	457	2.25
5	1	1	5	5	186	483	2.58
6	1	1	5	6	177	469	2.67
7	1	1	5	7	177	428	2.71
8	1	1	5	8	163	439	2.47
9	1	2	4	9	188	439	2.29
10	1	2	4	10	178	407	2.26
11	1	2	5	11	198	498	1.97
12	1	2	5	12	193	459	2.14
13	1	2	5	13	186	459	2.44
14	1	2	5	14	175	375	2.52
15	1	2	5	15	171	382	1.72
16	1	2	5	16	168	417	2.75
17	1	3	3	17	154	389	2.38
18	1	3	4	18	184	414	2.46
19	1	3	5	19	174	483	2.29
20	1	3	5	20	170	430	2.30
21	1	3	5	21	169	443	2.94
22	2	4	3	22	158	381	2.50
23	2	4	3	23	158	365	2.44
24	2	4	4	24	169	386	2.44
25	2	4	4	25	144	339	2.15
26	2	4	5	26	159	419	2.54
27	2	4	5	27	152	469	2.74
28	2	4	5	28	149	379	2.50
29	2	4	5	29	149	375	2.54
30	2	5	3	30	189	395	2.65
31	2	5	4	31	187	447	2.52
32	2	5	4	32	165	430	2.67
33	2	5	5	33	181	453	2.79
34	2	5	5	34	177	385	2.33
35	2	5	5	35	151	414	2.67
36	2	5	5	36	147	353	2.69
37	3	6	4	37	184	411	3.00
38	3	6	4	38	184	420	2.49
39	3	6	5	39	187	427	2.25
40	3	6	5	40	184	409	2.49
41	3	6	5	41	183	337	2.02
42	3	6	5	42	177	352	2.31
43	3	7	3	43	205	472	2.57
44	3	7	3	44	193	340	2.37
45	3	7	4	45	162	375	2.64
46	3	7	5	46	206	451	2.37
47	3	7	5	47	205	472	2.22
48	3	7	5	48	187	402	1.90
49	3	7	5	49	178	464	2.61
50	3	7	5	50	175	414	2.13
51	3	8	3	51	200	466	2.16
52	3	8	3	52	184	356	2.33
53	3	8	3	53	175	449	2.52
54	3	8	4	54	178	360	2.45
55	3	8	5	55	189	385	1.44
56	3	8	5	56	184	431	1.72
57	3	8	5	57	183	401	2.17
58	3	9	3	58	166	404	2.68
59	3	9	4	59	187	482	2.43
60	3	9	4	60	186	350	2.36
61	3	9	4	61	184	483	2.44
62	3	9	5	62	180	425	2.66
63	3	9	5	63	177	420	2.46
64	3	9	5	64	175	440	2.52
65	3	9	5	65	164	405	2.42

The analysis, as performed by Harvey, can be obtained directly by PROC GLM with the following SAS statements:

proc glm;
   class line sire agedam;
   model avdlygn=line sire(line) agedam
         line*agedam age intlwt / solution ss3;
   test h=line e=sire(line);

The results appear in Output 11.40.

Output 11.40 Complete Analysis of Variance

The GLM Procedure

Dependent Variable: avdlygn

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F

Model	16	2.52745871	0.15796617	3.14	0.0011

Error	48	2.41191667	0.05024826

Corrected Total	64	4.93937538


R-Square	Coeff Var	Root MSE	avdlygn Mean

0.511696	9.295956	0.224161	2.411385


Source	DF	Type III SS	Mean Square	F Value	Pr > F

line	2	0.13620255	0.06810128	1.36	0.2676
sire(line)	6	0.97388905	0.16231484	3.23	0.0095
agedam	2	0.13010623	0.06505311	1.29	0.2834
line*agedam	4	0.45343434	0.11335859	2.26	0.0768
age	1	0.38127612	0.38127612	7.59	0.0083
intlwt	1	0.26970425	0.26970425	5.37	0.0248

Tests of Hypotheses Using the Type III MS for sire(line) as an Error Term


Source	DF	Type III SS	Mean Square	F Value	Pr > F

line	2	0.13620255	0.06810128	0.42	0.6752

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|

Intercept	2.996269167 B	0.51285394	5.84	<.0001
line 1	0.071824656 B	0.14550628	0.49	0.6238
line 2	0.252468579 B	0.13716655	1.84	0.0719
line 3	0.000000000 B	⋅	⋅	⋅
sire(line) 1 1	0.085729012 B	0.13027803	0.66	0.5137
sire(line) 2 1	-0.121705157 B	0.13622078	-0.89	0.3761
sire(line) 3 1	0.000000000 B	⋅	⋅	⋅
sire(line) 4 2	-0.244601122 B	0.12669287	-1.93	0.0594
sire(line) 5 2	0.000000000 B	⋅	⋅	⋅
sire(line) 6 3	0.105395737 B	0.12908764	0.82	0.4183
sire(line) 7 3	-0.019520926 B	0.12037674	-0.16	0.8719
sire(line) 8 3	-0.330235387 B	0.12566795	-2.63	0.0115
sire(line) 9 3	0.000000000 B	⋅	⋅	⋅
agedam 3	0.370387027 B	0.11455814	3.23	0.0022
agedam 4	0.275459487 B	0.10377628	2.65	0.0107
agedam 5	0.000000000 B	⋅	⋅	⋅
line*agedam 1 3	-0.448936131 B	0.19581259	-2.29	0.0263
line*agedam 1 4	-0.282831924 B	0.16085047	-1.76	0.0851
line*agedam 1 5	0.000000000 B	⋅	⋅	⋅
line*agedam 2 3	-0.260782670 B	0.19528690	-1.34	0.1880
line*agedam 2 4	-0.350258133 B	0.17438656	-2.01	0.0502
line*agedam 2 5	0.000000000 B	⋅	⋅	⋅
line*agedam 3 3	0.000000000 B	⋅	⋅	⋅
line*agedam 3 4	0.000000000 B	⋅	⋅	⋅
line*agedam 3 5	0.000000000 B	⋅	⋅	⋅
age	-0.008530438	0.00309679	-2.75	0.0083
intlwt	0.002026334	0.00087464	2.32	0.0248

NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

The factor AGEDAM is treated as a discrete variable with levels (3, 4, and ≥5). The denominator for the F-test for testing LINE is SIRE(LINE).

To introduce the ABSORB statement, Harvey's model has been simplified. All sources of variation except the main effects of SIRE and AGEDAM have been disregarded. For the abbreviated model, the following SAS statements give the desired analysis:

proc glm;
class sire agedam;
model avdlygn=sire agedam / solution ss1 ss2 ss3;

The results appear in Output 11.41.

Output 11.41 Abbreviated Least-Squares Analysis of Variance

The GLM Procedure

Dependent Variable: avdlygn


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F

Model	10	1.42537863	0.14253786	2.19	0.0324

Error	54	3.51399676	0.06507401

Corrected Total	64	4.93937538


R-Square	Coeff Var	Root MSE	avdlygn Mean
0.288575	10.57882	0.255096	2.411385


Source	DF	Type I SS	Mean Square	F Value	Pr > F

sire	8	1.30643634	0.16330454	2.51	0.0214
agedam	2	0.11894229	0.05947115	0.91	0.4071


Source	DF	Type II SS	Mean Square	F Value	Pr > F

agedam	2	0.11894229	0.05947115	0.91	0.4071


Source	DF	Type III SS	Mean Square	F Value	Pr > F

agedam	2	0.11894229	0.05947115	0.91	0.4071

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|

agedam 3	0.1173825552 B	0.08911680	1.32	0.1933
agedam 4	0.0482979994 B	0.07715379	0.63	0.5340
agedam 5	0.0000000000 B	.	.	.

NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

If the number of sires were large, then this analysis would be expensive. However, because there is little concern for the actual estimates of the effects of SIRE, considerable expense can be avoided by using the ABSORB statement:

proc glm;
   absorb sire;
   class agedam;
   model avdlygn=agedam / solution ss1 ss2 ss3;

The results appear in Output 11.42.

Output 11.42 Abbreviated Least-Squares Analysis of Variance Using the ABSORB Statement

The GLM Procedure

Dependent Variable: avdlygn


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F

Model	10	1.42537863	0.14253786	2.19	0.0324

Error	54	3.51399676	0.06507401

Corrected Total	64	4.93937538


R-Square	Coeff Var	Root MSE	avdlygn Mean

0.288575	10.57882	0.255096	2.411385


Source	DF	Type I SS	Mean Square	F Value	Pr > F

sire	8	1.30643634	0.16330454	2.51	0.0214
agedam	2	0.11894229	0.05947115	0.91	0.4071


Source	DF	Type II SS	Mean Square	F Value	Pr > F

agedam	2	0.11894229	0.05947115	0.91	0.4071


Source	DF	Type III SS	Mean Square	F Value	Pr > F

agedam	2	0.11894229	0.05947115	0.91	0.4071


Parameter	Estimate	Standard Error	t Value	Pr > \|t\|

agedam 3	0.1173825552 B	0.08911680	1.32	0.1933
agedam 4	0.0482979994 B	0.07715379	0.63	0.5340
agedam 5	0.0000000000 B	.	.	.

NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable

The results in Output 11.41 and Output 11.42 are the same except that the SIRE sums of squares and the SIRE parameter estimates are not printed when SIRE is absorbed. (Note: Type I sums of squares for absorbed effects are computed as nested effects.)

Output 11.40 and Output 11.41 include results for the following statements:

contrast 'young vs old' agedam .5 .5 -1;
estimate 'young vs old' agedam .5 .5 -1;

The output illustrates that the CONTRAST and ESTIMATE statements are legitimate with the ABSORB statement as long as the coefficients of the linear function do not involve absorbed effects—that is, parameter estimates that are not printed (in this case the SIRE parameter estimates). The following ESTIMATE statement would not be legitimate when SIRE is absorbed because it involves the SIRE parameters:

estimate 'oldmean' sire .111111 ... .111111 agedam 1;

For the same reason, the LSMEANS statement for SIRE is not legitimate with the ABSORB statement.

The ABSORB statement is now applied to the full analysis as given by Harvey (see Output 11.39). If the sums of squares for LINE and SIRE(LINE) are not required, the remaining sums of squares can be obtained with the following statements:

proc glm;
   absorb line sire;
   class line agedam;
   model avdlygn=agedam line*agedam age
         intlwt / solution ss3;

Output 11.43 contains the output, which is identical to Harvey’s original results (see Output 11.40) except that neither the sums of squares nor the parameter estimates for LINE and SIRE(LINE) are computed when LINE and SIRE are absorbed.

Output 11.43 Complete Least-Squares Analysis of Variance Using the ABSORB Statement

The GLM Procedure

Dependent Variable: avdlygn


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F

Model	16	2.52745871	0.15796617	3.14	0.0011

Error	48	2.41191667	0.05024826

Corrected Total	64	4.93937538


R-Square	Coeff Var	Root MSE	avdlygn Mean

0.511696	9.295956	0.224161	2.411385


Source	DF	Type III SS	Mean Square	F Value	Pr > F

agedam	2	0.13010623	0.06505311	1.29	0.2834
line*agedam	4	0.45343434	0.11335859	2.26	0.0768
age	1	0.38127612	0.38127612	7.59	0.0083
intlwt	1	0.26970425	0.26970425	5.37	0.0248


Contrast	DF	Contrast SS	Mean Square	F Value	Pr > F

young vs old	1	0.11895160	0.11895160	2.37	0.1305

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|

young vs old	0.09912178	0.06442352	1.54	0.1305


Parameter	Estimate	Standard Error	t Value	Pr > \|t\|

agedam 3	0.3703870271 B	0.11455814	3.23	0.0022
agedam 4	0.2754594872 B	0.10377628	2.65	0.0107
agedam 5	0.0000000000 B	.	.	.
line*agedam 1 3	-.4489361310 B	0.19581259	-2.29	0.0263
line*agedam 1 4	-.2828319237 B	0.16085047	-1.76	0.0851
line*agedam 1 5	0.0000000000 B	.	.	.
line*agedam 2 3	-.2607826701 B	0.19528690	-1.34	0.1880
line*agedam 2 4	-.3502581329 B	0.17438656	-2.01	0.0502
line*agedam 2 5	0.0000000000 B	.	.	.
line*agedam 3 3	0.0000000000 B	.	.	.
line*agedam 3 4	0.0000000000 B	.	.	.
line*agedam 3 5	0.0000000000 B	.	.	.
age	-.0085304380	0.00309679	-2.75	0.0083
intlwt	0.0020263340	0.00087464	2.32	0.0248

We conclude this section by running a mixed-model analysis using PROC MIXED in which we consider sires to be random. This analysis might be considered preferable to the analysis of variance using PROC GLM because it truly treats sires as random. In some more complicated situations with very large data sets, however, PROC MIXED might overwhelm the computer.

The appropriate statements are

proc mixed data=sires;
   class line sire agedam;
   model avdlygn=line agedam line*agedam age intlwt/
   ddfm=satterthwaite;
   random sire(line);
   contrast 'young vs old' agedam .5 .5 -1;
   estimate 'young vs old' agedam .5 .5 -1;
run;

Edited results are shown in Output 11.44.

Output 11.44 A Mixed-Model Analysis

The Mixed Procedure

Model Information

Data Set	WORK.SIRES
Dependent Variable	avdlygn
Covariance Structure	Variance Components
Estimation Method	REML
Residual Variance Method	Profile
Fixed Effects SE Method	Model-Based
Degrees of Freedom Method	Satterthwaite

Class Level Information

Class	Levels	Values
line	3	1 2 3
sire	9	1 2 3 4 5 6 7 8 9
agedam	3	3 4 5

Covariance Parameter
Estimates

Cov Parm	Estimate

sire(line)	0.01792
Residual	0.05028

Type 3 Tests of Fixed Effects

Effect	Num DF	Den DF	F Value	Pr > F

line	2	7.2	0.43	0.6687
agedam	2	50	1.21	0.3068
line*agedam	4	49.5	2.00	0.1095
age	1	53.8	7.55	0.0082
intlwt	1	51.6	5.92	0.0185

Estimates

Label	Estimate	Standard Error	DF	t Value	Pr > \|t\|

young vs old	0.09581	0.06348	50.6	1.51	0.1374

Contrasts

Label	Num DF	Den DF	F Value	Pr > F

young vs old	1	50.6	2.28	0.1374

As you have seen with other examples, the mixed-model analysis provides estimates and tests that use appropriate error terms, at least in principle. The “experimental unit” for LINE is SIRE(LINE), and the table for “Type 3 Tests of Fixed Effects” reflects this, with two numerator DF and 7.2 denominator DF. The other tests have approximately 50 denominator DF, which is essentially the same as the 48 DF for residual error in Output 11.40. Steer in the “experimental unit” for the AGEDAM, LINE*AGEDAM, AGE, and INTLWT. Significance probabilities in Outputs 11.40 and 11.44 agree, for practical purposes, for these effects, but it is worth noting that the ANOVA tests in Output 11.40 are exact F-tests, whereas the mixed-model tests in Output 11.44 are approximate due to estimating the covariance parameters. This illustrates a basic lesson. If you are interested only in the effects AGEDAM, LINE*AGEDAM, AGE, and INTLWT, then there is really no benefit in using mixed-model methodology.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11.6 A Lack-of-Fit Analysis

Create new playlist

Sign In

Sign Up

11.6 A Lack-of-Fit Analysis

11.7 An Unbalanced Nested Structure

11.8 An Analysis of Multi-Location Data

11.8.1 An Analysis Assuming No Location× Treatment Interaction

11.8.2 A Fixed-Location Analysis with an Interaction

11.8.3 A Random-Location Analysis

11.8.4 Further Analysis of a Location×Treatment Interaction Using a Location Index

11.9 Absorbing Nesting Effects

Table of Contents for
11.6 A Lack-of-Fit Analysis