Chapter 7 Analysis of Covariance
7.2.2 Means and Least-Squares Means
7.3.1 Testing the Heterogeneity of Slopes
7.3.2 Estimating Different Slopes
7.3.3 Testing Treatment Differences with Unequal Slopes
7.4 A Two-Way Structure without Interaction
7.5 A Two-Way Structure with Interaction
7.6 Orthogonal Polynomials and Covariance Methods
7.6.2 Use of the IML ORPOL Function to Obtain Orthogonal Polynomial Contrast Coefficients
7.6.3 Use of Analysis of Covariance to Compute ANOVA and Fit Regression
Analysis of covariance can be described as a combination of the methods of regression and analysis of variance. Regression models use direct independent variables—that is, variables whose values appear directly in the model—for example, the linear regression y=β0+β1x. Analysis of variance models use class variables—that is, the independent variable’s classifications appear in the model—for example, the one-way ANOVA model y=μ+τi. In more theoretical terms, ANOVA class variables set up dummy 0-1 columns in the X matrix (see Chapter 6). Analysis-of-covariance models use both direct and class variables. A simple example combines linear regression and one-way ANOVA, yielding y=μ+τi+βx.
Analysis of covariance uses at least two measurements on each unit: the response variable y, and another variable x, called a covariable. You may have more than one covariable. The basic objective is to use information about y that is contained in x in order to refine inference about the response. This is done primarily in three ways:
❏ In all applications, variation in y that is associated with x is removed from the error variance, resulting in more precise estimates and more powerful tests.
❏ In some applications, group means of the y variable are adjusted to correspond to a common value of x, thereby producing an equitable comparison of the groups.
❏ In other applications, the regression of y on x for each group is of intrinsic interest, either to predict the effect of x on y for each group, or to compare the effect of x on y among groups.
Textbook discussions of covariance analysis focus on the first two points, with the main goal of establishing differences among adjusted treatment means. By including a related variable that accounts for substantial variation in the dependent variable of interest, you can reduce error. This increases the precision of the model parameter estimates. Textbooks discuss separate slopes versus common slope models—that is, covariance models with a separate regression slope coefficient for each treatment group, versus a single slope coefficient for all treatment groups. In these discussions, the main role of the separate slope model is to test for differences in slopes among the treatments. Typically, this test should be conducted as a preliminary step before an analysis of covariance, because, aside from carefully defined exceptions, the validity of comparing adjusted means using the analysis of covariance requires that the slopes be homogeneous. The beginning sections of this chapter present the textbook approach to analysis of covariance
There are broader uses of covariance models, such as the study of partial regression coefficients adjusted for treatment effects. Applied to factorial experiments with qualitative and quantitative factors, covariance models provide a convenient alternative to orthogonal polynomial and related contrasts that are often tedious and awkward. Later sections of this chapter present these methods.
To give a practical definition, analysis of covariance refers to models containing both continuous variables and group indicators (CLASS variables in the GLM procedure). Because CLASS variables create less-than-full-rank models, covariance models are typically more complex and hence involve more difficulties in interpretation than regression-only or ANOVA-only models. These issues are addressed throughout this chapter.
Analysis of covariance can be applied in any data classification whenever covariables are measured. This section deals with the simplest type of classification, the one-way structure.
The simplest covariance model is written
and combines a one-way treatment structure with parameters τi, one independent covariate xij, and associated regression parameter β.
Two equivalent models are
yijk = β0 + τi+ βxij + εijk
where
yijk = β0i + βxij+ εijk
where β0i is the intercept for the ith treatment. This expression reveals a model that represents a set of parallel lines; the common slope of the lines is β, and the intercepts are β0i = (β0 + τi). The model contains all the elements of an analysis-of-variance model of less-than-full rank, requiring restrictions on the τi or the use of generalized inverses and estimable functions. Note, however, that the regression coefficient β is not affected by the singularity of the X’X matrix; hence, the estimate of β is unique.
The following example, using data on the growth of oysters, illustrates the basic features of analysis of covariance. The goal is to determine
❏ if exposure to water heated artificially affects growth
❏ if the position in the water column (surface or bottom) affects growth.
Four bags with ten oysters in each bag are randomly placed at each of five stations in the cooling water canal of a power-generating plant. Each location, or station, is considered a treatment and is represented by the variable TRT. Each bag is considered to be one experimental unit. Two stations are located in the intake canal, and two stations are located in the discharge canal, one at the surface (TOP), the other at the bottom (BOTTOM) of each location. A single mid-depth station is located in a shallow portion of the bay near the power plant. The treatments are described below:
Treatment |
Station |
---|---|
1 |
INTAKE-BOTTOM |
2 |
INTAKE-SURFACE |
3 |
DISCHARGE-BOTTOM |
4 |
DISCHARGE-SURFACE |
5 |
BAY |
Stations in the intake canal act as controls for those in the discharge canal, which has a higher temperature. The station in the bay is an overall control in case some factor other than the heat difference due to water depth or location is responsible for an observed change in growth rate.
The oysters are cleaned and measured at the beginning of the experiment and then again about one month later. The initial weight and the final weight are recorded for each bag. The data appear in Output 7.1.
Output 7.1 Data for Analysis of Covariance
Obs | trt | rep | initial | final |
1 | 1 | 1 | 27.2 | 32.6 |
2 | 1 | 2 | 32.0 | 36.6 |
3 | 1 | 3 | 33.0 | 37.7 |
4 | 1 | 4 | 26.8 | 31.0 |
5 | 2 | 1 | 28.6 | 33.8 |
6 | 2 | 2 | 26.8 | 31.7 |
7 | 2 | 3 | 26.5 | 30.7 |
8 | 2 | 4 | 26.8 | 30.4 |
9 | 3 | 1 | 28.6 | 35.2 |
10 | 3 | 2 | 22.4 | 29.1 |
11 | 3 | 3 | 23.2 | 28.9 |
12 | 3 | 4 | 24.4 | 30.2 |
13 | 4 | 1 | 29.3 | 35.0 |
14 | 4 | 2 | 21.8 | 27.0 |
15 | 4 | 3 | 30.3 | 36.4 |
16 | 4 | 4 | 24.3 | 30.5 |
17 | 5 | 1 | 20.4 | 24.6 |
18 | 5 | 2 | 19.6 | 23.4 |
19 | 5 | 3 | 25.1 | 30.3 |
20 | 5 | 4 | 18.1 | 21.8 |
You can address the objectives given above by analysis of covariance. The response variable is final weight, but the analysis must also account for initial weight. You can do this using initial weight as the covariate. The following SAS statements are required to compute the basic analysis:
proc glm;
class trt;
model final=trt initial / solution;
The CLASS statement specifies that TRT is a classification variable. The variable INITIAL is the covariate. The MODEL statement defines the model yij = β0 + τi + βxij + εij. Specifying the SOLUTION option requests printing of the coefficient vector.
Results of these statements appear in Output 7.2.
Output 7.2 Results of Analysis of Covariance
The GLM Procedure | |||||
Dependent Variable: final | |||||
Sum of | |||||
Source | DF | Squares | Mean Square | F Value | Pr > F |
Model | 5 | 354.4471767 | 70.8894353 | 235.05 | <.0001 |
Error | 14 | 4.2223233 | 0.3015945 | ||
Corrected Total | 19 | 358.6695000 |
R-Square | Coeff Var | Root MSE | y Mean |
0.988228 | 1.780438 | 0.549176 | 30.84500 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
trt | 4 | 198.4070000 | 49.6017500 | 164.47 | <.0001 |
initial | 1 | 156.0401767 | 156.0401767 | 517.38 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
trt | 4 | 12.0893593 | 3.0223398 | 10.02 | 0.0005 |
initial | 1 | 156.0401767 | 156.0401767 | 517.38 | <.0001 |
Standard | |||||
Parameter | Estimate | Error | t Value | Pr > |t| | |
Intercept | 2.494859769 B | 1.02786287 | 2.43 | 0.0293 | |
trt | 1 | -0.244459378 B | 0.57658196 | -0.42 | 0.6780 |
trt | 2 | -0.280271345 B | 0.49290825 | -0.57 | 0.5786 |
trt | 3 | 1.654757698 B | 0.42943036 | 3.85 | 0.0018 |
trt | 4 | 1.107113519 B | 0.47175112 | 2.35 | 0.0342 |
trt | 5 | 0.000000000 B | ⋅ | ⋅ | ⋅ |
initial | 1.083179819 | 0.04762051 | 22.75 | <.0001 | |
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable. |
Consider the Type I and Type III SS (Type II and Type IV would be the same as Type III here). The Type I SS for TRT is the unadjusted treatment sum of squares. The ERROR SS for a simple analysis of variance can be reconstructed by subtracting the Type I SS from the TOTAL SS, for example,
Source | DF |
SS |
MS |
F |
---|---|---|---|---|
TRT | 4 |
198.407 |
49.602 |
4.642 |
ERROR | 15 |
160.263 |
10.684 |
|
TOTAL | 19 |
358.670 |
The resulting F-value indicates that p is less than .01. Thus, a simple analysis of variance leads to concluding that statistically significant treatment differences in final weight exist even when initial weights are not considered.
Now compare these results with the analysis of covariance. The Type III TRT SS is 12.089 whereas the Type I TRT SS equals the one-way ANOVA TRT SS of 198.407. The Type III TRT SS reflects differences among treatment means that have been adjusted to a common value of the covariate, INITIAL. In analysis of covariance, the TYPE III TRT SS is the adjusted treatment sum of squares; the Type I TRT SS is the unadjusted treatment sum of squares because it reflects the difference among treatment means prior to adjustment for the covariate. In this example, the unadjusted TRT SS is much larger than the adjusted one. However, the reduction in error mean squares from 10.684 to 0.302 allows an increase in the F-statistic from 4.642 in the simple analysis of variance to 10.02 in Output 7.2. The power of the test for treatment differences increases when the covariate is included because most of the error in the simple analysis of variance is due to variation in INITIAL values.
The last part of Output 7.2 contains the SOLUTION vector. In this one-factor case, the TRT estimates are obtained by setting the estimates for the last treatment (TRT 5) to 0. Therefore, the INTERCEPT estimate is the intercept for TRT 5, and the other four treatment effects are differences between each TRT and TRT 5. Because TRT 5 is the control, the output estimates, standard errors, and t-tests are for treatment versus control. Note that the means of TRT 3 and TRT 4 in the discharge canal differ from TRT 5.
The coefficient associated with INITIAL is the pooled within-groups regression coefficient relating FINAL to INITIAL. The coefficient estimate is a weighted average of the regression coefficients of FINAL on INITIAL, estimated separately for each of the five treatment groups. This coefficient estimates that a difference of 1.083 units in FINAL is associated with a one-unit difference in INITIAL.
A MEANS statement requests the unadjusted treatment means of all continuous (non-CLASS) variables in the model, that is, the response variable and the covariate. You can suppress printing the covariate means by using the DEPONLY option. These means are not strictly relevant to an analysis of covariance unless they are used to determine the effect of the covariance adjustment. The DUNCAN and WALLER options, among others, for multiple-comparisons tests are also available, but they are not useful here.
The LSMEANS (least-squares means) statement produces the estimates that are usually called adjusted treatment means. They are defined as
, or, equivalently,
Consistent with the MODEL statement in the SAS statements above, this example uses the latter form. Recall that you also obtain adjusted means for the unbalanced two-way classification using the same LSMEANS statement, which is
lsmeans trt / stderr tdiff;
The TDIFF option requests LSD tests among the adjusted means. You can use the ADJUST= option to obtain alternative multiple comparison tests, depending on the relative seriousness of Type I and Type II error. These issues were discussed in more detail in Section 3.3.3. The LSMEANS and TDIFF results appear in Output 7.3.
Output 7.3 Results of Analysis of Covariance: Adjusted Treatment Means (Least-Squares Means)
Least Squares Means | ||||
trt | final LSMEAN | Standard Error |
Pr > |t| | LSMEAN Number |
1 | 30.1531125 | 0.3339174 | <.0001 | 1 |
2 | 30.1173006 | 0.2827350 | <.0001 | 2 |
3 | 32.0523296 | 0.2796295 | <.0001 | 3 |
4 | 31.5046854 | 0.2764082 | <.0001 | 4 |
5 | 30.3975719 | 0.3621988 | <.0001 | 5 |
t for H0: LSMean(i)=LSMean(j) / Pr > |t| | |||||
i/j | 1 | 2 | 3 | 4 | 5 |
1 | 0.087941 | -4.1466 | -3.22289 | -0.42398 | |
0.9312 | 0.0010 | 0.0061 | 0.6780 | ||
2 | -0.08794 | -4.76003 | -3.55771 | -0.56861 | |
0.9312 | 0.0003 | 0.0032 | 0.5786 | ||
3 | 4.146599 | 4.76003 | 1.378002 | 3.853378 | |
0.0010 | 0.0003 | 0.1898 | 0.0018 | ||
4 | 3.222892 | 3.557715 | -1.378 | 2.346817 | |
0.0061 | 0.0032 | 0.1898 | 0.0342 | ||
5 | 0.42398 | 0.568608 | -3.8533 | -2.34682 | |
0.6780 | 0.5786 | 0.0018 | 0.0342 |
NOTE: To ensure overall protection level, only probabilities associated with preplanned comparisons should be used. |
The estimated least-squares means are followed by their standard errors, which are printed because of the STDERR option. The t-values and associated significance probabilities for all pairwise tests of treatment differences are printed because of the TDIFF option.
The table below shows the unadjusted and adjusted means for the response variable FINAL and the mean of the covariate, INITIAL for each treatment group:
TRT |
FINAL Unadjusted Means | Adjusted Least-Squares Means | INITIAL Covariate Mean |
---|---|---|---|
1 |
34.475 |
30.153 |
29.750 |
2 |
31.650 |
30.117 |
27.175 |
3 |
30.850 |
32.052 |
24.650 |
4 |
32.225 |
31.504 |
26.425 |
5 |
25.025 |
30.398 |
20.800 |
Figure 7.1 illustrates the distinction between adjusted and unadjusted means. The five linear regressions for each treatment are parallel, each with a common slope. Four points appear on each regression line: the end points are the predicted values of the response variable FINAL for the minimum and maximum values of INITIAL in the data set. The other two are 1) the predicted value of FINAL at the INITIAL mean for that treatment, and 2) the predicted value of FINAL at the mean value of the covariate INITIAL for the entire data set (shown by the solid vertical line at the mean value of INITIAL = 25.76). The former are the unadjusted sample means of FINAL; the latter are the adjusted, or LS means. The light-shaded vertical lines represent the mean of the covariate INITIAL at each treatment; the light-shaded horizontal lines correspond to the unadjusted sample means.
Figure 7.1 Regressions for Five Oyster Data Treatments Showing Means and LS Means
You can see that there are large changes between the unadjusted to adjusted treatment means for the variable FINAL. These changes result from the large treatment differences in the covariable INITIAL. Apparently, the random assignment of oysters to treatments did not result in equal mean initial weights. Some treatments, particularly TRT 5, received smaller oysters than other treatments. This biases the unadjusted treatment means. Computation of the adjusted treatment means is intended to remove the bias.
Note: Although the purpose of adjusted means is to remove bias resulting from unequal covariate means among the treatments, adjusted means are not always appropriate. The basic rule is that if the covariate means themselves depend on the treatments, adjustment is likely to be misleading, whereas if there is no reason to believe covariate means depend on treatment, failing to adjust is likely to be misleading. In the oyster growth example, there is no reason initial weights of oysters should depend on TRT. Therefore, you should use adjusted means. Consider, however, a typical example in plant breeding. Plant yield is affected by the population density of the plants. Accounting for plant density is essential to reduce error variance and hence allow manageable experiments to provide adequate power and precision. However, different plant varieties have inherently different plant densities, for a number of well-known reasons. Adjusting mean yield to a common plant density would distort differences among varieties you would actually see under realistic conditions. In this case, you should use unadjusted means. However, analysis of covariance is still useful because it improves precision.
You can use ESTIMATE statements to provide further insight into the mean and LS mean. The following SAS statements illustrate treatments 1 and 2 and their difference:
estimate 'trt 1 adj mean'
intercept 1 trt 1 0 0 0 0 initial 25.76;
estimate 'trt 2 adj mean'
intercept 1 trt 0 1 0 0 0 initial 25.76;
estimate 'adj trt diff' trt 1 -1 0 0 0;
The overall mean of the covariable INITIAL is =25.76, hence ESTIMATE computes the adjusted mean adjusted mean Because and are the same for all adjusted means, the adjusted treatment difference is . The unadjusted means, estimate where is the sample mean of the covariate for the ith treatment. Use the following ESTIMATE statements to compute the unadjusted means:
estimate 'trt 1 unadj mean'
intercept 1 trt 1 0 0 0 0 initial 29.75;
estimate 'trt 2 unadj mean'
intercept 1 trt 0 1 0 0 0 initial 27.175;
estimate 'unadj diff' trt 1 -1 0 0 0 initial 2.575;
For the unadjusted means, is different for each treatment, so the unadjusted treatment difference estimates . This shows how the unadjusted means are confounded with the . The results of both sets of ESTIMATE statements appear in Output 7.4.
Output 7.4 ESTIMATE Statements for Adjusted and Unadjusted Means
Standard | ||||
Parameter | Estimate | Error | t Value | Pr > |t| |
trt 1 adj mean | 30.1531125 | 0.33391743 | 90.30 | <.0001 |
trt 2 adj mean | 30.1173006 | 0.28273504 | 106.52 | <.0001 |
adj diff | 0.0358120 | 0.40722674 | 0.09 | 0.9312 |
trt 1 unadj mean | 34.4750000 | 0.27458811 | 125.55 | <.0001 |
trt 2 unadj mean | 31.6500000 | 0.27458811 | 115.26 | <.0001 |
unadj diff | 2.8250000 | 0.38832623 | 7.27 | <.0001 |
You can see that the apparent difference between treatments 1 and 2 in unadjusted means results from their different mean initial weights. When these are adjusted, the difference disappears. Notice that, with an equal number of observations per treatment, the standard errors of the unadjusted treatment means are equal, whereas for the adjusted means, they depend on the difference between and . As a final note, you can modify the LSMEANS statement to obtain results similar to Output 7.3, but for the unadjusted means. You use the statement
lsmeans trt / bylevel stderr tdiff;
The BYLEVEL option causes to be used in place of in computing the LS mean. These results are not shown, but would simply complete the results in Output 7.4 for the unadjusted means and differences.
This section illustrates comparing means with contrasts, using the oyster growth example discussed in Section 7.2.1, “Covariance Model.” The five treatments can also be looked upon as a 2×2 factorial (BOTTOM/TOPXINTAKE/DISCHARGE) plus a CONTROL. The adjusted treatment means from the analysis of covariance can be analyzed further with four orthogonal contrasts implemented by the following CONTRAST statements:
contrast 'CONTROL VS. TREATMENT' TRT -1 -1 -1 -1 4;
contrast 'BOTTOM VS. TOP' TRT -1 1 -1 1 0;
contrast 'INTAKE VS. DISCHARGE' TRT -1 -1 1 1 0;
contrast 'BOT/TOP*INT/DIS' TRT 1 -1 -1 1 0;
The output that results from these statements follows the partitioning of sums of squares in Output 7.5. Note that the only significant contrast is INTAKE VS. DISCHARGE. Also, note that these are comparisons among adjusted means. If the objectives of your study compel you to define contrasts among unadjusted means, these must include the contribution from the which do not cancel out as do the for the adjusted means. For example, for the first contrast above, CONTROL VS. TREATMENT, you need to include and thus the required SAS statement
contrast ‘CTL V TRT UNADJUSTED’
TRT -1 –1 –1 –1 4 INITIAL –24.85;
Notice that the significance level of the unadjusted contrast far exceeds that of the adjusted CONTROL VS. TREATMENT contrast given the large discrepancy between adjusted and unadjusted means, especially for the CONTROL, noted earlier.
Equivalent results can be obtained with the ESTIMATE statement, which also gives the estimated coefficients for the contrasts. All options for CONTRAST and ESTIMATE statements discussed in Chapter 3, “Analysis of Variance for Balanced Data,” and in Chapter 6, “Understanding Linear Models Concepts,” apply here. Although constructed to be orthogonal, these contrasts are not orthogonal to the covariable; hence, their sums of squares do not add to the adjusted treatment sums of squares.
Output 7.5 Results of Analysis of Covariance: Orthogonal Contrasts Plus an Unadjusted Example
Contrast | DF | Contrast SS | Mean Square | F Value | Pr > F |
CONTROL VS. TREATMENT | 1 | 0.52000411 | 0.52000411 | 1.72 | 0.2103 |
BOTTOM VS. TOP | 1 | 0.33879074 | 0.33879074 | 1.12 | 0.3071 |
INTAKE VS. DISCHARGE | 1 | 8.59108077 | 8.59108077 | 28.49 | 0.0001 |
BOT/TOP*INT/DIS | 1 | 0.22934155 | 0.22934155 | 0.76 | 0.3979 |
CTL V TRT UNADJ | 1 | 169.9923582 | 169.9923582 | 563.65 | <.0001 |
Multiple covariates are specified as continuous (non-CLASS) variables in the MODEL statement. If the CLASS variable is designated as the first independent variable, the Type I sums of squares for individual covariates can be added to get the adjusted sums of squares due to all covariates. The Type III sums of squares are the fully adjusted sums of squares for the individual regression coefficients as well as those for the adjusted treatment means.
Section 7.2 presented analysis of covariance assuming the equal slopes model. The unequal slopes model is a natural extension of covariance analysis. Both models are usually applied to data characterized by treatment groups and one or more covariates. The unequal slopes model allows you to test for heterogeneity of slopes—that is, it tests whether or not the regression coefficients are constant over groups. The analysis in Section 7.2 assumes constant regression coefficients and is invalid if this assumption fails. You can draw valid inference from the unequal slopes model, but it requires considerable care. This section presents the test for heterogeneity of slopes and inference strategies appropriate for the unequal slopes model.
Extending the one independent variable (covariate) and one-way treatment structure used in Section 7.2, the unequal slopes analysis-of-covariance model can be written
yij = β0i + β1ixij + ε
where i denotes the treatment group. The hypothesis of equal slopes is
H0: β1i = β1i' for all i ≠ i′
An alternate formulation of the model is
yij = β0 + αi + β1xij + δixij + ε
where β0 and β1 are overall intercept and slope coefficients, and αi and δi are coefficients for the treatment effect on intercept and slope, respectively, thus comparing the formulations of the model, β0i = β0 + αi and β1i = β1 + δi. Under the alternate formulation, the hypothesis of equal slopes becomes
H0: δi = 0 for all i = 1, 2,...,t
Note that any possible intercept differences are irrelevant to both hypotheses.
Regression relationships that differ among treatment groups actually reflect an interaction between the treatment groups and the covariates. In fact, the GLM procedure specifies and analyzes this phenomenon as an interaction. Thus, if you use the following statements, the expression X*A produces the appropriate statistics for estimating different regressions of Y on X for the different values, or classes, specified by A:
proc glm;
class a;
model y=a x x*a / solution;
This MODEL fits the formulation yij = β0 + αi + β1xij + δixij + ε. The αi correspond to A, β1xij corresponds to X, and δixij corresponds to X*A. In this application, the Type I sums of squares for this model provide the most useful information:
X |
is the sum of squares due to a single regression of Yon X, ignoring the group. |
A |
is the sum of squares due to different intercepts (adjusted treatment differences), assuming equal slopes. |
X*A |
is an additional sum of squares due to different regression coefficients for the groups specified by the factor A. |
The associated sequence of tests provides a logical stepwise analysis to determine the most appropriate model. Equivalent results can also be obtained by fitting the nested effects formulation yij = β0i + β1i + ε. Use the following statements:
proc glm;
class a;
model y=a x(a) / noint solution;
Here, the β0i correspond to A, and the β1i correspond to X(A). This formulation is more convenient than the alternative for obtaining estimates of the slopes for each treatment group. You can write a CONTRAST statement that generates a test equivalent to X*A in the previous model. However, X(A) does not test for the heterogeneity of slopes. Instead, it tests the hypothesis that all regression coefficients are 0. Also, A tests the hypothesis that all intercepts are equal to 0. For models like this one, the Type III (or Type IV) sums of squares have little meaning; it is not instructive to consider the effect of the CLASS variable over and above the effect of different regressions.
This section uses the oyster growth data from Output 7.1 to demonstrate the test for the homogeneity of slopes. In a practical situation, you would do this testing before proceeding to the inference based on equal slopes shown in Section 7.2. Use the following SAS statements to obtain the relevant test statistics:
proc glm;
class trt;
model final=trt initial trt*initial;
Output 7.6 shows the results.
Output 7.6 Unequal Slopes Analysis of Covariance for Oyster Growth Data
Sum of | |||||
Source | DF | Squares | Mean Square | F Value | Pr > F |
Model | 9 | 355.8354908 | 39.5372768 | 139.51 | <.0001 |
Error | 10 | 2.8340092 | 0.2834009 | ||
Corrected Total | 19 | 358.6695000 | |||
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
initial | 1 | 342.3578175 | 342.3578175 | 1208.03 | <.0001 |
trt | 4 | 12.0893593 | 3.0223398 | 10.66 | 0.0012 |
initial*trt | 4 | 1.3883141 | 0.3470785 | 1.22 | 0.3602 |
The Type I sums of squares show that
❏ the INITIAL weight has an effect on FINAL weight (F=1208.03, p<0.0001)
❏ TRT has an effect on FINAL weight, at any given INITIAL weight (F=10.66, p=0.0012)
❏ there is no significant difference in the INITIAL/FINAL relationship among the different levels of TRT (F=1.22, p=0.3602). That is, there is no evidence to contradict the null hypothesis of homogeneous slopes.
The last result is especially important, as it validates the analysis using the equal slopes model. The first two results essentially reiterate the results in Section 7.2.
In many cases, the individual regression slopes for each treatment contain useful information. Output 7.7 contains data to illustrate this. The data are from a study of the relationship between the price of oranges and sales per customer. The hypothesis is that sales vary as a function of price differences for different stores (STORE) and days of the week (DAY). The price is varied daily for two varieties of oranges. The variables P1 and P2 denote the prices for the two varieties, respectively. Q1 and Q2 are the sales per customer of the corresponding varieties.
Output 7.7 Orange Sales Data
Obs | STORE | DAY | P1 | P2 | Q1 | Q2 |
1 | 1 | 1 | 37 | 61 | 11.3208 | 0.0047 |
2 | 1 | 2 | 37 | 37 | 12.9151 | 0.0037 |
3 | 1 | 3 | 45 | 53 | 18.8947 | 7.5429 |
4 | 1 | 4 | 41 | 41 | 14.6739 | 7.0652 |
5 | 1 | 5 | 57 | 41 | 8.6493 | 21.2085 |
6 | 1 | 6 | 49 | 33 | 9.5238 | 16.6667 |
7 | 2 | 1 | 49 | 49 | 7.6923 | 7.1154 |
8 | 2 | 2 | 53 | 53 | 0.0017 | 1.0000 |
9 | 2 | 3 | 53 | 45 | 8.0477 | 24.2176 |
10 | 2 | 4 | 53 | 53 | 6.7358 | 2.9361 |
11 | 2 | 5 | 61 | 37 | 6.1441 | 40.5720 |
12 | 2 | 6 | 49 | 65 | 21.7939 | 2.8324 |
13 | 3 | 1 | 53 | 45 | 4.2553 | 6.0284 |
14 | 3 | 2 | 57 | 57 | 0.0017 | 2.0906 |
15 | 3 | 3 | 49 | 49 | 11.0196 | 13.9329 |
16 | 3 | 4 | 53 | 53 | 6.2762 | 6.5551 |
17 | 3 | 5 | 53 | 45 | 13.2316 | 10.6870 |
18 | 3 | 6 | 53 | 53 | 5.0676 | 5.1351 |
19 | 4 | 1 | 57 | 57 | 5.6235 | 3.9120 |
20 | 4 | 2 | 49 | 49 | 14.9893 | 7.2805 |
21 | 4 | 3 | 53 | 53 | 13.7233 | 16.3105 |
22 | 4 | 4 | 53 | 45 | 6.0669 | 23.8494 |
23 | 4 | 5 | 53 | 53 | 8.1602 | 4.1543 |
24 | 4 | 6 | 61 | 37 | 1.4423 | 21.1538 |
25 | 5 | 1 | 45 | 45 | 6.9971 | 6.9971 |
26 | 5 | 2 | 53 | 45 | 5.2308 | 3.6923 |
27 | 5 | 3 | 57 | 57 | 8.2560 | 10.6679 |
28 | 5 | 4 | 49 | 49 | 14.5000 | 16.7500 |
29 | 5 | 5 | 53 | 53 | 20.7627 | 15.2542 |
30 | 5 | 6 | 53 | 45 | 3.6115 | 21.5442 |
31 | 6 | 1 | 53 | 53 | 11.3475 | 4.9645 |
32 | 6 | 2 | 53 | 45 | 9.4650 | 11.7284 |
33 | 6 | 3 | 53 | 53 | 22.6103 | 14.8897 |
34 | 6 | 4 | 61 | 37 | 0.0020 | 19.2000 |
35 | 6 | 5 | 49 | 65 | 20.5997 | 2.3468 |
36 | 6 | 6 | 37 | 37 | 28.1828 | 17.9543 |
In this example, consider the variety 1 only—that is, the response variable, sales, for variety 1 is Q1 and the covariable, price, is P1. Examples in Section 7.4 also involve variety 2. Here, you compute the unequal slopes covariance model using the SAS statements
proc glm;
class day;
model q1=p1 day p1*day/solution;
Output 7.7 shows the results of this analysis.
Output 7.8 Unequal Slopes Analysis of Covariance for Orange Sales Data
Sum of | |||||
Source | DF | Squares | Mean Square | F Value | Pr > F |
Model | 11 | 1111.522562 | 101.047506 | 4.64 | 0.0008 |
Error | 24 | 522.153228 | 21.756384 | ||
Corrected Total | 35 | 1633.675790 | |||
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
P1 | 1 | 516.5921408 | 516.5921408 | 23.74 | <.0001 |
DAY | 5 | 430.5384175 | 86.1076835 | 3.96 | 0.0093 |
P1*DAY | 5 | 164.3920040 | 32.8784008 | 1.51 | 0.2236 |
Standard | |||||
Parameter | Estimate | Error | t Value | Pr > |t| | |
Intercept | 73.27263578 B | 13.48373708 | 5.43 | <.0001 | |
P1 | -1.22521164 B | 0.26520396 | -4.62 | 0.0001 | |
DAY | 1 | -54.59714671 B | 19.73545845 | -2.77 | 0.0107 |
DAY | 2 | -34.78570099 B | 20.25105926 | -1.72 | 0.0987 |
DAY | 3 | -27.94295765 B | 29.42842946 | -0.95 | 0.3518 |
DAY | 4 | -24.12342640 B | 21.39334761 | -1.13 | 0.2706 |
DAY | 5 | 4.62631110 B | 30.62842608 | 0.15 | 0.8812 |
DAY | 6 | 0.00000000 B | ⋅ | ⋅ | ⋅ |
P1*DAY | 1 | 1.00474758 B | 0.39410534 | 2.55 | 0.0176 |
P1*DAY | 2 | 0.60164207 B | 0.39876566 | 1.51 | 0.1444 |
P1*DAY | 3 | 0.61415851 B | 0.57034268 | 1.08 | 0.2923 |
P1*DAY | 4 | 0.42959726 B | 0.41510986 | 1.03 | 0.3110 |
P1*DAY | 5 | 0.02936476 B | 0.57034268 | 0.05 | 0.9594 |
P1*DAY | 6 | 0.00000000 B | ⋅ | ⋅ | ⋅ |
From Output 7.8, you can see that there is no evidence to reject the null hypothesis of equal slopes (F for P1*DAY is 1.51 and p=0.2236). Ordinarily, you would then proceed as with the oyster growth data, using an equal slopes model. The Type I results here indicate a significant effect of price (P1) on sales and a significant effects of DAY on sales at any given price. For these data, however, a closer look at the estimates of the daily regression coefficients reveals additional information. Although the differences in the daily regressions are not statistically significant, it is instructive to look at their estimates.
The estimated daily regression slope coefficients are . The estimate corresponding to P1 is and P1*DAY is . For example, for DAY 1, the estimated slope is . You can use ESTIMATE statements to obtain the daily regression coefficients:
estimate 'P1:DAY 1' p1 1 p1*day 1 0 0 0 0 0;
estimate 'P1:DAY 2' p1 1 p1*day 0 1 0 0 0 0;
estimate 'P1:DAY 3' p1 1 p1*day 0 0 1 0 0 0;
estimate 'P1:DAY 4' p1 1 p1*day 0 0 0 1 0 0;
estimate 'P1:DAY 5' p1 1 p1*day 0 0 0 0 1 0;
estimate 'P1:DAY 6' p1 1 p1*day 0 0 0 0 0 1;
The results appear in Output 7.9.
Output 7.9 Estimated Regression Coefficients for Each TRT
Standard | ||||
Parameter | Estimate | Error | t Value | Pr > |t| |
P1:DAY 1 | -0.22046406 | 0.29152337 | -0.76 | 0.4569 |
P1:DAY 2 | -0.62356957 | 0.29779341 | -2.09 | 0.0470 |
P1:DAY 3 | -0.61105313 | 0.50493329 | -1.21 | 0.2380 |
P1:DAY 4 | -0.79561438 | 0.31934785 | -2.49 | 0.0200 |
P1:DAY 5 | -1.19584688 | 0.50493329 | -2.37 | 0.0263 |
P1:DAY 6 | -1.22521164 | 0.26520396 | -4.62 | 0.0001 |
Note that these estimated coefficients are larger in absolute value toward the end of the week. This is quite reasonable given the higher level of overall sales activity near the end of the week, which may result in a proportionately larger response in sales to changes in price. Thus, it is likely that a coefficient specifically testing for a linear trend in price response during the week would be significant.
You can obtain the results in Output 7.7 more conveniently using the nested-effects formation, whose SAS statements are
proc glm;
class day;
model q1=day p1(day)/noint solution;
contrast 'equal slopes' p1(day) 1 0 0 0 0 -1,
p1(day) 0 1 0 0 0 -1,
p1(day) 0 0 1 0 0 -1,
p1(day) 0 0 0 1 0 -1,
p1(day) 0 0 0 0 1 -1;
The CONTRAST statement contains one independent comparison of the daily regressions per degree of freedom among daily regressions. In this case, there are 6 days and hence 5 DF. If all five comparisons are true, then H0: all β1. equal must be true. Hence the contrast is equivalent to the test generated by P1*DAY in Output 7.8. Output 7.10 contains the results.
Output 7.10 Analysis of Orange Sales Data Using a Nested Covariance Model
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
DAY | 6 | 4008.414213 | 668.069035 | 30.71 | <.0001 |
P1(DAY) | 6 | 861.125290 | 143.520882 | 6.60 | 0.0003 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
DAY | 6 | 1250.581757 | 208.430293 | 9.58 | <.0001 |
P1(DAY) | 6 | 861.125290 | 143.520882 | 6.60 | 0.0003 |
Contrast | DF | Contrast SS | Mean Square | F Value | Pr > F |
equal slopes | 5 | 164.3920040 | 32.8784008 | 1.51 | 0.2236 |
Standard | |||||
Parameter | Estimate | Error | t Value | Pr > |t| | |
DAY | 1 | 18.67548906 | 14.41100810 | 1.30 | 0.2073 |
DAY | 2 | 38.48693478 | 15.10940884 | 2.55 | 0.0177 |
DAY | 3 | 45.32967813 | 26.15762403 | 1.73 | 0.0959 |
DAY | 4 | 49.14920937 | 16.60915881 | 2.96 | 0.0068 |
DAY | 5 | 77.89894687 | 27.50071487 | 2.83 | 0.0092 |
DAY | 6 | 73.27263578 | 13.48373708 | 5.43 | <.0001 |
P1(DAY) | 1 | -0.22046406 | 0.29152337 | -0.76 | 0.4569 |
P1(DAY) | 2 | -0.62356957 | 0.29779341 | -2.09 | 0.0470 |
P1(DAY) | 3 | -0.61105313 | 0.50493329 | -1.21 | 0.2380 |
P1(DAY) | 4 | -0.79561437 | 0.31934785 | -2.49 | 0.0200 |
P1(DAY) | 5 | -1.19584687 | 0.50493329 | -2.37 | 0.0263 |
P1(DAY) | 6 | -1.22521164 | 0.26520396 | -4.62 | 0.0001 |
As noted earlier, the Type I and Type III sums of squares provide no useful information. P1(DAY) tests the null hypothesis that all daily regression slopes are equal to zero. You can see that the contrast for equal slopes is identical to the equal slopes test given by P1*DAY in Output 7.8. The parameter estimate is the most useful output. You can see that the daily regression coefficients, P1(DAY), are identical to the estimates obtained via the ESTIMATE statements in Output 7.9, although Output 7.10 is easier to compute because no ESTIMATE statements are required. Output 7.9 also gives the intercept of the regressions for each day. For example, using DAY 1 and P1(DAY) 1, you can see that the regression for DAY 1 is Q1 = 18.675 – 0.2205×P1.
Using more than one independent variable is straightforward and can determine which variables have a different coefficient for each treatment group. More complex designs are not difficult to implement but may be difficult to interpret.
Tests among adjusted means with the equal slopes model apply to all values of the covariable. For example, Section 7.2.2 showed that the difference between the adjusted mean of TRT 1 and 2 is (β + τ1 + β ) – (β + τ2 + β ) = τ1–τ2. If you use any other value of the covariable in the expression, the difference is still τ1–τ2. However, for the unequal slope model, the adjusted mean is β + σi + β1ix, and hence the difference depends on x. Typically is used for x. If you compare TRT 1 and 2 at the difference is . If you evaluate the treatment difference at a different value of x, the difference changes. Thus, in the unequal slopes model, you can compare treatments, but only conditional upon a specified value of the covariable.
PROC GLM and PROC MIXED offer a great deal of flexibility for unequal slopes models, but you must be careful, because some defaults do not necessarily result in sensible tests.
Output 7.11 contains the Type III sums of squares and the LS means for the unequal slopes models for the orange sales data. The SAS statements are
proc glm;
class day;
model q1=p1 day p1*day;
lsmeans day;
Output 7.11 Type III SS and Default LS Means for Unequal Slopes Covariance Analysis of Orange Sales Data
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
P1 | 1 | 554.7860985 | 554.7860985 | 25.50 | <.0001 |
DAY | 5 | 201.1717701 | 40.2343540 | 1.85 | 0.1412 |
P1*DAY | 5 | 164.3920040 | 32.8784008 | 1.51 | 0.2236 |
Least Squares Means | |
DAY | Q1 LSMEAN |
1 | 7.3828299 |
2 | 6.5463159 |
3 | 14.0301792 |
4 | 8.3960731 |
5 | 16.6450125 |
6 | 10.5145730 |
The least-squares means are computed using the overall covariable mean, which for these data is equal to 51.222. You can use the AT MEANS option with the LSMEANS statement to print the value of the covariable means being used. The SAS statement is
lsmeans day/at means;
Output 7.12 shows the results.
Output 7.12 Least-Squares Means for Orange Sales Data Using the AT MEANS Option
Least Squares Means at P1=51.22222 |
DAY | Q1 LSMEAN |
1 | 7.3828299 |
2 | 6.5463159 |
3 | 14.0301792 |
4 | 8.3960731 |
5 | 16.6450125 |
6 | 10.5145730 |
Suppose you want to test the null hypothesis of equal adjusted means—that is, among the means adjusted to a common value . Clearly, the Type I sum of squares for DAY from Output 7.8 does not do this: it tests means for a common slope, not adjusted for P1*DAY, the differences among slopes. What about the Type III sum of squares for DAY in Output 7.11? The Type III sum of squares tests the means adjusted not to but to x=0. You can see this from the following CONTRAST statement:
contrast 'trt' day 1 -1 0 0 0 0,
day 1 0 -1 0 0 0,
day 1 0 0 -1 0 0,
day 1 0 0 0 -1 0,
day 1 0 0 0 0 -1;
Output 7.13 shows the results.
Output 7.13 Contrast Testing the Equality of Sales Means Adjusted to Covariable Price=0
Contrast | DF | Contrast SS | Mean Square | F Value | Pr > F |
trt | 5 | 201.1717701 | 40.2343540 | 1.85 | 0.1412 |
You can see that the F=1.85 and p-value of 0.1412 are identical to the Type III SS results in Output 7.11.
You can use the following LSMEANS statement to compute adjusted means at x=0 to see what you are testing:
lsmeans day/at p1=0;
Output 7.14 shows the results.
Output 7.14 Orange Sales Adjusted Means at Price Covariable=0
Least Squares Means at P1=0 |
DAY | Q1 LSMEAN |
1 | 18.6754891 |
2 | 38.4869348 |
3 | 45.3296781 |
4 | 49.1492094 |
5 | 77.8989469 |
6 | 73.2726358 |
Recalling Output 7.10, these adjusted means are in fact the intercepts of the separate daily regression equations. Testing makes no sense in this context, because the oranges are not going to be sold for a price P1=0—that is, they are not going to be given away.
You can test the adjusted means at using the following CONTRAST statement:
contrast 'trt' day 1 -1 0 0 0 0
p1*day 51.2222 -51.2222 0 0 0 0,
day 1 0 -1 0 0 0 p1*day 51.2222 0 -51.2222 0 0 0,
day 1 0 0 -1 0 0 p1*day 51.2222 0 0 -51.2222 0 0,
day 1 0 0 0 -1 0 p1*day 51.2222 0 0 0 -51.2222 0,
day 1 0 0 0 0 -1 p1*day 51.2222 0 0 0 0 -51.2222;
This statement uses a set of independent comparisons for the unequal slopes model difference, whose form is The number of comparisons in the contrast equals the DF for DAY. Output 7.15 gives the results.
Output 7.15 Contrast to Test the Equality of Means Adjusted to Mean Covariable Price=51.22
Contrast | DF | Contrast SS | Mean Square | F Value | Pr > F |
trt | 5 | 376.3758925 | 75.2751785 | 3.46 | 0.0170 |
You can change the value of the AT P1= option in the LSMEANS statement and the coefficients for P1*DAY in the above contrast to test the equality of the adjusted means at any value of the covariable deemed reasonable. Alternatively, you can center the covariable in the DATA step. For example, you can define a new covariable X=P1–51.22 and use X in place of P1 in the analysis. The default Type III sum of squares tests the equality of adjusted means at X=0, which corresponds to P1=51.22, the overall mean. The crucial thing to keep in mind with unequal slopes models is that the treatment difference changes with the covariable, so the test is only valid if it is done at a value of the covariable agreed to be reasonable.
Analysis of covariance can be applied to other experimental and treatment structures. This section illustrates covariance analysis of a two-factor factorial experiment with two covariates.
This example uses the data from the study of the relationship between the price of oranges and sales per customer described in Section 7.3 and presented in Output 7.7. Recall that the data set had two varieties. The section uses only response variable Q1, sales per customer for the first variety to illustrate the main ideas. You can easily adapt these methods to Q2, the response variable for the second variety.
A model for the sales of oranges for variety 1 is
Q1 = μ + τi + δj + β1P1 + β2P2 + e
where
Q1 |
is the sales per customer for the first variety. |
τi |
is the effect of the ith STORE, i = 1, 2, . . . , 6. |
δj |
is the effect of the jth DAY, j = 1, 2, . . . , 6. |
β1 |
is the coefficient of the relationship between sales Q1 and P1 (the price of one variety of oranges). |
β2 |
is the coefficient of the relationship between Q1 and P2 (the price of the other variety of oranges). |
e |
is the random error term. |
Note that because there is no replication, the interaction between STORE and DAY must be used as the error term. In this example, the primary focus is on the influence of price, P1 and P2, and on sales, Q1. The DAY and STORE differences are of secondary importance.
To implement the model, use following SAS statements:
proc glm;
class store day;
model q1 q2=store day p1 p2 / solution;
lsmeans day / stderr;
The results appear in Output 7.16.
Output 7.16 Results of Analysis of Covariance: Two-Way Structure without Interaction
The GLM Procedure | |||||
Dependent Variable: q1 | |||||
Sum of | |||||
Source | DF | Squares | Mean Square | F Value | Pr > F |
Model | 12 | 1225.367548 | 102.113962 | 5.75 | 0.0002 |
Error | 23 | 408.308242 | 17.752532 | ||
Corrected Total | 35 | 1633.675790 |
R-Square | Coeff Var | Root MSE | y Mean |
0.750068 | 41.23842 | 4.213375 | 10.21711 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
tore | 5 | 313.4198071 | 62.6839614 | 3.53 | 0.0163 |
day | 5 | 250.3972723 | 50.0794545 | 2.82 | 0.0396 |
p1 | 1 | 622.0082168 | 622.0082168 | 35.04 | <.0001 |
p2 | 1 | 39.5422519 | 39.5422519 | 2.23 | 0.1492 |
Sum of | |||||
Source | DF | Squares | Mean Square | F Value | Pr > F |
store | 5 | 223.8326734 | 44.7665347 | 2.52 | 0.0583 |
day | 5 | 433.0968700 | 86.6193740 | 4.88 | 0.0035 |
p1 | 1 | 538.1688512 | 538.1688512 | 30.32 | <.0001 |
p2 | 1 | 39.5422519 | 39.5422519 | 2.23 | 0.1492 |
Standard | |||||
Parameter | Estimate | Error | t Value | Pr > |t| | |
Intercept | 51.69987930 B | 9.79103443 | 5.28 | <.0001 | |
store | 1 | -7.64532641 B | 2.69194414 | -2.84 | 0.0093 |
store | 2 | -5.60226472 B | 2.46416942 | -2.27 | 0.0327 |
store | 3 | -7.36284806 B | 2.46416942 | -2.99 | 0.0066 |
store | 4 | -4.36498239 B | 2.48754952 | -1.75 | 0.0926 |
store | 5 | -5.02052157 B | 2.43612208 | -2.06 | 0.0508 |
store | 6 | 0.00000000 B | ⋅ | ⋅ | ⋅ |
day | 1 | -5.83036664 B | 2.51932754 | -2.31 | 0.0299 |
day | 2 | -4.89997548 B | 2.44708866 | -2.00 | 0.0572 |
day | 3 | 2.26978922 B | 2.54028189 | 0.89 | 0.3808 |
day | 4 | -2.65249315 B | 2.44667751 | -1.08 | 0.2895 |
day | 5 | 4.04702055 B | 2.55655852 | 1.58 | 0.1271 |
day | 6 | 0.00000000 B | ⋅ | ⋅ | ⋅ |
p1 | -0.83036470 | 0.15081334 | -5.51 | <.0001 | |
p2 | 0.14884706 | 0.09973319 | 1.49 | 0.1492 | |
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable
Least Squares Means |
Standard | |||
DAY | Q1 LSMEAN | Error | Pr > |t| |
1 | 5.5644154 | 1.7680833 | 0.0045 |
2 | 6.4948065 | 1.7289585 | 0.0010 |
3 | 13.6645712 | 1.7515046 | <.0001 |
4 | 8.7422889 | 1.7339197 | <.0001 |
5 | 15.4418026 | 1.7858085 | <.0001 |
6 | 11.3947820 | 1.7667260 | <.0001 |
In addition to the details previously discussed, these also are of interest:
❏ The Type I SS for P1 and P2 can be summed to obtain a test for the partial contribution of both prices:
❏ The Type III SS show that all effects are highly significant except P2, the price of the competing orange.
❏ Each coefficient estimate is the mean difference between each CLASS variable value (STORE, DAY) and the last CLASS variable value, because there is no interaction.
❏ The P1 coefficient is negative, indicating the expected negatively sloping price response (demand function). The P2 coefficient, although not significant, has the expected positive sign for the price response of a competing product.
❏ Least-squares means are requested only for DAY, which shows the expected higher sales toward the end of the week.
Contrasts and estimates of linear functions could, of course, be requested with this analysis.
The most complex covariance model discussed in this chapter is a two-factor factorial with two stages of subsampling. Output 7.17 shows data from a study whose objective is to estimate y, the weight of usable lint from x, the total weight of cotton bolls. In addition, the researcher wants to see if lint estimation is affected by varieties of cotton (VARIETY) and the distance between planting rows (SPACING), using x, the boll weight (BOLLWT), as a covariate in the analysis of y, the lint weight. The study is a factorial experiment with two levels of VARIETY (37 and 213) and two levels of SPACING (30 and 40). There are two plants for each VARIETYXSPACING treatment combination, and there are from five to nine bolls per plant (PLANT).
Output 7.17 Data for Analysis of Covariance: Two-Way Structure with Interaction
Obs | variety | spacing | plant | bollwt | lint |
1 | 37 | 30 | 3 | 8.4 | 2.9 |
2 | 37 | 30 | 3 | 8.0 | 2.5 |
3 | 37 | 30 | 3 | 7.4 | 2.7 |
4 | 37 | 30 | 3 | 8.9 | 3.1 |
5 | 37 | 30 | 5 | 5.6 | 2.1 |
6 | 37 | 30 | 5 | 8.0 | 2.7 |
7 | 37 | 30 | 5 | 7.6 | 2.5 |
8 | 37 | 30 | 5 | 5.4 | 1.5 |
9 | 37 | 30 | 5 | 6.9 | 2.5 |
10 | 37 | 40 | 3 | 4.5 | 1.3 |
11 | 37 | 40 | 3 | 9.1 | 3.1 |
12 | 37 | 40 | 3 | 9.0 | 3.1 |
13 | 37 | 40 | 3 | 8.0 | 2.3 |
14 | 37 | 40 | 3 | 7.2 | 2.2 |
15 | 37 | 40 | 3 | 7.6 | 2.5 |
16 | 37 | 40 | 3 | 9.0 | 3.0 |
17 | 37 | 40 | 3 | 2.3 | 0.6 |
18 | 37 | 40 | 3 | 8.7 | 3.0 |
19 | 37 | 40 | 5 | 8.0 | 2.6 |
20 | 37 | 40 | 5 | 7.2 | 2.5 |
21 | 37 | 40 | 5 | 7.6 | 2.4 |
22 | 37 | 40 | 5 | 6.9 | 2.2 |
23 | 37 | 40 | 5 | 6.9 | 2.5 |
24 | 37 | 40 | 5 | 7.6 | 2.4 |
25 | 37 | 40 | 5 | 4.7 | 1.4 |
26 | 213 | 30 | 3 | 4.6 | 1.7 |
27 | 213 | 30 | 3 | 6.8 | 1.7 |
28 | 213 | 30 | 3 | 3.5 | 1.3 |
29 | 213 | 30 | 3 | 2.4 | 1.0 |
30 | 213 | 30 | 3 | 3.0 | 1.0 |
31 | 213 | 30 | 5 | 2.8 | 0.5 |
32 | 213 | 30 | 5 | 3.6 | 0.9 |
33 | 213 | 30 | 5 | 6.7 | 1.9 |
34 | 213 | 40 | 0 | 7.4 | 2.1 |
35 | 213 | 40 | 0 | 4.9 | 1.0 |
36 | 213 | 40 | 0 | 5.7 | 1.0 |
37 | 213 | 40 | 0 | 3.0 | 0.7 |
38 | 213 | 40 | 0 | 4.7 | 1.5 |
39 | 213 | 40 | 0 | 5.0 | 1.3 |
40 | 213 | 40 | 0 | 2.8 | 0.4 |
41 | 213 | 40 | 0 | 5.2 | 1.2 |
42 | 213 | 40 | 0 | 5.6 | 1.0 |
43 | 213 | 40 | 3 | 4.5 | 1.0 |
44 | 213 | 40 | 3 | 5.6 | 1.2 |
45 | 213 | 40 | 3 | 2.0 | 0.7 |
46 | 213 | 40 | 3 | 1.2 | 0.2 |
47 | 213 | 40 | 3 | 4.2 | 1.2 |
48 | 213 | 40 | 3 | 5.3 | 1.2 |
49 | 213 | 40 | 3 | 7.0 | 1.7 |
The model for the analysis is
yijkl = μ + vi + τi + (vτ)ij + γ(vτ)ijk + βxijkl + εijkl
where
yijkl |
equals the weight of the lint for the ith VARIETY, jth SPACING, kth PLANT of the (i, j)th cell, and lth(i,j)boll of each plant. |
μ |
is the intercept. |
vi |
is the effect of the ith VARIETY. |
τj |
is the effect of the jth SPACING. |
(vτ)ij |
is the VARIETY×SPACING interaction. |
γ(vτ)ijk |
is the effect of the kth plant in the (i,j) th VARIETY and SPACING combination. |
xijkl |
is the total weight of each boll, the covariate. |
β |
is the regression effect of the covariate. |
εijkl |
equals the error variation among bolls within plants. |
The primary focus of this study is on estimating lint weight from boll weight (that is, the regression) and only later in determining if this relationship is affected by VARIETY and SPACING factors. In the SAS program to analyze the data, the order of variables in the MODEL statement is changed so that the Type I sums of squares provide the appropriate information:
proc glm;
class variety spacing plant;
model lint=bollwt variety spacing variety*spacing
plant(variety*spacing) / solution;
random plant(variety*spacing)/test;
Note that the RANDOM statement with the TEST option has been added because the plant-to-plant variation provides the appropriate error term. Results of the analysis appear in Output 7.18.
Because PLANT(VARIETY*SPACING) is a random effect, an alternative is to use PROC MIXED. You use the following SAS statements:
proc mixed;
class variety spacing plant;
model lint=bollwt variety spacing
variety*spacing/solution;
random plant(variety*spacing);
The results for this analysis appear in Output 7.19. Littell et al. (1996) discuss analysis of covariance for mixed models in much greater detail.
Output 7.18 Results of Analysis of Covariance: Two-Way Structure with Interaction
Dependent Variable: lint | |||||
Sum of | |||||
Source | DF | Squares | Mean Square | F Value | Pr > F |
Model | 8 | 31.16009287 | 3.89501161 | 80.70 | <.0001 |
Error | 40 | 1.93051938 | 0.04826298 | ||
Corrected Total | 48 | 33.09061224 | |||
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
bollwt | 1 | 29.06931406 | 29.06931406 | 602.31 | <.0001 |
variety | 1 | 1.26353553 | 1.26353553 | 26.18 | <.0001 |
spacing | 1 | 0.46664798 | 0.46664798 | 9.67 | 0.0034 |
variety*spacing | 1 | 0.09326994 | 0.09326994 | 1.93 | 0.1722 |
plant(variet*spacin) | 4 | 0.26732535 | 0.06683134 | 1.38 | 0.2565 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
bollwt | 1 | 11.11855999 | 11.11855999 | 230.37 | <.0001 |
variety | 1 | 0.94242614 | 0.94242614 | 19.53 | <.0001 |
spacing | 1 | 0.37483940 | 0.37483940 | 7.77 | 0.0081 |
variety*spacing | 1 | 0.04785515 | 0.04785515 | 0.99 | 0.3253 |
plant(variet*spacin) | 4 | 0.26732535 | 0.06683134 | 1.38 | 0.2565 |
Tests of Hypotheses for Mixed Model Analysis of Variance | |||||
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
bollwt | 1 | 11.118560 | 11.118560 | 230.37 | <.0001 |
Error: MS(Error) | 40 | 1.930519 | 0.048263 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F | |
* | variety | 1 | 0.942426 | 0.942426 | 16.27 | 0.0021 |
Error | 10.657 | 0.617126 | 0.057907 | |||
Error: 0.5194*MS(plant(variet*spacin)) + 0.4806*MS(Error) | ||||||
* | This test assumes one or more other fixed effects are zero. | |||||
Source | DF | Type III SS | Mean Square | F Value | Pr > F | |
* | spacing | 1 | 0.374839 | 0.374839 | 5.76 | 0.0660 |
Error | 4.6073 | 0.300008 | 0.065116 | |||
Error: 0.9076*MS(plant(variet*spacin)) + 0.0924*MS(Error) | ||||||
* | This test assumes one or more other fixed effects are zero. |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
variety*spacing | 1 | 0.047855 | 0.047855 | 0.74 | 0.4324 |
Error | 4.6791 | 0.303859 | 0.064939 | ||
Error: 0.8981*MS(plant(variet*spacin)) + 0.1019*MS(Error) |
Standard | |||||
Parameter | Estimate | Error | t Value | Pr > |t| | |
Intercept | -.2724440749 B | 0.11934010 | -2.28 | 0.0278 | |
bollwt | 0.3056076686 | 0.02013479 | 15.18 | <.0001 | |
variety | 37 | 0.4232705043 B | 0.12964467 | 3.26 | 0.0022 |
variety | 213 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
spacing | 30 | 0.0379572553 B | 0.15161542 | 0.25 | 0.8036 |
spacing | 40 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
variety*spacing | 37 30 | 0.0236449357 B | 0.19897993 | 0.12 | 0.9060 |
variety*spacing | 37 40 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
variety*spacing | 213 30 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
variety*spacing | 213 40 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
plant(variet*spacin) | 3 37 30 | 0.0892286888 B | 0.15033417 | 0.59 | 0.5562 |
plant(variet*spacin) | 5 37 30 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
plant(variet*spacin) | 3 37 40 | -.0271310434 B | 0.11085696 | -0.24 | 0.8079 |
plant(variet*spacin) | 5 37 40 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
plant(variet*spacin) | 3 213 30 | 0.3337196850 B | 0.16055649 | 2.08 | 0.0441 |
plant(variet*spacin) | 5 213 30 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
plant(variet*spacin) | 0 213 40 | -.0984914494 B | 0.11151946 | -0.88 | 0.3824 |
plant(variet*spacin) | 3 213 40 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.
The Type I SS for BOLLWT is what would be obtained by a simple linear regression of LINT on BOLLWT. If you ran this simple linear regression, you would get an R2 value of (29.069/33.091)=0.878, a residual mean square of (33.091 29.069)/47)=0.08557, and an F-statistic of 339.69, thus indicating a strong relationship of lint weight to boll weight.
The Type III SS from the RANDOM statement with the TEST option shows a non-significant contribution from the VARIETY*SPACING interaction (F=0.74, p-value 0.4324). The VARIETY effect (F=16.27, p-value 0.0021) is statistically significant, whereas the SPACING main effect (F=5.76, p-value 0.0660) is only marginally significant. Note that the error terms for the VARIETY and SPACING main effects and interaction use linear combinations of PLANT(VARIETY*SPACING) and ERROR mean squares. This follows from the complex set of expected means squares that result in analysis of covariance.
It might seem that it would be simpler to assume from inspection of the analysis of covariance sources of variance that PLANT(VARIETY*SPACING) is the proper error term for VARIETY, SPACING, and VARIETY*SPACING and use the statement
test h=variety spacing variety*spacing
e=plant(variety*spacing);
in place of the RANDOM statement. If you do this, the resulting test statistics will be affected and there is the possibility of drawing erroneous conclusions. Now consider the results obtained using PROC MIXED.
Output 7.19 Analysis of Covariance Results Using PROC MIXED
Covariance Parameter Estimates | |
Cov Parm | Estimate |
plant(variet*spacin) | 0 |
Residual | 0.04995 |
Solution for Fixed Effects | |||||||
Standard | |||||||
Effect | variety | spacing | Estimate | Error | DF | t Value | Pr > |t| |
Intercept | -0.3210 | 0.1078 | 4 | -2.98 | 0.0408 | ||
bollwt | 0.3041 | 0.01990 | 40 | 15.28 | <.0001 | ||
variety | 37 | 0.4671 | 0.09351 | 4 | 5.00 | 0.0075 | |
variety | 213 | 0 | ⋅ | ⋅ | ⋅ | ⋅ | |
spacing | 30 | 0.3013 | 0.09720 | 4 | 3.10 | 0.0362 | |
spacing | 40 | 0 | ⋅ | ⋅ | ⋅ | ⋅ | |
variety*spacing | 37 | 30 | -0.1844 | 0.1350 | 4 | -1.37 | 0.2436 |
variety*spacing | 37 | 40 | 0 | ⋅ | ⋅ | ⋅ | ⋅ |
variety*spacing | 213 | 30 | 0 | ⋅ | ⋅ | ⋅ | ⋅ |
variety*spacing | 213 | 40 | 0 | ⋅ | ⋅ | ⋅ | ⋅ |
Type 3 Tests of Fixed Effects | ||||
Num | Den | |||
Effect | DF | DF | F Value | Pr > F |
bollwt | 1 | 40 | 233.51 | <.0001 |
variety | 1 | 4 | 18.21 | 0.0130 |
spacing | 1 | 4 | 9.68 | 0.0358 |
variety*spacing | 1 | 4 | 1.87 | 0.2436 |
Compared to Output 7.18 for PROC GLM, the exact parameter estimates and F-values differ somewhat. This is partly because PROC MIXED recovers information from the random-model effects, similar to the recovery of interblock information for incomplete-blocks designs discussed in Chapter 4, and partly because the variance component estimate for PLANT(VARIETY*SPACING) is zero. MIXED’s default option for computing F-values in the presence of zero or negative variance component estimates is somewhat different than the F-ratios derived from the expected mean square by PROC GLM.
Section 4.4.2 discussed the possible bias to F-statistics that the MIXED default of setting negative variance component estimates to zero may introduce. This is especially evident in this example with the main effect test for SPACING. The p-value here is 0.0358, whereas it was 0.0660 with the GLM analysis. The MIXED result reflects potential bias. Here, as in Section 4.4.2, you can avoid this problem by using the METHOD=TYPE3 option. The results are not shown here, but they are very similar to the results in Output 7.18. Slight discrepancies result from the recovery of random-effects information present in PROC MIXED but not in PROC GLM.
Both the GLM and MIXED results suggest dropping the terms VARIETY*SPACING and PLANT(VARIETY*SPACING). The justification for VARIETY*SPACING is the same, that is, the F-test. However, with MIXED, there is no test for PLANT(VARIETY*SPACING). The zero variance component estimate provides equivalent justification. Given these results, the model has more factors than necessary; hence, their coefficient estimates are not necessary and can make further inference needlessly awkward. Drop VARIETY*SPACING and PLANT(VARIETY*SPACING), and use the following GLM statements to re-estimate:
model lint=bollwt variety spacing / solution;
lsmeans variety spacing / stderr;
You can use equivalent MIXED statements and obtain the same results. The results appear in Output 7.20. They differ only slightly from those in Outputs 7.18 and 7.19.
Output 7.20 Results of a Simplified Covariance Analysis: Two-Way Structure with Interaction
Sum of | |||||
Source | DF | Squares | Mean Square | F Value | Pr > F |
Model | 3 | 30.79949757 | 10.26649919 | 201.65 | <.0001 |
Error | 45 | 2.29111467 | 0.05091366 | ||
Corrected Total | 48 | 33.09061224 | |||
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
bollwt | 1 | 29.06931406 | 29.06931406 | 570.95 | <.0001 |
variety | 1 | 1.26353553 | 1.26353553 | 24.82 | <.0001 |
spacing | 1 | 0.46664798 | 0.46664798 | 9.17 | 0.0041 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
bollwt | 1 | 11.57173388 | 11.57173388 | 227.28 | <.0001 |
variety | 1 | 1.19732512 | 1.19732512 | 23.52 | <.0001 |
spacing | 1 | 0.46664798 | 0.46664798 | 9.17 | 0.0041 |
Standard | |||||
Parameter | Estimate | Error | t Value | Pr > |t| | |
Intercept | -.2769483300 B | 0.10384452 | -2.67 | 0.0106 | |
bollwt | 0.3014429094 | 0.01999507 | 15.08 | <.0001 | |
variety | 37 | 0.4106564020 B | 0.08468173 | 4.85 | <.0001 |
variety | 213 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
spacing | 30 | 0.2052058951 B | 0.06778167 | 3.03 | 0.0041 |
spacing | 40 | 0.0000000000 B | ⋅ | ⋅ | ⋅ |
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.
Least Squares Means | |||
Standard | |||
var | lint LSMEAN | Error | Pr > |t| |
37 | 2.00805710 | 0.05320406 | <.0001 |
213 | 1.59740070 | 0.05523778 | <.0001 |
Standard | |||
spac | lint LSMEAN | Error | Pr > |t| |
30 | 1.90533185 | 0.05479483 | <.0001 |
40 | 1.70012595 | 0.03988849 | <.0001 |
This model specifies a single regression coefficient that relates LINT to BOLLWT (0.3014), but with different intercepts for the four treatment combinations. These intercepts can be constructed from the SOLUTION vector by summing appropriate component values.
For example, for VARIETY=37, SPACING=30, the model estimate is
y = μ + v1 + τ1 + βx = −.2769 + .4107 + .2052 + .3014x
For the other treatment combinations, the results are
VARIETY |
SPACING |
Values for Model |
|
---|---|---|---|
37 |
40 |
0.1338 + 0.3014x |
|
213 |
30 |
–0.0715 + 0.3014x |
|
213 |
40 |
–0.2769 + 0.3014x |
Note that these results can be obtained with ESTIMATE statements. Least-squares means appear in Output 7.20; other statistics can be obtained but are not necessary in this situation.
A standard method for analyzing treatments with quantitative levels is to decompose the treatment sum of squares using orthogonal polynomial contrasts. That is, contrasts whose coefficients measure the linear, quadratic, and higher-order regression effects associated with treatment level. Most statistical methods textbooks have tables of orthogonal polynomial coefficients for balanced data with equally spaced treatment levels. You can use the ORPOL function in PROC IML to determine coefficients when you have a design that standard tables do not cover, such as unequally spaced designs. Section 7.6.2 shows you how to use the ORPOL function.
In many practical applications, orthogonal polynomials are awkward to use. Often, you want to estimate the regression equation, not merely decide what is “significant.” Even with treatment designs covered by standard tables, extracting the regression equation from orthogonal polynomials is laborious. For factorial experiments, interest usually centers on interaction. That is, are the regressions over the quantitative factor the same for all levels of the other factor? Except for very simple factorial treatment designs, trying to use orthogonal polynomials to measure interaction can become a daunting task.
This section presents analysis-of-covariance methods that are equivalent to orthogonal polynomial contrasts. The main advantage of the covariance, or direct regression, approach is that in most cases, it is easier to implement using SAS.
Output 7.21 contains data from an experiment designed to compare response to increasing dosage for two types of drug. There were three levels of the actual dosage, DOSE in the SAS data set—1, 10, and 100 units. The data were analyzed using LOGDOSE, the base 10 logs of the dosages. Note that the levels of LOGDOSE are equally spaced. The experiment was conducted as a randomized-complete-blocks design. BLOC denotes the blocks and Y denotes the response variable.
Output 7.21 Data for a Type-Dose Factorial Orthogonal Polynomial Example
Obs | bloc | type | dose | logdose | y |
1 | 1 | 1 | 1 | 0 | 63 |
2 | 1 | 2 | 1 | 0 | 59 |
3 | 1 | 1 | 10 | 1 | 62 |
4 | 1 | 2 | 10 | 1 | 62 |
5 | 1 | 1 | 100 | 2 | 62 |
6 | 1 | 2 | 100 | 2 | 68 |
7 | 2 | 1 | 1 | 0 | 50 |
8 | 2 | 2 | 1 | 0 | 49 |
9 | 2 | 1 | 10 | 1 | 49 |
10 | 2 | 2 | 10 | 1 | 55 |
11 | 2 | 1 | 100 | 2 | 48 |
12 | 2 | 2 | 100 | 2 | 58 |
13 | 3 | 1 | 1 | 0 | 53 |
14 | 3 | 2 | 1 | 0 | 47 |
15 | 3 | 1 | 10 | 1 | 52 |
16 | 3 | 2 | 10 | 1 | 51 |
17 | 3 | 1 | 100 | 2 | 51 |
18 | 3 | 2 | 100 | 2 | 50 |
19 | 4 | 1 | 1 | 0 | 52 |
20 | 4 | 2 | 1 | 0 | 48 |
21 | 4 | 1 | 10 | 1 | 54 |
22 | 4 | 2 | 10 | 1 | 49 |
23 | 4 | 1 | 100 | 2 | 55 |
24 | 4 | 2 | 100 | 2 | 72 |
The analysis of variance of these data is as follows:
SOURCE OF VARIATION | DF |
|
---|---|---|
Block | 3 |
|
Type | 1 |
|
Log Dose | 2 |
|
Type × Log Dose | 2 |
|
Error | 15 |
The dose main effect can be partitioned into linear and quadratic components by orthogonal polynomials whose contrast coefficients are
COEFFICIENT FOR LOGDOSE | |||
---|---|---|---|
CONTRAST | 0 |
1 |
2 |
Linear | –1 |
0 |
1 |
Quadratic | –1 |
2 |
–1 |
Similarly, the Type×Log Dose interaction can be partitioned into a Linear×Type and a Quadratic×Type component. Because the log dose levels are equally spaced, you can look up the contrast coefficients shown above. Most statistical methods texts have such a table. If you don’t have a table readily available, or if you want to use contrasts for a situation not in the table, for instance partitioning DOSE rather than Log Dose into linear and quadratic components, you can use the ORPOL function in PROC IML demonstrated in Section 7.6.2.
For analysis with LOGDOSE, use the following SAS statements:
proc glm;
class bloc type logdose;
model y=bloc type|logdose;
contrast 'linear logdose' logdose -1 0 1;
contrast 'quadratic logdose' logdose -1 2 -1;
contrast 'linear logdose x type' type*logdose 1 0 -1 -1 0 1;
contrast 'quad logdose x type' type*logdose 1 -2 1 -1 2 -1;
Output 7.22 contains the results.
Output 7.22 Analysis of Variance for Type-Dose Data
Sum of | |||||
Source | DF | Squares | Mean Square | F Value | Pr > F |
Model | 8 | 816.500000 | 102.062500 | 6.06 | 0.0014 |
Error | 15 | 252.458333 | 16.830556 | ||
Corrected Total | 23 | 1068.958333 | |||
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
bloc | 3 | 538.7916667 | 179.5972222 | 10.67 | 0.0005 |
type | 1 | 12.0416667 | 12.0416667 | 0.72 | 0.4109 |
logdose | 2 | 121.5833333 | 60.7916667 | 3.61 | 0.0524 |
type*logdose | 2 | 144.0833333 | 72.0416667 | 4.28 | 0.0338 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
bloc | 3 | 538.7916667 | 179.5972222 | 10.67 | 0.0005 |
type | 1 | 12.0416667 | 12.0416667 | 0.72 | 0.4109 |
logdose | 2 | 121.5833333 | 60.7916667 | 3.61 | 0.0524 |
type*logdose | 2 | 144.0833333 | 72.0416667 | 4.28 | 0.0338 |
Contrast | DF | Contrast SS | Mean Square | F Value | Pr > F |
linear logdose | 1 | 115.5625000 | 115.5625000 | 6.87 | 0.0193 |
quadratic logdose | 1 | 6.0208333 | 6.0208333 | 0.36 | 0.5587 |
linear logdose x type | 1 | 138.0625000 | 138.0625000 | 8.20 | 0.0118 |
quad logdose x type | 1 | 6.0208333 | 6.0208333 | 0.36 | 0.5587 |
From this output you can see that there is a significant Type×Log Dose interaction. The Linear Logdose×Type interaction explains most of the interaction. A look at the Type×Log Dose least-squares means (Output 7.23) reveals why. For Type 1, LOGDOSE does not affect mean response, whereas for Type 2, mean response increases approximately linearly with increasing LOGDOSE.
Output 7.23 Least-Squares Means for Type-Dose Data
Least Squares Means | ||
type | logdose | y LSMEAN |
1 | 0 | 54.5000000 |
1 | 1 | 54.2500000 |
1 | 2 | 54.0000000 |
2 | 0 | 50.7500000 |
2 | 1 | 54.2500000 |
2 | 2 | 62.0000000 |
By inspection, a linear regression for each type appears sufficient to explain the LOGDOSE effect. You can use the following SAS code to formally confirm this:
proc glm;
class bloc type logdose;
model y=bloc type logdose(type);
contrast 'lin in type 1' logdose(type) 1 0 -1 0 0 0;
contrast 'lin in type 2' logdose(type) 0 0 0 1 0 -1;
contrast 'quad in type 1' logdose(type) 1 -2 1 0 0 0;
contrast 'quad in type 2' logdose(type) 0 0 0 1 -2 1;
Output 7.24 shows the results.
Output 7.24 Orthogonal Polynomial Contrast Results within Each Type
Contrast | DF | Contrast SS | Mean Square | F Value | Pr > F |
lin in type 1 | 1 | 0.5000000 | 0.5000000 | 0.03 | 0.8655 |
lin in type 2 | 1 | 253.1250000 | 253.1250000 | 15.04 | 0.0015 |
quad in type 1 | 1 | 0.0000000 | 0.0000000 | 0.00 | 1.0000 |
quad in type 2 | 1 | 12.0416667 | 12.0416667 | 0.72 | 0.4109 |
You can see that neither type has a significant quadratic regression. The F-values are 0.0 and 0.72, respectively for Types 1 and 2. Type 2 does have a highly significant linear regression, as shown by the Lin In Type 2 contrast. The F-value is 15.04, and the associated p-value is 0.0015. You would then fit the linear regression equation over LOGDOSE for Type 1.
The contrast coefficients used in the previous section are the standard orthogonal polynomial coefficients for equally spaced treatments. You can find these coefficients in tables contained in many statistical methods textbooks. Suppose, however, that you want to partition treatment effects into linear effects, quadratic effects, and so forth, but your treatment levels are not equally spaced. Or suppose that you simply don’t have convenient access to a table of standard orthogonal polynomial contrasts. The interactive matrix algebra procedure PROC IML has a function, ORPOL, that computes orthogonal polynomial contrasts for any set of quantitative treatment levels. The function is simple to use, and does not require knowledge of matrix algebra.
To illustrate the ORPOL function, suppose you want to use DOSE rather than LOGDOSE in the example in Section 7.6.1. The levels are 1, 10, and 100, which are not equally spaced so you won’t find the correct coefficients in any table. Use the following PROC IML statements:
proc iml;
levels={1,10,100};
coef=orpol(levels);
print coef;
Output 7.25 shows the results. The LEVELS={level 1, level 2,...} statement defines a variable named LEVELS that contains a list of the treatment levels. Note that the treatment levels are separated by commas. The name of the variable is your choice; here it is called LEVELS, but you can give it any name you like within the conventions of allowable SAS variable names. The variable named COEF will contain the contrast coefficients; its name is your choice as well. You set it equal to the ORPOL function and put the variable with the treatment levels in parentheses. The PRINT statement causes the variable COEF to be printed.
Output 7.25 Contrast Coefficients for Unequally Spaced Levels of DOSE Using ORPOL
COEF | ||
0.5773503 | -0.464991 | 0.6711561 |
0.5773503 | -0.348743 | -0.738272 |
0.5773503 | 0.8137335 | 0.0671156 |
The first column of numbers, all 0.577, is the contrast for the mean, which is rarely, if ever, used. The second column gives you the coefficients for the linear contrasts. The third column gives you the quadratic contrast coefficients. Note that for each contrast, there is one coefficient per treatment (in this case dose) level. Also, there is one contrast per treatment degree of freedom. If you had four treatment levels, there would be four coefficients per contrast and a cubic as well as a linear and quadratic contrast.
The information from Output 7.25 allows you to write the appropriate CONTRAST statements. For example, for linear and quadratic dose effects, use the statements:
contrast ‘linear dose’ dose -0.465 -0.349 0.814;
contrast ‘quadratic dose’ dose 0.671 -0.738 0.067;
Pay attention to the sum of the contrasts. Occasionally, a rounding error will cause the sum of the coefficients to not equal zero; you might get 0.001, for example. This will cause both the GLM and MIXED procedures to declare the contrast non-estimable and you will get no output. Simply adjust one number so the coefficients sum to exactly zero. The impact of this adjustment on the resulting computations is negligible.
You can use ORPOL for equally spaced treatments. For example, add the following statements to the IML program given above to compute the coefficients for the equally spaced levels of LOGDOSE:
log_lev=log10(levels);
coef=orpol(log_lev);
print log_lev;
print coef;
fuzzed_coef=fuzz(coef);
print fuzzed_coef;
Output 7.26 shows the results. The LOG10 function takes the base 10 log of each element of the vector variable LEVELS, defined above. The name of the new variable is your choice; here it is called LOG_LEV. Occasionally, machine-rounding error causes numbers that are supposed to be zero to be computed as very small nonzero numbers, as with the coefficient for the second treatment level in the linear contrast (second column). The FUZZ function cleans up these rounding errors and sets them to zero, as shown in the variable FUZZED_COEF, the COEF variable with FUZZ applied.
Output 7.26 ORPOL Results for Equally Spaced Treatment Level (LOGDOSE)
LOG_LEV | ||
0 | ||
1 | ||
2 | ||
COEF | ||
0.5773503 | -0.707107 | 0.4082483 |
0.5773503 | 8.194E-17 | -0.816497 |
0.5773503 | 0.7071068 | 0.4082483 |
FUZZED_COEF | ||
0.5773503 | -0.707107 | 0.4082483 |
0.5773503 | 0 | -0.816497 |
0.5773503 | 0.7071068 | 0.4082483 |
Note that the coefficients are given the orthonormal form, that is, the squared coefficients for each contrast sum to one, for example, (–0.707)2 + 02 + (0.707)2 = 1 for the linear contrast (second column). You can rescale these coefficients to integer values, for example, –1, 0, and 1 for the linear contrast and 1, –2, and 1 for the quadratic, without affecting the sum of squares or F-values for the contrasts. This gives you the same coefficients you used in the previous section for LOGDOSE.
In the previous sections you saw how to assess regression over treatment levels using orthogonal polynomial contrasts and how to use ORPOL to obtain the needed contrast coefficients. However, orthogonal polynomials can be awkward to use, especially when you want to estimate the regression equation in addition to merely partitioning variation for testing purposes. In practical situations, orthogonal polynomials are something of a holdover from pre-computer-era statistical analysis. You can use analysis-of-covariance methods to compute the same statistics you got using orthogonal polynomials, as well as additional statistics that are often useful. This section uses the example from Section 7.6.1 to illustrate.
You can reproduce the essential elements of the analysis in Output 7.22 using analysis-of-covariance methods. First, you need to define a new variable in the DATA step for the square of LOGDOSE. In the SAS program between, the new variable is called LOGD2, defined in the DATA step as LOGDOSE*LOGDOSE. Then use the following SAS statements:
proc glm;
class bloc type;
model y=bloc type logdose logd2
type*logdose type*logd2;
The results appear in Output 7.27. Notice that LOGDOSE does not appear in the CLASS statement. You use it, as well as LOGD2, as a direct regression variable, exactly as you would if LOGDOSE was a covariable and you suspected heterogeneous quadratic regressions of y on the covariable.
Output 7.27 ANOVA for Type-Dose Data Using Analysis-of-Covariance Methods
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
bloc | 3 | 538.7916667 | 179.5972222 | 10.67 | 0.0005 |
type | 1 | 12.0416667 | 12.0416667 | 0.72 | 0.4109 |
logdose | 1 | 115.5625000 | 115.5625000 | 6.87 | 0.0193 |
logd2 | 1 | 6.0208333 | 6.0208333 | 0.36 | 0.5587 |
logdose*type | 1 | 138.0625000 | 138.0625000 | 8.20 | 0.0118 |
logd2*type | 1 | 6.0208333 | 6.0208333 | 0.36 | 0.5587 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
bloc | 3 | 538.7916667 | 179.5972222 | 10.67 | 0.0005 |
type | 1 | 28.1250000 | 28.1250000 | 1.67 | 0.2157 |
logdose | 1 | 0.3894231 | 0.3894231 | 0.02 | 0.8811 |
logd2 | 1 | 6.0208333 | 6.0208333 | 0.36 | 0.5587 |
logdose*type | 1 | 0.8125000 | 0.8125000 | 0.05 | 0.8291 |
logd2*type | 1 | 6.0208333 | 6.0208333 | 0.36 | 0.5587 |
Comparing Output 7.25 to Output 7.22, you can see that the Type I SS for LOGDOSE and LOGD2 are identical to the contrast results for Linear Logdose and Quadratic Logdose. The sums of squares are both 115.5625 and 6.0208, respectively. Also, the TYPE*LOGDOSE and TYPE*LOGD2 Type I sum of squares are identical to the Linear Logdose×Type and Quadratic Logdose×Type contrasts from Output 7.21. In both cases, the significant difference between linear regressions of y on LOGDOSE for the two types is the main result. Note that the Type III sums of squares produce nonsense results for the LINEAR LOGDOSE main effect and interaction terms because they are both adjusted for quadratic effects. In general, the Type III SS should be ignored when using analysis of covariance in lieu of orthogonal polynomials.
Outputs 7.22 and 7.25 both lead to the conclusion that you should fit a linear regression over LOGDOSE for each type. You can use the following SAS statements to do so:
proc glm;
class bloc type;
model y= bloc type logdose(type)/solution;
estimate ‘beta-0, type 1’
intercept 4 bloc 1 1 1 1 type 4 0/divisor=4;
estimate ‘beta-0, type 2’
intercept 4 bloc 1 1 1 1 type 0 4/divisor=4;
These statements are similar to those used in Section 7.3 to fit the unequal slopes. The additional ESTIMATE statements allow you to compute the β0i terms for the ith type, which the MODEL statement implicitly defines to be the sum of the intercept, the average block effect, and the type effect. The results appear in Output 7.28.
Output 7.28 ANOVA for Type-Dose Data Using Analysis-of-Covariance Methods
Parameter | Estimate | Error | t Value | Pr > |t| |
beta-0, type 1 | 54.5000000 | 1.80039484 | 30.27 | <.0001 |
beta-0, type 2 | 50.0416667 | 1.80039484 | 27.79 | <.0001 |
Standard | |||||
Parameter | Estimate | Error | t Value | Pr > |t| | |
Intercept | 50.08333333 B | 2.27733935 | 21.99 | <.0001 | |
bloc | 1 | 7.66666667 B | 2.27733935 | 3.37 | 0.0037 |
bloc | 2 | -3.50000000 B | 2.27733935 | -1.54 | 0.1427 |
bloc | 3 | -4.33333333 B | 2.27733935 | -1.90 | 0.0741 |
bloc | 4 | 0.00000000 B | ⋅ | ⋅ | ⋅ |
type | 1 | 4.45833333 B | 2.54614280 | 1.75 | 0.0980 |
type | 2 | 0.00000000 B | ⋅ | ⋅ | ⋅ |
logdose(type) | 1 | -0.25000000 | 1.39457984 | -0.18 | 0.8598 |
logdose(type) | 2 | 5.62500000 | 1.39457984 | 4.03 | 0.0009 |
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.
From Output 7.25, the main results are the regression equations. For Type 1, the equation is y=54.5 – 0.25*LOGDOSE; for Type 2 it is y=50.042+5.625*LOGDOSE.
PROC MIXED makes it somewhat more convenient to obtain the regression equations. You can use the SAS statements
proc mixed;
class bloc type;
model y=type logdose(type)/noint solution;
random bloc;
The results appear in Output 7.29.
Output 7.29 Estimate of Linear Regression Equations for Each Type
Standard | |||||
Parameter | Estimate | Error | t Value | Pr > |t| | |
type | 1 | 54.50000000 | 2.89268414 | 18.84 | <.0001 |
type | 2 | 50.04166667 | 2.89268414 | 17.30 | <.0001 |
logdose(type) | 1 | -0.25000000 | 2.24066350 | -0.11 | 0.9123 |
logdose(type) | 2 | 5.62500000 | 2.24066350 | 2.51 | 0.0208 |
The RANDOM BLOC statement and the NOINT option cause the β0i terms to be estimated directly, rather than requiring ESTIMATE statements. You can assume that RANDOM BLOC does not affect the estimates, but it does change the standard errors of the regression coefficients. For PROC MIXED, the standard errors are 2.89 and 2.24 for the intercept and slope, respectively. These are valid if assuming blocks to be random is reasonable. The PROC GLM results, 1.80 and 1.39, assume fixed blocks.
3.144.45.153