Analysis of covariance combines elements of the analysis of variance and of regression analysis. Because the analysis of variance can be seen as a multiple linear regression analysis, the analysis of covariance (ANCOVA) can be defined as a multiple linear regression analysis in which there is at least one categorical explanatory variable and one quantitative variable. Usually the categorical variable is a treatment of primary interest measured at the experimental unit and is called response y. The quantitative variable x, which is also measured in experimental units in anticipation that it is associated linearly with the response of the treatment. This quantitative variable x is called a covariate or covariable or concomitant variable.
Before we explain this in general, we give some examples performed in a completely randomised design.
One could perhaps analyse the yield per plant (y/x) in kilograms as a means of removing differences in plant numbers. This is satisfactory if the relation between y and x is a straight line through the origin. However, the regression line of y on x is a straight line not through the origin and the estimated regression coefficient b is often substantially less than the mean yield per plant because when plant numbers are high, competition between plants reduces the yield per plant. If this happens, the use of y/x overcorrects for the stand of plants. Of course, the yield per plant should be analysed if there is direct interest in this quantity.
When one wants to use the ANCOVA it is a good idea to check whether the linear regression lines are parallel for the different treatments. If the regression lines are not parallel then we have an interaction between the treatments and the covariate. Hence it is a good idea to run first an ANCOVA model with interaction of treatments and covariate; if the slopes are not statistically different (no significant interaction), then we can use an ANCOVA model with parallel lines, which means that there is a separate intercepts regression model. The main use of ANCOVA is for testing a treatment effect while using a quantitative control variable as covariate to gain power.
Note that if needed the ANCOVA model can be extended with more covariates. The R program can handle this easily. If the regression of y is quadratic on the covariate x, we have then an ANCOVA model with two covariates, x1 = x and x2 = x2.
We first discuss the balanced design.
We assume that we have a balanced completely randomised design for a treatment A with a classes. Assuming further that there is a linear relationship between the response y and the covariate x, we find that an appropriate statistical model is
where μ is a constant, ai is the treatment effect with the side condition i = 0, β is the coefficient for the linear regression of yij on xij, and the eij are random independent normally distributed experimental errors with expectation 0 and variance σ2.
Two additional key assumptions for this model are that the regression coefficient β is the same for all treatment groups (parallel regression lines) and the treatments do not influence the covariate x.
The first objective of the covariance analysis is to determine whether the addition of the covariate has reduced the estimate of experimental error variance. This means that the test of the null hypothesis Hβ0: β = 0 against the alternative hypothesis HβA: β ≠ 0 results in rejection of Hβ0. If the reduction of the estimate of the experimental error variance is significant then we obtain estimates of the treatment group means μ + ai adjusted to the same value of the covariate x for each of the treatment groups and determine the significance of treatment differences on the basis of the adjusted treatment means. Usually the statistical packages estimate the adjusted treatment means for the overall mean .. of the covariate x.
The least squares estimates of the parameters of model (9.1) are:
see Montgomery (2013), section 15.3. For the derivation of the least squares estimates of the parameters see the general unbalanced completely randomised design in Section 9.2.2.
For the balanced case we have then that ni for i = 1, …, a is equal to n.
The nested ANOVA table (note the sequence of the source of variation) for the test of the null hypothesis Hβ0: β = 0 is given in Table 9.1.
Table 9.1 Nested ANOVA table for the test of the null hypothesis Hβ0: β = 0.
Source of variation | df | SS |
Treatments | (a − 1) | Tyy = n |
Regression coefficient | 1 | SSb = (Exy)2/Exx |
Error | a(n − 1) − 1 | SSE = SSyy – Tyy ‐ SSb |
Corrected total | an – 1 | SSyy = |
The test of the null hypothesis Hβ0: β = 0 against the alternative hypothesis HβA: β ≠ 0 is done with the test‐statistic
which has under Hβ0: β = 0 the F‐distribution with df = 1 for the numerator and df = a(n‐1)‐1 for the denominator.
For the test of the null hypothesis HA0: ‘all ai are equal’ against the alternative hypothesis HAA: ‘at least one ai is different from the other’ we use the nested ANOVA table given in Table 9.2.
Table 9.2 Nested ANOVA table for the test HA0: ‘all ai are equal’.
Source of variation | df | SS |
Regression coefficient | 1 | Sb = |
Treatments | a − 1 | ST = SSyy – Sb – SSE |
Error | a(n − 1) − 1 | SSE |
Corrected Total | an – 1 | SSyy = |
The test of the null hypothesis HA0: ‘all ai are equal’ against the alternative hypothesis HAA: ‘at least one ai is different from the other’ is done with the test‐statistic
which has under H0A: ‘all ai are equal’ the F‐distribution with df = a − 1 for the numerator and df = a(n −1) − 1 for the denominator.
The estimate of the treatment mean adjusted for the covariate at x = overall mean .. is
The estimate of the standard error of this estimate (9.2) is
The estimate for the difference between two adjusted treatment means at overall mean .. is
However, we first want to check if we can use an ANCOVA model with parallel lines, which means that there is a separate intercepts regression model. Therefore, we run first an ANCOVA model with interaction of treatments and covariate.
An appropriate statistical model is
where μ is a constant, ai is the treatment effect with the side condition i = 0, βi is the coefficient for the linear regression for treatment A with class ai of yij on xij, and the eij are random independent normally distributed experimental errors with mean 0 and variance σ2.
In R we would make an ANOVA table with this model and look for the test of the interaction effect of treatment and covariate whether we must reject the null hypothesis H0βi: β1 = … = βa and accept the alternative hypothesis Haβi: ‘there is at least one βi different from another βj with i ≠ j ’. If this is the case we cannot use the ANCOVA model.
From the ANOVA table with machine as the last model variable we find the p‐value
Pr(>F) = 0.1181 > 0.05 we cannot reject the null hypothesis HA0: ‘all ai are equal’.
Using y for strength and x for diameter we have the regression lines for the machines:
the regression line for M1 is y = 17.3592 +
0.954x;
the regression line for M2 is y = 18.396 +
0.954x;
the regression line for M3 is y = 15.7752 +
0.954x.
Note that in the output of Problem 9.3 the output of summary
(machine0
)
we find the estimate
(intercept
) 17.360, which is the intercept I1; further we find machine2 1.037
and machine3−1.584
.
The intercept I2 is (intercept) + machine2 = 17.360 + 1.037 = 18.397
and the intercept I3 is (intercept) + machine3 = 17.360 + (−1.584) = 15.776.
In the rationale in Problem 9.4 we have found that the difference in the adjusted means of M2 – M1 is 41.4192 – 40.3824 = 1.0368 and this is given in Problem 9.3 in the summary
(machine0 as the Estimate machine2 1.037
).
But the difference of the adjusted means of M2 – M1 is according to (9.4) [2. ‐ (2. ‐ ..)] – [1. ‐ (1. ‐ ..)] = [2. − (2.)] − [1. − (1.)] = I2 – I1.
Analogous to the difference of the adjusted means of M3 – M1 is 38.7984 – 40.3824 = –1.584 and this is given in Problem 9.3 in the summary (machine0
) as the Estimate machine3 −1.584
. However, the difference in the adjusted means of M3 – M1 is, according to (9.4), I3 – I1.
Now we will give the ANCOVA for an unbalanced completely randomised design for a treatment A with a classes. Assuming further that there is a linear relationship between the response y and the covariate x, we find that an appropriate statistical model is
where μ is a constant, ai are the treatment effects with the side condition i = 0, β is the coefficient for the linear regression of yij on xij, and the eij are random independent normally distributed experimental errors with mean 0 and variance σ2.
The least squares estimator of the parameters in (9.8) are as follows.
Deriving the first partial derivatives of
with respect to the parameters and zeroing the result at which the parameter values in the equations are replaced by their estimates leads to the equations ():
From this, we obtain the estimates explicitly as:
Because
is a convex function the solution of the equation to set the partial derivatives equal to zero gives a minimum of S.
Replacing the realisations of the random variables in (9.8) by the corresponding random variables results in the least squares estimators
The estimators in (9.11) are best linear unbiased estimators (BLUE) and normally distributed.
The analysis for the unbalanced ANCOVA goes with R with the same commands, but the differences are in the estimator of β, the standard errors of the estimate of the adjusted treatment means and the standard errors of the differences between the estimates of the adjusted treatment means.
The estimate of β is = Exx / Eyy with
which is also given in (9.10) with = b.
The estimates of the treatment means adjusted for the covariate at x = overall mean .. is
The estimate of the standard error of this estimate (9.2) is
The estimate for the difference between two adjusted treatment means is
Because the interaction Group:x
has the p‐value Pr(>F) = 0.42794 > 0.05 we cannot reject the null hypothesis HB0: β1 = … = βa and we can use the ANCOVA model (9.8) with parallel lines for the regression lines of y on x diameter for the groups.
Because in the ANOVA table with x as the last model variable we find the p‐value
Pr(>F) = 5.063e‐09 < 0.05 and we reject the null hypothesis HB0: β = 0.
Of course this can also concluded from the t‐test for x where we find the same
p‐value Pr(>|t|) = 5.06e‐09 < 0.05.
> summary(testbeta0)
Call:
lm(formula = y ∼ Group + x, data = example9_2)
Residuals:
Min 1Q Median 3Q Max
-19.0353 -6.8607 0.1951 6.1214 18.7915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.814 10.163 7.361 2.75e-08 ***
Group2 -10.222 3.363 -3.040 0.00478 **
x -11.268 1.410 -7.991 5.06e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.678 on 31 degrees of freedom
Multiple R-squared: 0.6848, Adjusted R-squared: 0.6644
F-statistic: 33.67 on 2 and 31 DF, p-value: 1.692e-08
We see that the estimate of the standard error of the adjusted means at the overall mean of x of group 1 – group 2 is SEG1G2 = 3.363
, which we have already found as 3.363 in summary(Group0)
in Problem 9.9.
The 0.95‐confidence interval for the expected means at the overall mean of x of group 1 – group 2 is: [3.363233; 17.08077].
Hence the regression line for Fibralo = group 1 is y = 74.8135–11.268x and the regression line for Gemfibrozil = group 2 is y = 64.59181–11.268x.
The scatter plot with the regression lines is done as follows.
> x1 <- c(7.0,6.0,7.1,8.6,6.3,7.5,6.6,7.4,5.3,6.5,6.2,7.8,8.5,9.2,5.0,7.0)
> y1 <- c(5,10,-5,-20,0,-15,10,-10,20,-15,5,0,-40,-25,25,-10)
> x2 <- c(5.1,6.0,7.2,6.4,5.5,6.0,5.6,5.5,6.7,8.6,6.4,6.0,9.3,8.5, 7.9,7.4,5.0,6.5)
> y2 <- c(10,15,-15,5,10,-15,-5,-10,-20,-40,-5,-10,-40,-20,-35,0,0,-10)
> Group <- rep(1:2, times= c(16,18))
> x <- c(x1,x2)
> y <- c(y1,y2)
> plot(x, y, main = "Group 1 ---, Group 2 ----", pch = as.character(Group))
> abline(I1,estimate.beta, lty=1, lwd=2)
> abline(I2,estimate.beta, lty=2, lwd=3)
See Figure 9.2.
We assume that we have a balanced randomised complete block design for a treatment A with a classes and b blocks. Assuming further that there is a linear relationship between the response y and the covariate x, we find that an appropriate statistical model is
where μ is a constant, ai is the treatment effect with the side condition i = 0, cj is the block effect with the side condition j = 0, β is the coefficient for the linear regression of yijk on xijk, and the eijk are random independent normally distributed experimental errors with mean 0 and variance σ2.
The analysis is analogous to Section 9.2.1 for the balanced completely randomised design; only the ANOVA table additionally includes the effect of blocks.
We want first to check if we can use an ANCOVA model with parallel lines, which means that there is a separate intercepts regression model. Therefore, we run first an ANCOVA model with interaction of treatments and covariate.
An appropriate statistical model is
μ is a constant, ai is the treatment effect with the side condition i = 0, cj is the block effect with the side condition j = 0, βi is the coefficient for the linear regression for treatment A with class Ai of yijk on xijk, and the eijk are random independent normally distributed experimental errors with mean 0 and variance σ2.
In R make an ANOVA table for this model and look for the test of the interaction effect of treatment and covariate whether we must reject the null hypothesis HB0: β1 = … = βa and accept the alternative hypothesis HBA: ‘there is at least one βi different from another βj with i ≠ j’. If this is the case we cannot use the ANCOVA model.
If the covariate x as well as the primary response variable y is affected by the treatments the resultant response is multivariate and the covariance adjustments for treatments means is inappropriate. In these cases an analysis of the bivariate response (x, y) utilising multivariate methods is in order. Multivariate methods are not covered in this book.
Adjustment for the covariate is appropriate if it is measured prior to treatment administration since the treatments have not yet had the opportunity to affect its value. If the covariate is measured concurrently with the response variable, then it must be decided whether it could be affected by the treatments before the covariance adjustments are considered. See the rationale given in Example 9.3.
Practical application of the analysis of covariance has been demonstrated only with completely randomised designs and randomised complete block designs; however, the use of covariates can be extended to any treatment and experiment design as well as to comparative observational studies of complex structure and studies requiring the use of multiple covariates for adjustment. Using R, the analysis of multiple covariates is easy to do. Add the multiple covariates in the model equation.
For further information about topics in analysis of covariance, see Snedecor and Cochran (1989) sections 18.5–18.9.
Extensive discussions on the use and misuses of covariates in research studies were provided in two special issues of Biometrics (1957), Volume 13, No. 3; and Biometrics (1982), Volume 38, No. 3. Of particular interest are articles by Cochran (1957), Smith (1957), and Cox and McCullagh (1982). A number of issues arise relevant to the use of covariates. Amongst those concerns are the applicability in certain situations and the relationship between blocking and covariates.
Analysis of covariance for general random and mixed‐effects models is considerably more difficult. Henderson and Henderson (1979) and Henderson (1982) discuss the problem and possible approaches.
3.16.50.188