9
Analysis of Covariance (ANCOVA)

9.1 Introduction

Analysis of covariance combines elements of the analysis of variance and of regression analysis. Because the analysis of variance can be seen as a multiple linear regression analysis, the analysis of covariance (ANCOVA) can be defined as a multiple linear regression analysis in which there is at least one categorical explanatory variable and one quantitative variable. Usually the categorical variable is a treatment of primary interest measured at the experimental unit and is called response y. The quantitative variable x, which is also measured in experimental units in anticipation that it is associated linearly with the response of the treatment. This quantitative variable x is called a covariate or covariable or concomitant variable.

Before we explain this in general, we give some examples performed in a completely randomised design.

  1. In a completely randomised design treatments were applied on tea bushes where the yields yij are the yields in kilograms of the tea bushes. An important source of error is that, by the luck of the draw by randomisation, some treatments will be allotted to a more productive set of bushes than others. Fisher described in 1925 in his first edition of “Statistical Methods for Research Workers” the application of the covariate xij, which was the yield in kilograms of the tea bushes in a period before treatments were applied. Since the relative yields of the tea bushes show a good deal of stability from year to year, xij serves as a linear predictor of the inherent yielding stabilities of the bushes. The regression lines of yij on xij are parallel regression lines for the treatments. By adjusting the treatment yields so as to remove these differences in yielding ability, we obtain a lower variance of the experimental error and more precise comparisons amongst the treatments. See Fisher (1935) section 49.1.
  2. In variety trials with corn (Zea mays) or sugar beet (Beta vulgaris) usually plots with 30 plants are used. The weight per plot in kilograms is the response yij, but often some plants per plot are missing. As covariate, xij is taken as the number of plants per plot. The regression lines of yij on xij are parallel regression lines for the treatments.

    One could perhaps analyse the yield per plant (y/x) in kilograms as a means of removing differences in plant numbers. This is satisfactory if the relation between y and x is a straight line through the origin. However, the regression line of y on x is a straight line not through the origin and the estimated regression coefficient b is often substantially less than the mean yield per plant because when plant numbers are high, competition between plants reduces the yield per plant. If this happens, the use of y/x overcorrects for the stand of plants. Of course, the yield per plant should be analysed if there is direct interest in this quantity.

  3. A common clinical method to evaluate an individual's cardiovascular capacity is through treadmill exercise testing. One of the measures obtained during treadmill testing, maximal oxygen uptake, is the best index of work capacity and maximal cardiovascular function. As subjects, e.g., 12 healthy males who did not participate in a regular exercise program were chosen. Two treatments selected for the study were a 12‐week step aerobics training program and a 12‐week outdoor running regimen on a flat terrain. Six men were randomly assigned to each group in a completely randomised design. Various respiratory measurements were made on the subjects while on the treadmill before the 12‐week period. There were no differences in the respiratory measurements of the two groups of subjects prior to treatment. The measurement of interest yij is the change in of maximal ventilation (litres/minute) of oxygen for the 12‐week period. The relationship between maximal ventilation change and age (years) is linear and the regression lines for the two treatments are parallel regression lines. Hence as covariate xij is taken as the age of the subjects.
  4. One wants to assess the strength of threads yij in pounds made by three different machines. Each thread is made from a batch of cotton, and some batches tend to form thicker thread than other batches. There is no way to know how thick it will be until you make it. Regardless of how the machines may affect thread strength, thicker threads are stronger. Thus, we record the diameter xij in 10−3 in. as a covariate. The regression lines of of yij on xij are parallel regression lines for the machines.

When one wants to use the ANCOVA it is a good idea to check whether the linear regression lines are parallel for the different treatments. If the regression lines are not parallel then we have an interaction between the treatments and the covariate. Hence it is a good idea to run first an ANCOVA model with interaction of treatments and covariate; if the slopes are not statistically different (no significant interaction), then we can use an ANCOVA model with parallel lines, which means that there is a separate intercepts regression model. The main use of ANCOVA is for testing a treatment effect while using a quantitative control variable as covariate to gain power.

Note that if needed the ANCOVA model can be extended with more covariates. The R program can handle this easily. If the regression of y is quadratic on the covariate x, we have then an ANCOVA model with two covariates, x1 = x and x2 = x2.

9.2 Completely Randomised Design with Covariate

We first discuss the balanced design.

9.2.1 Balanced Completely Randomised Design

We assume that we have a balanced completely randomised design for a treatment A with a classes. Assuming further that there is a linear relationship between the response y and the covariate x, we find that an appropriate statistical model is

9.1equation

where μ is a constant, ai is the treatment effect with the side condition imagesi = 0, β is the coefficient for the linear regression of yij on xij, and the eij are random independent normally distributed experimental errors with expectation 0 and variance σ2.

Two additional key assumptions for this model are that the regression coefficient β is the same for all treatment groups (parallel regression lines) and the treatments do not influence the covariate x.

The first objective of the covariance analysis is to determine whether the addition of the covariate has reduced the estimate of experimental error variance. This means that the test of the null hypothesis Hβ0: β = 0 against the alternative hypothesis HβA: β ≠ 0 results in rejection of Hβ0. If the reduction of the estimate of the experimental error variance is significant then we obtain estimates of the treatment group means μ + ai adjusted to the same value of the covariate x for each of the treatment groups and determine the significance of treatment differences on the basis of the adjusted treatment means. Usually the statistical packages estimate the adjusted treatment means for the overall mean images.. of the covariate x.

The least squares estimates of the parameters of model (9.1) are:

equation

see Montgomery (2013), section 15.3. For the derivation of the least squares estimates of the parameters see the general unbalanced completely randomised design in Section 9.2.2.

For the balanced case we have then that ni for i = 1, …, a is equal to n.

The nested ANOVA table (note the sequence of the source of variation) for the test of the null hypothesis Hβ0: β = 0 is given in Table 9.1.

Table 9.1 Nested ANOVA table for the test of the null hypothesis Hβ0: β = 0.

Source of variation df SS
Treatments (a − 1) Tyy = nimages
Regression coefficient images 1 SSb = (Exy)2/Exx
Error a(n − 1) − 1 SSE = SSyyTyySSb
Corrected total an – 1 SSyy = images

The test of the null hypothesis Hβ0: β = 0 against the alternative hypothesis HβA: β ≠ 0 is done with the test‐statistic

9.2equation

which has under Hβ0: β = 0 the F‐distribution with df = 1 for the numerator and df = a(n‐1)‐1 for the denominator.

For the test of the null hypothesis HA0: ‘all ai are equal’ against the alternative hypothesis HAA: ‘at least one ai is different from the other’ we use the nested ANOVA table given in Table 9.2.

Table 9.2 Nested ANOVA table for the test HA0: ‘all ai are equal’.

Source of variation df SS
Regression coefficient images 1 Sb = images
Treatments a − 1 ST = SSyy SbSSE
Error a(n − 1) − 1 SSE
Corrected Total an – 1 SSyy = images

The test of the null hypothesis HA0: ‘all ai are equal’ against the alternative hypothesis HAA: ‘at least one ai is different from the other’ is done with the test‐statistic

9.3equation

which has under H0A: ‘all ai are equal’ the F‐distribution with df = a − 1 for the numerator and df = a(n −1) − 1 for the denominator.

The estimate of the treatment mean adjusted for the covariate at x = overall mean images.. is

9.4equation

The estimate of the standard error of this estimate (9.2) is

9.5equation

The estimate for the difference between two adjusted treatment means at overall mean images.. is

9.6equation

However, we first want to check if we can use an ANCOVA model with parallel lines, which means that there is a separate intercepts regression model. Therefore, we run first an ANCOVA model with interaction of treatments and covariate.

An appropriate statistical model is

9.7equation

where μ is a constant, ai is the treatment effect with the side condition imagesi = 0, βi is the coefficient for the linear regression for treatment A with class ai of yij on xij, and the eij are random independent normally distributed experimental errors with mean 0 and variance σ2.

In R we would make an ANOVA table with this model and look for the test of the interaction effect of treatment and covariate whether we must reject the null hypothesis H0βi: β1 = … = βa and accept the alternative hypothesis Haβi: ‘there is at least one βi different from another βj with i ≠ j ’. If this is the case we cannot use the ANCOVA model.

From the ANOVA table with machine as the last model variable we find the p‐value

Pr(>F) = 0.1181 > 0.05 we cannot reject the null hypothesis HA0: ‘all ai are equal’.

Using y for strength and x for diameter we have the regression lines for the machines: the regression line for M1 is y = 17.3592 + 0.954x; the regression line for M2 is y = 18.396 + 0.954x; the regression line for M3 is y = 15.7752 + 0.954x.

Note that in the output of Problem 9.3 the output of summary(machine0) we find the estimate(intercept) 17.360, which is the intercept I1; further we find machine2 1.037 and machine3−1.584.

The intercept I2 is (intercept) + machine2 = 17.360 + 1.037 = 18.397 and the intercept I3 is (intercept) + machine3 = 17.360 + (−1.584) = 15.776.

In the rationale in Problem 9.4 we have found that the difference in the adjusted means of M2 – M1 is 41.4192 – 40.3824 = 1.0368 and this is given in Problem 9.3 in the summary(machine0 as the Estimate machine2 1.037).

But the difference of the adjusted means of M2 – M1 is according to (9.4) [images2.images (images2.images..)] – [images1.images (images1.images..)] = [images2.images (images2.)] − [images1.images (images1.)] = I2 – I1.

Analogous to the difference of the adjusted means of M3 – M1 is 38.7984 – 40.3824 = 1.584 and this is given in Problem 9.3 in the summary (machine0) as the Estimate machine3 −1.584. However, the difference in the adjusted means of M3 – M1 is, according to (9.4), I3 – I1.

9.2.2 Unbalanced Completely Randomised Design

Now we will give the ANCOVA for an unbalanced completely randomised design for a treatment A with a classes. Assuming further that there is a linear relationship between the response y and the covariate x, we find that an appropriate statistical model is

9.8equation

where μ is a constant, ai are the treatment effects with the side condition imagesi = 0, β is the coefficient for the linear regression of yij on xij, and the eij are random independent normally distributed experimental errors with mean 0 and variance σ2.

The least squares estimator of the parameters in (9.8) are as follows.

Deriving the first partial derivatives of

equation

with respect to the parameters and zeroing the result at which the parameter values in the equations are replaced by their estimates leads to the equations (images):

9.9equation

From this, we obtain the estimates explicitly as:

9.10equation

Because

equation

is a convex function the solution of the equation to set the partial derivatives equal to zero gives a minimum of S.

Replacing the realisations of the random variables in (9.8) by the corresponding random variables results in the least squares estimators

9.11equation

The estimators in (9.11) are best linear unbiased estimators (BLUE) and normally distributed.

The analysis for the unbalanced ANCOVA goes with R with the same commands, but the differences are in the estimator of β, the standard errors of the estimate of the adjusted treatment means and the standard errors of the differences between the estimates of the adjusted treatment means.

The estimate of β is images = Exx / Eyy with

equation

which is also given in (9.10) with images = b.

The estimates of the treatment means adjusted for the covariate at x = overall mean images.. is

9.12equation

The estimate of the standard error of this estimate (9.2) is

9.13equation

The estimate for the difference between two adjusted treatment means is

9.14equation

Because the interaction Group:x has the pvalue Pr(>F) = 0.42794 > 0.05 we cannot reject the null hypothesis HB0: β1 = … = βa and we can use the ANCOVA model (9.8) with parallel lines for the regression lines of y on x diameter for the groups.

Because in the ANOVA table with x as the last model variable we find the p‐value

Pr(>F) = 5.063e‐09 < 0.05 and we reject the null hypothesis HB0: β = 0.

Of course this can also concluded from the t‐test for x where we find the same

p‐value Pr(>|t|) = 5.06e‐09 < 0.05.

 > summary(testbeta0)
Call:
lm(formula = y ∼ Group + x, data = example9_2)
Residuals:
     Min       1Q   Median       3Q      Max
-19.0353  -6.8607   0.1951   6.1214  18.7915
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   74.814     10.163   7.361 2.75e-08 ***
Group2       -10.222      3.363  -3.040  0.00478 **
x            -11.268      1.410  -7.991 5.06e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.678 on 31 degrees of freedom
Multiple R-squared:  0.6848,    Adjusted R-squared:  0.6644
F-statistic: 33.67 on 2 and 31 DF,  p-value: 1.692e-08 

We see that the estimate of the standard error of the adjusted means at the overall mean of x of group 1 – group 2 is SEG1G2 = 3.363, which we have already found as 3.363 in summary(Group0) in Problem 9.9.

The 0.95‐confidence interval for the expected means at the overall mean of x of group 1 – group 2 is: [3.363233; 17.08077].

Hence the regression line for Fibralo = group 1 is y = 74.8135–11.268x and the regression line for Gemfibrozil = group 2 is y = 64.59181–11.268x.

The scatter plot with the regression lines is done as follows.

 > x1 <- c(7.0,6.0,7.1,8.6,6.3,7.5,6.6,7.4,5.3,6.5,6.2,7.8,8.5,9.2,5.0,7.0)
> y1 <- c(5,10,-5,-20,0,-15,10,-10,20,-15,5,0,-40,-25,25,-10)
> x2 <- c(5.1,6.0,7.2,6.4,5.5,6.0,5.6,5.5,6.7,8.6,6.4,6.0,9.3,8.5, 7.9,7.4,5.0,6.5)
> y2 <- c(10,15,-15,5,10,-15,-5,-10,-20,-40,-5,-10,-40,-20,-35,0,0,-10)
> Group <- rep(1:2, times= c(16,18))
> x <- c(x1,x2)
> y <- c(y1,y2)
> plot(x, y, main = "Group 1 ---, Group 2 ----", pch = as.character(Group))
> abline(I1,estimate.beta, lty=1, lwd=2)
> abline(I2,estimate.beta, lty=2, lwd=3) 

See Figure 9.2.

Scatter-plot with regression lines of to test the null hypothesis against the alternative hypothesis of the ANCOVA model.

Figure 9.2 Scatter‐plot with regression lines of the example in Problem 9.2.

9.3 Randomised Complete Block Design with Covariate

We assume that we have a balanced randomised complete block design for a treatment A with a classes and b blocks. Assuming further that there is a linear relationship between the response y and the covariate x, we find that an appropriate statistical model is

9.15equation

where μ is a constant, ai is the treatment effect with the side condition imagesi = 0, cj is the block effect with the side condition imagesj = 0, β is the coefficient for the linear regression of yijk on xijk, and the eijk are random independent normally distributed experimental errors with mean 0 and variance σ2.

The analysis is analogous to Section 9.2.1 for the balanced completely randomised design; only the ANOVA table additionally includes the effect of blocks.

We want first to check if we can use an ANCOVA model with parallel lines, which means that there is a separate intercepts regression model. Therefore, we run first an ANCOVA model with interaction of treatments and covariate.

An appropriate statistical model is

9.16equation

μ is a constant, ai is the treatment effect with the side condition imagesi = 0, cj is the block effect with the side condition imagesj = 0, βi is the coefficient for the linear regression for treatment A with class Ai of yijk on xijk, and the eijk are random independent normally distributed experimental errors with mean 0 and variance σ2.

In R make an ANOVA table for this model and look for the test of the interaction effect of treatment and covariate whether we must reject the null hypothesis HB0: β1 = … = βa and accept the alternative hypothesis HBA: ‘there is at least one βi different from another βj with i ≠ j’. If this is the case we cannot use the ANCOVA model.

9.4 Concluding Remarks

If the covariate x as well as the primary response variable y is affected by the treatments the resultant response is multivariate and the covariance adjustments for treatments means is inappropriate. In these cases an analysis of the bivariate response (x, y) utilising multivariate methods is in order. Multivariate methods are not covered in this book.

Adjustment for the covariate is appropriate if it is measured prior to treatment administration since the treatments have not yet had the opportunity to affect its value. If the covariate is measured concurrently with the response variable, then it must be decided whether it could be affected by the treatments before the covariance adjustments are considered. See the rationale given in Example 9.3.

Practical application of the analysis of covariance has been demonstrated only with completely randomised designs and randomised complete block designs; however, the use of covariates can be extended to any treatment and experiment design as well as to comparative observational studies of complex structure and studies requiring the use of multiple covariates for adjustment. Using R, the analysis of multiple covariates is easy to do. Add the multiple covariates in the model equation.

For further information about topics in analysis of covariance, see Snedecor and Cochran (1989) sections 18.5–18.9.

Extensive discussions on the use and misuses of covariates in research studies were provided in two special issues of Biometrics (1957), Volume 13, No. 3; and Biometrics (1982), Volume 38, No. 3. Of particular interest are articles by Cochran (1957), Smith (1957), and Cox and McCullagh (1982). A number of issues arise relevant to the use of covariates. Amongst those concerns are the applicability in certain situations and the relationship between blocking and covariates.

Analysis of covariance for general random and mixed‐effects models is considerably more difficult. Henderson and Henderson (1979) and Henderson (1982) discuss the problem and possible approaches.

References

  1. Cochran, W.G. (1957). Analysis of covariance: its nature and uses. Biometrics 13: 261–281.
  2. Cox, D.R. and McCullagh, P. (1982). Some aspects of analysis of covariance. Biometrics 38: 54–561.
  3. Fisher, R.A. (1935). Statistical Methods for Research Workers, 5the. Edinburgh: Oliver & Boyd.
  4. Henderson, C.R. Jr. (1982). Analysis of covariance in the mixed model: higher‐level, nonhomogeneous, and random regressions. Biometrics 38: 623–640.
  5. Henderson, C.R. Jr. and Henderson, C.R. (1979). Analysis of covariance in mixed models with unequal subclass numbers. Communications in Statistics A 8: 751–788.
  6. Montgomery, D.C. (2013). Design and Analysis of Experiments, 8the. New York: John Wiley & Sons, Inc.
  7. Smith, H.F. (1957). Interpretation of adjusted treatment means and regressions in analysis of covariance. Biometrics 13: 282–308.
  8. Snedecor, G.W. and Cochran, W.G. (1989). Statistical Methods, 8the. Ames: Iowa State University Press.
  9. Walker, G.A. (1997). Common Statistical Methods for Clinical Research, with SAS Examples. Cary, NC: SAS Institute Inc.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.50.188