Analysis of variance (Anova) is used to fit data to a linear model when all explanatory variables are categorical. Each of these categorical explanatory variables is known as a factor, and each factor can have two or more levels. When a single factor is present with three or more levels, we use a one-way Anova to analyze the data. If we had a single factor with two levels, we would use a student's t-test to analyze the data. When there are two or more factors, we would use a two-way or three-way Anova. You can easily perform an Anova in R using the aov()
function.
In the first example, we will look at the effect of the dose of drugA
on the level of fatigue reported by the 20 patients as follows:
> patient.fatigue <- read.table(header=TRUE, text=' patients fatigue drugA_dose 1 1 low 0.2 2 2 low 0.2 3 3 med 0.2 4 4 med 0.2 5 5 med 0.2 6 6 low 0.4 7 7 low 0.4 8 8 low 0.4 9 9 med 0.4 10 10 med 0.4 11 11 med 0.8 12 12 high 0.8 13 13 med 0.8 14 14 med 0.8 15 15 high 0.8 16 16 high 1.2 17 17 high 1.2 18 18 high 1.2 19 19 high 1.2 20 20 med 1.2 ') >attach(patient.fatigue) > aov(drugA_dose ~ fatigue) Call: aov(formula = drugA_dose ~ fatigue) Terms: fatigue Residuals Sum of Squares 1.666444 1.283556 Deg. of Freedom 2 17 Residual standard error: 0.2747786 Estimated effects may be unbalanced
A more concise way to view the results of the one-way Anova analysis is to use the summary()
function as follows:
> summary(aov(drugA_dose ~ fatigue)) Df Sum Sq Mean Sq F value Pr(>F) fatigue 2 1.666 0.8332 11.04 0.000847 *** Residuals 17 1.284 0.0755 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the analysis, you can see that the relationship between the levels of fatigue reported is related to the dose of drug_A
administered. Now, we can plot the model to ensure that the assumptions of the model are met, namely that the variance is constant and the errors are normally distributed. Let's take a look at this in the following lines of code:
> modelA <- aov(drugA_dose ~ fatigue) > par(mfrow=c(2,2)) > plot(modelA)
In the residuals versus leverage plot, we can see that the level of fatigue reported by patients[20]
greatly influences the data. Let's see what happens when we remove this data point. Let's take a look at the following lines of code:
> modelB <- update(modelA, subset=(patients !=20)) > summary(modelB) Df Sum Sq Mean Sq F value Pr(>F) fatigue 2 1.8152 0.9076 17.79 8.57e-05 *** Residuals 16 0.8163 0.0510 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >
As you can see, the p value changes but the interpretation remains the same. We can also investigate the effects of the different levels using the summary.lm()
function as follows:
> summary.lm(modelB) Call: aov(formula = drugA_dose ~ fatigue, subset = (patients != 20)) Residuals: Min 1Q Median 3Q Max -0.2750 -0.1933 0.0800 0.1333 0.3250 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.06667 0.09221 11.567 3.50e-09 *** fatiguelow -0.74667 0.13678 -5.459 5.25e-05 *** fatiguemed -0.59167 0.12199 -4.850 0.000177 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.2259 on 16 degrees of freedom Multiple R-squared: 0.6898, Adjusted R-squared: 0.651 F-statistic: 17.79 on 2 and 16 DF, p-value: 8.575e-05
In order to interpret the meaning of these coefficients, we need to remember that lm(y ~ x)
is interpreted as y = a + bx by R, where a is the intercept and b is the slope. Similarly, the regression model aov(y ~ x)
is interpreted as y = a + bx1 + cx2. Therefore, the intercept in this table refers to a, or based on R convention, the factor level that comes first by alphabetical order. So in our coefficients section of the summary
table, intercept refers to fatiguehigh
.
Now, if we wanted to factor in the gender of each patient in addition to fatigue into our model, we can perform a two-way Anova as on the original dataset as follows:
> patient.sex <- as.factor(c("F", "F", "F", "M", "M", "F", "M", "M", "M", "F", "F", "M", "M", "F", "F", "F", "M", "M", "F", "M")) > modelC = aov(drugA_dose ~ fatigue*patient.sex) > summary(modelC) Df Sum Sq Mean Sq F value Pr(>F) fatigue 2 1.6664 0.8332 9.243 0.00276 ** patient.sex 1 0.0067 0.0067 0.075 0.78842 fatigue:patient.sex 2 0.0148 0.0074 0.082 0.92158 Residuals 14 1.2620 0.0901 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the summary, you can see that the effects of gender and fatigue are not additive. In fact, we also see that there is no significant relationship between drug_A_dose
and patient.sex
. We can compare the two models using the anova()
function as follows:
> anova(modelA, modelC) Analysis of Variance Table Model 1: drugA_dose ~ fatigue Model 2: drugA_dose ~ fatigue * patient.sex Res.Df RSS Df Sum of Sq F Pr(>F) 1 17 1.2836 2 14 1.2620 3 0.021556 0.0797 0.97
As you can see, the models are not significantly different from each other based on the p value that is equal to 0.97. So, we can choose the simpler model to explain the data.
3.145.52.188