Analysis of variance

Analysis of variance (Anova) is used to fit data to a linear model when all explanatory variables are categorical. Each of these categorical explanatory variables is known as a factor, and each factor can have two or more levels. When a single factor is present with three or more levels, we use a one-way Anova to analyze the data. If we had a single factor with two levels, we would use a student's t-test to analyze the data. When there are two or more factors, we would use a two-way or three-way Anova. You can easily perform an Anova in R using the aov() function.

In the first example, we will look at the effect of the dose of drugA on the level of fatigue reported by the 20 patients as follows:

> patient.fatigue <- read.table(header=TRUE, text='
   patients fatigue drugA_dose
1         1     low        0.2
2         2     low        0.2
3         3     med        0.2
4         4     med        0.2
5         5     med        0.2
6         6     low        0.4
7         7     low        0.4
8         8     low        0.4
9         9     med        0.4
10       10     med        0.4
11       11     med        0.8
12       12    high        0.8
13       13     med        0.8
14       14     med        0.8
15       15    high        0.8
16       16    high        1.2
17       17    high        1.2
18       18    high        1.2
19       19    high        1.2
20       20     med        1.2 ')

>attach(patient.fatigue)
> aov(drugA_dose ~ fatigue)
Call:
   aov(formula = drugA_dose ~ fatigue)

Terms:
                 fatigue Residuals
Sum of Squares  1.666444  1.283556
Deg. of Freedom        2        17

Residual standard error: 0.2747786
Estimated effects may be unbalanced

A more concise way to view the results of the one-way Anova analysis is to use the summary() function as follows:

> summary(aov(drugA_dose ~ fatigue))
            Df Sum Sq Mean Sq F value   Pr(>F)    
fatigue      2  1.666  0.8332   11.04 0.000847 ***
Residuals   17  1.284  0.0755                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the analysis, you can see that the relationship between the levels of fatigue reported is related to the dose of drug_A administered. Now, we can plot the model to ensure that the assumptions of the model are met, namely that the variance is constant and the errors are normally distributed. Let's take a look at this in the following lines of code:

> modelA <- aov(drugA_dose ~ fatigue)
> par(mfrow=c(2,2)) 
> plot(modelA)  
Analysis of variance

In the residuals versus leverage plot, we can see that the level of fatigue reported by patients[20] greatly influences the data. Let's see what happens when we remove this data point. Let's take a look at the following lines of code:

> modelB <- update(modelA, subset=(patients !=20))
> summary(modelB)
            Df Sum Sq Mean Sq F value   Pr(>F)    
fatigue      2 1.8152  0.9076   17.79 8.57e-05 ***
Residuals   16 0.8163  0.0510                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>

As you can see, the p value changes but the interpretation remains the same. We can also investigate the effects of the different levels using the summary.lm() function as follows:

> summary.lm(modelB)

Call:
aov(formula = drugA_dose ~ fatigue, subset = (patients != 20))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2750 -0.1933  0.0800  0.1333  0.3250 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.06667    0.09221  11.567 3.50e-09 ***
fatiguelow  -0.74667    0.13678  -5.459 5.25e-05 ***
fatiguemed  -0.59167    0.12199  -4.850 0.000177 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2259 on 16 degrees of freedom
Multiple R-squared:  0.6898,  Adjusted R-squared:  0.651 
F-statistic: 17.79 on 2 and 16 DF,  p-value: 8.575e-05

In order to interpret the meaning of these coefficients, we need to remember that lm(y ~ x) is interpreted as y = a + bx by R, where a is the intercept and b is the slope. Similarly, the regression model aov(y ~ x) is interpreted as y = a + bx1 + cx2. Therefore, the intercept in this table refers to a, or based on R convention, the factor level that comes first by alphabetical order. So in our coefficients section of the summary table, intercept refers to fatiguehigh.

Now, if we wanted to factor in the gender of each patient in addition to fatigue into our model, we can perform a two-way Anova as on the original dataset as follows:

> patient.sex <- as.factor(c("F", "F", "F", "M", "M", "F", "M", "M", "M", "F", "F", "M", "M", "F", "F", "F", "M", "M", "F", "M")) 
> modelC = aov(drugA_dose ~ fatigue*patient.sex) 
> summary(modelC)
                    Df Sum Sq Mean Sq F value  Pr(>F)   
fatigue              2 1.6664  0.8332   9.243 0.00276 **
patient.sex          1 0.0067  0.0067   0.075 0.78842   
fatigue:patient.sex  2 0.0148  0.0074   0.082 0.92158   
Residuals           14 1.2620  0.0901                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the summary, you can see that the effects of gender and fatigue are not additive. In fact, we also see that there is no significant relationship between drug_A_dose and patient.sex. We can compare the two models using the anova() function as follows:

> anova(modelA, modelC)
Analysis of Variance Table

Model 1: drugA_dose ~ fatigue
Model 2: drugA_dose ~ fatigue * patient.sex
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     17 1.2836                           
2     14 1.2620  3  0.021556 0.0797   0.97

As you can see, the models are not significantly different from each other based on the p value that is equal to 0.97. So, we can choose the simpler model to explain the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.52.188