Generalized linear models

We just saw how to fit our data to a model using linear regression. However, as we just saw, in order for our model to be valid, it must make the assumption that the variance is constant and the errors are normally distributed. A generalized linear model (GLM) is an alternative approach to linear regression, which allows the errors to follow probability distributions other than a normal distribution. GLM is typically used for response variables that represent count data or binary response variables. To fit your data to a GLM in R, you can use the glm() function.

GLM has three important properties:

  • An error structure
  • A linear predictor
  • A link function

The error structure informs us of the error distribution to use to model the data and is specified by the family argument. For example, you might want to use a Poisson distribution to model the errors for count data and a Gamma distribution to model data showing a constant coefficient of variation as follows:

glm(y ~ z, family = poisson)
glm(y ~ z, family = Gamma)

The linear predictor incorporates the information about the independent variables into the model and is defined as the linear sum of the effects of one or more explanatory variables. The link function specifies the relationship between the linear predictor and the mean of the response variables. By using different link functions, the performance of the models can be compared. Ideally, the best link function to use is the one that produces the minimal residual deviance, where the deviance is defined as a quality of fit statistic for a model that is often used for statistical hypothesis testing. The canonical link functions are default functions used when a particular error structure is specified. Let's take a look at this in the following table:

Error structure

Canonical link function

binomial

link = "logit"

gaussian

link = "identity"

Gamma

link = "inverse"

inverse.gaussian

link = "1/mu^2"

poisson

link = "log"

quasi

link = "identity", variance = "constant"

quasibinomial

link = "logit"

quasipoisson

link = "log"

To use an alternative link function when fitting our data to a GLM model, we can change the link argument as follows:

glm(y, family=binomial(link=probit))

Now, let's go over a detailed example of how to fit your data to a GLM in R. In the first example, we will look at the effect of the dose of a compound on the death of 20 male and female mice. Let's take a look at this in the following lines of code:

> cmp1.ld <- read.table(header=TRUE, text='
   lethaldose sex numdead numalive
1           0   M       1       19
2           1   M       3       17
3           2   M       9       11
4           3   M      14        6
5           4   M      17        3
6           5   M      20        0
7           0   F       0       20
8           1   F       2       18
9           2   F       2       18
10          3   F       3       17
11          4   F       4       16
12          5   F       6       14
')
> attach(cmp1.ld)

We can plot the data to take a look at the relationship between the dose of the compound and the proportions of deaths by mouse gender. Let's take a look at this in the following lines of code:

> proportion_dead <- numdead/20
> plot(proportion_dead ~ lethaldose, pch=as.character(sex))

The result is shown in the following plot:

Generalized linear models

Now we will fit our data to a GLM model. First, we combine the number of dead and alive mice into a counts matrix as follows:

> counts <- cbind(numdead, numalive)
> cmp1.ld.model <- glm( counts ~ sex * lethaldose, family=binomial)

We can summarize the results for the GLM model with the summary() function as follows:

> summary(cmp1.ld.model)


Call:
glm(formula = counts ~ sex * lethaldose, family = binomial)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.23314  -0.14226  -0.03905   0.17624   1.11956  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.2507     0.6774  -4.799 1.59e-06 ***
sexM              0.3342     0.8792   0.380  0.70387    
lethaldose        0.4856     0.1812   2.681  0.00735 ** 
sexM:lethaldose   0.7871     0.2804   2.807  0.00500 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 125.811  on 11  degrees of freedom
Residual deviance:   3.939  on  8  degrees of freedom
AIC: 40.515

Number of Fisher Scoring iterations: 4

At first glance, there seems to be a significant interaction between the sex and the lethaldose explanatory variables (p = 0.0050). However, before we can make this conclusion, we need to check if there is more variability in the data than expected from the statistical model (overdispersion). We can check for overdispersion in our model by dividing the residual deviance by the number of degrees of freedom as follows:

> 3.939/8
[1] 0.492375

The value is less than 1, which reassures us that the data is not over-dispersed. However, if we get a value greater than 1, we might want to re-fit the model with a quasibinomial error structure to account for the overdispersion. For the sake of argument, let's see if we can improve the model by using a quasibinomial error structure instead of a binomial one. Let's take a look at this in the following lines of code:

> summary(glm( counts ~ sex * lethaldose, family=quasibinomial))

Call:
glm(formula = counts ~ sex * lethaldose, family = quasibinomial)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.23314  -0.14226  -0.03905   0.17624   1.11956  

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -3.2507     0.3957  -8.214 3.61e-05 ***
sexM              0.3342     0.5137   0.651  0.53355    
lethaldose        0.4856     0.1058   4.588  0.00178 ** 
sexM:lethaldose   0.7871     0.1638   4.805  0.00135 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.3413398)

    Null deviance: 125.811  on 11  degrees of freedom
Residual deviance:   3.939  on  8  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

As you can see, the change in error structure did not change too much except for the p values of the explanatory variables. Now let's see what will happen if we remove the sex:lethaldose interaction from our original model. Let's take a look at this in the following lines of code:

> cmp1.ld.model3 <- update(cmp1.ld.model, ~ . -sex:lethaldose )
> summary(cmp1.ld.model3)

Call:
glm(formula = counts ~ sex + lethaldose, family = binomial)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2468  -0.6442   0.1702   0.6824   1.8965  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -4.8871     0.6169  -7.922 2.34e-15 ***
sexM          2.7820     0.4310   6.455 1.08e-10 ***
lethaldose    0.9256     0.1373   6.740 1.58e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 125.811  on 11  degrees of freedom
Residual deviance:  11.975  on  9  degrees of freedom
AIC: 46.551

Number of Fisher Scoring iterations: 5

The first thing you will notice is that the residual deviance/degrees of freedom ratio is greater than 1, suggesting overdispersion:

> 11.975/9
[1] 1.330556

Now let's statistically test whether this model was significantly different from our initial model using the anova() function with the test argument set to Chi-square test since we are dealing with a binomial family. Let's take a look this in the following lines of code:

> anova(cmp1.ld.model, cmp1.ld.model3, test="Chi")
Analysis of Deviance Table

Model 1: counts ~ sex * lethaldose
Model 2: counts ~ sex + lethaldose
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
1         8      3.939                        
2         9     11.975 -1  -8.0356 0.004587 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this analysis, we see that the two models are significantly different from each other. Since the second model shows signs of overdispersion, we can stick to the first model as a better fit for this data. The important thing to remember when choosing the best model is to choose the one that has the lowest residual deviance (goodness of fit) while maximizing parsimony.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.73.175