We just saw how to fit our data to a model using linear regression. However, as we just saw, in order for our model to be valid, it must make the assumption that the variance is constant and the errors are normally distributed. A generalized linear model (GLM) is an alternative approach to linear regression, which allows the errors to follow probability distributions other than a normal distribution. GLM is typically used for response variables that represent count data or binary response variables. To fit your data to a GLM in R, you can use the glm()
function.
GLM has three important properties:
The error structure informs us of the error distribution to use to model the data and is specified by the family
argument. For example, you might want to use a Poisson distribution to model the errors for count data and a Gamma distribution to model data showing a constant coefficient of variation as follows:
glm(y ~ z, family = poisson) glm(y ~ z, family = Gamma)
The linear predictor incorporates the information about the independent variables into the model and is defined as the linear sum of the effects of one or more explanatory variables. The link function specifies the relationship between the linear predictor and the mean of the response variables. By using different link functions, the performance of the models can be compared. Ideally, the best link function to use is the one that produces the minimal residual deviance, where the deviance is defined as a quality of fit statistic for a model that is often used for statistical hypothesis testing. The canonical link functions are default functions used when a particular error structure is specified. Let's take a look at this in the following table:
Error structure |
Canonical link function |
---|---|
binomial |
|
gaussian |
|
Gamma |
|
inverse.gaussian |
|
poisson |
|
quasi |
|
quasibinomial |
|
quasipoisson |
|
To use an alternative link function when fitting our data to a GLM model, we can change the link
argument as follows:
glm(y, family=binomial(link=probit))
Now, let's go over a detailed example of how to fit your data to a GLM in R. In the first example, we will look at the effect of the dose of a compound on the death of 20 male and female mice. Let's take a look at this in the following lines of code:
> cmp1.ld <- read.table(header=TRUE, text=' lethaldose sex numdead numalive 1 0 M 1 19 2 1 M 3 17 3 2 M 9 11 4 3 M 14 6 5 4 M 17 3 6 5 M 20 0 7 0 F 0 20 8 1 F 2 18 9 2 F 2 18 10 3 F 3 17 11 4 F 4 16 12 5 F 6 14 ') > attach(cmp1.ld)
We can plot the data to take a look at the relationship between the dose of the compound and the proportions of deaths by mouse gender. Let's take a look at this in the following lines of code:
> proportion_dead <- numdead/20 > plot(proportion_dead ~ lethaldose, pch=as.character(sex))
The result is shown in the following plot:
Now we will fit our data to a GLM model. First, we combine the number of dead and alive mice into a counts
matrix as follows:
> counts <- cbind(numdead, numalive) > cmp1.ld.model <- glm( counts ~ sex * lethaldose, family=binomial)
We can summarize the results for the GLM model with the summary()
function as follows:
> summary(cmp1.ld.model) Call: glm(formula = counts ~ sex * lethaldose, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.23314 -0.14226 -0.03905 0.17624 1.11956 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.2507 0.6774 -4.799 1.59e-06 *** sexM 0.3342 0.8792 0.380 0.70387 lethaldose 0.4856 0.1812 2.681 0.00735 ** sexM:lethaldose 0.7871 0.2804 2.807 0.00500 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 125.811 on 11 degrees of freedom Residual deviance: 3.939 on 8 degrees of freedom AIC: 40.515 Number of Fisher Scoring iterations: 4
At first glance, there seems to be a significant interaction between the sex
and the lethaldose
explanatory variables (p = 0.0050). However, before we can make this conclusion, we need to check if there is more variability in the data than expected from the statistical model (overdispersion). We can check for overdispersion in our model by dividing the residual deviance by the number of degrees of freedom as follows:
> 3.939/8 [1] 0.492375
The value is less than 1, which reassures us that the data is not over-dispersed. However, if we get a value greater than 1, we might want to re-fit the model with a quasibinomial error structure to account for the overdispersion. For the sake of argument, let's see if we can improve the model by using a quasibinomial error structure instead of a binomial one. Let's take a look at this in the following lines of code:
> summary(glm( counts ~ sex * lethaldose, family=quasibinomial)) Call: glm(formula = counts ~ sex * lethaldose, family = quasibinomial) Deviance Residuals: Min 1Q Median 3Q Max -1.23314 -0.14226 -0.03905 0.17624 1.11956 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.2507 0.3957 -8.214 3.61e-05 *** sexM 0.3342 0.5137 0.651 0.53355 lethaldose 0.4856 0.1058 4.588 0.00178 ** sexM:lethaldose 0.7871 0.1638 4.805 0.00135 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for quasibinomial family taken to be 0.3413398) Null deviance: 125.811 on 11 degrees of freedom Residual deviance: 3.939 on 8 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 4
As you can see, the change in error structure did not change too much except for the p values of the explanatory variables. Now let's see what will happen if we remove the sex:lethaldose
interaction from our original model. Let's take a look at this in the following lines of code:
> cmp1.ld.model3 <- update(cmp1.ld.model, ~ . -sex:lethaldose ) > summary(cmp1.ld.model3) Call: glm(formula = counts ~ sex + lethaldose, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.2468 -0.6442 0.1702 0.6824 1.8965 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.8871 0.6169 -7.922 2.34e-15 *** sexM 2.7820 0.4310 6.455 1.08e-10 *** lethaldose 0.9256 0.1373 6.740 1.58e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 125.811 on 11 degrees of freedom Residual deviance: 11.975 on 9 degrees of freedom AIC: 46.551 Number of Fisher Scoring iterations: 5
The first thing you will notice is that the residual deviance/degrees of freedom ratio is greater than 1, suggesting overdispersion:
> 11.975/9 [1] 1.330556
Now let's statistically test whether this model was significantly different from our initial model using the anova()
function with the test
argument set to Chi-square test since we are dealing with a binomial family. Let's take a look this in the following lines of code:
> anova(cmp1.ld.model, cmp1.ld.model3, test="Chi") Analysis of Deviance Table Model 1: counts ~ sex * lethaldose Model 2: counts ~ sex + lethaldose Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 8 3.939 2 9 11.975 -1 -8.0356 0.004587 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this analysis, we see that the two models are significantly different from each other. Since the second model shows signs of overdispersion, we can stick to the first model as a better fit for this data. The important thing to remember when choosing the best model is to choose the one that has the lowest residual deviance (goodness of fit) while maximizing parsimony.
18.117.187.62