Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

11
Generalised Linear Models

11.1 Introduction

Like in analysis of variance (ANOVA) or regression analysis a generalised linear model (GLM) describes the relation between a random regressand (response variable) y and a vector x^T = (x₀, … , x_k) of regressor variables influencing it, and is a flexible generalisation of ordinary linear regression allowing for regressands that have error distribution models other than a normal one. The GLM generalises linear regression by writing the linear model to be related to the regressand via a link function of the corresponding exponential family and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Possibly the first who introduced a GLM was Rasch (1960). The Rasch model is a psychometric model for analysing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade‐off between (i) the respondent's abilities, attitudes or personality traits and (ii) the item difficulty. For example, we may use it to estimate a student's reading ability, or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession and market research because of their general applicability. The mathematical theory underlying Rasch models is a special case of a GLM. Specifically, in the original Rasch model, the probability of a correct response is modelled as a logistic function of the difference between the person and item parameter – see Section 11.5.

GLMs were formulated by Nelder and Wedderburn (1972) and later by McCullagh and Nelder (1989) as a way of unifying various other statistical models, including linear regression, logistic regression (Rasch model), and Poisson regression. They proposed an iteratively reweighted least squares method for maximum likelihood (ML) estimation of the model parameters. Maximum likelihood estimation remains popular and is the default method in many statistical computing packages. Other approaches, including Bayesian approaches and least squares fits to variance‐stabilised responses, have also been developed. Possibly Nelder and Wedderburn did not know the Rasch model. Later Rasch (1980) extended his approach.

In GLMs, exponential families of distributions play an important role.

11.2 Exponential Families of Distributions

Definition 11.1

The distribution of a random variable y (may be a vector) with parameter vector θ = (θ₁,θ₂,…,θ_p)^T belongs to a k‐parametric exponential family if its likelihood function can be written as

11.1

where the following conditions hold:

ϕ_i and B are real functions of θ and B does not depend on y,
the function h(y) is non‐negative and does not depend on θ.

The exponential family with a random variable y (may be a vector) is in canonical form with the so‐called natural parameters η_i, meaning that 11.1 can be reparameterised as

11.2

with η = (η₁, … , η_k)^T.

The exponential families include many of the most common distributions. The term η(θ) is called the canonical link function.

We present here some members of the family, which we use in this book.

Example 11.2

Let P* be the family of normal distributions N(μ, 1) with expectation μ = θ and variance 1, i.e. it holds Ω = R¹. P* is a one‐parametric exponential family with dim(Ω) = l, the link function is η = μ. P* is complete using theorem 1.4 in Rasch and Schott (2018). If Y = (y₁, y₂, … , y_n)^T is a random sample with components from P*, then is distributed as N(μ, ). Consequently the family of distributions P* is also complete. Because of theorem 1.3 in Rasch and Schott (2018) is minimal sufficient and therefore complete sufficient. The distribution family of CS(n − 1)‐distributions (χ²‐distributions with n − l degrees of freedom) induced by is independent of μ. Hence is an ancillary statistic relative to μ = θ.

For instance, the following discrete distribution families can be written as exponential families:

Bernoulli distribution, P(y = y) = p^y(1 − p)^1 − y, 0 < p < 1, y = 0, 1.

Binomial distribution with the probability (likelihood) function images , 0 < p < 1, y = 0, 1,…, n.

Poisson distribution with the likelihood function , λ > 0, y = 0, 1, 2, ….

Multinomial distribution

11.3 Generalised Linear Models – An Overview

The term GLM usually refers to conventional linear regression models for a continuous or discrete response variable y given continuous and/or categorical predictors. In this chapter we assume that the distribution of y belongs to the exponential families. A GLM does not assume a linear relationship between the regressand and the regressor, but the relation between the regressor variables (x₀, x₁, … , x_k) and the parameter(s) is described by a link function for which we usually use the natural parameters of the canonical form of the exponential family.

The GLMs are a broad class of models that include linear regression, ANOVA, ANCOVA, Poisson regression, log‐linear models, etc. Table 11.1 provides a summary of GLMs following Agresti (2018, chapter 4) where Mixed means categorical, nominal or ordinal, or continuous.

Table 11.1 Link function, random and systematic components of some GLMs.

Model	Random component	Link	Systematic component
Linear regression	Normal	Identity	Continuous
ANOVA	Normal	Identity	Categorical
ANCOVA	Normal	Identity	Mixed
Log‐linear regression	Gamma	Log	Continuous
Logistic regression	Binomial	Logit	Mixed
Log‐linear regression	Poisson	Log	Categorical
Multinomial response	Multinomial	Generalised logit	Mixed

In all GLM in this chapter we assume the following:

All random components refer to the probability distribution of the regressand.
We have systematic components specifying the regressors (x₁, x₂, …, x_k) in the model or their linear combination in creating the so‐called linear predictor.
We have a link function specifying the link between random and systematic components.
y₁, y₂, …, y_n are independently distributed.
The regressand has a distribution from an exponential family.
The relationship between regressor and regressand is linear via the link function.
Homogeneity of variance is not assumed, overdispersion is possible as we can see later.
Errors are independent but not necessarily normally distributed.

In GLM the deviance means sum of squares and the residual deviance means mean squares. It may happen that after fitting a GLM the residual deviance exceeds the value expected. In such cases we speak about overdispersion.

Sources of this may be:

The systematic component of the model is wrongly chosen (important regressors have been forgotten or were wrongly included, e.g. linear in place of quadratic).
There exist outliers in the data.
The data are not realisations of a random sample.

We demonstrate this for binary data.

Let p be a random variable with E(p) = μ and var(p) = nμ(1 − μ) = σ². For a realisation p of p we assume that k is B(n; p)‐distributed. Then by the laws of iterated expectation and variance

and

As we can see var(k) is larger than for a binomial distribution with parameter μ. For n = 1 overdispersion cannot be detected.

How to detect and handle overdispersion is demonstrated in Section 11.5.2.

11.4 Analysis – Fitting a GLM – The Linear Case

We demonstrate the analysis by fitting a GLM to the data of Example 5.9 but use now the R program glm2 – this means the linear case. The analysis of intrinsic GLM is shown in Sections 11.5, 11.6, and 11.7.

Problem 11.4

Fit a GLM with an identity link to the data of Example 5.9.

Solution and Example

We take the data of Table 5.8 which is now Table 11.2 and explain the procedure with these data.

Table 11.2 Observations (loss during storage in percent of dry mass during storage of 300 days) of the experiment of Example 5.9.

		Forage crop
Green rye	Lucerne
Kind of storage	Glass in refrigerator	8.39	9.44
Glass in barn	11.58	12.21
Sack in refrigerator	5.42	5.56
Sack in barn	9.53	10.39

For the analysis of this design we can use in Chapter 5 the R the commands >aov() or >glm().

While aov() can only be used for balanced two‐way cross classifications glm() gives more detailed information and can also be used for unbalanced two‐way cross classifications. With glm the ANOVA proceeds, in the command at family = "gaussian" with default the link identity, as follows (Table 11.3):

Table 11.3 Analysis of variance table of Problem 11.4.

Source of variation	SS	df	MS	F
Between the storages	43.2261	3	14.4087	186.7
Between the forage crops	0.8978	1	0.8978	11.63
Residual	0.2315	3	0.0772
Total	44.3554	7

 > loss <- c(8.39, 11.58, 5.42, 9.53, 9.44, 12.21, 5.56, 10.39)
> storage <- c(1,2,3,4,1,2,3,4)
> crop <- c(1,1,1,1,2,2,2,2)
> Table_11_2 <- data.frame(cbind(loss,storage,crop))
> STORAGE <- factor(storage)
> CROP <- factor(crop)
> library(glm2)
> glm.loss<- glm(loss ∼STORAGE + CROP, family="gaussian", Table_11_2)
> summary(glm.loss)
Call:
glm(formula = loss ∼ STORAGE + CROP, family = "gaussian", data =
Table_11_2)
Deviance Residuals: 
1            2            3            4            5            6
    7          8  
-0.190     0.020     0.265    -0.095     0.190    -0.020   
 -0.265     0.095  
Coefficients:
               Estimate      Std. Error       t value           Pr(>|t|)    
(Intercept)    8.5800         0.2196         39.069        3.69e-05 ***
STORAGE2       2.9800         0.2778        10.728          0.00173 ** 
STORAGE3      -3.4250         0.2778       -12.330          0.00115 ** 
STORAGE4       1.0450         0.2778          3.762         0.03285 *  
CROP2          0.6700        0.1964          3.411          0.04212 *  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.07716667)
Null deviance:        44.3554  on 7  degrees of freedom
Residual deviance:  0.2315  on 3  degrees of freedom
AIC: 6.3621
Number of Fisher Scoring iterations: 2
> anova(glm.loss)
Analysis of Deviance Table
Model: gaussian, link: identity
Response: loss
Terms added sequentially (first to last)
            Df  Deviance    Resid. Df      Resid. Dev
NULL                           7           44.355
STORAGE      3    43.226       4            1.129
CROP         1    0.898        3            0.232

The dispersion parameter above equals the mean square of the residuals (= residual deviance): 0.0772 = 0.2315/3. In glm the deviance means sum of squares and the residual deviance means mean squares.

Also, the null deviance is equal to the ordinary total variance. A more detailed explanation of these facts and the necessary modifications needed for fitting GLMs are given below.

11.5 Binary Logistic Regression

Logistic regression is a special type of regression where the probability of ‘success’ is modelled through a set of predictors (regressands). The predictors may be categorical, nominal or ordinal, or continuous. Binary logistic regression is a special type of regression where a binary response (regressor) variable is related to explanatory variables, which can be discrete and/or continuous.

We consider for a fixed factor with a ≥ 1 levels in each level a binary random variable y_i with possible outcomes 0 and 1. We assume P(y_i = 1) = p_i; 0 < p_i < 1, i = 1, … a. The statistics (absolute frequencies) of independent random samples with independent components distributed as y_i are binomially distributed with parameters n_i (known) and p_i. The relation between the regressor variables (x_i0, … , x_ik) and the parameter is described by the link function .

11.5.1 Analysis

Analogous to Section 8.1.1 the vector Y_i. may depend on a vector of (fixed) regressor variables .

Problem 11.5

Show the stochastic and systematic component and the link function for the binary logistic regression.

Solution

The absolute frequencies are binomially distributed with parameters p_i.

The logarithm of the likelihood function is

11.3

with a negligible constant images

Often we use the relative frequencies in place of the absolute frequencies Y_i.. We know that

and and the variance depends on the expectation p_i.

The logistic regression function is

11.4

For the link function we use in this book mainly the canonical link function (which means the link defined by the canonical form of the exponential family given in the solution of Problem 11.1)

11.5

Other possible link functions are the probit function

or the complementary log–log‐function (cloglog)

In this section we only demonstrate the logistic function and the model (11.4).

After estimating the parameters of model (11.4) with less than k regressors (an unsaturated model) with the methods of Chapter 8 by

and evaluating of this estimated regression model. This can be done by comparing it with the estimated saturated model, which includes all k − 1 regressors, which provides the best fit to the data. The parameter p_i in (11.5) is estimated by the relative frequency .

We can now formally test whether two models differ significantly. This can be done by either the Pearson's χ²‐statistic or by the deviance; both are explained below. We write now

Pearson's χ²‐statistic is the sum of the squared standardised residuals of a model with g components (g = k for the saturated model).

11.6

The deviance is defined for a model with g components if l_i is the log‐likelihood of component i by

11.7

χ² and D are asymptotically equivalent, which means that

Problem 11.6

Analyse the proportion of damaged o‐rings from the space shuttle challenger data.

The data can be found in Faraway (2016).

 >library(faraway)
>data(orings)

Each shuttle had two boosters, each with three o‐rings. For each mission, we know the number of o‐rings out of six showing some damage and the launch temperature (in degrees Fahrenheit, °F). What is the probability of failure in a given o‐ring related to the launch temperature? Compare the results for the three common choices of logit‐link, probit‐link and complementary log–log‐link.

Predict the probability of damage at 31 °F (this was the actual temperature on January 28, 1986, the date of the Challenger catastrophe).

Solution

First of all we take a look at the raw data and try to fit it with a simple linear model, i.e. via the lm-function:

 > library(faraway)
> data(orings)
> attach(orings)
> plot(damage/6∼temp, ylim=c(0,1), xlim=c(25,85), xlab="Temperature in
   °F", ylab="Probability of damage", main="orings data - lm fit")
> lmtry<-lm(damage/6∼temp)
> abline(lmtry, col="black")

When looking at the raw data, the damage variable is always divided by 6 to get the proportions of the damaged o‐rings, since these damage probabilities are of main interest. Although the minimum value for the temperature given in the data is 53 °F, the x‐axis starts at 25 °F, due to the fact that, firstly, the temperature of interest is 31 °F and, secondly, the forecast temperature for the day of the Challenger catastrophe was between 26 and 29 °F. Hence, these temperature values may also be of interest. The result of this first try with the lm function is shown in Figure 11.1 and, as expected, gives a terrible fit.

Figure 11.1 Data fitted with the lm-function.

Now, logit‐, probit‐, and complementary log–log‐link functions are used to fit a GLM to the data. This is done by the glm command, where our response is given to the function as a two‐column matrix with the number of ‘successes’, i.e. damages, in the first column, and the number of ‘fails’ (nodam) in the second one:

 > nodam<- 6- damage 
> cbind(damage,nodam)
           damage   nodam
 [1,]        5         1
 [2,]        1         5
 [3,]        1         5
 ------------------------   
 [20,]       0         6
 [21,]       0         6
 [22,]       0         6
 [23,]       0         6
> logit.mod<-glm(cbind(damage,nodam)∼temp,family=binomial(link="logit"))
> probit.mod<-glm(cbind(damage,nodam)∼temp,family=binomial(link="probit"))
> cloglog.mod<-glm(cbind(damage,nodam)∼temp, family=binomial(link="cloglog")) 
> summary(logit.mod)
Call:
glm(formula = cbind(damage, nodam) ∼ temp, family = binomial(link =
"logit"))
Deviance Residuals: 
  Min          1Q        Median         3Q          Max  
-0.9529     -0.7345     -0.4393      -0.2079      1.9565  
Coefficients:
            Estimate    Std. Error    z value     Pr(>|z|)    
(Intercept)  11.66299    3.29626      3.538     0.000403 ***
temp         -0.21623    0.05318     -4.066     4.78e-05 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 38.898  on 22  degrees of freedom
Residual deviance: 16.912  on 21  degrees of freedom
AIC: 33.675
Number of Fisher Scoring iterations: 6
> summary(probit.mod)
Call:
glm(formula = cbind(damage, nodam) ∼ temp, family = binomial(link = "probit"))
Deviance Residuals: 
   Min         1Q        Median         3Q        Max  
-1.0134    - 0.7761      -0.4467     -0.1581     1.9983  
Coefficients:
            Estimate    Std. Error   z value    Pr(>|z|)    
(Intercept) 5.59145     1.71055       3.269     0.00108 ** 
temp        -0.10580    0.02656      -3.984     6.79e-05 ***
- - -
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 38.898  on 22  degrees of freedom
Residual deviance: 18.131  on 21  degrees of freedom
AIC: 34.893
Number of Fisher Scoring iterations: 6
> summary(cloglog.mod)
Call:
glm(formula = cbind(damage, nodam) ∼ temp, family = binomial(link = "cloglog"))
Deviance Residuals: 
 Min          1Q         Median          3Q          Max  
-0.9884     -0.7262     -0.4373       -0.2141      1.9520  
Coefficients:
           Estimate     Std. Error   z value   Pr(>|z|)    
(Intercept) 10.86388     2.73668     3.970     7.20e-05 ***
temp        -0.20552     0.04561     -4.506    6.59e-06 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 38.898  on 22  degrees of freedom
Residual deviance: 16.029  on 21  degrees of freedom
AIC: 32.791
Number of Fisher Scoring iterations: 7
> inv.logit<-function(x){exp(x)/(1+exp(x))}
> inv.cloglog<-function(x){1-exp(-exp(x))}
> x<-seq(25,85, 0.5)
> inv.logit.val<-inv.logit(coef(logit.mod)[1]+coef(logit.mod)[2]*x)
> inv.probit.val<-pnorm(coef(probit.mod)[1]+coef(probit.mod)[2]*x)
> inv.cloglog.val<-inv.cloglog(coef(cloglog.mod)[1]+coef(cloglog.mod)[2]*x)
> plot(damage/6∼temp, ylim=c(0,1), xlim=c(25,85), xlab="Temperature in °F", 
   ylab="Probability of damage", main="orings data 
 glm fits") 
> lines(inv.logit.val, col="black", lt=2)
> lines(inv.probit.val, col="black", lt=1)
> lines(inv.cloglog.val, col="black", lt=3)
> legend(70,1, c("logit", "probit", "cloglog"), lt=c(2,1,3), lwd=c(1,1,1))

To compare the results with the data, the response functions (inverse link functions) are needed, i.e. inv.logit and inv.cloglog as implemented in the code for logit and complementary log–log, respectively, and the function pnorm for the probit‐link. The results are shown in Figure 11.2 and the following values for the intercept β₀, the first‐order co‐efficient of x, β₁, and the AIC values are obtained:

Figure 11.2 Data fitted with glm-functions.

	β₀	β₁	AIC
Logit‐link	11.66299	−0.21623	33.67
Probit‐link	5.59145	−0.10580	34.89
Complementary log–log‐link	10.86388	−0.20552	32.79

The AIC values stem from the Akaike criterion for model Choice. This and other criteria are explained in Rasch and Schott (2018, section 9.6.9).

The relative effects of the three different models are very similar, but the probit model, in particular, yields quite different coefficients. According to the AIC values, the complementary log–log‐link has a slightly better, i.e. smaller, value. This may occur due to the fact that it fits the data point with the highest number of failures at 53 °F best, whereas in the region with most of the data points all three models give similar results.

This extreme jump in the data also becomes noticeable in the wide confidence intervals, calculated by

 > library(MASS) 
> confint(logit.mod) 
Waiting for profiling to be done...
              2.5 %           97.5 %
(Intercept)  5.575195    18.737598
temp         -0.332657      -0.120179
> confint(probit.mod) 
Waiting for profiling to be done...
                       2.5 %            97.5 %
(Intercept)   2.4231051    9.04355971
temp          -0.1597757   -0.05742257
> confint(cloglog.mod)
Waiting for profiling to be done...
              2.5 %            97.5 %
(Intercept)   5.4348673   16.8985804
temp          -0.3082858    -0.1184926

and resulting in:

		2.5%	97.5%
Logit	β₀	5.575195	18.737598
β₁	−0.332657	−0.120179
Probit	β₀	2.4231051	9.043560
β₁	−0.1597757	−0.05742257
Complementary log–log	β₀	5.4348673	16.898580
β₁	−0.3082858	−0.1184926

Finally, the probabilities of failure at 31 °F for the different models, obtained via

 > logit31<-inv.logit(coef(logit.mod)[1]+coef(logit.mod)[2]*31) 
> probit31<-pnorm(coef(probit.mod)[1]+coef(probit.mod)[2]*31) 
> cloglog31<-inv.cloglog(coef(cloglog.mod)[1]+coef(cloglog.mod)[2]*31)

are given as 0.9930342, 0.9895983 and 1 for logit, probit and complementary log–log, respectively, and therefore clearly agree with the horrible ending of the Challenger catastrophe.

Example 11.4

Table 11.4 shows the number N_ijk of infected peanut plants in block i, genotype j and plot k of an experiment with 5 blocks with at most 12 plots each in two genotypes. Using n_ijk the number of plants on block k is given. This example comes from Rasch et al. (1998, Procedure 4/51/0200, page 472–475).

Table 11.4 Values of N_ijk(n_ijk) of the block experiment.

Block	Genotype 1	Genotype 2
1	5(10), 4(10), 4(11), 5(11)	6(10), 7(10), 8(9), 9(10)
5(10), 4(10), 3(10), 5(10)	10(11), 7(10), 8(11), 7(11)
4(10), 3(9)	10(11), 7(11),
2	2(12), 2(9)	5(10), 5(10)
3	1(10), 1(10), 0(11), 0(10)	3(11), 2(9), 3(12), 3(10)
4	0(10), 0(10), 2(9), 1(8)	4(11), 2(9), 3(9), 4(11)
3(12), 0(8), 1(10), 1(9)	3(10), 3(10), 3(9), 3(10)
0(9), 2(10), 0(9), 1(11)	3(11), 5(10), 5(12)
5	4(11), 4(9), 4(11), 2(10)	5(9), 8(10), 7(8), 9(10)
4(11), 3(9), 4(10), 0(10)	8(9), 7(9), 8(9), 6(9)
3(11), 0(8)	6(10), 7(10)

p_ijk is the infection probability on plot (ijk). How does p_ijk depend on the fixed factors genotype and block? We assume that the N_ijk are binomially B(n_ijk, p_ijk) distributed and that no dependency between neighboured plots exist. Further we assume an additive model with the canonical link

Problem 11.7

Analyse the infection probabilities of Example 11.4.

Solution

We use the command > glm.

Example

First we input the data in R:

 > peanuts=read.table("C:/Rasch_applied_statistics/peanuts.txt",
         header=T)
> peanuts
   block G_type N_infected n_plants
1      1      1          5       10
2      1      1          4       10
3      1      1          4       11
-----------------------------------
73     5      2          6        9
74     5      2          6       10
75     5      2          7       10

Then we start the analysis:

 > infected.plants <- sum(peanuts[,3])
> infected.plants
[1] 296
> number.plants <- sum(peanuts[,4])
> number.plants
[1] 749
> Block <- as.factor(peanuts[,1])
> Gtype=as.factor(peanuts[,2])
> peanuts1=as.data.frame(cbind(Block,Gtype,peanuts[,3],peanuts[,4]))
> peanuts1
   Block Gtype V3 V4
1      1     1  5 10
2      1     1  4 10
3      1     1  4 11
--------------------
73     5     2  6  9
74     5     2  6 10
75     5     2  7 10
> BLOCK <- factor(Block)
> GTYPE <- factor(Gtype)
> bin.add <- glm(V3/V4 ∼ BLOCK + GTYPE, family = binomial, weights=V4, data=peanuts1)
> summary(bin.add)
Call:
glm(formula = V3/V4 ∼ BLOCK + GTYPE, family = binomial, data = peanuts1, 
    weights = V4)
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.76222  -0.56956   0.05692   0.53349   1.67560  

Coefficients:

            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.4423     0.1728  -2.559  0.01049 *  
BLOCK2       -1.1793     0.3903  -3.021  0.00252 ** 
BLOCK3       -2.3643     0.3559  -6.643 3.08e-11 ***
BLOCK4       -1.9122     0.2370  -8.069 7.09e-16 ***
BLOCK5       -0.3246     0.2211  -1.468  0.14201    
GTYPE2        1.7287     0.1812   9.538  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 257.808  on 74  degrees of freedom
Residual deviance:  55.719  on 69  degrees of freedom
AIC: 233.13
Number of Fisher Scoring iterations: 4
> anova(bin.add, test = "Chisq")
Analysis of Deviance Table

Model: binomial, link: logit
Response: V3/V4
Terms added sequentially (first to last)
      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                     74    257.808              
BLOCK  4   98.772        70    159.037 < 2.2e-16 ***
GTYPE  1  103.317        69     55.719 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> bin.add1 <- glm(V3/V4 ∼ GTYPE + BLOCK , family = binomial, weights=V4, data=peanuts1)
> anova(bin.add1, test = "Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: V3/V4
Terms added sequentially (first to last)
      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                     74    257.808              
GTYPE  1   88.997        73    168.811 < 2.2e-16 ***
BLOCK  4  113.092        69     55.719 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Remarks

Number of observations = 75.
Number of infected plants = 296.
Number of plants = 749.
Coefficient estimate GTYPE2 = estimate (GTYPE2 − GTYPE1).
Estimate GTYPE1 = − (1.7287)/2 = − 0.8643 because _j = 0.
Coefficient estimate BLOCK2 = estimate (BLOCK2 − BLOCK1).
Coefficient estimate BLOCK3 = estimate (BLOCK3 − BLOCK1).
Coefficient estimate BLOCK4 = estimate (BLOCK4 − BLOCK1).
Coefficient estimate BLOCK5 = estimate(BLOCK5 − BLOCK1).
Estimate BLOCK1 = −(−1.1793 − 2.3643 − 1.9122 − 0.3246)/5 = 1.15608 because _i = 0.
Estimate MU = intercept − BLOCK1 − GTYPE1 = −0.4423 − 1.15608 − (−0.86435) = −0.7340.
Note that in the first analysis of deviance table the effect of GENOTYPE is tested and in the second analysis of deviance table the effect of BLOCK is tested. The command glm uses a hierarchical testing order.

11.5.2 Overdispersion

After fitting a GLM, it may happen that the estimated variance computed as the residual deviance exceeds the value expected – this we call overdispersion. We discuss this here for the binomial model and in Section 11.6.2 for the Poisson model. If the residual deviance exceeds the residual degrees of freedom overdispersion is present.

Possible sources of overdispersion are amongst others:

A wrongly chosen systematic component of the GLM.
Outliers are present.
The data are otherwise not a realised random sample.

Let p be a random variable with E(p) = μ and var(p) = σ². Further let k be B(n, p) distributed with a realisation p of p. Then

and

We see that var(k) exceeds for n > 1 the variance of a binomial distribution. If n = 1, overdispersion cannot be detected.

How can we reduce overdispersion?

First we try to correct the systematic component of the model. Further we choose a better link function and model the variance by

11.8

ϕ is estimated by dividing the residual deviance by the corresponding degrees of freedom (see Example 11.5).

More detailed information about overdispersion can be found in Collett (1991). In an analogous way underdispersion can be handled.

Example 11.5

This example comes from Rasch et al. (1998, Procedure 4/51/0205, page 476–479).

In an experiment to investigate biological pest control the ability to find in a given time a nest with eggs of a pest in three strains of wasps was observed. The eggs have been hidden in a labyrinth and y = 1 means nest found and y = 0 means nest not found. In each of 10 groups in each strain of wasps the number of wasps was between 36 and 128. In Table 11.5 in each of the 30 groups in three strains n is the number of wasps per group and k the number of wasps finding eggs.

Table 11.5 Number n of wasps per group and number k of these n wasps finding eggs.

Strain	k(n)
1	3(51), 8(68), 8(42), 21(96), 11(128), 4(49), 15(54), 3(84), 4(49), 5(63)
2	7(36), 20(79), 35(121), 31(121), 10(90), 12(70), 5(67), 22(81), 8(57), 14(62)
3	0(52), 5(111), 12(109), 10(87), 22(122), 0(65), 6(98), 3(71), 4(99), 7(97)

We use a GLM model with n_ij as number of wasps in strain i (i = 1, 2, 3) and group j (j = 1, 2, … , 10). k_ij is the corresponding number of successes.

The systematic model component for k_ij is

with g_i as the effect of strain i.

We use the logit‐link function.

Problem 11.8

Analyse the infection probabilities of Example 11.5 where the dispersion parameter (scale‐parameter) must be estimated.

Solution

We use the command > glm from the package glm2.

Example

 > library(glm2) 
> wasp <- read.table("D:/Rasch_Applied_Statistics/wasp.txt", header=T)
> wasp
   tribe  k   n
1      1  3  51
2      1  8  68
3      1  8  42
---------------
28     3  3  71
29     3  4  99
30     3  7  97
> wasp1 <- data.frame(wasp)
> TRIBE <- factor(wasp1[,1])
> number.events <- sum(wasp1[,2])
> number.events
[1] 315
> number.trials <- sum(wasp1[,3])
> number.trials
[1] 2379
> number.observations <- length(wasp[,3])
> number.observations
[1] 30
> k <- wasp1[,2]
> n <- wasp[,3]
> model1 <- glm(k/n ∼ TRIBE, family=quasibinomial, data=wasp1, weight=n)
> summary(model1)
Call:
glm(formula = k/n ∼ TRIBE, family = quasibinomial, data = wasp1, 
    weights = n)
Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.200  -1.333  -0.678   1.267   3.762  
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.9935     0.2220  -8.981 1.35e-09 ***
TRIBE2        0.6637     0.2769   2.396   0.0237 *  
TRIBE3       -0.5081     0.3241  -1.568   0.1286    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 3.556208)
    Null deviance: 165.98  on 29  degrees of freedom
Residual deviance: 100.45  on 27  degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
> anova(model1, test = "Chisq")
Analysis of Deviance Table
Model: quasibinomial, link: logit
Response: k/n
Terms added sequentially (first to last)
      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                     29     165.98              
TRIBE  2   65.535        27     100.45 9.962e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Number of observations = 30.
Number of estimate TRIBE2 = estimate (TRIBE2 − TRIBE1).
Coefficient estimate TRIBE3 = estimate (TRIBE3 − TRIBE1).
Estimate TRIBE1 = −(0.1556)/3 = −0.0519 because _j = 0.
Estimate MU = intercept − TRIBE1 = −1.9935 − (−0.0519) = −1.9416.
Number of events = 315.
Number of trials = 2379.
After model fitting we find in the glm output in the deviance row a residual deviance 100.45 with 27 degrees of freedom. In SAS the scale parameter ϕ is estimated as the square root of the quotient of residual deviance divided by the corresponding degrees of freedom. In GLM the dispersion parameter is only the quotient of residual deviance divided by the corresponding degrees of freedom.

11.6 Poisson Regression

Poisson regression refers to a GLM model where the random component is specified by the Poisson distribution of the response variable y, which is a count. However, we can also have , the rate as the response variable, where t is an interval representing time (hour, day), space (square meters) or some other grouping. The response variable y has the expectation λ. Because counts are non‐negative the expectation λ is positive. The relation between the regressor variables (x₀, x₁, … , x_k) and the parameter λ is described for k = 1 by

11.9

and in the case of k regressors we have, analogous to (11.4),

We mainly handle in this section the case of one regressor variable. From (11.9) we receive the expectations (with equal variances)

11.10

11.6.1 Analysis

We estimate the parameter λ by the maximum likelihood (ML) method minimising the logarithm of the likelihood function of n > 0 observations y_i, i = 1, … , n

Derivation with respect to λ and zeroing this derivation gives the ML estimate

i.e. the estimate is the arithmetic mean of the observed counts because the second derivative is negative and the solution gives a maximum.

Example 11.6

Ladislaus von Bortkiewicz was a Russian economist also dealing with ‘probability theory’ (including statistics) – see von Bortkiewicz (1893). He considered several data sets; one of the best known is from investigating 10 Prussian army corps over 20 years. He counted the number of soldiers per each of the 200 corps–year combinations dying from kicks by army mules. Table 11.6 shows his results.

Table 11.6 Number of soldiers dying from kicks by army mules.

Number of deaths	y = number of corp–year combinations
0	109
1	65
2	22
3	3
4	1
>5	0

Problem 11.9

Calculate the ML estimate for the Poisson parameter λ and calculate the chi‐square‐test statistic for the fit of the data of Example 11.6 for the Poisson distribution with .

Solution

Calculate first and calculate then the expected frequency E. Hereafter, calculate the Pearson chi‐square‐test statistic X² = ∑(y − E)²/E. Let m be the number of classes of deaths and the significance probability (p‐value) is Pr(CS(m − 1) ≥ X²).

Example

 > k <- c(0,1,2,3,4,5)
> y <- c(109, 65, 22, 3, 1, 0)
> m <- length(y)
> n <- sum(y)
> n
[1] 200
> y_bar <- sum(k*y)/n
> y_bar
[1] 0.61
> prob_k <- exp(- y_bar)* (y_bar^k)/factorial(k)
> prob_k
[1] 0.5433508691 0.3314440301 0.1010904292 0.0205550539 0.0031346457
[6] 0.0003824268
> E <- n*prob_k
> E
[1] 108.67017381  66.28880603  20.21808584   4.11101079  0.62692915
[6]   0.07648536
> result <- data.frame(k,y, E)
> result
  k   y      E
1 0 109 108.67017381
2 1  65  66.28880603
3 2  22  20.21808584
4 3   3   4.11101079
5 4   1   0.62692915
6 5   0   0.07648536
> diff <- (y-E)^2
> X2 <- sum(diff/E)
> X2
[1] 0.7818513
> p_value <- 1-pchisq(X2, df=m-1)
> p_value
[1] 0.9781765

Hence, because the p‐value = 0.9781765 > 0.05 the null hypothesis that the horse kicks data are Poisson distributed is not rejected.

Example 11.7

This example comes from Rasch et al. (1998, Procedure 4/51/0300, Page 479–482).

In an experiment the influence of the kind of nutrient medium on the number of buds on a cucumber leaf was investigated. Observations refer to the number y_ijk of buds on a cucumber leaf in medium i (i = 1, … , 4) of variety j (j = 1, 2) in replication k (k = 1, …, 6).

We assume that

is P(λ_ijk) distributed with the log‐likelihood function y_ijk · ln(λ_ijk) − λ_ijk + const.

The systematic part of a GLM is

η_ijk = μ + m_i + g_j + (mg)_ij with

m_i – effect of medium i
g_j – effect of variety j
(mg)_ij – interaction effect.

We use the link function

Table 11.7. Frequencies y_ijk from Example 11.7

Table 11.7 Values of k_ijk(n_ijk) on plots k = 1, …, m_ij in block i and genotype j and the m_ij.

Block	Genotype 1	Genotype 2
1	5(10), 4(10), 4(11), 5(11)	6(10), 7(10), 8(9), 9(10)
5(10), 4(10), 3(10), 5(10)	10(11), 7(10), 8(11), 7(11)
4(10), 3(9)	10(11), 7(11)
m₁₁ = 10	m₁₂ = 10
2	2(12), 2(9)	5(10), 5(10)
m₂₁ = 2	m₂₂ = 2
3	1(10), 1(10), 0(11), 0(10)	3(11), 2(9), 3(12), 3(10)
m₃₁ = 4	m₃₂ = 4
4	0(10), 0(10), 2(9), 1(8)	4(11), 2(9), 3(9), 4(11)
3(12), 0(8), 1(10), 1(9)	3(10), 3(10), 3(9), 3(10)
0(9), 2(10), 0(9), 1(11)	3(11), 5(10), 5(12)
m₄₁ = 12	m₄₂ = 11
5	4(11), 4(9), 4(11), 2(10)	5(9), 8(10), 7(8), 9(10)
4(11), 3(9), 4(10), 0(10)	8(9), 7(9), 8(9), 6(9)
3(11), 0(8)	6(10), 7(10)
m₅₁ = 10	m₅₂ = 10

Medium	Variety 1							Variety 2
1	10	.	5	7	12	6	16	8	10	11	9	12
2	21	16	10	18	17	20	8	9	10	9	15	9
3	17	10	20	12	14	21	10	10	17	.	14	12
4	5	.	6	2	3	2	6	2	2	3	7	3

Problem 11.10

Analyse the data of Example 11.7 using the GLM for the Poisson distribution.

Solution

We use the command > glm with the package glm2.

Example

 >cucumber <-read.table("C:/Rasch_applied_statistics/cucumber.txt",header=T)
> cucumber
   Medium Cultivar  y
1       1        1 10
2       1        1  5
3       1        1  7
---------------------  
43      4        2  3
44      4        2  7
45      4        2  3
> MEDIUM <- factor(cucumber[,1])
> CULTIVAR <- factor(cucumber[,2])
> model <- glm(y ∼MEDIUM + CULTIVAR + MEDIUM:CULTIVAR, family=poisson, data=cucumber)
> summary(model)
Call:
glm(formula = y ∼ MEDIUM + CULTIVAR + MEDIUM:CULTIVAR, family = poisson, 
    data = cucumber)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8405  -0.7401  -0.3063   0.6965   1.4710  
Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.0794     0.1581  13.152  < 2e-16 ***
MEDIUM2             0.7538     0.1866   4.040 5.34e-05 ***
MEDIUM3             0.6721     0.1888   3.560 0.000371 ***
MEDIUM4            -0.7985     0.2838  -2.813 0.004902 ** 
CULTIVAR2           0.3185     0.2004   1.589 0.112001    
MEDIUM2:CULTIVAR2  -0.8491     0.2581  -3.290 0.001003 ** 
MEDIUM3:CULTIVAR2  -0.5363     0.2582  -2.077 0.037791 *  
MEDIUM4:CULTIVAR2  -0.2557     0.3731  -0.685 0.493179    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
    Null deviance: 136.794  on 44  degrees of freedom
Residual deviance:  33.743  on 37  degrees of freedom
AIC: 230.53
Number of Fisher Scoring iterations: 4
> anova(model,test = "Chisq")
Analysis of Deviance Table
Model: poisson, link: log
Response: y
Terms added sequentially (first to last)
               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                              44   136.794              
MEDIUM          3   87.603        41    49.192 < 2.2e-16 ***
CULTIVAR        1    3.648        40    45.544  0.056134 .  
MEDIUM:CULTIVAR 3   11.801        37    33.743  0.008098 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(model1, test=  "Chisq")
Analysis of Deviance Table
Model: poisson, link: log
Response: y
Terms added sequentially (first to last)
               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                              44   136.794              
CULTIVAR        1    5.888        43   130.907  0.015247 *  
MEDIUM          3   85.363        40    45.544 < 2.2e-16 ***
CULTIVAR:MEDIUM 3   11.801        37    33.743  0.008098 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As we can see the scale parameter is 1. The residual deviance is 33.74 with 37 degrees of freedom and nearly equal to the value 37 expected for the Poisson distribution; this means the model is adequate.

Example 11.5 Continued

We extend the information received from the wasp experiment by counting the number m of eggs found by each of the g groups of wasps. Let m_ij be the number of eggs found by the wasps of strain i = 1, 2, 3 and group j = 1, …, 10. We assume that m_ij is Poisson distributed with E(m_ij) = θ_ij = k_ijλ_ij; i = 1, 2, 3; j = 1, … , 10 with λ_ij as the average of eggs found by wasps of strain i in group j = 1, …, 10.

The systematic component of our GLM is

where now ln (k_ij) is a covariate with regression coefficient 1.

The observations are given in Table 11.8.

Table 11.8 Values k_ij and m_ij found in three strains.

Strain	k_ij(m_ij)
1	3(46), 8(188), 8(137), 21(231), 11(93), 4(37), 15(63), 3(52), 4(48), 5(53)
2	7(45), 20(100), 35(181), 31(115), 10(32), 12(52), 5(37), 22(130), 8(24), 14(47)
3	0(0), 5(25), 12(57), 10(66), 22(150), 0(0), 6(36), 3(20), 4(27), 7(80)

Problem 11.11

Analyse the data of Table 11.8 using the GLM for the Poisson distribution with overdispersion.

Solution

We use the command > glm with the package glm2 and we delete the data 0(0) twice for strain 3.

Example

  >pest_eggs<-read.table("C:/Rasch_applied_statistics/pest_eggs.txt", header=T)
 > pest_eggs
    Tribe  k  m
 1      1  3  46
 2      1  8 188
 3      1  8 137
 --------------- 
 26     3  3  20
 27     3  4  27
 28     3  7  80
 > lk <- log(pest_eggs[, 2])
 > pest_eggs1 <- data.frame(pest_eggs, lk)
 > TRIBE <- factor(pest_eggs1[,1])
 > model <- glm(m ∼TRIBE, family=quasipoisson, offset=lk, data=pest_eggs1)
 > summary(model)
 
 Call:
 glm(formula = m ∼ TRIBE, family = quasipoisson, data = pest_eggs1, 
     offset = lk)
 
 Deviance Residuals: 
     Min       1Q   Median       3Q      Max  
 -9.6565  -1.7065  -0.0546   1.8865   8.7004  
 
 Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
 (Intercept)   2.4476     0.1108  22.088  < 2e-16 ***
 TRIBE2       -0.9102     0.1659  -5.485 1.07e-05 ***
 TRIBE3       -0.5483     0.1937  -2.830  0.00904 ** 
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
 (Dispersion parameter for quasipoisson family taken to be 11.64097)
 
     Null deviance: 634.84  on 27  degrees of freedom
 Residual deviance: 284.44  on 25  degrees of freedom
 AIC: NA
 
 Number of Fisher Scoring iterations: 4
 
 > anova(model, test="Chisq")
 Analysis of Deviance Table
 
 Model: quasipoisson, link: log
 
 Response: m
 
 Terms added sequentially (first to last)
 
 
       Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
 NULL                     27     634.84             
 TRIBE  2   350.39        25     284.44 2.91e-07 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The deviance is 284.44 with 25 degrees of freedom, and we have an overdispersion.

The GLM for rates has as link function the log of the rate.

11.6.2 Overdispersion

Because in the Poisson distribution the expectation equals the variance, overdispersion is often caused by some factors. Agresti (2018) mentioned that the negative binomial distribution may be better adapted to count data because it permits the variance to exceed the expectation.

11.7 The Gamma Regression

Gamma regression is a GLM model where the random component is specified by the Gamma distribution of the response variable y, which is continuous. We use here the two‐parametric Gamma distribution

The relation between the regressor variables (x₀, x₁, … , x_k) and the parameters λ, ν is described by the link functions and .

Assume that we have a random sample Y^T = (y₁, y₂, … , y_n) of size n with components distributed like y, i.e. they have the same parameters λ, ν. We further assume that y_i depends on regressor variables (x_i0, x_i1, … , x_ik) influencing the link function g(λ_i, ν_i), i = 1, …, n via

Without loss of generality we use the inverse link function in place of the canonical link function in the denominator. That there is no loss of generality stems from the fact that

11.8 GLM for Gamma Regression

Example 11.8

Myers and Montgomery (1997) present data from a step in the manufacturing process of semiconductors. Four factors are believed to influence the resistivity of the wafer and so a full factorial experiment with two levels of each factor was run. Previous experience led to the expectation that resistivity would have a skewed distribution. The data are as follows in a data frame with 16 observations on the following five variables.

x1 a factor with levels ‘−’ ‘+’
x2 a factor with levels ‘−’ ‘+’
x3 a factor with levels ‘−’ ‘+’
x4 a factor with levels ‘−’ ‘+’
resist is resistivity of the wafer.

    x1 x2 x3 x4 resist
1   -  -  -  -  193.4
2   +  -  -  -  247.6
3   -  +  -  -  168.2
4   +  +  -  -  205.0
5   -  -  +  -  303.4
6   +  -  +  -  339.9
7   -  +  +  -  226.3
8   +  +  +  -  208.3
9   -  -  -  +  220.0
10  +  -  -  +  256.4
11  -  +  -  +  165.7
12  +  +  -  +  203.5
13  -  -  +  +  285.0
14  +  -  +  +  268.0
15  -  +  +  +  169.1
16  +  +  +  +  208.5

Problem 11.12

Analyse the data of Example 11.8 with a GLM with the gamma distribution and the log‐link (multiplicative arithmetic mean model).

Solution

Use from the package glm2 the command > glm(formula, family, data).

Example

 > library(faraway)
> data(wafer)
> plot(density(wafer$resist))

 > resist.Gamma.log <- glm(resist ∼ x1 + x2 + x3 + x4,  family = Gamma(link="log"), data=wafer)
> summary(resist.Gamma.log)
Call:
glm(formula = resist ∼ x1 + x2 + x3 + x4, family = Gamma(link = "log"), 
    data = wafer)
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.17548  -0.06486   0.01423   0.08399   0.10898  
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.44552    0.05856  92.983  < 2e-16 ***
x1+          0.12115    0.05238   2.313 0.041090 *  
x2+         -0.30049    0.05238  -5.736 0.000131 ***
x3+          0.17979    0.05238   3.432 0.005601 ** 
x4+         -0.05757    0.05238  -1.099 0.295248    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.01097542)
    Null deviance: 0.69784  on 15  degrees of freedom
Residual deviance: 0.12418  on 11  degrees of freedom
AIC: 152.91
Number of Fisher Scoring iterations: 4
> anova(resist.Gamma.log, test = "Chisq")
Analysis of Deviance Table
Model: Gamma, link: log
Response: resist
Terms added sequentially (first to last)
     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    15    0.69784              
x1    1  0.05059        14    0.64725 0.0318037 *  
x2    1  0.37740        13    0.26985 4.519e-09 ***
x3    1  0.13244        12    0.13741 0.0005132 ***
x4    1  0.01323        11    0.12418 0.2723316    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that it is possible to make inference about the arithmetic means on the original scale. For example, having x1 = + increases the log arithmetic mean by 0.12115. The exponential coefficient exp(0.12115) = 1.128794 is the factor by which the arithmetic mean outcome in the original scale is multiplied, i.e. if x1 = +, the arithmetic mean on the original scale is 1.13 times higher compared to x1 = − within levels of other variables. The exponential intercept exp(5.44552) = 231.72 is the arithmetic mean outcome for wafers which have – for all predictors. Therefore, this model assumes multiplicative effects on the original outcomes of the predictors.

Problem 11.13

Analyse the data of Example 11.8 with a GLM with the gamma distribution and the identity link (additive arithmetic mean model).

Solution

Use from the package glm2 the command > glm(formula, family, data).

Example

 > library(faraway)
> data(wafer)
> resist.Gamma.identity <- glm(resist ∼ x1 + x2 + x3 + x4,  family = Gamma(link="identity"), data=wafer)
> summary(resist.Gamma.identity)
Call:
glm(formula = resist ∼ x1 + x2 + x3 + x4, family =  Gamma(link = "identity"), data = wafer)
Deviance Residuals: 
      Min         1Q     Median         3Q        Max  
-0.190394  -0.066533   0.005538   0.091549   0.126838  
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   235.43      14.21  16.572 3.97e-09 ***
x1+            27.83      12.24   2.274 0.044017 *  
x2+           -65.74      12.68  -5.184 0.000302 ***
x3+            36.94      12.32   2.999 0.012102 *  
x4+           -11.88      12.15  -0.978 0.349294    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.0124071)
    Null deviance: 0.69784  on 15  degrees of freedom
Residual deviance: 0.14009  on 11  degrees of freedom
AIC: 154.84
Number of Fisher Scoring iterations: 6
> anova(resist.Gamma.identity, test = "Chisq")
Analysis of Deviance Table
Model: Gamma, link: identity
Response: resist
Terms added sequentially (first to last)
     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    15    0.69784              
x1    1  0.05059        14    0.64725  0.043466 *  
x2    1  0.37755        13    0.26970 3.461e-08 ***
x3    1  0.11782        12    0.15188  0.002059 ** 
x4    1  0.01178        11    0.14009  0.329797    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that if the effect of each predictor is considered additive on the original scale, the generalised model with the identity link function can be used. In this case, the raw coefficients are on the original scale. Having x1 = + adds 27.83 to the arithmetic mean outcome (additive effect). The raw intercept (235.43) is the arithmetic mean outcome for wafers that have – for all predictors.

Example 11.9

In McCullagh and Nelder (1989), on page 300–302, an example of clotting times of blood is described. The data are clotting time of blood, giving clotting times in seconds (y) for normal plasma diluted to nine different percentage concentrations with prothrombin‐free plasma (u); clotting was induced by two lots of thromboplastin. The higher the dilution, the more the inference with the blood's ability to clot, because the blood's natural clotting capability has been weakened. For each sample, clotting was induced by introducing thromboplastin, a clotting agent, and the time until clotting occurred was recorded (in seconds). Five samples were measured at each of the nine percentage concentrations, and the mean clotting time was averaged; thus, the response is mean clotting time over the five samples. The data are shown in Table 11.9.

Table 11.9 Clotting time of blood in seconds (y) for normal plasma diluted to nine different percentage concentrations with prothrombin‐free plasma (u).

Clotting time
u	y lot 1	y lot 2
5	118	69
10	58	35
15	42	26
20	35	21
30	27	18
40	25	16
60	21	12
80	19	12
100	18	12

These data are positive with possible constant coefficient of variation. Thus, we consider the gamma probability model. Letting y_j be the clotting time at log(percentage concentration) x_j we, consider the model for the mean response with the reciprocal (or inverse): E(y_j) = 1/(β₀ + β₁ x_j).

Problem 11.14

Analyse the data of Example 11.9 with a GLM with the gamma distribution with x = log(u) and the inverse link for Lot 1 and Lot 2.

Solution

Use from the package glm2 the command > glm(formula, family, data).

Example

 > # A Gamma example, from McCullagh & Nelder (1989, pp. 300-2)
> clotting <- data.frame(
    u = c(5,10,15,20,30,40,60,80,100),
    lot1 = c(118,58,42,35,27,25,21,19,18),
    lot2 = c(69,35,26,21,18,16,13,12,12))
> x <- log(clotting[,1])
> y1 <- clotting[,2]
> plot(x,y1, xlab="log(percentage conc of plasma)", ylab="mean clotting time", main="Lot1")

 > lot1.log <- glm(lot1 ∼ x, data = clotting,  family = Gamma(link= "inverse" ))
> summary(lot1.log)

Call:
glm(formula = lot1 ∼ x, family = Gamma(link = "inverse"), data = clotting)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.04008  -0.03756  -0.02637   0.02905   0.08641  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0165544  0.0009275  -17.85 4.28e-07 ***
x            0.0153431  0.0004150   36.98 2.75e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 0.002446059)

    Null deviance: 3.51283  on 8  degrees of freedom
Residual deviance: 0.01673  on 7  degrees of freedom
AIC: 37.99

Number of Fisher Scoring iterations: 3
> anova(lot1.log, test = "Chisq")
Analysis of Deviance Table

Model: Gamma, link: inverse

Response: lot1

Terms added sequentially (first to last)


     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                     8     3.5128              
x     1   3.4961         7     0.0167 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> fit.lot1 <- fitted(lot1.log)
> data.fit1 <- data.frame(x,y1,fit.lot1)
> data.fit1
         x  y1  fit.lot1
1 1.609438 118 122.85904
2 2.302585  58  53.26389
3 2.708050  42  40.00713
4 2.995732  35  34.00264
5 3.401197  27  28.06578
6 3.688879  25  24.97221
7 4.094345  21  21.61432
8 4.382027  19  19.73182
9 4.605170  18  18.48317

Note that the reciprocal model fits the data very well and can be used for describing the percentage concentration of plasma‐clotting relationship!

 > y2 <- clotting[,3]
> plot(x,y2, xlab="log(percentage conc of plasma)", ylab="mean clotting time", main="Lot2")

 > lot2.log <- glm(lot2 ∼ log(u), data = clotting,  family = Gamma(link= "inverse"))
> summary(lot2.log)
Call:
glm(formula = lot2 ∼ log(u), family = Gamma(link = "inverse"), 
    data = clotting)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.05574  -0.02925   0.01030   0.01714   0.06371  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0239085  0.0013265  -18.02 4.00e-07 ***
log(u)       0.0235992  0.0005768   40.91 1.36e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 0.001813354)

    Null deviance: 3.118557  on 8  degrees of freedom
Residual deviance: 0.012672  on 7  degrees of freedom
AIC: 27.032

Number of Fisher Scoring iterations: 3
> anova(lot2.log, test = "Chisq")
Analysis of Deviance Table

Model: Gamma, link: inverse

Response: lot2

Terms added sequentially (first to last)


       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                       8    3.11856              
log(u)  1   3.1059         7    0.01267 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> fit.lot2 <- fitted(lot2.log)
> data.fit2 <- data.frame(x,y2,fit.lot2)
> data.fit2
         x y2 fit.lot2
1 1.609438 69 71.05806
2 2.302585 35 32.86152
3 2.708050 26 25.00038
4 2.995732 21 21.37279
5 3.401197 18 17.74399
6 3.688879 16 15.83627
7 4.094345 13 13.75235
8 4.382027 12 12.57800
9 4.605170 12 11.79664

Note that the reciprocal model fits the data very well and can be used for describing the percentage concentration of plasma‐clotting relationship!

11.9 GLM for the Multinomial Distribution

The multinomial logit model is a generalisation of the binomial logit model. It describes an (m − 1)‐dimensional response variable occurring with probabilities . The probability (likelihood) function of the multinomial distribution is

The relation between the regressor variables (x₀, x₁, … , x_k) and the probabilities p_i is described by the multinomial logit function as link function

where p₁ is the probability of the reference category.

The model is fitted as described in Section 11.6.1 for the Poisson model because we have an equivalence between the multinomial distribution and a Poisson distribution with fixed sum of all counts.

Problem 11.15

Analyse the infection data of Example 11.10.

Solution

We use from the package glm2 the command > glm.

Example

 > peanutsnew=read.table("D:/Rasch_Applied_Statistics/multinomial.txt", header=T)
> peanutsnew
    block G_type resp kn
1       1      1    1  5
2       1      1    1  4
3       1      1    1  4
------------------------  
148     5      2    2  3
149     5      2    2  4
150     5      2    2  3
> results <- data.frame(peanutsnew)
> BLOCK <- factor(results[,1])
> GTYPE <- factor(results[,2])
> RESP <- factor(results[,3])
> y <- results[,4]
> model <- glm(y ∼ RESP + BLOCK:GTYPE + RESP:BLOCK + RESP:GTYPE,  family="poisson", data=results)
> summary(model)
Call:
glm(formula = y ∼ RESP + BLOCK:GTYPE + RESP:BLOCK + RESP:GTYPE,
    family = "poisson", data = results)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.51857  -0.35922   0.06539   0.41226   1.83426  

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)     1.9063     0.1157  16.482  < 2e-16 ***
RESP2           0.4423     0.1728   2.559 0.010492 *  
BLOCK1:GTYPE1  -0.5323     0.1920  -2.773 0.005562 ** 
BLOCK2:GTYPE1  -1.3568     0.4048  -3.352 0.000802 ***
BLOCK3:GTYPE1  -2.4443     0.3879  -6.302 2.94e-10 ***
BLOCK4:GTYPE1  -2.0915     0.2601  -8.041 8.89e-16 ***
BLOCK5:GTYPE1  -0.7521     0.1787  -4.209 2.57e-05 ***
BLOCK1:GTYPE2   0.1915     0.1530   1.252 0.210625    
BLOCK2:GTYPE2  -0.2448     0.3018  -0.811 0.417386    
BLOCK3:GTYPE2  -0.9257     0.3021  -3.064 0.002181 ** 
BLOCK4:GTYPE2  -0.6400     0.1841  -3.476 0.000508 ***
BLOCK5:GTYPE2       NA         NA      NA       NA    
RESP2:BLOCK2    1.1793     0.3903   3.021 0.002517 ** 
RESP2:BLOCK3    2.3643     0.3559   6.643 3.08e-11 ***
RESP2:BLOCK4    1.9122     0.2370   8.069 7.10e-16 ***
RESP2:BLOCK5    0.3246     0.2211   1.468 0.142008    
RESP2:GTYPE2   -1.7287     0.1812  -9.538  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
    Null deviance: 297.692  on 149  degrees of freedom
Residual deviance:  61.361  on 134  degrees of freedom
AIC: 570.13
Number of Fisher Scoring iterations: 5
> anova(model, test = "Chisq")
Analysis of Deviance Table
Model: poisson, link: log
Response: y
Terms added sequentially (first to last)
            Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                          149    297.692              
RESP         1   33.155       148    264.538 8.512e-09 ***
BLOCK:GTYPE  9    1.088       139    263.450    0.9992    
RESP:BLOCK   4   98.772       135    164.678 < 2.2e-16 ***
RESP:GTYPE   1  103.317       134     61.361 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> model1 <- glm(y ∼ RESP + BLOCK:GTYPE + RESP:GTYPE + RESP:BLOCK,  family="poisson", data=results)
> anova(model1, test = "Chisq")
Analysis of Deviance Table
Model: poisson, link: log
Response: y
Terms added sequentially (first to last)
            Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                          149    297.692              
RESP         1   33.155       148    264.538 8.512e-09 ***
BLOCK:GTYPE  9    1.088       139    263.450    0.9992    
RESP:GTYPE   1   88.997       138    174.452 < 2.2e-16 ***
RESP:BLOCK   4  113.092       134     61.361 < 2.2e-16 ***

Compare these results with that of Example 11.4 in Problem 11.7.

Note that the estimates RESP2:BLOCK2,…., RESP2:GTYPE2 have the negative values of the estimates of BLOCK2, … , GTYPE2 in Problem 11.7. Note further that the deviance of RESP:GTYPE in the first analysis of deviance table is the same as the deviance of GTYPE in the first analysis of deviance table of Problem 11.7. Also, the deviance of RESP:BLOCK in the second analysis of deviance table is the same as the deviance of BLOCK in the second analysis of deviance table in Problem 11.7.

References

Agresti, A. (2018). Categorical Data Analysis. New York: Wiley.
von Bortkiewicz, L.J. (1893). Die mittlere Lebensdauer. Die Methoden ihrer Bestimmung und ihr Verhältnis zur Sterblichkeitsmessung. Jena: Gustav Fischer.
Collett, D. (1991). Modelling Binary Data. Boca Raton: Chapman & Hall.
Faraway, J.J. (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models, 2e. Boca Raton: Chapman & Hall/CRC Texts in Statistical Science.
McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models. New York: Springer.
Myers, R.H. and Montgomery, D.C. (1997). A tutorial on generalized linear models. J. Qual. Technol. 29: 274–291, (Published online: 21 Feb 2018).
Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear models. J. R. Stat. Soc. 135: 370–384.
Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Kopenhagen: Nissen & Lydicke.
Rasch, G. (1980). Probabilistic Models for Some Intelligence and Attainment Tests. Danish Institute for Educational Research, Copenhagen 1960, expanded edition with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press.
Rasch, D. and Schott, D. (2018). Mathematical Statistics. Oxford: Wiley.
Rasch, D., Herrendörfer, G., Bock, J. et al. (1998). Verfahrensbibliothek Versuchsplanung und ‐ auswertung, 2. verbesserte Auflage in einem Band mit CD. München Wien: R. Oldenbourg Verlag.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11 Generalised Linear Models

Create new playlist

Sign In

Sign Up

11.1 Introduction

11.2 Exponential Families of Distributions

11.3 Generalised Linear Models – An Overview

11.4 Analysis – Fitting a GLM – The Linear Case

11.5 Binary Logistic Regression

11.5.1 Analysis

11.5.2 Overdispersion

11.6 Poisson Regression

11.6.1 Analysis

11.6.2 Overdispersion

11.7 The Gamma Regression

11.8 GLM for Gamma Regression

11.9 GLM for the Multinomial Distribution

References

Table of Contents for
11 Generalised Linear Models