Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 17
Generalized Linear Models

Package(s): gdata, RSADBE

Dataset(s): chdage, lowbwt, sat, Disease, BS, caesareans

17.1 Introduction

In Chapter 16 we discussed many useful statistical methods for analysis of categorical data, which may be nominal or ordinal data. The related regression problems were deliberately not touched upon there, the reason for omission being that the topic is more appropriate here. We will see in the next section that the linear regression methods of Chapter 12 are not appropriate for explaining the relationship between the regressors and discrete regressands. The statistical models, which are more suitable for addressing this problem, are known as the generalized linear models, which we abbreviate to GLM.

In this chapter, we will consider the three families of the GLM: logistic, probit, and log-linear models. The logistic regression model will be covered in more detail, and the applications of the others will be clearly brought out in the rest of this chapter.

We first begin with the problem of using the linear regression model for count/discrete data in Section 17.2. The exponential family continues to provide excellent theoretical properties for GLMs and the relationship will be brought out in Section 17.3. The important logistic regression model will be introduced in Section 17.4. The statistical inference aspects of the logistic regression model is developed and illustrated in Section 17.5. Similar to the linear regression model, we will consider the similar problem of model selection in Section 17.6. Probit and Poisson regression models are developed in Sections 17.7 and 17.8.

17.2 Regression Problems in Count/Discrete Data

By count/discrete data, we refer to the case where the regressand/output is discrete. That is, the output $c17-math-0001$ takes values in the set $c17-math-0002$ , or $c17-math-0003$ . As in the case of the linear regression model, we have some explanatory variables which effect the output $c17-math-0004$ . we will immediately see the shortcomings of the approach of the linear regression model.

Example 17.2.1. Linear Model for Binary Outcome – Pass/Fail Indicator as Linear Function of the SAT Score

Johnson and Albert (1999) give a simple example of the drawbacks of the linear regression model approach for count data. The Scholastic Assessment Test (SAT) scores for the Mathematics subject, denoted by the variable Sat in the dataset sat, of students at the time of admission to a course is available, and it is known whether or not in a later examination the student completed the course, denoted by Pass. The variable Sat is the input variable $c17-math-0005$ , while the output variable Pass is the regressand $c17-math-0006$ . The course completion status Pass takes only two possible values 0 (fail) or 1 (pass). The task is to predict $c17-math-0007$ based on the covariate $c17-math-0008$ . The scatter plot looks as in Figure 17.1.

Figure 17.1 A Conditional Density Plot for the SAT Data

Though the scatter plot does not suggest that a linear model is appropriate, we will model for $c17-math-0009$ , Pass here, in terms of the covariate $c17-math-0010$ in Sat. Let us check how well the linear regression model of earlier chapters works here.

> library(RSADBE)
> data(sat)
> par(mfrow=c(1,2))
> plot(sat$Sat,sat$Pass,xlab="SAT-M Score",ylab="Pass/Fail",
+ main="Scatterplot of SAT-PASS Data")
> satlm <- lm(Pass∼Sat,data=sat)
> abline(reg=satlm)
> predict(satlm)
       1        2       24       25         29       30
 0.44667  0.50593  0.88370 -0.01260    0.65407  0.62444
> predict(satlm,newdata=list(Sat=c(300,750)))
     1      2
-1.220  2.113
> plot(predict(satlm),satlm$residuals,xlab="Predicted Values",ylab=
+ "Residuals",main="Residuals vs Predicted Plot")

First, the scatter plot is obtained using the plot function, and then the linear model is fitted using the lm function, which is the diagram on the left-hand side of Figure 17.1. The regression line is drawn on the scatter plot using the option of reg with the abline graphical function. The shortcomings of the linear regression model are seemingly apparent. For instance, the predicted values, predict(...), for a few students have values in excess of 1 and also a few have less than 0. In simple words, we see that for discrete and bounded random variables, the predicted values from the simple linear regression model can overshoot the boundaries. Finally, the residuals-vs-predicted plot, the right-hand panel of Figure 17.1, indicates two parallel lines, which is very unlikely if a linear model is appropriate. Thus we need a model which overcomes such limitations.□

Example 17.2.2. Understanding the Relationship between Coronary Heart Disease and the Age of the Patient

A well-known explanation of heart disease is that as age increases, the risk of coronary heart disease also increases. The current dataset and the example may be found in Chapter 1 of Hosmer and Lemeshow (1990–2013). we will first plot the indicator value of coronary heart disease (CHD) against the age of the patient with the usual plot command.

It is fairly clear from the plot, Figure 17.2, that there is not a strong association between the CHD indicator and the age of the patient. However, a different way of re-plotting the same data gives a useful insight. Let us group the patients into certain age intervals, and calculate the percentage of the patients who have CHD. The patients are grouped in the interval groups 19–29, 29–34, 34–39,…, 54–59, and 59–69 using the cut function. The number of patients falling in these age groups are obtained with table(agegrp,chdage$CHD). The code prop.table(mytable,1) gives row-wise percentages, and with option of 2 in place of 1, the column-wise percentages. The second column of this proportions table gives the percentage of CHD patients in these age groups. we will then plot this percentage against the center of the intervals with the points(...) option.

> data(chdage)
> plot(chdage$AGE,chdage$CHD,xlab="AGE",ylab="CHD Indicator",
+ main="Scatter plot for CHD Data")
> agegrp <- cut(chdage$AGE,c(19,29,34,39,44,49,54,59,69),include.
+ lowest=TRUE, labels=c(25,seq(31.5,56.5,5),64.5))
> mp <- c(25,seq(31.5,56.5,5),64.5) # mid-points
> chd_percent <- prop.table(table(agegrp,chdage$CHD),1)[,2]
> points(mp,chd_percent,"l",col="red")

The red curve in Figure 17.2 shows us that the percentage of people having CHD is increasing as age increases. Furthermore, the shape of the curve is “S” shaped. In particular, it looks like a sigmoid function. It is this fact that forms the basis of the logistic regression model.□

Figure 17.2 Understanding the Coronary Heart Disease Data in Terms of Percentage

we will now digress a bit from applications and attempt to understand the general underlying theory of GLMs. A brief look at the important members of the GLM family will be discussed here.

$c17-math-0011$

17.3 Exponential Family and the GLM

The exponential family of distribution was introduced in Chapter 7. Chapter 3 of Dobson (2002) highlights the role of the exponential family for the GLM's considered in this chapter. Recall the definition of the exponential family from Chapter 7 as

where $c17-math-0013$ are some known functions. This form of exponential family may be rewritten as

where $c17-math-0015$ and $c17-math-0016$ . If $c17-math-0017$ , the distribution is said to be in canonical form and in this case $c17-math-0018$ is called the natural parameter of the distribution. we had seen some members of the exponential family in Section 7.2.1. The dependency of GLM on the exponential family is now discussed.

Consider an independent sample $c17-math-0019$ , with the following characteristics:

Each observation $c17-math-0020$ , is a distribution from the exponential family.
The distributions of $c17-math-0021$ are of the same form, that is, all are either normal, or all binomial, etc.
The canonical form $c17-math-0022$ is specified by
17.1

This translates into the form that the joint probability density function of the random sample $c17-math-0024$ , is given by

17.2

In a GLM, we are interested in estimation of the $c17-math-0026$ parameters $c17-math-0027$ . The trick is to consider some function of $c17-math-0028$ , say $c17-math-0029$ , such that $c17-math-0030$ , and then allow the $c17-math-0031$ $c17-math-0032$ 's to vary as a function of $c17-math-0033$ , regression coefficients. That is, we define $c17-math-0034$ by

17.3

where $c17-math-0036$ is the covariate vector associated with $c17-math-0037$ . The function $c17-math-0038$ is called the link function of the GLM. Some essential requirements of the function $c17-math-0039$ is that it should be monotonic and differentiable. Table 17.1 gives a summary of important members of the GLM. There will be more focus on logistic regression in this chapter.

Table 17.1 GLM and the Exponential Family

Probability Model	Name of Link Function	Link Function	Mean Function
Normal	Identity	$c17-math-0040$	$c17-math-0041$
Exponential/Gamma	Inverse	$c17-math-0042$	$c17-math-0043$
Inverse Gaussian	Inverse Squared	$c17-math-0044$	$c17-math-0045$
Poisson	Log	$c17-math-0046$	$c17-math-0047$
Binomial	Logit	$c17-math-0048$	$c17-math-0049$

17.4 The Logistic Regression Model

We saw in the previous section how the probability curve is a sigmoid curve. Now we will introduce the concepts in a more formal and mathematical way. We have found the paper of Czepiel, http://czep.net/stat/mlelr.pdf, to be in the most appropriate pedagogical manner and this section is a liberal adaptation of the same. Let the binary outcomes be represented by $c17-math-0050$ , and the covariates associated with them be respectively $c17-math-0051$ . The covariate is assumed to be a vector of the $c17-math-0052$ elements, that is, $c17-math-0053$ . Without loss of generality, we can write the covariate vector as $c17-math-0054$ , with $c17-math-0055$ . The probability of success is specified by $c17-math-0056$ . The logistic regression model is then given by

17.4

where $c17-math-0058$ is the vector of regression coefficients. The probability of a failure has the simple form:

An important identity in the context of logistic regression is the form of the odds ratio, abbreviated and denoted as OR:

17.5

Taking the logarithm on both sides of the above equation, we get the form of logistic regression model:

The expression $c17-math-0062$ is known as the logit function and since it is linear in the covariates, the logistic regression model, based on the logit function, is a particular class of the well-known generalized linear models. The logistic regression model is given by

17.6

where $c17-math-0064$ is the error term. Thus, if $c17-math-0065$ , the error is $c17-math-0066$ with probability $c17-math-0067$ . Otherwise, the error is $c17-math-0068$ with probability $c17-math-0069$ . That is

17.7

Hence, the error $c17-math-0071$ has a binomial distribution with mean $c17-math-0072$ and variance $c17-math-0073$ . The next section will deal with the inferential aspect of $c17-math-0074$ .

17.5 Inference for the Logistic Regression Model

17.5.1 Estimation of the Regression Coefficients and Related Parameters

The likelihood function based on $c17-math-0075$ observations is given by

17.8

As in the most cases, it is easier to work with the log-likelihood function:

17.9

Differentiating the above log-likelihood function with respect to $c17-math-0078$ , we obtain the score function:

17.10

Let us denote the covariate matrix, as in earlier chapters, by $c17-math-0080$ , the probability of success vector by $c17-math-0081$ , and the outcome vector by $c17-math-0082$ . The normal equation, obtained by equating the above equation to zero, for the logistic regression model is then given by

17.11

The above equation looks similar to the normal equation of the linear models. However, Equation 17.11 cannot be solved immediately as the vector $c17-math-0084$ contains the vector of regression coefficients. We need the help of a different algorithm to obtain the estimates of the regression coefficients. This algorithm is generally known as the iterated reweighted least squares, abbreviated as IRLS, algorithm. In the context of the logistic regression model, the IRLS algorithm is given below, see Fox ^{http://socserv.mcmaster.ca/jfox/Courses/UCLA/logistic-regression-notes.pdf}.

Initialize the vector of regression coefficients $c17-math-0085$ .
The $c17-math-0086$ improvement of the regression coefficients is given by
17.12

where $c17-math-0088$ is a diagonal matrix with the $c17-math-0089$ diagonal element given by $c17-math-0090$ , and $c17-math-0091$ .
Repeat the above step until $c17-math-0092$ is close to $c17-math-0093$ .
The asymptotic covariance matrix of the regression coefficients is given by $c17-math-0094$ , see next sub-section.

In the next example, we will develop an example which will clearly bring out the steps of the IRLS algorithm. The use of the R function glm directly returns us the estimates of the regression coefficients. However, the working of the IRLS algorithm does not become clear, and the first time the reader may feel that IRLS is some kind of black box. It is to be understood that software does not hide anything, R or any other statistical software. The onus of clear understanding of software functionality is with the reader, and sometimes the authors. For the sake of simplicity, we will focus on the simple GLM case of a single covariate only. The IRLS function is first given and discussed.

irls <- function(output, input) {
input <- cbind(rep(1,length(output)),input)
bt <- 1:ncol(input)*0 # Initializing the regression coefficients
probs <- as.vector(1/(1+exp(-input%*%bt)))
temp <- rep(1,nrow(input))
while(sum((bt-temp)ˆ2)>0.0001) {
temp <- bt
bt <- bt+as.vector(solve(t(input)%*%diag(probs*(1-probs))
%*%input)%*%(t(input)%*%(output-probs)))
probs <- as.vector(1/(1+exp(-input%*%bt)))
}
return(bt)
}

The irls R function defined here should return us the estimates as required by the IRLS algorithm and in particular Equation 17.12 computations are to be handled here. Given the covariate values for the $c17-math-0095$ observations from a dataset, the first step is to take care of the intercept term, and thus we first begin by inserting a column of 1's with input <- cbind(rep(1,length(output)),input). The initial estimate of regression coefficients is set equal to 0, and hence the initial probabilities $c17-math-0096$ 's will be zero too, see bt and probs in the irls program. The regression coefficient vector for the previous iteration will be denoted by temp and so long as the improvement of the current iteration, vector distance, and the previous iteration is greater than 1e-4, the iterations will be carried out in the while loop. The iteration as required by Equation 17.12 is provided by the R code solve(t(input)%*%diag(probs*(1-probs))%*%input)%*%(t(input) %*%(output-probs)). When the convergence criteria is met, the vector of regression coefficients is returned by return(bt) as the output. The irls function will be tested for the coronary heart disease problem and the results will be further verified with the R glm function.

Example 17.5.1. Understanding the Relationship between Coronary Heart Disease and the Age of the Patient

Contd. We will now see how the above irls function compares with the R modules. The R function glm will be useful to build GLM, and it works on similar lines as the linear model function lm. The formula $c17-math-0097$ as used in lm continues to hold for glm and does not need further elaboration.

> data(chdage)
> chdglm <- glm(chdage$CHD∼chdage$AGE,family='binomial')
> irls(chdage$CHD,chdage$AGE)
[1] -5.3094530  0.1109211
> chdglm$coefficients
(Intercept)  chdage$AGE
 -5.3094534   0.1109211
> summary(chdglm)
Call:
glm(formula = chdage$CHD ∼ chdage$AGE, family = "binomial")
Deviance Residuals:
    Min       1Q   Median       3Q      Max
-1.9718  -0.8456  -0.4576   0.8253   2.2859
Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.30945    1.13365  -4.683 2.82e-06 ***
chdage$AGE   0.11092    0.02406   4.610 4.02e-06 ***
- - -
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 136.66  on 99  degrees of freedom
Residual deviance: 107.35  on 98  degrees of freedom
AIC: 111.35
Number of Fisher Scoring iterations: 4

Thus, the irls function returns the precise answer as the R function glm as seen by the output for irls(chdage$CHD,chdage$AGE) and chdglm$coefficients. It is now hoped that the reader is really comfortable with the IRLS algorithm.

The summary function shows that AGE is a significant variable for explaining CHD. Further description will follow in the rest of this section.□

Estimates of the link function and odds ratio is straightforwardly obtained by plugging in the values of the estimated regression coefficients, that is,

We will next consider an example of the multiple logistic regression, that is when we have more than one covariates.

Example 17.5.2. The Low Birth-Weight Problem

Low birth weight of new-born infants is a serious concern. If the weight of the new-born is less than 2500 grams, we consider that instance as a low-birth weight case. A study was carried out at Baystate Medical Center in Springfield, Massachusetts. Table 17.2 gives a description of the variables in the study.

Table 17.2 The Low Birth-Weight Variables

Serial Number	Description	Abbreviation
1	Identification Code	`ID`
2	Low Birth Weight	`LOW`
3	Age of Mother	`AGE`
4	Weight of Mother at Last Menstrual Period	`LWT`
5	Race	`RACE`
6	Smoking Status During Pregnancy	`SMOKE`
7	History of Premature Labor	`PTL`
8	History of Hypertension	`HT`
9	Presence of Uterine Irritability	`UI`
10	Number of Physician Visits During the First Trimester	`FTV`
11	Birth Weight	`BWT`

The multiple logistic regression model will be built for the low birth weight category LOW as a function of the covariates AGE, LWT, RACE, and FTV.

> data(lowbwt)
> lowglm <- glm(LOW∼AGE+LWT+RACE+FTV,data=lowbwt,family='binomial')
> lowglm$coefficients
(Intercept)         AGE      RACE3         FTV
 1.29536575 -0.02382297 0.43310843 -0.04930832

Note that converting RACE as a factor has multiplied itself by two variables, since we have modeled it as a factor variable and not as a continuous variable. The reasoning is similar as in the linear models.□

The estimated vector of regression coefficients $c17-math-0099$ needs to be tested for significance. The first step towards this would be to obtain an estimate of the variance-covariance matrix of $c17-math-0100$ .

17.5.2 Estimation of the Variance-Covariance Matrix of $c17-math-0101$

We have seen earlier in Chapter 7 that the variance of the score function gives the Fisher information. Thus, to obtain the variance-covariance matrix of $c17-math-0102$ , we look at the second-order partial derivatives of the log-likelihood function. The technique of estimating the variance-covariance matrix of $c17-math-0103$ follows from the theory of the maximum likelihood estimation, see Rao (1973). Differentiating the partial differential equations of the log-likelihood function of sub-section 8.4.1, we get

for $c17-math-0105$ . The Fisher information for $c17-math-0106$ , denoted by $c17-math-0107$ , consists of elements as specified in the above two equations. Adapting the results from Chapter 7, the inverse matrix of the information matrix gives us the variance-covariance matrix of $c17-math-0108$ , that is,

17.13

Thus, we specifically have that the variance of the estimator of regression coefficient $c17-math-0110$ , denoted by $c17-math-0111$ as the $c17-math-0112$ diagonal element of $c17-math-0113$ . Similarly, the covariance between the estimators of two regression coefficients, denoted by $c17-math-0114$ , as the $c17-math-0115$ element of $c17-math-0116$ .

In R, it is easier to obtain the variance-covariance matrix for a fitted logistic regression model. For the Low Birth-Weight example, we can obtain this using the listed object cov.unscaled.

> lowglm_summary <- summary(lowglm)
> lowglm_summary$cov.unscaled
            (Intercept)     AGE     LWT   RACE2   RACE3     FTV
(Intercept)      1.1480 -0.0220 -0.0045 -0.0373 -0.1339  0.0054
AGE             -0.0220  0.0011  0.0000  0.0031  0.0013 -0.0010
LWT             -0.0045  0.0000  0.0000 -0.0008  0.0003 -0.0001
RACE2           -0.0373  0.0031 -0.0008  0.2479  0.0571 -0.0003
RACE3           -0.1339  0.0013  0.0003  0.0571  0.1312  0.0058
FTV              0.0054 -0.0010 -0.0001 -0.0003  0.0058  0.0280

17.5.3 Confidence Intervals and Hypotheses Testing for the Regression Coefficients

The natural hypotheses testing problem is inspection for the significance of the regressors, that is, $c17-math-0117$ . The Wald test statistics for $c17-math-0118$ is given by

17.14

Under the hypotheses $c17-math-0120$ , the Wald statistics $c17-math-0121$ (asymptotically) follow a standard normal distribution. Furthermore, the $c17-math-0122$ confidence interval for $c17-math-0123$ is given by

17.15

We will illustrate these concepts for the low birth-weight study problem.

Example 17.5.3. The Low Birth-Weight Problem

Contd. The Wald statistics values and the corresponding $c17-math-0125$ -values are given in the third and fourth columns of the “Coefficients” table of the summary function, that is:

> lowglm_summary$coefficients[,3:4]
            z value Pr(>|z|)
(Intercept)  1.2090   0.2267
AGE         -0.7063   0.4800
LWT         -2.1778   0.0294
RACE2        2.0164   0.0438
RACE3        1.1956   0.2318
FTV         -0.2948   0.7681
> confint(lowglm)
Waiting for profiling to be done...
              2.5 %  97.5 %
(Intercept) -0.7603  3.4594
AGE         -0.0917  0.0413
LWT         -0.0278 -0.0020
RACE2        0.0230  1.9910
RACE3       -0.2787  1.1462
FTV         -0.3899  0.2707

Note that at $c17-math-0126$ level of significance, LWT and RACE would be two significant variables in explaining the low birth weight LOW. The 95% confidence intervals for the regression coefficients confirm the hypotheses test results, since 0 does not lie in the intervals for LWT and RACE. For details about profiling, refer to Dalgaard (2008).□

An overall model level significance test needs to be in place. Towards this test, we need to first define the notion of deviance, and this topic will be taken up in sub-section 17.5.5. First, we need to define the various types of residuals on the lines of linear models for the logistic regression model.

$c17-math-0127$

17.5.4 Residuals for the Logistic Regression Model

Recollect from the definition of residuals for the linear regression model as defined in Section 12.5 that the residual is the difference between actual and predicted $c17-math-0128$ values. Similarly, the residuals for the logistic regression model is defined by

17.16

The residuals $c17-math-0130$ are sometimes called the response residual. The different variants of the residuals which will be useful are defined next, see Chapter 7 of Tattar (2013). The hat matrix $c17-math-0131$ played an important role in defining the residuals for the linear regression model, recall Equation (12.37). A similar matrix will be required for the logistic regression model, and we have the problem here that the observation $c17-math-0132$ is not a straightforward linear in terms of the regression coefficient. A linear approximation for the fitted values $c17-math-0133$ , which gives a similar hat matrix for the logistic regression model, has been proposed by Pregibon (1981), see Section 5.3 of Hosmer and Lemeshow (1990–2013). The hat matrix for the logistic regression matrix will be denoted by $c17-math-0134$ , the subscript $c17-math-0135$ denotes logistic regression and not linear regression, and is defined by

17.17

where $c17-math-0137$ is a diagonal matrix with

The diagonal elements, hat values, $c17-math-0139$ , or simply $c17-math-0140$ , are given by

17.18

with $c17-math-0142$ capturing the model-based estimator of the variance of $c17-math-0143$ and $c17-math-0144$ computing the weighted distance of $c17-math-0145$ from the average of the covariate design matrix $c17-math-0146$ . Similar to the linear regression model case, the diagonal element $c17-math-0147$ is useful for obtaining the variance of the residual:

17.19

The Pearson residual for the logistic regression model is then defined by

17.20

The standardized Pearson residual is defined by

17.21

The deviance residual for the logistic regression model is defined as signed square root of the contribution of the observation to the sum of the model deviance and is given by

17.22

The residuals are obtained for a disease outbreak dataset, which is adapted from Kutner, et al. (2005).

Example 17.5.4. Disease Outbreak Problem

The purpose of this health study is investigation of an epidemic outbreak due to mosquitoes. A random sample, from two sectors of a city, among the individuals has been tested to determine if they had contracted the disease forming the binary outcome $c17-math-0152$ . The variables of age x1, socioeconomic status (of three categories lower, middle, and upper) through data variables x2 and x3, and sector of the city x4, are used to determine if the disease has been contracted by the individual. This dataset is available in the file Disease_Outbreak.csv and imported into the R session as Disease data.frame object. First, a logistic regression model will be fitted with glm(y .,data=Disease) and the fitted model DO_LR will be used to obtain the residuals: response, deviance, Pearson, standardized Pearson, and the important hat values. The R functions glm, residuals with the options of response, deviance, and pearson, hatvalues, and fitted will help us to obtain the quantities defined in Equations 17.16–17.22.

> data(Disease)
> DO_LR <- glm(y∼.,data=Disease,family='binomial')
> LR_Residuals <- data.frame(Y = Disease$y,Fitted = fitted(DO_LR),
+ Hatvalues = hatvalues(DO_LR),Response = residuals(DO_LR,
+ "response"), Deviance = residuals(DO_LR,"deviance"),
+ Pearson = residuals(DO_LR,"pearson"),
+ Pearson_Standardized = residuals(DO_LR,"pearson")/
+ sqrt(1-LR_Residuals$Hatvalues))
> LR_Residuals
   Y Fitted Hatvalues Response Deviance Pearson Pearson_Standardized
1  0 0.2090    0.0387  -0.2090   -0.685  -0.514               -0.524
2  0 0.2190    0.0404  -0.2190   -0.703  -0.529               -0.541
3  0 0.1058    0.0332  -0.1058   -0.473  -0.344               -0.350
4  0 0.3710    0.0895  -0.3710   -0.963  -0.768               -0.805
5  1 0.1108    0.0252   0.8892    2.098   2.833                2.869
94 0 0.1630    0.0460  -0.1630   -0.596  -0.441               -0.452
95 1 0.1589    0.0326   0.8411    1.918   2.300                2.339
96 0 0.1138    0.0254  -0.1138   -0.491  -0.358               -0.363
97 0 0.0919    0.0241  -0.0919   -0.439  -0.318               -0.322
98 0 0.1712    0.0356  -0.1712   -0.613  -0.455               -0.463

The plots of residuals against the fitted values of the type plot(fitted,residuals), unlike the linear regression model, are not very useful. A deviation from the tested path is required to obtain meaningful residual plots and will be developed now.□

Recollect from Equation 17.7 that the mean and variance of the error term are respectively $c17-math-0153$ and $c17-math-0154$ . Thus, the LOESS plot, refer to Section 8.4, of errors against the fitted values should reflect a line around 0 if the assumption of the logistic regression is appropriate. This perspective will be explored as a continuation of the previous example.

Example 17.5.5. Disease Outbreak Problem

Contd. The residuals against the fitted values plot are easily obtained. We need to complement the residual plots with the LOESS approximation. Thus, the loess function is used in the usual way loess(y∼x). The fitted LOESS curve will then be added to the residual plot and investigated if the line is approximately at 0.

> par(mfrow=c(2,2))
> plot(LR_Residuals$Fitted,LR_Residuals$Response,
+ xlab="Fitted Values",ylab="Response Residual")
> response_loess <- loess(Response∼Fitted,data=LR_Residuals)
> points(response_loess$x,predict(response_loess))
> plot(LR_Residuals$Fitted,LR_Residuals$Deviance,
+ xlab="Fitted Values", ylab="Deviance Residual")
> deviance_loess <- loess(Deviance∼Fitted,data=LR_Residuals)
> points(deviance_loess$x,predict(deviance_loess))
> plot(LR_Residuals$Fitted,LR_Residuals$Pearson,
+ xlab="Fitted Values", ylab="Pearson Residual")
> pearson_loess <- loess(Pearson∼Fitted,data=LR_Residuals)
> points(pearson_loess$x,predict(pearson_loess))
> plot(LR_Residuals$Fitted,LR_Residuals$Pearson_Standardized,
+ xlab="Fitted Values", ylab="Standardized Pearson Residual")
> pearson_standardized_loess <- loess(Pearson∼Fitted,
+ data=LR_Residuals)
> points(pearson_standardized_loess$x,predict(pearson_ standardized_loess))
> title(main="The Loess Approach for Residual Validation of Logistic
+ Regression Model",outer=TRUE,line=-2)

Figure 17.3 shows that the logistic regression model is appropriate for the disease outbreak data problem.□

Figure 17.3 Residual Plots using LOESS

The residuals have more importance than validation of the model assumptions. The overall fit of the model will be investigated next.

17.5.5 Deviance Test and Hosmer-Lemeshow Goodness-of-Fit Test

Testing for the significance of the GLM model is carried out using Deviance Statistic, denoted by $c17-math-0155$ , by comparing the likelihood function for the fitted model, which includes the covariates, with the likelihood of the saturated model. A saturated model is that model which takes into consideration all possible parameters and thus results in a perfect fit in the sense that all successful outcomes are predicted as success and failures predicted as failed. The deviance statistic $c17-math-0156$ is then defined by

Since the saturated model, by definition, leads to a perfect fit, we have $c17-math-0158$ , and thus

Thus, the deviance statistic becomes

To know the significance of the set of the independent variables, we need to compare the value of $c17-math-0161$ with and without the set of independent variables/covariates. That is

17.23

We further see that

17.24

where $c17-math-0164$ and $c17-math-0165$ . Basically, we are substituting the MLEs of $c17-math-0166$ and $c17-math-0167$ for the model without the set of independent variables. Under the hypothesis that all the regression coefficients are not significant, the test statistic $c17-math-0168$ follows a $c17-math-0169$ - distribution with $c17-math-0170$ degrees of freedom. With the appropriate significance level, it is possible to infer the model significance using the $c17-math-0171$ statistic. This $c17-math-0172$ statistic may be seen as the parallel of the $c17-math-0173$ statistic in the linear model case. For more details, refer to Hosmer and Lemeshow (2000).

Example 17.5.6. The Low Birth-Weight Problem

Contd. Recall that the R object lowglm_summary had been used to store summary values of the glm object in lowglm. To calculate the $c17-math-0174$ -statistic, which is not directly given in R, we note that the summarized object gives us the deviance and null.deviance of the model. Thus,

> gstat_lowbwt <- lowglm_summary$null.deviance - lowglm_summary $deviance
> gstat_lowbwt
[1] 12.09909
> 1-pchisq(gstat_lowbwt,5)
[1] 0.03345496
> with(lowglm, pchisq(null.deviance - deviance,df.null - df.residual,
+ lower.tail = FALSE)) # Equivalently
[1] 0.0335

The $c17-math-0175$ -value is significant at the 0.05 level, which leads us to reject the hypothesis $c17-math-0176$ and conclude that at least one or even all of $c17-math-0177$ variable effects are significantly different from zero.□

17.6 Model Selection in Logistic Regression Models

We consider “Stepwise Logistic Regression” for the best variable selection in a logistic regression model. The most important variable is the one that produces maximum change in the log-likelihood relative to a model containing no variables. That is, maximum change in the value of the $c17-math-0178$ -statistic is considered as the most important variable. The complete process of step-wise logistic regression is given in S number of steps.

Step 0

Consider all plausible variables, say $c17-math-0179$ .
Fit “intercept only model”, and evaluate the log-likelihood $c17-math-0180$ .
Fit each of the possible $c17-math-0181$ univariate logistic regression models and note the log-likelihood values $c17-math-0182$ . Furthermore, calculate the $c17-math-0183$ -statistic for each of the $c17-math-0184$ -models, $c17-math-0185$ .
Obtain the $c17-math-0186$ -value for each model.
The most important variable is then the one with least $c17-math-0187$ -value, $c17-math-0188$
Denote the most important variable by $c17-math-0189$ .
Define the entry criteria $c17-math-0190$ -value as $c17-math-0191$ , which will at any time throughout this procedure decide if the variable is to be included or not. That is, the variable with the least $c17-math-0192$ -value must be less than $c17-math-0193$ to be selected in the final model.
If none of the variables have the $c17-math-0194$ -value less than $c17-math-0195$ , we stop.

Step 1. The Forward Selection Step

Replace $c17-math-0196$ of the previous step with $c17-math-0197$ .
Fit $c17-math-0198$ models with variables $c17-math-0199$ and the remaining variables $c17-math-0200$ , $c17-math-0201$ , and $c17-math-0202$ distinct from $c17-math-0203$ .
For each of the $c17-math-0204$ models, calculate the log-likelihood $c17-math-0205$ and the $c17-math-0206$ -statistics $c17-math-0207$ , and the corresponding $c17-math-0208$ -values, denoted $c17-math-0209$ .
Define $c17-math-0210$
If $c17-math-0211$ , stop.

Step 2A. The Backward Elimination Step

Adding $c17-math-0212$ may leave $c17-math-0213$ statistically insignificant.
Let $c17-math-0214$ denote the log-likelihood of the model with variable $c17-math-0215$ removed.
Calculate the likelihood-ratio test of these reduced models with respect to the full model at the beginning of this step $c17-math-0216$ , and calculate the $c17-math-0217$ -values $c17-math-0218$ .
Deleted variables must result in a maximum $c17-math-0219$ -value of the modified model.
Denote $c17-math-0220$ as the variable which is to be removed, and define $c17-math-0221$ .
To remove variables, we need to have a value $c17-math-0222$ with respect to which we compare $c17-math-0223$ .
We need to have $c17-math-0224$ . (Why?)
Variables are removed if $c17-math-0225$ .

Step 2B. The Forward Selection Phase

Continue the forward selection method with $c17-math-0226$ remaining variables and find $c17-math-0227$ .
Let $c17-math-0228$ be the variable associated with $c17-math-0229$ .
If the $c17-math-0230$ is less than $c17-math-0231$ , proceed to Step 3, otherwise stop.

Step 3

Fit the model including the variable selected in the previous step and perform backward elimination and then forward selection phase.
Repeat until the last Step S.

Step S

Stopping happens if all the variables have been selected in the model.
It also happens if all the $c17-math-0232$ -values in the model are less than $c17-math-0233$ , and the remaining variables have $c17-math-0234$ -values exceeding $c17-math-0235$ .

The step function in R may be used to build a model. The criteria used there is the Akaike Information Criteria, AIC. To the best of our knowledge, there is no package/function which will implement stepwise logistic regression using the $c17-math-0236$ -statistic generated $c17-math-0237$ -values. The Hosmer and Lemeshow (2000) approach of model selection will be used here.

Example 17.6.1. Stepwise Regression for Low Birth-Weight Study

Let us begin by reading the dataset into R and with some essential data manipulations.

> data(lowbwt)
> attach(lowbwt)
> RACE_2=RACE_3=c()
> for(i in 1:nrow(lowbwt)){
+ if(lowbwt$RACE[i]==1) {RACE_2[i] <- 0;RACE_3[i] <- 0}
+ if(lowbwt$RACE[i]==2) {RACE_2[i] <- 1;RACE_3[i] <- 0}
+ if(lowbwt$RACE[i]==3) {RACE_2[i] <- 0;RACE_3[i] <- 1}
+ }
> design <- cbind(rep(1,nrow(lowbwt)),lowbwt[,3],
+ lowbwt[,4],RACE_2,RACE_3,lowbwt[,10])
> colnames(design)=c("intercept", "AGE","LWT",
+ "RACE_2","RACE_3","FTV")
> n <- nrow(design)
> n1 <- sum(lowbwt$LOW); n0 <- n-n1
> nullloglik <- n1*log(n1/n) + n0*log(n0/n)

We have thus obtained the null-log-likelihood value. We will first define two functions which will calculate the log-likelihood value glmllv and given two log-likelihood values, and will return the $c17-math-0238$ -values for the fitted GLM pvalue.

> # Functions which calculate the log-likelihood
> # values and the p-value
> glmllv <- function(glm, x) {
+ glm <- glm; y <- glm$y
+ x1 <- cbind(rep(1,length(y)),x)
+ coeff <- glm$coefficients
+ logitx <- x1%*%coeff;
+ pix <- exp(logitx)/(1+exp(logitx))
+ llvalue <- sum(y*log(pix))+sum((1-y)*log(1-pix))
+ return(llvalue)
+ }
> pvalue <- function(lik1,lik0,df){
+ gstat <- -2*(lik0-lik1)
+ pval <- 1-pchisq(gstat,df)
+ return(pval)
+ }

We will now set the entry and exit criteria and decide which variable will first enter the model.

> #The p-values for entry and exit criteria
> pe <- 0.25; pr <- 0.4
> # Selecting the first variable to be included in the model
> glm_AGE <- glm(LOW∼AGE,family='binomial')
> ll_AGE <- glmllv(glm_AGE,AGE)
> (pvalue_AGE <- pvalue(ll_AGE,nullloglik,1))
[1] 0.09664596
> glm_LWT <- glm(LOW∼LWT,family='binomial')
> ll_LWT <- glmllv(glm_LWT,LWT)
> (pvalue_LWT <- pvalue(ll_LWT,nullloglik,1))
[1] 0.01445812
> glm_RACE_2 <- glm(LOW∼RACE_2,family='binomial')
> ll_RACE_2 <- glmllv(glm_RACE_2,RACE_2)
> (pvalue_RACE_2 <- pvalue(ll_RACE_2,nullloglik,1))
[1] 0.1985105
> glm_RACE_3 <- glm(LOW∼RACE_3,family='binomial')
> ll_RACE_3 <- glmllv(glm_RACE_3,RACE_3)
> (pvalue_RACE_3 <- pvalue(ll_RACE_3,nullloglik,1))
[1] 0.1829021
> glm_FTV <- glm(LOW∼FTV,family='binomial')
> ll_FTV <- glmllv(glm_FTV,FTV)
> (pvalue_FTV <- pvalue(ll_FTV,nullloglik,1))
[1] 0.3792461

We see that the minimum $c17-math-0239$ -value of 0.0145 is associated with the LWT variable, and is also less than $c17-math-0240$ . We include this variable in the model now and move to Step 1.

> #Selecting the variables for the Step 1
> glm_LWT <- glm(LOW∼LWT,family='binomial')
> ll_LWT <- glmllv(glm_LWT,LWT)
> glm_LWT_AGE <- glm(LOW∼LWT+AGE,family='binomial')
> ll_LWT_AGE <- glmllv(glm_LWT_AGE,cbind(LWT,AGE))
> (pvalue_LWT_AGE <- pvalue(ll_LWT_AGE,ll_LWT,1))
[1] 0.2106024
> glm_LWT_RACE_2 <- glm(LOW∼LWT+RACE_2,family='binomial')
> ll_LWT_RACE_2 <- glmllv(glm_LWT_RACE_2,cbind(LWT,RACE_2))
> (pvalue_LWT_RACE_2 <- pvalue(ll_LWT_RACE_2,ll_LWT,1))
[1] 0.05723459
> glm_LWT_RACE_3 <- glm(LOW∼LWT+RACE_3,family='binomial')
> ll_LWT_RACE_3 <- glmllv(glm_LWT_RACE_3,cbind(LWT,RACE_3))
> (pvalue_LWT_RACE_3 <- pvalue(ll_LWT_RACE_3,ll_LWT,1))
[1] 0.442516
> glm_LWT_FTV <- glm(LOW∼LWT+FTV,family='binomial')
> ll_LWT_FTV <- glmllv(glm_LWT_FTV,cbind(LWT,FTV))
> (pvalue_LWT_FTV <- pvalue(ll_LWT_FTV,ll_LWT,1))
[1] 0.5457832

Since the $c17-math-0241$ -value associated with RACE_2 is the least and is less than $c17-math-0242$ , it can be selected in our model. We will now check if some variable needs to leave the model.

> #Backward Elimination Method of Step 2
> # Since Race is consists of both RACE_2 and RACE_3,
> # we include both in Step 2
> glm_LWT_RACE <- glm(LOW∼LWT+RACE_2+RACE_3,family='binomial')
> ll_LWT_RACE <- glmllv(glm_LWT_RACE, cbind(LWT,RACE_2,RACE_3))
> glm_RACE <- glm(LOW∼RACE_2+RACE_3)
> ll_RACE <- glmllv(glm_RACE,cbind(RACE_2,RACE_3))
> pvalue(ll_LWT_RACE,ll_LWT,2)
[1] 0.06615272
> pvalue(ll_LWT_RACE,ll_RACE,1)
[1] 1.554312e-15

Since the maximum of these two $c17-math-0243$ -values is less than $c17-math-0244$ , we retain the variable RACE in the model. That is, the backward elimination step has not removed any variable. We need to redo Step 2 until we reach Step S described earlier.

> #Step 3 continues the Step 2 untill stopping criteria
> glm_LWT_RACE_AGE <- glm(LOW∼LWT+RACE_2+RACE_3+AGE)
> ll_LWT_RACE_AGE <- glmllv(glm_LWT_RACE_AGE,cbind(LWT,RACE_2, RACE_3,AGE))
> (pvalue_LWT_RACE_AGE <- pvalue(ll_LWT_RACE_AGE,ll_LWT_RACE,1))
[1] 1
> glm_LWT_RACE_FTV <- glm(LOW∼LWT+RACE_2+RACE_3+FTV)
> ll_LWT_RACE_FTV <- glmllv(glm_LWT_RACE_FTV,cbind(LWT,RACE_2, RACE_3,FTV))
> (pvalue_LWT_RACE_FTV <- pvalue(ll_LWT_RACE_FTV,ll_LWT_RACE,1))
[1] 1

Since none of the $c17-math-0245$ -values associated variables AGE and FTV is less than $c17-math-0246$ , we cannot enter the variables into the model. Thus, our best model includes the variables LTW and RACE.□

We will next illustrate the concept of the backward regression selection method using the AIC criteria.

Example 17.6.2. Backward Selection Method for Low Birth-Weight Study

We will again use the step function with the backward option for the direction to get the desired result.

> lowbwt <- read.xls("lowbwt.xls",sheet=1,header=TRUE)
> lowbwt <- lowbwt[,-1]
> lowglm <- glm(LOW∼.,data=lowbwt,family='binomial')
> lowbackglm <- step(lowglm,direction="backward")
> step(lowglm,direction="backward")
Start:  AIC=20
LOW ∼ AGE + LWT + RACE + SMOKE + PTL + HT + UI + FTV + BWT
        Df Deviance    AIC
- SMOKE  1     0.00  18.00
- AGE    1     0.00  18.00
- RACE   1     0.00  18.00
- UI     1     0.00  18.00
- PTL    1     0.00  18.00
- FTV    1     0.00  18.00
- LWT    1     0.00  18.00
- HT     1     0.00  18.00
<none>         0.00  20.00
- BWT    1   204.19 222.19
Step:  AIC=18
LOW ∼ AGE + LWT + RACE + PTL + HT + UI + FTV + BWT
       Df Deviance    AIC
- AGE   1     0.00  16.00
- PTL   1     0.00  16.00
- LWT   1     0.00  16.00
- FTV   1     0.00  16.00
- HT    1     0.00  16.00
- RACE  1     0.00  16.00
- UI    1     0.00  16.00
<none>        0.00  18.00
- BWT   1   209.89 225.89
Step:  AIC=4
LOW ∼ BWT
       Df Deviance    AIC
<none>        0.00   4.00
- BWT   1   234.67 236.67
Call:  glm(formula = LOW ∼ BWT, family = "binomial", data = lowbwt)
Coefficients:
(Intercept)          BWT
   2976.010       -1.186
Degrees of Freedom: 188 Total (i.e. Null);  187 Residual
Null Deviance:    234.7
Residual Deviance: 4.931e-07 AIC: 4
> summary(lowbackglm)
Call:
glm(formula = LOW ∼ BWT, family = "binomial", data = lowbwt)
Deviance Residuals:
       Min          1Q      Median          3Q         Max
-4.986e-04  -2.107e-08  -2.107e-08   2.107e-08   2.472e-04
Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)   2976.010 218927.077   0.014     0.99
BWT             -1.186     87.251  -0.014     0.99
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 2.3467e+02  on 188  degrees of freedom
Residual deviance: 4.9313e-07  on 187  degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25

Thus, only the BWT variable has been selected for the best explanation of the LOW birth indicator.□

An alternative to the binary regression problem, as discussed here with the logistic regression model, is given by the probit model and we will discuss this in the next section.

$c17-math-0247$

17.7 Probit Regression

Bliss (1935) proposed the use of the probit regression model for problems where the response variable is a binary variable. Recall that in the logistic regression model we used the logit transformation of the link function to ensure that the predicted probabilities were always in the unit interval. Bliss suggested modeling the link function using the normal cumulative distribution.

Let $c17-math-0248$ denote the binary outcome as earlier and the covariates be represented by $c17-math-0249$ . The probit regression model is constructed through the use of an auxiliary RV denoted by $c17-math-0250$ , see Chapter 7 of Tattar (2013). The auxiliary RV $c17-math-0251$ is then modeled as the multiple linear regression model:

where the error $c17-math-0253$ follows the normal distribution $c17-math-0254$ . Without loss of generality we assume $c17-math-0255$ . The vectors $c17-math-0256$ and $c17-math-0257$ have their usual meanings. The probit regression model is then developed as a latent variable model through the use of the auxiliary RV:

17.25

The probit model is then given by

17.26

where $c17-math-0260$ denotes the cumulative distribution of a standard normal variable. If we have $c17-math-0261$ pairs of observations $c17-math-0262$ , the statistical inference for $c17-math-0263$ may be carried out through the likelihood function given by

It is obvious that this likelihood function will be difficult to evaluate. A probit regression model is set up in R, again using the glm function with the option binomial(probit). Since the fitted probit model is again a member of the glm class, the techniques available and discussed for the logistic regression model are also available for the fitted probit model. For the sake of paucity, the mathematical details of the probit model will be skipped. First, we consider the simple probit model, followed with an example of multiple probit model.

Example 17.7.1. Probit Model for the CHD Data

We need to use binomial(probit) as the family option in the glm function to fit a probit model.

> # The Probit Regression Model
> data(chdage)
> chdprobit <- glm(CHD∼AGE,data=chdage,family=binomial(probit))
> summary(chdprobit)
Call:
glm(formula = CHD ∼ AGE, family = binomial(probit), data = chdage)
Deviance Residuals:
    Min       1Q   Median       3Q      Max
-1.9713  -0.8608  -0.4499   0.8359   2.3269
Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.14573    0.62460  -5.036 4.74e-07 ***
AGE          0.06580    0.01335   4.930 8.20e-07 ***
- - -
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1
    Null deviance: 136.66  on 99  degrees of freedom
Residual deviance: 107.50  on 98  degrees of freedom
AIC: 111.50
Number of Fisher Scoring iterations: 4
> summary(glm(CHD∼AGE,data=chdage,family='binomial'))$coefficients
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -5.3094534 1.13365365 -4.683488 2.820338e-06
AGE          0.1109211 0.02405982  4.610224 4.022356e-06

A comparison with the logistic regression model, see Example 17.5.1, shows that the coefficients are similar, though not the same. The AGE is found to be a significant variable, as in the case of the logistic regression model.

The confidence intervals and model significance may be obtained in R for this model, as seen next.

> confint(chdprobit)
Waiting for profiling to be done...
                  2.5 %      97.5 %
(Intercept) -4.41109357 -1.97605229
AGE          0.04070297  0.09278964
> with(chdprobit, pchisq(null.deviance - deviance,df.null
+ - df.residual,lower.tail = FALSE))
[1] 6.649872e-08

Since 0 does not lie in the 95% confidence intervals for either of the intercept terms or AGE, we again conclude that the variable is significant. Finally, since the overall $c17-math-0265$ -value for the fitted model is very close to 0, 6.649872e-08, the fitted model is a significant model.□

A lot of similarity exists between the logistic regression and the probit regression regarding the model building in R, since both are generated using the glm function. Especially, the computational and inference methods remain almost the same. We will close this section with an illustration of the step-wise regression method for the probit regression model.

Example 17.7.2. Stepwise Regression for the Probit Regression Model with an Application to the Low Birth Weight Study

We are now familiar with this dataset. Hence, we will directly show how to perform the stepwise regression for the probit regression model.

> lowprobit <- glm(LOW∼.,data=lowbwt,binomial(probit))
> step(lowprobit,direction="both")
Start:  AIC=20
LOW ∼ AGE + LWT + RACE + SMOKE + PTL + HT + UI + FTV + BWT
        Df Deviance    AIC
- RACE   1     0.00  18.00
- UI     1     0.00  18.00
- AGE    1     0.00  18.00
- SMOKE  1     0.00  18.00
- PTL    1     0.00  18.00
- FTV    1     0.00  18.00
- LWT    1     0.00  18.00
- HT     1     0.00  18.00
<none>         0.00  20.00
- BWT    1   203.74 221.74
Step:  AIC=18
LOW ∼ AGE + LWT + SMOKE + PTL + HT + UI + FTV + BWT
        Df Deviance    AIC
- PTL    1     0.00  16.00
- AGE    1     0.00  16.00
- LWT    1     0.00  16.00
- FTV    1     0.00  16.00
- SMOKE  1     0.00  16.00
- UI     1     0.00  16.00
- HT     1     0.00  16.00
<none>         0.00  18.00
+ RACE   1     0.00  20.00
- BWT    1   208.46 224.46
Step:  AIC=4
LOW ∼ BWT
        Df Deviance    AIC
<none>         0.00   4.00
+ UI     1     0.00   6.00
+ LWT    1     0.00   6.00
+ AGE    1     0.00   6.00
+ PTL    1     0.00   6.00
+ RACE   1     0.00   6.00
+ HT     1     0.00   6.00
+ FTV    1     0.00   6.00
+ SMOKE  1     0.00   6.00
- BWT    1   234.67 236.67
Call:  glm(formula = LOW ∼ BWT,
family = binomial(probit), data = lowbwt)
Coefficients:
(Intercept)          BWT
     928.57        -0.37
Degrees of Freedom: 188 Total (i.e. Null);  187 Residual
Null Deviance:    234.7
Residual Deviance: 8.798e-07 AIC: 4

Note that the probit regression model has also identified BWT as the single most important regressor of the low birth-weight indicator, as in the logistic regression model in the previous section.□

The next section will deal with a different type of discrete variable.

$c17-math-0266$

17.8 Poisson Regression Model

The logistic and probit regression models are useful when the discrete output is a binary variable. By an extension, we can incorporate multi-nominal variables too. However, these variables are class indicators, or nominal variables. In Section 16.5, the role of the Poisson RV was briefly indicated in the context of discrete data. If the discrete output is of a quantitative type and there are covariates related to it, we need to use the Poisson regression model.

The Poisson regression models are useful in two slightly different contexts: (i) the events arising as a percentage of the exposures, along with covariates which may be continuous or categorical, and (ii) the exposure effect is constant with the covariates being categorical variables only. In the first type of modeling, the parameter (rate $c17-math-0267$ ) of the Poisson RV is specified in terms of the units of exposure. As an example, the quantitative variable may refer to the number of accidents per thousand vehicles arriving at a traffic signal, or the number of successful sales per thousand visitors on a web page. In the second case, the regressand $c17-math-0268$ may refer to the count in each cell of a contingency table. The Poisson model in this case is popularly referred to as the log-linear model. For a detailed treatment of this model, refer to Chapter 9 of Dobson (2002).

Let $c17-math-0269$ denote the number of responses from $c17-math-0270$ events for the $c17-math-0271$ exposure, $c17-math-0272$ , and let $c17-math-0273$ represent the vector of explanatory variables. The Poisson regression model is stated as

17.27

Here, the natural link function is the logarithmic function

The extra term $c17-math-0276$ that appears in the link function is referred to as the offset term. It is important to clearly specify here the form of the pmf of $c17-math-0277$ :

The likelihood function is then given by

17.28

The (ML) technique of obtaining $c17-math-0280$ is again based on the IRLS algorithm. However, this aspect will not be dealt with here. It is assumed that the ML estimate is available for $c17-math-0281$ , as returned by say R software, and by using it we will then look at other aspects of the model. The fitted values from the Poisson regression model are given by

17.29

and hence the residuals are

17.30

Note that the mean and variance of the Poisson model are equal, and hence an estimate of the variance of the residual is $c17-math-0284$ , which leads to the Pearson residuals:

17.31

A $c17-math-0286$ goodness-of-fit test statistic is then given by

17.32

These concepts will be demonstrated through the next example.

Example 17.8.1. British Doctors Smoking and Coronary Heart Disease

The data for this example is taken from Table 9.1 of Dobson (2002) and available in the file British_Smokers.csv. The problem is to investigate the impact of smoking tobacco among British doctors, refer to Example 9.2.1 of Dobson. In the year 1951, a survey was sent out to all British doctors asking them whether or not they smoked tobacco and their age group Age_Group. The data also collected the person-yearsPerson_Years of the doctors in the respective age group. A follow-up after ten years reveals the number of deaths Deaths, the smoking group indicator Smoker_Cat. The data is slightly re-coded to extract variables with Age_Cat taking values 1 to 5 respectively for the age groups 35-44, 45-54, 55-64, 65-74, and 75-84. To check the presence of the non-linear impact of the variable age, the square of the Age_Cat is created in Age_Square. The variable Smoke_Age is created, which takes the Age_Cat values for the smokers' group and 0 for the non-smokers. The number of deaths is standardized to 100 000 Person_Years.

The glm model can be built using the link function family=‘poisson’ with the offset option.

> data(BS)
> BS_Pois <- glm(Deaths∼Age_Cat+Age_Square+Smoke_Ind+Smoke_Age,
+ offset=log(Person_Years), data=BS,family=`poisson')
> logLik(BS_Pois)
'log Lik.' -28 (df=5)
> summary(BS_Pois)
Call:
glm(formula = Deaths ∼ Age_Cat + Age_Square + Smoke_Ind + Smoke_Age,
    family = "poisson", data = BS, offset = log(Person_Years))
Deviance Residuals:
      1        2        9       10
 0.4382  -0.2733    -0.4106  -0.0127
Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.7918     0.4501  -23.98  < 2e-16 ***
Age_Cat       2.3765     0.2079   11.43  < 2e-16 ***
Age_Square   -0.1977     0.0274   -7.22  5.1e-13 ***
Smoke_Ind     1.4410     0.3722    3.87  0.00011 ***
Smoke_Age    -0.3075     0.0970   -3.17  0.00153 **
- - -
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
    Null deviance: 935.0673  on 9  degrees of freedom
Residual deviance:   1.6354  on 5  degrees of freedom
AIC: 66.7
Number of Fisher Scoring iterations: 4
> with(BS_Pois, pchisq(null.deviance - deviance, df.null - df.residual,lower.tail = FALSE))
[1] 1e-200
> confint(BS_Pois)
Waiting for profiling to be done...
            2.5 % 97.5 %
(Intercept) -11.7   -9.9
Age_Cat       2.0    2.8
Age_Square   -0.3   -0.1
Smoke_Ind     0.7    2.2
Smoke_Age    -0.5   -0.1

The R output clearly shows that each of the variables included is significant for the number of deaths. However, we are not really interested in this aspect of the model! We would like to know if the death rates for smokers is higher than that for non-smokers, and if it is higher, then by how much? This will be answered through the next discussion.□

For a binary regressand, we can obtain a precise answer for the influence of the variable on the rate of the Poisson model through the concept of the rate ratio. The rate ratio, denoted by RR, gives the impact of the variable on the rate, by looking at the ratio of the expected value of $c17-math-0289$ , when the variable is present to the expected value when it is absent:

If the RR value is closer to unity, it implies that the indicator variable has no influence on $c17-math-0291$ . The confidence intervals, based on the invariance principle of MLE, is obtained with a simple exponentiation of the confidence intervals for the estimated regression coefficient $c17-math-0292$ .

Example 17.8.2. British Doctors Smoking and Coronary Heart Disease

Contd. Apart from the computation of $c17-math-0293$ , we will quickly look at the residuals (Pearson) and the $c17-math-0294$ goodness-of-fit test statistics.

> exp(BS_Pois$coefficients)
(Intercept)     Age_Cat  Age_Square   Smoke_Ind   Smoke_Age
       0.00       10.77        0.82        4.22        0.74
> exp(confint(BS_Pois))
Waiting for profiling to be done...
            2.5 % 97.5 %
(Intercept)  0.00   0.00
Age_Cat      7.23  16.34
Age_Square   0.78   0.87
Smoke_Ind    2.09   9.01
Smoke_Age    0.61   0.89
> residuals(BS_Pois,'pearson')
     1      2      3      4      5      6      7      8      9     10
 0.444 -0.272 -0.152  0.235 -0.057 -0.766  0.135  0.655 -0.405 -0.013
> sum(residuals(BS_Pois,'pearson')ˆ2)
[1] 1.550
> 1-pchisq(1.55,5)
[1] 0.907229

The confidence intervals for $c17-math-0295$ do not include 1 in any of them. Furthermore, the RR value for Smoke_Ind is 4.22, which indicates that the death rate for smokers is four times higher than that for the non-smokers. The Pearson residuals are all small enough, which rule out the presence of any outlier. The $c17-math-0296$ goodness-of-fit test statistic indicates that the fitted model is good enough!□

Example 17.8.3. The Caesarean Cases

An increasing concern has been the number of Caesarean deliveries, especially in private hospitals. We have obtained the small dataset from http://www.oxfordjournals.org/our_journals/tropej/online/ma_chap13.pdf. Here, we know the number of births, the type of hospital (private or public), and the number of Caesareans. We would like to model the number of Caesareans as a function of the number of births and the type of hospital. A Poisson regression model is fitted for this dataset.

> data(caesareans)
> names(caesareans)
[1] "Births"        "Hospital_Type" "Caesareans"
> cae_pois <- glm(Caesareans∼Hospital_Type+Births,data=caesareans, family='poisson')
> summary(cae_pois)
Call:
glm(formula = Caesareans ∼ Hospital_Type + Births,  family = "poisson",
    data = caesareans)
Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.3270  -0.6121  -0.0899   0.5398   1.6626
Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.351e+00  2.501e-01   5.402 6.58e-08 ***
Hospital_Type 1.045e+00  2.729e-01   3.830 0.000128 ***
Births        3.261e-04  6.032e-05   5.406 6.45e-08 ***
- - -
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
    Null deviance: 99.990  on 19  degrees of freedom
Residual deviance: 18.039  on 17  degrees of freedom
AIC: 110.80
Number of Fisher Scoring iterations: 4

The analysis shows that Caesarean sections are about twice as common in public hospitals as in private ones.□

$c17-math-0297$

17.9 Further Reading

Agresti (2007) is a very nice introduction for a first course in CDA. Simonoff (2003) has a different approach to CDA and is addressed to a more advanced audience. Johnson and Albert (1999) is a dedicated text for analysis of ordinal data from a Bayesian perspective. Congdon (2005) is a more exclusive account of Bayesian methods for CDA. Graphics of CDA are different in nature, as the scale properties are not necessarily preserved here. Specialized graphical methods have been developed in Friendly (2000). Though Friendly's book uses SAS as the sole software for statistical analysis and graphics, it is a very good account of the ideas and thought processes for CDA. Blascius and Greenacre (1998), Chen, et al. (2008), and Unwin, et al. (2005) are advanced level edited collections of research papers with dedicated emphasis on the graphical and visualization methods for data.

McCullagh and Nelder (1983–9) is among the first set of classics which deals with GLM. Kleinbaum and Klein (2002–10), Dobson (1990–2002), Christensen (1997), and Lindsey (1997) are a handful of very good introductory books in this field. Hosmer and Lemeshow (1990–2013) forms the spirit of this chapter for details on logistic regression. GLMs have evolved in more depth than we can possibly address here. Though we cannot deal with all of them, we will see here the other directional dimensions of this field. Bayesian methods are the natural alternative school for considering the inference methods in GLMs. Dey, Ghosh, and Mallick (2000) is a very good collection of peer reviewed articles for Bayesian methods in the domain of GLM. Agresti (2002, 2007) and Congdon (2005) also deal with Bayesian analysis of GLMs.

Chapter 13 of Dalgaard (2008), Chapter 7 of Everitt and Hothorn (2010), Chapter 6 of Faraway (2006), Chapter 8 of Maindonald and Braun (2006), and Chapter 13 of Crawley (2007) are some good introductions to GLM with R.

17.10 Complements, Problems, and Programs

Problem 17.1 The irls function given in Section 17.5 needs to be modified for incorporating more than one covariate. Extend the program which can then be used for any logistic regression model.
Problem 17.2 Obtain the 90% confidence intervals for the logistic regression model chdglm, as discussed in Example 17.5.1. Also, carry out the deviance test to find if the overall fitted model chdglm is a significant model? Validate the assumptions of the logistic regression model.
Problem 17.3 Suppose that you have a new observation $c17-math-0298$ . Find details with ?predict.glm and use them for prediction purposes for any $c17-math-0299$ of your choice with chdglm.
Problem 17.4 The likelihood function for the logistic regression model is given in Equation 17.8 . It may be tempting to write a function, say lik_Logistic, which is proportional to the likelihood function. However, optimize does not return the MLE! Write a program and check if you can obtain the MLE.
Problem 17.5 The residual plot technique extends to the probit regression and the reader should verify the same for the fitted probit regression models in Section 17.7.
Problem 17.6 Obtain the 99% confidence intervals for the lowprobit model in Example 17.8.2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 17: Generalized Linear Models

Create new playlist

Sign In

Sign Up