Chapter 17
Generalized Linear Models

Package(s): gdata, RSADBE

Dataset(s): chdage, lowbwt, sat, Disease, BS, caesareans

17.1 Introduction

In Chapter 16 we discussed many useful statistical methods for analysis of categorical data, which may be nominal or ordinal data. The related regression problems were deliberately not touched upon there, the reason for omission being that the topic is more appropriate here. We will see in the next section that the linear regression methods of Chapter 12 are not appropriate for explaining the relationship between the regressors and discrete regressands. The statistical models, which are more suitable for addressing this problem, are known as the generalized linear models, which we abbreviate to GLM.

In this chapter, we will consider the three families of the GLM: logistic, probit, and log-linear models. The logistic regression model will be covered in more detail, and the applications of the others will be clearly brought out in the rest of this chapter.

We first begin with the problem of using the linear regression model for count/discrete data in Section 17.2. The exponential family continues to provide excellent theoretical properties for GLMs and the relationship will be brought out in Section 17.3. The important logistic regression model will be introduced in Section 17.4. The statistical inference aspects of the logistic regression model is developed and illustrated in Section 17.5. Similar to the linear regression model, we will consider the similar problem of model selection in Section 17.6. Probit and Poisson regression models are developed in Sections 17.7 and 17.8.

17.2 Regression Problems in Count/Discrete Data

By count/discrete data, we refer to the case where the regressand/output is discrete. That is, the output c17-math-0001 takes values in the set c17-math-0002, or c17-math-0003. As in the case of the linear regression model, we have some explanatory variables which effect the output c17-math-0004. we will immediately see the shortcomings of the approach of the linear regression model.

we will now digress a bit from applications and attempt to understand the general underlying theory of GLMs. A brief look at the important members of the GLM family will be discussed here.

c17-math-0011

17.3 Exponential Family and the GLM

The exponential family of distribution was introduced in Chapter 7. Chapter 3 of Dobson (2002) highlights the role of the exponential family for the GLM's considered in this chapter. Recall the definition of the exponential family from Chapter 7 as

equation

where c17-math-0013 are some known functions. This form of exponential family may be rewritten as

equation

where c17-math-0015 and c17-math-0016. If c17-math-0017, the distribution is said to be in canonical form and in this case c17-math-0018 is called the natural parameter of the distribution. we had seen some members of the exponential family in Section 7.2.1. The dependency of GLM on the exponential family is now discussed.

Consider an independent sample c17-math-0019, with the following characteristics:

  • Each observation c17-math-0020, is a distribution from the exponential family.
  • The distributions of c17-math-0021 are of the same form, that is, all are either normal, or all binomial, etc.
  • The canonical form c17-math-0022 is specified by
    17.1 equation

    This translates into the form that the joint probability density function of the random sample c17-math-0024, is given by

    17.2 equation

In a GLM, we are interested in estimation of the c17-math-0026 parameters c17-math-0027. The trick is to consider some function of c17-math-0028, say c17-math-0029, such that c17-math-0030, and then allow the c17-math-0031 c17-math-0032's to vary as a function of c17-math-0033, regression coefficients. That is, we define c17-math-0034 by

17.3 equation

where c17-math-0036 is the covariate vector associated with c17-math-0037. The function c17-math-0038 is called the link function of the GLM. Some essential requirements of the function c17-math-0039 is that it should be monotonic and differentiable. Table 17.1 gives a summary of important members of the GLM. There will be more focus on logistic regression in this chapter.

Table 17.1 GLM and the Exponential Family

Probability Model Name of Link Function Link Function Mean Function
Normal Identity c17-math-0040 c17-math-0041
Exponential/Gamma Inverse c17-math-0042 c17-math-0043
Inverse Gaussian Inverse Squared c17-math-0044 c17-math-0045
Poisson Log c17-math-0046 c17-math-0047
Binomial Logit c17-math-0048 c17-math-0049

17.4 The Logistic Regression Model

We saw in the previous section how the probability curve is a sigmoid curve. Now we will introduce the concepts in a more formal and mathematical way. We have found the paper of Czepiel, http://czep.net/stat/mlelr.pdf, to be in the most appropriate pedagogical manner and this section is a liberal adaptation of the same. Let the binary outcomes be represented by c17-math-0050, and the covariates associated with them be respectively c17-math-0051. The covariate is assumed to be a vector of the c17-math-0052 elements, that is, c17-math-0053. Without loss of generality, we can write the covariate vector as c17-math-0054, with c17-math-0055. The probability of success is specified by c17-math-0056. The logistic regression model is then given by

17.4 equation

where c17-math-0058 is the vector of regression coefficients. The probability of a failure has the simple form:

equation

An important identity in the context of logistic regression is the form of the odds ratio, abbreviated and denoted as OR:

17.5 equation

Taking the logarithm on both sides of the above equation, we get the form of logistic regression model:

equation

The expression c17-math-0062 is known as the logit function and since it is linear in the covariates, the logistic regression model, based on the logit function, is a particular class of the well-known generalized linear models. The logistic regression model is given by

17.6 equation

where c17-math-0064 is the error term. Thus, if c17-math-0065, the error is c17-math-0066 with probability c17-math-0067. Otherwise, the error is c17-math-0068 with probability c17-math-0069. That is

Hence, the error c17-math-0071 has a binomial distribution with mean c17-math-0072 and variance c17-math-0073. The next section will deal with the inferential aspect of c17-math-0074.

17.5 Inference for the Logistic Regression Model

17.5.1 Estimation of the Regression Coefficients and Related Parameters

The likelihood function based on c17-math-0075 observations is given by

As in the most cases, it is easier to work with the log-likelihood function:

17.9 equation

Differentiating the above log-likelihood function with respect to c17-math-0078, we obtain the score function:

17.10 equation

Let us denote the covariate matrix, as in earlier chapters, by c17-math-0080, the probability of success vector by c17-math-0081, and the outcome vector by c17-math-0082. The normal equation, obtained by equating the above equation to zero, for the logistic regression model is then given by

The above equation looks similar to the normal equation of the linear models. However, Equation 17.11 cannot be solved immediately as the vector c17-math-0084 contains the vector of regression coefficients. We need the help of a different algorithm to obtain the estimates of the regression coefficients. This algorithm is generally known as the iterated reweighted least squares, abbreviated as IRLS, algorithm. In the context of the logistic regression model, the IRLS algorithm is given below, see Fox http://socserv.mcmaster.ca/jfox/Courses/UCLA/logistic-regression-notes.pdf.

  • Initialize the vector of regression coefficients c17-math-0085.
  • The c17-math-0086 improvement of the regression coefficients is given by

    where c17-math-0088 is a diagonal matrix with the c17-math-0089 diagonal element given by c17-math-0090, and c17-math-0091.

  • Repeat the above step until c17-math-0092 is close to c17-math-0093.
  • The asymptotic covariance matrix of the regression coefficients is given by c17-math-0094, see next sub-section.

In the next example, we will develop an example which will clearly bring out the steps of the IRLS algorithm. The use of the R function glm directly returns us the estimates of the regression coefficients. However, the working of the IRLS algorithm does not become clear, and the first time the reader may feel that IRLS is some kind of black box. It is to be understood that software does not hide anything, R or any other statistical software. The onus of clear understanding of software functionality is with the reader, and sometimes the authors. For the sake of simplicity, we will focus on the simple GLM case of a single covariate only. The IRLS function is first given and discussed.

irls <- function(output, input) {
input <- cbind(rep(1,length(output)),input)
bt <- 1:ncol(input)*0 # Initializing the regression coefficients
probs <- as.vector(1/(1+exp(-input%*%bt)))
temp <- rep(1,nrow(input))
while(sum((bt-temp)ˆ2)>0.0001) {
temp <- bt
bt <- bt+as.vector(solve(t(input)%*%diag(probs*(1-probs))
%*%input)%*%(t(input)%*%(output-probs)))
probs <- as.vector(1/(1+exp(-input%*%bt)))
}
return(bt)
}

The irls R function defined here should return us the estimates as required by the IRLS algorithm and in particular Equation 17.12 computations are to be handled here. Given the covariate values for the c17-math-0095 observations from a dataset, the first step is to take care of the intercept term, and thus we first begin by inserting a column of 1's with input <- cbind(rep(1,length(output)),input). The initial estimate of regression coefficients is set equal to 0, and hence the initial probabilities c17-math-0096's will be zero too, see bt and probs in the irls program. The regression coefficient vector for the previous iteration will be denoted by temp and so long as the improvement of the current iteration, vector distance, and the previous iteration is greater than 1e-4, the iterations will be carried out in the while loop. The iteration as required by Equation 17.12 is provided by the R code solve(t(input)%*%diag(probs*(1-probs))%*%input)%*%(t(input) %*%(output-probs)). When the convergence criteria is met, the vector of regression coefficients is returned by return(bt) as the output. The irls function will be tested for the coronary heart disease problem and the results will be further verified with the R glm function.

Estimates of the link function and odds ratio is straightforwardly obtained by plugging in the values of the estimated regression coefficients, that is,

equation

We will next consider an example of the multiple logistic regression, that is when we have more than one covariates.

The estimated vector of regression coefficients c17-math-0099 needs to be tested for significance. The first step towards this would be to obtain an estimate of the variance-covariance matrix of c17-math-0100.

17.5.2 Estimation of the Variance-Covariance Matrix of c17-math-0101

We have seen earlier in Chapter 7 that the variance of the score function gives the Fisher information. Thus, to obtain the variance-covariance matrix of c17-math-0102, we look at the second-order partial derivatives of the log-likelihood function. The technique of estimating the variance-covariance matrix of c17-math-0103 follows from the theory of the maximum likelihood estimation, see Rao (1973). Differentiating the partial differential equations of the log-likelihood function of sub-section 8.4.1, we get

equation

for c17-math-0105. The Fisher information for c17-math-0106, denoted by c17-math-0107, consists of elements as specified in the above two equations. Adapting the results from Chapter 7, the inverse matrix of the information matrix gives us the variance-covariance matrix of c17-math-0108, that is,

17.13 equation

Thus, we specifically have that the variance of the estimator of regression coefficient c17-math-0110, denoted by c17-math-0111 as the c17-math-0112 diagonal element of c17-math-0113. Similarly, the covariance between the estimators of two regression coefficients, denoted by c17-math-0114, as the c17-math-0115 element of c17-math-0116.

In R, it is easier to obtain the variance-covariance matrix for a fitted logistic regression model. For the Low Birth-Weight example, we can obtain this using the listed object cov.unscaled.

> lowglm_summary <- summary(lowglm)
> lowglm_summary$cov.unscaled
            (Intercept)     AGE     LWT   RACE2   RACE3     FTV
(Intercept)      1.1480 -0.0220 -0.0045 -0.0373 -0.1339  0.0054
AGE             -0.0220  0.0011  0.0000  0.0031  0.0013 -0.0010
LWT             -0.0045  0.0000  0.0000 -0.0008  0.0003 -0.0001
RACE2           -0.0373  0.0031 -0.0008  0.2479  0.0571 -0.0003
RACE3           -0.1339  0.0013  0.0003  0.0571  0.1312  0.0058
FTV              0.0054 -0.0010 -0.0001 -0.0003  0.0058  0.0280

17.5.3 Confidence Intervals and Hypotheses Testing for the Regression Coefficients

The natural hypotheses testing problem is inspection for the significance of the regressors, that is, c17-math-0117. The Wald test statistics for c17-math-0118 is given by

17.14 equation

Under the hypotheses c17-math-0120, the Wald statistics c17-math-0121 (asymptotically) follow a standard normal distribution. Furthermore, the c17-math-0122 confidence interval for c17-math-0123 is given by

17.15 equation

We will illustrate these concepts for the low birth-weight study problem.

An overall model level significance test needs to be in place. Towards this test, we need to first define the notion of deviance, and this topic will be taken up in sub-section 17.5.5. First, we need to define the various types of residuals on the lines of linear models for the logistic regression model.

c17-math-0127

17.5.4 Residuals for the Logistic Regression Model

Recollect from the definition of residuals for the linear regression model as defined in Section 12.5 that the residual is the difference between actual and predicted c17-math-0128 values. Similarly, the residuals for the logistic regression model is defined by

The residuals c17-math-0130 are sometimes called the response residual. The different variants of the residuals which will be useful are defined next, see Chapter 7 of Tattar (2013). The hat matrix c17-math-0131 played an important role in defining the residuals for the linear regression model, recall Equation (12.37). A similar matrix will be required for the logistic regression model, and we have the problem here that the observation c17-math-0132 is not a straightforward linear in terms of the regression coefficient. A linear approximation for the fitted values c17-math-0133, which gives a similar hat matrix for the logistic regression model, has been proposed by Pregibon (1981), see Section 5.3 of Hosmer and Lemeshow (1990–2013). The hat matrix for the logistic regression matrix will be denoted by c17-math-0134, the subscript c17-math-0135 denotes logistic regression and not linear regression, and is defined by

17.17 equation

where c17-math-0137 is a diagonal matrix with

equation

The diagonal elements, hat values, c17-math-0139, or simply c17-math-0140, are given by

17.18 equation

with c17-math-0142 capturing the model-based estimator of the variance of c17-math-0143 and c17-math-0144 computing the weighted distance of c17-math-0145 from the average of the covariate design matrix c17-math-0146. Similar to the linear regression model case, the diagonal element c17-math-0147 is useful for obtaining the variance of the residual:

17.19 equation

The Pearson residual for the logistic regression model is then defined by

17.20 equation

The standardized Pearson residual is defined by

17.21 equation

The deviance residual for the logistic regression model is defined as signed square root of the contribution of the observation to the sum of the model deviance and is given by

The residuals are obtained for a disease outbreak dataset, which is adapted from Kutner, et al. (2005).

Recollect from Equation 17.7 that the mean and variance of the error term are respectively c17-math-0153 and c17-math-0154. Thus, the LOESS plot, refer to Section 8.4, of errors against the fitted values should reflect a line around 0 if the assumption of the logistic regression is appropriate. This perspective will be explored as a continuation of the previous example.

The residuals have more importance than validation of the model assumptions. The overall fit of the model will be investigated next.

17.5.5 Deviance Test and Hosmer-Lemeshow Goodness-of-Fit Test

Testing for the significance of the GLM model is carried out using Deviance Statistic, denoted by c17-math-0155, by comparing the likelihood function for the fitted model, which includes the covariates, with the likelihood of the saturated model. A saturated model is that model which takes into consideration all possible parameters and thus results in a perfect fit in the sense that all successful outcomes are predicted as success and failures predicted as failed. The deviance statistic c17-math-0156 is then defined by

equation

Since the saturated model, by definition, leads to a perfect fit, we have c17-math-0158, and thus

equation

Thus, the deviance statistic becomes

equation

To know the significance of the set of the independent variables, we need to compare the value of c17-math-0161 with and without the set of independent variables/covariates. That is

17.23 equation

We further see that

17.24 equation

where c17-math-0164 and c17-math-0165. Basically, we are substituting the MLEs of c17-math-0166 and c17-math-0167 for the model without the set of independent variables. Under the hypothesis that all the regression coefficients are not significant, the test statistic c17-math-0168 follows a c17-math-0169- distribution with c17-math-0170 degrees of freedom. With the appropriate significance level, it is possible to infer the model significance using the c17-math-0171 statistic. This c17-math-0172 statistic may be seen as the parallel of the c17-math-0173 statistic in the linear model case. For more details, refer to Hosmer and Lemeshow (2000).

17.6 Model Selection in Logistic Regression Models

We consider “Stepwise Logistic Regression” for the best variable selection in a logistic regression model. The most important variable is the one that produces maximum change in the log-likelihood relative to a model containing no variables. That is, maximum change in the value of the c17-math-0178-statistic is considered as the most important variable. The complete process of step-wise logistic regression is given in S number of steps.

Step 0

  • Consider all plausible variables, say c17-math-0179.
  • Fit “intercept only model”, and evaluate the log-likelihood c17-math-0180.
  • Fit each of the possible c17-math-0181 univariate logistic regression models and note the log-likelihood values c17-math-0182. Furthermore, calculate the c17-math-0183-statistic for each of the c17-math-0184-models, c17-math-0185.
  • Obtain the c17-math-0186-value for each model.
  • The most important variable is then the one with least c17-math-0187-value, c17-math-0188
  • Denote the most important variable by c17-math-0189.
  • Define the entry criteria c17-math-0190-value as c17-math-0191, which will at any time throughout this procedure decide if the variable is to be included or not. That is, the variable with the least c17-math-0192-value must be less than c17-math-0193 to be selected in the final model.
  • If none of the variables have the c17-math-0194-value less than c17-math-0195, we stop.

Step 1. The Forward Selection Step

  • Replace c17-math-0196 of the previous step with c17-math-0197.
  • Fit c17-math-0198 models with variables c17-math-0199 and the remaining variables c17-math-0200, c17-math-0201, and c17-math-0202 distinct from c17-math-0203.
  • For each of the c17-math-0204 models, calculate the log-likelihood c17-math-0205 and the c17-math-0206-statistics c17-math-0207, and the corresponding c17-math-0208-values, denoted c17-math-0209.
  • Define c17-math-0210
  • If c17-math-0211, stop.

Step 2A. The Backward Elimination Step

  • Adding c17-math-0212 may leave c17-math-0213 statistically insignificant.
  • Let c17-math-0214 denote the log-likelihood of the model with variable c17-math-0215 removed.
  • Calculate the likelihood-ratio test of these reduced models with respect to the full model at the beginning of this step c17-math-0216, and calculate the c17-math-0217-values c17-math-0218.
  • Deleted variables must result in a maximum c17-math-0219-value of the modified model.
  • Denote c17-math-0220 as the variable which is to be removed, and define c17-math-0221.
  • To remove variables, we need to have a value c17-math-0222 with respect to which we compare c17-math-0223.
  • We need to have c17-math-0224. (Why?)
  • Variables are removed if c17-math-0225.

Step 2B. The Forward Selection Phase

  • Continue the forward selection method with c17-math-0226 remaining variables and find c17-math-0227.
  • Let c17-math-0228 be the variable associated with c17-math-0229.
  • If the c17-math-0230 is less than c17-math-0231, proceed to Step 3, otherwise stop.

Step 3

  • Fit the model including the variable selected in the previous step and perform backward elimination and then forward selection phase.
  • Repeat until the last Step S.

Step S

  • Stopping happens if all the variables have been selected in the model.
  • It also happens if all the c17-math-0232-values in the model are less than c17-math-0233, and the remaining variables have c17-math-0234-values exceeding c17-math-0235.

The step function in R may be used to build a model. The criteria used there is the Akaike Information Criteria, AIC. To the best of our knowledge, there is no package/function which will implement stepwise logistic regression using the c17-math-0236-statistic generated c17-math-0237-values. The Hosmer and Lemeshow (2000) approach of model selection will be used here.

We will next illustrate the concept of the backward regression selection method using the AIC criteria.

An alternative to the binary regression problem, as discussed here with the logistic regression model, is given by the probit model and we will discuss this in the next section.

c17-math-0247

17.7 Probit Regression

Bliss (1935) proposed the use of the probit regression model for problems where the response variable is a binary variable. Recall that in the logistic regression model we used the logit transformation of the link function to ensure that the predicted probabilities were always in the unit interval. Bliss suggested modeling the link function using the normal cumulative distribution.

Let c17-math-0248 denote the binary outcome as earlier and the covariates be represented by c17-math-0249. The probit regression model is constructed through the use of an auxiliary RV denoted by c17-math-0250, see Chapter 7 of Tattar (2013). The auxiliary RV c17-math-0251 is then modeled as the multiple linear regression model:

equation

where the error c17-math-0253 follows the normal distribution c17-math-0254. Without loss of generality we assume c17-math-0255. The vectors c17-math-0256 and c17-math-0257 have their usual meanings. The probit regression model is then developed as a latent variable model through the use of the auxiliary RV:

17.25 equation

The probit model is then given by

17.26 equation

where c17-math-0260 denotes the cumulative distribution of a standard normal variable. If we have c17-math-0261 pairs of observations c17-math-0262, the statistical inference for c17-math-0263 may be carried out through the likelihood function given by

equation

It is obvious that this likelihood function will be difficult to evaluate. A probit regression model is set up in R, again using the glm function with the option binomial(probit). Since the fitted probit model is again a member of the glm class, the techniques available and discussed for the logistic regression model are also available for the fitted probit model. For the sake of paucity, the mathematical details of the probit model will be skipped. First, we consider the simple probit model, followed with an example of multiple probit model.

A lot of similarity exists between the logistic regression and the probit regression regarding the model building in R, since both are generated using the glm function. Especially, the computational and inference methods remain almost the same. We will close this section with an illustration of the step-wise regression method for the probit regression model.

The next section will deal with a different type of discrete variable.

c17-math-0266

17.8 Poisson Regression Model

The logistic and probit regression models are useful when the discrete output is a binary variable. By an extension, we can incorporate multi-nominal variables too. However, these variables are class indicators, or nominal variables. In Section 16.5, the role of the Poisson RV was briefly indicated in the context of discrete data. If the discrete output is of a quantitative type and there are covariates related to it, we need to use the Poisson regression model.

The Poisson regression models are useful in two slightly different contexts: (i) the events arising as a percentage of the exposures, along with covariates which may be continuous or categorical, and (ii) the exposure effect is constant with the covariates being categorical variables only. In the first type of modeling, the parameter (rate c17-math-0267) of the Poisson RV is specified in terms of the units of exposure. As an example, the quantitative variable may refer to the number of accidents per thousand vehicles arriving at a traffic signal, or the number of successful sales per thousand visitors on a web page. In the second case, the regressand c17-math-0268 may refer to the count in each cell of a contingency table. The Poisson model in this case is popularly referred to as the log-linear model. For a detailed treatment of this model, refer to Chapter 9 of Dobson (2002).

Let c17-math-0269 denote the number of responses from c17-math-0270 events for the c17-math-0271 exposure, c17-math-0272, and let c17-math-0273 represent the vector of explanatory variables. The Poisson regression model is stated as

17.27 equation

Here, the natural link function is the logarithmic function

equation

The extra term c17-math-0276 that appears in the link function is referred to as the offset term. It is important to clearly specify here the form of the pmf of c17-math-0277:

equation

The likelihood function is then given by

17.28 equation

The (ML) technique of obtaining c17-math-0280 is again based on the IRLS algorithm. However, this aspect will not be dealt with here. It is assumed that the ML estimate is available for c17-math-0281, as returned by say R software, and by using it we will then look at other aspects of the model. The fitted values from the Poisson regression model are given by

17.29 equation

and hence the residuals are

17.30 equation

Note that the mean and variance of the Poisson model are equal, and hence an estimate of the variance of the residual is c17-math-0284, which leads to the Pearson residuals:

17.31 equation

A c17-math-0286 goodness-of-fit test statistic is then given by

17.32 equation

These concepts will be demonstrated through the next example.

For a binary regressand, we can obtain a precise answer for the influence of the variable on the rate of the Poisson model through the concept of the rate ratio. The rate ratio, denoted by RR, gives the impact of the variable on the rate, by looking at the ratio of the expected value of c17-math-0289, when the variable is present to the expected value when it is absent:

equation

If the RR value is closer to unity, it implies that the indicator variable has no influence on c17-math-0291. The confidence intervals, based on the invariance principle of MLE, is obtained with a simple exponentiation of the confidence intervals for the estimated regression coefficient c17-math-0292.

c17-math-0297

17.9 Further Reading

Agresti (2007) is a very nice introduction for a first course in CDA. Simonoff (2003) has a different approach to CDA and is addressed to a more advanced audience. Johnson and Albert (1999) is a dedicated text for analysis of ordinal data from a Bayesian perspective. Congdon (2005) is a more exclusive account of Bayesian methods for CDA. Graphics of CDA are different in nature, as the scale properties are not necessarily preserved here. Specialized graphical methods have been developed in Friendly (2000). Though Friendly's book uses SAS as the sole software for statistical analysis and graphics, it is a very good account of the ideas and thought processes for CDA. Blascius and Greenacre (1998), Chen, et al. (2008), and Unwin, et al. (2005) are advanced level edited collections of research papers with dedicated emphasis on the graphical and visualization methods for data.

McCullagh and Nelder (1983–9) is among the first set of classics which deals with GLM. Kleinbaum and Klein (2002–10), Dobson (1990–2002), Christensen (1997), and Lindsey (1997) are a handful of very good introductory books in this field. Hosmer and Lemeshow (1990–2013) forms the spirit of this chapter for details on logistic regression. GLMs have evolved in more depth than we can possibly address here. Though we cannot deal with all of them, we will see here the other directional dimensions of this field. Bayesian methods are the natural alternative school for considering the inference methods in GLMs. Dey, Ghosh, and Mallick (2000) is a very good collection of peer reviewed articles for Bayesian methods in the domain of GLM. Agresti (2002, 2007) and Congdon (2005) also deal with Bayesian analysis of GLMs.

Chapter 13 of Dalgaard (2008), Chapter 7 of Everitt and Hothorn (2010), Chapter 6 of Faraway (2006), Chapter 8 of Maindonald and Braun (2006), and Chapter 13 of Crawley (2007) are some good introductions to GLM with R.

17.10 Complements, Problems, and Programs

  1. Problem 17.1 The irls function given in Section 17.5 needs to be modified for incorporating more than one covariate. Extend the program which can then be used for any logistic regression model.

  2. Problem 17.2 Obtain the 90% confidence intervals for the logistic regression model chdglm, as discussed in Example 17.5.1. Also, carry out the deviance test to find if the overall fitted model chdglm is a significant model? Validate the assumptions of the logistic regression model.

  3. Problem 17.3 Suppose that you have a new observation c17-math-0298. Find details with ?predict.glm and use them for prediction purposes for any c17-math-0299 of your choice with chdglm.

  4. Problem 17.4 The likelihood function for the logistic regression model is given in Equation 17.8 . It may be tempting to write a function, say lik_Logistic, which is proportional to the likelihood function. However, optimize does not return the MLE! Write a program and check if you can obtain the MLE.

  5. Problem 17.5 The residual plot technique extends to the probit regression and the reader should verify the same for the fitted probit regression models in Section 17.7.

  6. Problem 17.6 Obtain the 99% confidence intervals for the lowprobit model in Example 17.8.2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.85.33