Assessing logistic regression models

The summary of the logistic regression model produced with the glm() function has a similar format to that of the linear regression model produced with the lm() function. This shows us that for our categorical variables, we have one fewer binary feature than the number of levels in the original variable, so for example, the three-valued THAL input feature produced two binary variables labeled THAL6 and THAL7. We'll begin by looking first at the regression coefficients that are predicted with our model. These are presented with their corresponding z-statistic. This is analogous to the t-statistic that we saw in linear regression, and again, the higher the absolute value of the z-statistic, the more likely it is that this particular feature is significantly related to our output variable. The p-values next to the z-statistic express this notion as a probability and are annotated with stars and dots, as they were in linear regression, indicating the smallest confidence interval that includes the corresponding p-value.

Due to the fact that logistic regression models are trained with the maximum likelihood criterion, we use the standard normal distribution to perform significance tests on our coefficients. For example, to reproduce the p-value for the THAL7 feature that corresponds to the listed z-value of 3.362, we can write the following (set the lower.tail parameter to T when testing negative coefficients):

> pnorm(3.362 , lower.tail = F) * 2
[1] 0.0007738012

Note

An excellent reference for learning about the essential concepts of distributions in statistics is All of Statistics, Larry Wasserman, Springer.

From the model summary, we see that FLUOR, CHESTPAIN4, and THAL7 are the strongest feature predictors for heart diseases. A number of input features have relatively high p-values. This indicates that they are probably not good indicators of heart disease in the presence of the other features. We'll stress once again the importance of interpreting this table correctly. The table does not say that heart age, for example, is not a good indicator for heart disease; rather, it says that in the presence of the other input features, age does not really add much to the model. Furthermore, note that we almost definitely have some degree of collinearity in our features as the regression coefficient of age is negative, whereas we would expect that the likelihood of heart disease increases with age. Of course, this assumption is valid only in the absence of all other input features. Indeed, if we retrain a logistic regression model with only the AGE variable, we get a positive regression coefficient as well as a low p-value, both of which support our belief that the features are collinear:

> heart_model2 <- glm(OUTPUT ~ AGE, data = heart_train, family = binomial("logit"))
> summary(heart_model2)

Call:
glm(formula = OUTPUT ~ AGE, family = binomial("logit"), data = heart_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5027  -1.0691  -0.8435   1.2061   1.6759  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -2.71136    0.86348  -3.140  0.00169 **
AGE          0.04539    0.01552   2.925  0.00344 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 315.90  on 229  degrees of freedom
Residual deviance: 306.89  on 228  degrees of freedom
AIC: 310.89

Number of Fisher Scoring iterations: 4

Note that the AIC value of this simpler model is higher than what we obtained with the full model, so we would expect this simple model to be worse.

Model deviance

To understand the remainder of the model summary, we need to introduce an important concept known as deviance. In linear regression, our residuals were defined simply as the difference between the predicted value and the actual value of the output that we are trying to predict. Logistic regression is trained using maximum likelihood, so it is natural to expect that an analogous concept to the residual would involve the likelihood. There are several closely-related definitions of the concept of deviance. Here, we will use the definitions that the glm() function uses in order to explain the model's output. The deviance of an observation can be computed as the -2 times the log likelihood of that observation. The deviance of a data set is just the sum of all the observation deviances. The deviance residual of an observation is derived from the deviance itself and is analogous to the residual of a linear regression. It can be computed as follows:

Model deviance

For an observation i, dri represents the deviance residual and di represents the deviance. Note that squaring a deviance residual effectively eliminates the sign function and produces just the deviance of the observation. Consequently, the sum of squared deviance residuals is the deviance of the data set, which is just the log likelihood of the data set scaled by the constant -2. Consequently, maximizing the log likelihood of the data is the same as minimizing the sum of the squared deviance residuals, so our analogy with linear regression is complete.

In order to reproduce the results that are shown in the model summary, and to understand how deviance is computed, we'll write some of our own functions in R. We'll begin by computing the log likelihood for our data set using the equation for the log likelihood that we saw earlier on in this chapter. From the equation, we'll create two functions. The log_likelihoods() function computes a vector of log likelihoods for all the observations in a data set, given the probabilities that the model predicts and the actual target labels, and dataset_log_likelihood() sums these up to produce the log likelihood of a data set:

 log_likelihoods <- function(y_labels, y_probs) {
     y_a <- as.numeric(y_labels)
     y_p <- as.numeric(y_probs)
     y_a * log(y_p) + (1 - y_a) * log(1 - y_p)
 }
 
 dataset_log_likelihood <- function(y_labels, y_probs) {
     sum(log_likelihoods(y_labels, y_probs))
 }

Next, we can use the definition of deviance to compute two analogous functions, deviances() and dataset_deviance(). The first of these computes a vector of observation deviances, and the second sums these up for the whole data set:

 deviances <- function(y_labels, y_probs) {
     -2 * log_likelihoods(y_labels, y_probs)
 }

dataset_deviance <- function(y_labels, y_probs) {
     sum(deviances(y_labels, y_probs))
 }

Given these functions, we can now create a function that will compute the deviance of a model. To do this, we need to use the predict() function in order to compute the model's probability predictions for the observations in the training data. This works just as with linear regression, except that by default, it returns probabilities on the logit scale. To ensure that we get actual probabilities, we need to specify the value of response for the type parameter:

model_deviance <- function(model, data, output_column) {
  y_labels = data[[output_column]]
  y_probs = predict(model, newdata = data, type = "response")
  dataset_deviance(y_labels, y_probs)
}

To check whether our function is working, let's compute the model deviance, also known as the residual deviance, for our heart model:

> model_deviance(heart_model, data = heart_train, output_column = 
                 "OUTPUT")
[1] 140.3561

Reassuringly, this is the same value as that listed in our model summary. One way to evaluate a logistic regression model is to compute the difference between the model deviance and the deviance of the null model, which is the model trained without any features. The deviance of the null model is known as null deviance. The null model predicts class 1 via a constant probability, as it has no features. This probability is estimated via the proportion of the observations of class 1 in the training data, which we can obtain by simply averaging the OUTPUT column:

 null_deviance <- function(data, output_column) {
     y_labels <- data[[output_column]]
     y_probs <- mean(data[[output_column]])
     dataset_deviance(y_labels, y_probs)
 }

> null_deviance(data = heart_training, output_column = "OUTPUT")
[1] 314.3811

Once again, we see that we have reproduced the value that R computes for us in the model summary. The residual deviance and null deviance are analogous to the Residual Sum of Squares (RSS) and the True Sum of Squares (TSS) that we saw in linear regression. If the difference between these two is high, the interpretation is similar to the notion of the residual sum of squares in linear regression explaining away the variance observed by the output variable. Continuing with this analogy, we can define a pseudo R2 value for our model using the same equation that we used to compute the R2 for linear regression but substituting in the deviances. We implement this in R as follows:

 model_pseudo_r_squared <- function(model, data, output_column) {
     1 - ( model_deviance(model, data, output_column) / 
           null_deviance(data, output_column) )
 }

> model_pseudo_r_squared(heart_model, data = heart_train, 
                         output_column = "OUTPUT")
[1] 0.5556977

Our logistic regression model is said to explain roughly 56 percent of the null deviance. This is not particularly high; most likely, we don't have a rich enough feature set to make accurate predictions with a logistic model. Unlike linear regression, it is possible for the pseudo R2 to exceed 1, but this only happens under problematic circumstances where the residual deviance exceeds the null deviance. If this happens, we should not trust the model and proceed with feature selection methods, or try out alternative models.

Besides the pseudo R2, we may also want a statistical test to check whether the difference between the null deviance and the residual deviance is significant. The absence of a p-value next to the residual deviance in the model summary indicates that R has not created any test. It turns out that the difference between the residual and null deviances is approximately and asymptotically distributed with a χ2 (pronounced CHI squared) distribution. We'll define a function to compute a p-value for this difference but state that this is only an approximation.

First, we need the difference between the null deviance and the residual deviance. We also need the degrees of freedom for this difference, which are computed simply by subtracting the number of degrees of freedom of our model from those of the null model. The null model only has an intercept, so the number of degrees of freedom is the total number of observations in our data set minus 1. For the residual deviance, we are computing a number of regression coefficients, including the intercept, so we need to subtract this number from the total number of observations. Finally, we use the pchisq() function to obtain a p-value, noting that we are creating an upper tail computation and hence need to set the lower.tail parameter to FALSE. The code is as follows:

model_chi_squared_p_value <-  function(model, data, output_column) {
     null_df <- nrow(data) - 1
     model_df <- nrow(data) - length(model$coefficients)
     difference_df <- null_df - model_df
     null_deviance <- null_deviance(data, output_column)
     m_deviance <- model_deviance(model, data, output_column)
     difference_deviance <- null_deviance - m_deviance
     pchisq(difference_deviance, difference_df,lower.tail = F)
}

> model_chi_squared_p_value(heart_model, data = heart_train, 
                            output_column = "OUTPUT")
[1] 7.294219e-28

The p-value that we obtain is tiny, so we feel certain that our model produces predictions that are better than average guessing. In our original model summary we also saw a summary of the deviance residuals. Using the definition of deviance residual that we gave earlier, we'll define a function to compute the vector of deviance residuals:

model_deviance_residuals <- function(model, data, output_column) {
     y_labels = data[[output_column]]
     y_probs = predict(model, newdata = data, type = "response")
     residual_sign = sign(y_labels - y_probs)
     residuals = sqrt(deviances(y_labels, y_probs))
     residual_sign * residuals
 }

Finally, we can use the summary() function on the deviance residuals that we obtain with our model_deviance_residuals() function to obtain a table:

> summary(model_deviance_residuals(heart_model, data = 
          heart_train, output_column = "OUTPUT"))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.71400 -0.44210 -0.13820 -0.02765  0.35880  2.81200 

Once again, we can verify that we obtain the correct result. Our model summary provides us with one final diagnostic; name the fisher scoring iterations, which we have not yet discussed. This number is typically in the range of 4 to 8 and is a convergence diagnostic. If the optimization procedure that R uses to train the logistic model has not converged, we expect to see a number that is high. If this happens, our model is suspect and we may not be able to use it to make predictions. In our case, we are within the expected range.

Test set performance

We've seen how we can use the predict() function to compute the output of our model. This output is the probability of the input belonging to class 1. We can perform binary classification by applying a threshold. We'll do this with both our training and test data and compare them with our expected outputs to measure the classification accuracy:

> train_predictions <- predict(heart_model, newdata = heart_train, 
                               type = "response")
> train_class_predictions <- as.numeric(train_predictions > 0.5)
> mean(train_class_predictions == heart_train$OUTPUT)
[1] 0.8869565
> test_predictions = predict(heart_model, newdata = heart_test, 
                             type = "response")
> test_class_predictions = as.numeric(test_predictions > 0.5)
> mean(test_class_predictions == heart_test$OUTPUT)
[1] 0.9

The classification accuracies on the training and test sets are very similar and are close to 90 percent. This is a very good starting point for a modeler to work from. The coefficients table in our model showed us that several features did not seem to be significant, and we also saw a degree of collinearity that means we could now proceed with variable selection and possibly look for more features, either through computation or by obtaining additional data about our patients. The pseudo R2 computation showed us that we did not explain enough of the deviance in our model, which also supports this.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.74.160