Remember when I said, a thorough understanding of linear models will pay enormous dividends throughout your career as an analyst in the previous chapter? Well, I wasn't lying! This next classifier is a product of a generalization of linear regression that can act as a classifier.
What if we used linear regression on a binary outcome variable, representing diabetes as 1 and not diabetes as 0? We know that the output of linear regression is a continuous prediction, but what if, instead of predicting the binary class (diabetes or not diabetes), we attempted to predict the probability of an observation having diabetes? So far, the idea is to train a linear regression on a training set where the variables we are trying to predict are a dummy-coded 0 or 1, and the predictions on an independent training set are interpreted as a continuous probability of class membership.
It turns out this idea is not quite as crazy as it sounds—the outcome of the predictions are indeed proportional to the probability of each observation's class membership. The biggest problem is that the outcome is only proportional to the class membership probability and can't be directly interpreted as a true probability. The reason is simple: probability is, indeed, a continuous measurement, but it is also a constrained measurement—it is bounded by 0 and 1. With regular old linear regression, we will often get predicted outcomes below 0 and above 1, and it is unclear how to interpret those outcomes.
But what if we had a way of taking the outcome of a linear regression (a linear combination of beta coefficients and predictors) and applying a function to it that constrains it to be between 0 and 1 so that it can be interpreted as a proper probability? Luckily, we can do this with the logistic function:
whose plot is depicted in Figure 9.6:
Note that no matter what value of x (the output of the linear regression) we use—from negative infinity to positive infinity—the y (the output of the logistic function) is always between 0 and 1. Now we can adapt linear regression to output probabilities!
The function that we apply to the linear combination of predictors to change it into the kind of prediction we want is called the inverse link function. The function that transforms the dependent variable into a value that can be modeled using linear regression is just called the link function. In logistic regression, the link function (which is the inverse of the inverse link function, the logistic function) is called the logit function.
Before we get started using this powerful idea on our data, there are two other problems that we must contend with. The first is that we can't use ordinary least squares to solve for the coefficients anymore, because the link function is non-linear. Most statistical software solves this problem using a technique called Maximum Likelihood Estimation (MLE) instead, though there are other alternatives.
The second problem is that an assumption of linear regression (if you remember from last chapter) is that the error distribution is normally distributed. In the context of linear regression, this doesn't make sense, because it is a binary categorical variable. So, logistic regression models the error distribution as a Bernoulli distribution (or a binomial distribution, depending on how you look at it).
Generalized Linear Model (GLM)
If you are surprised that linear regression can be generalized enough to accommodate classification, prepare to be astonished by generalized linear models!
GLMs are a generalization of regular linear regression that allow for other link functions to map from linear model output to the dependent variable, and other error distributions to describe the residuals. In logistic regression, the link function and error distribution is the logit and binomial respectively. In regular linear regression, the link function is the identity function (a function that returns its argument unchanged), and the error distribution is the normal distribution.
Besides regular linear regression and logistic regression, there are still other species of GLM that use other link functions and error distributions. Another common GLM is Poisson regression, a technique that is used to predict /model count data (number of traffic stops, number of red cards, and so on), which uses the logarithm as the link function and the Poisson distribution as its error distribution. The use of the log link function constrains the response variable (the dependent variable) so that it is always above 0.
Remember that we expressed the t-test and ANOVA in terms of the linear model? So the GLM encompasses not only linear regression, logistic regression, Poisson regression, and the like, but it also encompasses t-tests, ANOVA, and the related technique called ANCOVA (Analysis of Covariance). Pretty cool, eh?!
Performing logistic regression—an advanced and widely used classification method—could scarcely be easier in R. To fit a logistic regression, we use the familiar glm
function. The difference now is that we'll be specifying our own error distribution and link function (the glm
calls of last chapter assumed we wanted the regular linear regression error distribution and link function, by default). These are specified in the family
argument:
> model <- glm(diabetes ~ ., data=PID, family=binomial(logit))
Here, we build a logistic regression using all available predictor variables.
You may also see logistic regressions being performed where the family argument looks like family="binomial"
or family=binomial()
—it's all the same thing, I just like being more explicit.
Let's look at the output from calling summary on the model.
> summary(model) Call: glm(formula = diabetes ~ ., family = binomial(logit), data = PID) Deviance Residuals: Min 1Q Median 3Q Max -2.5566 -0.7274 -0.4159 0.7267 2.9297 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.4046964 0.7166359 -11.728 < 2e-16 *** pregnant 0.1231823 0.0320776 3.840 0.000123 *** glucose 0.0351637 0.0037087 9.481 < 2e-16 *** pressure -0.0132955 0.0052336 -2.540 0.011072 * ...
The output is similar to that of regular linear regression; for example, we still get estimates of the coefficients and associated p-values. The interpretation of the beta coefficients requires a little more care this time around, though. The beta coefficient of pregnant
, 0.123
, means that a one unit increase in pregnant
(an increase in the number of times being pregnant by one) is associated with an increase of the logarithm of the odds of the observation being diabetic. If this is confusing, concentrate on the fact that if the coefficient is positive, it has a positive impact on probability of the dependent variable, and if the coefficient is negative, it has a negative impact on the probability of the binary outcome. Whether positive means higher probability of diabetes or higher probability of not diabetes' depends on how your binary dependent variable is dummy-coded.
To find the training set accuracy of our model, we can use the accuracy
function we wrote from the last section. In order to use it correctly, though, we need to convert the probabilities into class labels, as follows:
> predictions <- round(predict(model, type="response")) > predictions <- ifelse(predictions == 1, "pos", "neg") > accuracy(predictions, PID$diabetes) [1] 0.7825521
Cool, we get a 78% accuracy on the training data, but remember: if we overfit, our training set accuracy will not be a reliable estimate of performance on an independent dataset. In order to test this model's generalizability, let's perform k-fold cross-validation, just like in the previous chapter!
> set.seed(3) > library(boot) > cv.err <- cv.glm(PID, model, K=5) > cv.err$delta[2] [1] 0.154716 > 1 - cv.err$delta[2] [1] 0.845284
Wow, our CV-estimated accuracy rate is 85%! This indicates that it is highly unlikely that we are overfitting. If you are wondering why we were using all available predictors after I said that doing so was dangerous business in the last chapter, it's because though they do make the model more complex, the extra predictors didn't cause the model to overfit.
Finally, let's test the model on the independent test set so that we can compare this model's accuracy against k-NN's:
> predictions <- round(predict(model, type="response", + newdata=test)) > predictions <- ifelse(predictions == 1, "pos", "neg") > accuracy(predictions, test[,9]) # 78% [1] 0.7792208
Nice! A 78% accuracy rate!
It looks like logistic regression may have given us a slight improvement over the more flexible k-NN. Additionally, the model gives us at least a little transparency into why each observation is classified the way it is—a luxury not available to us via k-NN.
Before we move on, it's important to discuss two limitations of logistic regression.
3.145.179.177