Although logistic regression was partly covered in Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth), as it's often used to solve classification problems we will revisit this topic again with some related examples and some notes on—for example—the multinomial version of logistic regression, which was not introduced in the previous chapters.
Our data often does not meet the requirements of the discriminant analysis. In such cases, using logistic, logit, or probit regression can be a reasonable choice, as these methods are not sensitive to non-normal distribution and unequal variances within each group; on the other hand, they require much larger sample sizes. For small sample sizes, discriminant analysis is much more reliable.
As a rule of thumb, you should have at least 50 observations for each independent variable, which means that, if we want to build a logistic regression model for the mtcars
dataset as earlier, we will need at least 500 observations—but we have only 32.
To this end, we will restrict this section to one or two quick examples on how to conduct a logit regression—for example, to estimate whether a car has automatic or manual transmission based on the performance and weight of the automobile:
> lr <- glm(am ~ hp + wt, data = mtcars, family = binomial) > summary(lr) Call: glm(formula = am ~ hp + wt, family = binomial, data = mtcars) Deviance Residuals: Min 1Q Median 3Q Max -2.2537 -0.1568 -0.0168 0.1543 1.3449 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 18.86630 7.44356 2.535 0.01126 * hp 0.03626 0.01773 2.044 0.04091 * wt -8.08348 3.06868 -2.634 0.00843 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 43.230 on 31 degrees of freedom Residual deviance: 10.059 on 29 degrees of freedom AIC: 16.059 Number of Fisher Scoring iterations: 8
The most important table from the preceding output is the coefficients table, which describes whether the model and the independent variables significantly contribute to the value of the independent variable. We can conclude that:
It seems that, despite (or rather due to) the low sample size, the model fits the data very well, and the horsepower and weight of the cars can explain whether a car has an automatic transmission or manual shift:
> table(mtcars$am, round(predict(lr, type = 'response'))) 0 1 0 18 1 1 1 12
But running the preceding command on the number of gears instead of transmission would fail, as logit regression by default expects a dichotomous variable. We can overcome this by fitting multiple models on the data, such as verifying whether a car has 3/4/5 gears or not with dummy variables, or by fitting a multinomial logistic regression. The nnet
package has a very convenient function to do so:
> library(nnet) > (mlr <- multinom(factor(gear) ~ ., data = mtcars)) # weights: 36 (22 variable) initial value 35.155593 iter 10 value 5.461542 iter 20 value 0.035178 iter 30 value 0.000631 final value 0.000000 converged Call: multinom(formula = factor(gear) ~ ., data = mtcars) Coefficients: (Intercept) mpg cyl disp hp drat 4 -12.282953 -1.332149 -10.29517 0.2115914 -1.7284924 15.30648 5 7.344934 4.934189 -38.21153 0.3972777 -0.3730133 45.33284 wt qsec vs am carb 4 21.670472 0.1851711 26.46396 67.39928 45.79318 5 -4.126207 -11.3692290 -38.43033 32.15899 44.28841 Residual Deviance: 4.300374e-08 AIC: 44
As expected, it returns a highly fitted model to our small dataset:
> table(mtcars$gear, predict(mlr)) 3 4 5 3 15 0 0 4 0 12 0 5 0 0 5
However, due to the small sample size, this model is extremely limited. Before proceeding to the next examples, please remove the updated mtcars
dataset from the current R session to avoid unexpected errors:
> rm(mtcars)
3.145.97.170