Logistic regression

Although logistic regression was partly covered in Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth), as it's often used to solve classification problems we will revisit this topic again with some related examples and some notes on—for example—the multinomial version of logistic regression, which was not introduced in the previous chapters.

Our data often does not meet the requirements of the discriminant analysis. In such cases, using logistic, logit, or probit regression can be a reasonable choice, as these methods are not sensitive to non-normal distribution and unequal variances within each group; on the other hand, they require much larger sample sizes. For small sample sizes, discriminant analysis is much more reliable.

As a rule of thumb, you should have at least 50 observations for each independent variable, which means that, if we want to build a logistic regression model for the mtcars dataset as earlier, we will need at least 500 observations—but we have only 32.

To this end, we will restrict this section to one or two quick examples on how to conduct a logit regression—for example, to estimate whether a car has automatic or manual transmission based on the performance and weight of the automobile:

> lr <- glm(am ~ hp + wt, data = mtcars, family = binomial)
> summary(lr)

Call:
glm(formula = am ~ hp + wt, family = binomial, data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2537  -0.1568  -0.0168   0.1543   1.3449  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 18.86630    7.44356   2.535  0.01126 * 
hp           0.03626    0.01773   2.044  0.04091 * 
wt          -8.08348    3.06868  -2.634  0.00843 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 10.059  on 29  degrees of freedom
AIC: 16.059

Number of Fisher Scoring iterations: 8

The most important table from the preceding output is the coefficients table, which describes whether the model and the independent variables significantly contribute to the value of the independent variable. We can conclude that:

  • A 1-unit increase of horsepower increases the log odds of having a manual transmission (at least back in 1974, when the data was collected)
  • A 1-unit increase of weight (in pounds), on the other hand, decreases the same log odds by 8

It seems that, despite (or rather due to) the low sample size, the model fits the data very well, and the horsepower and weight of the cars can explain whether a car has an automatic transmission or manual shift:

> table(mtcars$am, round(predict(lr, type = 'response')))
     0  1
  0 18  1
  1  1 12

But running the preceding command on the number of gears instead of transmission would fail, as logit regression by default expects a dichotomous variable. We can overcome this by fitting multiple models on the data, such as verifying whether a car has 3/4/5 gears or not with dummy variables, or by fitting a multinomial logistic regression. The nnet package has a very convenient function to do so:

> library(nnet) 
> (mlr <- multinom(factor(gear) ~ ., data = mtcars)) 
# weights:  36 (22 variable)
initial  value 35.155593 
iter  10 value 5.461542
iter  20 value 0.035178
iter  30 value 0.000631
final  value 0.000000 
converged
Call:
multinom(formula = factor(gear) ~ ., data = mtcars)

Coefficients:
  (Intercept)       mpg       cyl      disp         hp     drat
4  -12.282953 -1.332149 -10.29517 0.2115914 -1.7284924 15.30648
5    7.344934  4.934189 -38.21153 0.3972777 -0.3730133 45.33284
         wt        qsec        vs       am     carb
4 21.670472   0.1851711  26.46396 67.39928 45.79318
5 -4.126207 -11.3692290 -38.43033 32.15899 44.28841

Residual Deviance: 4.300374e-08 
AIC: 44

As expected, it returns a highly fitted model to our small dataset:

> table(mtcars$gear, predict(mlr))
     3  4  5
  3 15  0  0
  4  0 12  0
  5  0  0  5

However, due to the small sample size, this model is extremely limited. Before proceeding to the next examples, please remove the updated mtcars dataset from the current R session to avoid unexpected errors:

> rm(mtcars)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.97.170