Extensions of the binary logistic classifier

So far, the focus of this chapter has been on the binary classification task where we have two classes. We'll now turn to the problem of multiclass prediction. In Chapter 1, Gearing Up for Predictive Modeling, we studied the iris data set, where the goal is to distinguish between three different species of iris, based on features that describe the external appearance of iris flower samples. Before presenting additional examples of multiclass problems, we'll state an important caveat. The caveat is that several other methods for classification that we will study in this book, such as neural networks and decision trees, are both more natural and more commonly used than logistic regression for classification problems involving more than two classes. With that in mind, we'll turn to multinomial logistic regression, our first extension of the binary logistic classifier.

Multinomial logistic regression

Suppose our target variable is comprised of K classes. For example, in the iris data set, K = 3. Multinomial logistic regression tackles the multiclass problem by fitting K-1 independent binary logistic classifier models. This is done by arbitrarily choosing one of the output classes as a reference class and fitting K-1 regression models that compare each of the remaining classes to this one. For example, if we have two features, X1 and X2, and three classes, which we could call 0, 1, and 2, we construct the following two models:

Multinomial logistic regression

Here we used class 0 as the baseline and built two binary regression models. In the first, we compare class 1 against class 0 and in the second, we compare class 2 against class 0. Note that because we now have more than one binary regression model, our model coefficients have two subscripts. The first subscript identifies the model and the second subscript pairs the coefficient with a feature. For example, β12 is the coefficient of feature X2 in the first model. We can write a general expression for the probability that our combined model predicts class k when there are K classes in total, numbered from 0 to K-1, and class 0 is chosen as the reference class:

Multinomial logistic regression

The reader should verify that the sum of all the output class probabilities is 1, as required. This particular mathematical form of an exponential divided by a sum of exponentials is known as the softmax function. For our three class problem discussed previously, we simply substitute K=3 in the preceding equations. At this point, we should mention some important characteristics of this approach.

To begin with, we are training one fewer model than the total number of classes in our output variable, and as a result, it should be easy to see that this approach does not scale very well when we have a large number of output classes from which to choose. The fact that we are building and training so many models also means that we tend to need a much larger data set to produce results with reasonable accuracy. Finally, as we independently compare each output class to a reference class, we make an assumption, known as the Independence of Irrelevant Alternatives (IIA) assumption.

The IIA assumption, in a nutshell, states that the odds of predicting one particular output class over another do not depend on whether we increase the number of possible output classes k by adding new classes. To illustrate this, suppose for simplicity that we model our iris data set using multinomial logistic regression, and the odds of the output classes are 0.33 : 0.33 : 0.33 for the three different species so that every species is in a 1 : 1 ratio with every other species. The IIA assumption states that if we refit a model that includes samples of a new type of iris, for example, ensata (the Japanese iris), the odds ratio between the previous three iris species is maintained. A new overall odds ratio of 0.2 : 0.2 : 0.2 : 0.4 between the four species (where the 0.4 corresponds to species ensata) would be valid, for example, because the 1 : 1 ratios between the old three species are maintained.

Predicting glass type

In this section, we'll demonstrate how we can train multinomial logistic regression models in R by way of an example data set. The data we'll examine is from the field of forensic science. Here, our goal is to examine properties of glass fragments found in crime scenes and predict the source of these fragments, for example, headlamps. The glass identification data set is hosted by the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Glass+Identification. We'll first load the data in a data frame, rename the columns using information from the website, and throw away the first column that is a unique identifier for each sample, as this has been arbitrarily assigned and not needed by our model:

> glass <- read.csv("glass.data", header = FALSE)
> names(glass) <- c("id","RI","Na", "Mg", "Al", "Si", "K", "Ca", 
                    "Ba", "Fe", "Type")
> glass <- glass[,-1]

Next, we'll look at a table showing what each column in our data frame represents.

Column name

Type

Definition

RI

Numerical

Refractive index

Na

Numerical

Percentage of Sodium Oxide by weight

Mg

Numerical

Percentage of Magnesium Oxide by weight

Al

Numerical

Percentage of Aluminium Oxide by weight

Si

Numerical

Percentage of Silicon Oxide by weight

K

Numerical

Percentage of Potassium Oxide by weight

Ca

Numerical

Percentage of Calcium Oxide by weight

Ba

Numerical

Percentage of Barium Oxide by weight

Fe

Numerical

Percentage of Iron Oxide by weight

Type

Categorical

Type of glass (1: float processed building windows, 2: nonfloat processed building windows, 3: float processed vehicle windows, 4: nonfloat processed vehicle windows, 5: containers, 6: tableware, 7: headlamps)

As usual, we'll proceed by preparing a training and test set for our glass data:

> set.seed(4365677)
> glass_sampling_vector 
   <- createDataPartition(glass$Type, p = 0.80, list = FALSE)
> glass_train <- glass[glass_sampling_vector,]
> glass_test <- glass[-glass_sampling_vector,]

Now, to perform multinomial logistic regression, we will use the nnet package. This package also contains functions that work with neural networks, so we will revisit this package in the next chapter as well. The multinom()function is used for multinomial logistic regression. This works by specifying a formula and a data frame, so it has a familiar interface. In addition, we can also specify the maxit parameter that determines the maximum number of iterations for which the underlying optimization procedure will run. Sometimes, we may find that training a model returns an error that convergence was not reached. In this case, one possible approach is to increase this parameter and allow the model to train over a larger number of iterations. In doing so, however, we should be aware of the fact that the model may take longer to train:

> library(nnet)
> glass_model <- multinom(Type ~ ., data = glass_train, maxit = 1000)
> summary(glass_model)
Call:
multinom(formula = Type ~ ., data = glass_train, maxit = 1000)

Coefficients:
  (Intercept)         RI         Na         Mg          Al
2   52.259841  229.29126 -3.3704788  -5.975435  0.07372541
3  596.591193 -237.75997 -1.2230210  -2.435149 -0.65752347
5   -1.107583  -22.94764 -0.7434635  -4.244450  8.39355868
6   -7.493074  -11.83462 11.7893062  -6.383788 35.54561277
7  -55.888124  442.23590 -2.5269178 -10.479849  1.35983136
          Si            K         Ca          Ba           Fe
2 -4.0428142   -3.4934439 -4.6096363   -6.319183    3.2295218
3 -2.6703131   -4.1221815 -1.7952780   -3.910554    0.2818498
5  0.6992306   -0.2149109 -0.8790202   -4.642283    4.3379314
6 -2.2672275 -138.1047925  0.9011624 -161.700857 -200.9598019
7 -6.5363409   -7.5444163 -8.5710078   -4.087614  -67.9907347



Std. Errors:
  (Intercept)         RI         Na        Mg       Al        Si
2  0.03462075 0.08068713  0.5475710 0.7429120 1.282725 0.1392131
3  0.05425817 0.08750688  0.7339134 0.9173184 1.544409 0.1805758
5  0.06674926 0.11759231  1.0866157 1.4062285 2.738635 0.3225212
6  0.17049665 0.28791033 17.2280091 4.9726046 2.622643 4.3385330
7  0.06432732 0.10522206  2.2561142 1.5246356 3.244288 0.4733835
           K        Ca           Ba         Fe
2 1.98021049 0.4897356 1.473156e+00 2.45881312
3 2.35233054 0.5949799 4.222783e+00 3.45835575
5 2.78360034 0.9807043 5.471887e+00 5.52299959
6 0.02227295 7.2406622 1.656563e-08 0.01779519
7 3.25038195 1.7310334 4.381655e+00 0.28562065

Residual Deviance: 219.2651 
AIC: 319.2651

Our model summary shows us that we have five sets of coefficients. This is because our TYPE output variable has six levels, which is to say that we are choosing to predict one of six different sources of glass. There are no examples in the data where Type takes the value 4. The model also shows us standard errors but no significance tests. In general, testing for coefficient significance is a lot trickier than with binary logistic regression, and this is one of the weaknesses of this approach. Often, we resort to independently testing the significance of coefficients for each of the binary models that we trained.

We won't dwell on this any further, but will instead check the overall accuracy on our training data to give us a sense of the overall quality of fit:

> glass_predictions <- predict(glass_model, glass_train)
> mean(glass_predictions == glass_train$Type)
[1] 0.7209302

Our training accuracy is 72 percent, which is not especially high. Here is the confusion matrix:

> table(predicted = glass_predictions, actual = glass_train$Type)
         actual
predicted  1  2  3  5  6  7
        1 46 17  8  0  0  0
        2 13 40  6  2  0  1
        3  0  0  0  0  0  0
        5  0  1  0  7  0  0
        6  0  0  0  0  7  0
        7  0  0  0  0  0 24

The confusion matrix reveals certain interesting facts. The first of these is that it seems that the model does not distinguish well between the first two classes, as many of the errors that are made involve these two. Part of the reason for this, however, is that these two classes are the most frequent in the data. The second problem that we are seeing is that the model never predicts class 3. In fact, it completely confuses this class with the first two classes. The seven examples of class 6 are perfectly distinguished, and accuracy for class 7 is also near perfect, with only 1 mistake out of 25. Overall, 72 percent accuracy on training data is considered mediocre, but given the fact that we have six output classes and only 172 observations in our training data, this is to be expected with this type of model. Let's repeat this for the test data set:

> glass_test_predictions <- predict(glass_model, glass_test)
> mean(glass_test_predictions == glass_test$Type)
[1] 0.6428571
> table(predicted = glass_test_predictions, actual = 
        glass_test$Type)
         actual
predicted  1  2  3  5  6  7
        1  7  2  2  0  0  0
        2  4 15  1  2  0  0
        3  0  0  0  0  0  0
        5  0  0  0  1  0  2
        6  0  0  0  0  2  0
        7  0  1  0  1  0  2

As we can see, the confusion matrix paints a fairly similar picture to what we saw in training. Again, our model never predicts class 3 and the first two classes are still hard to distinguish. The number of observations in our test set is only 42, so this is very small. The test set accuracy is only 64 percent, somewhat less than we saw in training. If our sample sizes were larger, we may suspect that our model suffers from overfitting, but in this case, the variance of our test set performance is high due to the small sample size.

With multinomial logistic regression, we assumed that there was no natural ordering to the output classes. If our output variable is an ordinal, also known as an ordered factor, we can train a different model known as ordinal logistic regression. This is our second extension of the binary logistic regression model and is presented in the next section.

Ordinal logistic regression

Ordered factors are very common in a number of scenarios. For example, human responses to surveys are often on subjective scales with scores ranging from 1 to 5 or using qualitative labels with an intrinsic ordering such as disagree, neutral, and agree. We can try to treat these problems as regression problems, but we will still face similar issues as we did with treating the binary classification problem as a regression problem. Instead of trying to train K-1 binary logistic regression models as with multinomial logistic regression, ordinal logistic regression trains a single model with multiple thresholds on the output. In order to achieve this, it makes an important assumption known as the assumption of proportional odds. If we have K classes and want to put a threshold on the output of a single binary logistic regression model, we will need K-1 thresholds or cutoff points. The proportional odds assumption is that in the logit scale, all of these thresholds lie on a straight line. Put differently, the model uses a single set of βi coefficients determining the slope of the straight line, but there are K-1 intercept terms. For a model with p features and an output variable with K classes numbered from 0 to K-1, our model predicts:

Ordinal logistic regression

This assumption may be a little hard to visualize and is perhaps best understood by way of an example. Suppose that we are trying to predict the results of a survey on the opinions of the public on a particular government policy, based on demographic data about survey participants.

The output variable is an ordered factor that ranges from strongly disagree to strongly agree on a five-point scale (also known as a Likert scale). Suppose that l0 is the log-odds of the probabilities of strongly disagreeing versus disagreeing or better, l1 is the log-odds of the probabilities of disagreeing or strongly disagreeing versus at least being neutral, and so on until l3. These four log-odds l0 to l3 form an arithmetic sequence, which means that the distance between consecutive numbers is a constant.

Note

Even though the proportional odds model is the most frequently cited logistic regression model that handles ordered factors, there are alternative approaches. A good reference that discusses the proportional odds model as well as other related models, such as the adjacent-category logistic model, is Applied Logistic Regression Third Edition, Hosmer Jr., Lemeshow, and Sturdivant, published by Wiley.

Predicting wine quality

The data set for our ordinal logistic regression example is the wine quality data set from the UCI Machine Learning Repository. The observations in this data set consist of wine samples taken from both red and white wines of the Portuguese Vinho Verde variety. The wine samples have been rated on a scale from 1 to 10 by a number of wine experts. The goal of the data set is to predict the rating that an expert will give to a wine sample, using a range of physiochemical properties, such as acidity and alcohol composition. The website is https://archive.ics.uci.edu/ml/datasets/Wine+Quality. The data is split into two files, one for red wines and one for white wines. We will use the white wine data set, as it contains a larger number of samples. In addition, for simplicity and because the distribution of wine samples by score is sparse, we will contract our original output variable to a three point scale from 0 to 2. First, let's load and process our data:

> wine <- read.csv("winequality-white.csv", sep = ";")
> wine$quality <- factor(ifelse(wine$quality < 5, 0,                     
                         ifelse(wine$quality > 6, 2, 1)))

The following table shows our input features and output variables:

Column name

Type

Definition

fixed.acidity

Numerical

Fixed Acidity (g(tartaric acid)/dm3)

volatile.acidity

Numerical

Volatile acidity (g(acetic acid)/dm3)

citric.acid

Numerical

Citric acid (g/dm3)

residual.sugar

Numerical

Residual sugar (g/dm3)

chlorides

Numerical

Chlorides (g(sodium chloride)/dm3)

free.sulfur.dioxide

Numerical

Free Sulfur Dioxide (mg/dm3)

total.sulfur.dioxide

Numerical

Total Sulfur Dioxide (mg/dm3)

density

Numerical

Density (g/cm3)

pH

Numerical

PH

sulphates

Numerical

Sulphates (g(potassium sulphate)/dm3)

alcohol

Numerical

Alcohol (% vol.)

quality

Categorical

Wine quality (1 = Poor, 2 = Average, 3 = Good)

First, we'll prepare a training and test set:

> set.seed(7644)
> wine_sampling_vector <- createDataPartition(wine$quality, p = 
                          0.80, list = FALSE)
> wine_train <- wine[wine_sampling_vector,]
> wine_test <- wine[-wine_sampling_vector,]

Next, we'll use the polr() function from the MASS package to train a proportional odds logistic regression model. Just as with the other model functions we have seen so far, we first need to specify a formula and a data frame with our training data. In addition, we must specify the Hess parameter to TRUE in order to obtain a model that includes additional information, such as standard errors on the coefficients:

> library(MASS)
> wine_model <- polr(quality ~ ., data = wine_train, Hess = T)
> summary(wine_model)
Call:
polr(formula = quality ~ ., data = wine_train, Hess = T)

Coefficients:
                          Value Std. Error    t value
fixed.acidity         4.728e-01   0.055641     8.4975
volatile.acidity     -4.211e+00   0.435288    -9.6741
citric.acid           9.896e-02   0.353466     0.2800
residual.sugar        3.386e-01   0.009835    34.4248
chlorides            -2.891e+00   0.116025   -24.9162
free.sulfur.dioxide   1.176e-02   0.003234     3.6374
total.sulfur.dioxide -1.618e-04   0.001384    -0.1169
density              -7.534e+02   0.625157 -1205.1041
pH                    3.107e+00   0.301434    10.3087
sulphates             2.199e+00   0.338923     6.4873
alcohol               2.883e-02   0.041479     0.6951

Intercepts:
    Value      Std. Error t value   
1|2  -736.9784     0.6341 -1162.3302
2|3  -731.4177     0.6599 -1108.4069

Residual Deviance: 4412.75 
AIC: 4438.75 

Our model summary shows us that we have three output classes, and we have two intercepts. Now, in this data set, we have many wines that were rated average (either 5 or 6) and as a result, this class is the most frequent. We'll use the table() function to count the number of samples by the output score and then apply prop.table() to express these as relative frequencies:

> prop.table(table(wine$quality))

         1          2          3 
0.03736219 0.74622295 0.21641486

Class 2, which corresponds to average wines, is by far the most frequent. In fact, a simple baseline model that always predicts this category would be correct 74.6 percent of the time. Let's see whether our model does better than this. We'll begin by looking at the fit on the training data and the corresponding confusion matrix:

> wine_predictions <- predict(wine_model, wine_train)
> mean(wine_predictions == wine_train$quality)
[1] 0.7647359
> table(predicted = wine_predictions,actual = wine_train$quality)
         actual
predicted    1    2    3
        1    4    1    0
        2  141 2764  619
        3    2  159  229

Our model performs only marginally better on the training data than our baseline model. We can see why this is the case—it predicts the average class (2) very often and almost never predicts class 1. Repeating with the test set reveals a similar situation:

> wine_test_predictions <- predict(wine_model, wine_test)
> mean(wine_test_predictions == wine_test$quality)
[1] 0.7681307
> table(predicted = wine_test_predictions, 
           actual = wine_test$quality)
         actual
predicted   1   2   3
        1   2   2   0
        2  33 693 155
        3   1  36  57

It seems that our model is not a particularly good choice for this data set. As we know, there are a number of possible reasons ranging from having chosen the wrong type of model to having insufficient features or the wrong kind of features. One aspect of the ordinal logistic regression model that we should always try to check is whether the proportional odds assumption is valid. There is no universally accepted way to do this, but a number of different statistical tests have been proposed in the literature. Unfortunately, it is very difficult to find reliable implementations of these tests in R. One simple test that is easy to do, however, is to train a second model using multinomial logistic regression. Then, we can compare the AIC value of our two models. Let's do this:

> wine_model2 <- multinom(quality ~ ., data = wine_train, 
                          maxit = 1000)
> wine_predictions2 <- predict(wine_model2, wine_test)
> mean(wine_predictions2 == wine_test$quality)
[1] 0.7630235
> table(predicted = wine_predictions2, actual = wine_test$quality)
         actual
predicted   1   2   3
        1   2   2   0
        2  32 682 149
        3   2  47  63

The two models have virtually no difference in the quality of the fit. Let's check their AIC values:

> AIC(wine_model)
[1] 4438.75
> AIC(wine_model2)
[1] 4367.448

The AIC is lower in the multinomial logistic regression model, which suggests that we might be better off working with that model. Another possible avenue for improvement on this data set would be to carry out feature selection. The step() function that we saw in the previous chapter, for example, also works on models trained with the polr() function. We'll leave this as an exercise for the reader to verify that we can, in fact, get practically the same level of performance by removing some of the features. Unsatisfied with the results of logistic regression on this latest data set, we will revisit it in subsequent chapters in order to see whether more sophisticated classification models can do better.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.167.183