Discriminant Function Analysis (DA) refers to the process of determining which continuous independent (predictor) variables discriminate between a discrete dependent (response) variable's categories, which can be considered as a reversed Multivariate Analysis of Variance (MANOVA).
This suggests that DA is very similar to logistic regression (see Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth) and the following section), which is more generally used because of its flexibility. While logistic regression can handle both categorical and continuous data, DA requires numeric independent variables and has a few further requirements that logistic regression does not have:
There are two different types of DA, and we will use lda
from the MASS
package for the linear discriminant function, and qda
for the quadratic discriminant function.
Let us start with the dependent variable being the number of gears, and we will use all the other numeric values as independent variables. To make sure that we start with a standard mtcars
dataset not overwritten in the preceding examples, let's clear the namespace and update the gear column to include categories instead of the actual numeric values:
> rm(mtcars) > mtcars$gear <- factor(mtcars$gear)
Due to the low number of observations (and as we have already discussed the related options in Chapter 9, From Big to Smaller Data), we can now set aside conducting the normality and other tests. Let's proceed with the actual analysis.
We call the lda
function, setting cross validation (CV) to TRUE
, so that we can test the accuracy of the prediction. The dot in the formula refers to all variables except the explicitly mentioned gear:
> library(MASS) > d <- lda(gear ~ ., data = mtcars, CV =TRUE)
So now we can check the accuracy of the predictions by comparing them to the original values via the confusion matrix:
> (tab <- table(mtcars$gear, d$class)) 3 4 5 3 14 1 0 4 2 10 0 5 1 1 3
To present relative percentages instead of the raw numbers, we can do some quick transformations:
> tab / rowSums(tab) 3 4 5 3 0.93333333 0.06666667 0.00000000 4 0.16666667 0.83333333 0.00000000 5 0.20000000 0.20000000 0.60000000
And we can also compute the percentage of missed predictions:
> sum(diag(tab)) / sum(tab) [1] 0.84375
After all, around 84 percent of the cases got classified into their most likely respective classes, which were made up from the actual probabilities that can be extracted by the posterior
element of the list:
> round(d$posterior, 4) 3 4 5 Mazda RX4 0.0000 0.8220 0.1780 Mazda RX4 Wag 0.0000 0.9905 0.0095 Datsun 710 0.0018 0.6960 0.3022 Hornet 4 Drive 0.9999 0.0001 0.0000 Hornet Sportabout 1.0000 0.0000 0.0000 Valiant 0.9999 0.0001 0.0000 Duster 360 0.9993 0.0000 0.0007 Merc 240D 0.6954 0.2990 0.0056 Merc 230 1.0000 0.0000 0.0000 Merc 280 0.0000 1.0000 0.0000 Merc 280C 0.0000 1.0000 0.0000 Merc 450SE 1.0000 0.0000 0.0000 Merc 450SL 1.0000 0.0000 0.0000 Merc 450SLC 1.0000 0.0000 0.0000 Cadillac Fleetwood 1.0000 0.0000 0.0000 Lincoln Continental 1.0000 0.0000 0.0000 Chrysler Imperial 1.0000 0.0000 0.0000 Fiat 128 0.0000 0.9993 0.0007 Honda Civic 0.0000 1.0000 0.0000 Toyota Corolla 0.0000 0.9995 0.0005 Toyota Corona 0.0112 0.8302 0.1586 Dodge Challenger 1.0000 0.0000 0.0000 AMC Javelin 1.0000 0.0000 0.0000 Camaro Z28 0.9955 0.0000 0.0044 Pontiac Firebird 1.0000 0.0000 0.0000 Fiat X1-9 0.0000 0.9991 0.0009 Porsche 914-2 0.0000 1.0000 0.0000 Lotus Europa 0.0000 0.0234 0.9766 Ford Pantera L 0.9965 0.0035 0.0000 Ferrari Dino 0.0000 0.0670 0.9330 Maserati Bora 0.0000 0.0000 1.0000 Volvo 142E 0.0000 0.9898 0.0102
Now we can run lda
again without cross validation to see the actual discriminants and how the different categories of gear
are structured:
> d <- lda(gear ~ ., data = mtcars) > plot(d)
The numbers in the preceding plot stand for the cars in the mtcars
dataset presented by the actual number of gears. It is really straightforward that the elements rendered by the two discriminants highlight the similarity of cars with the same number of gears and the difference between those with unequal values in the gear
column.
These discriminants can be also extracted from the d
object by calling predict
, or can directly be rendered on a histogram to see the distribution of this continuous variable by the categories of the independent variable:
> plot(d, dimen = 1, type = "both" )
18.223.237.29