Discriminant analysis

Discriminant Function Analysis (DA) refers to the process of determining which continuous independent (predictor) variables discriminate between a discrete dependent (response) variable's categories, which can be considered as a reversed Multivariate Analysis of Variance (MANOVA).

This suggests that DA is very similar to logistic regression (see Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth) and the following section), which is more generally used because of its flexibility. While logistic regression can handle both categorical and continuous data, DA requires numeric independent variables and has a few further requirements that logistic regression does not have:

  • Normal distribution is assumed
  • Outliers should be eliminated
  • No two variables should be highly correlated (multi-collinearity)
  • The sample size of the smallest category should be higher than the number of predictor values
  • The number of independent variables should not exceed the sample size

There are two different types of DA, and we will use lda from the MASS package for the linear discriminant function, and qda for the quadratic discriminant function.

Let us start with the dependent variable being the number of gears, and we will use all the other numeric values as independent variables. To make sure that we start with a standard mtcars dataset not overwritten in the preceding examples, let's clear the namespace and update the gear column to include categories instead of the actual numeric values:

> rm(mtcars)
> mtcars$gear <- factor(mtcars$gear)

Due to the low number of observations (and as we have already discussed the related options in Chapter 9, From Big to Smaller Data), we can now set aside conducting the normality and other tests. Let's proceed with the actual analysis.

We call the lda function, setting cross validation (CV) to TRUE, so that we can test the accuracy of the prediction. The dot in the formula refers to all variables except the explicitly mentioned gear:

> library(MASS)
> d <- lda(gear ~ ., data = mtcars, CV =TRUE)

So now we can check the accuracy of the predictions by comparing them to the original values via the confusion matrix:

> (tab <- table(mtcars$gear, d$class)) 
     3  4  5
  3 14  1  0
  4  2 10  0
  5  1  1  3

To present relative percentages instead of the raw numbers, we can do some quick transformations:

> tab / rowSums(tab)
             3          4          5
  3 0.93333333 0.06666667 0.00000000
  4 0.16666667 0.83333333 0.00000000
  5 0.20000000 0.20000000 0.60000000

And we can also compute the percentage of missed predictions:

> sum(diag(tab)) / sum(tab)
[1] 0.84375

After all, around 84 percent of the cases got classified into their most likely respective classes, which were made up from the actual probabilities that can be extracted by the posterior element of the list:

> round(d$posterior, 4)
                         3      4      5
Mazda RX4           0.0000 0.8220 0.1780
Mazda RX4 Wag       0.0000 0.9905 0.0095
Datsun 710          0.0018 0.6960 0.3022
Hornet 4 Drive      0.9999 0.0001 0.0000
Hornet Sportabout   1.0000 0.0000 0.0000
Valiant             0.9999 0.0001 0.0000
Duster 360          0.9993 0.0000 0.0007
Merc 240D           0.6954 0.2990 0.0056
Merc 230            1.0000 0.0000 0.0000
Merc 280            0.0000 1.0000 0.0000
Merc 280C           0.0000 1.0000 0.0000
Merc 450SE          1.0000 0.0000 0.0000
Merc 450SL          1.0000 0.0000 0.0000
Merc 450SLC         1.0000 0.0000 0.0000
Cadillac Fleetwood  1.0000 0.0000 0.0000
Lincoln Continental 1.0000 0.0000 0.0000
Chrysler Imperial   1.0000 0.0000 0.0000
Fiat 128            0.0000 0.9993 0.0007
Honda Civic         0.0000 1.0000 0.0000
Toyota Corolla      0.0000 0.9995 0.0005
Toyota Corona       0.0112 0.8302 0.1586
Dodge Challenger    1.0000 0.0000 0.0000
AMC Javelin         1.0000 0.0000 0.0000
Camaro Z28          0.9955 0.0000 0.0044
Pontiac Firebird    1.0000 0.0000 0.0000
Fiat X1-9           0.0000 0.9991 0.0009
Porsche 914-2       0.0000 1.0000 0.0000
Lotus Europa        0.0000 0.0234 0.9766
Ford Pantera L      0.9965 0.0035 0.0000
Ferrari Dino        0.0000 0.0670 0.9330
Maserati Bora       0.0000 0.0000 1.0000
Volvo 142E          0.0000 0.9898 0.0102

Now we can run lda again without cross validation to see the actual discriminants and how the different categories of gear are structured:

> d <- lda(gear ~ ., data = mtcars)
> plot(d)
Discriminant analysis

The numbers in the preceding plot stand for the cars in the mtcars dataset presented by the actual number of gears. It is really straightforward that the elements rendered by the two discriminants highlight the similarity of cars with the same number of gears and the difference between those with unequal values in the gear column.

These discriminants can be also extracted from the d object by calling predict, or can directly be rendered on a histogram to see the distribution of this continuous variable by the categories of the independent variable:

> plot(d, dimen = 1, type = "both" )
Discriminant analysis
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.137.59