Supervised data compression via linear discriminant analysis

Linear Discriminant Analysis (LDA) can be used as a technique for feature extraction to increase the computational efficiency and reduce the degree of over-fitting due to the curse of dimensionality in nonregularized models.

The general concept behind LDA is very similar to PCA, whereas PCA attempts to find the orthogonal component axes of maximum variance in a dataset; the goal in LDA is to find the feature subspace that optimizes class separability. Both LDA and PCA are linear transformation techniques that can be used to reduce the number of dimensions in a dataset; the former is an unsupervised algorithm, whereas the latter is supervised. Thus, we might intuitively think that LDA is a superior feature extraction technique for classification tasks compared to PCA. However, A.M. Martinez reported that preprocessing via PCA tends to result in better classification results in an image recognition task in certain cases, for instance, if each class consists of only a small number of samples (A. M. Martinez and A. C. Kak. PCA Versus LDA. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(2):228–233, 2001).

Note

Although LDA is sometimes also called Fisher's LDA, Ronald A. Fisher initially formulated Fisher's Linear Discriminant for two-class classification problems in 1936 (R. A. Fisher. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(2):179–188, 1936). Fisher's Linear Discriminant was later generalized for multi-class problems by C. Radhakrishna Rao under the assumption of equal class covariances and normally distributed classes in 1948, which we now call LDA (C. R. Rao. The Utilization of Multiple Measurements in Problems of Biological Classification. Journal of the Royal Statistical Society. Series B (Methodological), 10(2):159–203, 1948).

The following figure summarizes the concept of LDA for a two-class problem. Samples from class 1 are shown as crosses and samples from class 2 are shown as circles, respectively:

Supervised data compression via linear discriminant analysis

A linear discriminant, as shown on the x-axis (LD 1), would separate the two normally distributed classes well. Although the exemplary linear discriminant shown on the y-axis (LD 2) captures a lot of the variance in the dataset, it would fail as a good linear discriminant since it does not capture any of the class-discriminatory information.

One assumption in LDA is that the data is normally distributed. Also, we assume that the classes have identical covariance matrices and that the features are statistically independent of each other. However, even if one or more of those assumptions are slightly violated, LDA for dimensionality reduction can still work reasonably well (R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. 2nd. Edition. New York, 2001).

Before we take a look into the inner workings of LDA in the following subsections, let's summarize the key steps of the LDA approach:

  1. Standardize the Supervised data compression via linear discriminant analysis-dimensional dataset (Supervised data compression via linear discriminant analysis is the number of features).
  2. For each class, compute the Supervised data compression via linear discriminant analysis-dimensional mean vector.
  3. Construct the between-class scatter matrix Supervised data compression via linear discriminant analysis and the within-class scatter matrix Supervised data compression via linear discriminant analysis.
  4. Compute the eigenvectors and corresponding eigenvalues of the matrix Supervised data compression via linear discriminant analysis.
  5. Choose the Supervised data compression via linear discriminant analysis eigenvectors that correspond to the Supervised data compression via linear discriminant analysis largest eigenvalues to construct a Supervised data compression via linear discriminant analysis-dimensional transformation matrix Supervised data compression via linear discriminant analysis; the eigenvectors are the columns of this matrix.
  6. Project the samples onto the new feature subspace using the transformation matrix Supervised data compression via linear discriminant analysis.

Note

The assumptions that we make when we are using LDA are that the features are normally distributed and independent of each other. Also, the LDA algorithm assumes that the covariance matrices for the individual classes are identical. However, even if we violate those assumptions to a certain extent, LDA may still work reasonably well in dimensionality reduction and classification tasks (R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. 2nd. Edition. New York, 2001).

Computing the scatter matrices

Since we have already standardized the features of the Wine dataset in the PCA section at the beginning of this chapter, we can skip the first step and proceed with the calculation of the mean vectors, which we will use to construct the within-class scatter matrix and between-class scatter matrix, respectively. Each mean vector Computing the scatter matrices stores the mean feature value Computing the scatter matrices with respect to the samples of class Computing the scatter matrices:

Computing the scatter matrices

This results in three mean vectors:

Computing the scatter matrices
>>> np.set_printoptions(precision=4)
>>> mean_vecs = []
>>> for label in range(1,4):
...     mean_vecs.append(np.mean(
...                X_train_std[y_train==label], axis=0))
...     print('MV %s: %s
' %(label, mean_vecs[label-1]))
MV 1: [ 0.9259 -0.3091  0.2592 -0.7989  0.3039  0.9608  1.0515 -0.6306  0.5354
  0.2209  0.4855  0.798   1.2017]

MV 2: [-0.8727 -0.3854 -0.4437  0.2481 -0.2409 -0.1059  0.0187 -0.0164  0.1095
 -0.8796  0.4392  0.2776 -0.7016]

MV 3: [ 0.1637  0.8929  0.3249  0.5658 -0.01   -0.9499 -1.228   0.7436 -0.7652
  0.979  -1.1698 -1.3007 -0.3912]

Using the mean vectors, we can now compute the within-class scatter matrix Computing the scatter matrices:

Computing the scatter matrices

This is calculated by summing up the individual scatter matrices Computing the scatter matrices of each individual class Computing the scatter matrices:

Computing the scatter matrices
>>> d = 13 # number of features
>>> S_W = np.zeros((d, d))
>>> for label,mv in zip(range(1,4), mean_vecs):
...     class_scatter = np.zeros((d, d)) 
...     for row in X_train[y_train == label]:
...         row, mv = row.reshape(d, 1), mv.reshape(d, 1) 
...         class_scatter += (row-mv).dot((row-mv).T)
...     S_W += class_scatter                             
>>> print('Within-class scatter matrix: %sx%s'
...        % (S_W.shape[0], S_W.shape[1]))
Within-class scatter matrix: 13x13

The assumption that we are making when we are computing the scatter matrices is that the class labels in the training set are uniformly distributed. However, if we print the number of class labels, we see that this assumption is violated:

>>> print('Class label distribution: %s' 
...       % np.bincount(y_train)[1:])
Class label distribution: [40 49 35]

Thus, we want to scale the individual scatter matrices Computing the scatter matrices before we sum them up as scatter matrix Computing the scatter matrices. When we divide the scatter matrices by the number of class samples Computing the scatter matrices, we can see that computing the scatter matrix is in fact the same as computing the covariance matrix Computing the scatter matrices. The covariance matrix is a normalized version of the scatter matrix:

Computing the scatter matrices
>>> d = 13 # number of features
>>> S_W = np.zeros((d, d))
>>> for label,mv in zip(range(1, 4), mean_vecs):
...     class_scatter = np.cov(X_train_std[y_train==label].T)
...     S_W += class_scatter
>>> print('Scaled within-class scatter matrix: %sx%s' 
...       % (S_W.shape[0], S_W.shape[1]))
Scaled within-class scatter matrix: 13x13

After we have computed the scaled within-class scatter matrix (or covariance matrix), we can move on to the next step and compute the between-class scatter matrix Computing the scatter matrices:

Computing the scatter matrices

Here, Computing the scatter matrices is the overall mean that is computed, including samples from all classes.

>>> mean_overall = np.mean(X_train_std, axis=0)
>>> d = 13 # number of features
>>> S_B = np.zeros((d, d))
>>> for i,mean_vec in enumerate(mean_vecs):
...     n = X_train[y_train==i+1, :].shape[0]
...     mean_vec = mean_vec.reshape(d, 1)
...     mean_overall = mean_overall.reshape(d, 1) 
    S_B += n * (mean_vec - mean_overall).dot(
...                (mean_vec - mean_overall).T)
print('Between-class scatter matrix: %sx%s' 
...    % (S_B.shape[0], S_B.shape[1]))
Between-class scatter matrix: 13x13

Selecting linear discriminants for the new feature subspace

The remaining steps of the LDA are similar to the steps of the PCA. However, instead of performing the eigendecomposition on the covariance matrix, we solve the generalized eigenvalue problem of the matrix Selecting linear discriminants for the new feature subspace:

>>>eigen_vals, eigen_vecs =
...np.linalg.eig(np.linalg.inv(S_W).dot(S_B))

After we computed the eigenpairs, we can now sort the eigenvalues in descending order:

>>> eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:,i]) 
...              for i in range(len(eigen_vals))]
>>> eigen_pairs = sorted(eigen_pairs, 
...               key=lambda k: k[0], reverse=True)
>>> print('Eigenvalues in decreasing order:
')
>>> for eigen_val in eigen_pairs:
...     print(eigen_val[0])


Eigenvalues in decreasing order:

452.721581245
156.43636122
8.11327596465e-14
2.78687384543e-14
2.78687384543e-14
2.27622032758e-14
2.27622032758e-14
1.97162599817e-14
1.32484714652e-14
1.32484714652e-14
1.03791501611e-14
5.94140664834e-15
2.12636975748e-16

In LDA, the number of linear discriminants is at most Selecting linear discriminants for the new feature subspace where Selecting linear discriminants for the new feature subspace is the number of class labels, since the in-between class scatter matrix Selecting linear discriminants for the new feature subspace is the sum of Selecting linear discriminants for the new feature subspace matrices with rank 1 or less. We can indeed see that we only have two nonzero eigenvalues (the eigenvalues 3-13 are not exactly zero, but this is due to the floating point arithmetic in NumPy). Note that in the rare case of perfect collinearity (all aligned sample points fall on a straight line), the covariance matrix would have rank one, which would result in only one eigenvector with a nonzero eigenvalue.

To measure how much of the class-discriminatory information is captured by the linear discriminants (eigenvectors), let's plot the linear discriminants by decreasing eigenvalues similar to the explained variance plot that we created in the PCA section. For simplicity, we will call the content of the class-discriminatory information discriminability.

>>> tot = sum(eigen_vals.real)
>>> discr = [(i / tot) for i in sorted(eigen_vals.real, reverse=True)]
>>> cum_discr = np.cumsum(discr)
>>> plt.bar(range(1, 14), discr, alpha=0.5, align='center',
...         label='individual "discriminability"')
>>> plt.step(range(1, 14), cum_discr, where='mid',
...          label='cumulative "discriminability"')
>>> plt.ylabel('"discriminability" ratio')
>>> plt.xlabel('Linear Discriminants')
>>> plt.ylim([-0.1, 1.1])
>>> plt.legend(loc='best')
>>> plt.show()

As we can see in the resulting figure, the first two linear discriminants capture about 100 percent of the useful information in the Wine training dataset:

Selecting linear discriminants for the new feature subspace

Let's now stack the two most discriminative eigenvector columns to create the transformation matrix Selecting linear discriminants for the new feature subspace:

>>> w = np.hstack((eigen_pairs[0][1][:, np.newaxis].real,
...                eigen_pairs[1][1][:, np.newaxis].real))
>>> print('Matrix W:
', w)
Matrix W:
[[ 0.0662 -0.3797]
[-0.0386 -0.2206]
[ 0.0217 -0.3816]
[-0.184 0.3018]
[ 0.0034 0.0141]
[-0.2326 0.0234]
[ 0.7747 0.1869]
[ 0.0811 0.0696]
[-0.0875 0.1796]
[-0.185 -0.284 ]
[ 0.066 0.2349]
[ 0.3805 0.073 ]
[ 0.3285 -0.5971]]

Projecting samples onto the new feature space

Using the transformation matrix Projecting samples onto the new feature space that we created in the previous subsection, we can now transform the training data set by multiplying the matrices:

Projecting samples onto the new feature space
>>> X_train_lda = X_train_std.dot(w)
>>> colors = ['r', 'b', 'g']
>>> markers = ['s', 'x', 'o']
>>> for l, c, m in zip(np.unique(y_train), colors, markers):
...     plt.scatter(X_train_lda[y_train==l, 0]*(-1) 
...                 X_train_lda[y_train==l, 1]*(-1) 
...                 c=c, label=l, marker=m)
>>> plt.xlabel('LD 1')
>>> plt.ylabel('LD 2')
>>> plt.legend(loc='lower right')
>>> plt.show()

As we can see in the resulting plot, the three wine classes are now linearly separable in the new feature subspace:

Projecting samples onto the new feature space

LDA via scikit-learn

The step-by-step implementation was a good exercise for understanding the inner workings of LDA and understanding the differences between LDA and PCA. Now, let's take a look at the LDA class implemented in scikit-learn:

>>> from sklearn.lda import LDA
>>> lda = LDA(n_components=2)
>>> X_train_lda = lda.fit_transform(X_train_std, y_train)

Next, let's see how the logistic regression classifier handles the lower-dimensional training dataset after the LDA transformation:

>>> lr = LogisticRegression()
>>> lr = lr.fit(X_train_lda, y_train)
>>> plot_decision_regions(X_train_lda, y_train, classifier=lr)
>>> plt.xlabel('LD 1')
>>> plt.ylabel('LD 2')
>>> plt.legend(loc='lower left')
>>> plt.show()

Looking at the resulting plot, we see that the logistic regression model misclassifies one of the samples from class 2:

LDA via scikit-learn

By lowering the regularization strength, we could probably shift the decision boundaries so that the logistic regression models classify all samples in the training dataset correctly. However, let's take a look at the results on the test set:

>>> X_test_lda = lda.transform(X_test_std)
>>> plot_decision_regions(X_test_lda, y_test, classifier=lr)
>>> plt.xlabel('LD 1')
>>> plt.ylabel('LD 2')
>>> plt.legend(loc='lower left')
>>> plt.show()

As we can see in the resulting plot, the logistic regression classifier is able to get a perfect accuracy score for classifying the samples in the test dataset by only using a two-dimensional feature subspace instead of the original 13 Wine features:

LDA via scikit-learn
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.217.220