LDA versus PCA – iris dataset

Finally, we arrive at the moment where we can try using both PCA and LDA in our machine learning pipelines. Because we have been working with the iris dataset extensively in this chapter, we will continue to demonstrate the utility of both LDA and PCA as feature transformational pre-processing steps for supervised and unsupervised machine learning.

We will start with supervised machine learning and attempt to build a classifier to recognize the species of flower given the four quantitative flower traits:

We begin by importing three modules from scikit-learn:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

We will use KNN as our supervised model and the pipeline module to combine our KNN model with our feature transformation tools to create machine learning pipelines that can be cross-validated using the cross_val_score module. We will try a few different machine learning pipelines and record their performance:

Let's begin by creating three new variables, one to hold our LDA, one to hold our PCA, and another to hold a KNN model:

# Create a PCA module to keep a single component
 single_pca = PCA(n_components=1)
 
 # Create a LDA module to keep a single component
 single_lda = LinearDiscriminantAnalysis(n_components=1)
 
 # Instantiate a KNN model
 knn = KNeighborsClassifier(n_neighbors=3)

Let's invoke the KNN model without any transformational techniques to get the baseline accuracy. We will use this to compare the two feature transformation algorithms:

# run a cross validation on the KNN without any feature transformation
 knn_average = cross_val_score(knn, iris_X, iris_y).mean()
 
 # This is a baseline accuracy. If we did nothing, KNN on its own achieves a 98% accuracy
 knn_average
 
 0.98039215686274517

The baseline accuracy to beat is 98.04%. Let's use our LDA, which keeps only the most powerful component:

lda_pipeline = Pipeline([('lda', single_lda), ('knn', knn)])
 lda_average = cross_val_score(lda_pipeline, iris_X, iris_y).mean()
 
 # better prediction accuracy than PCA by a good amount, but not as good as original
 lda_average
 
 0.9673202614379085

It seems that only using a single linear discriminant isn't enough to beat our baseline accuracy. Let us now try the PCA. Our hypothesis here is that the PCA will not outperform the LDA for the sole reason that the PCA is not trying to optimize for class separation as LDA is:

# create a pipeline that performs PCA
 pca_pipeline = Pipeline([('pca', single_pca), ('knn', knn)])
 
 pca_average = cross_val_score(pca_pipeline, iris_X, iris_y).mean()
 
 pca_average
 
 0.8941993464052288

Definitely the worst so far.

It is worth exploring whether adding another LDA component will help us:

# try LDA with 2 components
 lda_pipeline = Pipeline([('lda', LinearDiscriminantAnalysis(n_components=2)),
 ('knn', knn)])
 
 lda_average = cross_val_score(lda_pipeline, iris_X, iris_y).mean()
 
 # Just as good as using original data
 lda_average
 
 0.98039215686274517

With two components, we are able to achieve the original accuracy! This is great, but we want to do better than our baseline. Let's see if a feature selection module from the last chapter can help us. Let's import and use the SelectKBest module and see if statistical feature selection would best our LDA module:

# compare our feature transformation tools to a feature selection tool
 from sklearn.feature_selection import SelectKBest
 # try all possible values for k, excluding keeping all columns
 for k in [1, 2, 3]:
 # make the pipeline
 select_pipeline = Pipeline([('select', SelectKBest(k=k)), ('knn', knn)])
 # cross validate the pipeline
 select_average = cross_val_score(select_pipeline, iris_X, iris_y).mean()
 print k, "best feature has accuracy:", select_average

 # LDA is even better than the best selectkbest
 1 best feature has accuracy: 0.953839869281 2 best feature has accuracy: 0.960784313725 3 best feature has accuracy: 0.97385620915

Our LDA with two components is so far winning. In production, it is quite common to use both unsupervised and supervised feature transformations. Let's set up a GridSearch module to find the best combination across:

Scaling data (with or without mean/std)
PCA components
LDA components
KNN neighbors

The following code block is going to set up a function called get_best_model_and_accuracy which will take in a model (scikit-learn pipeline or other), a parameter grid in the form of a dictionary, our X and y datasets, and output the result of the grid search module. The output will be the model's best performance (in terms of accuracy), the best parameters that led to the best performance, the average time it took to fit, and the average time it took to predict:

def get_best_model_and_accuracy(model, params, X, y):
    grid = GridSearchCV(model, # the model to grid search
                        params, # the parameter set to try
                        error_score=0.) # if a parameter set raises an error, continue and set the performance as a big, fat 0

    grid.fit(X, y) # fit the model and parameters
    # our classical metric for performance
    print "Best Accuracy: {}".format(grid.best_score_)
    # the best parameters that caused the best accuracy
    print "Best Parameters: {}".format(grid.best_params_)
    # the average time it took a model to fit to the data (in seconds)
    avg_time_fit = round(grid.cv_results_['mean_fit_time'].mean(), 3)
    print "Average Time to Fit (s): {}".format(avg_time_fit)
    # the average time it took a model to predict out of sample data (in seconds)
    # this metric gives us insight into how this model will perform in real-time analysis
    print "Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3))

Once we have our function set up to take in models and parameters, let's use it to test our pipeline with our combinations of scaling, PCA, LDA, and KNN:


from sklearn.model_selection import GridSearchCV
iris_params = {
     'preprocessing__scale__with_std': [True, False],
     'preprocessing__scale__with_mean': [True, False],
     'preprocessing__pca__n_components':[1, 2, 3, 4], # according to scikit-learn docs, max allowed n_components for LDA is number of classes - 1
     'preprocessing__lda__n_components':[1, 2],

     'clf__n_neighbors': range(1, 9)
}
# make a larger pipeline
preprocessing = Pipeline([('scale', StandardScaler()), ('pca', PCA()), ('lda', LinearDiscriminantAnalysis())])


iris_pipeline = Pipeline(steps=[('preprocessing', preprocessing),('clf', KNeighborsClassifier())])


get_best_model_and_accuracy(iris_pipeline, iris_params, iris_X, iris_y)

Best Accuracy: 0.986666666667 Best Parameters: {'preprocessing__scale__with_std': False, 'preprocessing__pca__n_components': 3, 'preprocessing__scale__with_mean': True, 'preprocessing__lda__n_components': 2, 'clf__n_neighbors': 3} Average Time to Fit (s): 0.002 Average Time to Score (s): 0.001

The best accuracy so far (near 99%) uses a combination of scaling, PCA, and LDA. It is common to correctly use all three of these algorithms in the same pipelines and perform hyper-parameter tuning to fine-tune the process. This shows us that more often than not, the best production-ready machine learning pipelines are in fact a combination of multiple feature engineering methods.

Table of Contents for LDA versus PCA – iris dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
LDA versus PCA – iris dataset