Keeping the top k eigenvalues (sorted by the descending eigenvalues)

Now that we have our four eigenvalues, we will choose the appropriate number of them to keep to consider them principal components. We can choose all four if we wish, but we generally wish to choose a number less than the original number of features. But what is the right number? We could grid search and find the answer using the brute-force method, however, we have another tool in our arsenal, called the scree plot.

A scree plot is a simple line graph that shows the percentage of total variance explained in the data by each principal component. To build this plot, we will sort the eigenvalues in order of descending value and plot the cumulative variance explained by each component and all components prior. In the case of iris, we will have four points on our scree plot, one for each principal component. Each component on its own explains a percentage of the total variance captured, and all components, when the percentages are added up, should account for 100% of the total variance in the dataset.

Let's calculate the percentage of variance explained by each eigenvector (principal component) by taking the eigenvalue associated with that eigenvector and dividing it by the sum of all eigenvalues:

# the percentages of the variance captured by each eigenvalue
# is equal to the eigenvalue of that components divided by
# the sum of all eigen values

explained_variance_ratio = eig_val_cov/eig_val_cov.sum()
explained_variance_ratio

array([ 0.92461621,  0.05301557,  0.01718514,  0.00518309])

What this is telling us is that our four principal components differ vastly in the amount of variance that they account for. The first principal component, as a single feature/column, is able to account for over 92% of the variance in the data. That is astonishing! This means that this single super-column theoretically can do nearly all of the work of the four original columns.

To visualize our scree plot, let's create a plot with the four principal components on the x axis and the cumulative variance explained on the y axis. For every data-point, the y position will represent the total percentage of variance explained using all principal components up until that one:

# Scree Plot

plt.plot(np.cumsum(explained_variance_ratio))
plt.title('Scree Plot')
plt.xlabel('Principal Component (k)')
plt.ylabel('% of Variance Explained <= k')

The following is the output of the preceding code:

This is telling us that the first two components, by themselves, account for nearly 98% of the total variance of the original dataset, meaning that if we only used the first two eigenvectors and used them as our new principal components, then we would be in good shape. We would be able to shrink the size of our dataset by half (from four to two columns) while maintaining integrity in performance and speeding up performance. We will taking a closer look at examples of machine learning to validate these theoretical notions in the upcoming sections.

An eigenvalue decomposition will always result in as many eigenvectors as we have features. It is up to us to choose how many principal components we wish to use once they are all calculated. This highlights the fact that PCA, like most other algorithms in this text, is semi-supervised and require some human input.

Table of Contents for Keeping the top k eigenvalues (sorted by the descending eigenvalues)

Create new playlist

Sign In

Sign Up

Table of Contents for
Keeping the top k eigenvalues (sorted by the descending eigenvalues)