An example of PCA with Scikit-Learn

We can repeat the same experiment made with the FA and heteroscedastic noise to assess the MLE score of the PCA. We are going to use the PCA class with the same number of components (n_components=64). To achieve the maximum accuracy, we also set the svd_solver='full' parameter, to force Scikit-Learn to apply a full SVD instead of the truncated version. In this way, the top eigenvalues are selected only after the decomposition, avoiding the risk of imprecise estimations:

from sklearn.decomposition import PCA

pca = PCA(n_components=64, svd_solver='full', random_state=1000)
Xpca = pca.fit_transform(Xh)

print(pca.score(Xh))
-3772.7483580391995

The result is not surprising: the MLE is much lower than FA, because of the wrong estimations made due to the heteroscedastic noise. I invite the reader to compare the results with different datasets and noise levels, considering that the training performance of PCA is normally higher than FA. Therefore, when working with large datasets, a good trade-off is surely desirable. As with FA, it's possible to retrieve the components through the components_ instance variable.

It's interesting to check the total explained variance (as a fraction of the total input variance) through the component-wise instance array explained_variance_ratio_:

print(np.sum(pca.explained_variance_ratio_))
0.862522337381

With 64 components, we are explaining 86% of the total input variance. Of course, it's also useful to compare the explained variance using a plot:

As usual, the first components explain the largest part of the variance; however, after about the twentieth component, each contribution becomes lower than 1% (decreasing till about 0%). This analysis suggests two observations: it's possible to further reduce the number of components with an acceptable loss (using the previous snippet, it's easy to extend the sum only the first n components and compare the results) and, at the same time, the PCA will be able to overcome a higher threshold (such as 95%) only by adding a large number of new components. In this particular case, we know that the dataset is made up of handwritten digits; therefore, we can suppose that the tail is due to secondary differences (a line slightly longer than average, a marked stroke, and so on); hence, we can drop all the components with n > 64 (or less) without problems (it's also easy to verify visually a rebuilt image using the inverse_transform() method). However, it is always best practice to perform a complete analysis before moving on to further processing steps, particularly when the dimensionality of X is high.

Another interesting approach to determine the optimal number of components has been proposed by Minka (Automatic Choice of Dimensionality for PCA, Minka T.P., NIPS 2000") and it's based on the Bayesian model selection. The idea is to use the MLE to optimize the likelihood p(X|k) where k is a parameter indicating the number of components. In other words, it doesn't start analyzing the explained variance, but determines a value of k < n so that the likelihood keeps being the highest possible (implicitly, k will explain the maximum possible variance under the constraint of max(k) = k_max). The theoretical foundation (with tedious mathematical derivations) of the method is presented in the previously mentioned paper however, it's possible to use this method with Scikit-Learn by setting the n_components='mle' and svd_solver='full' parameters.

Table of Contents for An example of PCA with Scikit-Learn

Create new playlist

Sign In

Sign Up

Table of Contents for
An example of PCA with Scikit-Learn