We can repeat the same experiment made with the FA and heteroscedastic noise to assess the MLE score of the PCA. We are going to use the PCA class with the same number of components (n_components=64). To achieve the maximum accuracy, we also set the svd_solver='full' parameter, to force Scikit-Learn to apply a full SVD instead of the truncated version. In this way, the top eigenvalues are selected only after the decomposition, avoiding the risk of imprecise estimations:
from sklearn.decomposition import PCA
pca = PCA(n_components=64, svd_solver='full', random_state=1000)
Xpca = pca.fit_transform(Xh)
print(pca.score(Xh))
-3772.7483580391995
The result is not surprising: the MLE is much lower than FA, because of the wrong estimations made due to the heteroscedastic noise. I invite the reader to compare the results with different datasets and noise levels, considering that the training performance of PCA is normally higher than FA. Therefore, when working with large datasets, a good trade-off is surely desirable. As with FA, it's possible to retrieve the components through the components_ instance variable.
It's interesting to check the total explained variance (as a fraction of the total input variance) through the component-wise instance array explained_variance_ratio_:
print(np.sum(pca.explained_variance_ratio_))
0.862522337381
With 64 components, we are explaining 86% of the total input variance. Of course, it's also useful to compare the explained variance using a plot:
As usual, the first components explain the largest part of the variance; however, after about the twentieth component, each contribution becomes lower than 1% (decreasing till about 0%). This analysis suggests two observations: it's possible to further reduce the number of components with an acceptable loss (using the previous snippet, it's easy to extend the sum only the first n components and compare the results) and, at the same time, the PCA will be able to overcome a higher threshold (such as 95%) only by adding a large number of new components. In this particular case, we know that the dataset is made up of handwritten digits; therefore, we can suppose that the tail is due to secondary differences (a line slightly longer than average, a marked stroke, and so on); hence, we can drop all the components with n > 64 (or less) without problems (it's also easy to verify visually a rebuilt image using the inverse_transform() method). However, it is always best practice to perform a complete analysis before moving on to further processing steps, particularly when the dimensionality of X is high.