An example of factor analysis with Scikit-Learn

We can now make an example of FA with Scikit-Learn using the MNIST handwritten digits dataset (70,000 28 × 28 grayscale images) in the original version and with added heteroscedastic noise (ω_i randomly selected from [0, 0.75]).

The first step is to load and zero-center the original dataset (I'm using the functions defined in the first chapter, Chapter 1, Machine Learning Model Fundamentals):

import numpy as np

from sklearn.datasets import fetch_mldata

digits = fetch_mldata('MNIST original')
X = zero_center(digits['data'].astype(np.float64))
np.random.shuffle(X)

Omega = np.random.uniform(0.0, 0.75, size=X.shape[1])
Xh = X + np.random.normal(0.0, Omega, size=X.shape)

After this step, the X variable will contain the zero-center original dataset, while Xh is the noisy version. The following screenshot shows a random selection of samples from both versions:

We can perform FA on both datasets using the Scikit-Learn FactorAnalysis class with the n_components=64 parameter and check the score (the average log-likelihood over all samples). If the noise variance is known (or there's a good estimation), it's possible to include the starting point through the noise_variance_init parameter; otherwise, it will be initialized with the identity matrix:

from sklearn.decomposition import FactorAnalysis

fa = FactorAnalysis(n_components=64, random_state=1000)
fah = FactorAnalysis(n_components=64, random_state=1000)

Xfa = fa.fit_transform(X)
Xfah = fah.fit_transform(Xh)

print(fa.score(X))
-2162.70193446

print(fah.score(Xh))
-3046.19385694

As expected, the presence of noise has reduced the final accuracy (MLE). Following an example provided by A. Gramfort and D. A. Engemann in the original Scikit-Learn documentation, we can create a benchmark for the MLE using the Lodoit-Wolf algorithm (a shrinking method for improving the condition of the covariance that is beyond the scope of this book.

For further information, read A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices, Ledoit O., Wolf M., Journal of Multivariate Analysis, 88, 2/2004":

from sklearn.covariance import LedoitWolf

ldw = LedoitWolf()
ldwh = LedoitWolf()

ldw.fit(X)
ldwh.fit(Xh)

print(ldw.score(X))
-2977.12971009

print(ldwh.score(Xh))
-2989.27874799

With the original dataset, FA performs much better than the benchmark, while it's slightly worse in the presence of heteroscedastic noise. The reader can try other combinations using the grid search with different numbers of components and noise variances, and experiment with the effect of removing the zero-centering step. It's possible to plot the extracted components using the components_ instance variable:

A plot of the 64 components extracted with the factor analysis on the original dataset

A careful analysis shows that the components are a superimposition of many low-level visual features. This is a consequence of the assumption to have a Gaussian prior distribution over the components (z ∼ N(0, I)). In fact, one of the disadvantages of this distribution is its intrinsic denseness (the probability of sampling values far from the mean is often too high, while in some case, it would be desirable to have a peaked distribution that discourages values not close to its mean, to be able to observe more selective components). Moreover, considering the distribution p[Z|X; θ], the covariance matrix ψ could not be diagonal (trying to impose this constraint can lead to an unsolvable problem), leading to a resulting multivariate Gaussian distribution, which isn't normally made up of independent components. In general, the single variables z_i, (conditioned to an input sample, x_i) are statistically dependent and the reconstruction x_i, is obtained with the participation of almost all extracted features. In all these cases, we say that the coding is dense and the dictionary of features in under-complete (the dimensionality of the components is lower than dim(x_i)).

The lack of independence can be also an issue considering that any orthogonal transformation Q applied to A (the factor loading matrix) don't affect the distribution p[X|Z, θ]. In fact, as QQ^T=I, the following applies:

In other words, any feature rotation (x = AQz + ν) is always a solution to the original problem and it's impossible to decide which is the real loading matrix. All these conditions lead to the further conclusion that the mutual information among components is not equal to zero and neither close to a minimum (in this case, each of them carries a specific portion of information). On the other side, our main goal was to reduce the dimensionality. Therefore, it's not surprising to have dependent components because we aim to preserve the maximum amount of original information contained in p(X) (remember that the amount of information is related to the entropy and the latter is proportional to the variance).

The same phenomenon can be observed in the PCA (which is still based on the Gaussian assumption), but in the last paragraph, we're going to discuss a technique, called ICA, whose goal is to create a representation of each sample (without the constraint of the dimensionality reduction) after starting from a set of statistically independent features. This approach, even if it has its peculiarities, belongs to a large family of algorithms called sparse coding. In this scenario, if the corresponding dictionary has dim(z_i) > dim(x_i),it is called over-complete (of course, the main goal is no longer the dimensionality reduction).

However, we're going to consider only the case when the dictionary is at most complete dim(z_i) = dim(x_i), because ICA with over-complete dictionaries requires a more complex approach. The level of sparsity, of course, is proportional to dim(z_i) and with ICA, it's always achieved as a secondary goal (the primary one is always the independence between components).

Table of Contents for An example of factor analysis with Scikit-Learn

Create new playlist

Sign In

Sign Up

Table of Contents for
An example of factor analysis with Scikit-Learn