Principal Component Analysis

Another common approach to the problem of reducing the dimensionality of a high-dimensional dataset is based on the assumption that, normally, the total variance is not explained equally by all components. If p_data is a multivariate Gaussian distribution with covariance matrix Σ, then the entropy (which is a measure of the amount of information contained in the distribution) is as follows:

Therefore, if some components have a very low variance, they also have a limited contribution to the entropy, providing little additional information. Hence, they can be removed without a high loss of accuracy.

Just as we've done with FA, let's consider a dataset drawn from p_data ∼ N(0, Σ)(for simplicity, we assume that it's zero-centered, even if it's not necessary):

Our goal is to define a linear transformation, z = A^Tx (a vector is normally considered a column, therefore x has a shape (n × 1)), such as the following:

As we want to find out the directions where the variance is higher, we can build our transformation matrix, A, starting from the eigen decomposition of the input covariance matrix, Σ (which is real, symmetric, and positive definite):

V is an (n × n) matrix containing the eigenvectors (as columns), while Ω is a diagonal matrix containing the eigenvalues. Moreover, V is also orthogonal, hence the eigenvectors constitute a basis. An alternative approach is based on the singular value decomposition (SVD), which has an incremental variant and there are algorithms that can perform a decomposition truncated at an arbitrary number of components, speeding up the convergence process (such as the Scikit-Learn implementation TruncatedSVD).

In this case, it's immediately noticeable that the sample covariance is as follows:

If we apply the SVD to the matrix X (each row represents a single sample with a shape (1, n)), we obtain the following:

U is a unitary matrix containing (as rows) the left singular vectors (the eigenvectors of XX^T), V (also unitary) contains (as rows) the right singular vectors (corresponding to the eigenvectors of X^TX), while Λ is a diagonal matrix containing the singular values of Σ_s (which are the square roots of the eigenvalues of both XX^Tand X^TX). Conventionally, the eigenvalues are sorted by descending order and the eigenvectors are rearranged to match the corresponding position.

Hence, we can directly use the matrix Λ to select the most relevant eigenvalues (the square root is an increasing function and doesn't change the order) and the matrix V to retrieve the corresponding eigenvectors (the factor 1/M is a proportionality constant). In this way, we don't need to compute and eigen decompose the covariance matrix Σ (contains n × n elements) and we can exploit some very fast approximate algorithms that work only with the dataset (without computing X^TX). Using the SVD, the transformation of X can be done directly, considering that U and V are unitary matrices (this means that UU^T = U^TU = I; therefore, the conjugate transpose is also the inverse):

Right now, X has only been projected in the eigenvector space (it has been simply rotated) and its dimensionality hasn't changed. However, from the definition of the eigenvector, we know that the following is true:

If λ is large, the projection of v will be amplified proportionally to the variance explained by the direction of the corresponding eigenvector. Therefore, if it has not been already done, we can sort (and rename) the eigenvalues and the corresponding eigenvectors to have the following:

If we select the first top k eigenvalues, we can build a transformation matrix based on the corresponding eigenvectors (principal components) that projects X onto a subspace of the original eigenvector space:

Using the SVD, instead of A_k, we can directly truncate U and Λ, creating the matrices U_k (which contains only the top k eigenvectors) and Λ_k, a diagonal matrix with the top k eigenvalues.

When choosing the value for k, we are assuming that the following is true:

To achieve this goal, it is normally necessary to compare the performances with a different number of components. In the following graph, there's a plot where the variance ratio (variance explained by component n/total variance) and the cumulative variance are plotted as functions of the components:

Explained variance per component (left) and cumulative variance per component (right)

In this case, the first 10 components are able to explain 80% of the total variance. The remaining 25 components have a slighter and slighter impact and could be removed. However, the choice must be always based on the specific context, considering the loss of value induced by the loss of information.

A trick for determining the right number of components is based on the analysis of the eigenvalues of X. After sorting them, it's possible to consider the differences between subsequent values d = {λ₁ - λ₂, λ₂ - λ₃, ..., λ_n-1 - λ_n}. The highest difference λ_k - λ_k+1 determines the index k of a potential optimal reduction (obviously, it's necessary to consider a constraint on the minimum value, because normally λ₁ - λ₂ is the highest difference). For example, if d = {4, 4, 3, 0.2, 0.18, 0.05} the original dimensionality is n=6; however, λ₄ - λ₅is the smallest difference, so, it's reasonable to reduce the dimensionality to (n + 1) - k = 3. The reason is straightforward, the eigenvalues determine the magnitude of each component, but we need a relative measure because the scale changes. In the example, the last three eigenvectors point to directions where the explained variance is negligible when compared to the first three components.

Once we've defined the transformation matrix A_k, it's possible to perform the actual projection of the original vectors in the new subspace, through the relation:

The complete transformation of the whole dataset is simply obtained as follows:

Now, let's analyze the new covariance matrix E[Z^TZ]. If the original distribution p_data x ∼ N(0, Σ), p(z) will also be Gaussian with mean and covariance:

We know that Σ is orthogonal; therefore, v_i • v_j = 0 if i ≠ j. If we analyze the term A^TV, we get the following:

Considering that Ω is diagonal, the resulting matrix Σ_z will be diagonal as well. This means that the PCA decorrelates the transformed covariance matrix. At the same time, we can state that every algorithm that decorrelates the input covariance matrix performs a PCA (with or without dimensionality reduction). For example, the whitening process is a particular PCA without dimensionality reduction, while Isomap (see Chapter 3, Graph-Based Semi-Supervised Learning) performs the same operation working with the Gram matrix with a more geometric approach. This result will be used in Chapter 6, Hebbian Learning, to show how some particular neural networks can perform a PCA without eigen decomposing Σ.

Let's now consider a FA with homoscedastic noise. We have seen that the covariance matrix of the conditional distribution, p(X|Z; θ), is equal to AA^T + Ω. In the case of homoscedastic noise, it becomes AA^T + ωI. For a generic covariance matrix, Σ, it's possible to prove that adding a constant diagonal matrix (Σ + aI) doesn't modify the original eigenvectors and shifts the eigenvalues by the same quantity:

Therefore, we can consider the generic case of absence of noise without loss of generality. We know that the goal of FA (with Ω = (0)) is finding the matrix, A, so that AA^T ≈ Q (the input covariance). Hence, thanks to the symmetry and imposing the asymptotic equality, we can write the following:

This result implies that the FA is a more generic (and robust) way to manage the dimensionality reduction in the presence of heteroscedastic noise, and the PCA is a restriction to homoscedastic noise. When a PCA is performed on datasets affected by heteroscedastic noise, the MLE worsens because the different noise components, altering the magnitude of the eigenvalues at different levels, can drive to the selection of eigenvectors that, in the original dataset, explain only a low percentage of the variance (and in a noiseless scenario, it would be normally discarded in favor of more important directions). If you think of the example discussed at the beginning of the previous paragraph, we know that the noise is strongly heteroscedastic, but we don't have any tools to inform the PCA to cope with it and the variance of the first component will be much higher than expected, considering that the two sources are identical. Unfortunately, in a real- life scenario, the noise is correlated and neither a factor nor a PCA can efficiently solve the problem when the noise power is very high. In all those cases, more sophisticated denoising techniques must be employed. Whenever, instead, it's possible to define an approximate diagonal noise covariance matrix, FA is surely more robust and efficient than PCA. The latter should be considered only in noiseless or quasi-noiseless scenarios. In both cases, the results can never lead to well-separated features. For this reason, the ICA has been studied and many different strategies have been engineered.

The complete algorithm for the PCA is as follows:

Create a matrix X^{(M × n)} containing all the samples x_i as rows
1. Eigen decomposition version:
  1. Compute the covariance matrix Σ = [X^TX]
  2. Eigen decompose Σ = VΩV^T
2. SVD version:
  1. Compute the SVD on the matrix X = UΛV^T
3. Select the top k eigenvalues (from Ω or Λ) and the corresponding eigenvectors (from V)
4. Create the matrix A with shape (n × k), whose columns are the top k eigenvectors (each of them has a shape (n × 1))
5. Project the dataset into the low-dimensional space Z = XA (eigen decomposition) or Z = UΛ (SVD)

Some packages (such as Scipy, which is the backend for many NumPy function, such asnp.linalg.svd()) return the matrix V (right singular vectors) already transposed. In this case, it's necessary to use V^T instead of V in step 3 of the algorithm. I suggest always checking the documentation when implementing these kinds of algorithms.

Table of Contents for Principal Component Analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Principal Component Analysis