An overview of the principal components

PCA is the process of finding the principal components. What exactly are these?

We can consider that a component is a normalized linear combination of the features (James, 2012). The first principal component in a dataset is the linear combination that captures the maximum variance in the data. A second component is created by selecting another linear combination that maximizes the variance with the constraint that its direction is perpendicular to the first component. The subsequent components (equal to the number of variables) would follow this same rule.

A couple of things here. This definition describes the linear combination, which is one of the key assumptions in PCA. If you ever try and apply PCA to a dataset of variables having a low correlation, you will likely end up with a meaningless analysis. Another key assumption is that the mean and variance for a variable are sufficient statistics. What this tells us is that the data should fit a normal distribution so that the covariance matrix fully describes our dataset, that is, multivariate normality. PCA is fairly robust to non-normally distributed data and is even used in conjunction with binary variables, so the results are still interpretable.

Now, what is this direction described here and how is the linear combination determined? The best way to grasp this subject is with visualization. Let's take a small dataset with two variables and plot it. PCA is sensitive to scale, so the data has been scaled with a mean of zero and standard deviation of one. You can see in the following diagram that this data happens to form the shape of an oval with the diamonds representing each observation:

Looking at the plot, the data has the most variance along the x axis, so we can draw a dashed horizontal line to represent our first principal component, as shown in the following diagram. This component is the linear combination of our two variables or PC1 = α₁₁X₁ + α₁₂X₂, where the coefficient weights are the variable loading on the principal component. They form the basis of the direction along which the data varies the most. This equation is constrained by 1 in order to prevent the selection of arbitrarily high values. Another way to look at this is that the dashed line minimizes the distance between itself and the data points. This distance is shown for a couple of points as arrows, as follows:

The second principal component is then calculated in the same way, but it is uncorrelated with the first, that is, its direction is at a right angle or orthogonal to the first principal component. The following plot shows the second principal component added as a dotted line:

With the principal component loading calculated for each variable, the algorithm will then provide us with the principal component scores. The scores are calculated for each principal component for each observation. For PC1 and the first observation, this would equate to the formula Z₁₁ = α₁₁ * (X₁₁ - average of X₁) + α₁₂ * (X₁₂ - average of X₂). For PC2 and the first observation, the equation would be Z₁₂ = α₂₁ * (X₁₁ - average of X₂) + α₂₂ * (X₁₂ - average of X₂). These principal component scores are now the new feature space to be used in whatever analysis you will undertake.

Recall that the algorithm will create as many principal components as there are variables, accounting for 100 percent of the possible variance. So, how do we narrow down the components to achieve the original objective in the first place? There are some heuristics that one can use, and in the upcoming modeling process, we will look at the specifics; but a common method to select a principal component is if its eigenvalue is greater than one. While the algebra behind the estimation of eigenvalues and eigenvectors is outside the scope of this book, it is important to discuss what they are and how they are used in PCA.

The optimized linear weights are determined using linear algebra in order to create what is referred to as an eigenvector. They are optimal because no other possible combination of weights could explain variation better than they do. The eigenvalue for a principal component then is the total amount of variation that it explains in the entire dataset.

Recall that the equation for the first principal component is PC1 = α₁₁X₁ + α₁₂X₂.

As the first principal component accounts for the largest amount of variation, it will have the largest eigenvalue. The second component will have the second highest eigenvalue and so forth. So, an eigenvalue greater than one indicates that the principal component accounts for more variance than any of the original variables do by themselves. If you standardize the sum of all the eigenvalues to one, you will have the percentage of the total variance that each component explains. This will also aid you in determining a proper cut-off point.

The eigenvalue criterion is certainly not a hard-and-fast rule and must be balanced with your knowledge of the data and business problem at hand. Once you have selected the number of principal components, you can rotate them in order to simplify their interpretation.

Table of Contents for An overview of the principal components

Create new playlist

Sign In

Sign Up

Table of Contents for
An overview of the principal components