Principal component analysis and total variance

Simply put, PCA is a tool that enables dimension reduction of data. While this sounds simple enough, let's discuss in a bit more detail. If we start with a dataset that has a large number of variables, say 100, then the question may arise as to whether we really need all 100 variables or if there is some redundancy in the data such that it can be summarized with fewer variables. In this case, redundancy does not mean complete duplication of a measurement but rather a significant amount of overlap.

We will get to some real data in a moment, but for now, let's assume the voting results from senate bills in the United States. Each bill gets voted on by up to 100 senators (some may abstain), and we want to get an idea of how each bill was received by the voting members of the senate. There are two political parties in the United States, Democratic and Republican, and there may be a lot of redundancy in voting. In fact, ample research shows that in general, members of Congress tend to vote with the rest of their party, suggesting that such redundancy does exist. If this is the case, then we can keep only a small portion of the data and be mostly accurate (though sometimes wrong) at figuring out how a bill was received by members of Congress.

There are a number of dimension reduction techniques, but PCA is one of the most commonly used techniques.

Understanding the basics of PCA

The basic idea of PCA is to figure out whether the data represented in multiple dimensions can be effectively modeled with fewer dimensions. PCA actually returns the same number of dimensions that we started with, but it tells us whether most of the information is loading along the first few dimensions with a little additional information provided by recording data for additional dimensions. This is probably best explained with an example using simulated data.

First, simulate data for two variables, such that one variable is strongly related to the other:

set.seed(20)
x <- sample(c(0:100), replace = TRUE, 1000)
y <- x + sample(c(-10:10), replace = TRUE, 1000)
plot(y ~ x)

The result is shown in the following plot:

Understanding the basics of PCA

As can be seen in the preceding plot, we have actually plotted a fuzzy line. We are using two dimensions to locate points on this line, but if we reduced this to only a single dimension by rotating our axes, we would not lose much information regarding where on the line each point is located. The trick is figuring out the rotation that gives us only one dimension, something that is simple in this two-dimensional, well-behaved dataset, but more difficult in multidimensional and less well-behaved data. This is where PCA comes in; it rotates the axes giving us a new set of axes with just as many dimensions, but it captures as much of the variance as possible on the first dimension, as much of the remaining variance as possible on the second dimension, and so on. In this way, the first few of our newly created dimensions will explain as much of the data as possible. If they explain enough, then perhaps the following dimensions are not needed.

Mathematically, PCA works by taking an eigenvalue decomposition (see Chapter 5, Linear Algebra) of the covariance or the correlation matrix, where the eigenvalues represent the total variance explained by the corresponding eigenvector (or principal component), or by performing a singular value decomposition on the raw data. It does matter whether correlations or covariances are used in the eigenvalue method, and we may need to scale the data prior to performing an SVD-based PCA. We will discuss this further in the section Scaled versus unscaled PCA.

Tip

What is the difference between princomp and prcomp in R?

There are two R functions (in the base R software) for performing PCA: prcomp and princomp. They differ particularly in how they go about performing PCA. Where princomp does an eigenvalue decomposition on the covariance (or correlation) matrix, prcomp performs a singular value decomposition on the raw data. These two commands differ slightly in output, and prcomp is considered to provide a better numerical estimate.

We can apply R's prcomp function, which performs PCA and summarizes the results as follows:

>pca.sample <- prcomp(matrix(c(x,y), ncol =2))
>summary(pca.sample)

Importance of components:
                           PC1     PC2
Standard deviation     42.3675 4.31650
Proportion of Variance  0.9897 0.01027
Cumulative Proportion   0.9897 1.00000

This shows that almost 99 percent of the variance is explained by the first newly created dimension (called a principal component).

As indicated earlier, PCA just rotates the axes. This suggests that we should be able to use the results of a PCA to see how our data can be plotted on this new set of axes, and in fact this can be done easily. The object returned by prcomp contains a matrix of loadings:

pca.sample$rotation
> pca.sample$rotation
            PC1        PC2
[1,] -0.6983965  0.7157111
[2,] -0.7157111 -0.6983965
> pca.sample <- prcomp(matrix(c(x,y), ncol =2))
> rotation.matrix <- -pca.sample$rotation

In this case, the loadings are mostly negative, which is completely arbitrary, so we reverse the sign before saving it into a new matrix, the name of which hints at what can be done with it. We now multiply it with the rotation matrix, and we get a new set of coordinates, which represent how each value in the original data matrix can be projected onto our new coordinate system:

rotated.data <- matrix(c(x,y), ncol = 2) %*% rotation.matrix

Now, if we plot this rotated data, we will see something that looks a lot like our original plot of the data; only here, we have rotated the axes as shown in the following plot:

plot(rotated.data[,1], rotated.data[,2], xlim = c(0, 150), ylim = c(-75, 75))
Understanding the basics of PCA

Compare the previous plot with the one before it to see that this is simply a rotation of the original data onto new axes. Notice that the new horizontal axis has points from 0 to almost 150, whereas the old horizontal axis only went as high as 100. This is because the new horizontal axis is really a hypotenuse passing through the cloud of data points, and the hypotenuse of a right triangle with two legs of length 100 is about 141.

How does PCA relate to SVD?

As we discussed in Chapter 5, Linear Algebra, a singular value decomposition decomposes a matrix M into two matrices U and V, and a vector of singular values D (see Chapter 5, Linear Algebra, for further elaboration):

M=UDVT

Multiplying just the upper matrix U by the diagonal matrix D, for which the diagonal values are given by the vector of the singular value D, will project the data onto the new coordinates in the same way as multiplying the original data with the rotation matrix. For example, we could perform the rotation that we achieved with R's prcomp command the following way with the svd command:

svd.sample <- svd(matrix(c(x,y), ncol = 2))
manual.rotation <- svd.sample$u %*% -diag(svd.sample$d)
plot(manual.rotation[,1], manual.rotation[,2], xlim = c(0, 150), ylim = c(-75, 75))

The result is shown in the following plot:

How does PCA relate to SVD?

Scaled versus unscaled PCA

Previously, we have discussed the fact that a correlation is a scaled covariance. It is important to point out here that scaling of the data can make a big difference. Both eigenvalue decomposition and SVD attempt to tell us the direction of data of a matrix point. We will encounter a problem in datasets that have variables placed on different scales. In such data sets, eigenvalue decomposition on the covariances or SVD on the raw data will tend to give the illusion that the data is aligned in the direction of the variables with largest values. This is a problem if some variables have large values only because they are expressed in small units (for example, expressing a length in millimeters rather than meters will multiply a length value by 1,000). These large values mean large variances, and PCA attempts to rotate the first component in the direction of the largest variance. Thus, if variables are measured on very different scales, then it is necessary to somehow scale the variables (typically to a variance of one). This is best seen with some examples.

Let's load the wine dataset, as follows:

red.wine <- read.csv('winequality-red.txt')

Now, we will take an eigenvalue decomposition of the covariance matrix ignoring the last variable, followed by an eigenvalue decomposition of the correlation matrix (this is close to what R does in the princomp command):

wine.eigen.cov <- eigen(cov(red.wine[,-12]))
wine.eigen.cor <- eigen(cor(red.wine[,-12]))

Now, let's look at the proportion of variance explained by each principal component in the two different eigenvalue decompositions. We will compute the proportion of variance explained by dividing each eigenvalue by the sum of eigenvalues:

> wine.eigen.cov$values / sum(wine.eigen.cov$values)
 [1] 9.465770e-01 4.836830e-02 2.589172e-03 1.518968e-03
 [5] 8.735540e-04 3.456072e-05 1.936276e-05 9.472781e-06
 [9] 8.413766e-06 1.214728e-06 4.687628e-10
> wine.eigen.cor$values / sum(wine.eigen.cor$values)
 [1] 0.281739313 0.175082699 0.140958499 0.110293866 0.087208370
 [6] 0.059964388 0.053071929 0.038450609 0.031331102 0.016484833
[11] 0.005414392

It can be seen that there is a stark difference between the two datasets. Based on the covariance matrix, the first principal component explains almost 95 percent of the variance, whereas using the correlation matrix method, the first principal component does not even explain 30 percent of the variance. The reason is because some variables (for example, those related to sulfur dioxide) are expressed in units that provide much bigger numbers than the other variables.

This can be a problem when using R's prcomp command if we are not careful:

wine.prcomp <- prcomp(red.wine[,-12])
wine.prcomp.scaled <- prcomp(red.wine[,-12], scale = TRUE)
summary(wine.prcomp)
summary(wine.prcomp.scaled)

The output is as shown in the following screenshot:

Scaled versus unscaled PCA

The unscaled version of this PCA gives us a misleading result because one particular variable, sulfur dioxide, has a very large absolute variance, which is a consequence of the units it is measured in. PC1 is essentially the sulfur dioxide variable. This is akin to measuring most variables in meters and one variable in centimeters; centimeters are smaller than meters, so variances expressed in terms of centimeters will be numerically larger. By adding the scale = TRUE argument, we are telling R to rescale all variables to unit variance. If variables are not expressed on the same or at least similar scales, rescaling is needed.

There are a number of key points to keep in mind with PCA:

  • PCA retains the same number of dimensions in the data but rotates the axes in the direction of the data
  • PCA attempts to account for all of the variance in variables
  • Variables on very different scales may need to be rescaled
  • The principle components created in PCA do not have to mean anything substantive

PCA for dimension reduction

Thus far, we have discussed the mathematical properties of PCA but not actually discussed how it can be used. In short, PCA is a technique that allows us to interpret observations from a multidimensional dataset in terms of fewer dimensions. Here, we will use a dataset examining the chemical properties of red wine to demonstrate PCA in R. We will use commands from R's FactoMineR package, which is based on SVD and has numerous utilities for interpreting the results of PCA.

While PCA has been described as a dimension reduction technique, an application for it may not be immediately obvious. To demonstrate its use, let's look at the abalone dataset that consists of a variety of measurements on abalone:

abalone <- read.csv('abalone.txt')
library(FactoMineR)
abalone.pca <- PCA(abalone[, c(-1)])

The PCA command produces a variables factor map that plots as vectors to show how each original variable falls on the rotated axes produced by PCA, as shown in the following diagram:

PCA for dimension reduction

The preceding figure only shows two dimensions; so it is of limited use when many dimensions are retained but very useful in cases where one or two dimensions capture most of the variance. It appears that nearly all of the individual measurements of an abalone can be captured on a single dimension. From a practical standpoint, this suggests that most of these abalone traits are really just slight variants of a single underlying trait—size, which as we can see captures nearly 84 percent of the variance in all of the measurements.

PCA to summarize wine properties

The abalone example demonstrates how PCA can arrive at a nice and simple solution for a dataset. Unfortunately, most real-world datasets are not so conveniently categorized. Let's return to the red wine dataset for an example of this. First, let's just look at the correlation matrix of the red wine data, usually a good first step before engaging in PCA. R's cor command applied to a data frame will give a correlation matrix:

cor(red.wine)

We won't get into the details of wine chemistry and oenology here, but some high correlations are expected. For example, citric acid is one of the fixed acids in wine, so we would expect a high correlation here. However, we see a mix of high and low correlations, which means that there may be some structure to this data consisting of more than one dimension, but fewer than the total number of wine measurements. Let's perform a PCA on this dataset now:

wine.pca <- PCA(red.wine, quanti.sup = 12)

By declaring quanti.sup = 12, we are telling the PCA command not to use the 12th variable, which is quality in the construction of the new dimensions. However, we are asking the PCA command to estimate the parameters for this variable on the newly created dimensions, which is shown in the following figure:

PCA to summarize wine properties

The preceding graphic produced gives us a hint of what we are getting into with our attempts at dimension reduction, and this is much more representative of common datasets seen in the real world. Let's dig into the wine.pca object to try to make sense of things:

summary(wine.pca)

Let's have a look at the following screenshot:

PCA to summarize wine properties

The preceding screenshot summarizes the eigenvalue decomposition, shows how to plot the individual wines on the rotated axes, and provides a summary of the variables. We will discuss each of these individually here.

The eigenvalues are from an eigenvalue decomposition of a correlation matrix (although SVD was actually used). The sum of these eigenvalues will be the total number of variables, in this case 11. The proportion of variance explained by a particular principal component (that is, rotated new dimension) is simply the ratio of that eigenvalue to the total number of variables. The importance of this is that once an eigenvalue is less than one, the principal component associated with that eigenvalue explains less of the variance than a single original variable. We will discuss this in a bit more detail in the next section.

The individual information and variable information simply provides the data to plot individual and original variables along the new rotated axes. By default, this data is provided for only the first three principal components. There are a number of columns shown in the preceding screenshot. The first column denoted by Dim refers to the coordinate of each variable on each principal component. The second column denoted by ctr provides the contribution of that coordinate to the creation of that dimension. The quality variable has no contribution column, because it did not contribute to the creation of the principal components. The final column, named cos2, provides the squared cosine. The closer the squared cosine is to 1 (or the greater it is), the better that variable is projected onto the axis and the closer that variable is linked to that principal component.

Choosing the number of principal components to retain

As we discussed earlier, PCA (regardless of which R command is used) simply rotates the axes in the direction of the data, but there are as many principal components as there are original variables. The question then is how PCA can be used as a dimension reduction technique if it returns the same number of dimensions that we started with. As we have already touched upon here, the key is to choose only as many principal components as are important and ignore the rest. This then leaves the question as to how we decide on how many principal components are important.

There are a number of possible ways to go about choosing the number of principal components to retain. Qualitative methods include choosing based on theory (for example, we are interested in only a given number of dimensions apriori), or based on looking at the principal components and keeping those that we can substantively find meaning in. The two most commonly used quantitative methods include the Kaiser-Guttman rule and the Screen test.

The Kaiser-Guttman rule simply states that principal components with an eigenvalue greater than one should be retained. The rationale for this is that principal components with eigenvalues less than one explain less of the total variance than a single original variable does on average. Based on this rule, we should keep four principal components in the PCA of the red wine data. The screen test simply plots the eigenvalues (or the proportion of variance) as a function of principal components. If there is a leveling off in the plot, then those principal components are regarded as unnecessary. The code to plot the graph is shown as follows:

plot(wine.pca$eig$eigenvalue, type = 'b', xlab = 'Principal Component', ylab = 'Eigenvalue', main = 'Eigenvalues of Principal Components')

The result is as shown in the following plot:

Choosing the number of principal components to retain

As can be seen in the preceding plot, the leveling off is not nice, so by scree criteria we would not really be able to determine an appropriate threshold of principal components to include for the red wine dataset. In comparison, let's look at what the scree plot shows for the abalone PCA:

plot(abalone.pca$eig$eigenvalue, type = 'b', xlab = 'Principal Component', ylab = 'Eigenvalue', main = 'Eigenvalues of Principal Components')

The result is shown in the following scree plot:

Choosing the number of principal components to retain

In the preceding figure, the scree plot for abalone largely levels off starting at the second principal component and definitely by the third. As such, by scree plot criteria, we would keep one or at most two dimensions in the PCA of the abalone dataset.

Tip

An important limitation of the Kaiser-Guttman rule is related to the fact that the components in PCA do not necessarily have to have any meaning. For example, if we apply PCA to 20 simulated uncorrelated random variables, the Kaiser-Guttman rule will probably reveal that we should retain a large number of components, despite the fact that no meaningful correlation structure actually exists. Run the following R code to see a demonstration of this:

simulated.data <- matrix(sample(1:100, 20000, replace = TRUE), ncol = 20)
summary(PCA(simulated.data))

Let's return to the wine PCA to discuss an interpretation of the PCA results. We will attempt to make sense of four dimensions, explaining nearly 71 percent of the variance. So, we will print the summary of the PCA results again telling R that we want the coordinates, contribution, and squared cosine for four dimensions rather than the default three:

summary(wine.pca, ncp = 4, )

The results are shown in the following screenshot:

Choosing the number of principal components to retain

There is no guarantee of substantive meaning in these four components, but by examination of the loadings and contribution of the variables to each, a convention to describe them can be (and this is very subjective):

  • Complexity (something that acids are thought to impact)
  • Sulfite burden
  • Yeastiness
  • Grapiness

Bear in mind that the complexity dimension by the very nature of PCA will capture more variance than any subsequent components, so this interpretation also implies that complexity is the foremost aspect of wine to consider, followed by sulfite burden, yeastiness, and finally grapiness. An important consideration is whether this classification scheme makes sense.

Tip

It is critical to remember that both the selection of the number of components to be retained as well as the interpretation of the components are not strictly matters of quantitative analysis; instead, they require some substantive understanding of the topic at hand and sometimes, subjective judgment of the researcher.

Now, the ultimate question is: How good is the wine? There was the wine quality variable, which we included as a supplementary variable, and as we can see in the output, it has a low-squared cosine with all four principal components. Thus, while our four principal components explain slightly more than 70 percent of the variance, quality is not strongly tied to any one of them. Maybe there really is something artistic and magical about wine not explained in simple chemistry.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.29.71