Chapter 6. Dimensionality Reduction with Principal Component Analysis

Nowadays, accessing data is easier and cheaper than ever before. This has led to the proliferation of data in organizations' data warehouses and on the Internet. Analyzing this data is not a trivial task, as its quantity often makes analysis difficult or unpractical. For instance, the data is often more abundant than available memory on the machines. The available computational power is also often not enough to analyze the data in a reasonable time frame. One solution is to have recourse to technologies that deal with high dimensionality in data (Big Data). These solutions typically use the memory and computing power of several machines for analysis (computer clusters). But most organizations do not have such an infrastructure. Therefore, a more practical solution is to reduce the dimensionality of the data while keeping the essential information intact.

Another reason to reduce dimensionality is that, in some cases, there are more attributes than observations. If scientists were to store the genome of all inhabitants of Europe and the United States, the number of cases (approximately 1 billion) would be much less than the 3 billion base pairs in the human DNA.

Most analyses do not work well or at all when observations are fewer than attributes. Confronted with this problem, data analysts might select groups of attributes that go together (for instance, height and weight) according to their domain knowledge, and reduce the dimensionality of the dataset.

The data is often structured across several relatively independent dimensions, with each dimension measured using several attributes. This is where Principal Component Analysis (PCA) is an essential tool, as it permits each observation to receive a score on each dimension (determined by PCA itself) while allowing us to discard the attributes from which the dimensions are computed. What is meant here is that, for each of the obtained dimensions, values (scores) are produced that combine several attributes. These can be used for further analyses. In the next section, we will use PCA to combine several attributes in questionnaire data. We will see that participants' self-reports on items such as lively, excited, enthusiastic, (and many more) can be combined in a single dimension we call positive arousal. In this sense, PCA performs both dimensionality reduction (discards the attributes) and feature extraction (computes the dimensions).

Another use of PCA is to check that the underlying structure of the data corresponds to a theoretical model. For instance, in a questionnaire, a group of questions might assess construct A, another group construct B, and so on. PCA will find two different factors, if indeed there is more similarity in the answers of participants within each group of questions compared to the overall questionnaire. Researchers in fields such as psychology use PCA mainly to test their theoretical model in this fashion.

In what follows, we will:

  • Examine how PCA works
  • Continue with a tutorial on using PCA in R, in which we will notably discover how to interpret the results, select the appropriate number of dimensions, and perform diagnostics

The inner working of Principal Component Analysis

Principal Component Analysis aims at finding the dimensions (principal component) that explain most of the variance in a dataset. Once these components are found, a principal component score is computed for each row and each principal component. Remember the example of the questionnaire data we discussed in the preceding section. These scores can be understood as summaries (combinations) of the attributes that compose the data frame.

PCA produces the principal components by computing the eigenvalues of the covariance matrix of a dataset. There is one eigenvalue for each row in the covariance matrix. The computation of eigenvectors is also required to compute the principal component scores. The eigenvalues and eigenvectors are computed using the following equation, where A is the covariance matrix of interest, I is the identity matrix, k is a positive integer, λ is the eigenvalue and v is the eigenvector:

The inner working of Principal Component Analysis

What is important to understand for current purposes is that the principal components are sorted by descending order as a function of their eigenvalues (each row is a principal component): the higher the eigenvalues, the higher the variance explained. The more the variance is explained, the more useful the principal component is in summarizing the data. The part of variance explained by a principal component is computed as a function of the eigenvalues by dividing the eigenvalue of the principal component of interest by the sum of the eigenvalues. The equation is as follows, where partVar is the part of variance explained and eigen is the eigenvalue:

The inner working of Principal Component Analysis

Although this equation is more complex than this short explanation, the scores can be thought of as the matrix multiplication (operator %*% in R) of the factor loadings and the mean centered data matrix. In other words, the scores are made from the original data weighted by the factor loadings. The factor loadings are equal to the eigenvectors in an unrotated solution (we will see later what unrotated means). By definition, factorial scores also allow us to examine the relationship between attributes and dimensions: the higher the loading, the higher the strength of the relationship.

In what follows, we create our own PCA function. The reader is advised to use it for didactic purposes only and not for actual analyses, as some adjustments made in PCA are not implemented here. Once we have our function ready, we will examine the principal components in the iris dataset using our own solution and then compare it with the results of the princomp() function provided in the stats package. We will see that the mentioned adjustments usually leave the components that explain most variance (those of most interest) are largely unaffected, but less important components are quite affected:

1  myPCA = function (df) {
2     eig = eigen(cov(df))
3     means = unlist(lapply(df,mean))
4     scores = scale(df, center = means) %*% eig$vectors 
5     list(values = eig$values,vectors = eig$vectors, scores = scores)
6  }

In this function, we first compute in line 2 the eigenvalues and eigenvectors using function eigen(). We then create on line 3 a vector called means containing the means for each of the attributes of the original dataset. In line 4, we then use this vector to center the values around the means and multiply (matrix multiplication) the resulting data frame by the eigenvectors data frame to produce the principal component scores. In line 5, we create an object (then returned by the function) of a class list containing the eigenvalues, the eigenvectors, and the principal component scores.

In what follows, we examine the principal components in the iris dataset (omitting the Species attribute):

my_pca = myPCA(iris[1:4])
my_pca

The following screenshot provides part of the output:

The inner working of Principal Component Analysis

PCA results using the custom myPCA function

Now let's compare the eigenvalues in our function and the corresponding values when using princomp(). We first run princomp() on the data and assign the result to the pca object. The sdev component of the princomp objects is the square root of the eigenvalues. In order to obtain a comparable metric, we apply this transformation to the eigenvalues in the my_pca object for comparison using the sqrt() function:

pca = princomp(iris[1:4], scores = T)
cbind(unlist(lapply(my_pca[1], sqrt)), pca$sdev)

The following is the output:

The inner working of Principal Component Analysis

We can see that the standard deviation of the principal components (the eigenvalues squared) is quite similar between our function in the first column and princomp() in the second column.

The summary of the object generated by princomp() provides the amount of variance explained by each of the components:

summary(pca)

The output is as follows:

The inner working of Principal Component Analysis

We can compare these values to the result using our function:

my_pca[[1]] / sum(my_pca[[1]])

The output appears as follows:

[1] 0.924618723 0.053066483 0.017102610 0.005212184

We can see that the values are identical and, clearly, the first component is the most important, as it explains more than 92 percent of the variance in the dataset.

The eigenvectors are not directly accessible from the princomp() function, but we computed the scores, which are much more useful. So, let's compare the scores generated by our function and princomp(). We will do so by measuring the correlation between them. We start by creating a dataset with the score of both analyses, then run the correlation analysis but display only the part of the output we are interested in, that is, the correlation between our principal component scores, and those of princomp():

scores = cbind(matrix(unlist(my_pca[3]),ncol = 4), pca$scores)
round(cor(scores)[1:4,5:8],3)

The output is provided here:

The inner working of Principal Component Analysis

On the diagonal lines, we can see that the correlation between our scores and the scores of princomp() is almost perfect for the first two components (the sign is not important in this case as we will explain later). The third component is already less correlated, and the correlation with the last is not that good. In what follows, we will discover how to use PCA with existing datasets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.224.226