Principal Component Analysis

Finding the really important fields in databases with a huge number of variables may prove to be a challenging task for the data scientist. This is where Principal Component Analysis (PCA) comes into the picture: to find the core components of data. It was invented more than 100 years ago by Karl Pearson, and it has been widely used in diverse fields since then.

The objective of PCA is to interpret the data in a more meaningful structure with the help of orthogonal transformations. This linear transformation is intended to reveal the internal structure of the dataset with an arbitrarily designed new basis in the vector space, which best explains the variance of the data. In plain English, this simply means that we compute new variables from the original data, where these new variables include the variance of the original variables in decreasing order.

This can be either done by eigendecomposition of the covariance, correlation matrix (the so-called R-mode PCA), or singular value decomposition (the so-called Q-mode PCA) of the dataset. Each method has great advantages, such as computation performance, memory requirements, or simply avoiding the prior standardization of the data before passing it to PCA when using a correlation matrix in eigendecomposition.

Either way, PCA can successfully ship a lower-dimensional image of the data, where the uncorrelated principal components are the linear combinations of the original variables. And this informative overview can be a great help to the analyst when identifying the underlying structure of the variables; thus the technique is very often used for exploratory data analysis.

PCA results in the exact same number of extracted components as the original variables. The first component includes most of the common variance, so it has the highest importance in describing the original dataset, while the last component often only includes some unique information from only one original variable. Based on this, we would usually only keep the first few components of PCA for further analysis, but we will also see some use cases where we will concentrate on the extracted unique variance.

PCA algorithms

R provides a variety of functions to run PCA. Although it's possible to compute the components manually by eigen or svd as R-mode or Q-mode PCA, we will focus on the higher level functions for the sake of simplicity. Relying on my stats-teacher background, I think that sometimes it's more efficient to concentrate on how to run an analysis and interpreting the results rather than spending way too much time with the linear algebra background—especially with given time/page limits.

R-mode PCA can be conducted by princomp or principal from the psych package, while the more preferred Q-mode PCA can be called by prcomp. Now let's focus on the latter and see what the components of mtcars look like:

> prcomp(mtcars, scale = TRUE)
Standard deviations:
 [1] 2.57068 1.62803 0.79196 0.51923 0.47271 0.46000 0.36778 0.35057
 [9] 0.27757 0.22811 0.14847

Rotation:
           PC1       PC2       PC3        PC4       PC5       PC6
mpg   -0.36253  0.016124 -0.225744 -0.0225403  0.102845 -0.108797
cyl    0.37392  0.043744 -0.175311 -0.0025918  0.058484  0.168554
disp   0.36819 -0.049324 -0.061484  0.2566079  0.393995 -0.336165
hp     0.33006  0.248784  0.140015 -0.0676762  0.540047  0.071436
drat  -0.29415  0.274694  0.161189  0.8548287  0.077327  0.244497
wt     0.34610 -0.143038  0.341819  0.2458993 -0.075029 -0.464940
qsec  -0.20046 -0.463375  0.403169  0.0680765 -0.164666 -0.330480
vs    -0.30651 -0.231647  0.428815 -0.2148486  0.599540  0.194017
am    -0.23494  0.429418 -0.205767 -0.0304629  0.089781 -0.570817
gear  -0.20692  0.462349  0.289780 -0.2646905  0.048330 -0.243563
carb   0.21402  0.413571  0.528545 -0.1267892 -0.361319  0.183522
           PC7        PC8       PC9      PC10      PC11
mpg   0.367724 -0.7540914  0.235702  0.139285 -0.1248956
cyl   0.057278 -0.2308249  0.054035 -0.846419 -0.1406954
disp  0.214303  0.0011421  0.198428  0.049380  0.6606065
hp   -0.001496 -0.2223584 -0.575830  0.247824 -0.2564921
drat  0.021120  0.0321935 -0.046901 -0.101494 -0.0395302
wt   -0.020668 -0.0085719  0.359498  0.094394 -0.5674487
qsec  0.050011 -0.2318400 -0.528377 -0.270673  0.1813618
vs   -0.265781  0.0259351  0.358583 -0.159039  0.0084146
am   -0.587305 -0.0597470 -0.047404 -0.177785  0.0298235
gear  0.605098  0.3361502 -0.001735 -0.213825 -0.0535071
carb -0.174603 -0.3956291  0.170641  0.072260  0.3195947

Note

Please note that we have called prcomp with scale set to TRUE, which is FALSE by default due to being backward-compatible with the S language. But in general, scaling is highly recommended. Using the scaling option is equivalent to running PCA on a dataset after scaling it previously, such as: prcomp(scale(mtcars)), which results in data with unit variance.

First, prcomp returned the standard deviations of the principal components, which shows how much information was preserved by the 11 components. The standard deviation of the first component is a lot larger than any other subsequent value, which explains more than 60 percent of the variance:

> summary(prcomp(mtcars, scale = TRUE))
Importance of components:
                         PC1   PC2   PC3    PC4    PC5    PC6    PC7
Standard deviation     2.571 1.628 0.792 0.5192 0.4727 0.4600 0.3678
Proportion of Variance 0.601 0.241 0.057 0.0245 0.0203 0.0192 0.0123
Cumulative Proportion  0.601 0.842 0.899 0.9232 0.9436 0.9628 0.9751
                          PC8   PC9    PC10  PC11
Standard deviation     0.3506 0.278 0.22811 0.148
Proportion of Variance 0.0112 0.007 0.00473 0.002
Cumulative Proportion  0.9863 0.993 0.99800 1.000

Besides the first component, only the second one has a higher standard deviation than 1, which means that only the first two components include at least as much information as the original variables did. Or, in other words: only the first two variables have a higher eigenvalue than one. The eigenvalue can be computed by the square of the standard deviation of the principal components, summing up to the number of original variables as expected:

> sum(prcomp(scale(mtcars))$sdev^2)
[1] 11

Determining the number of components

PCA algorithms always compute the same number of principal components as the number of variables in the original dataset. The importance of the component decreases from the first one to the last one.

As a rule of a thumb, we can simply keep all those components with higher standard deviation than 1. This means that we keep those components, which explains at least as much variance as the original variables do:

> prcomp(scale(mtcars))$sdev^2
 [1] 6.608400 2.650468 0.627197 0.269597 0.223451 0.211596 0.135262
 [8] 0.122901 0.077047 0.052035 0.022044

So the preceding summary suggests keeping only two components out of the 11, which explains almost 85 percent of the variance:

> (6.6 + 2.65) / 11
[1] 0.8409091

An alternative and great visualization tool to help us determine the optimal number of component is scree plot. Fortunately, there are at least two great functions in the psych package we can use here: the scree and the VSS.scree functions:

> VSS.scree(cor(mtcars))
Determining the number of components
> scree(cor(mtcars))
Determining the number of components

The only difference between the preceding two plots is that scree also shows the eigenvalues of a factor analysis besides PCA. Read more about this in the next section of this chapter.

As can be seen, VSS.scree provides a visual overview on the eigenvalues of the principal components, and it also highlights the critical value at 1 by a horizontal line. This is usually referred to as the Kaiser criterion.

Besides this rule of a thumb, as discussed previously one can also rely on the so-called Elbow-rule, which simply suggests that the line-plot represents an arm and the optimal number of components is the point where this arm's elbow can be found. So we have to look for the point from where the curve becomes less steep. This sharp break is probably at 3 in this case instead of 2, as we have found with the Kaiser criterion.

And besides Cattell's original scree test, we can also compare the previously described scree of the components with a bit of a randomized data to identify the optimal number of components to keep:

> fa.parallel(mtcars)
Determining the number of components
Parallel analysis suggests that the number of factors = 2 
and the number of components =  2

Now we have verified the optimal number of principal components to keep for further analysis with a variety of statistical tools, and we can work with only two variables instead of 11 after all, which is great! But what do these artificially created variables actually mean?

Interpreting components

The only problem with reducing the dimension of our data is that it can be very frustrating to find out what our newly created, highly compressed, and transformed data actually is. Now we have PC1 and PC2 for our 32 cars:

> pc <- prcomp(mtcars, scale = TRUE)
> head(pc$x[, 1:2])
                        PC1      PC2
Mazda RX4         -0.646863  1.70811
Mazda RX4 Wag     -0.619483  1.52562
Datsun 710        -2.735624 -0.14415
Hornet 4 Drive    -0.306861 -2.32580
Hornet Sportabout  1.943393 -0.74252
Valiant           -0.055253 -2.74212

These values were computed by multiplying the original dataset with the identified weights, so-called loadings (rotation) or the component matrix. This is a standard linear transformation:

> head(scale(mtcars) %*% pc$rotation[, 1:2])
                        PC1      PC2
Mazda RX4         -0.646863  1.70811
Mazda RX4 Wag     -0.619483  1.52562
Datsun 710        -2.735624 -0.14415
Hornet 4 Drive    -0.306861 -2.32580
Hornet Sportabout  1.943393 -0.74252
Valiant           -0.055253 -2.74212

Both variables are scaled with the mean being zero and the standard deviation as described previously:

> summary(pc$x[, 1:2])
      PC1              PC2        
 Min.   :-4.187   Min.   :-2.742  
 1st Qu.:-2.284   1st Qu.:-0.826  
 Median :-0.181   Median :-0.305  
 Mean   : 0.000   Mean   : 0.000  
 3rd Qu.: 2.166   3rd Qu.: 0.672  
 Max.   : 3.892   Max.   : 4.311  
> apply(pc$x[, 1:2], 2, sd)
   PC1    PC2 
2.5707 1.6280 
> pc$sdev[1:2]
[1] 2.5707 1.6280

All scores computed by PCA are scaled, because it always returns the values transformed to a new coordinate system with an orthogonal basis, which means that the components are not correlated and scaled:

> round(cor(pc$x))
     PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
PC1    1   0   0   0   0   0   0   0   0    0    0
PC2    0   1   0   0   0   0   0   0   0    0    0
PC3    0   0   1   0   0   0   0   0   0    0    0
PC4    0   0   0   1   0   0   0   0   0    0    0
PC5    0   0   0   0   1   0   0   0   0    0    0
PC6    0   0   0   0   0   1   0   0   0    0    0
PC7    0   0   0   0   0   0   1   0   0    0    0
PC8    0   0   0   0   0   0   0   1   0    0    0
PC9    0   0   0   0   0   0   0   0   1    0    0
PC10   0   0   0   0   0   0   0   0   0    1    0
PC11   0   0   0   0   
0   0   0   0   0    0    1

To see what the principal components actually mean, it's really helpful to check the loadings matrix, as we have seen before:

> pc$rotation[, 1:2]
          PC1       PC2
mpg  -0.36253  0.016124
cyl   0.37392  0.043744
disp  0.36819 -0.049324
hp    0.33006  0.248784
drat -0.29415  0.274694
wt    0.34610 -0.143038
qsec -0.20046 -0.463375
vs   -0.30651 -0.231647
am   -0.23494  0.429418
gear -0.20692  0.462349
carb  0.21402  0.413571

Probably this analytical table might be more meaningful in some visual way, for example as a biplot, which shows not only the original variables but also the observations (black labels) on the same plot with the new coordinate system based on the principal components (red labels):

> biplot(pc, cex = c(0.8, 1.2))
> abline(h = 0, v = 0, lty = 'dashed')
Interpreting components

We can conclude that PC1 includes information mostly from the number of cylinders (cyl), displacement (disp), weight (wt), and gas consumption (mpg), although the latter looks likely to decrease the value of PC1. This was found by checking the highest and lowest values on the PC1 axis. Similarly, we find that PC2 is constructed by speed-up (qsec), number of gears (gear), carburetors (carb), and the transmission type (am).

To verify this, we can easily compute the correlation coefficient between the original variables and the principal components:

> cor(mtcars, pc$x[, 1:2])
          PC1       PC2
mpg  -0.93195  0.026251
cyl   0.96122  0.071216
disp  0.94649 -0.080301
hp    0.84847  0.405027
drat -0.75617  0.447209
wt    0.88972 -0.232870
qsec -0.51531 -0.754386
vs   -0.78794 -0.377127
am   -0.60396  0.699103
gear -0.53192  0.752715
carb  0.55017  0.673304

Does this make sense? How would you name PC1 and PC2? The number of cylinders and displacement seem like engine parameters, while the weight is probably rather influenced by the body of the car. Gas consumption should be affected by both specs. The other component's variables deal with suspension, but we also have speed there, not to mention the bunch of mediocre correlation coefficients in the preceding matrix. Now what?

Rotation methods

Based on the fact that rotation methods are done in a subspace, rotation is always suboptimal compared to the previously discussed PCA. This means that the new axes after rotation will explain less variance than the original components.

On the other hand, rotation simplifies the structure of the components and thus makes it a lot easier to understand and interpret the results; thus, these methods are often used in practice.

Note

Rotation methods can be (and are) usually applied to both PCA and FA (more on this later). Orthogonal methods are preferred.

There are two main types of rotation:

  • Orthogonal, where the new axes are orthogonal to each other. There is no correlation between the components/factors.
  • Oblique, where the new axes are not necessarily orthogonal to each other; thus there might be some correlation between the variables.

Varimax rotation is one of the most popular rotation methods. It was developed by Kaiser in 1958 and has been popular ever since. It is often used because the method maximizes the variance of the loadings matrix, resulting in more interpretable scores:

> varimax(pc$rotation[, 1:2])
$loadings
     PC1    PC2   
mpg  -0.286 -0.223
cyl   0.256  0.276
disp  0.312  0.201
hp           0.403
drat -0.402       
wt    0.356  0.116
qsec  0.148 -0.483
vs          -0.375
am   -0.457  0.174
gear -0.458  0.217
carb -0.106  0.454

                 PC1   PC2
SS loadings    1.000 1.000
Proportion Var 0.091 0.091
Cumulative Var 0.091 0.182

$rotmat
         [,1]    [,2]
[1,]  0.76067 0.64914
[2,] -0.64914 0.76067

Now the first component seems to be mostly affected (negatively dominated) by the transmission type, number of gears, and rear axle ratio, while the second one is affected by speed-up, horsepower, and the number of carburetors. This suggests naming PC2 as power, while PC1 instead refers to transmission. Let's see those 32 automobiles in this new coordinate system:

> pcv <- varimax(pc$rotation[, 1:2])$loadings
> plot(scale(mtcars) %*% pcv, type = 'n',
+     xlab = 'Transmission', ylab = 'Power')
> text(scale(mtcars) %*% pcv, labels = rownames(mtcars))
Rotation methods

Based on the preceding plot, every data scientist should pick a car from the upper left quarter to go with the top rated models, right? Those cars have great power based on the y axis and good transmission systems, as shown on the x axis—do not forget about the transmission being negatively correlated with the original variables. But let's see some other rotation methods and the advantages of those as well!

Quartimax rotation is an orthogonal method, as well, and minimizes the number of components needed to explain each variable. This often results in a general component and additional smaller components. When a compromise between Varimax and Quartimax rotation methods is needed, you might opt for Equimax rotation.

Oblique rotation methods include Oblimin and Promax, which are not available in the base stats or even the highly used psych package. Instead, we can load the GPArotation package, which provides a wide range of rotation methods for PCA and FA as well. For demonstration purposes, let's see how Promax rotation works, which is a lot faster compared to, for example, Oblimin:

> library(GPArotation)
> promax(pc$rotation[, 1:2])
$loadings

Loadings:
     PC1    PC2   
mpg  -0.252 -0.199
cyl   0.211  0.258
disp  0.282  0.174
hp           0.408
drat -0.416       
wt    0.344       
qsec  0.243 -0.517
vs          -0.380
am   -0.502  0.232
gear -0.510  0.276
carb -0.194  0.482

                 PC1   PC2
SS loadings    1.088 1.088
Proportion Var 0.099 0.099
Cumulative Var 0.099 0.198

$rotmat
         [,1]    [,2]
[1,]  0.65862 0.58828
[2,] -0.80871 0.86123

> cor(promax(pc$rotation[, 1:2])$loadings)
         PC1      PC2
PC1  1.00000 -0.23999
PC2 -0.23999  1.00000

The result of the last command supports the view that oblique rotation methods generate scores that might be correlated, unlike when running an orthogonal rotation.

Outlier-detection with PCA

PCA can be used for a variety of goals besides exploratory data analysis. For example, we can use PCA to generate eigenfaces, compress images, classify observations, or to detect outliers in a multidimensional space via image filtering. Now, we will construct a simplified model discussed in a related research post published on R-bloggers in 2012: http://www.r-bloggers.com/finding-a-pin-in-a-haystack-pca-image-filtering.

The challenge described in the post was to detect a foreign metal object in the sand photographed by the Curiosity Rover on the Mars. The image can be found at the official NASA website at http://www.nasa.gov/images/content/694811main_pia16225-43_full.jpg, for which I've created a shortened URL for future use: http://bit.ly/nasa-img.

In the following image, you can see a strange metal object highlighted in the sand in a black circle, just to make sure you know what we are looking for. The image found at the preceding URL does not have this highlight:

Outlier-detection with PCA

And now let's use some statistical methods to identify that object without (much) human intervention! First, we need to download the image from the Internet and load it into R. The jpeg package will be really helpful here:

>
 library(jpeg)
> t <- tempfile()
> download.file('http://bit.ly/nasa-img', t)
trying URL 'http://bit.ly/nasa-img'
Content type 'image/jpeg' length 853981 bytes (833 Kb)
opened URL
==================================================
downloaded 833 Kb

>
 img <- readJPEG(t)
> str(img)
 num [1:1009, 1:1345, 1:3] 0.431 0.42 0.463 0.486 0.49 ...

The readJPEG function returns the RGB values of every pixel in the picture, resulting in a three dimensional array where the first dimension is the row, the second is the column, and the third dimension includes the three color values.

Note

RGB is an additive color model that can reproduce a wide variety of colors by mixing red, green, and blue by given intensities and optional transparency. This color model is highly used in computer science.

As PCA requires a matrix as an input, we have to convert this 3-dimensional array to a 2-dimensional dataset. To this end, let's not bother with the order of pixels for the time being, as we can reconstruct that later, but let's simply list the RGB values of all pixels, one after the other:

> h <- dim(img)[1]
> w <- dim(img)[2]
> m <- matrix(img, h*w)
> str(m)
 num [1:1357105, 1:3] 0.431 0.42 0.463 0.486 0.49 ...

In a nutshell, we saved the original height of the image (in pixels) in variable h, saved the width in w, and then converted the 3D array to a matrix with 1,357,105 rows. And, after four lines of data loading and three lines of data transformation, we can call the actual, rather simplified statistical method at last:

> pca <- prcomp(m)

As we've seen before, data scientists do indeed deal with data preparation most of the time, while the actual data analysis can be done easily, right?

The extracted components seems to perform pretty well; the first component explains more than 96 percent of the variance:

> summary(pca)
Importance of components:
                         PC1    PC2     PC3
Standard deviation     0.277 0.0518 0.00765
Proportion of Variance 0.965 0.0338 0.00074
Cumulative Proportion  0.965 0.9993 1.00000

Previously, interpreting RGB values was pretty straightforward, but what do these components mean?

> pca$rotation
          PC1      PC2      PC3
[1,] -0.62188  0.71514  0.31911
[2,] -0.57409 -0.13919 -0.80687
[3,] -0.53261 -0.68498  0.49712

It seems that the first component is rather mixed with all three colors, the second component misses the green color, while the third component includes almost only green. Why not visualize that instead of trying to imagine how these artificial values look? To this end, let's extract the color intensities from the preceding component/loading matrix by the following quick helper function:

> extractColors <- function(x)
+     rgb(x[1], x[2], x[3])

Calling this on the absolute values of the component matrix results in the hex-color codes that describe the principal components:

> (colors <- apply(abs(pca$rotation), 2, extractColors))
      PC1       PC2       PC3 
"#9F9288" "#B623AF" "#51CE7F"

These color codes can be easily rendered—for example, on a pie chart, where the area of the pies represents the explained variance of the principal components:

> pie(pca$sdev, col = colors, labels = colors)
Outlier-detection with PCA

Now we no longer have red, green, or blue intensities or actual colors in the computed scores stored in pca$x; rather, the principal components describe each pixel with the visualized colors shown previously. And, as previously discussed, the third component stands for a greenish color, the second one misses green (resulting in a purple color), while the first component includes a rather high value from all RGB colors resulting in a tawny color, which is not surprising at all knowing that the photo was taken in the desert of Mars.

Now we can render the original image with monochrome colors to show the intensity of the principal components. The following few lines of code produce two modified photos of the Curiosity Rover and its environment based on PC1 and PC2:

> par(mfrow = c(1, 2), mar = rep(0, 4))
> image(matrix(pca$x[, 1], h), col = gray.colors(100))
> image(matrix(pca$x[, 2], h), col = gray.colors(100), yaxt = 'n')
Outlier-detection with PCA

Although the image was rotated by 90 degrees in some of the linear transformations, it's pretty clear that the first image was not really helpful in finding the foreign metal object in the sand. As a matter of fact, this image represents the noise in the desert area, as PC1 included sand-like color intensities, so this component is useful for describing the variety of tawny colors.

On the other hand, the second component highlights the metal object in the sand very well! All surrounding pixels are dim, due to the low ratio of purple color in normal sand, while the anomalous object is rather dark.

I really like this piece of R code and the simplified example: although they're still basic enough to follow, they also demonstrate the power of R and how standard data analytic methods can be used to harvest information from raw data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.171.212