Finding the really important fields in databases with a huge number of variables may prove to be a challenging task for the data scientist. This is where Principal Component Analysis (PCA) comes into the picture: to find the core components of data. It was invented more than 100 years ago by Karl Pearson, and it has been widely used in diverse fields since then.
The objective of PCA is to interpret the data in a more meaningful structure with the help of orthogonal transformations. This linear transformation is intended to reveal the internal structure of the dataset with an arbitrarily designed new basis in the vector space, which best explains the variance of the data. In plain English, this simply means that we compute new variables from the original data, where these new variables include the variance of the original variables in decreasing order.
This can be either done by eigendecomposition of the covariance, correlation matrix (the so-called R-mode PCA), or singular value decomposition (the so-called Q-mode PCA) of the dataset. Each method has great advantages, such as computation performance, memory requirements, or simply avoiding the prior standardization of the data before passing it to PCA when using a correlation matrix in eigendecomposition.
Either way, PCA can successfully ship a lower-dimensional image of the data, where the uncorrelated principal components are the linear combinations of the original variables. And this informative overview can be a great help to the analyst when identifying the underlying structure of the variables; thus the technique is very often used for exploratory data analysis.
PCA results in the exact same number of extracted components as the original variables. The first component includes most of the common variance, so it has the highest importance in describing the original dataset, while the last component often only includes some unique information from only one original variable. Based on this, we would usually only keep the first few components of PCA for further analysis, but we will also see some use cases where we will concentrate on the extracted unique variance.
R provides a variety of functions to run PCA. Although it's possible to compute the components manually by eigen
or svd
as R-mode or Q-mode PCA, we will focus on the higher level functions for the sake of simplicity. Relying on my stats-teacher background, I think that sometimes it's more efficient to concentrate on how to run an analysis and interpreting the results rather than spending way too much time with the linear algebra background—especially with given time/page limits.
R-mode PCA can be conducted by princomp
or principal
from the psych
package, while the more preferred Q-mode PCA can be called by prcomp
. Now let's focus on the latter and see what the components of mtcars
look like:
> prcomp(mtcars, scale = TRUE) Standard deviations: [1] 2.57068 1.62803 0.79196 0.51923 0.47271 0.46000 0.36778 0.35057 [9] 0.27757 0.22811 0.14847 Rotation: PC1 PC2 PC3 PC4 PC5 PC6 mpg -0.36253 0.016124 -0.225744 -0.0225403 0.102845 -0.108797 cyl 0.37392 0.043744 -0.175311 -0.0025918 0.058484 0.168554 disp 0.36819 -0.049324 -0.061484 0.2566079 0.393995 -0.336165 hp 0.33006 0.248784 0.140015 -0.0676762 0.540047 0.071436 drat -0.29415 0.274694 0.161189 0.8548287 0.077327 0.244497 wt 0.34610 -0.143038 0.341819 0.2458993 -0.075029 -0.464940 qsec -0.20046 -0.463375 0.403169 0.0680765 -0.164666 -0.330480 vs -0.30651 -0.231647 0.428815 -0.2148486 0.599540 0.194017 am -0.23494 0.429418 -0.205767 -0.0304629 0.089781 -0.570817 gear -0.20692 0.462349 0.289780 -0.2646905 0.048330 -0.243563 carb 0.21402 0.413571 0.528545 -0.1267892 -0.361319 0.183522 PC7 PC8 PC9 PC10 PC11 mpg 0.367724 -0.7540914 0.235702 0.139285 -0.1248956 cyl 0.057278 -0.2308249 0.054035 -0.846419 -0.1406954 disp 0.214303 0.0011421 0.198428 0.049380 0.6606065 hp -0.001496 -0.2223584 -0.575830 0.247824 -0.2564921 drat 0.021120 0.0321935 -0.046901 -0.101494 -0.0395302 wt -0.020668 -0.0085719 0.359498 0.094394 -0.5674487 qsec 0.050011 -0.2318400 -0.528377 -0.270673 0.1813618 vs -0.265781 0.0259351 0.358583 -0.159039 0.0084146 am -0.587305 -0.0597470 -0.047404 -0.177785 0.0298235 gear 0.605098 0.3361502 -0.001735 -0.213825 -0.0535071 carb -0.174603 -0.3956291 0.170641 0.072260 0.3195947
Please note that we have called prcomp
with scale
set to TRUE
, which is FALSE
by default due to being backward-compatible with the S language. But in general, scaling is highly recommended. Using the scaling option is equivalent to running PCA on a dataset after scaling it previously, such as: prcomp(scale(mtcars))
, which results in data with unit variance.
First, prcomp
returned the standard deviations of the principal components, which shows how much information was preserved by the 11 components. The standard deviation of the first component is a lot larger than any other subsequent value, which explains more than 60 percent of the variance:
> summary(prcomp(mtcars, scale = TRUE)) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 2.571 1.628 0.792 0.5192 0.4727 0.4600 0.3678 Proportion of Variance 0.601 0.241 0.057 0.0245 0.0203 0.0192 0.0123 Cumulative Proportion 0.601 0.842 0.899 0.9232 0.9436 0.9628 0.9751 PC8 PC9 PC10 PC11 Standard deviation 0.3506 0.278 0.22811 0.148 Proportion of Variance 0.0112 0.007 0.00473 0.002 Cumulative Proportion 0.9863 0.993 0.99800 1.000
Besides the first component, only the second one has a higher standard deviation than 1, which means that only the first two components include at least as much information as the original variables did. Or, in other words: only the first two variables have a higher eigenvalue than one. The eigenvalue can be computed by the square of the standard deviation of the principal components, summing up to the number of original variables as expected:
> sum(prcomp(scale(mtcars))$sdev^2) [1] 11
PCA algorithms always compute the same number of principal components as the number of variables in the original dataset. The importance of the component decreases from the first one to the last one.
As a rule of a thumb, we can simply keep all those components with higher standard deviation than 1. This means that we keep those components, which explains at least as much variance as the original variables do:
> prcomp(scale(mtcars))$sdev^2 [1] 6.608400 2.650468 0.627197 0.269597 0.223451 0.211596 0.135262 [8] 0.122901 0.077047 0.052035 0.022044
So the preceding summary suggests keeping only two components out of the 11, which explains almost 85 percent of the variance:
> (6.6 + 2.65) / 11 [1] 0.8409091
An alternative and great visualization tool to help us determine the optimal number of component is scree plot. Fortunately, there are at least two great functions in the psych
package we can use here: the scree
and the VSS.scree
functions:
> VSS.scree(cor(mtcars))
> scree(cor(mtcars))
The only difference between the preceding two plots is that scree
also shows the eigenvalues of a factor analysis besides PCA. Read more about this in the next section of this chapter.
As can be seen, VSS.scree
provides a visual overview on the eigenvalues of the principal components, and it also highlights the critical value at 1 by a horizontal line. This is usually referred to as the Kaiser criterion.
Besides this rule of a thumb, as discussed previously one can also rely on the so-called Elbow-rule, which simply suggests that the line-plot represents an arm and the optimal number of components is the point where this arm's elbow can be found. So we have to look for the point from where the curve becomes less steep. This sharp break is probably at 3 in this case instead of 2, as we have found with the Kaiser criterion.
And besides Cattell's original scree test, we can also compare the previously described scree
of the components with a bit of a randomized data to identify the optimal number of components to keep:
> fa.parallel(mtcars)
Parallel analysis suggests that the number of factors = 2 and the number of components = 2
Now we have verified the optimal number of principal components to keep for further analysis with a variety of statistical tools, and we can work with only two variables instead of 11 after all, which is great! But what do these artificially created variables actually mean?
The only problem with reducing the dimension of our data is that it can be very frustrating to find out what our newly created, highly compressed, and transformed data actually is. Now we have PC1
and PC2
for our 32 cars:
> pc <- prcomp(mtcars, scale = TRUE) > head(pc$x[, 1:2]) PC1 PC2 Mazda RX4 -0.646863 1.70811 Mazda RX4 Wag -0.619483 1.52562 Datsun 710 -2.735624 -0.14415 Hornet 4 Drive -0.306861 -2.32580 Hornet Sportabout 1.943393 -0.74252 Valiant -0.055253 -2.74212
These values were computed by multiplying the original dataset with the identified weights, so-called loadings (rotation
) or the component matrix. This is a standard linear transformation:
> head(scale(mtcars) %*% pc$rotation[, 1:2]) PC1 PC2 Mazda RX4 -0.646863 1.70811 Mazda RX4 Wag -0.619483 1.52562 Datsun 710 -2.735624 -0.14415 Hornet 4 Drive -0.306861 -2.32580 Hornet Sportabout 1.943393 -0.74252 Valiant -0.055253 -2.74212
Both variables are scaled with the mean being zero and the standard deviation as described previously:
> summary(pc$x[, 1:2]) PC1 PC2 Min. :-4.187 Min. :-2.742 1st Qu.:-2.284 1st Qu.:-0.826 Median :-0.181 Median :-0.305 Mean : 0.000 Mean : 0.000 3rd Qu.: 2.166 3rd Qu.: 0.672 Max. : 3.892 Max. : 4.311 > apply(pc$x[, 1:2], 2, sd) PC1 PC2 2.5707 1.6280 > pc$sdev[1:2] [1] 2.5707 1.6280
All scores computed by PCA are scaled, because it always returns the values transformed to a new coordinate system with an orthogonal basis, which means that the components are not correlated and scaled:
> round(cor(pc$x)) PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC1 1 0 0 0 0 0 0 0 0 0 0 PC2 0 1 0 0 0 0 0 0 0 0 0 PC3 0 0 1 0 0 0 0 0 0 0 0 PC4 0 0 0 1 0 0 0 0 0 0 0 PC5 0 0 0 0 1 0 0 0 0 0 0 PC6 0 0 0 0 0 1 0 0 0 0 0 PC7 0 0 0 0 0 0 1 0 0 0 0 PC8 0 0 0 0 0 0 0 1 0 0 0 PC9 0 0 0 0 0 0 0 0 1 0 0 PC10 0 0 0 0 0 0 0 0 0 1 0 PC11 0 0 0 0 0 0 0 0 0 0 1
To see what the principal components actually mean, it's really helpful to check the loadings matrix, as we have seen before:
> pc$rotation[, 1:2] PC1 PC2 mpg -0.36253 0.016124 cyl 0.37392 0.043744 disp 0.36819 -0.049324 hp 0.33006 0.248784 drat -0.29415 0.274694 wt 0.34610 -0.143038 qsec -0.20046 -0.463375 vs -0.30651 -0.231647 am -0.23494 0.429418 gear -0.20692 0.462349 carb 0.21402 0.413571
Probably this analytical table might be more meaningful in some visual way, for example as a biplot
, which shows not only the original variables but also the observations (black labels) on the same plot with the new coordinate system based on the principal components (red labels):
> biplot(pc, cex = c(0.8, 1.2)) > abline(h = 0, v = 0, lty = 'dashed')
We can conclude that PC1
includes information mostly from the number of cylinders (cyl
), displacement (disp
), weight (wt
), and gas consumption (mpg
), although the latter looks likely to decrease the value of PC1
. This was found by checking the highest and lowest values on the PC1
axis. Similarly, we find that PC2
is constructed by speed-up (qsec
), number of gears (gear
), carburetors (carb
), and the transmission type (am
).
To verify this, we can easily compute the correlation coefficient between the original variables and the principal components:
> cor(mtcars, pc$x[, 1:2]) PC1 PC2 mpg -0.93195 0.026251 cyl 0.96122 0.071216 disp 0.94649 -0.080301 hp 0.84847 0.405027 drat -0.75617 0.447209 wt 0.88972 -0.232870 qsec -0.51531 -0.754386 vs -0.78794 -0.377127 am -0.60396 0.699103 gear -0.53192 0.752715 carb 0.55017 0.673304
Does this make sense? How would you name PC1
and PC2
? The number of cylinders and displacement seem like engine parameters, while the weight is probably rather influenced by the body of the car. Gas consumption should be affected by both specs. The other component's variables deal with suspension, but we also have speed there, not to mention the bunch of mediocre correlation coefficients in the preceding matrix. Now what?
Based on the fact that rotation methods are done in a subspace, rotation is always suboptimal compared to the previously discussed PCA. This means that the new axes after rotation will explain less variance than the original components.
On the other hand, rotation simplifies the structure of the components and thus makes it a lot easier to understand and interpret the results; thus, these methods are often used in practice.
There are two main types of rotation:
Varimax rotation is one of the most popular rotation methods. It was developed by Kaiser in 1958 and has been popular ever since. It is often used because the method maximizes the variance of the loadings matrix, resulting in more interpretable scores:
> varimax(pc$rotation[, 1:2]) $loadings PC1 PC2 mpg -0.286 -0.223 cyl 0.256 0.276 disp 0.312 0.201 hp 0.403 drat -0.402 wt 0.356 0.116 qsec 0.148 -0.483 vs -0.375 am -0.457 0.174 gear -0.458 0.217 carb -0.106 0.454 PC1 PC2 SS loadings 1.000 1.000 Proportion Var 0.091 0.091 Cumulative Var 0.091 0.182 $rotmat [,1] [,2] [1,] 0.76067 0.64914 [2,] -0.64914 0.76067
Now the first component seems to be mostly affected (negatively dominated) by the transmission type, number of gears, and rear axle ratio, while the second one is affected by speed-up, horsepower, and the number of carburetors. This suggests naming PC2
as power
, while PC1
instead refers to transmission
. Let's see those 32 automobiles in this new coordinate system:
> pcv <- varimax(pc$rotation[, 1:2])$loadings > plot(scale(mtcars) %*% pcv, type = 'n', + xlab = 'Transmission', ylab = 'Power') > text(scale(mtcars) %*% pcv, labels = rownames(mtcars))
Based on the preceding plot, every data scientist should pick a car from the upper left quarter to go with the top rated models, right? Those cars have great power based on the y axis and good transmission systems, as shown on the x axis—do not forget about the transmission being negatively correlated with the original variables. But let's see some other rotation methods and the advantages of those as well!
Quartimax rotation is an orthogonal method, as well, and minimizes the number of components needed to explain each variable. This often results in a general component and additional smaller components. When a compromise between Varimax and Quartimax rotation methods is needed, you might opt for Equimax rotation.
Oblique rotation methods include Oblimin and Promax, which are not available in the base stats or even the highly used
psych
package. Instead, we can load the
GPArotation
package, which provides a wide range of rotation methods for PCA and FA as well. For demonstration purposes, let's see how Promax rotation works, which is a lot faster compared to, for example, Oblimin:
> library(GPArotation) > promax(pc$rotation[, 1:2]) $loadings Loadings: PC1 PC2 mpg -0.252 -0.199 cyl 0.211 0.258 disp 0.282 0.174 hp 0.408 drat -0.416 wt 0.344 qsec 0.243 -0.517 vs -0.380 am -0.502 0.232 gear -0.510 0.276 carb -0.194 0.482 PC1 PC2 SS loadings 1.088 1.088 Proportion Var 0.099 0.099 Cumulative Var 0.099 0.198 $rotmat [,1] [,2] [1,] 0.65862 0.58828 [2,] -0.80871 0.86123 > cor(promax(pc$rotation[, 1:2])$loadings) PC1 PC2 PC1 1.00000 -0.23999 PC2 -0.23999 1.00000
The result of the last command supports the view that oblique rotation methods generate scores that might be correlated, unlike when running an orthogonal rotation.
PCA can be used for a variety of goals besides exploratory data analysis. For example, we can use PCA to generate eigenfaces, compress images, classify observations, or to detect outliers in a multidimensional space via image filtering. Now, we will construct a simplified model discussed in a related research post published on R-bloggers in 2012: http://www.r-bloggers.com/finding-a-pin-in-a-haystack-pca-image-filtering.
The challenge described in the post was to detect a foreign metal object in the sand photographed by the Curiosity Rover on the Mars. The image can be found at the official NASA website at http://www.nasa.gov/images/content/694811main_pia16225-43_full.jpg, for which I've created a shortened URL for future use: http://bit.ly/nasa-img.
In the following image, you can see a strange metal object highlighted in the sand in a black circle, just to make sure you know what we are looking for. The image found at the preceding URL does not have this highlight:
And now let's use some statistical methods to identify that object without (much) human intervention! First, we need to download the image from the Internet and load it into R. The jpeg
package will be really helpful here:
> library(jpeg) > t <- tempfile() > download.file('http://bit.ly/nasa-img', t) trying URL 'http://bit.ly/nasa-img' Content type 'image/jpeg' length 853981 bytes (833 Kb) opened URL ================================================== downloaded 833 Kb > img <- readJPEG(t) > str(img) num [1:1009, 1:1345, 1:3] 0.431 0.42 0.463 0.486 0.49 ...
The readJPEG
function returns the RGB values of every pixel in the picture, resulting in a three dimensional array where the first dimension is the row, the second is the column, and the third dimension includes the three color values.
As PCA requires a matrix as an input, we have to convert this 3-dimensional array to a 2-dimensional dataset. To this end, let's not bother with the order of pixels for the time being, as we can reconstruct that later, but let's simply list the RGB values of all pixels, one after the other:
> h <- dim(img)[1] > w <- dim(img)[2] > m <- matrix(img, h*w) > str(m) num [1:1357105, 1:3] 0.431 0.42 0.463 0.486 0.49 ...
In a nutshell, we saved the original height of the image (in pixels) in variable h
, saved the width in w
, and then converted the 3D array to a matrix with 1,357,105 rows. And, after four lines of data loading and three lines of data transformation, we can call the actual, rather simplified statistical method at last:
> pca <- prcomp(m)
As we've seen before, data scientists do indeed deal with data preparation most of the time, while the actual data analysis can be done easily, right?
The extracted components seems to perform pretty well; the first component explains more than 96 percent of the variance:
> summary(pca) Importance of components: PC1 PC2 PC3 Standard deviation 0.277 0.0518 0.00765 Proportion of Variance 0.965 0.0338 0.00074 Cumulative Proportion 0.965 0.9993 1.00000
Previously, interpreting RGB values was pretty straightforward, but what do these components mean?
> pca$rotation PC1 PC2 PC3 [1,] -0.62188 0.71514 0.31911 [2,] -0.57409 -0.13919 -0.80687 [3,] -0.53261 -0.68498 0.49712
It seems that the first component is rather mixed with all three colors, the second component misses the green color, while the third component includes almost only green. Why not visualize that instead of trying to imagine how these artificial values look? To this end, let's extract the color intensities from the preceding component/loading matrix by the following quick helper function:
> extractColors <- function(x) + rgb(x[1], x[2], x[3])
Calling this on the absolute values of the component matrix results in the hex-color codes that describe the principal components:
> (colors <- apply(abs(pca$rotation), 2, extractColors)) PC1 PC2 PC3 "#9F9288" "#B623AF" "#51CE7F"
These color codes can be easily rendered—for example, on a pie chart, where the area of the pies represents the explained variance of the principal components:
> pie(pca$sdev, col = colors, labels = colors)
Now we no longer have red, green, or blue intensities or actual colors in the computed scores stored in pca$x
; rather, the principal components describe each pixel with the visualized colors shown previously. And, as previously discussed, the third component stands for a greenish color, the second one misses green (resulting in a purple color), while the first component includes a rather high value from all RGB colors resulting in a tawny color, which is not surprising at all knowing that the photo was taken in the desert of Mars.
Now we can render the original image with monochrome colors to show the intensity of the principal components. The following few lines of code produce two modified photos of the Curiosity Rover and its environment based on PC1
and PC2
:
> par(mfrow = c(1, 2), mar = rep(0, 4)) > image(matrix(pca$x[, 1], h), col = gray.colors(100)) > image(matrix(pca$x[, 2], h), col = gray.colors(100), yaxt = 'n')
Although the image was rotated by 90 degrees in some of the linear transformations, it's pretty clear that the first image was not really helpful in finding the foreign metal object in the sand. As a matter of fact, this image represents the noise in the desert area, as PC1
included sand-like color intensities, so this component is useful for describing the variety of tawny colors.
On the other hand, the second component highlights the metal object in the sand very well! All surrounding pixels are dim, due to the low ratio of purple color in normal sand, while the anomalous object is rather dark.
I really like this piece of R code and the simplified example: although they're still basic enough to follow, they also demonstrate the power of R and how standard data analytic methods can be used to harvest information from raw data.
3.16.79.147