Principal component analysis

Principal component analysis (PCA) is another exploratory method you can use to separate your samples into groups. PCA converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. PCA is widely used as a dimension reduction technique to help visualize data. PCA is different from LDA because it doesn't rely on class information to explore the relationship between the variable values and the sample group numbers. For example, let's perform a PCA to explore our simulated fish.data dataset. Before performing PCA, it is important to remember that the magnitude of the variables and any skews in the data will influence the resulting principal components. So, we need to scale and transform our data.

First, we recommend you to log transform the data (if necessary). Then, run PCA using the prcomp() function as follows:

> fish.data.mx <- as.matrix(fish.data[, 1:3])
> fish.data.log <- log(fish.data.mx) 
> fish.log.pca <- prcomp(fish.data.log, scale=T, center=T)
> summary(fish.log.pca) 
Importance of components:
                          PC1    PC2    PC3
Standard deviation     1.2563 0.9953 0.6566
Proportion of Variance 0.5261 0.3302 0.1437
Cumulative Proportion  0.5261 0.8563 1.0000

Instead of the summary, you might want to see the standard deviation of each principal component, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables. To get this information, we can use the print() function as follows:

> print(fish.log.pca)
Standard deviations:
[1] 1.2563128 0.9952860 0.6565698

Rotation:
             PC1         PC2         PC3
length 0.7012316 -0.09252202  0.70690444
weight 0.7016076 -0.08648088 -0.70729617
speed  0.1265742  0.99194795  0.00427102

We can also plot the variance associated with each principal component as follows:

> plot(fish.log.pca, ylim=c(0, 2)) #plot not shown

Now let's visualize the loadings on the first principal component with the qplot() function from the ggplot2 package as follows:

> library("ggplot2")
> loadings <- as.data.frame(fish.log.pca$rotation)

> # Add a column with the name of each variable to the loadings data frame.
> loadings$variables <- colnames(fish.data[,1:3])

> # Plot figure with qplot()
> q <- qplot(x = variables, y = PC1, data = loadings, geom = "bar", stat="identity")

> # Adjust axis label sizes
> q + theme(axis.title = element_text(face="bold", size=20), axis.text  = element_text(size=18))

The result is shown in the following plot:

Principal component analysis

We can also plot the score from our PCA results as follows:

> scores <- as.data.frame(fish.log.pca$x)
> q2 <- qplot(x = PC1, y = PC2, data = scores, geom = "point", col = fish.data$fish)
> q2 <- q2 + theme(legend.title=element_blank(), legend.text = element_text( size = 20, face = "bold"))
> q2 <- q2 + theme(axis.title = element_text(face="bold", size=20), axis.text  = element_text(size=18))
> q2

The result is shown in the following plot:

Principal component analysis

We can also view the PCA results with the biplot() function. A biplot allows us to visualize two sets of variables at the same time. It's representation is similar to the last plot we just generated with the qplot() function except that it also represents the variables as vectors. Let's take a look at the following biplot() function:

> biplot(fish.log.pca)

The result is shown in the following plot:

Principal component analysis

As you can see, in this case, the samples are shown as points and the variables (speed, length, and width) are shown as vectors.

You may also want to take advantage of the ggbiplot() function as part of the ggbiplot package (available at https://github.com/vqv/ggbiplot) to produce a better figure as follows:

> library("devtools")
> install_github("ggbiplot", "vqv")
> library("ggbiplot")

> fish.class  <- fish.data$fish  
> g <- ggbiplot(fish.log.pca, obs.scale = 1, var.scale = 1, groups = fish.class, ellipse = TRUE, circle = TRUE)
> g <- g + scale_color_discrete(name = '')
> g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
> g <- g + theme(legend.title=element_blank(), legend.text = element_text( size = 20, face = "bold"))
> g <- g + theme(axis.title = element_text(face="bold", size=20), axis.text  = element_text(size=18))
> g

The result is shown in the following plot:

Principal component analysis

We can also create a three-dimensional view of the principal components with the plot3d() function available in the rgl package. The plot3d() function creates an interactive 3D scatter plot. So you can move around the plot to have a better idea of the separation between the principal components. Let's take a look at the following code:

> library(rgl)
> fish.pca.col=c("red", "blue", "green", "magenta", "black")
> plot3d(fish.log.pca$x[,1:3], col=fish.pca.col[sort(rep(1:5, 50))])
Principal component analysis

To save an image, we use the rgl.snapshot() function as follows:

> rgl.snapshot("PCAfigure1.png")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.54.168