Component extraction

To extract the components with the psych package, you will use the principal() function. The syntax will include the data and whether or not we want to rotate the components at this time:

> pca <- principal(train_scale, rotate = "none")

You can examine the components by calling the pca object that we created. However, my primary intent is to determine what should be the number of components to retain. For that, a scree plot will suffice. A scree plot can aid you in assessing the components that explain the most variance in the data. It shows the Component number on the axis and their associated Eigenvalues on the axis. For simplicity of interpretation, I include only the first 10 components:

> plot(pca$values[1:10], type = "b", ylab = "Eigenvalues", xlab = "Component")

The following is the output of the preceding command:

What you are looking for is a point in the scree plot where the rate of change decreases. This will be what is commonly called an elbow or bend in the plot. That elbow point in the plot captures the fact that additional variance explained by a component does not differ greatly from one component to the next. In other words, it is the breakpoint where the plot flattens out. In the plot, maybe four, five, or six components looks compelling. I think more information is needed. Here, we can see the eigenvalues of those 10 components. A rule of thumb recommends selecting all components with an eigenvalue greater than 1:

> head(pca$values, 10)
[1] 52.2361 11.3294 4.7375 3.0193 1.9830 1.5153 1.2896 1.0655 1.0275
[10] 0.9185

Another rule I've learned over the years is that you should capture about 70 percent of the total variance, which means that the cumulative variance explained by each of the selected components accounts for 70 percent of the variance explained by all the components. That is pretty simple to do. I'm inclined to go with five components:

> sum(pca$values)
[1] 93

> sum(pca$values[1:5])
[1] 73.31

We are capturing 79 percent of the total variance with just 5 components. Let's put that together:

> pca_5 <- psych::principal(train_scale, nfactors = 5, rotate = "none")

Calling the object gives a number of results. Here are the abbreviated results for the top portion of the output:

> pca_5
Principal Components Analysis
Call: psych::principal(r = train_scale, nfactors = 5, rotate = "none")
Standardized loading (pattern matrix) based upon correlation matrix
PC1 PC2 PC3 PC4 PC5 h2 u2 com
abdominalextensiondepthsitting 0.58 0.66 0.09 -0.08 -0.26 0.85 0.146 2.4
acromialheight 0.92 -0.27 0.13 0.19 -0.03 0.98 0.025 1.3
acromionradialelength 0.84 -0.29 0.16 -0.04 -0.11 0.83 0.167 1.4
anklecircumference 0.67 0.34 0.00 0.02 0.34 0.69 0.314 2.0

Here, we see the feature loading on each of the five components. For example, acromialheight has the highest positive loading of the features on component 1. Here, I paste the part of the output that shows the sum of squares:

                            PC1   PC2  PC3  PC4  PC5
SS loading 52.24 11.33 4.74 3.02 1.98
Proportion Var 0.56 0.12 0.05 0.03 0.02
Cumulative Var 0.56 0.68 0.73 0.77 0.79
Proportion Explained 0.71 0.15 0.06 0.04 0.03
Cumulative Proportion 0.71 0.87 0.93 0.97 1.00

Here, the numbers are the eigenvalues for each component. When they are normalized, you will end up with the Proportion Explained row, which, as you may have guessed, stands for the proportion of the variance explained by each component. You can see that principal component 1 explains 56 percent of all the variance explained by the five components. Remember we previously examined the heuristic rule that your selected components should account for a minimum of 70 percent of the total variation. The Cumulative Var row shows the cumulative variance is 79 percent, as demonstrated previously.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.144.65