Principal Components and Exploratory Factor Analysis

Principal Component Analysis (PCA) is a data-reduction technique. You use it as an intermediate step in a more complex analytical session. Imagine that you need to use hundreds of input variables, which can be correlated. With PCA, you convert a collection of possibly correlated variables into a new collection of linearly uncorrelated variables called principal components. The transformation is defined in such a way that the first principal component has the largest possible dataset overall variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (uncorrelated with) the preceding components. Principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric. The first principal component is the eigenvector with the highest eigenvalue, the second with the second highest, and so on. In short, an eigenvector is a direction of an eigenvalue with an explained variance for this point, for this direction. In further analysis, you keep only a few principal components instead of the plethora of original variables. Of course, you lose some variability; nevertheless, the first few principal components should retain the majority of the overall dataset variability.

The following figure explains the process of defining principal components graphically. Initially, variables used in the analysis form a multidimensional space, or matrix, of dimensionality m, if you use m variables. The following screenshot shows a two-dimensional space. Values of the variables v1 and v2 define cases in this 2-D space. The variability of the cases is spread across both source variables approximately equally. Finding principal components means finding new m axes, where m is exactly equal to the number of the source variables.

However, these new axes are selected in such a way that most cases variability is spread over a single new variable, or over a principal component, as shown in the following figure:

Principal Components Analysis

You use PCA analysis, as mentioned, to reduce the number of variables in further analysis. In addition, you can use PCA for anomaly detection, for finding cases that are somehow different from the majority of cases. You use the residual variability not explained by the two first principal components for this task.

Similar to PCA is Exploratory Factor Analysis (EFA), which is used to uncover the latent structure in the input variables collection. A smaller set of calculated variables called factors is used to explain the relations between the input variables.

For a start, the following code creates a subset of the TM data frame by extracting the numerical variables only:

TMPCAEFA <- TM[, c("TotalChildren", "NumberChildrenAtHome", 
                   "HouseOwnerFlag", "NumberCarsOwned", 
                   "BikeBuyer", "YearlyIncome", "Age")]; 

You can use the princomp() function from the base R installation to calculate the principal components. However, the following code uses the principal() function from the psych package, which returns results that are easier to understand. The following code installs the package, loads it in memory, and calculates two principal components from the seven input variables. Note the comment—the components are not rotated. You will learn about rotation very soon:

install.packages("psych"); 
library(psych); 
# PCA not rotated 
pcaTM_unrotated <- principal(TMPCAEFA, nfactors = 2, rotate = "none"); 
pcaTM_unrotated; 

Here are some partial results:

Standardized loadings (pattern matrix) based upon correlation matrix
                           PC1   PC2    h2   u2 com
    TotalChildren         0.73  0.43 0.712 0.29 1.6
    NumberChildrenAtHome  0.74 -0.30 0.636 0.36 1.3
    HouseOwnerFlag        0.20  0.44 0.234 0.77 1.4
    NumberCarsOwned       0.70 -0.34 0.615 0.39 1.5
    BikeBuyer            -0.23 -0.21 0.097 0.90 2.0
    YearlyIncome          0.67 -0.43 0.628 0.37 1.7
    Age                   0.46  0.65 0.635 0.36 1.8
    
                           PC1  PC2
    SS loadings           2.32 1.23
    Proportion Var        0.33 0.18
    Cumulative Var        0.33 0.51

In the pattern matrix part, you can see two principal components (PC) loadings, in the PC1 and PC2 columns. These loadings are correlations of the observed variables with the two PCs. You can use component loading to interpret the meaning of the PCs. The h2 column tells you the amount of the variance explained by the two components. In the SS loadings row, you can see the eigenvalues of the first two PCs, and how much variability is explained by each of them and cumulatively by both of them.

From the pattern matrix, you can see that all of the input variables highly correlate with PC1. For most of them, the correlation is higher than with PC2. This is logical, because this is how the components were calculated. However, it is hard to interpret the meaning of the two components. Note that for pure machine learning, you might not even be interested in such an interpretation; you might just continue with further analysis using the two principal components only instead of the original variables.

You can improve your understanding of PCs by rotating them. This means you are rotating the axes of the multidimensional hyperspace. The rotation is done in such a way that it maximizes associations of PCs with different subsets of input variables each. The rotation can be orthogonal, where the rotated components are still uncorrelated, or oblique, where correlation between PCs is allowed. In principal component analysis, you typically use orthogonal rotation, because you probably want to use uncorrelated components in further analysis. The following code recalculates the two PCAs using varimax, the most popular orthogonal rotation:

pcaTM_varimax <- principal(TMPCAEFA, nfactors = 2, rotate = "varimax");
pcaTM_varimax;

The abbreviated results are now slightly different, and definitely more interpretable:

Standardized loadings (pattern matrix) based upon correlation matrix
                           PC1   PC2    h2   u2 com
    TotalChildren         0.38  0.76 0.712 0.29 1.5
    NumberChildrenAtHome  0.78  0.15 0.636 0.36 1.1
    HouseOwnerFlag       -0.07  0.48 0.234 0.77 1.0
    NumberCarsOwned       0.78  0.09 0.615 0.39 1.0
    BikeBuyer            -0.08 -0.30 0.097 0.90 1.1
    YearlyIncome          0.79  0.01 0.628 0.37 1.0
    Age                   0.04  0.80 0.635 0.36 1.0
    
                           PC1  PC2
    SS loadings           2.00 1.56
    Proportion Var        0.29 0.22
    Cumulative Var        0.29 0.51
Now you can easily see that the PC1 has high loadings, or highly correlates, with the number of children at home, the number of cars owned, yearly income, and also with the total number of children. The second one correlates more with total children and age, quite well with the house ownership flag as well, and negatively with the bike buyer flag.

When you do an EFA, you definitely want to understand the results. The factors are the underlying combined variables that help you understand your data. This is similar to adding computed variables, just in a more complex way, with many input variables. For example, the obesity index could be interpreted as a very simple factor that includes height and weight, and gives you much more information about a person's health than the base two variables do. Therefore, you typically rotate the factors, and also allow correlations between them. The following code extracts two factors from the same dataset as used for the PCA, this time with promax rotation, which is an oblique rotation:

efaTM_promax <- fa(TMPCAEFA, nfactors = 2, rotate = "promax"); 
efaTM_promax; 

Here are the abbreviated results:

Standardized loadings (pattern matrix) based upon correlation matrix
                           MR1   MR2    h2   u2 com
    TotalChildren         0.23  0.72 0.684 0.32 1.2
    NumberChildrenAtHome  0.77  0.04 0.618 0.38 1.0
    HouseOwnerFlag        0.03  0.19 0.040 0.96 1.1
    NumberCarsOwned       0.55  0.07 0.332 0.67 1.0
    BikeBuyer            -0.05 -0.14 0.027 0.97 1.2
    YearlyIncome          0.60 -0.02 0.354 0.65 1.0
    Age                  -0.18  0.72 0.459 0.54 1.1
    
                           MR1  MR2
    SS loadings           1.38 1.13
    Proportion Var        0.20 0.16
    Cumulative Var        0.20 0.36
    
    With factor correlations of 
         MR1  MR2
    MR1 1.00 0.36
    MR2 0.36 1.00
    

Note that this time the results include the correlation between the two factors. In order to interpret the results even more easily, you can use the fa.diagram() function, as the following code shows:

fa.diagram(efaTM_promax, simple = FALSE, 
           main = "EFA Promax"); 

The code produces the following diagram:

Exploratory Factor Analysis

You can now easily see which variables correlate with which of the two factors, and also the correlation between the two factors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.41.148