Introduction: The Basics of Principal Component Analysis

Principal component analysis is an appropriate procedure when you have obtained measures on a number of observed variables and want to develop a smaller number of variables (called principal components) that will account for most of the variance in the observed variables. The principal components can then be used as predictors or criterion variables in subsequent analyses.

A Variable Reduction Procedure

Principal component analysis is a variable reduction procedure. It is useful when you obtain data for a number of variables (possibly a large number of variables) and believe that there is redundancy among those variables. In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, you believe that it should be possible to reduce the observed variables into a smaller number of principal components that account for most of the observed variance in the variables.

Because it is a variable reduction procedure, principal component analysis is similar in many respects to exploratory factor analysis. In fact, the steps followed when conducting a principal component analysis are virtually identical to those followed when conducting an exploratory factor analysis. However, there are significant conceptual differences between the two procedures, and it is important that you do not mistakenly claim that you are performing factor analysis when you are actually performing principal component analysis. The differences between these two procedures are described in greater detail in a later section, “Principal Component Analysis Is Not Factor Analysis.”

An Illustration of Variable Redundancy

This is a fictitious example of research presented to illustrate the concept of variable redundancy. Imagine that you have developed a 7-item measure of job satisfaction. The instrument is reproduced here:

Please respond to each of the following statements
 by placing a
rating in the space to the left of the statement. 
 In making your
ratings, use any number from 1 to 7 in which 1 =
 "strongly
disagree" and 7 = "strongly agree."

_____ 1.   My supervisor(s) treats me with
 consideration.

_____ 2.   My supervisor(s) consults me concerning
 important
           decisions that affect my work.

_____ 3.   My supervisor(s) gives me recognition
 when I do a
           good job.

_____ 4.   My supervisor(s) gives me the support I
 need to do my
           job well.

_____ 5.   My pay is fair.

_____ 6.   My pay is appropriate, given the amount of
           responsibility that comes with my job.

_____ 7.   My pay is comparable to the pay earned
 by other
           employees whose jobs are similar to mine.


Perhaps you began your investigation with the intention of administering this questionnaire to 200 employees using their responses to the seven items as seven separate variables in subsequent analyses.

There are several problems with conducting the study in this manner, however. One of the more important problems involves the concept of redundancy that was previously mentioned. Examine the content of the seven items in the questionnaire closely. Notice that items 1 to 4 all deal with the same topic: employees’ satisfaction with their supervisors. In this way, items 1 to 4 are somewhat redundant. Similarly, notice that items 5 to 7 also all seem to deal with the same topic: employees’ satisfaction with their pay.

Empirical findings can further support the notion that there is redundancy among items. Assume that you administer the questionnaire to 200 employees and compute all possible correlations between responses to the 7 items. The resulting fictitious correlations are presented in Table 15.1:

Table 15.1. Correlations among Seven Job Satisfaction Items
 Correlations
Variable1234567
11.00      
2.751.00     
3.83.821.00    
4.68.92.881.00   
5.03.01.04.011.00  
6.05.02.05.07.891.00 
7.02.06.00.03.91.761.00
Note. N = 200.      

When correlations among several variables are computed, they are typically summarized in the form of a correlation matrix such as the one presented in Table 15.1. This is an appropriate opportunity to review just how a correlation matrix is interpreted. The rows and columns of Table 15.1 correspond to the seven variables included in the analysis. Row 1 (and column 1) represents variable 1, row 2 (and column 2) represents variable 2, and so forth. Where a given row and column intersect, you will find the correlation between the two corresponding variables. For example, where the row for variable 2 intersects with the column for variable 1, you find a correlation of .75; this means that the correlation between variables 1 and 2 is .75.

The correlations in Table 15.1 show that the seven items seem to hang together in two distinct groups. First, notice that items 1 through 4 show relatively strong correlations with one another. This could be because items 1 through 4 are measuring the same construct. In the same way, items 5 through 7 correlate strongly with one another (a possible indication that they also measure a single construct). Even more interesting, notice that items 1 through 4 demonstrate very weak correlations with items 5 through 7. This is what you would expect to see if items 1 through 4 and items 5 through 7 were measuring two different constructs.

Given this apparent redundancy, it is likely that the seven items of the questionnaire are not really measuring seven different constructs. More likely, items 1 through 4 are measuring a single construct that could reasonably be labeled “satisfaction with supervision” whereas items 5 through 7 are measuring a different construct that could be labeled “satisfaction with pay.”

If responses to the seven items actually displayed the redundancy suggested by the pattern of correlations in Table 15.1, it would be advantageous to reduce the number of variables in this dataset, so that (in a sense) items 1 through 4 are collapsed into a single new variable that reflects employees’ satisfaction with supervision and items 5 through 7 are collapsed into a single new variable that reflects satisfaction with pay. You could then use these two new variables (rather than the seven original variables) as predictor variables in multiple regression or any other type of analysis.

In essence, this is what is accomplished by principal component analysis: it allows you to reduce a set of observed variables into a smaller set of variables called principal components. The resulting principal components can then be used in subsequent analyses.

What Is a Principal Component?

How Principal Components Are Computed

A principal component can be defined as a linear combination of optimally weighted observed variables. In order to understand the meaning of this definition, it is necessary to first describe how scores on a principal component are computed.

In the course of performing a principal component analysis, it is possible to calculate a score for each participant for a given principal component. For example, in the preceding study, each participant would have scores on two components: one score on the satisfaction with supervision component and one score on the satisfaction with pay component. Participants’ actual scores on the seven questionnaire items would be optimally weighted and then summed to compute their scores for a given component.

Below is the general form for the formula to compute scores on the first component extracted (created) in a principal component analysis:

C1 =b11(X1) + b12(X2) + ... b1p(Xp)

where

C1 =the participant’s score on principal component 1 (the first component extracted);
b1p =the regression coefficient (or weight) for observed variable p, as used in creating principal component 1;
Xp =the participant’s score on observed variable p.

For example, assume that component 1 in the present study was “satisfaction with supervision.” You could determine each participant’s score on principal component 1 by using the following fictitious formula:

C1 =.44 (X1) + .40 (X2) + .47 (X3) + .32 (X4)
 + .02 (X5) + .01 (X6) + .03 (X7)

In the present case, the observed variables (the “X” variables) are responses to the seven job satisfaction questions: X1 represents question 1; X2 represents question 2; and so forth. Notice that different regression coefficients were assigned to the different questions in computing scores on component 1: questions 1 through 4 were assigned relatively large regression weights that range from .32 to .44 whereas questions 5 through 7 were assigned very small weights ranging from .01 to .03. This makes sense, because component 1 is the satisfaction with supervision component and satisfaction with supervision is assessed by questions 1 through 4. It is therefore appropriate that items 1 through 4 are given a good deal of weight in computing participant scores on this component, while items 5 through 7 have comparatively little weight.

Obviously, a different equation, with different regression weights, would be used to compute scores on component 2 (i.e., satisfaction with pay). Below is a fictitious illustration of this formula:

C2 = .01 (X1) + .04 (X2) + .02 (X3) + .02 (X4)
 +.48 (X5) + .31 (X6) + .39 (X7)

The preceding shows that, in creating scores on the second component, much weight is given to items 5 through 7 and comparatively little is given to items 1 through 4. As a result, component 2 should account for much of the variability in the three satisfaction with pay items (i.e., it should be strongly correlated with those three items).

At this point, it is reasonable to wonder how the regression weights from the preceding equations are determined. The SAS PROC FACTOR procedure generates these weights by using a special type of equation called an eigenequation. The weights produced by these eigenequations are optimal weights in the sense that, for a given set of data, no other set of weights could produce components that are more effective in accounting for variance among observed variables. The weights are created so as to satisfy a principle of least squares similar (but not identical) to the principle of least squares used in multiple regression (see Chapter 14, “Multiple Regression”). Later, this chapter shows how PROC FACTOR can be used to extract (create) principal components.

It is now possible to understand the definition provided at the beginning of this section more fully. There, a principal component was defined as a linear combination of optimally weighted observed variables. The words “linear combination” refer to the fact that scores on a component are created by adding together scores on the observed variables being analyzed. “Optimally weighted” refers to the fact that the observed variables are weighted in such a way that the resulting components account for a maximal amount of observed variance in the dataset.

Number of Components Extracted

The preceding section might have created the impression that, if a principal component analysis were performed on data from the 7-item job satisfaction questionnaire, only two components would be created. However, such an impression would not be entirely correct.

In reality, the number of components extracted in a principal component analysis is equal to the number of observed variables being analyzed. This means that an analysis of your 7-item questionnaire would result in seven components, not two.

In most analyses, however, only the first few components account for meaningful amounts of variance so only these first few components are retained, interpreted, and used in subsequent analyses (such as in multiple regression). For example, in your analysis of the 7-item job satisfaction questionnaire, it is likely that only the first two components would account for a meaningful amount of variance. Therefore only these would be retained for interpretation. You would assume that the remaining five components accounted for only trivial amounts of variance. These latter components would therefore not be retained, interpreted, or further analyzed.

Characteristics of Principal Components

The first component extracted in a principal component analysis accounts for a maximal amount of total variance in the observed variables. Under typical conditions, this means that the first component is correlated with at least some observed variables. It is usually correlated with many.

The second component extracted will have two important characteristics. First, this component will account for a maximal amount of variance in the dataset that was not accounted for by the first component. Again under typical conditions, this means that the second component is correlated with some observed variables that did not display strong correlations with component 1.

The second characteristic of the second component is that it is uncorrelated with the first component. Literally, if you were to compute the correlation between components 1 and 2, that correlation would be zero. (For the exception, see the following section regarding oblique solutions.)

The remaining components extracted in the analysis display the same two characteristics: each component accounts for a maximal amount of variance in the observed variables that is not accounted for by the preceding components and is uncorrelated with all of the preceding components. A principal component analysis proceeds in this manner with each new component accounting for progressively smaller amounts of variance. This is why only the first few components are usually retained and interpreted. When the analysis is complete, the resulting components display varying degrees of correlation with the observed variables, but are completely uncorrelated with one another.

What is meant by “total variance” in the dataset?

To understand the meaning of total variance as it is used in a principal component analysis, remember that the observed variables are standardized in the course of the analysis. This means that each variable is transformed so that it has a mean of zero and a standard deviation of one (and hence a variance of one). The “total variance” in the dataset is simply the sum of variances of these observed variables. Because they have been standardized to have a standard deviation of one, each observed variable contributes one unit of variance to “total variance” in the dataset. Because of this, total variance in a principal component analysis always equals the number of observed variables analyzed. For example, if seven variables are analyzed, the total variance equals seven. The components that are extracted in the analysis partition this variance. Perhaps the first component accounts for 3.2 units of total variance; perhaps the second component accounts for 2.1 units. The analysis continues in this way until all variance in the dataset is accounted for.


Orthogonal versus Oblique Solutions

This chapter discusses only principal component analyses that result in orthogonal solutions. An orthogonal solution is one in which the components are uncorrelated (orthogonal means “uncorrelated”).

It is possible to perform a principal component analysis that results in correlated components. Such a solution is referred to as an oblique solution. In some situations, oblique solutions are preferred to orthogonal solutions because they produce cleaner, more easily interpreted results.

However, oblique solutions can also be somewhat more complicated to interpret. For this reason, the present chapter focuses only on the interpretation of orthogonal solutions.

Principal Component Analysis Is Not Factor Analysis

Principal component analysis is often confused with factor analysis. This is understandable because there are many important similarities between the two procedures. Both are methods that can be used to identify groups of observed variables that tend to hang together empirically. Both procedures can also be performed with the SAS FACTOR procedure and they generally provide similar results.

Nonetheless, there are some important conceptual differences between principal component analysis and factor analysis that should be understood at the outset. Perhaps the most important deals with the assumption of an underlying causal structure. Factor analysis assumes that the covariation in the observed variables is due to the presence of one or more latent variables (factors) that exert causal influence on these observed variables. An example of such a causal structure is presented in Figure 15.1:

Figure 15.1. Example of the Underlying Causal Structure That Is Assumed in Factor Analysis


The ovals in Figure 15.1 represent the latent (unmeasured) factors of “satisfaction with supervision” and “satisfaction with pay.” These factors are latent in the sense that they are assumed to actually exist in the employees’ belief systems but cannot be measured directly. However, they do exert an influence on employees’ responses to the seven items that constitute the job satisfaction questionnaire described earlier. (These seven items are represented as squares labeled V1 to V7 in the figure.) It can be seen that the “supervision” factor exerts influence on items V1 to V4 (the supervision questions) whereas the “pay” factor exerts influence on items V5 to V7 (the pay items).

Researchers use factor analysis when they believe that certain latent factors exist that exert causal influence on the observed variables they are studying. Exploratory factor analysis helps the researcher identify the number and nature of such latent factors.

In contrast, principal component analysis makes no assumption about an underlying causal structure. Principal component analysis is simply a variable reduction procedure that (typically) results in a relatively small number of components that account for most of the variance in a set of observed variables (i.e., groupings of observed variables vs. latent constructs).

In summary, both factor analysis and principal component analysis have important roles to play in social science research, but their conceptual foundations are quite distinct.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.31.39