Pearson Correlations

When to Use

You can use the Pearson product-moment correlation coefficient (symbolized by the letter r) to assess the nature of the relationship between two variables when both are measured on either an interval- or ratio-level of measurement. It is further assumed that both variables should include a relatively large number of values. For example, you would not use this statistic if one of the variables could assume only three values.

It would be appropriate to compute a Pearson correlation coefficient to investigate the nature of the relationship between GRE verbal test scores and grade point average (GPA). GRE verbal is assessed on an interval-level of measurement, and can assume a wide variety of values (i.e., possible scores range from 200 through 800). Grade point ratio is also assessed on an interval level and can also assume a wide variety of values from 0.00 through 4.00.

In addition to interval- or ratio-scale measurement, the Pearson statistic also assumes that the observed values are distributed normally. When one or both variables display a markedly non-normal distribution (e.g., when one or both variables are markedly skewed), it might be more appropriate to analyze the data with the Spearman correlation coefficient. A later section of this chapter discusses the Spearman coefficient. (Summarized assumptions of both the Pearson and Spearman correlations are at the end of this chapter.)

Interpreting the Coefficient

To more fully understand the nature of the relationship between the two variables studied, it is necessary to interpret two characteristics of a Pearson correlation coefficient. First, the sign of the coefficient tells you whether there is a positive or negative relationship between variables. A positive correlation indicates that as values for one variable increase, values for the second variable also increase. A positive correlation is illustrated in Figure 6.2 which shows the relationship between GRE verbal test scores and GPA in a fictitious sample of data.

Figure 6.2. A Positive Correlation


You can see that participants who received low scores on the predictor variable (GRE verbal) also received low scores on the criterion variable (GPA). At the same time, participants who received high scores on GRE verbal also received high scores on GPA. The two variables can therefore be said to be positively correlated.

With a negative correlation, as values for one variable increase, values for the second variable decrease. For example, you might expect to see a negative correlation between GRE verbal test scores and the number of errors that participants make on a vocabulary test (i.e., the students with high GRE verbal scores tend to make few mistakes, and the students with low GRE scores tend to make many mistakes). This relationship is illustrated with fictitious data in Figure 6.3.

Figure 6.3. A Negative Correlation


The second characteristic of a correlation coefficient is its magnitude: the greater the absolute value of a correlation coefficient, the stronger the relationship between the two variables. Pearson correlation coefficients can range in size from –1.00 through 0.00 through +1.00. Coefficients of 0.00 indicate no relationship between variables. For example, if there were a zero correlation between GRE scores and GPA, then knowing a person’s GRE score would tell you nothing about his or her GPA. In contrast, correlations of –1.00 or +1.00 indicate perfect relationships. If the correlation between GRE scores and GPA were 1.00, it would mean that knowing someone’s GRE score would allow you to predict his or her GPA with complete accuracy. In the real world, however, GRE scores are not that strongly related to GPA, so you would expect the correlation between them to be considerably less than 1.00.

The following is an approximate guide for interpreting the strength of the relationship between two variables, based on the absolute value of the coefficient:

±1.00 = Perfect correlation
 ±.80 = Strong correlation
 ±.50 = Moderate correlation
 ±.20 = Weak correlation
 ±.00 = No correlation

We recommend that you consider the magnitude of correlation coefficients as opposed to whether or not coefficients are statistically significant. This is because significance estimates are strongly influenced by sample sizes. For instance, an r value of .15 (weak correlation) would be statistically significant with samples in excess of 700 whereas a coefficient of .50 (moderate correlation) would not be statistically significant with a sample of only 15 participants.

Remember that one considers the absolute value of the coefficient when interpreting its size. This is to say that a correlation of –.50 is just as strong as a correlation of +.50, a correlation of –.75 is just as strong as a correlation of +.75, and so forth.

Linear versus Nonlinear Relationships

The Pearson correlation is appropriate only if there is a linear relationship between the two variables. There is a linear relationship between two variables when their scattergram follows the form of a straight line. For example, it is possible to draw a straight line through the center of the scattergram presented in Figure 6.4, and this straight line fits the pattern of the data fairly well. This means that there is a linear relationship between GRE verbal test scores and GPA.

Figure 6.4. A Linear Relationship


In contrast, there is a nonlinear relationship between two variables if their scattergram does not follow the form of a straight line. For example, imagine that you have constructed a test of creativity and have administered it to a large sample of college students. With this test, higher scores reflect higher levels of creativity. Imagine further that you obtain the GRE verbal test scores for these students, plot their GRE scores against their creativity scores, and obtain the scattergram presented in Figure 6.5.

Figure 6.5. A Nonlinear Relationship


The scattergram in Figure 6.5 reveals a nonlinear relationship between GRE scores and creativity. It shows that:

  • students with low GRE scores tend to have low creativity scores;

  • students with moderate GRE scores tend to have high creativity scores;

  • students with high GRE scores tend to have low creativity scores.

It is not possible to draw a good-fitting straight line through the data points of Figure 6.5. This is why we say that there is a nonlinear (or perhaps a curvilinear) relationship between GRE scores and creativity scores.

When one uses the Pearson correlation to assess the relationship between variables reflecting a nonlinear relationship, the resulting correlation coefficient usually underestimates the actual strength of the relationship between variables. For example, computing the Pearson correlation between the GRE scores and creativity scores presented in Figure 6.5 might result in a coefficient of .10, which would indicate a very weak relationship between the two variables. From the diagram, however, there is clearly a fairly strong relationship between GRE scores and creativity. The figure shows that if you know someone’s GRE score, you can accurately predict his or her creativity score.

The implication of this is that you should always verify that there is a linear relationship between two variables before computing a Pearson correlation for those variables. One of the easiest ways of verifying that the relationship is linear is to prepare a scattergram similar to those presented in the preceding figures. Fortunately, this is easily done by SAS with the GPLOT procedure.

Producing Scattergrams with PROC GPLOT

Here is the general form for requesting a scattergram with the PLOT procedure:

PROC GPLOT   DATA=dataset-name;
   PLOT    criterion-variable*predictor-variable ;
RUN;

The variable listed as the “criterion-variable” in the preceding program is plotted on the vertical axis, and the “predictor-variable” is plotted on the horizontal axis.

To illustrate this procedure, imagine that you have conducted a study dealing with the investment model, a theory of commitment in romantic associations (Rusbult, 1980). The investment model identifies a number of variables that are believed to influence a person’s commitment to a romantic association. Commitment refers to the participant’s intention to remain in the relationship. These are some of the variables that are predicted to influence participant commitment:

Satisfaction:The participant’s affective response to the relationship
Investment size:The amount of time and personal resources that the participant has put into the relationship
Alternative value:The attractiveness of the participant’s alternatives to the relationship (e.g., the attractiveness of alternate romantic partners)

Assume that you have developed a 16-item questionnaire to measure these four variables. The questionnaire is administered to 20 participants who are currently involved in a romantic relationship, and the participants are asked to complete the instrument while thinking about their relationship. When they have completed the questionnaire, it is possible to use their responses to compute four scores for each participant. First, each receives a score on the commitment scale. Higher values on the commitment scale reflect greater commitment to the relationship. Each participant also receives a score on the satisfaction scale, where higher scores reflect greater satisfaction with the relationship. Higher scores on the investment scale mean that the participant believes that he or she has invested a great deal of time and effort in the relationship. Finally, with the alternative value scale, higher scores mean that it would be attractive to the respondent to find a different romantic partner.

Once the data have been entered, you can use the PLOT procedure to prepare scattergrams for various combinations of variables. The following SAS program inputs some fictitious data and requests that a scattergram be prepared in which commitment scores are plotted against satisfaction scores:

 1     DATA D1;
 2        INPUT   #1   @1   COMMITMENT    2.
 3                     @4   SATISFACTION  2.
 4                     @7   INVESTMENT    2.
 5                     @10  ALTERNATIVES  2.   ;
 6     DATALINES;
 7     20 20 28 21
 8     10 12  5 31
 9     30 33 24 11
10      8 10 15 36
11     22 18 33 16
12     31 29 33 12
13      6 10 12 29
14     11 12  6 30
15     25 23 34 12
16     10  7 14 32
17     31 36 25  5
18      5  4 18 30
19     31 28 23  6
20      4  6 14 29
21     36 33 29  6
22     22 21 14 17
23     15 17 10 25
24     19 16 16 22
25     12 14 18 27
26     24 21 33 16
27     ;
28     RUN;
29
30     PROC GPLOT   DATA=D1;
31        PLOT COMMITMENT*SATISFACTION;
32     RUN;

In the preceding program, scores on the commitment scale are entered in columns 1 to 2, and are given the SAS variable name COMMITMENT. Similarly, scores on the satisfaction scale are entered in columns 4 to 5, and are given the name SATISFACTION; scores on the investment scale appear in columns 7 to 8 and are given the name INVESTMENT; and scores on the alternative value scale appear as the last column of data and are given the name ALTERNATIVES.

The data for the 20 participants appear on lines 7 to 26 in the program. There is one line of data for each participant.

Line 30 of the program requests the GPLOT procedure, specifying that the dataset to be analyzed is dataset D1. The PLOT command on line 31 specifies COMMITMENT as the criterion variable and SATISFACTION as the predictor variable for this analysis. The results of this analysis appear in Output 6.1.

Output 6.1. Scattergram of Commitment Scores Plotted against Satisfaction Scores


Notice that in this output, the criterion variable (COMMITMENT) is plotted on the vertical axis while the predictor variable (SATISFACTION) is plotted on the horizontal axis. The shape of the scattergram indicates that there is a linear relationship between SATISFACTION and COMMITMENT. This is evident from the fact that it would be possible to draw a relatively good-fitting straight line through the center of the scattergram. Given that the relationship is linear, it seems safe to proceed with the computation of a Pearson correlation coefficient for this pair of variables.

The general shape of the scattergram also suggests that there is a fairly strong relationship between the two variables: knowing where a participant stands on the SATISFACTION variable allows you to predict with some accuracy where that participant will stand on the COMMITMENT variable. Later, you will compute the correlation coefficient for these two variables to determine just how strong the relationship is.

Output 6.1 also indicates that the relationship between SATISFACTION and COMMITMENT is positive (i.e., large values on SATISFACTION are associated with large values on COMMITMENT and small values on SATISFACTION are associated with small values on COMMITMENT). This makes intuitive sense; you would expect that participants who are highly satisfied with their relationships would also be highly committed to those relationships. To illustrate a negative relationship, you can plot COMMITMENT against ALTERNATIVES. To do this, include the following statements in the preceding program:

     PROC GPLOT   DATA=D1;
        PLOT COMMITMENT*ALTERNATIVES;
     RUN;

These statements are identical to the earlier statements except that ALTERNATIVES is now specified as the predictor variable. These statements produce the scattergram presented in Output 6.2.

Output 6.2. Scattergram of Commitment Scores Plotted against Alternative Value Scores


Notice that the relationship between these two variables is negative. This is what you would expect as it makes intuitive sense that participants who indicate that alternatives to their current romantic partner are attractive would not be overly committed to a current partner. The relationship between ALTERNATIVES and COMMITMENT also appears to be linear. It is therefore appropriate to assess the strength of the relationship between these variables with the Pearson correlation coefficient.

Computing Pearson Correlations with PROC CORR

The CORR procedure offers a number of options regarding what type of coefficient will be computed as well as a number of options regarding the way they will appear on the printed page. Some of these options are discussed here.

Computing a Single Correlation Coefficient

In some instances, you might want to compute the correlation between just two variables. Here is the general form for the statements that will accomplish this:

PROC CORR   DATA=dataset-name   options;
   VAR   variable1   variable2;
RUN;

The choice of which variable is “variable1” and which is “variable2” is arbitrary. For a specific example, assume that you want to compute the correlation between commitment and satisfaction. These are the required statements:

PROC CORR   DATA=D1;
    VAR COMMITMENT SATISFACTION;
RUN;

This program command results in a single page of output, reproduced here as Output 6.3:

Output 6.3. Computing the Pearson Correlation between Commitment and Satisfaction
                                 The CORR Procedure

                      2  Variables:    COMMITMENT   SATISFACTION


                                Simple Statist
Variable         N      Mean     Std Dev         Sum     Minimum     Maximum

COMMITMENT      20  18.60000    10.05459   372.00000     4.00000    36.00000
SATISFACTION    20  18.50000     9.51177   370.00000     4.00000    36.00000


                       Pearson Correlation Coefficients, N = 20
                               Prob > |r| under H0: Rho=0

                                       COMMITMENT      SATISFACTION

                     COMMITMENT           1.00000           0.96252
                                                             <.0001

                     SATISFACTION         0.96252           1.00000
                                           <.0001

The first part of Output 6.3 presents simple descriptive statistics for the variables being analyzed. This allows you to verify that everything looks appropriate (e.g., the correct number of cases were analyzed, no variables were out of range). The names of the variables appear below the “Variable” heading, and the statistics for the variables appear to the right of the variable names. These descriptive statistics show that 20 participants provided usable data for the COMMITMENT variable, that the mean for COMMITMENT is 18.6, and the standard deviation is 10.05. It is always important to review the “Minimum” and “Maximum” columns to verify that no impossible scores appear in the data. With COMMITMENT, the lowest possible score was 4 and the highest possible score was 36. The “Minimum” and “Maximum” columns of Output 6.3 show that no observed values were out of range, thus providing no evidence of misentered data. (Again, these proofing procedures do not guarantee that no errors were made in entering data but they are useful for identifying some types of errors.) Since the descriptive statistics provide no obvious evidence of entering or programming mistakes, you are now free to review the correlations themselves.

The bottom half of Output 6.3 provides the correlations requested in the VAR statement. There are actually four correlation coefficients in the output because your statement requested that the system compute every possible correlation between the variables COMMITMENT and SATISFACTION. This caused SAS to compute the correlation between COMMITMENT and SATISFACTION, between SATISFACTION and COMMITMENT, between COMMITMENT and COMMITMENT, and between SATISFACTION and SATISFACTION.

The correlation between COMMITMENT and COMMITMENT appears in the upper-left corner of the matrix of correlation coefficients in Output 6.3. You can see that the correlation between these variables is 1.00. This makes sense, because the correlation of any variable with itself is always equal to 1.00. Similarly, in the lower-right corner, you see that the correlation between SATISFACTION and SATISFACTION is also 1.00.

The coefficient you are actually interested in appears where the column headed COMMITMENT intersects with the row headed SATISFACTION. The top number in the “cell” where this column and row intersect is .96, which is the Pearson correlation between COMMITMENT and SATISFACTION (rounded to two decimal places).

Just below the correlation is the p value associated with the correlation. This is the significance estimate obtained from a test of the null hypothesis that the correlation between COMMITMENT and SATISFACTION is zero in the population. More technically, the p value gives us the probability that you would obtain a sample correlation this large (or larger) if the correlation between COMMITMENT and SATISFACTION were really zero in the population. For the present correlation coefficient of r = .96, the corresponding p value is less than .01. This means that, given your sample size, there is less than 1 chance in 100 of obtaining a correlation of .96 or larger from this population by chance alone. You might therefore reject the null hypothesis and tentatively conclude that COMMITMENT is related to SATISFACTION in the population. (The alternative hypothesis for this statistical test is that the correlation is not equal to zero in the population. This alternative hypothesis is two-sided which means that it does not predict whether the correlation coefficient is positive or negative, only that it is not equal to zero.)

Determining Sample Size

The size of the sample used in computing the correlation coefficient can appear in one of two places on the output page. First, if all correlations in the analysis were based on the same number of participants, the sample size appears only once on the page, in the line above the matrix of correlations. This line appears just below the descriptive statistics. In Output 6.3, the line says:

Pearson Correlation Coefficients, N = 20

The “N =” portion of this output indicates the sample size. In Output 6.3, the sample size is 20.

However, if one is requesting correlations between several different pairs of variables, it is possible that certain coefficients will be based on more participants than others (due to missing data). In this case, the sample size is printed for each correlation coefficient. Specifically, the sample size appears immediately below the correlation coefficient and its associated significance level (i.e., p value), following this format:

correlation
p value
N

Computing All Possible Correlations for a Set of Variables

Here is the general form for computing all possible Pearson correlation coefficients for a set of variables:

PROC CORR   DATA=dataset-name   options;
   VAR   variable-list;
RUN;

Each variable name in the preceding “variable-list” should be separated by at least one space. For example, assume that you now want to compute all possible correlations for the variables COMMITMENT, SATISFACTION, INVESTMENT, and ALTERNATIVES. The statements that request these correlations are as follows:

PROC CORR   DATA=D1;
   VAR COMMITMENT SATISFACTION INVESTMENT ALTERNATIVES;
RUN;

The preceding program produced the output reproduced here as Output 6.4:

Output 6.4. Computing All Possible Pearson Correlations
                            The CORR Procedure

        4  Variables:    COMMITMENT   SATISFACTION INVESTMENT   ALTERNATIVES


                                Simple Statistics

Variable      N      Mean     Std Dev           Sum       Minimum       Maximum

COMMITMENT   20   8.60000    10.05459     372.00000       4.00000      36.00000
SATISFACTION 20  18.50000     9.51177     370.00000       4.00000      36.00000
INVESTMENT   20  20.20000     9.28836     404.00000       5.00000      34.00000
ALTERNATIVES 20  20.65000     9.78869     413.00000       5.00000      36.00000


                     Pearson Correlation Coefficients, N = 20
                            Prob > |r| under H0: Rho=0

                 COMMITMENT      SATISFACTION      INVESTMENT      ALTERNATIVES

COMMITMENT          1.00000           0.96252         0.71043          -0.95604
                                       <.0001          0.0004            <.0001

SATISFACTION        0.96252           1.00000         0.61538          -0.93355
                     <.0001                            0.0039            <.0001

INVESTMENT          0.71043           0.61538         1.00000          -0.72394
                     0.0004            0.0039                            0.0003

ALTERNATIVES       -0.95604          -0.93355        -0.72394           1.00000
                     <.0001            <.0001          0.0003

You can interpret the correlations and significance values in this output in exactly the same way as with Output 6.3. For example, to find the correlation coefficient between INVESTMENT and COMMITMENT, you find the cell where the row for INVESTMENT intersects with the column for COMMITMENT. The top number in this cell is .71, which is the Pearson correlation coefficient between these two variables. Just below this correlation coefficient is the p value less than .01, meaning that there is less than 1 chance in 100 of observing a sample correlation this large if the population correlation is really zero. The observed correlation is statistically significant.

Notice that the pattern of the correlations supports some of the predictions of the investment model: commitment is positively related to satisfaction and investment size; and is negatively related to alternative value. With respect to magnitude, the correlations range from being moderately strong to very strong. (Remember that these data are fictitious.)

What happens if I omit the VAR statement?

It is possible to run PROC CORR without the VAR statement. This causes every possible correlation to be computed between all quantitative variables in the dataset. Use caution when doing this, however; with large datasets, leaving off the VAR statement can result in a very long printout.


Computing Correlations between Subsets of Variables

By using the WITH statement in the SAS program, it is possible to compute correlations between one subset of variables and a second subset of variables. The general form is as follows:

PROC CORR   DATA=dataset-name   options;
   VAR   variables-that-will-appear-as-columns;
   WITH  variables-that-will-appear-as-rows;
RUN;

Any number of variables can appear in the VAR statement and any number of variables can also appear in the WITH statement. To illustrate, assume that you want to prepare a matrix of correlation coefficients in which there is one column of coefficients representing the COMMIT variable, and there are three rows of coefficients representing the SATISFACTION, INVESTMENT, and ALTERNATIVES variables. The following statements would create this matrix:

PROC CORR   DATA=D1;
   VAR  COMMITMENT;
   WITH SATISFACTION INVESTMENT ALTERNATIVES;
RUN;

Output 6.5 presents the results generated by this program. Obviously, the correlations in this output are identical to those obtained in Output 6.4, though Output 6.5 is more compact. This is why it is often wise to use the WITH statement in conjunction with the VAR statement as this can produce smaller and more manageable printouts than obtained if you use only the VAR statement.

Output 6.5. Computing Pearson Correlations for Subsets of Variables
                               The CORR Procedure

           3 With Variables:    SATISFACTION INVESTMENT   ALTERNATIVES
           1      Variables:    COMMITMENT


                                     Simple Statistics

Variable         N       Mean    Std Dev         Sum    Minimum    Maximum

SATISFACTION    20   18.50000    9.51177   370.00000    4.00000   36.00000
INVESTMENT      20   20.20000    9.28836   404.00000    5.00000   34.00000
ALTERNATIVES    20   20.65000    9.78869   413.00000    5.00000   36.00000
COMMITMENT      20   18.60000   10.05459   372.00000    4.00000   36.00000

                    Pearson Correlation Coefficients, N = 20
                           Prob > |r| under H0: Rho=0

                                                COMMITMENT

                          SATISFACTION            0.96252
                                                   <.0001

                          INVESTMENT              0.71043
                                                   0.0004

                          ALTERNATIVES           -0.95604
                                                   <.0001

Options Used with PROC CORR

The following items are some of the PROC CORR options that you might find especially useful when conducting social science research. Remember that the option names should appear before the semicolon that ends the PROC CORR statement:

ALPHA

prints coefficient alpha (a measure of scale reliability) for the variables listed in the VAR statement. (Chapter 7 deals with coefficient alpha in greater detail.)

COV

prints covariances between the variables. This is useful when you need a variance-covariance table, rather than a table of correlations.

KENDALL

prints Kendall’s tau-b coefficient, a measure of bivariate association for variables assessed at the ordinal level.

NOMISS

drops from the analysis any observation (participant) with missing data on any of the variables listed in the VAR statement. Using this option ensures that all correlations are based on exactly the same observations (and, therefore, on the same number of observations).

NOPROB

prevents the printing of p values associated with the correlations.

RANK

for each variable, reorders the correlations from highest to lowest (in absolute value) and prints them in this order.

SPEARMAN

prints Spearman correlations, which are appropriate for variables measured on an ordinal level. Spearman correlations are discussed in the following section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.154.185