Inputting a Correlation or Covariance Matrix

There are times when, for reasons of either necessity or convenience, you might choose to analyze a correlation matrix or covariance matrix rather than raw data (e.g., very large datasets). SAS allows you to input such a matrix as data and some (but not all) SAS procedures can then be used to analyze the dataset. For example, a correlation or covariance matrix can be analyzed using PROC REG, PROC FACTOR, or PROC CALIS, and other procedures.

Inputting a Correlation Matrix

This type of data input is sometimes necessary when a researcher obtains a correlation matrix from an earlier study (perhaps from an article published in a research journal) and wishes to perform further analyses on the data. You could input the published correlation matrix as a dataset and analyze it in the same way you would analyze raw data.

For example, imagine that you have read an article that tested a social psychology theory called the investment model (Rusbult, 1980). The investment model identifies a number of variables that are believed to influence a person’s satisfaction with, and commitment to, a romantic relationship. The following are short definitions for the variables that constitute the investment model:

Commitment:the person’s intention to remain in the relationship;
Satisfaction:the person’s affective (emotional) response to the relationship;
Rewards:the number of good things or benefits associated with the relationship;
Costs:the number of bad things or hardships associated with the relationship;
Investment size:the amount of time, energy, and personal resources put into the relationship;
Alternative value:the attractiveness of alternatives to the relationship (e.g., attractiveness of alternative romantic partners).

One interpretation of the investment model predicts that commitment to the relationship is determined by satisfaction, investment size, and alternative value, while satisfaction with the relationship is determined by rewards and costs. The predicted relationships among these variables are shown in Figure 3.1:

Figure 3.1. Predicted Relationships between Investment Model Variables


Assume that you have read an article that reports an investigation of the investment model and that the article included the following fictitious table:

Table 3.1. Standard Deviations and Intercorrelations for All Variables
Intercorrelations
VariableSD123456
1. Commitment2.31921.0000     
2. Satisfaction1.7744.67421.0000    
3. Rewards1.2525.5501.67211.0000   
4. Costs1.4086-.3499-.5717-.44051.0000  
5. Investments1.5575.6444.5234.5346-.18541.0000 
6. Alternatives1.8701-.6929-.4952-.4061.3525-.39341.0000
Note: N = 240.       

Supplied with this information, you can now create a SAS dataset that includes just these correlation coefficients and standard deviations. Here are the necessary data input statements:

 1      DATA D1(TYPE=CORR) ;
 2        INPUT _TYPE_ $ _NAME_ $ V1-V6 ;
 3        LABEL
 4           V1 ='COMMITMENT'
 5           V2 ='SATISFACTION'
 6           V3 ='REWARDS'
 7           V4 ='COSTS'
 8           V5 ='INVESTMENTS'
 9           V6 ='ALTERNATIVES' ;
10      DATALINES;
11      N      .    240     240     240     240     240     240
12      STD    .   2.3192  1.7744  1.2525  1.4086  1.5575  1.8701
13      CORR  V1   1.0000   .       .       .       .       .
14      CORR  V2    .6742  1.0000   .       .       .       .
15      CORR  V3    .5501   .6721  1.0000   .       .       .
16      CORR  V4   -.3499  -.5717  -.4405  1.0000   .       .
17      CORR  V5    .6444   .5234   .5346  -.1854  1.0000   .
18      CORR  V6   -.6929  -.4952  -.4061   .3525  -.3934  1.0000
19      ;
20      RUN;

The following shows the general form for this DATA step in which six variables are to be analyzed. The program would, of course, be modified if the analysis involved a different number of variables.

 1     DATA dataset-name(TYPE=CORR) ;
 2       INPUT _TYPE_ $ _NAME_ $ variable-list ;
 3       LABEL
 4          V1 ='long-name'
 5          V2 ='long-name'
 6          V3 ='long-name'
 7          V4 ='long-name'
 8          V5 ='long-name'
 9          V6 ='long-name' ;
10     DATALINES;
11     N      .    n       n       n       n       n       n
12     STD    .    std     std     std     std     std     std
13     CORR  V1  1.0000   .       .       .       .       .
14     CORR  V2    r     1.0000   .       .       .       .
15     CORR  V3    r       r     1.0000   .       .       .
16     CORR  V4    r       r       r     1.0000   .       .
17     CORR  V5    r       r       r       r     1.0000   .
18     CORR  V6    r       r       r       r       r     1.0000
19     ;
20     RUN;

where:

variable-list =List of variables (e.g., V1, V2,).
long-name =Full name for the given variable. This is used to label the variable when it appears in the SAS output. If this is not desired, you can omit the entire LABEL statement.
n =Number of observations contributing to the correlation matrix. Each correlation in this matrix should be based on the same observations and hence the same number of observations. (This is automatically the case if the matrix is created using the NOMISS option with PROC CORR, as discussed in Chapter 6.)
std =Standard deviation obtained for each variable. These standard deviations are needed if you are performing an analysis on the correlation matrix so that SAS can convert the correlation matrix into a variance-covariance matrix. Instead, if you wish to perform an analysis on a variance-covariance matrix, then standard deviations are not required.
r =Correlation coefficients between pairs of variables.

The observations that appear on lines 11 to 18 in the preceding program are easier to understand if you think of the observations as a matrix with eight rows and eight columns. The first column in this matrix (running vertically) contains the _TYPE_ variable. (Notice that the INPUT statement tells SAS that the first variable it will read is a character variable named “_TYPE_”.) If an “N” appears as a value in this _TYPE_ column, then SAS knows that sample sizes will appear on that line. If “STD” appears as a value in the _TYPE_ column, then the system knows that standard deviations appear on that line. Finally, if “CORR” appears as a value in the _TYPE_ column, then SAS knows that correlation coefficients appear on that line.

The second column in this matrix contains short names for the observed variables. These names should appear only on the CORR lines. Periods (for missing data) should appear where the N and STD lines intersect with this column (i.e., above the diagonal).

Looking at the matrix from the other direction, you see eight rows running horizontally. The first row is the N row (or “line”); it should contain the following:

  • the N symbol;

  • a period for the missing variable name;

  • the sample sizes for the variables, each separated by at least one blank space. The preceding program shows that the sample size was 240 for each variable.

The STD row (or line) should contain the following:

  • the STD symbol;

  • the period for the missing variable name;

  • the standard deviations for the variables, each separated by at least one blank space. If the STD line is omitted, the analysis can be performed only on covariances, not correlation coefficients.

Finally, where rows 3 to 8 intersect with columns 3 to 8, the correlation coefficients should appear. These coefficients appear below the diagonal, ones should appear on the diagonal (i.e., the correlation coefficient of a number with itself is always equal to 1.0) and periods appear above the diagonal (where redundant correlation coefficients would again appear if this were a full matrix). Be very careful in entering these correlations; one missing period can cause an error in reading the data.

You can see that the columns of data in this matrix are lined up in an organized fashion. Technically, neatness is not required as this INPUT statement is in free format. You should try to be equally organized when preparing your matrix, however, as this will minimize the chance of leaving out an entry and causing an error.

Inputting a Covariance Matrix

The procedure for inputting a covariance matrix is similar to that used with a correlation matrix. An example is presented here:

 1        DATA D1(TYPE=COV) ;
 2          INPUT _TYPE_ $ _NAME_ $ V1-V6 ;
 3          LABEL
 4             V1 ='COMMITMENT'
 5             V2 ='SATISFACTION'
 6             V3 ='REWARDS'
 7             V4 ='COSTS'
 8             V5 ='INVESTMENTS'
 9             V6 ='ALTERNATIVES' ;
10        DATALINES;
11        N      .    240     240     240     240     240     240
12        COV   V1 11.1284   .       .       .       .       .
13        COV   V2  5.6742  9.0054   .       .       .       .
14        COV   V3  4.5501  3.6721  6.8773   .       .       .
15        COV   V4 -3.3499 -5.5717 -2.4405 10.9936   .       .
16        COV   V5  7.6444  2.5234  3.5346 -4.1854  7.1185   .
17        COV   V6 -8.6329 -3.4952 -6.4061  4.3525 -5.3934  9.2144
18        ;
19        RUN;

Notice that the DATA statement now specifies TYPE=COV rather than TYPE=CORR. The line providing standard deviations is no longer needed and has been removed. The matrix itself now provides variances on the diagonal and covariances below the diagonal; the beginning of each line now specifies COV to indicate that this is a covariance matrix. The remaining sections are identical to those used to input a correlation matrix.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.97.64