Chapter 10: Correlation

Introduction

Creating a Permanent SAS Data Set

Reading the Exercise.xls Workbook and Creating a Permanent SAS Data Set

Using the Statistics Correlation Task

Generating Correlation and Scatter Plot Matrices

Interpreting Correlation Coefficients

Generating Spearman Non-Parametric Correlations

Conclusions

Problems

Introduction

There are several ways to quantify the relationship between two continuous variables, the most common being a Pearson correlation coefficient. This chapter describes this as well as a nonparametric alternative called a Spearman rank-order correlation.

In addition to showing you how to compute these two correlation coefficients, you will also see how to save a SAS data set in a permanent library. Until now, all the SAS data sets created using the Import Utility or by reading raw data were stored in the Work library. Data sets in the Work library are deleted each time you close your SAS session. This is fine for homework problems or learning how to write SAS programs, but if you are working on a project that may take days (or months or years), you will want to save your SAS data sets in a permanent location.

Creating a Permanent SAS Data Set

If you plan to use only temporary SAS data sets in the WORK library, you can skim over this section or skip it entirely. Creating permanent SAS data sets in a virtual environment is somewhat more complicated than creating SAS data sets in a non-virtual environment.

One of the data sources that you are going to use to see how the correlation tasks work, is an Excel workbook called Exercise.xls, stored in c:SASUniversityEditionmyfolders. Figure 1 shows the first few rows of this worksheet:

Figure 1: Workbook Exercise.xls

image

This worksheet contains information on age, the number of pushups a person can perform, and pulse rates under three conditions: resting, maximal exertion, and while running. By the way, the data values are not real—they was generated by a small SAS program that used random number generators in such a way that certain variables would be related to others.

There are 50 subjects in the file. Notice that the column headings are all valid SAS variable names. The first step is to use the Import Utility to convert this workbook into a SAS data set.

Because you want this data set to be permanent, you need to create what is called a SAS library. You have already seen the SASHELP and WORK libraries. For this exercise you are going to create a new library called BOOKDATA.

SAS library names (also called librefs—library references) are created with a LIBNAME statement. Library names are a maximum of eight characters in length and follow the same naming conventions as SAS variable names and SAS data sets. That is, they must begin with a letter or underscore with the remaining characters being letters, digits, or the underscore character.

Rather than having to write a LIBNAME statement, you will see how SAS Studio can create this statement for you and save it in a location called AUTOEXEC.SAS. The advantage of saving the LIBNAME statement there is that the library will be available every time you open a SAS session. This is important because SAS libraries must be created every time you open a new session.

To have SAS Studio do all this work for you, first click Server Files and Folders and place the cursor on the shared folder Book_Data (see Figure 2): To review how this shared folder was created, refer to the shared folders section in Chapter 3.

Figure 2: The Shared Folder Created in VirtualBox

image

Now, right-click on this folder shortcut, select Create and then click Library (Figure 3):

Figure 3: Creating a Permanent Library

image

Be sure to check the box to re-create this library at start-up (Figure 4):

Figure 4: Creating the New Library

image

Clicking OK adds the appropriate LIBNAME statement that is written to AUTOEXEC.SAS. Here is a screen shot of AUTOEXEC.SAS, with the new LIBNAME statement outlined:

Figure 5: AUTOEXEC.SAS with New LIBNAME Statement Added

image

Now, every time you start a SAS session, the BOOKDATA library will be available for you to read or write SAS data sets.

One final note: The actual folder on your hard drive is c:ookBook_Data (it contains an underscore character). This is the same name that you gave to the shared folder that you created using Oracle VirtualBox. It is important to remember that the name of the shared folder cannot contain any blanks. The library name that identifies this folder (BOOKDATA) does not contain the underscore character. Why? Although an underscore character is valid in a library name, adding it would exceed the 8-character maximum length for library names. As a matter of fact, the name that you choose for this library name does not need to have any relationship to the actual folder name. It just makes sense to use a name that helps you remember the actual folder name. The good news is that once you have checked the option to create the library name at start-up using the Autoexec file, you can always check this file to see what library name you used and the name of the shared folder. Now you see why placing your data files and SAS data sets in SASUniversityEditionmyfolders is a good idea.

Reading the Exercise.xls Workbook and Creating a Permanent SAS Data Set

To convert the Exercise.xls workbook into a permanent SAS data set, proceed as you did with all the previous Excel workbook examples in this book:

Tasks and Utilities ▶ Utilities ▶ Import Data

This time, when you change the name of the output data set, select the BOOKDATA library and call the data set Exercise (see Figure 6):

Figure 6: Saving the Exercise Data Set in the BOOKDATA Library

image

When you look in Server Files and Folders, you will see a file called exercise.sas7bdat listed under the folder shortcut Book_Data. SAS uses the extension sas7bdat for its data set names (see Figure 7):

Figure 7: The SAS Data Set Is Listed under Folder Shortcuts - Book_Data

image

Using the Statistics Correlation Task

As a quick summary, a Person correlation coefficient measures the strength of the relationship between two variables. These variables are usually continuous, but there are types of correlations where one or both of the values are binary (0 or 1) or ordinal variables. The formula for calculating Pearson correlation coefficients (from here on, just called correlations) is such that correlation values lie between -1 and +1, inclusive. Positive correlations indicate a positive relationship between two variables. For example, height and weight would be correlated for a group of young children. The taller children would tend to be heavier and vice versa. Correlations near zero tell you that given a value for one of the variables, you have no better guess for the value of the other variable. Finally, correlations near -1 indicate a strong inverse relationship between two values; as one increases, the other decreases. For example, the dose of insulin would be negatively correlated with blood sugar levels—the higher the dose, the lower the blood sugar level.

It is important to remember two things about correlations. First, it doesn't matter which variable you place on the x- or y-axis. Second, the correlation is strongly influenced by outliers. The following two figures show the effect of a single outlier on a set of data points that, without the outlier, has a correlation close to zero.

First, here is a scatter plot of x-y data with a correlation close to zero:

Figure 8: X and Y with Close to a Zero Correlation

image

The correlation is -0.023 (almost zero). Here is the same set of Xs and Ys with a single outlier (x=11, y=12) added:

Figure 9: Sample Plot with a Single Outlier

image

The correlation is now .782. .The lesson here is that outliers can have a very large effect on Pearson correlation coefficients. In looking at extreme outliers, you might want to do two things: one is to make sure it isn’t a data entry error, and the second is to make sure that this case is really appropriately considered to be part of the population under consideration. For example, if you were looking at the relationship between income and school achievement, and one of your sample was a multi-billionaire, you might want to remove that data point from the sample. Of course, in doing so, you would want to say that you had done that in any article or report you wrote concerning those data.

This brings up one of the most important rules about reporting correlation coefficients in a study: Always inspect a scatter plot to identify data points that may have undue influence.

It's time to investigate correlations among the variables in the Exercise data set. Start by clicking on Tasks and Utilities. Then choose Statistics, and finally, Correlation Analysis:

Tasks and Utilities ▶ Statistics ▶ Correlation Analysis

Generating Correlation and Scatter Plot Matrices

The Correlation DATA tab looks like this:

Figure 10: The Correlation DATA Tab

image

You have a choice. One is to enter all the numeric variables of interest in the Analysis variables box (as was done here) to compute correlations between every pair of variables. As an alternative, select one or more variables in the Analysis box and other variables in the Correlate with box. If you do this, the task will compute correlations for every combination of the Analysis variables and the Correlate with variables. For example, if you have variables A and B in the Analysis variables box and variables X, Y, and Z in the Correlate with box, the task will compute the correlations for the pairs: AX, AY, AZ, BX, BY, and BZ.

Looking at the Options tab (Figure 11 below), you have chosen to generate a matrix of scatter plots and you have checked the option to include histograms on the diagonal of the matrix.

Figure 11: Correlation Analysis OPTIONS Tab

image

Figure 12 shows the correlation matrix:

Figure 12: Correlation Matrix

image

The intersection of each row and column in this table shows you the correlation coefficient (top number) and the p-value (the bottom number). For example, the correlation between Age and Pushups is -.49191 with a p-value of .0003. The older the subject, the fewer pushups he or she could do. Because of the symmetry of the matrix, you only need to look at the upper or lower triangle of the matrix.

Before you spend time investigating the correlation coefficient, you should take a moment to inspect the p-value. What does it mean to have a "significant" correlation? To understand the p-value, imagine two variables that are completely unrelated—the correlation between them is 0 (see Figure 13 below):

Figure 13: Population with a Zero Correlation

image

Now, imagine taking a random sample of 5 subjects from this population. You might wind up with a selection that looks like the circled points in Figure 14. The problem is that, with a small sample, you may, by chance, end up with a correlation that is either close to 1 or close to -1. The p-value that you see in the correlation tables is the probability that you would obtain a correlation with an absolute value as large as or larger than the one you obtained by chance alone, given that the true population correlation between your two variables is actually 0.

Figure 14: Random Sample: Correlation = .8

image

Now, back to the output. Following the correlation matrix is a matrix of scatter plots. Because you checked the box to include histograms on the diagonal, they are included as well (see Figure 15):

Figure 15: Scatter Plot Matrix

image

Here you see a scatter plot for all combinations of the variables (corresponding to the correlation matrix above). This scatter plot matrix is an excellent way to see relationships among the variables in your data set.

If you choose individual scatter plots instead of a scatter plot matrix, each of the small plots displayed in the matrix is displayed as an individual plot. A typical plot looks like the one displaying Max_Pulse versus Rest_Pulse in the figure below:

Figure 16: Example of an Individual Scatter Plot

image

Interpreting Correlation Coefficients

The question often comes up: "What is a large correlation?" The short answer is "that depends." That's not very satisfying. A useful approach to interpreting a correlation coefficient is to square it. The value of r-square is the proportion of variance (the standard deviation squared) of one of the variables that can be explained by the other. For example, look at the correlation between Run_Pulse and Rest_Pulse. It is .76139, and this value squared is .5797. Both of these variables differ among the subjects, and you can compute the variance of each of the resting and running pulse rates. Because the value of r-square is .5797, you can say that 57.97% of the variance in the running pulse rate can be explained by the fact that these subjects all have different resting pulse rates.

Generating Spearman Non-Parametric Correlations

As with many statistical tests, there is a nonparametric alternative to a Pearson correlation. One of the most popular is called a Spearman correlation. The Spearman method substitutes ranks for the two variables and then computes a correlation on the ranks. When there are outliers on your scatter plot, you may want to consider computing Spearman correlations. As a matter of fact, it's not a bad idea to routinely compute both Pearson and Spearman correlations and take special notice when they produce substantially different results.

To add Spearman correlations to the output of the Correlation analysis tab, expand the Nonparametric tab and click the box next to Spearman's rank-order correlation (see Figure 17):

Figure 17: Requesting Spearman Correlations

image

Two variables, Run_Pulse and Rest_Pulse, were selected for this example. The resulting output is shown below:

Figure 18: Pearson and Spearman Correlations

image

The value of the Spearman correlation (-.33960) is almost identical to the Pearson correlation (r=-.34555). When you have outliers or other non-normal distributions, these two values may be quite different.

Conclusions

The correlation analysis statistics task enables you to correlate one set of variables with another or to produce a correlation matrix showing correlations between every pair of variables. You should routinely request either individual plots or a scatter plot matrix as part of the procedure. Finally, consider computing Spearman rank-order correlations.

Problems

10-1: Start with the Excel workbook Blood_Pressure.xlsx and create a temporary SAS data set called BP. Compute a Pearson and Spearman (nonparametric) correlation between the two variables SBP and DBP. Select the appropriate option to print a scatter plot.

10-2: Using the SASHELP data set Heart, compute a correlation matrix with the variables Height, Weight, and Cholesterol. Be sure to include p-values in the table. Request a matrix of scatter plots and be sure to set the number of points plotted to unlimited.

10-3: Take a random sample of size 500 from the SASHELP.Heart data set (use a fixed seed of 13579). Call the sample Sample_Heart. Include all the variables in the sample. Repeat problem 10-2 using the random sample. Compare the correlations and p-values between the full data set and the sample.

10-4: Run the program below to create a SAS data set called Corr_Population. Run the Correlation Analysis task to confirm that the correlation between X and Y is close to zero. Next, use the Select Random Sample task under Data tasks to create a random sample of seven from Corr_Population. Compute the correlation between X and Y in this random sample. Do this several times (you can just click on the two tabs for Select Random Sample and Correlation Analysis) and notice how the correlation changes each time.

data Corr_Population;

   call streaminit(12345);

   do i = 1 to 1000;

      X = ceil(rand('uniform')*100);

      Y = ceil(rand('uniform')*100);

      output;

   end;

   drop i;

run;

For the reader who is interested in this SAS program, the RAND function with the argument 'uniform' produces uniform random numbers between 0 and 1. The call routine CALL STREAMINIT enables you to select a fixed seed value to initiate the random sequence. The statements that compute the X and Y values produce random integers from 1 to 100. Because both X and Y are randomly determined, the correlation between them should (and is) close to zero. You can use the List Data task or PROC PRINT to print the first few observations in this data set.

10-5: Use the DATA step below to create a data set called Outlier.

data Outlier;

   input X Y @@;

datalines;

0 2 5 6 6 2 3 3 1 3 4 4 8 1 6 4 2 5 4 2 6 5

;

Compute the correlation between X and Y. Then add the single data point x=15, Y=15, and rerun the correlation task. Request both a Pearson and Spearman correlation. What lesson is to be learned from this problem? Note: To add the additional data point, place two 15s separated by spaces at the end of the line of data in the program.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.144.65