Checking a Range Using an Algorithm Based on Standard Deviation

One way of deciding what constitutes reasonable cutoffs for low and high data values is to use an algorithm based on the data values themselves. For example, you could decide to flag all values more than two standard deviations from the mean. However, if you had some severe data errors, the standard deviation could be so badly inflated that obviously incorrect data values might lie within two standard deviations. A possible workaround for this would be to compute the standard deviation after removing some of the highest and lowest values. For example, you could compute a standard deviation of the middle 50% of your data and use this to decide on outliers. Another popular alternative is to use an algorithm based on the interquartile range (the difference between the 25th percentile and the 75th percentile). Some programs and macros based on these ideas are presented in the next two sections.

Let’s first see how you could identify data values more than two standard deviations from the mean. You can use PROC MEANS to compute the standard deviations and a short DATA step to select the outliers, as shown in Program 2-19.

Program 2-19. Detecting Outliers Based on the Standard Deviation
LIBNAME CLEAN "C:CLEANING";
***Output means and standard deviations to a data set;
PROC MEANS DATA=CLEAN.PATIENTS NOPRINT;
   VAR HR SBP DBP;
   OUTPUT OUT=MEANS(DROP=_TYPE_ _FREQ_)
          MEAN=M_HR M_SBP M_DBP
          STD=S_HR S_SBP S_DBP;
RUN;
%LET N_SD = 2;
*** The number of standard deviations to list;


DATA _NULL_;
   FILE PRINT;
   TITLE "Statistics for Numeric Variables";
   SET CLEAN.PATIENTS;
   ***Bring in the means and standard deviations;
   IF _N_ = 1 THEN SET MEANS;
   ARRAY RAW[3] HR SBP DBP;
   ARRAY _MEAN[3] M_HR M_SBP M_DBP;
   ARRAY _STD[3] S_HR S_SBP S_DBP;

   DO I = 1 TO DIM(RAW);
      IF RAW[I] LT _MEAN[I] - &N_SD*_STD[I] AND RAW[I] NE .
      OR RAW[I] GT _MEAN[I] + &N_SD*_STD[I] THEN PUT PATNO= RAW[I]=;
   END;
RUN;

The PROC MEANS step computes the mean and standard deviation for each of the numeric variables in your data set. To make the program more flexible, the number of standard deviations above or below the mean that you would like to report is assigned to a macro variable (N_SD). To compare each of the raw data values against the limits defined by the mean and standard deviation, you need to combine the values in the single observation data set created by PROC MEANS to the original data set. You use the same trick you used earlier, that is, you execute a SET statement only once, when _N_ is equal to one. Because all the variables brought into the program data vector (PDV) with a SET statement are retained, these summary values will be available in each observation in the PATIENTS data set. Finally, to save some typing, three arrays were created to hold the original raw variables, the means, and the standard deviations, respectively. The IF statement at the bottom of this DATA step prints out the ID variable and the raw data value for any value above or below the designated cutoff.

The results of running this program on the PATIENTS data set with N_SD set to two follows:

Statistics for Numeric Variables

PATNO=009 DBP=180
PATNO=011 SBP=300
PATNO=321 HR=900
PATNO=321 SBP=400
PATNO=321 DBP=200
PATNO=020 DBP=8


How would you go about computing cutoffs based on the middle 50% of your data? Calculating a mean and standard deviation on the middle 50% of the data (called trimmed statistics by robust statisticians — and I know some statisticians that are very robust!) is easy if you first use PROC RANK (with a GROUPS= option) to identify quartiles, and then use this information in a subsequent PROC MEANS step to compute the mean and standard deviation of the middle 50% of your data. Your decision on how many of these trimmed standard deviation units should be used to define outliers is somewhat of a trial-and-error process. Obviously, (well, maybe not that obvious) the standard deviation computed on the middle 50% of your data will be smaller than the standard deviation computed from all of your data if you have outliers. The difference between the two will be even larger if there are some dramatic outliers in your data. (This will be demonstrated later in this section.) As an approximation, if your data are normally distributed, the trimmed standard deviation is approximately 2.6 times smaller than the untrimmed value. So, if your original cutoff was plus or minus two standard deviations, you might choose 5 or 5.2 trimmed standard deviations as your cutoff scores. What follows is a program that computes trimmed statistics and uses them to identify outliers.

Program 2-20. Detecting Outliers Based on a Trimmed Mean
PROC RANK DATA=CLEAN.PATIENTS OUT=TMP GROUPS=4;
   VAR HR;
   RANKS R_HR;
RUN;


PROC MEANS DATA=TMP NOPRINT;
   WHERE R_HR IN (1,2);  ***The middle 50%;
   VAR HR;
   OUTPUT OUT=MEANS(DROP=_TYPE_ _FREQ_)
          MEAN=M_HR
          STD=S_HR;
RUN;
DATA _NULL_;
   TITLE "Outliers Based on Trimmed Standard Deviation";
   FILE PRINT;


   %LET N_SD = 5.25;
   ***The value of 5.25 computed from the trimmed mean is
      approximately equivalent to the 2 standard deviations
      you used before, computed from all the data. Set this
      value approximately 2.65 times larger than the number
      of standard deviations you would compute from untrimmed data;


   SET CLEAN.PATIENTS;
   IF _N_ = 1 THEN SET MEANS;
   IF HR LT M_HR - &N_SD*S_HR AND HR NE .
      OR HR GT M_HR + &N_SD*S_HR THEN PUT PATNO= HR=;
RUN;

There is one slight complication here, compared to the earlier nontrimmed version of the program. The middle 50% of the observations can be different for each of the numeric variables you want to test. So, if you want to run the program for several variables, it would be convenient to assign to a macro variable the name of the numeric variable that will be tested. This is done next, but first, a brief explanation of the program. PROC RANK is used with the GROUPS= option to create a new variable (R_HR), which will have values of 0, 1, 2, or 3, depending on which quartile the value lies. Because you want both the original value for HR and the rank value, use a RANKS statement, which allows you to give a new name to the variable that will hold the rank of the variable listed in the VAR statement. All that is left to do is to run PROC MEANS as you did before, except that a WHERE statement selects the middle 50% of the data values. What follows is the same as Program 2-19, except that arrays are not needed because you can only process one variable at a time. Finally, here is the output from Program 2-20.

Outliers Based on Trimmed Standard Deviation

PATNO=008  HR=210
PATNO=014  HR=22
PATNO=017  HR=208
PATNO=321  HR=900
PATNO=020  HR=10
PATNO=023  HR=22


Notice that the method based on a nontrimmed standard deviation reported only one HR as an outlier (PATNO=321, HR=900) while the method based on a trimmed mean identified six values. The reason? The heart rate value of 900 inflated the nontrimmed standard deviation so much that none of the other values fell within two standard deviations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.70.191