Using PROC UNIVARIATE to Look for Highest and Lowest Values by Percentage

Let’s return to the problem of locating the “n” highest and “n” lowest values for each of several numeric variables in the data set. Remember that earlier in this chapter, you used PROC UNIVARIATE to list the five highest and five lowest values for your three numeric variables. First of all, this procedure prints lots of other statistics that you don’t need (or want), unless you use the output delivery system to limit the output.

If you are running a version of SAS software prior to Version 7 or you want to control the number of high and low values to list, you can write a custom program to give you exactly what you want. The approach is to have PROC UNIVARIATE output a data set containing the cutoff values on the lower and upper range of interest. The first program described lists the bottom and top “n” percent of the values. Next, the program is turned into a macro so that it is easier to use.

Program 2-12 uses PROC UNIVARIATE to print out the bottom and top “n” percent of the data values.

Program 2-12. Using PROC UNIVARIATE to Print the Top and Bottom “n” Percent of Data Values
***Solution using PROC UNIVARIATE and Percentiles;


LIBNAME CLEAN "C:CLEANING";
***The two macro variables that follow define the lower and upper
   percentile cut points;


***Change the value in the line below to the percentile cut-off
   you want;
%LET LOW_PER=20;  1


***Compute the upper cut-off value;
%LET UP_PER= %EVAL(100 - &LOW_PER);  2


***Choose a variable to operate on;
%LET VAR = HR;  3


PROC UNIVARIATE DATA=CLEAN.PATIENTS NOPRINT;  4
   VAR &VAR;
   ID PATNO;
   OUTPUT OUT=TMP PCTLPTS=&LOW_PER &UP_PER PCTLPRE = L_;  5
RUN;


DATA HILO;
   SET CLEAN.PATIENTS(KEEP=PATNO &VAR);  6
   ***Bring in upper and lower cutoffs for variable;
   IF _N_ = 1 THEN SET TMP;  7
   IF &VAR LE L_&LOW_PER THEN DO;
      RANGE = 'LOW ';
      OUTPUT;
   END;
   ELSE IF &VAR GE L_&UP_PER THEN DO;
      RANGE = 'HIGH';
      OUTPUT;
   END;
RUN;


PROC SORT DATA=HILO(WHERE=(&VAR NE .)); 8
   BY DESCENDING RANGE &VAR;
RUN;
PROC PRINT DATA=HILO;
   TITLE "High and Low Values for Variables";
   ID PATNO;
   VAR RANGE &VAR;
RUN;

Let’s go through this program step by step. To make the program somewhat general, it uses several macro variables. Line assigns the lower percentile to a macro variable (LOW_PER) using a %PUT statement. Line computes the upper percentile cutoff (UP_PER) by subtracting the lower percentile cutoff from 100. (Note: The %EVAL function is needed here to perform the integer arithmetic. If the value of LOW_PER was 20, the value of &UP_PER, without the %EVAL function, would be the text string “100 - 20” instead of 80.) If you look at line , you see the two macro variables LOW_PER and UP_PER preceded by an ampersand (&). As discussed earlier, before the SAS processor runs any SAS program, it runs the macro processor, which processes all the macro statements and substitutes the assigned values of the macro variables. In this program, after the macro processor does its job, line reads:

OUTPUT OUT=TMP PCTLPTS=20 80 PCTLPRE = L_;

That is, the two macro variables, &LOW_PER and &UP_PER are replaced by the values assigned by the %LET statements, 20 and 80 respectively.

In line , a macro variable (VAR) is assigned the value of one of the numeric variables to be checked (HR). To run this program on another numeric variable, SBP for example, you only have to change the variable name in line .

PROC UNIVARIATE can be used to create an output data set containing information that is normally printed out by the procedure. Because you only want the output data set and not the listing from the procedure, use the NOPRINT option as shown in line . As you did before, you are supplying PROC UNIVARIATE with an ID statement so that the ID variable (PATNO in this case) will be included in the output data set. Line defines the name of the output data set and specifies the information you want it to include. The keyword OUT= names your data set (TMP) and PCTLPTS= instructs the program to create two variables; one to hold the value of the VAR variable at the 20th percentile and the other for the 80th percentile. In order for this procedure to create the variable names for these two variables, the keyword PCTLPRE= (percentile prefix) is used. Because you set the prefix to L_, the procedure creates two variables, L_20 and L_80.

The cut points you choose are combined with your choice of prefix to create these two variable names. The data set TMP contains only one observation and three variables, PATNO (because of the ID statement), L_20, and L_80. The value of L_20 is 58 and the value of _80 is 88, the 20th and 80th percentile cutoffs, respectively. The remainder of the program is easier to follow.

You want to add the two values of L_20 and L_80 to every observation in the original PATIENTS data set. Let’s do this with a “trick.” The SET statement in line brings in an observation from the PATIENTS data set, keeping only the variables PATNO and HR (because the macro variable &VAR was set to HR). Line is executed only on the first iteration of this DATA step (when _N_ is equal to 1). Because all variables brought in with a SET statement are automatically retained, the values for L_20 and L_80 are added to every observation in the data set HILO.

Finally, for each observation coming in from the PATIENTS data set, the value of HR is compared to the lower and upper cutoff points defined by L_20 and L_80. If the value of HR is at or below the value of L_20, RANGE is set to the value ’LOW’ and the observation is added to the data set HILO. Likewise, if the value of HR is at or above the value of L_80, RANGE is set to ’HIGH’ and the observation is added to the data set HILO. Before you print out the contents of the data set HILO, you sort it first so that the low values and high values are grouped, and within these groups, the values sorted from lowest to highest. The keyword DESCENDING is used in the first level sort so that the LOW values are listed before the HIGH values (’H’ comes before ’L’ in a normal ascending alphabetical sort). Within each of these two groups, the data values are listed from low to high. It would probably be nicer for the HIGH values to be listed from highest to lowest, but it would not be worth the effort. The final listing from this program is shown next.

High and Low Values for Variables

PATNO    RANGE      HR

 020     LOW        10
 014     LOW        22
 023     LOW        22
 022     LOW        48
 003     LOW        58
 019     LOW        58
 001     HIGH       88
 007     HIGH       88
         HIGH       90
 004     HIGH      101
 017     HIGH      208
 008     HIGH      210
 321     HIGH      900


To turn the above program into a macro is actually quite straightforward. The macro version is shown in Program 2-13.

Program 2-13. Creating a Macro to List the Highest and Lowest “n” Percent of the Data Using PROC UNIVARIATE
*---------------------------------------------------------------*
| Program Name: HILOWPER.SAS  in C:CLEANING                    |
| Purpose: To list the n percent highest and lowest values for  |
|          a selected variable.                                 |
| Arguments: DSN    - Data set name                             |
|            VAR     - Numeric variable to test                 |
|            PERCENT - Upper and Lower percentile cutoff        |
|            IDVAR   - ID variable to print in the report       |
| Example: %HILOWPER(CLEAN.PATIENTS,SBP,20,PATNO)               |
*---------------------------------------------------------------*;


%MACRO HILOWPER(DSN,VAR,PERCENT,IDVAR);


   ***Compute upper percentile cutoff;
   %LET UP_PER = %EVAL(100 - &PERCENT);


   PROC UNIVARIATE DATA=&DSN NOPRINT;
      VAR &VAR;
      ID &IDVAR;
      OUTPUT OUT=TMP PCTLPTS=&PERCENT &UP_PER PCTLPRE = L_;
   RUN;


   DATA HILO;
      SET &DSN(KEEP=&IDVAR &VAR);
      IF _N_ = 1 THEN SET TMP;
      IF &VAR LE L_&PERCENT THEN DO;
         RANGE = 'LOW ';
         OUTPUT;
      END;
      ELSE IF &VAR GE L_&UP_PER THEN DO;
         RANGE = 'HIGH';
         OUTPUT;
      END;
   RUN;


   PROC SORT DATA=HILO(WHERE=(&VAR NE .));
      BY DESCENDING RANGE &VAR;
   RUN;
   PROC PRINT DATA=HILO;
      TITLE "Low and High Values for Variables";
      ID &IDVAR;
      VAR RANGE &VAR;
   RUN;


   PROC DATASETS LIBRARY=WORK NOLIST;
     DELETE TMP;
     DELETE HILO;
   RUN;
   QUIT;


 %MEND HILOWPER ;

The only change, besides the four macro variables, is the addition of PROC DATASETS to delete the two temporary data sets TMP and HILO. To demonstrate this macro, the three lines below call the macro to list the highest and lowest 20 % of the values for heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP) in the data set PATIENTS.

%HILOWPER(CLEAN.PATIENTS,HR,20,PATNO)
%HILOWPER(CLEAN.PATIENTS,SBP,20,PATNO)
%HILOWPER(CLEAN.PATIENTS,DBP,20,PATNO)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.45.137