Using PROC RANK to Look for Highest and Lowest Values by Percentage

There is a simpler and more efficient way to list the highest and lowest “n” percent of the data values, that is, by using PROC RANK. The reason that the previous, more complicated program was shown, is that it produces a slightly more accurate listing than the program shown in this section. PROC RANK is designed to produce a new variable (or replace the values of an existing variable) with values equal to the ranks of another variable. For example, if the variable X has values of 7, 3, 2, and 8, the equivalent ranks would be 3, 2, 1, and 4, respectively. However, PROC RANK has a very useful option (GROUPS=) that allows you to group your data values. For example, if you set GROUPS=4, the new variable that usually holds the rank values, will now have values of 0, 1, 2, and 3, with those observations in groups 0 being in the bottom quartile and observations in group 3 being in the top quartile. So, if you want to print out the top 20% of your data values, you set the GROUPS option to 5, each group representing 20% of your data values. The bottom 20% corresponds to the ranked variable having a value of 0, and the top 20% corresponds to the ranked variable having a value of 4. (Yes, it is odd that without the GROUPS= option ranks go from 1 to n and with the GROUPS= option, the groups go from 0 to n - 1.)

Now let’s see how to apply this idea to a program that will list the top and bottom “n” percent of your data values. Because you have already seen a program and macro that lists highest and lowest “n” percent of your data values, only the macro version is shown here.

Program 2-14. Creating a Macro to List the Highest and Lowest “n” Percent of the Data Using PROC RANK
*---------------------------------------------------------------- *
| Macro Name: HI_LOW_P                                            |
| Purpose: To list the upper and lower n% of values               |
| Arguments: DSN     - Data set name (one- or two-level           |
|            VAR     - Variable to test                           |
|            PERCENT - Upper and lower n%                         |
|            IDVAR   - ID variable                                |
| Example: %HI_LOW_P(CLEAN.PATIENTS,SBP,20,PATNO)                 |
*----------------------------------------------------------------*;


%MACRO HI_LOW_P(DSN,VAR,PERCENT,IDVAR);
   ***Compute number of groups for PROC RANK;
   %LET GRP = %SYSEVALF(100 / &PERCENT,FLOOR);  1
   ***Value of the highest GROUP from PROC RANK, equal to the
      number of groups - 1;
   %LET TOP = %EVAL(&GRP - 1);  2


   PROC FORMAT;  3
      VALUE RNK 0='Low' &TOP='High';
   RUN;


   PROC RANK DATA=&DSN OUT=NEW GROUPS=&GRP;  4
      VAR &VAR;
      RANKS RANGE;
   RUN;


   ***Sort and keep top and bottom n%;
   PROC SORT DATA=NEW (WHERE=(RANGE IN (0,&TOP)));  5
      BY  &VAR;
   RUN;
   ***Produce the report;
   PROC PRINT DATA=NEW;  6
   TITLE "Upper and Lower &PERCENT.% Values for %UPCASE(&VAR)";
      ID &IDVAR;
      VAR RANGE &VAR;
      FORMAT RANGE RNK.;
   RUN;


   PROC DATASETS LIBRARY=WORK NOLIST;  7
      DELETE NEW;
   RUN;
   QUIT;


%MEND HI_LOW_P;

First, you need to compute the approximate number of groups that will correspond to the percentage you want. Line uses the %SYSEVALF function to do this computation. This function, unlike its companion %EVAL, allows noninteger arithmetic and also provides various conversions (CEIL, FLOOR, INTEGER, or BOOLEAN) for the results. The floor conversion was chosen because you would rather have the program list too many values (i.e., a smaller value for the GROUPS= option) than too few. For example, if you want the top and bottom 8% of your data values, the value of GRP would be FLOOR(100/8) = 12 and the value for TOP would be 11. It is this rounding that may produce a slightly less accurate report than the program that uses PROC UNIVARIATE. The RNK format assigns the formats ’Low’ and ’High’ to the ranked variable.

The key to the whole program is PROC RANK , which uses the GROUPS= option to divide the data values into groups. The sort accomplishes two things: 1) It subsets the data set with the WHERE data set option, keeping only the top and bottom groups, and 2) it puts the data values in order from the smallest to the largest. All that is left to do is to print the report and delete the temporary data set .

Issue the following statement to see a list of the top and bottom 10% of your values for SBP (systolic blood pressure):

%HI_LOW_P(CLEAN.PATIENTS,SBP,10,PATNO)

This produces the following output:

  Upper and Lower 10% Values for SBP

  PATNO           RANGE          SBP

   020            Low             20
   023            Low             34
   011            High           300
   321            High           400


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.27.218