There is a simpler and more efficient way to list the highest and lowest “n” percent of the data values, that is, by using PROC RANK. The reason that the previous, more complicated program was shown, is that it produces a slightly more accurate listing than the program shown in this section. PROC RANK is designed to produce a new variable (or replace the values of an existing variable) with values equal to the ranks of another variable. For example, if the variable X has values of 7, 3, 2, and 8, the equivalent ranks would be 3, 2, 1, and 4, respectively. However, PROC RANK has a very useful option (GROUPS=) that allows you to group your data values. For example, if you set GROUPS=4, the new variable that usually holds the rank values, will now have values of 0, 1, 2, and 3, with those observations in groups 0 being in the bottom quartile and observations in group 3 being in the top quartile. So, if you want to print out the top 20% of your data values, you set the GROUPS option to 5, each group representing 20% of your data values. The bottom 20% corresponds to the ranked variable having a value of 0, and the top 20% corresponds to the ranked variable having a value of 4. (Yes, it is odd that without the GROUPS= option ranks go from 1 to n and with the GROUPS= option, the groups go from 0 to n - 1.)
Now let’s see how to apply this idea to a program that will list the top and bottom “n” percent of your data values. Because you have already seen a program and macro that lists highest and lowest “n” percent of your data values, only the macro version is shown here.
*---------------------------------------------------------------- * | Macro Name: HI_LOW_P | | Purpose: To list the upper and lower n% of values | | Arguments: DSN - Data set name (one- or two-level | | VAR - Variable to test | | PERCENT - Upper and lower n% | | IDVAR - ID variable | | Example: %HI_LOW_P(CLEAN.PATIENTS,SBP,20,PATNO) | *----------------------------------------------------------------*; %MACRO HI_LOW_P(DSN,VAR,PERCENT,IDVAR); ***Compute number of groups for PROC RANK; %LET GRP = %SYSEVALF(100 / &PERCENT,FLOOR); 1 ***Value of the highest GROUP from PROC RANK, equal to the number of groups - 1; %LET TOP = %EVAL(&GRP - 1); 2 PROC FORMAT; 3 VALUE RNK 0='Low' &TOP='High'; RUN; PROC RANK DATA=&DSN OUT=NEW GROUPS=&GRP; 4 VAR &VAR; RANKS RANGE; RUN; ***Sort and keep top and bottom n%; PROC SORT DATA=NEW (WHERE=(RANGE IN (0,&TOP))); 5 BY &VAR; RUN; ***Produce the report; PROC PRINT DATA=NEW; 6 TITLE "Upper and Lower &PERCENT.% Values for %UPCASE(&VAR)"; ID &IDVAR; VAR RANGE &VAR; FORMAT RANGE RNK.; RUN; PROC DATASETS LIBRARY=WORK NOLIST; 7 DELETE NEW; RUN; QUIT; %MEND HI_LOW_P; |
First, you need to compute the approximate number of groups that will correspond to the percentage you want. Line uses the %SYSEVALF function to do this computation. This function, unlike its companion %EVAL, allows noninteger arithmetic and also provides various conversions (CEIL, FLOOR, INTEGER, or BOOLEAN) for the results. The floor conversion was chosen because you would rather have the program list too many values (i.e., a smaller value for the GROUPS= option) than too few. For example, if you want the top and bottom 8% of your data values, the value of GRP would be FLOOR(100/8) = 12 and the value for TOP would be 11. It is this rounding that may produce a slightly less accurate report than the program that uses PROC UNIVARIATE. The RNK format assigns the formats ’Low’ and ’High’ to the ranked variable.
The key to the whole program is PROC RANK , which uses the GROUPS= option to divide the data values into groups. The sort accomplishes two things: 1) It subsets the data set with the WHERE data set option, keeping only the top and bottom groups, and 2) it puts the data values in order from the smallest to the largest. All that is left to do is to print the report and delete the temporary data set .
Issue the following statement to see a list of the top and bottom 10% of your values for SBP (systolic blood pressure):
%HI_LOW_P(CLEAN.PATIENTS,SBP,10,PATNO)
This produces the following output:
Upper and Lower 10% Values for SBP PATNO RANGE SBP 020 Low 20 023 Low 34 011 High 300 321 High 400 |
18.224.27.218