Using PROC MEANS, PROC TABULATE, and PROC UNIVARIATE to Look for Outliers

One of the simplest ways to check for invalid numeric values is to run either PROC MEANS or PROC UNIVARIATE. By default, PROC MEANS lists the minimum and maximum values, along with the n, mean, and standard deviation. PROC UNIVARIATE is somewhat more useful in detecting invalid values, because it provides you with a listing of the five highest and five lowest values, along with graphical output (stem-and-leaf plots and box plots). Let’s first look at how you can use PROC MEANS for very simple checking of numeric variables. The program below checks the three numeric variables, heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP), in the PATIENTS data set.

Program 2-1. Using PROC MEANS to Detect Invalid and Missing Values
LIBNAME CLEAN "C:CLEANING";
PROC MEANS DATA=CLEAN.PATIENTS N NMISS MIN MAX MAXDEC=3;
   TITLE "Checking Numeric Variables in the PATIENTS Data Set";
   VAR HR SBP DBP;
RUN;

Let’s choose the options N, NMISS, MIN, MAX, and MAXDEC=3 for this procedure. The N and NMISS options report the number of nonmissing and missing observations for each variable, respectively. The MIN and MAX options list the smallest and largest nonmissing values for each variable. The MAXDEC=3 option is used so that the minimum and maximum values will be printed to three decimal places. Because HR, SBP, and DBP are supposed to be integers, you might have thought to set the MAXDEC option to 0. However, you might want to catch any data errors where a decimal point was entered by mistake.

Here is the output from Program 2-1.

Checking Numeric Variables in the PATIENTS Data Set

The MEANS Procedure

                                           N
Variable  Label                     N      Miss   
     Minimum       Maximum
--------------------------------------------------
----------------------------
HR        Heart Rate                28     3      
     10.000        900.000
SBP       Systolic Blood Pressure   27     4      
     20.000        400.000
DBP       Diastolic Blood Pressure  28     3      
      8.000        200.000
--------------------------------------------------
----------------------------


This output is not particularly useful. It does show the number of nonmissing and missing observations along with the highest and lowest values. Inspection of the minimum and maximum values for all three variables shows that there are probably some data errors in the PATIENTS data set. If you want a slightly prettier output, you can use PROC TABULATE to accomplish the same task. For an excellent reference on PROC TABULATE, let me suggest a book written by Lauren E. Haworth, called PROC TABULATE by Example, published by SAS Institute, Cary, NC, as part of their Books by Users series.

Here is the equivalent PROC TABULATE program, followed by the output. (Assume that the libref CLEAN has been previously defined in this program and in any future programs where it is not included in the program.)

Program 2-2. Using PROC TABULATE to Display Descriptive Data
PROC TABULATE DATA=CLEAN.PATIENTS FORMAT=7.3;  1
   TITLE "Statistics for Numeric Variables";
   VAR HR SBP DBP;  2
   TABLES HR SBP DBP,
          N*F=7.0 NMISS*F=7.0 MEAN MIN MAX / RTSPACE=18;  3
   KEYLABEL N     = 'Number'  4
            NMISS = 'Missing'
            MEAN  = 'Mean'
            MIN   = 'Lowest'
            MAX   = 'Highest';
RUN;

The FORMAT option tells the procedure to use the numeric format 7.3 (a field width of 7 with 3 places to the right of the decimal point) for all the output, unless otherwise specified. The analysis variables HR, SBP, and DBP are listed in a VAR statement . Let’s place these variables on the row dimension and the statistics along the column dimension. The TABLE option RTSPACE=18 allows for 18 spaces for all row labels, including the spaces for the lines forming the table. In addition, the format 7.0 is to be used for N and NMISS in the table. Finally, the KEYLABEL statement replaces the keywords for the selected statistics with more meaningful labels. Below is the output from PROC TABULATE.

A more useful procedure might be PROC UNIVARIATE. Running this procedure for your numeric variables yields much more information.

Program 2-3. Using PROC UNIVARIATE to Look for Outliers
PROC UNIVARIATE DATA=CLEAN.PATIENTS PLOT;
   TITLE "Using PROC UNIVARIATE to Look for Outliers";
   VAR HR SBP DBP;
RUN;

The procedure option PLOT provides you with several graphical displays of the data; a stem-and-leaf plot, a box plot, and a normal probability plot. Output from this procedure is shown next. (Note: To save some space, the PROC UNIVARIATE output for the variable SBP has been omitted)

Using PROC UNIVARIATE to Look for Outliers

The UNIVARIATE Procedure
Variable:  HR  (Heart Rate)

                            Moments

N                          28    Sum Weights      
            28
Mean               107.392857    Sum Observations 
          3007
Std Deviation      161.086436    Variance         
    25948.8399
Skewness           4.73965876    Kurtosis         
    23.7861582
Uncorrected SS        1023549    Corrected SS     
    700618.679
Coeff Variation    149.997347    Std Error Mean   
     30.442475

              Basic Statistical Measures

    Location                    Variability

Mean     107.3929     Std Deviation           161
.08644
Median    74.0000     Variance                   
 25949
Mode      68.0000     Range                   890
.00000
                      Interquartile Range      27
.00000

           Tests for Location: Mu0=0

Test           -Statistic-    -----p Value------

Student's t    t  3.527731    Pr > |t|     0.0015
Sign           M        14    Pr >= |M|    <.0001
Signed Rank    S       203    Pr >= |S|    <.0001

Quantiles (Definition 5)

Quantile      Estimate

100% Max           900
99%                900
95%                210
90%                208
75% Q3              87
50% Median          74
25% Q1              60
10%                 22
5%                  22
1%                  10
0% Min              10

Using PROC UNIVARIATE to Look for Outliers

The UNIVARIATE Procedure
Variable:  HR  (Heart Rate)

        Extreme Observations

----Lowest----        ----Highest---

Value      Obs        Value      Obs

   10       23           90        8
   22       25          101        4
   22       15          208       19
   48       24          210        9
   58       20          900       22


             Missing Values

                       -----Percent Of-----
Missing                             Missing
  Value       Count     All Obs         Obs

      .           3        9.68      100.00


   Stem Leaf                         #  Boxplot
      9 0                            1     *
      8
      7
      6
      5
      4
      3
      2 11                           2     *
      1 0                            1     +
      0 122566667777777888889999    24  +--0--+
        ----+----+----+----+----
    Multiply Stem.Leaf by 10**+2


Using PROC UNIVARIATE to Look for Outliers

The UNIVARIATE Procedure
Variable:  HR  (Heart Rate)

                       Normal Probability Plot
     950+                                         
    *
        |
        |
     650+
        |                                         
         +
        |                                         
   ++++++
     350+                                     +++++++
        |                               ++++++ *  *
        |                         ++++++     *
      50+     *   *    **  *+**+***** * **  *
         +----+----+----+----+----+----+----+----+
----+----+
             -2        -1         0        +1     
   +2

Using PROC UNIVARIATE to Look for Outliers

The UNIVARIATE Procedure
Variable:  DBP  (Diastolic Blood Pressure)

                            Moments

N                          28    Sum Weights      
            28
Mean               88.0714286    Sum Observations 
          2466
Std Deviation      37.2915724    Variance         
    1390.66138
Skewness           1.06190956    Kurtosis         
    3.67139184
Uncorrected SS         254732    Corrected SS     
    37547.8571
Coeff Variation     42.342418    Std Error Mean   
    7.04744476


              Basic Statistical Measures

    Location                    Variability

Mean     88.07143     Std Deviation            37
.29157
Median   81.00000     Variance                    
 1391
Mode     78.00000     Range                   192
.00000
                      Interquartile Range      26
.00000

NOTE: The mode displayed is the smallest of 2
 modes with a
      count of 3.


           Tests for Location: Mu0=0

Test           -Statistic-    -----p Value------

Student's t    t  12.49693    Pr > |t|    <.0001
Sign           M        14    Pr >= |M|   <.0001
Signed Rank    S       203    Pr >= |S|   <.0001


      Quantiles (Definition 5)

Quantile       Estimate

100% Max            200
99%                 200
95%                 180
90%                 120
75% Q3              100
50% Median           81
25% Q1               74
10%                  64
5%                   20
1%                    8
0% Min                8

Using PROC UNIVARIATE to Look for Outliers

The UNIVARIATE Procedure
Variable:  DBP  (Diastolic Blood Pressure)

        Extreme Observations

----Lowest----        ----Highest---

Value      Obs        Value      Obs

    8       23          106       28
   20       12          120        4
   64       14          120       11
   68       27          180       10
   68        6          200       22


          Missing Values


                       -----Percent Of-----
Missing                             Missing
  Value      Count      All Obs         Obs

      .          3         9.68      100.00

   Stem Leaf                     #  Boxplot
     20 0                        1     *
     18 0                        1     *
     16
     14
     12 00                       2     |
     10 0026                     4  +-----+
      8 000244800                9  *--+--*
      6 488044888                9  +-----+
      4
      2 0                        1     0
      0 8                        1     0
        ----+----+----+----+
    Multiply Stem.Leaf by 10**+1


Using PROC UNIVARIATE to Look for Outliers

The UNIVARIATE Procedure
Variable:  DBP  (Diastolic Blood Pressure)

                       Normal Probability Plot
     210+                                         
    *
        |                                        
 *        +
        |                                         
    +++++
        |                                       ++++++
        |                                  ++*+*
     110+                             ++*** *
        |                       ****+**
        |            * *  **+*+*
        |             +++++
        |       ++*+++
      10+  +++*+
         +----+----+----+----+----+----+----+----+
----+----+
             -2        -1         0        +1     
   +2


You certainly get lots of information from PROC UNIVARIATE, perhaps too much information. Starting off, you see some descriptive univariate statistics (hence the procedure name) for each of the variables listed in the VAR statement. Most of these statistics are not very useful in the data checking operation. The number of nonmissing observations (N), the number of observations not equal to zero (Num ^= 0), and the number of observations greater than zero (Num > 0) are probably the only items that are of interest to you at this time.

One of the most important sections of the PROC UNIVARIATE output, for data checking purposes, is the section labeled “Extremes.” Here you see the five highest and five lowest values for each of your variables. For example, for the variable HR (heart rate), there are three possible data errors under the column label “Lowest” (10, 22, and 22) and three possible data errors under the column label “Highest” (208, 210, and 900). Obviously, having knowledge of reasonable values for each of your variables is essential if this information is to be of any use. Next to the listing of the highest and lowest values is the observation number containing this value. What would be more useful would be the patient or subject number you assigned to each patient. This is easily accomplished by adding an ID statement to PROC UNIVARIATE. You list the name of your identifying variable following the keyword ID. The values of this ID variable are then used in addition to the OBS column. (Note: In versions of SAS software prior to Version 7, the ID column replaced the OBS column; in versions after Version 7, both the ID column and the OBS column are displayed.)

If you are running Version 7 or later of SAS software, you can include an ODS (output delivery system) statement to limit the PROC UNIVARIATE output to just the table of extreme values. Here are the PROC UNIVARIATE statements with an ID statement added, as well as the ODS statement to limit the output to the five highest and five lowest data values (the “Extremes”).

Program 2-4. Adding an ID Statement to PROC UNIVARIATE
/******************************************************************
| The ODS statement is valid for V7 and later.                     |
| Note that the name EXTREMEOBS may change in future SAS releases. |
| Use ODS TRACE ON; before the PROC and ODS TRACE OFF; after       |
| the PROC to obtain a list of output object names (found in       |
| the SAS Log).                                                    |
******************************************************************/


ODS SELECT EXTREMEOBS;


PROC UNIVARIATE DATA=CLEAN.PATIENTS;
   TITLE "Using PROC UNIVARIATE to Look for Outliers";
   ID PATNO;
   VAR HR SBP DBP;
RUN;

The section of output showing the “Extremes” for the variable heart rate (HR) follows:

Using PROC UNIVARIATE to Look for Outliers


The UNIVARIATE Procedure
Variable:  HR  (Heart Rate)


                Extreme Observations

--------Lowest--------        --------Highest-------

Value   PATNO      Obs        Value   PATNO      Obs

   10   020         23           90                8
   22   023         25          101   004          4
   22   014         15          208   017         19
   48   022         24          210   008          9
   58   019         20          900   321         22


Missing Value         .
Count                 3
% Count/Nobs      10.00


Before the addition of the ID statement, we only had columns labeled VALUE and OBS. With the ID statement there is a new column labeled PATNO that contains the values of your ID variable (PATNO), making it easier to locate the original patient data and check for errors. (Note: The column, which contains the values of the ID variable, that here is labeled PATNO will be labeled ID in releases of SAS software prior to Version 7).

The middle section of the output on page 28 contains a stem-and-leaf plot and a box plot. These two data visualizations come from an area of statistics known as exploratory data analysis (EDA). (For an excellent reference see Exploratory Data Analysis, by John Tukey, Reading, Massachusetts: Addison-Wesley.) Let’s focus on the plots for the variable DBP (diastolic blood pressure). The stem-and-leaf plot can be thought of as a sideways histogram. For this variable, the diastolic blood pressures are grouped in 20-point intervals. For example, the stem labeled “8” represents the values from 80 to 99. Instead of simply placing X’s or some other symbol to represent the bar in this sideways histogram, the next digit in the value is used instead. Thus, you see that there were three values of 80, one 82, two 84’s, one 88, and two 90’s. You can ignore these values and just think of the stem-and-leaf plot as a histogram, or you might be interested in the additional information that the leaf values give you. A quick examination of this plot shows that there were some abnormally low and high diastolic blood pressure values. This useful information complements the “Extremes” information. The “Extremes” only lists the five highest and five lowest values; the stem-and-leaf plot shows all the values in your data set.

To the right of the stem-and-leaf plot is a box plot. This plot shows the mean (+ sign), the median (the dashed line between the two asterisks), and the 25th and 75th percentiles (the bottom and top of the box, respectively). The distance between the top and bottom of the box is called the interquartile range and can also be found earlier in the outout labeled as “Q1-Q3.” Extending from both the top and bottom of the box are whiskers and outliers. The whiskers represent data values within one-and-a-half interquartile ranges above or below the box. (Note: The EDA people call the top and bottom of the box, the hinges.) Any data values more than one-and-a-half but less than three interquartile ranges above or below the box (hinges) are represented by 0’s. Two data values for DBP (8 and 20) fit this description in the box plot on page 28. Finally, any data values more than three interquartile ranges above or below the top and bottom hinges are represented by asterisks. For your DBP variable, two data points, 180 and 200, fit this description. The final graph, called a normal probability plot, is of interest to statisticians and helps determine deviations from a theoretical distribution called a normal or Gaussian distribution. The information displayed in the normal probability plot may not be useful for your data cleaning task because you are looking for data errors and are not particularly interested if the data are normally distributed or not.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.196.223