One of the simplest ways to check for invalid numeric values is to run either PROC MEANS or PROC UNIVARIATE. By default, PROC MEANS lists the minimum and maximum values, along with the n, mean, and standard deviation. PROC UNIVARIATE is somewhat more useful in detecting invalid values, because it provides you with a listing of the five highest and five lowest values, along with graphical output (stem-and-leaf plots and box plots). Let’s first look at how you can use PROC MEANS for very simple checking of numeric variables. The program below checks the three numeric variables, heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP), in the PATIENTS data set.
LIBNAME CLEAN "C:CLEANING"; PROC MEANS DATA=CLEAN.PATIENTS N NMISS MIN MAX MAXDEC=3; TITLE "Checking Numeric Variables in the PATIENTS Data Set"; VAR HR SBP DBP; RUN; |
Let’s choose the options N, NMISS, MIN, MAX, and MAXDEC=3 for this procedure. The N and NMISS options report the number of nonmissing and missing observations for each variable, respectively. The MIN and MAX options list the smallest and largest nonmissing values for each variable. The MAXDEC=3 option is used so that the minimum and maximum values will be printed to three decimal places. Because HR, SBP, and DBP are supposed to be integers, you might have thought to set the MAXDEC option to 0. However, you might want to catch any data errors where a decimal point was entered by mistake.
Here is the output from Program 2-1.
Checking Numeric Variables in the PATIENTS Data Set The MEANS Procedure N Variable Label N Miss Minimum Maximum -------------------------------------------------- ---------------------------- HR Heart Rate 28 3 10.000 900.000 SBP Systolic Blood Pressure 27 4 20.000 400.000 DBP Diastolic Blood Pressure 28 3 8.000 200.000 -------------------------------------------------- ---------------------------- |
This output is not particularly useful. It does show the number of nonmissing and missing observations along with the highest and lowest values. Inspection of the minimum and maximum values for all three variables shows that there are probably some data errors in the PATIENTS data set. If you want a slightly prettier output, you can use PROC TABULATE to accomplish the same task. For an excellent reference on PROC TABULATE, let me suggest a book written by Lauren E. Haworth, called PROC TABULATE by Example, published by SAS Institute, Cary, NC, as part of their Books by Users series.
Here is the equivalent PROC TABULATE program, followed by the output. (Assume that the libref CLEAN has been previously defined in this program and in any future programs where it is not included in the program.)
PROC TABULATE DATA=CLEAN.PATIENTS FORMAT=7.3; 1 TITLE "Statistics for Numeric Variables"; VAR HR SBP DBP; 2 TABLES HR SBP DBP, N*F=7.0 NMISS*F=7.0 MEAN MIN MAX / RTSPACE=18; 3 KEYLABEL N = 'Number' 4 NMISS = 'Missing' MEAN = 'Mean' MIN = 'Lowest' MAX = 'Highest'; RUN; |
The FORMAT option tells the procedure to use the numeric format 7.3 (a field width of 7 with 3 places to the right of the decimal point) for all the output, unless otherwise specified. The analysis variables HR, SBP, and DBP are listed in a VAR statement . Let’s place these variables on the row dimension and the statistics along the column dimension. The TABLE option RTSPACE=18 allows for 18 spaces for all row labels, including the spaces for the lines forming the table. In addition, the format 7.0 is to be used for N and NMISS in the table. Finally, the KEYLABEL statement replaces the keywords for the selected statistics with more meaningful labels. Below is the output from PROC TABULATE.
A more useful procedure might be PROC UNIVARIATE. Running this procedure for your numeric variables yields much more information.
PROC UNIVARIATE DATA=CLEAN.PATIENTS PLOT; TITLE "Using PROC UNIVARIATE to Look for Outliers"; VAR HR SBP DBP; RUN; |
The procedure option PLOT provides you with several graphical displays of the data; a stem-and-leaf plot, a box plot, and a normal probability plot. Output from this procedure is shown next. (Note: To save some space, the PROC UNIVARIATE output for the variable SBP has been omitted)
Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: HR (Heart Rate) Moments N 28 Sum Weights 28 Mean 107.392857 Sum Observations 3007 Std Deviation 161.086436 Variance 25948.8399 Skewness 4.73965876 Kurtosis 23.7861582 Uncorrected SS 1023549 Corrected SS 700618.679 Coeff Variation 149.997347 Std Error Mean 30.442475 Basic Statistical Measures Location Variability Mean 107.3929 Std Deviation 161 .08644 Median 74.0000 Variance 25949 Mode 68.0000 Range 890 .00000 Interquartile Range 27 .00000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 3.527731 Pr > |t| 0.0015 Sign M 14 Pr >= |M| <.0001 Signed Rank S 203 Pr >= |S| <.0001 Quantiles (Definition 5) Quantile Estimate 100% Max 900 99% 900 95% 210 90% 208 75% Q3 87 50% Median 74 25% Q1 60 10% 22 5% 22 1% 10 0% Min 10 Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: HR (Heart Rate) Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 10 23 90 8 22 25 101 4 22 15 208 19 48 24 210 9 58 20 900 22 Missing Values -----Percent Of----- Missing Missing Value Count All Obs Obs . 3 9.68 100.00 Stem Leaf # Boxplot 9 0 1 * 8 7 6 5 4 3 2 11 2 * 1 0 1 + 0 122566667777777888889999 24 +--0--+ ----+----+----+----+---- Multiply Stem.Leaf by 10**+2 Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: HR (Heart Rate) Normal Probability Plot 950+ * | | 650+ | + | ++++++ 350+ +++++++ | ++++++ * * | ++++++ * 50+ * * ** *+**+***** * ** * +----+----+----+----+----+----+----+----+ ----+----+ -2 -1 0 +1 +2 Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: DBP (Diastolic Blood Pressure) Moments N 28 Sum Weights 28 Mean 88.0714286 Sum Observations 2466 Std Deviation 37.2915724 Variance 1390.66138 Skewness 1.06190956 Kurtosis 3.67139184 Uncorrected SS 254732 Corrected SS 37547.8571 Coeff Variation 42.342418 Std Error Mean 7.04744476 Basic Statistical Measures Location Variability Mean 88.07143 Std Deviation 37 .29157 Median 81.00000 Variance 1391 Mode 78.00000 Range 192 .00000 Interquartile Range 26 .00000 NOTE: The mode displayed is the smallest of 2 modes with a count of 3. Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 12.49693 Pr > |t| <.0001 Sign M 14 Pr >= |M| <.0001 Signed Rank S 203 Pr >= |S| <.0001 Quantiles (Definition 5) Quantile Estimate 100% Max 200 99% 200 95% 180 90% 120 75% Q3 100 50% Median 81 25% Q1 74 10% 64 5% 20 1% 8 0% Min 8 Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: DBP (Diastolic Blood Pressure) Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 8 23 106 28 20 12 120 4 64 14 120 11 68 27 180 10 68 6 200 22 Missing Values -----Percent Of----- Missing Missing Value Count All Obs Obs . 3 9.68 100.00 Stem Leaf # Boxplot 20 0 1 * 18 0 1 * 16 14 12 00 2 | 10 0026 4 +-----+ 8 000244800 9 *--+--* 6 488044888 9 +-----+ 4 2 0 1 0 0 8 1 0 ----+----+----+----+ Multiply Stem.Leaf by 10**+1 Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: DBP (Diastolic Blood Pressure) Normal Probability Plot 210+ * | * + | +++++ | ++++++ | ++*+* 110+ ++*** * | ****+** | * * **+*+* | +++++ | ++*+++ 10+ +++*+ +----+----+----+----+----+----+----+----+ ----+----+ -2 -1 0 +1 +2 |
You certainly get lots of information from PROC UNIVARIATE, perhaps too much information. Starting off, you see some descriptive univariate statistics (hence the procedure name) for each of the variables listed in the VAR statement. Most of these statistics are not very useful in the data checking operation. The number of nonmissing observations (N), the number of observations not equal to zero (Num ^= 0), and the number of observations greater than zero (Num > 0) are probably the only items that are of interest to you at this time.
One of the most important sections of the PROC UNIVARIATE output, for data checking purposes, is the section labeled “Extremes.” Here you see the five highest and five lowest values for each of your variables. For example, for the variable HR (heart rate), there are three possible data errors under the column label “Lowest” (10, 22, and 22) and three possible data errors under the column label “Highest” (208, 210, and 900). Obviously, having knowledge of reasonable values for each of your variables is essential if this information is to be of any use. Next to the listing of the highest and lowest values is the observation number containing this value. What would be more useful would be the patient or subject number you assigned to each patient. This is easily accomplished by adding an ID statement to PROC UNIVARIATE. You list the name of your identifying variable following the keyword ID. The values of this ID variable are then used in addition to the OBS column. (Note: In versions of SAS software prior to Version 7, the ID column replaced the OBS column; in versions after Version 7, both the ID column and the OBS column are displayed.)
If you are running Version 7 or later of SAS software, you can include an ODS (output delivery system) statement to limit the PROC UNIVARIATE output to just the table of extreme values. Here are the PROC UNIVARIATE statements with an ID statement added, as well as the ODS statement to limit the output to the five highest and five lowest data values (the “Extremes”).
/****************************************************************** | The ODS statement is valid for V7 and later. | | Note that the name EXTREMEOBS may change in future SAS releases. | | Use ODS TRACE ON; before the PROC and ODS TRACE OFF; after | | the PROC to obtain a list of output object names (found in | | the SAS Log). | ******************************************************************/ ODS SELECT EXTREMEOBS; PROC UNIVARIATE DATA=CLEAN.PATIENTS; TITLE "Using PROC UNIVARIATE to Look for Outliers"; ID PATNO; VAR HR SBP DBP; RUN; |
The section of output showing the “Extremes” for the variable heart rate (HR) follows:
Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: HR (Heart Rate) Extreme Observations --------Lowest-------- --------Highest------- Value PATNO Obs Value PATNO Obs 10 020 23 90 8 22 023 25 101 004 4 22 014 15 208 017 19 48 022 24 210 008 9 58 019 20 900 321 22 Missing Value . Count 3 % Count/Nobs 10.00 |
Before the addition of the ID statement, we only had columns labeled VALUE and OBS. With the ID statement there is a new column labeled PATNO that contains the values of your ID variable (PATNO), making it easier to locate the original patient data and check for errors. (Note: The column, which contains the values of the ID variable, that here is labeled PATNO will be labeled ID in releases of SAS software prior to Version 7).
The middle section of the output on page 28 contains a stem-and-leaf plot and a box plot. These two data visualizations come from an area of statistics known as exploratory data analysis (EDA). (For an excellent reference see Exploratory Data Analysis, by John Tukey, Reading, Massachusetts: Addison-Wesley.) Let’s focus on the plots for the variable DBP (diastolic blood pressure). The stem-and-leaf plot can be thought of as a sideways histogram. For this variable, the diastolic blood pressures are grouped in 20-point intervals. For example, the stem labeled “8” represents the values from 80 to 99. Instead of simply placing X’s or some other symbol to represent the bar in this sideways histogram, the next digit in the value is used instead. Thus, you see that there were three values of 80, one 82, two 84’s, one 88, and two 90’s. You can ignore these values and just think of the stem-and-leaf plot as a histogram, or you might be interested in the additional information that the leaf values give you. A quick examination of this plot shows that there were some abnormally low and high diastolic blood pressure values. This useful information complements the “Extremes” information. The “Extremes” only lists the five highest and five lowest values; the stem-and-leaf plot shows all the values in your data set.
To the right of the stem-and-leaf plot is a box plot. This plot shows the mean (+ sign), the median (the dashed line between the two asterisks), and the 25th and 75th percentiles (the bottom and top of the box, respectively). The distance between the top and bottom of the box is called the interquartile range and can also be found earlier in the outout labeled as “Q1-Q3.” Extending from both the top and bottom of the box are whiskers and outliers. The whiskers represent data values within one-and-a-half interquartile ranges above or below the box. (Note: The EDA people call the top and bottom of the box, the hinges.) Any data values more than one-and-a-half but less than three interquartile ranges above or below the box (hinges) are represented by 0’s. Two data values for DBP (8 and 20) fit this description in the box plot on page 28. Finally, any data values more than three interquartile ranges above or below the top and bottom hinges are represented by asterisks. For your DBP variable, two data points, 180 and 200, fit this description. The final graph, called a normal probability plot, is of interest to statisticians and helps determine deviations from a theoretical distribution called a normal or Gaussian distribution. The information displayed in the normal probability plot may not be useful for your data cleaning task because you are looking for data errors and are not particularly interested if the data are normally distributed or not.
18.188.196.223