Using Formats to Check for Invalid Values

Another way to check for invalid values of a character variable from raw data is to use user-defined formats. There are several possibilities here. One, you can create a format that leaves all valid character values as is and formats all invalid values to a single error code. Let’s start out with a program that simply assigns formats to the character variables and uses PROC FREQ to list the number of valid and invalid codes. Following that, you will extend the program by using a DATA step to identify which ID’s have invalid values. Program 1-6 uses formats to convert all invalid data values to a single value.

Program 1-6. Using a User-Defined Format and PROC FREQ to List Invalid Data Values
PROC FORMAT;
   VALUE $GENDER 'F','M' = 'Valid'
                 ' '     = 'Missing'
                 OTHER   = 'Miscoded';
   VALUE $DX '001' - '999' = 'Valid'  /* See important note below */
             ' '           = 'Missing'
                 OTHER     = 'Miscoded';


   VALUE $AE '0','1' = 'Valid'
             ' '     = 'Missing'
              OTHER  = 'Miscoded';
RUN;


PROC FREQ DATA=CLEAN.PATIENTS;
   TITLE "Using Formats to Identify Invalid Values";
   FORMAT GENDER $GENDER.
          DX     $DX.
          AE     $AE.;
   TABLES GENDER DX AE / NOCUM NOPERCENT MISSING;
RUN;

For the variables GENDER and AE, which have specific valid values, you list each of the valid values in the range to the left of the equal sign in the VALUE statement. Format each of these values with the value ‘Valid’. For the $DX format, you specify a range of values on the left side of the equal sign.

Important Note: It should be pointed out here, that the range ‘001’ - ‘999’ will behave differently on Windows and UNIX platforms compared to MVS and CMS platforms. You may want to test several values on your platform to be sure the program is performing as you intend. For example, the value ‘0A1’ will be considered ‘Valid’ on a Windows or a UNIX platform and ‘Invalid’ on MVS or CMS (as pointed out by two of my reviewers, John Laing and Mike Zdeb). You may want to test for alphabetic values for DX in a short DATA step, prior to running Program 1-6.

You may choose to lump the missing value with the valid values if that is appropriate, or you may want to keep track of missing values separately as was done here. Finally, any value other than the valid values or a missing value will be formatted as ‘Miscoded’. All that is left is to run PROC FREQ to count the number of ‘Valid’, ‘Missing’, and ‘Miscoded’ values. The TABLES option MISSING causes the missing values to be listed in the body of the PROC FREQ output. Here is the output from PROC FREQ.

Using Formats to Identify Invalid Values
The FREQ Procedure

       Gender
GENDER      Frequency
---------------------
Missing            1
Miscoded           4
Valid             26

Diagnosis Code

DX          Frequency
---------------------
Missing            8
Valid             21
Miscoded           2

Adverse Event?

AE          Frequency
---------------------
Missing            1
Valid             29
Miscoded           1


This output isn’t particularly useful. It doesn’t tell you which observations (patient numbers) contain missing or invalid values. Let’s modify the program by adding a DATA step, so that ID’s with invalid character values are listed.

Program 1-7. Using a User-Defined Format and a DATA Step to List Invalid Data Values
PROC FORMAT;
   VALUE $GENDER 'F','M' = 'Valid'
                 ' '     = 'Missing'
                 OTHER   = 'Miscoded';
   VALUE $DX '001' - '999' = 'Valid'
             ' '           = 'Missing'
                 OTHER     = 'Miscoded';
   VALUE $AE '0','1' = 'Valid'
             ' '     = 'Missing'
               OTHER = 'Miscoded';
RUN;
DATA _NULL_;
   INFILE "C:CLEANINGPATIENTS.TXT" PAD;
   FILE PRINT; ***Send output to the Output window;
   TITLE "Listing of Invalid Patient Numbers and Data Values";
   ***Note: We will only input those variables of interest;
   INPUT @1  PATNO    $3.
         @4  GENDER   $1.
         @24 DX       $3.
         @27 AE       $1.;


   IF PUT(GENDER,$GENDER.) = 'Miscoded' THEN PUT PATNO= GENDER=;
   IF PUT(DX,$DX.) = 'Miscoded' THEN PUT PATNO= DX=;
   IF PUT(AE,$AE.) = 'Miscoded' THEN PUT PATNO= AE=;
RUN;

The “heart” of this program is the PUT function. To review, the PUT function is similar to the INPUT function. It takes the following form:

character_variable = PUT(variable,format)

where character_variable is a character variable that contains the value of the variable listed as the first argument to the function, formatted by the format listed as the second argument to the function. The result of a PUT function is always a character variable and the function is frequently used to perform numeric-to-character conversions. In Program 1-7, the first argument of the PUT function is a character variable, and the result of the PUT function for any invalid data values would be the value ‘Miscoded’.

Here is the output from Program 1-7.

Listing of Invalid Patient Numbers and Data Values

PATNO=002  DX=X
PATNO=003  GENDER=X
PATNO=004  AE=A
PATNO=010  GENDER=f
PATNO=013  GENDER=2
PATNO=002  DX=X
PATNO=023  GENDER=f


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.170.83