Using Informats to Check for Invalid Values

PROC FORMAT is also used to create informats. Remember that formats are used to control how variables look in output or how they are classified by such procedures as PROC FREQ. Informats modify the value of variables as they are read from the raw data, or they can be used with an INPUT function to create new variables in the DATA step. User-defined informats are created in much the same way as user-defined formats. Instead of a VALUE statement that creates formats, an INVALUE statement is used to create informats. The only difference between the two is that informat names can only be seven characters in length. (Note: For those curious readers, the reason is that informats and formats are both stored in the same catalog and an “@” is placed before informats to distinguish them from formats.) The following is a program that changes invalid values for GENDER and AE to missing values by using a user-defined informat.

Program 1-8. Using a User-Defined Informat to Set Invalid Data Values to Missing
*---------------------------------------------------------------- *
| PROGRAM NAME: INFORM1.SAS IN C:CLEANING                        |
| PURPOSE: TO CREATE A SAS DATA SET CALLED PATIENTS2              |
|          AND SET ANY INVALID VALUES FOR GENDER AND AE TO        |
|          MISSING, USING A USER-DEFINED INFORMAT                 |
*--------------------------------------------------------------- *;
LIBNAME CLEAN "C:CLEANING";


PROC FORMAT;
   INVALUE $GEN    'F','M' = _SAME_
                   OTHER   = ' ';
   INVALUE $AE    '0','1' = _SAME_
                   OTHER  = ' ';
RUN;


DATA CLEAN.PATIENTS2;
   INFILE "C:CLEANINGPATIENTS.TXT" PAD;
   INPUT @1  PATNO    $3.
         @4  GENDER   $GEN1.
         @27 AE       $AE1.;


   LABEL PATNO   = "Patient Number"
         GENDER  = "Gender"
         DX      = "Diagnosis Code"
         AE      = "Adverse Event?";
RUN;

PROC PRINT DATA=CLEAN.PATIENTS2;
   TITLE "Listing of Data Set PATIENTS2";
   VAR PATNO GENDER AE;
RUN;

Notice the INVALUE statements in the PROC FORMAT above. The key word _SAME_ is a SAS reserved value that does what its name implies — it leaves any of the values listed in the range specification unchanged. The key word OTHER in the subsequent line refers to any values not matching one of the previous ranges. Notice also, that the informats in the INPUT statement use the user-defined informat name followed by the number of columns to be read, the same method that is used with predefined SAS informats.

Output from the PROC PRINT is shown next.

Listing of Data Set PATIENTS2

Obs     PATNO     GENDER     AE

  1      001        M        0
  2      002        F        0
  3      003                 1
  4      004        F
  5      XX5        M        0
  6      006                 1
  7      007        M        0
  8                 M        0
  9      008        F        0
 10      009        M        1
 11      010                 0
 12      011        M        1
 13      012        M        0
 14      013
 15      014        M        1
 16      002        F        0
 17      003        M        0
 18      015        F        1
 19      017        F        0
 20      019        M        0
 21      123        M        0
 22      321        F        1
 23      020        F        0
 24      022        M        1
 25      023                 0
 26      024        F        0
 27      025        M        1
 28      027        F        0
 29      028        F        0
 30      029        M        1
 31      006        F        0


Notice that invalid values for GENDER and AE are now missing values, including the two lowercase ‘f’s (patient numbers 010 and 023).

Let’s add one more feature to this program. By using the keyword UPCASE in the informat specification, you can automatically convert the values being read to uppercase before the ranges are checked. Here are the PROC FORMAT statements, rewritten to use this option.

PROC FORMAT;
   INVALUE $GEN (UPCASE)  'F' = 'F'
                          'M' = 'M'
                        OTHER = ' ';
   INVALUE $AE '0','1' = _SAME_
                OTHER  = ' ';
RUN;

The UPCASE option is placed in parenthesis following the informat name. Notice some other changes as well. You cannot use the keyword _SAME_ anymore because the value is changed to uppercase for comparison purposes, but the _SAME_ specification would leave the original lowercase value unchanged. By specifying each value individually, the lowercase ‘f’ (the only lowercase GENDER value) would match the range ‘F’ and be assigned the value of an uppercase ‘F’.

The output of this data set is identical to the output for Program 1-8 except the value of GENDER for patients 010 and 023 are an uppercase ‘F’.

If you want to preserve the original value of the variable, you can use a user-defined informat with an INPUT function instead of an INPUT statement. You can use this method to check a raw data file or a SAS data set. Program 1-9 reads the SAS data set CLEAN.PATIENTS and uses user-defined informats to detect errors.

Program 1-9. Using a User-Defined Informat with the INPUT Function
PROC FORMAT;
   INVALUE $GENDER 'F','M' = _SAME_
                    OTHER  = 'ERROR';
   INVALUE $AE      '0','1' = _SAME_
                    OTHER   = 'ERROR';
RUN;


DATA _NULL_;
   FILE PRINT;
   SET CLEAN.PATIENTS;
   IF INPUT (GENDER,$GENDER.) = 'ERROR' THEN
      PUT @1 "Error for Gender for Patient:" PATNO" Value is " GENDER;
   IF INPUT (AE,$AE.) = 'ERROR' THEN
      PUT @1 "Error for AE for Patient:" PATNO" Value is " AE;
RUN;

The advantage of this program over Program 1-8 is that the original values of the variables are not lost.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.161.228