Debugging a DATA Step

Diagnosing Errors in the Compilation Phase

Errors that are detected during the compilation phase include these:
  • misspelled keywords and data set names
  • unbalanced quotation marks
  • invalid options
During the compilation phase, SAS can interpret some syntax errors (such as the keyword DATA misspelled as DAAT). If it cannot interpret the error, SAS does the following:
  • prints the word ERROR followed by an error message in the log
  • compiles but does not execute the step where the error occurred, and prints the following message:
    NOTE: The SAS System stopped processing this step because of errors.
Some errors are explained fully by the message that SAS prints; other error messages are not as easy to interpret. For example, because SAS statements are free-format, when you fail to end a SAS statement with a semicolon, SAS cannot detect the error.

Diagnosing Errors in the Execution Phase

When SAS detects an error in the execution phase, the following can occur, depending on the type of error:
  • A note, warning, or error message is displayed in the log.
  • The values that are stored in the program data vector are displayed in the log.
  • The processing of the step either continues or stops.

Example: Execution Phase Error

Suppose you misspelled the fileref in the INFILE statement below. This is not a syntax error, because SAS does not validate the file that you reference until the execution phase. During the compilation phase, the fileref Invnt is assumed to reference some external raw data file.
data work.update; 
   infile invnt; 
   input Item $ 1-13 IDnum $ 15-19 
         InStock 21-22 BackOrd 24-25; 
   Total=instock+backord;  
run;
This error is not detected until the execution phase begins. Because there is no external file that is referenced by the fileref Invnt, the DATA step stops processing.
Log 7.2 SAS Log
3486 data work.update; 
3487    infile invnt;
3488    input Item $1-13 IDnum $15-19
3489    InStock 21-22 BackOrd 24-25;
3490    Total=instock+backord;
3491 run;

ERROR: No logical assign for filename INVNT.
NOTE: The SAS System topped processing this step because of errors.
WARNING: The data set WORK.UPDATE may be incomplete. When this step was stopped
					there was 0 observations and 5 variables. 
WARNING: Data set WORK.UPDATE was not replaced because this step was stopped. 
NOTE: DATA statement used (Total process time):
			 real time           0.01 seconds
      cpu time            0.01 seconds
Because Invent is misspelled, the statement in the DATA step that identifies the raw data is incorrect. Note, however, that the correct number of variables was defined in the descriptor portion of the data set.
Incorrectly identifying a variable's type is another common execution-time error. Recall that the values for IDnum are character values. Suppose you forget to place the dollar sign ($) after the variable's name in your INPUT statement. This is not a compile-time error, because SAS cannot verify IDnum's type until the data values for IDnum are read.
Raw Data File Invent
data work.update; 
   infile invent; 
   input Item $ 1-13 IDnum 15-19 
         InStock 21-22 BackOrd 24-25; 
   Total=instock+backord;  
run;
In this case, the DATA step completes the execution phase, and the observations are written to the data set. However, several notes appear in the log.
Log 7.3 SAS Log (partial log)
NOTE: The infile INVENT is:
      Filename=Z:sasuserinvent.dat,
      RECFM=V,LRECL=32767,File Size (bytes)=738,
      Last Modified=02Feb2017:13:50:21,
      Create Time=19Dec2016:12:49:08

NOTE: Invalid data for IDnum in line 1 15-19.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----
1         Bird Feeder   LG088  3 20                                                        80
Item=Bird Feeder IDnum=. InStock=3 BackOrd=20 Total=23 _ERROR_=1 _N_=1
NOTE: Invalid data for IDnum in line 2 15-19.
2         6 Glass Mugs  SB082  6 12                                                        80
Item=6 Glass Mugs IDnum=. InStock=6 BackOrd=12 Total=18 _ERROR_=1 _N_=2
NOTE: Invalid data for IDnum in line 3 15-19.
3         Glass Tray    BQ049 12  6                                                        80
Item=Glass Tray IDnum=. InStock=12 BackOrd=6 Total=18 _ERROR_=1 _N_=3
NOTE: Invalid data for IDnum in line 4 15-19.
4         Padded Hangrs MN256 15  6                                                        80
Item=Padded Hangrs IDnum=. InStock=15 BackOrd=6 Total=21 _ERROR_=1 _N_=4
NOTE: Invalid data for IDnum in line 5 15-19.
Each note identifies the location of the invalid data for each observation. In this example, the invalid data is located in columns 15-19 for all observations.
The second line in each note (excluding the RULE line) displays the raw data record. Notice that the second field displays the values for IDnum, which are obviously character values.
The third and fourth lines display the values that are stored in the program data vector. Here, the values for IDnum are missing, although the other values have been correctly assigned to their respective variables. Notice that _ERROR_ has a value of 1, indicating that a data error has occurred.
The PRINT procedure displays the data set, showing that the values for IDnum are missing. In this example, the periods indicate that IDnum is a numeric variable, although it should have been defined as a character variable.
proc print data=work.update;
run;
Figure 7.7 Output from the PRINT Procedure Showing Missing Values for IDnum
Output from the PRINT Procedure Showing Missing Values for IDnum

Validating and Cleaning Data

Recall that data errors occur when data values are not appropriate for the SAS statements that are specified in a program. SAS detects data errors during program execution. When a data error is detected, SAS continues to execute the program.
In general, SAS procedures analyze data, produce output, or manage SAS files. In addition, SAS procedures can be used to detect invalid data. In addition to the PRINT procedure showing missing values, the following procedures can be used to detect invalid data:
  • PROC FREQ
  • PROC MEANS
The FREQ procedure detects invalid character and numeric values by looking at distinct values. You can use PROC FREQ to identify any variables that were not given an expected value.
Syntax, FREQ procedure:
PROC FREQ DATA=SAS-data-set <NLEVELS>;
TABLES variable(s);
RUN;
  • The NLEVELS option displays a table that provides the number of distinct values for each variable that is named in the TABLES statement.
  • The TABLES statement specifies the frequency tables to produce.
In the following example, the data set contains invalid characters for the variables Gender and Age. PROC FREQ displays the distinct values of variables and is therefore useful for finding invalid values in data. You can use PROC FREQ with the TABLES statement to produce a frequency table for specific variables.
proc freq data=work.Patients;
      tables Gender Age; 
run;
In the following figures, notice the valid (M and F) and invalid (G) values for Gender, and the valid and invalid (242) values for Age. In both the Gender and Age FREQ tables, one observation needs data to be cleaned.
Figure 7.8 HTML Output: FREQ Procedure
The FREQ Procedure uses the TABLES statement to produce a frequency table for specific variables
The MEANS procedure can also be used to validate data because it produces summary reports that display descriptive statistics. For example, PROC MEANS can show whether the values for a particular variable are within their expected range.
Syntax, MEANS procedure:
PROC MEANS DATA=SAS-data-set <statistics>;
VAR variable(s);
RUN;
  • The statistics to display can be specified as an option in the PROC MEANS statement.
  • The VAR statement specifies the analysis variables and their order in the results.
Using the same data set as in the previous example, you can submit PROC MEANS to determine whether the age of all test subjects is within a reasonable range. Notice that the VAR statement is specified with that particular variable (Age) to get the statistical information, or range, of the data values.
proc means data=work.Patients;
      var Age;
run;
The following figure shows the output for the MEANS procedure. It displays a range of 44 to 242, which clearly indicates that there is invalid data somewhere in the Age column.
Figure 7.9 HTML Output: MEANS Procedure
The MEANS procedure uses the VAR statement to get statistical information of the data values.

Using an Assignment Statement to Clean the Data

You can use an assignment statement or a conditional clause to programmatically clean invalid data when it is identified. For example, if your input data contains a field called Gender, and that field has an invalid value (a value other than M or F), then you can clean your data by changing the invalid value to a valid value for Gender. To avoid overwriting your original data set, you can use the DATA statement to create a new data set. The new data set contains all of the data from your original data set, along with the correct values for invalid data.
The following example assumes that Gender has an invalid value of G in the input data. This error might be the result of a data entry error. If G should actually be F, it is possible to correct the invalid data for Gender by using an assignment statement along with an IF-THEN statement:
data work.clean_data;
   set work.patients;
   gender=upcase(Gender);
   if Gender='G' then Gender='F';
run;

proc print data=work.clean_data;
run;
Notice that two observations contain invalid values for Age. These values exceed a maximum value of 110. It is possible to uniquely identify each of the observations by specifying the variable Empid. After checking the date of birth in each of the observations and determining the correct value for Age, you can change the data by inserting an IF-THEN-ELSE statement:
data work.clean_data;
   set work.patients;
   if empid=3294 then age=65;
   else if empid=7391 then age=75;
run;

proc print data=work.clean_data;
run;
Another way of ensuring that your output data set contains valid data is to programmatically identify invalid data and delete the associated observations from your output data set:
data work.clean_data;
   set work.patients;
   if Age>110 then delete;
run;

proc print data=work.clean_data;
run;
Last updated: January 10, 2018
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.170.134