Debugging a DATA Step

Diagnosing Errors in the Compilation Phase

Errors that are detected during the compilation phase include these:
  • misspelled keywords and data set names
  • unbalanced quotation marks
  • invalid options
During the compilation phase, SAS can interpret some syntax errors (such as the keyword DATA misspelled as DAAT). If it cannot interpret the error, SAS does the following:
  • prints the word ERROR followed by an error message in the SAS log
  • compiles but does not execute the step where the error occurred, and prints the following message:
    NOTE: The SAS System stopped processing this step because of errors.
Some errors are explained fully by the message that SAS prints; other error messages are not as easy to interpret. For example, because SAS statements are free-format, when you fail to end a SAS statement with a semicolon, SAS cannot detect the error.

Diagnosing Errors in the Execution Phase

When SAS detects an error in the execution phase, the following can occur, depending on the type of error:
  • A note, warning, or error message is displayed in the SAS log.
  • The values that are stored in the PDV are displayed in the SAS log.
  • The processing of the step either continues or stops.

Debugging Data Errors

Recall that data errors occur when data values are not appropriate for the SAS statements that are specified in a program. SAS detects data errors during program execution. When a data error is detected, SAS continues to execute the program.
In general, SAS procedures analyze data, produce output, or manage SAS files. In addition, SAS procedures can be used to detect invalid data. In addition to the PRINT procedure showing missing values, the following procedures can be used to detect invalid data:
  • PROC FREQ
  • PROC MEANS
The FREQ procedure detects invalid character and numeric values by looking at distinct values. You can use PROC FREQ to identify any variables that were not given an expected value.
Syntax, FREQ procedure:
PROC FREQ DATA=SAS-data-set <NLEVELS>;
TABLES variable(s);
RUN;
  • The NLEVELS option displays a table that provides the number of distinct values for each variable that is named in the TABLES statement.
  • The TABLES statement specifies the frequency tables to produce based on the number of variables that are specified.
In the following example, the data set contains invalid characters for the variables Gender and Age. PROC FREQ displays the distinct values of variables and is therefore useful for finding invalid values in data. You can use PROC FREQ with the TABLES statement to produce a frequency table for specific variables.
proc freq data=cert.pats;
      tables Gender Age; 
run;
In the following figures, notice the valid (M and F) and invalid (G) values for Gender, and the valid and invalid (202) values for Age. In both the Gender and Age FREQ tables, data in one observation needs to be cleaned.
Output 7.2 FREQ Procedure Output
The FREQ Procedure uses the TABLES statement to produce a frequency table for specific variables
The MEANS procedure can also be used to validate data because it produces summary reports that display descriptive statistics. For example, PROC MEANS can show whether the values for a particular variable are within their expected range.
Syntax, MEANS procedure:
PROC MEANS DATA=SAS-data-set <statistics>;
VAR variable(s);
RUN;
  • The statistics to display can be specified as an option in the PROC MEANS statement.
  • The VAR statement specifies the analysis variables and their order in the results.
Using the same data set as in the previous example, you can submit PROC MEANS to determine whether the age of all test subjects is within a reasonable range. Notice that the VAR statement is specified with that particular variable (Age) to get the statistical information, or range, of the data values.
proc means data=cert.pats;
  var Age;
run;
The following figure shows the output for the MEANS procedure. It displays a range of 16 to 202, which clearly indicates that there is invalid data somewhere in the Age column.
Output 7.3 MEANS Procedure Output
The MEANS procedure uses the VAR statement to get statistical information of the data values.

Using an Assignment Statement to Clean Invalid Data

You can use an assignment statement or a conditional clause to programmatically clean invalid data when it is identified.
For example, if your input data contains a field and that field contains an invalid value, you can use an assignment statement to clean your data. To avoid overwriting your original data set, you can use the DATA statement to create a new data set. The new data set contains all of the data from your original data set, along with the correct values for invalid data.
The following example assumes that Gender has an invalid value of G in the input data. This error might be the result of a data entry error. If G should actually be M, it is possible to correct the invalid data for Gender by using an assignment statement along with an IF-THEN statement:
data work.pats_clean;
  set cert.pats;
  gender=upcase(Gender);
  if Gender='G' then Gender='M';
run;
proc print data=work.pats_clean;
run;
Notice that two observations contain invalid values for Age. These values exceed a maximum value of 100. It is possible to uniquely identify each of the observations by specifying the variable ID. After checking the date of birth in each of the observations and determining the correct value for Age, you can change the data by inserting an IF-THEN-ELSE statement:
data work.clean_data;
  set cert.pats;
  gender=upcase(Gender);
  if Gender='G' then Gender='M';
  if id=1147 then age=65;
  else if id=5277 then age=75;
run;
proc print data=work.clean_data;
run;
Output 7.4 PROC PRINT Output of Work.Clean_Data Data Set
PROC PRINT Output of Work.Clean_Data Data Set
Another way of ensuring that your output data set contains valid data is to programmatically identify invalid data and delete the associated observations from your output data set:
data work.clean_data;
  set cert.pats;
  gender=upcase(Gender);
  if Gender='G' then Gender='M';
  if Age>110 then delete;
run;
proc print data=work.clean_data;
run;
Output 7.5 PROC PRINT Output of Work.Clean_Data Data Set with Deleted Observations
PROC PRINT of Work.Clean_Data Data Set with Deleted Observations
Last updated: August 23, 2018
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.228.88