Example 9.2 Determining the Type of a Variable's Content

Goal

Determine whether a character variable's value contains numeric data, alphabetic text, missing data, or nonalphanumeric characters. Save the values that are determined to be numeric in a numeric variable.

Example Features

Featured StepDATA step
Featured Step Options and StatementsCOMPRESS, INPUT with ?? format modifier, NOTALPHA, REVERSE, STRIP, SUBSTR, and UPCASE functions
A Closer LookUnderstanding Functions That Evaluate the Content of Variable Values

Input Data Set

Data set CHEMTEST contains preliminary results of 14 lab samples in character variable CHEM_PPB. The values of CHEM_PPB contain a mix of alphanumeric and other characters.

            CHEMTEST

Obs    sample    chem_ppb

  1     57175    1250.3
  2     71309    2.53E3
  3     28009    40 ppb
  4     40035    -81
  5     55128    3,900
  6     41930    ~1000
  7     21558    4?23
  8     46801    <1%
  9     18322
 10     11287    <1000
 11     37175    >5000
 12     22195    Sample lost
 13     81675    Invalid: GHK
 14     88810    N/A

Resulting Data Set

Output 9.2 CHEMEVAL Data Set

             Example 9.2 CHEMEVAL Data Set

Obs    sample    chem_ppb        eval            result

  1     57175    1250.3          Numeric         1250.3
  2     71309    2.53E3          Numeric         2530.0
  3     28009    40 ppb          Numeric           40.0
  4     40035    -81             Error               .
  5     55128    3,900           Error               .
  6     41930    ~1000           Error               .
  7     21558    4?23            Error               .
  8     46801    <1%             Error               .
  9     18322                    Undefined           .
 10     11287    <1000           Below Range         .
 11     37175    >5000           Above Range         .
 12     22195    Sample lost     Text                .
 13     81675    Invalid: GHK    Text                .
 14     88810    N/A             Text                .


Example Overview

This example examines a character variable's values. It uses the information that is returned by several SAS functions to determine the content of a variable value. The values of the character variable are categorized based on the results that are returned by the SAS functions. The category assignments are saved in a new character variable. Values that the DATA step determines to be numeric are saved in a numeric variable.

Data set CHEMTEST contains preliminary results of 12 lab samples in character variable CHEM_PPB. The values of CHEM_PPB contain a mix of alphanumeric and other characters.

The goal of the DATA step is to examine and categorize the values of CHEM_PPB. Variable CHEM_PPB holds the chemical concentration in parts per billion of a sample. The values of CHEM_PPB vary widely from numbers to text. A series of IF statements applies several SAS functions to determine the content of CHEM_PPB. Only the values that are evaluated to be numeric are saved as numeric data in new variable RESULT. Each value of CHEM_PPB is categorized into one of the following six categories. The character category value is saved in variable EVAL:

  • Numeric: The value of CHEM_PPB can be read in as numeric. This includes numbers that are written in exponential notation and numbers that include decimal points. It also includes values that are specified as numbers followed by the PPB label in uppercase or lowercase. Values that contain commas or negative signs are considered errors in data entry.

  • Below range: The value of CHEM_PPB is below the lowest detectable value. The value starts with a less than sign and the remaining value can be read in as numeric.

  • Above range: The value of CHEM_PPB is above the highest detectable value. The value starts with a greater than sign and the remaining value can be read in as numeric.

  • Undefined: CHEM_PPB is missing.

  • Text: The value of CHEM_PPB contains alphabetic characters and no numbers and no special characters other than blanks, or the value contains the text ``N/A''.

  • Error: The value of CHEM_PPB does not fit into any of the other five categories.

Note that the CHEM_PPB value for sample 55128 was a number specified with a comma: 3,900. The program determined this value was an error. You could use the COMMAw.d informat instead of the BEST12. informat to read this value as a number. However, if you do that, the COMMAw.d informat would remove the percent sign from sample 46801. This might or might not be acceptable in your application. A way around this would be to remove the commas from the values of CHEM_PPB before reading them with the INPUT function. This would successfully convert the character value of `3900' to a numeric value.

Program

Create data set CHEMEVAL. Read the observations in CHEMTEST. Define a new character variable and a new numeric variable. When the value of CHEM_PPB is missing, assign the text "Undefined" to variable EVAL. Execute this DO group when CHEM_PPB is not missing. Attempt to read CHEM_PPB as a numeric value. Include the '??' format modifier in the INPUT function to suppress error messages and prevent automatic variable _ERROR_ from being set to 1 that would result if CHEM_PPB cannot be read in as a number. Use the BEST12. informat to allow for various types of numeric value representation. Evaluate numeric values that are successfully returned by the INPUT function. Assign an evaluation of "Numeric" when the value that is returned is greater than or equal to 0. For negative or missing values, classify the observation as an error and reset RESULT to missing.

Execute this DO group when the value that is returned by the INPUT function is missing, which means the BEST12. informat could not be successfully applied to CHEM_PPB. Check if the value of CHEM_PPB ends with the uppercase or lowercase text "PPB". Reverse the text of CHEM_PPB, strip leading and trailing blanks from the reversed string, and convert it to uppercase before checking for the "PPB" text. Attempt to read the value of CHEM_PPB with the "PPB" text at the end removed to see if the beginning part of CHEM_PPB is numeric. Evaluate numeric values that are successfully returned by the INPUT function. Assign an evaluation of "Numeric" when the value that is returned is greater than or equal to 0. For negative or missing values, classify the observation as an error and reset RESULT to missing.

Test if the value of CHEM_PPB is a mix of letters and specific punctuation or if its uppercased value is 'N/A'. Remove acceptable nonalphabetic characters from CHEM_PPB with COMPRESS so that the NOTALPHA function does not find them. Use NOTALPHA to determine whether any of the remaining characters are not uppercase or lowercase alphabetic letters. Execute this DO group when the value of CHEM_PPB has not been categorized as text only.

Determine whether the value of CHEM_PPB is structured to indicate the sample was below the limit of detection. If the first byte of the CHEM_PPB value starts with a less than sign and the remainder can be read in as a numeric value, conclude that the value is for a sample below the limit of detection. Follow a similar process as in the preceding IF statement to determine whether the sample was above the limit of detection. For all values that remain unclassified, assign a value of 'Error' to EVAL.

data chemeval;
  set chemtest;
  length eval $ 12 result 8;

  if chem_ppb=' ' then eval='Undefined';

  else do;

    result=input(chem_ppb,?? best12.);


    if result ne . then do;


      if result ge 0 then eval='Numeric';


       else do;
        result=.;
         eval='Error';
       end;
    end;
     else do;



      if upcase(strip(reverse(chem_ppb)))=:'BPP'
              then do;




         result=input(
       substr(chem_ppb,1,length(chem_ppb)-3),?? best12.);


        if result ne . then do;


          if result ge 0 then eval='Numeric';


          else do;
             result=.;
            eval='Error';
           end;
         end;
       end;
      if notalpha(compress(chem_ppb,' ().,-&:'))=0
            or upcase(chem_ppb)='N/A' then eval='Text';







       else do;




        if char(chem_ppb)='<' and
             input(substr(chem_ppb,2),?? best12.) ne .
                    then eval='Below Range';





        else if char(chem_ppb)='>' and
            input(substr(chem_ppb,2),?  ? best12.)
                   then eval='Above Range';

        else eval='Error';

      end;
    end;
  end;
run;

A Closer Look

Understanding Functions That Evaluate the Content of Variable Values

The SAS language contains many functions that can test the content of variable values. With functions FIND, FINDC, INDEX, INDEXC, INDEXW, and VERIFY, you can specify complex arguments to search for sets of specific values. The two series of functions—one that starts with ANY and the other that starts with NOT—look for specific types of values within a character string. The preceding DATA step used one of these, NOTALPHA, to look for nonalphabetic characters in a string where specific characters had been removed by the COMPRESS function. These specific characters were allowed in the value when categorizing the value as text.

The ANY- and NOT- functions are simple to use because they look for specific sets of predefined characters. The first argument to these functions is your character variable. The second optional argument to these functions is the position in which to start the examination of the value. The direction of the search can be specified. For left to right, specify the column position. For right to left, precede the column position with a minus sign (-).

The ANY- series of functions returns the first column position where a character of the type being searched for is found. It returns a value of 0 when its argument does not contain any characters of the type being searched for.

The NOT- series of functions returns the first column position where a character is found that is not of the type being searched for. It returns a value of 0 when all the characters in the argument are of the type being searched for.

Table 9.1 lists the ANY- and NOT- series of functions that can examine the content of a variable value. For more information about the other six functions listed at the beginning of this section, see SAS documentation.

Table 9.1. ANY- and NOT- Series of SAS Functions
Searches for This Type of CharacterANY VersionNOT Version
Alphanumeric: digits 0–9, uppercase letter, lowercase letterANYALNUMNOTALNUM
Alphabetic: uppercase letter, lowercase letterANYALPHANOTALPHA
Control: line feeds, page ejects, etc.ANYCNTRLNOTCNTRL
Digit: digits 0–9ANYDIGITNOTDIGIT
Character that is valid as the first character in a SAS variable name under VALIDVARNAME=V7: uppercase letter, lowercase letter, underscoreANYFIRSTNOTFIRST
Graphical: any printable character other than white spaceANYGRAPHNOTGRAPH
Alphabetic: lowercase letterANYLOWERNOTLOWER
Character that is valid in a SAS variable name under the rules for SAS system option VALIDVARNAME=V7ANYNAMENOTNAME
Printable characterANYPRINTNOTPRINT
PunctuationANYPUNCTNOTPUNCT
White-space character: blank, horizontal, and vertical tab, carriage return, line feed, form feedANYSPACENOTSPACE
Alphabetic: uppercase letterANYUPPERNOTUPPER
Hexadecimal character that represents a digitANYXDIGITNOTXDIGIT

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.171.86