Example 9.19 Validating and Standardizing Complex Character Data

Goal

Validate character data values that can be stored in multiple patterns. When valid, reformat values to conform to one pattern.

Example Features

Featured StepDATA step
Featured Step Options and StatementsRegular expressions and associated functions: PRXCHANGE, PRXMATCH, PRXPARSE, and PRXPOSN CATS function
A Closer LookUnderstanding the Regular Expressions Defined in This Example

Input Data Sets

Data set NEWCONTACTS contains contact information about eight people. The values for each variable are not uniformly specified.

                                NEWCONTACTS

 Obs     contactname       contactphone   contactemail
  1  Young Wilson, Karen  (315) 555- 3210 [email protected]
  2  Denise M. Morris     (607)555-0123   denise [email protected]
  3  Bonita Ann Henderson 920-555-6719    [email protected]
  4  Brown, Anne-Marie    9255554031      [email protected]
  5  Butler, L. William   612–555-7171    [email protected]
  6  Davis-Lee, Kim       111-555-4444    [email protected]
  7  Joseph Mitchell      (505)555-5432   [email protected]
  8  Sanchez,   R. T.     334   5551234   sanc001@schooledu

Resulting Data Set

Output 9.19 CONTACTS_EDITED Data Set

                       Example 9.19 CONTACTS_EDITED

Obs  contactname           contactphone   contactemail

 1   Karen Young Wilson    (315)555-3210  [email protected]
 2   Denise M. Morris      (607)555-0123  **Invalid**
 3   Bonita Ann Henderson  (920)555-6719  **Invalid**
 4   Anne-Marie Brown      (925)555-4031  **Invalid**
 5   L. William Butler     (612)555-7171  [email protected]
 6   Kim Davis-Lee         **Invalid**    [email protected]
 7   Joseph Mitchell       (505)555-5432  [email protected]
 8   R. T. Sanchez         (334)555-1234  **Invalid**


Example Overview

Character data can sometimes be collected in a free-form manner that is difficult to work with unless you modify them to conform to a specific structure. Values that are collected this way are usually not checked for errors either.

Many SAS functions and SAS language statements can be used to check and modify your data. Functions such as INDEX, VERIFY, SUBSTR, and the series of concatenation functions commonly are used in a series of IF-THEN-ELSE statements to check and revise character data. However, when you want to verify and edit complex data values, it might be more efficient and require less code to work with the set of Perl regular expression functions and CALL routines. With these expressions, you can match multiple patterns to your text values so that you can validate, modify, and extract data.

This example works with complex character data values. Each variable's values record the same type of information, but not necessarily the same way.

Data set NEWCONTACTS contains three variables of information about eight people. It contains the person's name, phone number, and e-mail address. There are differences in how the values for each variable are stored. For example, some of the names are written first name first followed by last name, while others are last name first followed by a comma and first name.

Some of the data values are also invalid. For example, the area code for one person starts with a 1, which is invalid. Some of the e-mail addresses are also invalid because they include invalid characters or site names.

To make a regular expression, you must first understand the patterns that you want to match and those you do not want to match. You describe the patterns in your data with alphabetic characters and special characters that are called metacharacters. Description of the many ways of specifying regular expressions is beyond the scope of this book. For more complete information, see SAS documentation. The following example will present just a few ways of specifying patterns to validate and edit the data.

Working with regular expressions is usually a two-step process. First the regular expression is compiled with the PRXPARSE function or CALL routine, which assigns a regular expression ID to a variable. Next the identifier variable is referenced by other PRX functions or CALL routines.

Usually your expressions need to be compiled only once. You can either put your PRXPARSE calls inside a DO group that executes only on the first iteration of the DATA step, or you can add the "o" option to the PRXPARSE call. The "o" option tells SAS to compile the regular expression only once.

After defining your expressions, statements in your DATA step that call PRX functions can then reference the regular expression ID variable.

Instead of making this a two-step process, you can specify your regular expression in functions such as PRXCHANGE that use them. However, this style might make reading your code more difficult because many regular expressions can be long and complicated to read.

In this DATA step, the PRXPARSE function defines a regular expression for each of the character variables. The calls to PRXPARSE are placed in a DO block that executes only on the first iteration of the DATA step. The regular expressions are compiled once.

  • The regular expression that is used with CONTACTNAME is NAME_RX. Its definition specifies that contact names in the output data set that are last name first, first name last, and last and first names separated by a comma will be rearranged to be first name first and last name last.

  • The regular expression that is used with CONTACTPHONE is PHONE_RX. Its definition will be used to validate telephone data and reformat valid telephone data to one style.

  • The regular expression that is used with CONTACTEMAIL is EMAIL_RX. Its definition will be used to validate e-mail address specification.

"Understanding the Regular Expressions Defined in This Example" in the "A Closer Look" section describes in more detail how the regular expressions in this example are coded.

Following the first DO block, the DATA step then performs these actions:

  • Uses PRXCHANGE to rearrange CONTACTNAME so that values that are last name first, first name last, and last and first names separated by a comma will be rearranged to first name first and last name last.

  • Uses PRXMATCH to determine whether the value of CONTACTPHONE is a valid phone number. When it is valid, calls to PRXPOSN determine the positions of the three parts of the telephone number. The CATS function then concatenates the three parts and encloses the area code in parentheses so that all telephone numbers are specified the same way. When it is not valid, the value of CONTACTPHONE is replaced with the text "**Invalid**".

  • Uses PRXMATCH to determine whether the value of CONTACTEMAIL is valid. When it is not valid, the value of CONTACTEMAIL is replaced with the text "**Invalid**".

The complex patterns and modifications would be challenging to write in a series of IF-THEN-ELSE statements so writing regular expressions is appropriate for this example.

Program

Create data set CONTACTS_EDITED. Read the observations in NEWCONTACTS. Retain across all iterations of the DATA step the values that are assigned to these variables on the first iteration of the DATA step.

Execute this block on the first iteration of the DATA step. Build with PRXPARSE a regular expression whose ID number is assigned to NAME_RX. Build with PRXPARSE a regular expression whose ID number is assigned to PHONE_RX. Build with PRXPARSE a regular expression whose ID number is assigned to EMAIL_RX.

Modify the values of CONTACTNAME according to the regular expression NAME_RX. Specify -1 as the second argument so that the process of matching patterns and replacing text continues until the end of the value of CONTACTNAME is reached. Execute the DO group if the pattern of the current value of CONTACTPHONE matches that of the regular expression that is identified by the value of PHONE_RX. Because the regular expression that was identified by PHONE_RX was defined with three capture buffers, one for each of the three parts of the phone number, extract the first and assign it to AREACODE. Extract the second capture buffer from CONTACTPHONE and assign it to PREFIX. Extract the third capture buffer from CONTACTPHONE and assign it to EXTENSION. Concatenate the three parts of the telephone number. Enclose the area code in parentheses. Insert a hyphen between the prefix and extension parts of the telephone number.

When the value of CONTACTPHONE does not match the pattern in the regular expression that is identified by PHONE_RX, replace the current value of CONTACTPHONE with a specific text value. When the value of CONTACTEMAIL does not match the pattern in the regular expression that is identified by EMAIL_RX starting in column 1, replace the current value of CONTACTEMAIL with a specific text value.

data contacts_edited;

  set newcontacts;

  retain name_rx phone_rx email_rx;




  drop name_rx phone_rx email_rx areacode prefix extension;
  if _n_=1 then do;

    name_rx=
prxparse('s/([a-z]+[-.a-zs]+),s*?([a-z]+[-.a-zs]+)/2 1/i');


     phone_rx=
prxparse(
       "/(?([2-9]dd))?s*-?s*([2-9]dd)s*-?s*(d{4})/");

     email_rx=
   prxparse("/(w+[.-w]*)@(w+[.-w]*).(gov|com|edu|net)/i");



  end;

  contactname=prxchange(name_rx,-1,contactname);





  if prxmatch(phone_rx,contactphone) then do;




    areacode=prxposn(phone_rx,1,contactphone);






    prefix=prxposn(phone_rx,2,contactphone);

    extension=prxposn(phone_rx,3,contactphone);


    contactphone=cats('(',areacode,')',prefix,'-',extension);


  end;
  else contactphone='**Invalid**';





  if prxmatch(email_rx,contactemail)^=1 then
                  contactemail='**Invalid**';


run;

A Closer Look

Understanding the Regular Expressions Defined in This Example

This section describes in more detail how the three regular expressions that were defined in the preceding DATA step were coded. This section does not explain in general how to build your own regular expressions. For more information, see SAS documentation.

When you use regular expressions, you compactly describe expected patterns in your data and add programming statements to perform specific actions based on the results of matching your data values to the expected patterns coded in the regular expressions.

Forward slashes enclose regular expressions. Within each expression, you can match specific text to specific positions in a string. Regular expressions also support wildcard specifications so that your coding can indicate specific characters in one part of your string and any character in another. Ranges of values such as all alphabetic characters or digits from 0 to 9 can be specified. Coding can indicate how many times a pattern can repeat itself. Parts of an expression can be grouped with parentheses.

Table 9.3 describes the metacharacters in the regular expression that is associated with CONTACTNAME. The patterns that are defined in this regular expression are used to find values of CONTACTNAME that are stored last name first, first name last, and first and last names separated by a comma. When a value of CONTACTNAME matches the patterns, the value is rearranged by using the PRXCHANGE function so that the value is stored first name first and last name last.

When the value of CONTACTNAME does not match the patterns in the regular expression, the value remains unchanged. For example, values of CONTACTNAME that are already stored as first name first and last name last are not modified. These values do not have commas and thus do not match the regular expression.

The part of the regular expression that is being described is highlighted and enlarged.

Table 9.3. Description of Metacharacters in Regular Expression Associated with Variable CONTACTNAME
name_rx=prxparse('s/([a-z]+[-.a-zs]+),s*([a-z]+[-.a-zs]+)/2 1/i');

Use the regular expression in a substitution. Place the pattern to be matched between the first two slash characters. Place the pattern of the result of the substitution between the last two slash characters. Indicate that case of the patterns is to be ignored by placing a lowercase 'i' after the last slash character.
name_rx=prxparse('s/([a-z]+[-.a-zs]+),s*([a-z]+[-.a-zs]+)/2 1/i');

Group the contents of the regular expression by enclosing the metacharacters in the two groups in parentheses. These groups then become capture buffers that are referenced by the metacharacters between the second set of slash characters. The first capture buffer corresponds to the last name. The second capture buffer corresponds to the first name.
name_rx=prxparse('s/([a-z]+[-.a-zs]+),s*([a-z]+[-.a-zs]+)/2 1/i');

Specify metacharacters identically for both the last name and the first name. Enclose in square brackets the allowable characters in each name.
name_rx=prxparse('s/([a-z]+[-.a-zs]+),s*([a-z]+[-.a-zs]+)/2 1/i');

Specify in the pattern at least one letter as the first character in the first name and in the last name by placing a plus sign (+) following the square bracket. Place the range of characters that are allowed within the square brackets. Because an 'i' is placed at the end of the substitution expression, case of the alphabetic characters is ignored and specifying just "a-z" includes both uppercase and lowercase versions of the letters.
name_rx=prxparse('s/([a-z]+[-.a-zs]+),s*([a-z]+[-.a-zs]+)/2 1/i');

Specify in the pattern that after the first letter in the first name and in the last name, there should be an alphabetic character, a hyphen (-), a period (.), or a space (s). Indicate a space with the s metacharacter. Including a space and a hyphen in the pattern allows matches when the value has two first names, two last names, or hyphenated names. A period allows initials in a matching value.
name_rx=prxparse('s/([a-z]+[-.a-zs]+),s*([a-z]+[-.a-zs]+)/2 1/i');

Specify in the pattern that a match must have the two parts of the name separated with a comma and optionally one or more spaces. Indicate a space with the s metacharacter. Indicate that there can be in a match none, one, or many spaces with the asterisk (*) repetition factor metacharacter.
name_rx=prxparse('s/([a-z]+[-.a-zs]+),s*([a-z]+[-.a-zs]+)/2 1/i');

Indicate the arrangement of the results of the substitution. Place the second capture buffer before the first capture buffer. Place a space between them. Precede the capture buffer number with the backward slash escape character so that the number is not interpreted as text.

Table 9.4 describes the metacharacters in the PHONE_RX regular expression that is associated with CONTACTPHONE. The patterns that are defined in this regular expression are used to find valid telephone numbers in the values of CONTACTPHONE.

When a value of CONTACTPHONE matches a pattern in PHONE_RX, the three parts of the telephone number, which are the area code, prefix, and extension, are extracted from the value by using PRXPOSN. The CATS function then concatenates the three parts so that all the matches have telephone numbers stored the same way in the output data set.

When the value of CONTACTPHONE does not match the patterns in the PHONE_RX regular expression, the value is replaced with the text "**Invalid**.

The part of the regular expression that is being described is highlighted and enlarged.

Table 9.4. Description of Metacharacters in Regular Expression Associated with Variable CONTACTPHONE
phone_rx=prxparse("/(?([2-9]dd))? s*-?s*([2-9]dd) s*-?s*(d{4})/");

Place the pattern to be matched between two slash characters.
phone_rx=prxparse("/(?([2-9]dd))? s*-?s*([2-9]dd) s*-?s*(d{4})/");

Make two groups in the regular expression by enclosing each of them with parentheses. The first group corresponds to the prefix part of the telephone number. The second group corresponds to the extension part of the telephone number. The parentheses in the first part of the expression do not form a group; they are instead part of the pattern to match the area code and are described next.
phone_rx=prxparse("/(?([2-9]dd))? s*-?s*([2-9]dd) s*-?s*(d{4})/");

Specify the pattern of the area code part of the telephone number. Allow matches to optionally enclose the area code in parentheses. To prevent the open and close parentheses from being interpreted as metacharacters, precede them with the backslash escape character. Follow the open parenthesis and the close parenthesis with a question mark (?) to indicate that a match can be made if the parenthesis character is present one or zero times. Allow matches to have only digits from 2 to 9 in the first position of the area code. In the second and third positions of the area code, allow matches to have any of the 10 digits from 0 to 9 by placing the metacharacter d in each position. Metacharacter d means any digit from 0 to 9.
phone_rx=prxparse("/(?([2-9]dd))? s*-?s*([2-9]dd) s*-?s*(d{4})/");

Specify the same pattern between the area code and the prefix and the prefix and the extension. Allow matches to have one or zero hyphens (-?) between none or any number of spaces (s*). The asterisk (*) repitition factor metacharacter that follows the two space metacharacters (s) allows matches to have zero or more spaces.
phone_rx=prxparse("/(?([2-9]dd))? s*-?s*([2-9]dd)s*-?s* (d{4})/");

Specify the pattern of the prefix part of the telephone number. Allow matches to have only digits from 2 to 9 in the first position of the prefix. In the second and third positions of the prefix, allow matches to have any of the 10 digits from 0 to 9 by placing the metacharacter d in each position. Metacharacter d means any digit from 0 to 9.
phone_rx=prxparse("/(?([2-9]dd))? s*-?s* ([2-9]dd) s*-?s*(d{4})/");

Specify the pattern of the extension part of the telephone number. Require that a match must have four digits each ranging from 0 to 9. Indicate a digit from 0 to 9 with the d metacharacter. Indicate the exact number of digits by following the d metacharacter with the number of digits in curly brackets.

Table 9.5 describes the metacharacters in the EMAIL_RX regular expression that is associated with CONTACTEMAIL. The patterns that are defined in this regular expression are used to find valid e-mail addresses in the values of CONTACTEMAIL.

When a value of CONTACTEMAIL matches a pattern of EMAIL_RX, the value of CONTACTEMAIL remains unchanged.

When the value of CONTACTEMAIL does not match the patterns in the EMAIL_RX regular expression, the value is replaced with the text "**Invalid**.

The part of the regular expression that is being described is highlighted and enlarged.

Table 9.5. Description of Metacharacters in Regular Expression Associated with Variable CONTACTEMAIL
email_rx=prxparse("/(w+[.-w]*)@(w+[.-w]*).(gov|com|edu|net)/i");

Place the pattern to be matched between two slash characters. Indicate that case of the patterns is to be ignored by placing a lowercase 'i' after the last slash character.
email_rx=prxparse("/(w+[.-w]*)@(w+[.-w]*).(gov|com|edu|net)/i");

Group the contents of the regular expression by enclosing the metacharacters in the three parts of the e-mail address in parentheses. The first group is the user name. The second group is the subdomain name. The third group is the set of valid domains.
email_rx=prxparse("/(w+[.-w]*)@(w+[.-w]*).(gov|com|edu|net)/i");

Specify identically the pattern of the user name part and the subdomain part of the e-mail address. A match must start with at least one word character (w). A word character is any alphanumeric character plus the underscore (_). The plus sign (+) repetition factor metacharacter specifies the match must be made at least one time. The remainder of each part of the e-mail address can be missing or can be any combination of one or more periods (-), hyphens (-), and word characters (w). The asterisk (*) repetition factor metacharacter allows matches to have zero or more of these characters.
email_rx=prxparse("/(w+[.-w]*)@(w+[.-w]*).(gov|com|edu|net)/i");

Specify that a match must have an at sign (@) between the user name and subdomain parts of the e-mail address.
email_rx=prxparse("/(w+[.-w]*)@(w+[.-w]*).(gov|com|edu|net)/i");

Specify that a match must have a period (.) between the subdomain and domain parts of the e-mail address. To prevent the period from being interpreted as a metacharacter, precede it with the backslash escape character so that it is interpreted as text.
email_rx=prxparse("/(w+[.-w]*)@(w+[.-w]*).(gov|com|edu|net)/i");

Specify that a match must have one of four domain names. Separate the four choices with the vertical bar character (|) to indicate that they are options and one must be present in a match.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.26.138