Example 9.17 Extracting a Character String without Breaking the Text in the Middle of a Word

Goal

Extract from a variable a character string that is no longer than a specified length and that does not end in the middle of a word.

Example Features

Featured StepDATA step
Featured Step Options and StatementsANYPUNCT, ANYSPACE, LARGEST, and SUBSTR functions

Input Data Set

Data set WORKPLACE has comments from six employees.

                          WORKPLACE

  Obs    comments
   1     Cannot discuss work-related issues with my supervisor.
   2     Need flex scheduling&job-sharing options.
   3     Add closer parking areas.
   4     Love the cafeteria
   5     More programs for career advancement
   6     Mentoring? Coaching? Either available?

Resulting Data Set

Output 9.17 TRUNC_COMMENTS Data Set

                           Example 9.17 TRUNC_COMMENTS

Obs  comments                                                        truncated

 1   Cannot discuss work-related issues with my supervisor.  Cannot discuss work-
 2   Need flex scheduling&job-sharing options.               Need flex scheduling&job-
 3   Add closer parking areas.                               Add closer parking areas.
 4   Love the cafeteria                                      Love the cafeteria
 5   More programs for career advancement                    More programs for career
 6   Mentoring? Coaching? Either available?                  Mentoring? Coaching?


Example Overview

This example demonstrates how to use several SAS language functions to extract a text string at a word boundary so that the length of the extracted text is less than or equal to a specified length and as close to that length as possible.

Data set WORKPLACE has comments from six employees. The goal is to extract a string up to 25 characters in length from the comment variable COMMENTS. The string should not end in the middle of a word.

The DATA step uses the two functions ANYPUNCT and ANYSPACE to determine the column where to extract the text from COMMENTS. Both functions start from column 25 in COMMENTS and work backward to the beginning of COMMENTS.

The ANYPUNCT function returns the first position at which a punctuation character is found. The ANYSPACE function returns the first position at which a whitespace character, such as a blank, is found. The second argument to both functions is -25, which tells the functions to start at column 25 and work backward to column 1. Searching right to left from column 25 ensures that the maximum amount of text will be extracted.

The LARGEST function then picks the larger of the two values that are returned by the two functions. This value is the length of the extracted text and is specified as the third argument to the SUBSTR function.

The values of COMMENTS for the third and fourth observations are less than 25 characters in length. The DATA step assigns the full value of COMMENTS to TRUNCATED.

The punctuation characters that ANYPUNCT searches for are dependent on your operating system. In this example, the text for observation 2 breaks at the hyphen (-) punctuation character.

The whitespace characters that ANYSPACE searches for are dependent on your operating system. In addition to blanks, ANYSPACE searches for horizontal and vertical tabs, carriage returns, line feeds, and form feeds.

If you need to be more specific about the characters than that provided in this example's code, you could use the FINDC function. For information about its usage, see SAS documentation.

For more information about functions such as ANYPUNCT and ANYSPACE, see "Understanding Functions That Evaluate the Content of Variable Values" in Example 9.2 of the "A Closer Look" section.

Program

Create data set TRUNC_COMMENTS. Read the observations in data set WORKPLACE. Define a new character variable whose length is the maximum length of the string to extract from COMMENTS.

Search right to left for the first punctuation character. Start the search in column 25. Search right to left for the first whitespace character. Start the search in column 25. Determine the length of the text string to extract from COMMENTS. Specify 1 as the first argument to LARGEST so that the larger of COLPUNCT and COLBLANK is saved in CUTCOL, ensuring the maximum amount of text will be extracted. Extract text from COMMENTS that has the length of the value of CUTCOL.

data trunc_comments;

  set workplace;

  length truncated $ 25;


  keep comments truncated;
  colpunct=anypunct(comments,-25);


  colblank=anyspace(comments,-25);


  cutcol=largest(1,colpunct,colblank);





  truncated=substr(comments,1,cutcol);


run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.67.54