5.7. Regular Expressions

A regular expression to awk is a pattern that consists of characters enclosed in forward slashes. Awk supports the use of regular expression metacharacters (same as egrep) to modify the regular expression in some way. If a string in the input line is matched by the regular expression, the resulting condition is true, and any actions associated with the expression are executed. If no action is specified and an input line is matched by the regular expression, the record is printed. See Table 5.6.

Example 5.26.
% awk  '/Mary/'   employees
					Mary Adams  5346   11/4/63  28765
				

Explanation

Awk will display all lines in the employees file containing the regular expression pattern Mary.

Example 5.27.
% awk '/Mary/{print $1, $2}' employees
					Mary Adams
				

Explanation

Awk will display the first and second fields of all lines in the employees file containing the regular expression pattern Mary.

Table 5.6. Regular Expression Metacharacters
^Matches at the beginning of string
$Matches at the end of string
.Matches for a single character
*Matches for zero or more of preceding character
+Matches for one or more of preceding character
?Matches for zero or one of preceding character
[ABC]Matches for any one character in the set of characters, i.e., A, B, or C
[^ABC]Matches character not in the set of characters, i.e., A, B, or C
[A-Z]Matches for any character in the range from A to Z
A|BMatches either A or B
(AB)+Matches one or more sets of AB
*Matches for a literal asterisk
&Used in the replacement string, to represent what was found in the search string (e.g., can be used with sub, gsub, etc.)
A{m}A{m,} A{m,n}Repetition of character A,m times, at least m times, or between m and n times
y [a]Matches an empty string either at the beginning or end of a word
BMatches an empty string within a word
<Matches an empty string at the beginning of a word—also called beginning of word anchor
>Matches an empty string at the end of a word—also called end of word anchor
wMatches an alphanumeric word character
WMatches a nonalphanumeric word character
`Matches an empty string at the beginning of a string
'Matches an empty string at the end of a string

[a] All metacharacters from here to the end of the table are specific to gawk, not UNIX versions of awk.

Example 5.28.
% awk  '/^Mary/'  employees
					Mary Adams  5346  11/4/63  28765
				

Explanation

Awk will display all lines in the employees file that start with the regular expression Mary.

Example 5.29.
% awk '/^[A-Z][a-z]+ /' employees
					Tom Jones     4424    5/12/66    543354
					Mary Adams    5346    11/4/63    28765
					Sally Chang   1654    7/22/54    650000
					Billy Black   1683    9/23/44    336500
				

Explanation

Awk will display all lines in the employees file where the line begins with an uppercase letter at the beginning of the line, followed by one or more lowercase letters, followed by a space.

The POSIX Character Class

POSIX (the Portable Operating System Interface) is an industry standard to ensure that programs are portable across operating systems. In order to be portable, POSIX recognizes that different countries or locales may differ in the way characters are encoded, different alphabets, the symbols used to represent currency, and how times and dates are represented. To handle different types of characters, POSIX added to the basic and extended regular expressions, the bracketed character class of characters shown in Table 5.7. Gawk supports this new character class of metacharacters, whereas awk and nawk do not.

The class, [:alnum:] is another way of saying A-Za-z0-9. To use this class, it must be enclosed in another set of brackets for it to be recognized as a regular expression. For example, A-Za-z0-9, by itself, is not a regular expression, but [A-Za-z0-9] is Likewise, [:alnum:] should be written [[:alnum:]]. The difference between using the first form, [A-Za-z0-9] and the bracketed form, [[:alnum:]] is that the first form is dependent on ASCII character encoding, whereas the second form allows characters from other languages to be represented in the class, such as Swedish rings and German umlauts.

Table 5.7. Bracketed Character Class Added by POSIX
Bracket ClassMeaning
[:alnum:]alphanumeric characters
[:alpha:]alphabetic characters
[:cntrl:]control characters
[:digit:]numeric characters
[:graph:]nonblank characters (not spaces, control characters, etc.)
[:lower:]lowercase letters
[:print:]like [:graph:], but includes the space character
[:punct:]punctuation characters
[:space:]all white-space characters (newlines, spaces, tabs)
[:upper:]uppercase letters
[:xdigit:]allows digits in a hexadecimal number (0-9a-fA-F)

Example 5.30.
% awk '/[[:lower:]]+g[[:space:]]+[[:digit:]]/' employees
Sally Chang 1654 7/22/54 650000

Explanation

Awk searches for one or more lowercase letters, followed by a g, followed by one or more spaces, followed by a digit.

5.7.1. The Match Operator

The match operator, the tilde (~), is used to match an expression within a record or field.

Example 5.31.
% cat employees
						Tom Jones     44234     5/12/66    543354
						Mary Adams    5346      11/4/63    28765
						Sally Chang   1654      7/22/54    650000
						Billy Black   1683      9/23/44    336500

% awk '$1 ~ /[Bb]ill/' employees
						Billy Black      1683     9/23/44    336500
					

Explanation

Awk will display any lines matching Bill or bill in the first field.

Example 5.32.
% awk '$1 !~ /ly$/' employees
						Tom Jones      4424     5/12/66     543354
						Mary Adams     5346     11/4/63     28765
					

Explanation

Awk will display any lines not matching ly, when ly is at the end of the first field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.75.221