A regular expression to awk is a pattern that consists of characters enclosed in forward slashes. Awk supports the use of regular expression metacharacters (same as egrep) to modify the regular expression in some way. If a string in the input line is matched by the regular expression, the resulting condition is true, and any actions associated with the expression are executed. If no action is specified and an input line is matched by the regular expression, the record is printed. See Table 5.6.
% awk '/Mary/' employees Mary Adams 5346 11/4/63 28765 |
Explanation
Awk will display all lines in the employees file containing the regular expression pattern Mary.
% awk '/Mary/{print $1, $2}' employees Mary Adams |
Explanation
Awk will display the first and second fields of all lines in the employees file containing the regular expression pattern Mary.
^ | Matches at the beginning of string |
$ | Matches at the end of string |
. | Matches for a single character |
* | Matches for zero or more of preceding character |
+ | Matches for one or more of preceding character |
? | Matches for zero or one of preceding character |
[ABC] | Matches for any one character in the set of characters, i.e., A, B, or C |
[^ABC] | Matches character not in the set of characters, i.e., A, B, or C |
[A-Z] | Matches for any character in the range from A to Z |
A|B | Matches either A or B |
(AB)+ | Matches one or more sets of AB |
* | Matches for a literal asterisk |
& | Used in the replacement string, to represent what was found in the search string (e.g., can be used with sub, gsub, etc.) |
A{m}A{m,} A{m,n} | Repetition of character A,m times, at least m times, or between m and n times |
y [a] | Matches an empty string either at the beginning or end of a word |
B | Matches an empty string within a word |
< | Matches an empty string at the beginning of a word—also called beginning of word anchor |
> | Matches an empty string at the end of a word—also called end of word anchor |
w | Matches an alphanumeric word character |
W | Matches a nonalphanumeric word character |
` | Matches an empty string at the beginning of a string |
' | Matches an empty string at the end of a string |
[a] All metacharacters from here to the end of the table are specific to gawk, not UNIX versions of awk.
% awk '/^Mary/' employees Mary Adams 5346 11/4/63 28765 |
Explanation
Awk will display all lines in the employees file that start with the regular expression Mary.
% awk '/^[A-Z][a-z]+ /' employees Tom Jones 4424 5/12/66 543354 Mary Adams 5346 11/4/63 28765 Sally Chang 1654 7/22/54 650000 Billy Black 1683 9/23/44 336500 |
Explanation
Awk will display all lines in the employees file where the line begins with an uppercase letter at the beginning of the line, followed by one or more lowercase letters, followed by a space.
POSIX (the Portable Operating System Interface) is an industry standard to ensure that programs are portable across operating systems. In order to be portable, POSIX recognizes that different countries or locales may differ in the way characters are encoded, different alphabets, the symbols used to represent currency, and how times and dates are represented. To handle different types of characters, POSIX added to the basic and extended regular expressions, the bracketed character class of characters shown in Table 5.7. Gawk supports this new character class of metacharacters, whereas awk and nawk do not.
The class, [:alnum:] is another way of saying A-Za-z0-9. To use this class, it must be enclosed in another set of brackets for it to be recognized as a regular expression. For example, A-Za-z0-9, by itself, is not a regular expression, but [A-Za-z0-9] is Likewise, [:alnum:] should be written [[:alnum:]]. The difference between using the first form, [A-Za-z0-9] and the bracketed form, [[:alnum:]] is that the first form is dependent on ASCII character encoding, whereas the second form allows characters from other languages to be represented in the class, such as Swedish rings and German umlauts.
Bracket Class | Meaning |
---|---|
[:alnum:] | alphanumeric characters |
[:alpha:] | alphabetic characters |
[:cntrl:] | control characters |
[:digit:] | numeric characters |
[:graph:] | nonblank characters (not spaces, control characters, etc.) |
[:lower:] | lowercase letters |
[:print:] | like [:graph:], but includes the space character |
[:punct:] | punctuation characters |
[:space:] | all white-space characters (newlines, spaces, tabs) |
[:upper:] | uppercase letters |
[:xdigit:] | allows digits in a hexadecimal number (0-9a-fA-F) |
% awk '/[[:lower:]]+g[[:space:]]+[[:digit:]]/' employees
Sally Chang 1654 7/22/54 650000
|
Explanation
Awk searches for one or more lowercase letters, followed by a g, followed by one or more spaces, followed by a digit.
The match operator, the tilde (~), is used to match an expression within a record or field.
% cat employees Tom Jones 44234 5/12/66 543354 Mary Adams 5346 11/4/63 28765 Sally Chang 1654 7/22/54 650000 Billy Black 1683 9/23/44 336500 % awk '$1 ~ /[Bb]ill/' employees Billy Black 1683 9/23/44 336500 |
Explanation
Awk will display any lines matching Bill or bill in the first field.
% awk '$1 !~ /ly$/' employees Tom Jones 4424 5/12/66 543354 Mary Adams 5346 11/4/63 28765 |
Explanation
Awk will display any lines not matching ly, when ly is at the end of the first field.
3.147.75.221