While it is assumed that the reader is familiar with regular expressions, it is useful to review. This will ensure that the terminology is understood, and it may encourage you to use features that you've not been using.
Like shell patterns, regular expressions match on a character-by-character basis unless a meta-character is encountered in the pattern. Regular expressions have more meta-characters than shell patterns, which makes them more powerful. It also makes them more difficult to master.
When searching for text within a file, it is often necessary to use anchors. An anchor is a meta-character that can cause a pattern to be attached to another entity. Regular expressions define two anchors:
The beginning | ^ |
The end | $ |
The anchors may be attached to the beginning and end of a line or to the beginning and end of a string. The context of the anchor depends on the application.
The egrep(1) command uses regular expressions and can be used to illustrate. In the following example, only those lines that start with the letters ftp are displayed from the file /etc/services:
$ egrep '^ftp'/etc/services
ftp-data 20/tcp #File Transfer [Default Data]
ftp-data 20/udp #File Transfer [Default Data]
ftp 21/tcp #File Transfer [Control]
ftp 21/udp #File Transfer [Control]
ftp-agent 574/tcp #FTP Software Agent System
ftp-agent 574/udp #FTP Software Agent System
$
The egrep(1) pattern '^ftp' causes lines starting with ftp to be selected. The regular expression used here is ^ftp. The ^ anchor indicates that the pattern match can only succeed if ftp starts the text line. Without the anchor, other lines would have matched, including, for example, lines starting with tftp.
The next example matches lines ending with the text system:
$ egrep 'system$'/etc/services
# 24/tcp any private mail system
# 24/udp any private mail system
remotefs 556/tcp rfs rfs_server # Brunhoff remote filesystem
remotefs 556/udp rfs rfs_server # Brunhoff remote filesystem
mshnet 1989/tcp #MHSnet system
mshnet 1989/udp #MHSnet system
$
The $ anchor causes the pattern system$ to succeed only when the pattern ends at the end of the line. The anchors can also be used together:
$ egrep '^#$'/etc/services
#
#
#
#
$
In this example, the anchors in the pattern ^#$ were used to select only those lines in which # is the only character on the line. The ^ and $ anchors lose their special meaning when used in places other than the beginning and end of a pattern. For example, the pattern $#^ has no meta-characters in it.
A set is a collection of characters between the meta-characters [ and ]. Sets work the same as they do in shell patterns. The following egrep(1) command shows a set of two characters:
$ egrep '^äftp'/etc/services
tftp 69/tcp #Trivial File Transfer
tftp 69/udp #Trivial File Transfer
mftp 349/tcp
mftp 349/udp
$
The first character on the line matches a t or m from the specified set [tm] in the regular expression.
When the character ^ occurs as the first character of the set, it becomes a meta-character. It reverses the sense of the set. For example the pattern [^tm] matches any character except t or m. If the ^ character occurs in any other place within the set, it is not special. For example, the pattern [tm^] matches the characters t, m, or ^.
To include the ] character within the set, make it the first character of the set (or immediately following the ^ character). The following example searches for a line that starts with <abc> or [abc].
$ egrep '^[[<]abc[]>]'file
A range is an extension of the set idea. A range is specified within the meta-characters [ and ] and has the hyphen character used between the extremes. For example, the range pattern [A-Z] specifies the set of all uppercase letters.
Ranges can be grouped together. For example, the range [A-Za-z] allows you to select any letter, without regard to case. They may also be combined with sets. The range pattern [A-Z01] will match any uppercase character or the digits 0 or 1.
Like sets, the ^ character reverses the sense of the set if it occurs as the first character. For example, the pattern [^A-Z] matches any character except uppercase alphabetic characters.
Regular expressions also include character classes. These use the meta-character pair [: and :]. An example of a character class is [:digit:], which represents any numeric digit. Valid class names are as follows and are listed in ctype(3):
alnum | digit | punct |
alpha | graph | space |
blank | lower | upper |
cntrl | xdigit |
These class names correspond to the ctype(3) macros isalnum(3), isdigit(3), ispunct(3), and so on.
The . meta-character matches any single character. The following example shows a pattern in which any first character is accepted as a match:
$ egrep '^.ftp'/etc/services
tftp 69/tcp #Trivial File Transfer
tftp 69/udp #Trivial File Transfer
sftp 115/tcp #Simple File Transfer Protocol
sftp 115/udp #Simple File Transfer Protocol
bftp 152/tcp #Background File Transfer Program
bftp 152/udp #Background File Transfer Program
mftp 349/tcp
mftp 349/udp
$
A regular expression can be included within the parenthesis characters ( and ), which perform a grouping function. The following egrep(1) command illustrates a simple example:
$ egrep '^ä(ftp)'/etc/services
tftp 69/tcp #Trivial File Transfer
tftp 69/udp #Trivial File Transfer
mftp 349/tcp
mftp 349/udp
$
Parenthesized matches cause substrings to be extracted from a matching operation. This and other uses of the parenthesis will become clearer as the chapter progresses.
An atom is a unit that participates in pattern matching. The following are atoms within regular expressions:
Any single non–meta-character
A single anchor (^ or $)
A set (such as [abc])
A range (such as [A-Z])
A character class (such as [:digit:])
A parenthesized match (such as (abc[de]))
Atoms are important to understanding how a piece works in regular expressions.
A piece is an atom followed by the meta-character *, +, or ?. These meta-characters influence the matching process in the following ways:
* | Matches zero or more atoms |
+ | Matches one or more atoms |
? | Matches zero or one atom |
The pattern A* will match any of the following:
"" | Null string |
A | One A |
AA | Two As |
AAA | Three As |
The pattern A+ insists that at least one A be matched. Alternatively, the pattern A? matches the null string or a single A character.
The pattern (abc)+ shows a parenthesized expression. This pattern matches any of the following:
abc | The + matches one () expression. |
abcabc | The + matches two () expressions. |
abcabcabc | The + matches any number of () expressions. |
The possibilities are nearly endless when you include sets and ranges within the parentheses.
A branch of a regular expression is a pattern component that is separated by the pipe symbol |. It is used to specify alternative patterns to be matched. The following example shows two branches in the pattern:
$ egrep '^ftp|^telnet'/etc/services
ftp-data 20/tcp #File Transfer [Default Data]
ftp-data 20/udp #File Transfer [Default Data]
ftp 21/tcp #File Transfer [Control]
ftp 21/udp #File Transfer [Control]
telnet 23/tcp
telnet 23/udp
ftp-agent 574/tcp #FTP Software Agent System
ftp-agent 574/udp #FTP Software Agent System
telnets 992/tcp
$
The example selects those lines that begin with the text ftp or telnet. Branches can be used within parenthesized subexpressions:
$ egrep '^ftp(-agent)?'/etc/services
ftp-data 20/tcp #File Transfer [Default Data]
ftp-data 20/udp #File Transfer [Default Data]
ftp 21/tcp #File Transfer [Control]
ftp 21/udp #File Transfer [Control]
ftp-agent 574/tcp #FTP Software Agent System
ftp-agent 574/udp #FTP Software Agent System
$
In this example, the line must start with the letters ftp. The subexpression (-agent) indicates what the subexpression should match. This is modified, however, by the following ? operator, which says that zero or one of these subexpressions must match. Consequently, lines are selected that start with ftp, ftp-data, or ftp-agent.
You have already seen how the *, +, and the ? meta-characters affect the preceding atom. It is also possible to specify a bound instead. A bound consists of an opening brace character ({), an unsigned integer, a comma (,), another unsigned integer, and a closing brace (}). The fully specified bound { 2,5} indicates that at least 2 atoms must match but no more than 5.
The second component of the bound is optional. For example, a bound of the form { 3} indicates that exactly 3 matches must be made.
A bound may also be specified with a missing second count. For example, the bound { 2,} specifies that 2 or more matches can be made.
The valid range for unsigned integers is between 0 and the value RE_DUP_MAX (which is 255 on most platforms). The following example demonstrates how to select those lines with a 6 followed by at least three zeros (the egrep(1) option -E is required to enable the bounds feature):
$ egrep -E '60{ 3,} '/etc/services netviewdm1 729/tcp #IBM NetView DM/6000 Server/Client netviewdm1 729/udp #IBM NetView DM/6000 Server/Client netviewdm2 730/tcp #IBM NetView DM/6000 send/tcp netviewdm2 730/udp #IBM NetView DM/6000 send/tcp netviewdm3 731/tcp #IBM NetView DM/6000 receive/tcp netviewdm3 731/udp #IBM NetView DM/6000 receive/tcp #x11 6000-6063/tcp X Window System #x11 6000-6063/udp X Window System $
Given the number of meta-characters used in regular expressions, it is often necessary to quote meta-characters to remove their special meaning. The quote character used in regular expressions is the backslash () character. Any character that follows this backslash is interpreted literally; it is not treated as a meta character.
For example, if you want to match a pattern that includes parentheses, you need to quote the parenthesis characters. The expression (abc) matches the string (abc).
3.138.105.124