Understanding Regular Expressions

While it is assumed that the reader is familiar with regular expressions, it is useful to review. This will ensure that the terminology is understood, and it may encourage you to use features that you've not been using.

Like shell patterns, regular expressions match on a character-by-character basis unless a meta-character is encountered in the pattern. Regular expressions have more meta-characters than shell patterns, which makes them more powerful. It also makes them more difficult to master.

Anchors

When searching for text within a file, it is often necessary to use anchors. An anchor is a meta-character that can cause a pattern to be attached to another entity. Regular expressions define two anchors:

The beginning ^
The end $

The anchors may be attached to the beginning and end of a line or to the beginning and end of a string. The context of the anchor depends on the application.

The egrep(1) command uses regular expressions and can be used to illustrate. In the following example, only those lines that start with the letters ftp are displayed from the file /etc/services:

$ egrep '^ftp'/etc/services
ftp-data         20/tcp    #File Transfer [Default Data]
ftp-data         20/udp    #File Transfer [Default Data]
ftp              21/tcp    #File Transfer [Control]
ftp              21/udp    #File Transfer [Control]
ftp-agent       574/tcp    #FTP Software Agent System
ftp-agent       574/udp    #FTP Software Agent System
$

The egrep(1) pattern '^ftp' causes lines starting with ftp to be selected. The regular expression used here is ^ftp. The ^ anchor indicates that the pattern match can only succeed if ftp starts the text line. Without the anchor, other lines would have matched, including, for example, lines starting with tftp.

The next example matches lines ending with the text system:

$ egrep 'system$'/etc/services
#                24/tcp    any private mail system
#                24/udp    any private mail system
remotefs        556/tcp    rfs rfs_server       # Brunhoff remote filesystem
remotefs        556/udp    rfs rfs_server       # Brunhoff remote filesystem
mshnet          1989/tcp   #MHSnet system
mshnet          1989/udp   #MHSnet system
$

The $ anchor causes the pattern system$ to succeed only when the pattern ends at the end of the line. The anchors can also be used together:

$ egrep '^#$'/etc/services
#
#
#
#
$

In this example, the anchors in the pattern ^#$ were used to select only those lines in which # is the only character on the line. The ^ and $ anchors lose their special meaning when used in places other than the beginning and end of a pattern. For example, the pattern $#^ has no meta-characters in it.

Sets

A set is a collection of characters between the meta-characters [ and ]. Sets work the same as they do in shell patterns. The following egrep(1) command shows a set of two characters:

$ egrep '^äftp'/etc/services
tftp             69/tcp    #Trivial File Transfer
tftp             69/udp    #Trivial File Transfer
mftp            349/tcp
mftp            349/udp
$

The first character on the line matches a t or m from the specified set [tm] in the regular expression.

When the character ^ occurs as the first character of the set, it becomes a meta-character. It reverses the sense of the set. For example the pattern [^tm] matches any character except t or m. If the ^ character occurs in any other place within the set, it is not special. For example, the pattern [tm^] matches the characters t, m, or ^.

To include the ] character within the set, make it the first character of the set (or immediately following the ^ character). The following example searches for a line that starts with <abc> or [abc].

$ egrep '^[[<]abc[]>]'file
						

Range

A range is an extension of the set idea. A range is specified within the meta-characters [ and ] and has the hyphen character used between the extremes. For example, the range pattern [A-Z] specifies the set of all uppercase letters.

Ranges can be grouped together. For example, the range [A-Za-z] allows you to select any letter, without regard to case. They may also be combined with sets. The range pattern [A-Z01] will match any uppercase character or the digits 0 or 1.

Like sets, the ^ character reverses the sense of the set if it occurs as the first character. For example, the pattern [^A-Z] matches any character except uppercase alphabetic characters.

Character Classes

Regular expressions also include character classes. These use the meta-character pair [: and :]. An example of a character class is [:digit:], which represents any numeric digit. Valid class names are as follows and are listed in ctype(3):

alnum digit punct
alpha graph space
blank lower upper
cntrl print xdigit

These class names correspond to the ctype(3) macros isalnum(3), isdigit(3), ispunct(3), and so on.

The . Meta-Character

The . meta-character matches any single character. The following example shows a pattern in which any first character is accepted as a match:

$ egrep '^.ftp'/etc/services
tftp             69/tcp    #Trivial File Transfer
tftp             69/udp    #Trivial File Transfer
sftp            115/tcp    #Simple File Transfer Protocol
sftp            115/udp    #Simple File Transfer Protocol
bftp            152/tcp    #Background File Transfer Program
bftp            152/udp    #Background File Transfer Program
mftp            349/tcp
mftp            349/udp
$

Parenthesized Match Subexpression

A regular expression can be included within the parenthesis characters ( and ), which perform a grouping function. The following egrep(1) command illustrates a simple example:

$ egrep '^ä(ftp)'/etc/services
tftp             69/tcp    #Trivial File Transfer
tftp             69/udp    #Trivial File Transfer
mftp            349/tcp
mftp            349/udp
$

Parenthesized matches cause substrings to be extracted from a matching operation. This and other uses of the parenthesis will become clearer as the chapter progresses.

Atoms

An atom is a unit that participates in pattern matching. The following are atoms within regular expressions:

  • Any single non–meta-character

  • A single anchor (^ or $)

  • A set (such as [abc])

  • A range (such as [A-Z])

  • A character class (such as [:digit:])

  • A parenthesized match (such as (abc[de]))

Atoms are important to understanding how a piece works in regular expressions.

Piece

A piece is an atom followed by the meta-character *, +, or ?. These meta-characters influence the matching process in the following ways:

* Matches zero or more atoms
+ Matches one or more atoms
? Matches zero or one atom

The pattern A* will match any of the following:

"" Null string
A One A
AA Two As
AAA Three As

The pattern A+ insists that at least one A be matched. Alternatively, the pattern A? matches the null string or a single A character.

The pattern (abc)+ shows a parenthesized expression. This pattern matches any of the following:

abc The + matches one () expression.
abcabc The + matches two () expressions.
abcabcabc The + matches any number of () expressions.

The possibilities are nearly endless when you include sets and ranges within the parentheses.

Branch

A branch of a regular expression is a pattern component that is separated by the pipe symbol |. It is used to specify alternative patterns to be matched. The following example shows two branches in the pattern:

$ egrep '^ftp|^telnet'/etc/services
ftp-data         20/tcp    #File Transfer [Default Data]
ftp-data         20/udp    #File Transfer [Default Data]
ftp              21/tcp    #File Transfer [Control]
ftp              21/udp    #File Transfer [Control]
telnet           23/tcp
telnet           23/udp
ftp-agent       574/tcp    #FTP Software Agent System
ftp-agent       574/udp    #FTP Software Agent System
telnets         992/tcp
$

The example selects those lines that begin with the text ftp or telnet. Branches can be used within parenthesized subexpressions:

$ egrep '^ftp(-agent)?'/etc/services
ftp-data         20/tcp    #File Transfer [Default Data]
ftp-data         20/udp    #File Transfer [Default Data]
ftp              21/tcp    #File Transfer [Control]
ftp              21/udp    #File Transfer [Control]
ftp-agent       574/tcp    #FTP Software Agent System
ftp-agent       574/udp    #FTP Software Agent System
$

In this example, the line must start with the letters ftp. The subexpression (-agent) indicates what the subexpression should match. This is modified, however, by the following ? operator, which says that zero or one of these subexpressions must match. Consequently, lines are selected that start with ftp, ftp-data, or ftp-agent.

Expression Bounds

You have already seen how the *, +, and the ? meta-characters affect the preceding atom. It is also possible to specify a bound instead. A bound consists of an opening brace character ({), an unsigned integer, a comma (,), another unsigned integer, and a closing brace (}). The fully specified bound { 2,5} indicates that at least 2 atoms must match but no more than 5.

The second component of the bound is optional. For example, a bound of the form { 3} indicates that exactly 3 matches must be made.

A bound may also be specified with a missing second count. For example, the bound { 2,} specifies that 2 or more matches can be made.

The valid range for unsigned integers is between 0 and the value RE_DUP_MAX (which is 255 on most platforms). The following example demonstrates how to select those lines with a 6 followed by at least three zeros (the egrep(1) option -E is required to enable the bounds feature):

$ egrep -E '60{ 3,} '/etc/services
netviewdm1      729/tcp    #IBM NetView DM/6000 Server/Client
netviewdm1      729/udp    #IBM NetView DM/6000 Server/Client
netviewdm2      730/tcp    #IBM NetView DM/6000 send/tcp
netviewdm2      730/udp    #IBM NetView DM/6000 send/tcp
netviewdm3      731/tcp    #IBM NetView DM/6000 receive/tcp
netviewdm3      731/udp    #IBM NetView DM/6000 receive/tcp
#x11            6000-6063/tcp   X Window System
#x11            6000-6063/udp   X Window System
$

Quoted Characters

Given the number of meta-characters used in regular expressions, it is often necessary to quote meta-characters to remove their special meaning. The quote character used in regular expressions is the backslash () character. Any character that follows this backslash is interpreted literally; it is not treated as a meta character.

For example, if you want to match a pattern that includes parentheses, you need to quote the parenthesis characters. The expression (abc) matches the string (abc).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.105.124