Chapter 15. Regular Expression Grammars

When a thought takes ones breath away, a lesson on grammar seems an impertinence.

Preface to Emily Dickinson’s Poems, First Series
THOMAS WENTWORTH HIGGINSON

A regular expression grammar defines the rules for writing and interpreting a regular expression. The TR1 library supports six grammars. These grammars have a lot in common, so many regular expressions match the same set of target sequences, regardless of which grammar is being used. However, the grammars also have significant differences, so some things that are valid under one grammar are illegal under another; worse, some things that mean one thing under one grammar mean something different under another. It’s important to know which grammar you’re using, so that you don’t accidentally write a regular expression that does something other than what you intended it to do.

The six grammars supported by the TR1 library are

1. BRE: Basic Regular Expressions, defined in Part 1 of the POSIX Standard ([Int03b])

2. ERE: Extended Regular Expressions, defined in Part 1 of the POSIX Standard ([Int03b])

3. ECMAScript: ECMAScript Regular Expressions, based on the ECMAScript Language Specification ([Ecm03])[1]

4. awk: regular expressions as used in the awk utility, defined in Part 3 of the POSIX Standard ([Int03c])

5. grep: regular expressions as used in the grep utility, defined in Part 3 of the POSIX Standard ([Int03c])

6. egrep: regular expressions as used in the grep utility with the -E option, defined in Part 3 of the POSIX Standard ([Int03c])

If you don’t ask for a particular grammar, you’ll get ECMAScript.

Because they have so many similarities, we look at all six grammars together. Some constructs aren’t available in some grammars. Discussions of those constructs list the grammars that support them. Some constructs use different syntax in different grammars. Discussions of those grammars point out those syntactic differences.

This grammar discussion has three parts. The first (Section 15.1) gives an overview of the structure of a regular expression. You should read this section carefully and be sure that you understand it. The second (Section 15.2) is a table, showing which constructs are available in each of the grammars. You should look through the entries for ECMAScript, since that’s the default, to be sure that you know what can be done with it. The third (Section 15.3) gives the details. You’ll probably refer back to it fairly often when you get confused.

In this chapter, names in SMALL CAPITALS refer to the definitions in Section 15.3.

This discussion of regular expressions is complete but terse. See [Fri02] for lengthier discussions and lots of examples.

15.1. Structure of Regular Expressions

15.1.1. Element

An element can be any of the following:

• An ORDINARY CHARACTER, which matches the same character in the target sequence.

• A WILDCARD CHARACTER, ‘.’, which matches any character in the target sequence except a newline.

• A BRACKET EXPRESSION, of the form “[expr]”, which matches a single character or a COLLATING ELEMENT in the target sequence that is also in the set defined by the expression expr, oroftheform “[^expr]”, which matches a single character or a collation element in the target sequence that is not in the set defined by the expression expr. In either case, the expression expr can consist of any combination of any number of each of the following:

— An INDIVIDUAL CHARACTER, which adds that character to the set defined by expr

— A CHARACTER RANGE, of the form ch1-ch2, which adds all the characters represented by values in the closed range [ch1, ch2] to the set defined by expr

— A CHARACTER CLASS, of the form “[:name:]”, which adds all the characters in the named class to the set defined by expr

— A COLLATING SYMBOL, of the form “[.elt.]”, which adds the collation element elt to the set defined by expr

— An EQUIVALENCE CLASS, of the form “[=elt=]”, which adds the collating elements that are equivalent to elt to the set defined by expr

• An ANCHOR, either ‘^’ or ‘$’, which matches the beginning or the end of the target sequence, respectively.

• A CAPTURE GROUP, of the form “(subexpression)” (written as “(subexpression)” in BRE and grep), which matches the sequence of characters in the target sequence matched by the subexpression between the delimiters.

• An IDENTITY ESCAPE, of the form “k”, which matches the character k in the target sequence.

For example:

“a” matches the target sequence “a” but does not match any of the target sequences “B”, “b”, or “c”.

“.” matches all the target sequences “a”, “B”, “b”, and “c”.

“[b-z]” matches the target sequences “b” and “c” but does not match the target sequence “a” or the target sequence “B”.

“[[:lower:]]” matches the target sequences “a”, “b”, and “c” but does not match the target sequence “B”.

“(a)” matches the target sequence “a” and associates capture group 1 with the subsequence “a” but does not match any of the target sequences “B”, “b”, or “c”.

In ECMAScript, BRE, and grep, an element can also be

• A BACK REFERENCE, of the form d, as well as dd in ECMA-Script, which matches a sequence of characters in the target sequence that is the same as the sequence of characters matched by the Nth CAPTURE GROUP, where N is the value represented by the decimal digit d or by the decimal digits dd.

For example:

“(a)1” matches the target sequence “aa” because the first, and only, capture group matches the initial sequence “a”, and the back reference 1 then matches the final sequence “a”.

In ECMAScript, an element can also be any of the following:

• A NONCAPTURE GROUP, of the form “(?:subexpression)”; the group matches the sequence of characters in the target sequence matched by the subexpression between the delimiters.

• A limited FILE FORMAT ESCAPE, of the form “f”, “ ”, “ ”, “ ”, or “v”; these match a form feed, newline, carriage return, horizontal tab, and vertical tab, respectively, in the target sequence.

• A POSITIVE ASSERT, of the form “(?=subexpression)”, which matches the sequence of characters in the target sequence matched by the subexpression between the delimiters but does not change the match position in the target sequence.

• A NEGATIVE ASSERT, of the form “(?!subexpression)”, which matches any sequence of characters in the target sequence that does not match the subexpression between the delimiters but does not change the match position in the target sequence.

• A HEXADECIMAL ESCAPE SEQUENCE, of the form “xhh ; the sequence matches a character in the target sequence whose representation is the value represented by the two hexadecimal digits hh.

• A UNICODE ESCAPE SEQUENCE, of the form “uhhhh , which matches a character in the target sequence whose representation is the value represented by the four hexadecimal digits hhhh.

• A CONTROL ESCAPE SEQUENCE, of the form “ck , which matches the control character named by the character k.

• A WORD BOUNDARY ASSERT, of the form “”, which matches if the current position in the target sequence is immediately after a word boundary.

• A NEGATIVE WORD BOUNDARY ASSERT, of the form “B”; the assert matches if the current position in the target sequence is not immediately after a word boundary.

A DSW CHARACTER ESCAPE, of the form “d”, “D”, “s”, “S”, “w”, “W”, which provides a short name for a character class.

For example:

“(?:a)” matches the target sequence “a”.

“(?:a)1” is invalid, because there is no capture group 1.

“(?=a)a” matches the target sequence “a”. The assert matches the initial sequence “a” in the target sequence, and the final “a” in the regular expression matches the initial sequence “a” in the target sequence.

“(?!a)a” does not match the target sequence “a”; nor does it match any other target sequence.

“a.” matches the target sequence “a!” but does not match the target sequence “ab”.

“aB.” matches the target sequence “ab” but does not match the target sequence “a!”.

In awk, an element can also be one of the following:

• A FILE FORMAT ESCAPE, of the form “\”, “a”, “”, “f”, “ ”, “ ”, “ ”, or “v”; these match a backslash, alert, backspace, form feed, newline, carriage return, horizontal tab, and vertical tab, respectively, in the target sequence.

• An OCTAL ESCAPE SEQUENCE, of the form ooo, which matches a character in the target sequence whose representation is the value represented by the one, two, or three octal digits ooo.

15.1.2. Repetition

Any element other than a POSITIVE ASSERT, a NEGATIVE ASSERT, or an ANCHOR can be followed by a repetition count. The most general form of a repetition count is “{min, max}” (written as “{min, max}” in BRE and grep). An element followed by this form of repetition count matches at least min and no more than max successive occurrences of sequences that match the element.

For example:

“a{2, 3}” matches the target sequence “aa” and the target sequence “aaa” but not the target sequence “a” or the target sequence “aaaa”.

A repetition count can also take one of the following forms:

“{min}” (written as “{min}” in BRE and grep), which is equivalent to “{min, min}”

“{min,}” (written as “{min,}” in BRE and grep), which is equivalent to “{min, unbounded}”

“*”, which is equivalent to “{0, unbounded}”

For example:

“a{2}” matches the target sequence “aa” but not the target sequence “a” or “aaa”.

“a{2,}” matches the target sequence “aa”, the target sequence “aaa”, and so on, but does not match the target sequence “a”.

“a*” matches the target sequence “”, the target sequence “a”, the target sequence “aa”, and so on.

For all grammars except BRE and grep, a repetition count can also take one of the following forms:

“?”, which is equivalent to “{0, 1}”

“+”, which is equivalent to “{1, unbounded}”

For example:

“a?” matches the target sequence “” and the target sequence “a” but not the target sequence “aa”.

“a+” matches the target sequence “a”, the target sequence “aa”, and so on, but not the target sequence “”.

All the previous repetition counts apply a greedy repetition, which matches as many characters as possible in the target sequence. In ECMAScript, all the forms of repetition count can be followed by the character ‘?’ to specify a non-greedy repetition. A nongreedy repetition matches as few characters as possible in the target sequence.

For example:

“(a+)a*” matches the target sequence “aa” and associates capture group 1 with the entire target sequence because the element inside the capture group (“a+”) uses a greedy match.

“(a+?)a*” matches the target sequence “aa” and associates capture group 1 with the initial subsequence “a” because the element inside the capture group (“a+?”) uses a nongreedy match.

15.1.3. Concatenation

Regular expression elements, with or without repetition counts, can be concatenated to form longer regular expressions. Such a concatenated regular expression matches a target sequence that is a concatenation of sequences matched by the individual elements.

For example:

“a{2, 3}c” matches the target sequence “aac” and the target sequence “aaac” but does not match the target sequence “ac” or the target sequence “aaaac”.

“ab{2, 3}c” matches the target sequence “abbc” and the target sequence “abbbc” but does not match the target sequence “ababc”.

“(ab){2, 3}c” matches the target sequence “ababc” and the target sequence “abababc” but does not match the target sequence “abbc”.

15.1.4. Alternation

For all the regular expression grammars except BRE and grep, a concatenated regular expression can be followed by the character ‘|’ and another concatenated regular expression, which can be followed by another ‘|’ and another concatenated regular expression, and so on. Such an expression matches any target sequence that matches one or more of the concatenated regular expressions.

For example:

“ab|cd” matches the target sequence “ab” and the target sequence “cd” but does not match the target sequence “abd” or the target sequence “acd”.

In grep and egrep, a newline character (‘ ’) can be used to separate alternations.[2]

When a match succeeds, if more than one of the concatenated regular expressions matches in an alternation could match part of the target sequence, ECMAScript chooses the first of the concatenated regular expressions that matches the target sequence as the match; the other regular expression grammars choose the one that results in the longest match.

For example:

“(a|ab).*” matches the target sequence “abc”. In ECMAScript, the capture group is associated with the initial sequence “a” because it matched the first element in the alternation. Under the other grammars, the capture group is associated with the initial sequence “ab” because it gave the longest match in the alternation.

15.1.5. Subexpression

A subexpression is a concatenated regular expression in BRE and grep and an alternation in the other regular expression grammars. This is where the specification for regular expressions becomes recursive. In particular, as we saw earlier, a capture group can hold a subexpression. This makes it possible to nest subexpressions to create rather complicated—and potentially unreadable—regular expressions.

For example:

“(a(.*)d)” matches the target sequence “abcd”, associates capture group 1 with the text “abcd”, and associates capture group 2 with the text “bc”.[3]

“(a(.*)d)1” matches the target sequence “abcdabcd”. It associates capture group 1 with the initial text “abcd”, and the back reference matches the corresponding text at the end of the target text.

“(a(.*)d)2” matches the target sequence “abcdbc”. It associates capture group 2 with the first occurrence of “bc”, and the back reference matches the corresponding text at the end of the target text.

15.2. Grammar Features

Table 15.1 presents the full list of regular expression grammar features and the grammars that support them.

Table 15.1. Grammar Features

image

15.3. Regular Expression Details

15.3.1. Anchor

An anchor matches a position in the target string and not a character. A ‘^’ matches the beginning of the target sequence, and a ‘$’ matches the end of the target sequence.

15.3.2. Back Reference

A back reference is a backslash followed by a decimal value N and matches the contents of the Nth CAPTURE GROUP. The value of N must not be greater than the number of complete capture groups that precede the back reference. In BRE and grep, the value of N is never greater than 9, even if the regular expression has more than nine capture groups. In ECMAScript, the value of N is never greater than 99. Back references are not supported in ERE, egrep, and awk. For example:

“((a+)(b+))(c+))3” matches the target sequence “aabbcbbb”. The back reference “3” matches the text in the third capture group, that is, the “(b+)”. The regular expression does not match the target sequence “aabbbcbb”.

“(a)2” is not valid.

“(b(((((((((a))))))))))10” matches the target sequence “baa” in ECMAScript; the back reference is “10” and it matches the tenth capture group (i.e., the innermost one). In BRE, in the analogous regular expression

“(b (((((((((a ))))))))))10”

the back reference is “1”. It matches the first capture group—the one beginning with “(b” and ending with the final “)” preceding the back reference—and the final ‘0’ matches the ordinary character ‘0’.

15.3.3. Bracket Expression

A bracket expression defines a set of characters and COLLATING ELEMENTS. If the bracket expression begins with the character ‘^’, the match succeeds if none of the elements in the set matches the current character in the target sequence. Otherwise, the match succeeds if any of the elements in the set matches the current character in the target sequence.

The set of characters can be defined by listing any combination of INDIVIDUAL CHARACTERS, CHARACTER RANGES, CHARACTER CLASSES, COLLATING SYMBOLS, and EQUIVALENCE CLASSES.

15.3.4. Capture Group

A capture group marks its contents as a single unit in the regular expression grammar and associates the capture group with the subsequence of the target sequence that matches its contents. Each capture group has a number, determined by counting the left delimiters (‘(’ or “(”) marking capture groups up to and including the left parenthesis marking the current capture group. For example:

“ab+” matches the target sequence “abb” but not the target sequence “abab”.

“(ab)+” does not match the target sequence “abb” but matches the target sequence “abab”.

“((a+)(b+))(c+))” matches the target sequence “aabbbc” and associates capture group 1 with the subsequence “aabbb”, capture group 2 with the subsequence “aa”, capture group 3 with the subsequence “bbb”, and capture group 4 with the subsequence “c”.

15.3.5. Character Class

A character class in a bracket expression adds all the characters in the named class to the character set defined by the bracket expression. To add the contents of a character class to a bracket expression, use “[:” followed by the name of the character class followed by “:]”. Internally, the name of a character class is recognized by calling the member functions lookup_classname and isctype on the regular expression object’s traits object. The default traits class supports the following class names:

“alnum”: lowercase letters, uppercase letters, and digits

“alpha”: lowercase letters and uppercase letters

“blank”: space or tab

“cntrl”: the FILE FORMAT ESCAPE characters

“digit”: digits

“graph”: lowercase letters, uppercase letters, digits, and punctuation

“lower”: lowercase letters

“print”: lowercase letters, uppercase letters, digits, punctuation, and space

“punct”: punctuation

“space”: space

“upper”: uppercase letters

“xdigit”: digits, ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F

“d”: same as “digit”

“s”: same as “space”

“w”: same as “alnum”

15.3.6. Character Range

A character range in a bracket expression adds all the characters in the range to the character set defined by the bracket expression. To create a character range put the character ‘-’ between the first and last characters in the range. This puts into the set all the characters whose representation has a numeric value that is greater than or equal to the numeric value of the representation of the first character and less than or equal to the numeric value of the representation of the last character. Note that the contents and validity of this set of added characters depend on the platform-specific representation of characters. If the character ‘-’ occurs at the beginning or the end of a bracket expression or as the first or last character of a character range it represents itself. For example:

“[0-7]” represents the characters in the set {0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7}. It matches the target sequences “0”, “1”, and so on, but not the target sequence “a”.

“[h-k]” represents the characters in the set {h’, ‘i’, ‘j’, ‘k} on systems that use the ASCII character encoding. It matches the target sequences “h”, “i”, and so on, but not the target sequences “x8A” or “0”.

“[h-k]” represents the characters in the set {h’, ‘i’, ‘x8A’, ‘x8B’, ‘x8C’, ‘x8D’, ‘x8E’, ‘x8F’, ‘x90’, ‘j’, ‘k} on systems that use the EBCDIC character encoding; ‘h’ is encoded as 0x88, and ‘k’ is encoded as 0x92. It matches the target sequences “h”, “i”, “x8A”, and so on, but not “0”.

“[-0-24]” represents the set of characters {-’, ‘0’, ‘1’, ‘2’, ‘4}.

“[0-2-]” represents the set of characters {0’, ‘1’, ‘2’, ‘-}.

“[+--]” represents the set of characters {+’, ‘, ’, ‘-} on systems that use ASCII.

You can pass a flag that changes this to the constructor of a regular expression object. When you select locale-sensitive ranges, the characters are determined by the collation rules for the regular expression’s locale. Characters that collate after the first character in the definition of the range and before the last character in the definition of the range are in the set, as well as the two end characters.

15.3.7. Collating Element

A collating element is a multicharacter sequence that is treated as a single character.

15.3.8. Collating Symbol

A collating symbol in a bracket expression adds a COLLATING ELEMENT to the set defined by the bracket expression. To add a collating symbol to the set defined by a bracket expression, use “[.” followed by the collating element followed by “.]”.

15.3.9. Control Escape Sequence

A control escape sequence is a backslash followed by the letter ‘c’ followed by one of the letters ‘a’ through ‘z’ or ‘A’ through ‘Z’. The escape sequence matches the ASCII control character named by that letter. For example:

“ci” matches the target sequence “x09” because <ctrl-i> has the value 0x09.

15.3.10. DSW Character Escape

A dsw character escape is a short name for a bracket expression containing a single character class. In Table 15.2, the second column shows the meaning of each of the dsw character escapes, and the third column shows the meaning with the default traits class.

Table 15.2. DSW Character Escapes

image

15.3.11. Equivalence Class

An equivalence class in a bracket expression adds all the characters and COLLATING ELEMENTS that are equivalent to the collating element in the equivalence class definition to the set defined by the bracket expression. To add the characters in an equivalence class to the set defined by a bracket expression, use “[=” followed by a collating element followed by “=]”.

15.3.12. File Format Escape

A file format escape represents the usual C-language character escapes—“\”, “a”, “”, “f”, “ ”, “ ”, “ ”, and “v”—with their usual meanings, namely, backslash, alert, backspace, form feed, newline, carriage return, horizontal tab, and vertical tab, respectively. In ECMAScript, “a” and “” are not allowed. (Although “\” is allowed, technically it’s an identity escape, not a file format escape.)

15.3.13. Hexadecimal Escape Sequence

A hexadecimal escape sequence is a backslash followed by the letter ‘x’ followed by two hexadecimal digits (0-9a-fA-F). This sequence matches a character in the target sequence with the value specified by the two digits. For example:

“x41” matches the target sequence “A” when the ASCII character encoding is used.

15.3.14. Identity Escape

An identity escape is a backslash followed by a single character. The escape matches that character and is needed when the character has a special meaning in the regular expression grammar; using the identity escape removes the special meaning. For example:

“a*” matches the target sequence “aaa” but does not match the target sequence “a*”.

“a*” does not match the target sequence “aaa” but does match the target sequence “a*”.

The set of characters allowed in an identity escape depends on the regular expression grammar.

BRE and grep

   . [  * ^ $

ERE and egrep

   . [  () * + ? { | ^ $

awk

   . [  () * + ? { | ^ $ " /

ECMAScript: all characters except those that can be part of an identifier: roughly speaking, letters, digits, ‘$’, ‘_’, and Unicode escape sequences. For full details, see the ECMAScript Language Specification ([Ecm03]).

15.3.15. Individual Character

An individual character in a bracket expression adds that character to the character set defined by the bracket expression. A ‘^’ anywhere other than at the beginning of a bracket expression represents itself. For example:

“[abc]” matches the target sequences “a”, “b”, and “c” but not “d”.

“[^abc]” matches the target sequence “d” but not “a”, “b”, and “c”.

“[a^bc]” matches the target sequences “a”, “b”, “c”, and “^” but not “d”.

In all the regular expression grammars except ECMAScript, if a ‘]’ is the first character following the opening ‘[’ in a bracket expression or the first character following an initial ‘^’ in a bracket expression, it represents itself. For example:

“[]a” is invalid because there is no ‘]’ to end the bracket expression.

“[]abc]” matches the target sequences “]”, “a”, “b”, and “c” but not “d”.

“[^]abc]” matches the target sequence “d” but not “]”, “a”, “b”, and “c”.

In ECMAScript, use “]” to represent the character ‘]’ in a bracket expression.

“[]a” does not match any target sequence, because the bracket expression is empty.

“[]abc]” matches the target sequences “]”, “a”, “b”, and “c” but not “d”.

15.3.16. Negative Assert

A negative assert matches anything but its contents; it does not consume any characters in the target sequence. For example:

“(?!aa)(a*)” matches the target sequence “a” and associates capture group 1 with the subsequence “a” but it does not match the target sequence “aa” or the target sequence “aaa”.

15.3.17. Negative Word Boundary Assert

A negative word boundary assert matches if the current position in the target sequence is not immediately after a word boundary. See WORD BOUNDARY ASSERT.

15.3.18. Noncapture Group

A noncapture group marks its contents as a single unit in the regular expression grammar but does not associate any group number with the matching target sequence. For example:

“(a)(?:b)(c)” matches the target text “abc” and associates capture group 1 with the subsequence “a” and capture group 2 with the subsequence “c”.

15.3.19. Octal Escape Sequence

An octal escape sequence is a backslash followed by one, two, or three octal digits (0-7). This sequence matches a character in the target sequence whose representation has the value specified by those digits. If all the digits are ‘0’, the sequence is invalid. For example:

“101” matches the target sequence “A” when the ASCII character encoding is used.

15.3.20. Ordinary Character

An ordinary character is any valid character that doesn’t have a special meaning in the current grammar.

• In ECMAScript, the characters that have special meanings are

     ^$  . * + ? ()  [ ]     |

• In BRE and grep, the characters that have special meanings are

     . [

*’ has a special meaning in all cases except when it is the first character in a regular expression, the first character following an initial ‘^’ in a regular expression, the first character in a capture group, or the first character following an initial ‘^’ in a capture group.

• In ERE, egrep, and awk, the characters that have special meanings are

     . [  ( * + ? { |

In addition, the following characters have special meanings when used in a particular context:

)’ has a special meaning when it matches a preceding ‘(’.

^’ has a special meaning when it is the first character of a regular expression.

$’ has a special meaning when it is the last character of a regular expression.

An ordinary character matches the same character in the target sequence. By default, this means that the match succeeds if the two characters are represented by the same value. When you select case-insensitive matching, character equality is determined by calling the member function translate_-nocase on the regular expression object’s traits object. When you select a locale-sensitive match, character equality is determined by calling the member function translate on the regular expression object’s traits object.

15.3.21. Positive Assert

A positive assert matches its contents but does not consume any characters in the target sequence. For example:

“(aa)(a*)” matches the target sequence “aaaa” and associates capture group 1 with the subsequence “aa” at the beginning of the target sequence and capture group 2 with the subsequence “aa” at the end of the target sequence.

“(?=aa)(a*)” matches the target sequence “aaaa” and associates capture group 1 with the subsequence “aaaa”.

“(?=aa)(a*)|(a*)” matches the target sequence “a” and associates capture group 1 with an empty sequence—because the positive assert failed—and capture group 2 with the subsequence “a”. The regular expression also matches the target sequence “aa” and associates capture group 1 with the subsequence “aa” and capture group 2 with an empty sequence.

15.3.22. Unicode Escape Sequence

A Unicode escape sequence is a backslash followed by the letter ‘u’ followed by four hexadecimal digits (0-9a-fA-F). This sequence matches a character in the target sequence with the value specified by the four digits. For example:

“u0041” matches the target sequence “A”.

15.3.23. Wildcard Character

A wildcard character matches any character in the target expression except a newline.

15.3.24. Word Boundary Assert

A word boundary assert matches if the current position in the target sequence is immediately after a word boundary. A word boundary occurs in the following situations.

• The current position is at the beginning of the target sequence, and the current character is one of the word characters 0-9a-zA-Z _.

• The current position is at the end of the target sequence, and the last character in the target sequence is one of the word characters.

The character at the current position is one of the word characters, and the preceding character is not.

• The character at the current position is not one of the word characters, and the preceding character is.

15.4. About the Exercises

We haven’t looked yet at the details of the various templates and classes in the header <regex>, so it’s difficult to write code that uses them. Instead, for these exercises, use the header “rgxutil.h”, which provides simple functions for checking regular expression syntax and for regular expression matching. The function match_ECMA uses the ECMAScript grammar; the function match_grep uses the grep grammar; and the function match_ere uses the ERE grammar. The header is available on my Web site in the archive that contains the examples. In addition, its contents are listed in Appendix B.3.

To check the syntax of a regular expression, write the regular expression as a C-style string, and pass it to one of the functions match_ECMA, match_grep, or match_ere.

Example 15.1. Checking Syntax (regexgram/check.cpp)


// demonstrate use of match_XXX functions to check regular expression syntax
#include "rgxutil.h"

int main ()
  { // demonstrate functions match_XXX
  const char expr [] = "a*";
  match_ECMA (expr);
  match_grep (expr);
  match_ere (expr);

  match_ECMA ("*");
  match_grep ("*");
  match_ere ("*");
  return 0;
  }


To check whether a regular expression matches some target text, write both the regular expression and the target text as C-style strings, and pass them to one of the functions match_ECMA, match_bre, or match_ere.

Example 15.2. Matching (regexgram/match.cpp)


// demonstrate use of match_XXX functions for regular expression searching
#include "rgxutil.h"

int main ()
  { // demonstrate functions match_XXX
  match_ECMA ("[b-z]",  "b");
  match_grep ("[b-z]",  "b");
  match_ere ("[b-z]",  "b");

  match_ECMA ("[b-z]",  "c");
  match_grep ("[b-z]",  "c");
  match_ere ("[b-z]",  "c");

  match_ECMA ("[b-z]",  "a");
  match_grep ("[b-z]",  "a");
  match_ere ("[b-z]",  "a");

  match_ECMA ("[b-z]",  "B");
  match_grep ("[b-z]",  "B");
  match_ere ("[b-z]",  "B");
  return 0;
  }


Exercises

Exercise 1

The best way to learn about the details of regular expressions is to use them. If you’re confused by any of the examples in this chapter, try them out with the various match functions. Be careful of backslashes: In a C-style string, you need two for each one in the regular expression.

Exercise 2

For each of the following target sequences, write a regular expression using the ECMAScript grammar and a regular expression using the grep grammar—they will often be the same—that matches it and nothing else:

1. The letter ‘a

2. The letter ‘b

3. The letter ‘a’ followed by the letter ‘b

4. Any character

5. Any character followed by any character

6. Any vowel

7. Any character that is not a vowel

8. Any of the characters ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h

9. Any letter of the alphabet

10. The character ‘{

11. The character ‘

12. Any character followed by the same character

13. Any letter except ‘Q

14. Any letter except ‘Q’ and ‘x

15. Any number of occurrences of the letter ‘a

16. Any number of occurrences of the letter sequence “ab”

17. Any number of occurrences of either of the letters ‘a’ and ‘b

18. Zero or one occurrence of either of the letters ‘a’ and ‘b

19. One or more occurrences of either of the letters ‘a’ and ‘b

20. Seventeen or more occurrences of either of the letters ‘a’ and ‘b

21. One of the three HTML tags “<EM>”, “<CODE>”, and “<PRE>”; don’t make allowances for lowercase characters

22. One of those same three HTML tags, followed by an arbitrary sequence of characters that does not include a ‘<’, followed by the corresponding closing tag (“</EM>”, “</CODE>”, or “</PRE>”)

Exercise 3

How many ways can you think of to write a regular expression that matches any of the target sequences “0”, “1”, and “2”?

Exercise 4

How many ways can you think of to write a regular expression that matches a single hexadecimal digit—any decimal digit or any of the letters ‘a’ through ‘f’, either lowercase or uppercase?

Exercise 5

1. Write a regular expression that matches a sequence of three characters followed by the same three characters in the same order.

2. Write a regular expression that matches a sequence of three characters followed by the same three characters in reverse order.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.11.89