17.3.1. Using the Regular Expression Library

As a fairly simple example, we’ll look for words that violate a well-known spelling rule of thumb, “i before e except after c”:

// find the characters ei that follow a character other than c
string pattern("[^c]ei");
// we want the whole word in which our pattern appears
pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";
regex r(pattern); // construct a regex to find pattern
smatch results;   // define an object to hold the results of a search
// define a string that has text that does and doesn't match pattern
string test_str = "receipt freind theif receive";
// use r to find a match to pattern in test_str
if (regex_search(test_str, results, r)) // if there is a match
    cout << results.str() << endl;      // print the matching word

We start by defining a string to hold the regular expression we want to find. The regular expression [^c] says we want any character that is not a 'c', and [^c]ei says we want any such letter that is followed by the letters ei. This pattern describes strings containing exactly three characters. We want the entire word that contains this pattern. To match the word, we need a regular expression that will match the letters that come before and after our three-letter pattern.

That regular expression consists of zero or more letters followed by our original three-letter pattern followed by zero or more additional characters. By default, the regular-expression language used by regex objects is ECMAScript. In ECMAScript, the pattern [[:alpha:]] matches any alphabetic character, and the symbols + and * signify that we want “one or more” or “zero or more” matches, respectively. Thus, [[:alpha:]]* will match zero or more characters.

Having stored our regular expression in pattern, we use it to initialize a regex object named r. We next define a string that we’ll use to test our regular expression. We initialize test_str with words that match our pattern (e.g., “freind” and “theif”) and words (e.g., “receipt” and “receive”) that don’t. We also define an smatch object named results, which we will pass to regex_search. If a match is found, results will hold the details about where the match occurred.

Next we call regex_search. If regex_search finds a match, it returns true. We use the str member of results to print the part of test_str that matched our pattern. The regex_search function stops looking as soon as it finds a matching substring in the input sequence. Thus, the output will be

freind

§ 17.3.2 (p. 734) will show how to find all the matches in the input.

Specifying Options for a regex Object

When we define a regex or call assign on a regex to give it a new value, we can specify one or more flags that affect how the regex operates. These flags control the processing done by that object. The last six flags listed in Table 17.6 indicate the language in which the regular expression is written. Exactly one of the flags that specify a language must be set. By default, the ECMAScript flag is set, which causes the regex to use the ECMA-262 specification, which is the regular expression language that many Web browsers use.

Table 17.6. regex (and wregex) Operations

Image

The other three flags let us specify language-independent aspects of the regular-expression processing. For example, we can indicate that we want the regular expression to be matched in a case-independent manner.

As one example, we can use the icase flag to find file names that have a particular file extension. Most operating systems recognize extensions in a case-independent manner—we can store a C++ program in a file that ends in .cc, or .Cc, or .cC, or .CC. We’ll write a regular expression to recognize any of these along with other common file extensions as follows:

// one or more alphanumeric characters followed by a '.' followed by "cpp" or "cxx" or "cc"
regex r("[[:alnum:]]+\.(cpp|cxx|cc)$", regex::icase);
smatch results;
string filename;
while (cin >> filename)
    if (regex_search(filename, results, r))
        cout << results.str() << endl;  // print the current match

This expression will match a string of one or more letters or digits followed by a period and followed by one of three file extensions. The regular expression will match the file extensions regardless of case.

Just as there are special characters in C++ (§ 2.1.3, p. 39), regular-expression languages typically also have special characters. For example, the dot (.) character usually matches any character. As we do in C++, we can escape the special nature of a character by preceding it with a backslash. Because the backslash is also a special character in C++, we must use a second backslash inside a string literal to indicate to C++ that we want a backslash. Hence, we must write \. to represent a regular expression that will match a period.

Errors in Specifying or Using a Regular Expression

We can think of a regular expression as itself a “program” in a simple programming language. That language is not interpreted by the C++ compiler. Instead, a regular expression is “compiled” at run time when a regex object is initialized with or assigned a new pattern. As with any programming language, it is possible that the regular expressions we write can have errors.


Image Note

It is important to realize that the syntactic correctness of a regular expression is evaluated at run time.


If we make a mistake in writing a regular expression, then at run time the library will throw an exception (§ 5.6, p. 193) of type regex_error . Like the standard exception types, regex_error has a what operation that describes the error that occurred (§ 5.6.2, p. 195). A regex_error also has a member named code that returns a numeric code corresponding to the type of error that was encountered. The values code returns are implementation defined. The standard errors that the RE library can throw are listed in Table 17.7.

Table 17.7. Regular Expression Error Conditions

Image

For example, we might inadvertently omit a bracket in a pattern:

try {
    // error: missing close bracket after alnum; the constructor will throw
    regex r("[[:alnum:]+\.(cpp|cxx|cc)$", regex::icase);
} catch (regex_error e)
  { cout << e.what() << " code: " << e.code() << endl; }

When run on our system, this program generates

regex_error(error_brack):
The expression contained mismatched [ and ].
code: 4

Our compiler defines the code member to return the position of the error as listed in Table 17.7, counting, as usual, from zero.

Regular Expression Classes and the Input Sequence Type

We can search any of several types of input sequence. The input can be ordinary char data or wchar_t data and those characters can be stored in a library string or in an array of char (or the wide character versions, wstring or array of wchar_t). The RE library defines separate types that correspond to these differing types of input sequences.

For example, the regex class holds regular expressions of type char. The library also defines a wregex class that holds type wchar_t and has all the same operations as regex. The only difference is that the initializers of a wregex must use wchar_t instead of char.

The match and iterator types (which we will cover in the following sections) are more specific. These types differ not only by the character type, but also by whether the sequence is in a library string or an array: smatch represents string input sequences; cmatch, character array sequences; wsmatch, wide string (wstring) input; and wcmatch, arrays of wide characters.

The important point is that the RE library types we use must match the type of the input sequence. Table 17.8 indicates which types correspond to which kinds of input sequences. For example:

regex r("[[:alnum:]]+\.(cpp|cxx|cc)$", regex::icase);
smatch results;  // will match a string input sequence, but not char*
if (regex_search("myfile.cc", results, r)) // error: char* input
    cout << results.str() << endl;

Table 17.8. Regular Expression Library Classes

Image

The (C++) compiler will reject this code because the type of the match argument and the type of the input sequence do not match. If we want to search a character array, then we must use a cmatch object:

cmatch results;  // will match character array input sequences
if (regex_search("myfile.cc", results, r))
    cout << results.str() << endl;  // print the current match

In general, our programs will use string input sequences and the corresponding string versions of the RE library components.


Exercises Section 17.3.1

Exercise 17.14: Write several regular expressions designed to trigger various errors. Run your program to see what output your compiler generates for each error.

Exercise 17.15: Write a program using the pattern that finds words that violate the “i before e except after c” rule. Have your program prompt the user to supply a word and indicate whether the word is okay or not. Test your program with words that do and do not violate the rule.

Exercise 17.16: What would happen if your regex object in the previous program were initialized with "[^c]ei"? Test your program using that pattern to see whether your expectations were correct.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.51.228