17.3.3. Using Subexpressions

A pattern in a regular expression often contains one or more subexpressions. A subexpression is a part of the pattern that itself has meaning. Regular-expression grammars typically use parentheses to denote subexpressions.

As an example, the pattern that we used to match C++ files (§ 17.3.1, p. 730) used parentheses to group the possible file extensions. Whenever we group alternatives using parentheses, we are also declaring that those alternatives form a subexpression. We can rewrite that expression so that it gives us access to the file name, which is the part of the pattern that precedes the period, as follows:

// r has two subexpressions: the first is the part of the file name before the period
// the second is the file extension
regex r("([[:alnum:]]+)\.(cpp|cxx|cc)$", regex::icase);

Our pattern now has two parenthesized subexpressions:

([[:alnum:]]+), which is a sequence of one or more characters

(cpp| cxx| cc), which is the file extension

We can also rewrite the program from § 17.3.1 (p. 730) to print just the file name by changing the output statement:

if (regex_search(filename, results, r))
    cout << results.str(1) << endl;  // print the first subexpression

As in our original program, we call regex_search to look for our pattern r in the string named filename, and we pass the smatch object results to hold the results of the match. If the call succeeds, then we print the results. However, in this program, we print str(1), which is the match for the first subexpression.

In addition to providing information about the overall match, the match objects provide access to each matched subexpression in the pattern. The submatches are accessed positionally. The first submatch, which is at position 0, represents the match for the entire pattern. Each subexpression appears in order thereafter. Hence, the file name, which is the first subexpression in our pattern, is at position 1, and the file extension is in position 2.

For example, if the file name is foo.cpp, then results.str(0) will hold foo.cpp; results.str(1) will be foo; and results.str(2) will be cpp. In this program, we want the part of the name before the period, which is the first subexpression, so we print results.str(1).

Subexpressions for Data Validation

One common use for subexpressions is to validate data that must match a specific format. For example, U.S. phone numbers have ten digits, consisting of an area code and a seven-digit local number. The area code is often, but not always, enclosed in parentheses. The remaining seven digits can be separated by a dash, a dot, or a space; or not separated at all. We might want to allow data with any of these formats and reject numbers in other forms. We’ll do a two-step process: First, we’ll use a regular expression to find sequences that might be phone numbers and then we’ll call a function to complete the validation of the data.

Before we write our phone number pattern, we need to describe a few more aspects of the ECMAScript regular-expression language:

{d} represents a single digit and {d}{n} represents a sequence of n digits. (E.g., {d}{3} matches a sequence of three digits.)

• A collection of characters inside square brackets allows a match to any of those characters. (E.g., [-. ] matches a dash, a dot, or a space. Note that a dot has no special meaning inside brackets.)

• A component followed by ’?’ is optional. (E.g., {d}{3}[-. ]?{d}{4} matches three digits followed by an optional dash, period, or space, followed by four more digits. This pattern would match 555-0132 or 555.0132 or 555 0132 or 5550132.)

• Like C++, ECMAScript uses a backslash to indicate that a character should represent itself, rather than its special meaning. Because our pattern includes parentheses, which are special characters in ECMAScript, we must represent the parentheses that are part of our pattern as (or ).

Because backslash is a special character in C++, each place that a appears in the pattern, we must use a second backslash to indicate to C++ that we want a backslash. Hence, we write \{d}{3} to represent the regular expression {d}{3}.

In order to validate our phone numbers, we’ll need access to the components of the pattern. For example, we’ll want to verify that if a number uses an opening parenthesis for the area code, it also uses a close parenthesis after the area code. That is, we’d like to reject a number such as (908.555.1800.

To get at the components of the match, we need to define our regular expression using subexpressions. Each subexpression is marked by a pair of parentheses:

// our overall expression has seven subexpressions: ( ddd ) separator ddd separator dddd
// subexpressions 1, 3, 4, and 6 are optional; 2, 5, and 7 hold the number
"(\()?(\d{3})(\))?([-. ])?(\d{3})([-. ]?)(\d{4})";

Because our pattern uses parentheses, and because we must escape backslashes, this pattern can be hard to read (and write!). The easiest way to read it is to pick off each (parenthesized) subexpression:

1. (\()? an optional open parenthesis for the area code

2. (\d{3}) the area code

3. (\))? an optional close parenthesis for the area code

4. ([-. ])? an optional separator after the area code

5. (\d{3}) the next three digits of the number

6. ([-. ])? another optional separator

7. (\d{4}) the final four digits of the number

The following code uses this pattern to read a file and find data that match our overall phone pattern. It will call a function named valid to check whether the number has a valid format:

string phone =
    "(\()?(\d{3})(\))?([-. ])?(\d{3})([-. ]?)(\d{4})";
regex r(phone);  // a regex to find our pattern
smatch m;
string s;
// read each record from the input file
while (getline(cin, s)) {
    // for each matching phone number
    for (sregex_iterator it(s.begin(), s.end(), r), end_it;
               it != end_it; ++it)
        // check whether the number's formatting is valid
        if (valid(*it))
            cout << "valid: " << it->str() << endl;
        else
            cout << "not valid: " << it->str() << endl;
    }

Using the Submatch Operations

We’ll use submatch operations, which are outlined in Table 17.11, to write the valid function. It is important to keep in mind that our pattern has seven subexpressions. As a result, each smatch object will contain eight ssub_match elements. The element at [0]represents the overall match; the elements [1]. . .[7] represent each of the corresponding subexpressions.

When we call valid, we know that we have an overall match, but we do not know which of our optional subexpressions were part of that match. The matched member of the ssub_match corresponding to a particular subexpression is true if that subexpression is part of the overall match.

In a valid phone number, the area code is either fully parenthesized or not parenthesized at all. Therefore, the work valid does depends on whether the number starts with a parenthesis or not:

bool valid(const smatch& m)
{
    // if there is an open parenthesis before the area code
    if(m[1].matched)
        // the area code must be followed by a close parenthesis
        // and followed immediately by the rest of the number or a space
        return m[3].matched
               && (m[4].matched == 0 || m[4].str() == " ");
    else
        // then there can't be a close after the area code
        // the delimiters between the other two components must match
        return !m[3].matched
               && m[4].str() == m[6].str();
}

We start by checking whether the first subexpression (i.e., the open parenthesis) matched. That subexpression is in m[1]. If it matched, then the number starts with an open parenthesis. In this case, the overall number is valid if the subexpression following the area code also matched (meaning that there was a close parenthesis after the area code). Moreover, if the number is correctly parenthesized, then the next character must be a space or the first digit in the next part of the number.

If m[1] didn’t match (i.e., there was no open parenthesis), the subexpression following the area code must also be empty. If it’s empty, then the number is valid if the remaining separators are equal and not otherwise.


Exercises Section 17.3.3

Exercise 17.19: Why is it okay to call m[4].str() without first checking whether m[4] was matched?

Exercise 17.20: Write your own version of the program to validate phone numbers.

Exercise 17.21: Rewrite your phone number program from § 8.3.2 (p. 323) to use the valid function defined in this section.

Exercise 17.22: Rewrite your phone program so that it allows any number of whitespace characters to separate the three parts of a phone number.

Exercise 17.23: Write a regular expression to find zip codes. A zip code can have five or nine digits. The first five digits can be separated from the remaining four by a dash.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.171.153