Chapter 14. The <regex> Header

Great blunders are often made, like large ropes, of a multitude of fibers.

Les Misérables
VICTOR HUGO

Powerful software facilities, too, are often made of a multitude of fibers. The <regex> header offers a somewhat daunting set of class templates and function templates for searches using regular expressions. Don’t let the size of the header deter you, though; you need to understand only a few basic ideas to use regular expressions effectively. You need to know how to write a regular expression (Chapter 15), how to create an object that holds a regular expression (Chapter 16), how to use a regular expression object to search for matches in a target string (Chapter 17), and how to hold the results of a search (Chapter 18). For more sophisticated applications you can create iterator objects that perform multiple sequential searches (Chapter 19), suitable for use as input sequences to STL algorithms, and you can scan input text, replacing portions selected by a regular expression (Chapter 20). Finally, if you really need to, you can customize some aspects of the regular expression engine (Chapter 21). We look at each of these subjects in a bit more detail in the rest of this chapter and in far more detail in subsequent chapters.

A regular expression is a sequence of characters that can match one or more target sequences of characters according to the rules of a regular expression grammar. For example, the regular expression “sequence[^s]” matches the text “sequence” earlier in this sentence but not the text “sequences.” The rules that determine what is and isn’t a valid regular expression and what a valid regular expression means are called the regular expression grammar. The grammars supported by the TR1 library are discussed in Chapter 15.

When writing regular expressions, it’s important to keep in mind that a backslash character has a special meaning both in regular expression grammars and in C++. When you write a regular expression as a string literal in code, the compiler gets the first shot at any backslashes and will treat them as escape characters. If you need to have a backslash in the regular expression itself, you must use two backslashes in the string literal. For example, the regular expression ".a" is the character ‘’ followed by the character ‘.’ followed by the character ‘a’. In code, however, a string literal representing that same regular expression has two backslashes.

std::string str("\.a");     // str holds the character sequence ''
'.' 'a'

Once you know how to write a regular expression that correctly describes the text pattern you want to search for, you need to create an object that encapsulates that pattern. The class template basic_regex does this for more or less arbitrary types of characters. You’ll almost always be providing regular expressions as sequences of char or wchar_t, for which you’ll use the specializations of basic_regex named regex and wregex, respectively. Objects of these types are constructed from a text sequence that defines a regular expression:

std ::tr1 :: regex rgx(str );    // rgx holds the regular expression ".a"

When you search for text that matches the pattern defined by a regular expression, you’re often interested in more than simply whether a match was found. You usually want to know where the match was in the target sequence and, sometimes, where some matching subsequences occurred. These results are reported through the class templates sub_match and match_results or, more commonly, through their specializations for use with particular kinds of target sequences. In the following code snippet, cmatch can hold the results of a search through an array of char:

std ::tr1 :: cmatch match ;    // match will receive search results

Of course, the reason for using a regular expression in the first place is to search for text that matches it. Three function templates search for matching text. The function template regex_match checks whether a target sequence exactly matches the regular expression. The function template regex_search looks for the first matching subsequence. The function template regex_replace looks for matches and replaces them with new text. These functions all take a regular expression object that defines the pattern to search for and a target sequence that will be searched. The various overloads of the function templates regex_match and regex_search all return a Boolean value that indicates whether a match was found. Some of the overloads of these function templates also take a match_results object for more detailed results:

if ( std ::tr1 :: regex_search (" aba .a", match, rgx ))
   std :: cout << " match found after "
      << match .prefix () << ' ';

In the preceding code snippet, regex_search looks for the first position in the text “aba.a” that matches the regular expression. That match consists of the fourth and fifth characters, so the code snippet will display “match found after aba”.

Here is the complete program that these code snippets were taken from.

Example 14.1. Regular Expression Overview (regexhdr/overview.cpp)


# include  < regex >
# include  < iostream >
# include  < string >
using std :: tr1 :: regex ;
using std :: tr1 :: cmatch ; using std :: tr1 :: regex_search ;
using std :: cout ; using std :: string ;

int main ()
     {
     string str (" \. a");
     regex  rgx ( str);
     cmatch match ;
     if    ( std :: tr1 :: regex_search (" aba .a",  match, rgx))
      std :: cout << " match found after           "
         << match . prefix () <<   '   ' ;
     return 0;
     }


Sometimes, your program needs to split its input text into chunks according to a set of rules that can be defined by a regular expression. Two forms of regular expression iterators do this. You can call an STL algorithm with such an iterator object, and the algorithm will see the individual chunks. In the example that follows, all the code before main is infrastructure. The regex object word_sep holds a regular expression that matches any sequence of text consisting of one or more separator characters, where a separator character is a space, a comma, a period, a horizontal tab, a newline, a semicolon, or a colon. The object words, of type sregex_token_iterator, uses word_sep to separate the input sequence [text.begin(), text.end()) into tokens separated by subsequences that match the regular expression. Thus, each token is a word from the input text. The map object word_count counts the number of times each word appears in the text. The while loop loops through the words, as determined by words, and increments the count for each word it encounters. The call to copy shows the result.

Example 14.2. Regular Expression Iteration (regexhdr/regexiter.cpp)



#include <regex>
#include <algorithm>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <map>
#include <string>
using std::tr1::regex;
using std::tr1::sregex_token_iterator;
using std::map;
using std::cout; using std::basic_ostream;
using std::setw; using std::ostream_iterator;
using std::string; using std::copy;

string text =
"The quality of mercy is not strain'd, "
"It droppeth as the gentle rain from heaven "
"Upon the place beneath: it is twice bless'd; "
"It blesseth him that gives and him that takes: "
"'Tis mightiest in the mightiest; it becomes "
"The throned monarch better than his crown; "
"His sceptre shows the force of temporal power, "
"The attribute to awe and majesty, "
"Wherein doth sit the dread and fear of kings ";
// William Shakespeare, The Merchant of Venice

typedef map<string, int> counter;
typedef counter::value_type pairs;

namespace std { // add inserter to namespace std
template <class Elem, class Alloc>
basic_ostream<Elem, Alloc>& operator<<(
  basic_ostream<Elem, Alloc>& out, const pairs& val)
  { // insert pair<string, int> into stream
  return out << setw(10) << val.first
    << ": " << val.second;
  }
}

int main ()
  {  // count occurrences of each word
  regex word_sep ("[ ,.\t\n;:]+");
  sregex_token_iterator words(
    text.begin(), text.end(), word_sep, -1);
  sregex_token_iterator end;

  map<string, int> word_count;
  while (words != end)
    ++word_count[*words++];
  copy(word_count.begin(), word_count.end(),
    ostream_iterator<pairs>(cout, " "));
  return 0;
  }


Finally, it’s possible to customize the regular expression grammars in limited ways. Each basic_regex object has a traits object that it uses to determine whether a particular character has a special meaning and what that meaning is, whether two characters should be treated as equivalent, and so on. This customization is discussed in Chapter 21.

Further Reading

Mastering Regular Expressions [Fri02] is a very good discussion, with lots of examples, of how to use regular expressions and of the intricacies of their grammars.

Ecma-262, ECMAScript Language Specification [Ecm03] is the formal reference that the TR1 library’s ECMAScript grammar is based on.

Portable Operating System Interface (POSIX) [Int03b, Int03c] is the formal reference that the TR1 library’s bre, ere, grep, egrep, and awk grammars are based on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.56.251