Chapter 21. Customizing Regular Expressions

“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.”

Through the Looking-Glass
LEWIS CARROLL

In Chapter 16, we saw that the template basic_regex<Elem> has a second template argument, with a default type regex_traits<Elem>. In this chapter, we look at that parameter in more detail. As a reminder, the declaration of basic_regex looks like this:

    // CLASS TEMPLATE basic_regex
template <class Elem,
    class RXtraits = regex_traits <Elem> >
    class basic_regex;

Each basic_regex object contains an object of type RXtraits, which determines how the basic_regex object interprets some features of the regular expression grammar. The default template, regex_traits<Elem>, is required to work only when Elem is char or wchar_t. If you write your own traits type, you can support additional character types, provide a hook for your own type of locale objects, change the rules for character matching, change the collation rules, and add or remove character classifications. If you’re ambitious, you could, for example, add Unicode support to the basic_regex template. Simply pick a suitable character type, such as the up-and-coming char32_t, write a class that provides the necessary traits—let’s call it unicode_traits—and then create a regular expression object: basic_regex<char32_t, unicode_-traits>.

The following sections each present a related subset of the requirements for a regular expression traits class. The required elements are presented as code snippets that must be valid for a conforming traits class. The list of elements is followed by a discussion of how the rest of the regular expression library uses these elements, then by the specific choices made in the TR1 library for the required traits types, regex_traits<char> and regex_traits<wchar_t>.

The last section has a synopsis of the class template regex_traits, so that you can see all these elements in one place.

The code snippets use the following names, with their specified meanings:

Tr: the name of the traits type

ctr: a const object of type Tr

tr: an object of type Tr

ch: a value of type Tr::char_type

loc: an object of type Tr::locale_type

p: a value of type const Tr::char_type*

base: one of the values 8, 10, or 16

F1, F2: forward iterators that point at characters of type Tr::char_type and define the range [F1, F2)

21.1. Character Traits

Tr::char_type : a synonym for the type of character that will be used to describe the regular expression.

Tr:: string_type : a synonym for std::basic_string<char_type>.

Tr::length (p): returns a value of type std::size_t, which is the smallest non-negative value len such that p[len] == 0. Its time complexity is O(len).

ctr. value (ch, base): returns the numeric value represented by the character ch in the given base. If the character ch is not a valid digit in that base, the expression yields -1.

Usage

The static member function length is called whenever the implementation needs to know the length of a null-terminated character string.

The member function value is called whenever the implementation needs to translate a series of digits into a numeric value, that is, when it encounters an OCTAL ESCAPE SEQUENCE, a HEXADECIMAL ESCAPE SEQUENCE, a UNICODE ESCAPE SEQUENCE, or a REPETITION with an explicit count.

Class Template regex_traits Specializations

The class template regex_traits implements length(p) by calling std:: char_traits<Elem>::length(p).

The specialization regex_traits<char> treats the characters ‘0’ through ‘7’ as octal digits, the characters ‘0’ through ‘9’ as decimal digits, and the characters ‘0’ through ‘9’, ‘a’ through ‘f’, and ‘A’ through ‘F’ as hexadecimal digits, with their usual meanings. The specialization regex_traits<wchar_t> treats the wide-character equivalents of those characters in the same way.

21.2. Locales

Tr:: locale_type : a synonym for a copy-constructible type that represents the locale used by the traits class.

tr. imbue (loc): copies loc to tr’s locale object and makes any other changes needed to use the new locale correctly. It returns a copy of the previous locale object.

ctr. getloc (): returns a copy of tr’s locale object.

Usage

The class template basic_regex<Elem, RXtraits> has three members that use the locale elements of RXtraits. The nested type name basic_regex<Elem, RXtraits>::locale_type is a synonym for RXtraits::locale_type, and the two member functions basic_regex<Elem, RXtraits>::imbue(loc) and basic_regex<Elem, RXtraits>::getloc() both forward to the corresponding member function of the nested RXtraits object.

Class Template regex_traits Specializations

The class template regex_traits defines the nested type name locale_type to be a synonym for std::locale. Each regex_traits object holds a locale object. The member function getloc() returns a copy of that object. The member function imbue(loc) discards any cached information based on the previous locale, copies loc onto the stored locale object, and returns a copy of the previous locale object.

21.3. Character Matching

ctr. translate (ch): returns a value of type Tr::char_type. If two characters ch1 and ch2 are equivalent, ctr.translate(ch1) == ctr. translate(ch2).

ctr.translate_nocase (ch): returns a value of type Tr::char_type. If two characters ch1 and ch2 are equivalent without regard to case, ctr.translate_nocase(ch1) == ctr.translate_nocase(ch2).

ctr. lookup_collatename (F1, F2): returns a Tr::string_type object that holds the characters that make up the collating element named by the text sequence pointed at by [F1, F2). If the text sequence does not name a valid collating element, the function returns an empty string.

Usage

Comparing two characters ch1 and ch2 for equality follows these rules.

• If flags() & regex_constants::icase is nonzero (i.e., the icase flag was passed to the basic_regex object’s contructor), ch1 and ch2 are equal if ctr.translate_nocase(ch1) == ctr.translate_-nocase(ch2).

• Otherwise, if flags() & regex_constants::collate is nonzero (i.e., the collate flag was passed to the basic_regex object’s constructor and the icase flag was not), ch1 and ch2 are equal if ctr. translate(ch1) == ctr.translate(ch2).

• Otherwise, ch1 and ch2 are equal if ch1 == ch2.

Adding a COLLATING SYMBOL to a bracket expression (i.e., “[[.elt.]]”) adds the contents of the string object returned by calling ctr.lookup_collate_name with the text between the “[.” and “.]” delimiters. If the returned string object is empty (i.e., the text is not the name of a valid collating element), the implementation throws a regex_error object whose code member function returns error_collate.

Class Template regex_traits Specializations

The class template regex_traits implements translate(ch) by returning ch. The template implements translate_nocase(ch) by returning use_-facet<ctype<Elem>>(getloc()).tolower(ch). That is, it uses its current locale to translate the character to lowercase.

The TR1 specification does not impose any testable requirements for regex_traits::lookup_collatename.

21.4. Collating

ctr. transform (F1, F2): returns a Tr::string_type object that can be used as a sort key for the character sequence pointed at by the iterator range [F1,F2). If a character sequence pointed at by the iterator range [G1,G2) should sort before the character sequence pointed at by the iterator range [H1,H2), ctr.transform(G1, G2) <ctr.transform(H1, H2).

ctr.transform_primary (F1, F2): returns a Tr::string_type object that can be used as a sort key for the character sequence pointed at by the iterator range [F1,F2) without regard to case. If a character sequence pointed at by the iterator range [G1,G2) should sort before the character sequence pointed at by the iterator range [H1,H2), without regard to case, ctr.transform_primary(G1, G2) <ctr.transform_primary(H1, H2).

Usage

The transform function is used to decide whether a character ch0 is in a character range “[ch1 -ch2 ]”. It is used as follows.

• If flags() & regex_constants::collate is false (i.e., the collate flag was not passed to the basic_regex object’s contructor), the character is in the range only if ch1 <= ch0 && ch0 <= ch2.

• Otherwise, the match is determined by first translating each of the three characters ch0, ch1, and ch2 with either ctr.translate_nocase(chx) if flags() & regex_constants::icase is true or with ctr.translate(chx) otherwise, producing ch0a, ch1a, and ch2a, respectively. These three characters are each then converted to Tr:: string_type objects str0, str1, and str2, respectively. These three string objects are then transformed into sort keys by calling ctr. transform(strx.begin(), strx.end()), producing str0a, str1a, and str2a. The character ch0 is in the range if str1a <= str0a && str0a <= str2a.

The transform_primary function is used to decide whether a character ch belongs to an EQUIVALENCE CLASS “[[=eq =]]”. The equivalence class name is passed to ctr.transform_primary, and the single-character sequence ch is passed to ctr.transform_primary. If the two returned strings compare equal, the character is a member of the equivalence class.

Class Template regex_traits Specializations

The template member function template <class FwdIt> regex_traits<Elem>::transform(FwdIt, FwdIt) constructs a regex_traits<Elem>::string_type object str that holds a copy of the text sequence pointed at by the two iterator arguments. It returns use_facet<collate<Elem> >(getloc()).transform(str.data(), str.data() + str.size()).

The template member function template <class FwdIt> regex_traits <Elem>::transform_primary(FwdIt, FwdIt) returns a case-insensitive version of the sort key returned by regex_traits<Elem>::transform.

21.5. Character Classes

Tr:: char_class_type: names an unspecified bitmask type that identifies sets of character classifications.

ctr. lookup_classname (F1, F2): returns a value of type Tr::char_-class_type that identifies the character classification named by the text sequence pointed at by [F1, F2). If the text sequence does not name a valid character classification, 0 is returned.

ctr. isctype (ch, cl): returns true if the character ch is a member of one of the character classifications identified by the Tr::char_-class_type value cl; otherwise, false.

Usage

When a regular expression uses a CHARACTER CLASS, the implementation calls lookup_classname with the name of the class. If the function returns 0, it throws a regex_error object whose code member function returns error_ctype. Otherwise, to match a character against that character class, the implementation calls isctype with the character and the value returned by lookup_classname. The character matches only if isctype returns true.

Class Template regex_traits Specializations

For regex_traits<char>, the member function lookup_classname treats all the following names and, optionally, more, without regard to case, as valid class names:

“d”

“s”

“w”

“alnum”

“alpha”

“blank”

“cntrl”

“digit”

“graph”

“lower”

“print”

“punct”

“space”

“upper”

“xdigit”

For regex_traits<wchar_t>, the member function lookup_classname does the same, for the wide-character equivalents of these names.

The member function regex_traits::isctype converts its argument cl into a value cl_mask of type std::ctype_base::mask in an unspecified manner and then calls use_facet<ctype<Elem> >(getloc()).is(ch, cl_mask). If the result is true, the function returns true. Otherwise, if cl has bits set that identify the character class named “w” and ch is ‘_’, the function returns true. Otherwise, if cl has bits set that identify the character class named “blank” and ch is in an implementation-defined set of characters for which isspace(ch, getloc()) returns true, the function returns true. Otherwise, it returns false.

What that comes down to is that isctype looks to the traits object’s current locale for the details of its character classifications. Granted, the mapping to the mask type is unspecified, but no implementer will violate the obvious correspondence. The rest of the paragraph does some fine-tuning. The character class “w”[1] includes underscores. The description of the character class “blank”[2] paraphrases the description of the isblank function that was added to C with C99 and is included in the TR1 library.

21.6. The regex_traits Class Template

The class template regex_traits is defined in the header <regex>.

namespace std {   // C++ standard library
  namespace tr1 { // TR1 additions

    // CLASS TEMPLATE regex_traits
template <class Elem>
  struct regex_traits {

    // CHARACTER TRAITS
  typedef Elem  char_type;
  typedef basic_string <Elem> string_type;
  static size_t  length (const char_type *str);
  int value (Elem ch, int base) const;

    // LOCALES
  typedef unspecified locale_type;
  locale_type imbue(locale_type loc);
  locale_type getloc () const;

    // CHARACTER MATCHING
  char_type translate (char_type ch) const;
  char_type translate_nocase (char_type ch) const;
  template<class FwdIt>
    string_type lookup_collatename (
      FwdIt first, FwdIt last) const;

    // COLLATING
  template<class FwdIt>
    string_type transform (
      FwdIt first, FwdIt last) const;
  template<class FwdIt>
    string_type transform_primary (
      FwdIt first, FwdIt last) const;

    // CHARACTER CLASSES
  typedef unspecified char_class_type;
  template<class FwdIt>
    char_class_type lookup_classname (
      FwdIt first, FwdIt last) const;
  bool isctype (char_type ch, char_class_type cls) const;

  };

    // SPECIALIZATIONS OF CLASS TEMPLATE regex_traits
template <>
  struct regex_traits<char>;
template <>
  struct regex_traits<wchar_t>;
} }

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.13.219