“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.”
Through the Looking-Glass
LEWIS CARROLL
In Chapter 16, we saw that the template basic_regex<Elem>
has a second template argument, with a default type regex_traits<Elem>
. In this chapter, we look at that parameter in more detail. As a reminder, the declaration of basic_regex
looks like this:
// CLASS TEMPLATE basic_regex
template <class Elem,
class RXtraits = regex_traits <Elem> >
class basic_regex;
Each basic_regex
object contains an object of type RXtraits
, which determines how the basic_regex
object interprets some features of the regular expression grammar. The default template, regex_traits<Elem>
, is required to work only when Elem
is char
or wchar_t
. If you write your own traits type, you can support additional character types, provide a hook for your own type of locale objects, change the rules for character matching, change the collation rules, and add or remove character classifications. If you’re ambitious, you could, for example, add Unicode support to the basic_regex
template. Simply pick a suitable character type, such as the up-and-coming char32_t
, write a class that provides the necessary traits—let’s call it unicode_traits
—and then create a regular expression object: basic_regex<char32_t, unicode_-traits>
.
The following sections each present a related subset of the requirements for a regular expression traits class. The required elements are presented as code snippets that must be valid for a conforming traits class. The list of elements is followed by a discussion of how the rest of the regular expression library uses these elements, then by the specific choices made in the TR1 library for the required traits types, regex_traits<char>
and regex_traits<wchar_t>
.
The last section has a synopsis of the class template regex_traits
, so that you can see all these elements in one place.
The code snippets use the following names, with their specified meanings:
• Tr
: the name of the traits type
• ctr
: a const
object of type Tr
• tr
: an object of type Tr
• ch
: a value of type Tr::char_type
• loc
: an object of type Tr::locale_type
• p
: a value of type const Tr::char_type*
• base
: one of the values 8, 10, or 16
• F1
, F2
: forward iterators that point at characters of type Tr::char_type
and define the range [F1, F2)
• Tr::
char_type
: a synonym for the type of character that will be used to describe the regular expression.
• Tr::
string_type
: a synonym for std::basic_string<char_type>
.
• Tr::
length
(p)
: returns a value of type std::size_t
, which is the smallest non-negative value len
such that p[len] == 0
. Its time complexity is O(len).
• ctr.
value
(ch, base)
: returns the numeric value represented by the character ch
in the given base. If the character ch
is not a valid digit in that base, the expression yields -1
.
The static member function length
is called whenever the implementation needs to know the length of a null-terminated character string.
The member function value
is called whenever the implementation needs to translate a series of digits into a numeric value, that is, when it encounters an OCTAL ESCAPE SEQUENCE
, a HEXADECIMAL ESCAPE SEQUENCE
, a UNICODE ESCAPE SEQUENCE
, or a REPETITION
with an explicit count.
regex_traits
SpecializationsThe class template regex_traits
implements length(p)
by calling std:: char_traits<Elem>::length(p)
.
The specialization regex_traits<char>
treats the characters ‘0
’ through ‘7
’ as octal digits, the characters ‘0
’ through ‘9
’ as decimal digits, and the characters ‘0
’ through ‘9
’, ‘a
’ through ‘f
’, and ‘A
’ through ‘F
’ as hexadecimal digits, with their usual meanings. The specialization regex_traits<wchar_t>
treats the wide-character equivalents of those characters in the same way.
• Tr::
locale_type
: a synonym for a copy-constructible type that represents the locale used by the traits class.
• tr.
imbue
(loc)
: copies loc
to tr
’s locale object and makes any other changes needed to use the new locale correctly. It returns a copy of the previous locale object.
• ctr.
getloc
()
: returns a copy of tr
’s locale object.
The class template basic_regex<Elem, RXtraits>
has three members that use the locale elements of RXtraits
. The nested type name basic_regex<Elem, RXtraits>::locale_type
is a synonym for RXtraits::locale_type
, and the two member functions basic_regex<Elem, RXtraits>::imbue(loc)
and basic_regex<Elem, RXtraits>::getloc()
both forward to the corresponding member function of the nested RXtraits
object.
regex_traits
SpecializationsThe class template regex_traits
defines the nested type name locale_type
to be a synonym for std::locale
. Each regex_traits
object holds a locale object. The member function getloc()
returns a copy of that object. The member function imbue(loc)
discards any cached information based on the previous locale, copies loc
onto the stored locale object, and returns a copy of the previous locale object.
• ctr.
translate
(ch)
: returns a value of type Tr::char_type
. If two characters ch1
and ch2
are equivalent, ctr.translate(ch1) == ctr. translate(ch2)
.
• ctr.
translate_nocase
(ch)
: returns a value of type Tr::char_type
. If two characters ch1
and ch2
are equivalent without regard to case, ctr.translate_nocase(ch1) == ctr.translate_nocase(ch2)
.
• ctr.
lookup_collatename
(F1, F2)
: returns a Tr::string_type
object that holds the characters that make up the collating element named by the text sequence pointed at by [F1, F2)
. If the text sequence does not name a valid collating element, the function returns an empty string.
Comparing two characters ch1
and ch2
for equality follows these rules.
• If flags() & regex_constants::icase
is nonzero (i.e., the icase
flag was passed to the basic_regex
object’s contructor), ch1
and ch2
are equal if ctr.translate_nocase(ch1) == ctr.translate_-nocase(ch2)
.
• Otherwise, if flags() & regex_constants::collate
is nonzero (i.e., the collate
flag was passed to the basic_regex
object’s constructor and the icase
flag was not), ch1
and ch2
are equal if ctr. translate(ch1) == ctr.translate(ch2)
.
• Otherwise, ch1
and ch2
are equal if ch1 == ch2
.
Adding a COLLATING SYMBOL to a bracket expression (i.e., “[[.
elt
.]]”
) adds the contents of the string object returned by calling ctr.lookup_collate_name
with the text between the “[.”
and “.]”
delimiters. If the returned string object is empty (i.e., the text is not the name of a valid collating element), the implementation throws a regex_error
object whose code
member function returns error_collate
.
regex_traits
SpecializationsThe class template regex_traits
implements translate(ch)
by returning ch
. The template implements translate_nocase(ch)
by returning use_-facet<ctype<Elem>>(getloc()).tolower(ch)
. That is, it uses its current locale to translate the character to lowercase.
The TR1 specification does not impose any testable requirements for regex_traits::lookup_collatename
.
• ctr.
transform
(F1, F2)
: returns a Tr::string_type
object that can be used as a sort key for the character sequence pointed at by the iterator range [F1,F2)
. If a character sequence pointed at by the iterator range [G1,G2)
should sort before the character sequence pointed at by the iterator range [H1,H2)
, ctr.transform(G1, G2) <ctr.transform(H1, H2)
.
• ctr.
transform_primary
(F1, F2)
: returns a Tr::string_type
object that can be used as a sort key for the character sequence pointed at by the iterator range [F1,F2)
without regard to case. If a character sequence pointed at by the iterator range [G1,G2)
should sort before the character sequence pointed at by the iterator range [H1,H2)
, without regard to case, ctr.transform_primary(G1, G2) <ctr.transform_primary(H1, H2)
.
The transform
function is used to decide whether a character ch0
is in a character range “[
ch1 -
ch2 ]”
. It is used as follows.
• If flags() & regex_constants::collate
is false (i.e., the collate
flag was not passed to the basic_regex
object’s contructor), the character is in the range only if ch1 <= ch0 && ch0 <= ch2
.
• Otherwise, the match is determined by first translating each of the three characters ch0
, ch1
, and ch2
with either ctr.translate_nocase(chx)
if flags() & regex_constants::icase
is true or with ctr.translate(chx)
otherwise, producing ch0a
, ch1a
, and ch2a
, respectively. These three characters are each then converted to Tr:: string_type
objects str0
, str1
, and str2
, respectively. These three string objects are then transformed into sort keys by calling ctr. transform(strx.begin(), strx.end())
, producing str0a
, str1a
, and str2a
. The character ch0
is in the range if str1a <= str0a && str0a <= str2a
.
The transform_primary
function is used to decide whether a character ch
belongs to an EQUIVALENCE CLASS
“[[=
eq =]]”
. The equivalence class name is passed to ctr.transform_primary
, and the single-character sequence ch
is passed to ctr.transform_primary
. If the two returned strings compare equal, the character is a member of the equivalence class.
regex_traits
SpecializationsThe template member function template <class FwdIt> regex_traits<Elem>::transform(FwdIt, FwdIt)
constructs a regex_traits<Elem>::string_type
object str
that holds a copy of the text sequence pointed at by the two iterator arguments. It returns use_facet<collate<Elem> >(getloc()).transform(str.data(), str.data() + str.size())
.
The template member function template <class FwdIt> regex_traits <Elem>::transform_primary(FwdIt, FwdIt)
returns a case-insensitive version of the sort key returned by regex_traits<Elem>::transform
.
• Tr::
char_class_type
: names an unspecified bitmask type that identifies sets of character classifications.
• ctr.
lookup_classname
(F1, F2)
: returns a value of type Tr::char_-class_type
that identifies the character classification named by the text sequence pointed at by [F1, F2)
. If the text sequence does not name a valid character classification, 0 is returned.
• ctr.
isctype
(ch, cl)
: returns true
if the character ch
is a member of one of the character classifications identified by the Tr::char_-class_type
value cl
; otherwise, false
.
When a regular expression uses a CHARACTER CLASS
, the implementation calls lookup_classname
with the name of the class. If the function returns 0, it throws a regex_error
object whose code
member function returns error_ctype
. Otherwise, to match a character against that character class, the implementation calls isctype
with the character and the value returned by lookup_classname
. The character matches only if isctype
returns true
.
regex_traits
SpecializationsFor regex_traits<char>
, the member function lookup_classname
treats all the following names and, optionally, more, without regard to case, as valid class names:
• “d”
• “s”
• “w”
• “alnum”
• “alpha”
• “blank”
• “digit”
• “graph”
• “lower”
• “print”
• “punct”
• “space”
• “upper”
• “xdigit”
For regex_traits<wchar_t>
, the member function lookup_classname
does the same, for the wide-character equivalents of these names.
The member function regex_traits::isctype
converts its argument cl
into a value cl_mask
of type std::ctype_base::mask
in an unspecified manner and then calls use_facet<ctype<Elem> >(getloc()).is(ch, cl_mask)
. If the result is true
, the function returns true
. Otherwise, if cl
has bits set that identify the character class named “w”
and ch
is ‘_
’, the function returns true
. Otherwise, if cl
has bits set that identify the character class named “blank”
and ch
is in an implementation-defined set of characters for which isspace(ch, getloc())
returns true
, the function returns true
. Otherwise, it returns false
.
What that comes down to is that isctype
looks to the traits object’s current locale for the details of its character classifications. Granted, the mapping to the mask type is unspecified, but no implementer will violate the obvious correspondence. The rest of the paragraph does some fine-tuning. The character class “w”
[1] includes underscores. The description of the character class “blank”
[2] paraphrases the description of the isblank
function that was added to C with C99 and is included in the TR1 library.
regex_traits
Class TemplateThe class template regex_traits
is defined in the header <regex>
.
namespace std { // C++ standard library
namespace tr1 { // TR1 additions
// CLASS TEMPLATE regex_traits
template <class Elem>
struct regex_traits {
// CHARACTER TRAITS
typedef Elem char_type;
typedef basic_string <Elem> string_type;
static size_t length (const char_type *str);
int value (Elem ch, int base) const;
// LOCALES
typedef unspecified locale_type;
locale_type imbue(locale_type loc);
locale_type getloc () const;
// CHARACTER MATCHING
char_type translate (char_type ch) const;
char_type translate_nocase (char_type ch) const;
template<class FwdIt>
string_type lookup_collatename (
FwdIt first, FwdIt last) const;
// COLLATING
template<class FwdIt>
string_type transform (
FwdIt first, FwdIt last) const;
template<class FwdIt>
string_type transform_primary (
FwdIt first, FwdIt last) const;
// CHARACTER CLASSES
typedef unspecified char_class_type;
template<class FwdIt>
char_class_type lookup_classname (
FwdIt first, FwdIt last) const;
bool isctype (char_type ch, char_class_type cls) const;
};
// SPECIALIZATIONS OF CLASS TEMPLATE regex_traits
template <>
struct regex_traits<char>;
template <>
struct regex_traits<wchar_t>;
} }
18.227.13.219