10

Strings and Regular Expressions

Prefer the standard to the offbeat.

– Strunk & White

10.1 Introduction

Text manipulation is a major part of most programs. The C++ standard library offers a string type to save most users from C-style manipulation of arrays of characters through pointers. A string_view type allows us to manipulate sequences of characters however they may be stored (e.g., in a std::string or a char[]). In addition, regular expression matching is offered to help find patterns in text. The regular expressions are provided in a form similar to what is common in most modern languages. Both strings and regex objects can use a variety of character types (e.g., Unicode).

10.2 Strings

The standard library provides a string type to complement the string literals (§1.2.1); string is a regular type (§8.2, §14.5) for owning and manipulating a sequence of characters of various character types. The string type provides a variety of useful string operations, such as concatenation. For example:

string compose(const string& name, const string& domain)
{
        return name + @ + domain;
}

auto addr = compose("dmr","bell-labs.com");

Here, addr is initialized to the character sequence [email protected]. “Addition” of strings means concatenation. You can concatenate a string, a string literal, a C-style string, or a character to a string. The standard string has a move constructor, so returning even long strings by value is efficient (§6.2.2).

In many applications, the most common form of concatenation is adding something to the end of a string. This is directly supported by the += operation. For example:

void m2(string& s1, string& s2)
{
        s1 = s1 + '
';   // append newline
        s2 += '
';        // append newline
}

The two ways of adding to the end of a string are semantically equivalent, but I prefer the latter because it is more explicit about what it does, more concise, and possibly more efficient.

A string is mutable. In addition to = and +=, subscripting (using [ ]) and substring operations are supported. For example:

string name = "Niels Stroustrup";

void m3()
{
        string s = name.substr(6,10);            // s = "Stroustrup"
        name.replace(0,5,"nicholas");          // name becomes "nicholas Stroustrup"
        name[0] = toupper(name[0]);            // name becomes "Nicholas Stroustrup"
}

The substr() operation returns a string that is a copy of the substring indicated by its arguments. The first argument is an index into the string (a position), and the second is the length of the desired substring. Since indexing starts from 0, s gets the value Stroustrup.

The replace() operation replaces a substring with a value. In this case, the substring starting at 0 with length 5 is Niels; it is replaced by nicholas. Finally, I replace the initial character with its uppercase equivalent. Thus, the final value of name is Nicholas Stroustrup. Note that the replacement string need not be the same size as the substring that it is replacing.

Among the many useful string operations are assignment (using =), subscripting (using [ ] or at() as for vector; §12.2.2), comparison (using == and !=), and lexicographical ordering (using <, <=, >, and >=), iteration (using iterators, begin(), and end() as for vector; §13.2), input (§11.3), and streaming (§11.7.3).

Naturally, strings can be compared against each other, against C-style strings §1.7.1), and against string literals. For example:

string incantation;

void respond(const string& answer)
{
        if (answer == incantation) {
                // ... perform magic ...
        }
        else if (answer == "yes") {
                // ...
        }
        // ...
}

If you need a C-style string (a zero-terminated array of char), string offers read-only access to its contained characters (c_str() and data()). For example:

void print(const string& s)
{
        printf("For people who like printf: %s
",s.c_str());           // s.c_str() returns a pointer to s' characters
        cout << "For people who like streams: " << s << '
';
}

A string literal is by definition a const char*. To get a literal of type std::string use an s suffix. For example:

auto cat = "Cat"s;    // a std::string
auto dog = "Dog";   // a C-style string: a const char*

To use the s suffix, you need to use the namespace std::literals::string_literals6.6).

10.2.1 string Implementation

Implementing a string class is a popular and useful exercise. However, for general-purpose use, our carefully crafted first attempts rarely match the standard string in convenience or performance. These days, string is usually implemented using the short-string optimization. That is, short string values are kept in the string object itself and only longer strings are placed on free store. Consider:

string s1 {"Annemarie"};                          // short string
string s2 {"Annemarie Stroustrup"};       // long string

The memory layout will be something like this:

Images

When a string’s value changes from a short to a long string (and vice versa) its representation adjusts appropriately. How many characters can a “short” string have? That’s implementation defined, but “about 14 characters” isn’t a bad guess.

The actual performance of strings can depend critically on the run-time environment. In particular, in multi-threaded implementations, memory allocation can be relatively costly. Also, when lots of strings of differing lengths are used, memory fragmentation can result. These are the main reasons that the short-string optimization has become ubiquitous.

To handle multiple character sets, string is really an alias for a general template basic_string with the character type char:

template<typename Char>
class basic_string {
        // ... string of Char ...
};

using string = basic_string<char>;

A user can define strings of arbitrary character types. For example, assuming we have a Japanese character type Jchar, we can write:

using Jstring = basic_string<Jchar>;

Now we can do all the usual string operations on Jstring, a string of Japanese characters.

10.3 String Views

The most common use of a sequence of characters is to pass it to some function to read. This can be achieved by passing a string by value, a reference to a string, or a C-style string. In many systems there are further alternatives, such as string types not offered by the standard. In all of these cases, there are extra complexities when we want to pass a substring. To address this, the standard library offers string_view; a string_view is basically a (pointer,length) pair denoting a sequence of characters:

Images

A string_view gives access to a contiguous sequence of characters. The characters can be stored in many possible ways, including in a string and in a C-style string. A string_view is like a pointer or a reference in that it does not own the characters it points to. In that, it resembles an STL pair of iterators (§13.3).

Consider a simple function concatenating two strings:

string cat(string_view sv1, string_view sv2)
{
     string res {sv1};           // initialize from sv1
     return res += sv2;        // append from sv2 and return

}

We can call this cat():

string king = "Harold";
auto s1 = cat(king,"William");                          // HaroldWilliam: string and const char*
auto s2 = cat(king,king);                                   // HaroldHarold: string and string
auto s3 = cat("Edward","Stephen"sv);           // EdwardStephen: const char * and string_view
auto s4 = cat("Canute"sv,king);                      // CanuteHarold
auto s5 = cat({&king[0],2},"Henry"sv);           // HaHenry
auto s6 = cat({&king[0],2},{&king[2],4});         // Harold

This cat() has three advantages over the compose() that takes const string& arguments (§10.2):

  • It can be used for character sequences managed in many different ways.

  • We can easily pass a substring.

  • We don’t have to create a string to pass a C-style string argument.

Note the use of the sv (“string view”) suffix. To use that, we need to make it visible:

using namespace std::literals::string_view_literals;       // §6.6

Why bother with a suffix? The reason is that when we pass "Edward" we need to construct a string_view from a const char* and that requires counting the characters. For "Stephen"sv the length is computed at compile time.

A string_view defines a range, so we can traverse its characters. For example:

void print_lower(string_view sv1)
{
        for (char ch : sv1)
                cout << tolower(ch);
}

One significant restriction of string_view is that it is a read-only view of its characters. For example, you cannot use a string_view to pass characters to a function that modifies its argument to lowercase. For that, you might consider using a span15.2.2).

Think of string_view as a kind of pointer; to be used, it must point to something:

string_view bad()
{
        string s = "Once upon a time";
        return {&s[5],4};                // bad: returning a pointer to a local
}

Here, the returned string will be destroyed before we can use its characters.

The behavior of out-of-range access to a string_view is undefined. If you want guaranteed range checking, use at(), which throws out_of_range for attempted out-of-range access, or gsl::string_span15.2.2).

10.4 Regular Expressions

Regular expressions are a powerful tool for text processing. They provide a way to simply and tersely describe patterns in text (e.g., a U.S. postal code such as TX 77845, or an ISO-style date, such as 2009-06-07) and to efficiently find such patterns. In <regex>, the standard library provides support for regular expressions in the form of the std::regex class and its supporting functions. To give a taste of the style of the regex library, let us define and print a pattern:

regex pat {R"(w{2}s*d{5}(-d{4})?)"};   // U.S. postal code pattern: XXddddd-dddd and variants

People who have used regular expressions in just about any language will find w{2}s*d{5}(-d{4})? familiar. It specifies a pattern starting with two letters w{2} optionally followed by some space s* followed by five digits d{5} and optionally followed by a dash and four digits -d{4}. If you are not familiar with regular expressions, this may be a good time to learn about them ([Stroustrup,2009], [Maddock,2009], [Friedl,1997]).

To express the pattern, I use a raw string literal starting with R"( and terminated by )". This allows backslashes and quotes to be used directly in the string. Raw strings are particularly suitable for regular expressions because they tend to contain a lot of backslashes. Had I used a conventional string, the pattern definition would have been:

regex pat {"\w{2}\s*\d{5}(-\d{4})?"};    // U.S. postal code pattern

In <regex>, the standard library provides support for regular expressions:

  • regex_match(): Match a regular expression against a string (of known size) (§10.4.2).

  • regex_search(): Search for a string that matches a regular expression in an (arbitrarily long) stream of data (§10.4.1).

  • regex_replace(): Search for strings that match a regular expression in an (arbitrarily long) stream of data and replace them.

  • regex_iterator: Iterate over matches and submatches (§10.4.3).

  • regex_token_iterator: Iterate over non-matches.

10.4.1 Searching

The simplest way of using a pattern is to search for it in a stream:

int lineno = 0;
for (string line; getline(cin,line); ) {                  // read into line buffer
        ++lineno;
        smatch matches;                                       // matched strings go here
        if (regex_search(line,matches,pat))         // search for pat in line
                cout << lineno << ": " << matches[0] << '
';
}

The regex_search(line,matches,pat) searches the line for anything that matches the regular expression stored in pat and if it finds any matches, it stores them in matches. If no match was found, regex_search(line,matches,pat) returns false. The matches variable is of type smatch. The “s” stands for “sub” or “string,” and an smatch is a vector of submatches of type string. The first element, here matches[0], is the complete match. The result of a regex_search() is a collection of matches, typically represented as an smatch:

void use()
{
        ifstream in("file.txt");        // input file
        if (!in) {                               // check that the file was opened
                cerr << "no file
";
                return;
        }

        regex pat {R"(w{2}s*d{5}(-d{4})?)"};   // U.S. postal code pattern

        int lineno = 0;
        for (string line; getline(in,line); ) {
                ++lineno;
                smatch matches;     // matched strings go here
                if (regex_search(line, matches, pat)) {
                        cout << lineno << ": " << matches[0] << '
';           // the complete match
                        if (1<matches.size() && matches[1].matched)        // if there is a sub-pattern
                                                                                                            // and if it is matched
                                cout << "	: " << matches[1] << '
';                 // submatch
                  }
          }
}

This function reads a file looking for U.S. postal codes, such as TX77845 and DC 20500-0001. An smatch type is a container of regex results. Here, matches[0] is the whole pattern and matches[1] is the optional four-digit subpattern (-d{4})?.

The newline character, , can be part of a pattern, so we can search for multiline patterns. Obviously, we shouldn’t read one line at a time if we want to do that.

The regular expression syntax and semantics are designed so that regular expressions can be compiled into state machines for efficient execution [Cox,2007]. The regex type performs this compilation at run time.

10.4.2 Regular Expression Notation

The regex library can recognize several variants of the notation for regular expressions. Here, I use the default notation, a variant of the ECMA standard used for ECMAScript (more commonly known as JavaScript). The syntax of regular expressions is based on special characters:

Regular Expression Special Characters

.

Any single character (a “wildcard”)

Next character has a special meaning

[

Begin character class

*

Zero or more (suffix operation)

]

End character class

+

One or more (suffix operation)

{

Begin count

?

Optional (zero or one) (suffix operation)

}

End count

|

Alternative (or)

(

Begin grouping

^

Start of line; negation

)

End grouping

$

End of line

For example, we can specify a line starting with zero or more As followed by one or more Bs

followed by an optional C like this:

^A*B+C?$

Examples that match:

AAAAAAAAAAAABBBBBBBBBC
BC
B

Examples that do not match:

AAAAA                    // no B
    AAAABC             // initial space
AABBCC                 // too many Cs

A part of a pattern is considered a subpattern (which can be extracted separately from an smatch) if it is enclosed in parentheses. For example:

d+-d+            // no subpatterns
d+(-d+)          // one subpattern
(d+)(-d+)        // two subpatterns

A pattern can be optional or repeated (the default is exactly once) by adding a suffix:

Repetition

{ n }

Exactly n times

{ n, } n

or more times

{n,m}

At least n and at most m times

*

Zero or more, that is, {0,}

+

One or more, that is, {1,}

?

Optional (zero or one), that is {0,1}

For example:

A{3}B{2,4}C*

Examples that match:

AAABBC
AAABBB

Examples that do not match:

AABBC                     // too few As
AAABC                     // too few Bs
AAABBBBBCCC     // too many Bs

A suffix ? after any of the repetition notations (?, *, +, and { }) makes the pattern matcher “lazy” or “non-greedy.” That is, when looking for a pattern, it will look for the shortest match rather than the longest. By default, the pattern matcher always looks for the longest match; this is known as the Max Munch rule. Consider:

ababab

The pattern (ab)+ matches all of ababab. However, (ab)+? matches only the first ab.

The most common character classifications have names:

Character Classes

alnum

Any alphanumeric character

alpha

Any alphabetic character

blank

Any whitespace character that is not a line separator

cntrl

Any control character

d

Any decimal digit

digit

Any decimal digit

graph

Any graphical character

lower

Any lowercase character

print

Any printable character

punct

Any punctuation character

s

Any whitespace character

space

Any whitespace character

upper

Any uppercase character

w

Any word character (alphanumeric characters plus the underscore)

xdigit

Any hexadecimal digit character

In a regular expression, a character class name must be bracketed by [: :]. For example, [:digit:] matches a decimal digit. Furthermore, they must be used within a [ ] pair defining a character class.

Several character classes are supported by shorthand notation:

Character Class Abbreviations

d

A decimal digit

[[:digit:]]

s

A space (space, tab, etc.)

[[:space:]]

w

A letter (a-z) or digit (0-9) or underscore (_)

[_[:alnum:]]

D

Not d

[^[:digit:]]

S

Not s

[^[:space:]]

W

Not w

[^_[:alnum:]]

In addition, languages supporting regular expressions often provide:

Nonstandard (but Common) Character Class Abbreviations

l

A lowercase character

[[:lower:]]

u

An uppercase character

[[:upper:]]

L

Not l

[^[:lower:]]

U

Not u

[^[:upper:]]

For full portability, use the character class names rather than these abbreviations.

As an example, consider writing a pattern that describes C++ identifiers: an underscore or a letter followed by a possibly empty sequence of letters, digits, or underscores. To illustrate the subtleties involved, I include a few false attempts:

[:alpha:][:alnum:]*                  // wrong: characters from the set ":alpha" followed by ...
[[:alpha:]][[:alnum:]]*             // wrong: doesn't accept underscore ( _ is not alpha)
([[:alpha:]]|_)[[:alnum:]]*        // wrong: underscore is not part of alnum either

([[:alpha:]]|_)([[:alnum:]]|_)*           // OK, but clumsy
[[:alpha:]_][[:alnum:]_]*                  // OK: include the underscore in the character classes
[_[:alpha:]][_[:alnum:]]*                  // also OK
[_[:alpha:]]w*                                 // w is equivalent to [_[:alnum:]]

Finally, here is a function that uses the simplest version of regex_match()10.4.1) to test whether a string is an identifier:

bool is_identifier(const string& s)
{
        regex pat {"[_[:alpha:]]\w*"};  // underscore or letter
                                                           // followed by zero or more underscores, letters, or digits
        return regex_match(s,pat);
}

Note the doubling of the backslash to include a backslash in an ordinary string literal. Use raw string literals (§10.4)to alleviate problems with special characters. For example:

bool is_identifier(const string& s)
{
        regex pat {R"([_[:alpha:]]w*)"};
        return regex_match(s,pat);
}

Here are some examples of patterns:

Ax*                  // A, Ax, Axxxx
Ax+                 // Ax, Axxx         Not A
d-?d              // 1-2, 12            Not 1--2
w{2}-d{4,5}   // Ab-1234, XX-54321, 22-5432         Digits are in w
(d*:)?(d+)      // 12:3, 1:23, 123, :123       Not 123:
(bs|BS)           // bs, BS             Not bS
[aeiouy]          // a, o, u              An English vowel, not x
[^aeiouy]        // x, k                  Not an English vowel, not e
[a^eiouy]        // a, ^, o, u           An English vowel or ^

A group (a subpattern) potentially to be represented by a sub_match is delimited by parentheses. If you need parentheses that should not define a subpattern, use (?: rather than plain (. For example:

(s|:|,)*(d*)      // optional spaces, colons, and/or commas followed by an optional number

Assuming that we were not interested in the characters before the number (presumably separators), we could write:

(?:s|:|,)*(d*)    // optional spaces, colons, and/or commas followed by an optional number

This would save the regular expression engine from having to store the first characters: the (?: variant has only one subpattern.

Regular Expression Grouping Examples

d*sw+

No groups (subpatterns)

(d*)s(w+)

Two groups

(d*)(s(w+))+

Two groups (groups do not nest)

(s*w*)+

One group; one or more subpatterns;

only the last subpattern is saved as a sub_match

<(.*?)>(.*?)</1>

Three groups; the 1 means “same as group 1”

That last pattern is useful for parsing XML. It finds tag/end-of-tag markers. Note that I used a non-greedy match (a lazy match), .*?, for the subpattern between the tag and the end tag. Had I used plain .*, this input would have caused a problem:

Always look on the <b>bright</b> side of <b>life</b>.

A greedy match for the first subpattern would match the first < with the last >. That would be correct behavior, but unlikely what the programmer wanted.

For a more exhaustive presentation of regular expressions, see [Friedl,1997].

10.4.3 Iterators

We can define a regex_iterator for iterating over a sequence of characters finding matches for a pattern. For example, we can use a sregex_iterator (a regex_iterator<string>) to output all whitespace-separated words in a string:

void test()
{
        string input = "aa as; asd ++e^asdf asdfg";
        regex pat {R"(s+(w+))"};
        for (sregex_iterator p(input.begin(),input.end(),pat); p!=sregex_iterator{}; ++p)
                cout << (*p)[1] << '
';
}

This outputs:

as
asd
asdfg

We missed the first word, aa, because it has no preceding whitespace. If we simplify the pattern to R"((w+))", we get

aa
as
asd
e
asdf
asdfg

A regex_iterator is a bidirectional iterator, so we cannot directly iterate over an istream (which offers only an input iterator). Also, we cannot write through a regex_iterator, and the default regex_iterator (regex_iterator{}) is the only possible end-of-sequence.

10.5 Advice

[1] Use std::string to own character sequences; §10.2; [CG: SL.str.1].

[2] Prefer string operations to C-style string functions; §10.1.

[3] Use string to declare variables and members rather than as a base class; §10.2.

[4] Return strings by value (rely on move semantics and copy elision); §10.2, §10.2.1; [CG: F.15].

[5] Directly or indirectly, use substr() to read substrings and replace() to write substrings; §10.2.

[6] A string can grow and shrink, as needed; §10.2.

[7] Use at() rather than iterators or [ ] when you want range checking; §10.2, §10.3.

[8] Use iterators and [ ] rather than at() when you want to optimize speed; §10.2, §10.3.

[9] Use a range-for to safely minimize range checking §10.2, §10.3.

[10] string input doesn’t overflow; §10.2, §11.3.

[11] Use c_str() or data() to produce a C-style string representation of a string (only) when you have to; §10.2.

[12] Use a stringstream or a generic value extraction function (such as to<X>) for numeric conversion of strings; §11.7.3.

[13] A basic_string can be used to make strings of characters on any type; §10.2.1.

[14] Use the s suffix for string literals meant to be standard-library strings; §10.3 [CG: SL.str.12].

[15] Use string_view as an argument of functions that needs to read character sequences stored in various ways; §10.3 [CG: SL.str.2].

[16] Use string_span<char> as an argument of functions that needs to write character sequences stored in various ways; §10.3. [CG: SL.str.2] [CG: SL.str.11].

[17] Think of a string_view as a kind of pointer with a size attached; it does not own its characters; §10.3.

[18] Use the sv suffix for string literals meant to be standard-library string_views; §10.3.

[19] Use regex for most conventional uses of regular expressions; §10.4.

[20] Prefer raw string literals for expressing all but the simplest patterns; §10.4.

[21] Use regex_match() to match a complete input; §10.4, §10.4.2.

[22] Use regex_search() to search for a pattern in an input stream; §10.4.1.

[23] The regular expression notation can be adjusted to match various standards; §10.4.2.

[24] The default regular expression notation is that of ECMAScript; §10.4.2.

[25] Be restrained; regular expressions can easily become a write-only language; §10.4.2.

[26] Note that i for a digit i allows you to express a subpattern in terms of a previous subpattern; §10.4.2.

[27] Use ? to make patterns “lazy”; §10.4.2.

[28] Use regex_iterators for iterating over a stream looking for a pattern; §10.4.3.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.22.49