Chapter 17. Searching

There are and can be only two ways of searching into and discovering truth.

Novum Organum
FRANCIS BACON

Suppose that you want to scan a text file encoded in HTML and extract all the lines that contain code snippets. Each snippet begins with “<CODE>” and ends with “</CODE>”. The two markers are not case sensitive. For now, let’s assume that no snippet is longer than one line. This means that we need to search only for lines that contain the first marker, so we can use this text to define a regular expression for the search:

const char *expr = "<CODE>";

It’s easy to read a file one line at a time, so we don’t need to do anything more sophisticated in the regular expression to pick out lines. Simply read a line and see whether it has any text that matches the regular expression. To check for that match, we’ll call the template function regex_search, passing it the current line of text from the file and the regular expression object. The function returns true if any subsequence in the target text matches the regular expression.

Example 17.1. Searching for Lines with Code Snippets (regexsrch/snippets.cpp)



#include <regex>
#include <iostream>
#include <fstream>
#include <string>
#include <stdlib.h>
using std::tr1::regex; using std::tr1::regex_search;
using std::string; using std::ifstream; using std::cout;

static void show_matches (const char *fname)
  { // scan file named by fname, line by line
  ifstream input (fname);
  string str;
  const char *expr = "<CODE>";

  regex rgx (expr, regex::icase);
  while (getline (input, str))
    { // check line for match
    if (regex_search (str, rgx))
      cout << str << ' ';
    }
  }

int main (int argc, char *argv[])
  { // search for lines with code snippets in text file
  if (argc != 2)
    { // wrong number of arguments
    cout << "Usage : snippets <filename> ";
    return EXIT_FAILURE ;
    }
  try
    { // search the file
    show_matches (argv[1]);
    }
  catch (…)
    { // something went wrong
    cout << "Error ";
    return EXIT_FAILURE;
    }
  return 0;
  }


There are two ways of searching with regular expressions. You can ask whether a regular expression exactly matches the entire target sequence, and you can ask whether a regular expression matches a subsequence of the target sequence. The first is done with the overloaded set of function templates regex_match, discussed in Section 17.2; the second is done with the overloaded set of function templates regex_search, discussed in Section 17.3.

The argument lists for the regex_match and regex_search function templates all start with one or two arguments that designate the sequence of characters to be searched. This target sequence is followed by a basic_regex object that defines the regular expression to search for. Finally, an argument of type match_flag_type affects the matching rules. This argument has a default value of match_default, so you often don’t have to mention it.

In this chapter, we look at three overloaded versions of regex_match and three overloaded versions of regex_search. For each of these six function templates, a corresponding function template can take the same arguments and an additional argument whose type is an instantiation of match_results, which the algorithm fills in with details of how the match was accomplished. We look at those versions in the next chapter.

The characters in the target sequence must have the same type as the element type of the basic_regex object. For example, when the regular expression that you’re searching for is held in a basic_regex<wchar_t> object, the target sequence must consist of characters of type wchar_t.

When you call these functions, you can specify the target sequence in three ways.

1. A pointer ptr of type Elem* designates a null-terminated sequence, such as a C string, when Elem is a synonym for char. The sequence contains the characters in the contiguous array beginning at ptr and ending with the first character whose value is Elem(), which typically has the value 0 and is not part of the target sequence.

2. An object of type std::basic_string<Elem> designates the character sequence managed by the basic_string object.

3. A pair of bidirectional iterators first and last designates the character sequence defined by the half-open range [first, last).

Example 17.2. Target Sequences (regexsrch/targets.cpp)


#include <regex>
#include <iostream>
#include <list>
#include <string>
using std::tr1::regex; using std::tr1::regex_match;
using std::cout;
using std::list;
using std::basic_string;

int main()
  { // designating the target sequence
  regex rgx("b(c*)d");

  // pointer
  const char *tgt1 = "bcd";
  if (regex_match(tgt1, rgx))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";

  // string

  basic_string <char> tgt2("bcd");
  if (regex_match(tgt2, rgx))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";

  // pair of bidirectional iterators
  list <char> tgt0;
  tgt0.push_back('b'),
  tgt0.push_back('c'),
  tgt0.push_back('d'),
  if (regex_match(tgt0.begin(), tgt0.end(), rgx))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";

  return 0;
  }


17.1. Header <regex> Partial Synopsis

In this chapter, we look at the following components of the header <regex>:

namespace std {  // C++ standard library
 namespace tr1 { // TR1 additions
    
    // FUNCTION TEMPLATE regex_match
template<class Elem, class RXtraits>
  bool regex_match(
    const Elem *ptr,
    const basic_regex<Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);
template<class IOtraits, class IOalloc,
    class Elem, class RXtraits>
  bool regex_match(
    const basic_string<Elem, IOtraits, IOalloc>& str,
    const basic_regex<Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);
template<class BidIt, class Elem, class RXtraits>

  bool regex_match(
    BidIt first, Bidit last,
    const basic_regex<Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);

    // FUNCTION TEMPLATE regex_search
template<class Elem, class RXtraits>
  bool regex_search(
    const Elem *ptr,
    const basic_regex<Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);
template<class IOtraits, class IOalloc,
    class Elem, class RXtraits>
  bool regex_search(
    const basic_string<Elem, IOtraits, IOalloc>& str,
    const basic_regex<Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);
template<class BidIt, class Elem, class RXtraits>
  bool regex_search(
    BidIt first, Bidit last,
    const basic_regex<Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);
} }

17.2. Matching Exactly

template<class Elem, class RXtraits>
  bool regex_match (
    const Elem *ptr,
    const basic_regex <Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);
template<class IOtraits, class IOalloc,
    class Elem, class RXtraits>
  bool regex_match (
    const basic_string <Elem, IOtraits, IOalloc>& str,
    const basic_regex <Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);
template<class BidIt, class Elem, class RXtraits>
  bool regex_match (
    BidIt first, Bidit last,
    const basic_regex <Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);

The function templates return true only if the regular expression argument rgx matches the entire target sequence.

Example 17.3. Using regex_match (regexsrch/match.cpp)


#include <regex>
#include <iomanip>
#include <iostream>
using std::tr1::regex; using std::tr1::regex_match;
using std::cout; using std::setw;

void show_match (const regex & rgx, const char *str)
  { // check for match
  cout << setw (10) << str <<": ";
  if (regex_match (str, rgx))
    cout << "matches ";
  else
    cout << "doesn' t match ";
  }

int main()
  { // demonstrate exact match
  regex rgx("b(c*) d");
  show_match(rgx, "bd");                        // matches
  show_match(rgx, "d");                         // doesn't match
  show_match(rgx, "bcd");                     // matches
  show_match(rgx, "bc");                      // doesn't match
  show_match(rgx, "abd");                     // doesn't match
  show_match(rgx, "bde");                     // doesn't match
  return 0;
  }


17.3. Searching

template<class Elem, class RXtraits>
  bool regex_search (
    const Elem *ptr,
    const basic_regex <Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);
template<class IOtraits, class IOalloc,
    class Elem, class RXtraits>
  bool regex_search (

  const basic_string <Elem, IOtraits, IOalloc>& str,
    const basic_regex <Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);
template<class BidIt, class Elem, class RXtraits>
  bool regex_search (
    BidIt first, Bidit last,
    const basic_regex <Elem, RXtraits>& rgx,
    match_flag_types flags = match_default);

The function templates return true only if the regular expression argument rgx matches a subsequence of the target sequence.

Example 17.4. Using regex_search (regexsrch/search.cpp)


#include <regex>
#include <iomanip>
#include <iostream>
using std::tr1::regex; using std::tr1::regex_search;
using std::cout; using std::setw;

void show_match ( const regex & rgx, const char *str)
  { // check for match
  cout << setw (10) << str << ": ";
  if (regex_search (str, rgx))
    cout << "matches  ";
  else
  cout << "doesn't match  ";
  }

int main()
  { // demonstrate use of regex_search
  regex rgx ("b(c*) d");
  show_match (rgx, "bd");                        // matches
  show_match (rgx, "d");                         // doesn't match
  show_match (rgx, "bcd");                       // matches
  show_match (rgx, "bc");                        // doesn't match
  show_match (rgx, "abd");                       // matches
  show_match (rgx, "bde");                       // matches
  return 0;
  }


17.4. Search Options

namespace regex_constants {
  static const match_flag_type
    match_default,
    match_not_bol,
    match_not_eol,
    match_not_bow,
    match_not_eow,
    match_any,
    match_not_null,
    match_continuous,
    natch_prev_avail ;
}

Values of type match_flag_type can be combined with the logical OR operator and passed to the various search algorithms. Flags that are not applicable to a particular grammar are ignored when that grammar was used to create the regular expression object used for the search. The flag values have the following meanings:

match_default: use normal matching rules.

match_not_bol: do not treat the first position in the target sequence as the beginning of a line.

match_not_eol: do not treat the last position in the target sequence as the end of a line.

match_not_bow: do not treat the first position in the target sequence as the beginning of a word.

match_not_eow: do not treat the last position in the target sequence as the end of a word.

match_any: if more than one match is possible, any match is acceptable.

match_not_null: do not treat an empty sequence as a match.

match_continuous: do not search for matches other than at the beginning of the target sequence.

match_prev_avail: when an algorithm is called with a target sequence defined by a pair of iterators first and last, --first is a valid iterator; ignore match_not_bol and match_not_bow.

These flag values can be passed to the algorithms regex_match, regex_search, and regex_replace[1] to change the matching rules.

17.4.1. The match_default Flag

The flag designates the default matching rules for the grammar that was used to create the regular expression object used for the search. All of the search algorithms have a default value of match_default for their flags argument, so you often won’t have to pass this flag to the algorithms.[2]

17.4.2. The match_not_bol Flag

Ordinarily, the beginning of the target sequence is treated as the beginning of a line. This flag tells the search engine not to apply this rule. That’s useful in repetitive searches that resume where the previous search finished. We look at repetitive searches in Chapter 19. In the meantime, this example shows how this flag works.

Example 17.5. Flag match_not_bol (regexsrch/notbol.cpp)


#include <regex>
#include <iostream>
using std::tr1::regex; using std::tr1::regex_match;
using namespace std::tr1::regex_constants;
using std::cout;

int main()
  { // demonstrate use of match_not_bol
  regex rgx("^abcd");
  const char *tgt = "abcd";
  if (regex_match(tgt, rgx))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  if (regex_match (tgt, rgx, match_not_bol))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  return 0;
  }


The first search succeeds because the beginning of the target sequence is treated as the beginning of a line, thus matching the initial character ‘^’. The second search fails because the beginning of the target sequence is not treated as the beginning of a line.

17.4.3. The match_not_eol Flag

Ordinarily, the end of the target sequence is treated as the end of a line. This flag tells the search engine not to apply this rule. That’s useful when searching through buffered input, where additional input might not yet have been read into the buffer.

Example 17.6. Flag match_not_eol (regexsrch/noteol.cpp)


#include <regex>
#include <iostream>
using std::tr1::regex; using std::tr1::regex_match;
using namespace std::tr1::regex_constants;
using std::cout;

int main()
  { // demonstrate use of match_not_eol
  regex rgx ("abcd$");
  const char *txt = "abcde";
  const char *end = txt + 4;        // points to 'e'
  if (regex_match(txt, end, rgx))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  if (regex_match (txt, end, rgx, match_not_eol))
    cout << "Matches . ";
  else
    cout << "Doesn't match . ";
  return 0;
  }


The first search succeeds because the end of the target sequence is treated as the end of a line, thus matching the final character ‘$’. The second search fails because the end of the target sequence is not treated as the end of a line.

17.4.4. The match_not_bow Flag

Ordinarily, the beginning of the target sequence is treated as the beginning of a word if the first character can be part of a word. This flag tells the search engine not to apply this rule. That’s useful in repetitive searches that resume where the previous search finished. We look at repetitive searches in Chapter 19. In the meantime, this example shows how this flag works.

Example 17.7. Flag match_not_bow (regexsrch/notbow.cpp)


#include <regex>
#include <iostream>
using std::tr1::regex; using std::tr1::regex_match;
using namespace std::tr1::regex_constants;
using std::cout;

int main()
  { // demonstrate use of match_not_bow
  regex rgx ("\ babcd");
  const char *tgt = "abcd";
  if (regex_match (tgt, rgx))
    cout << "Matches . ";
  else
    cout << "Doesn 't match . ";
  if (regex_match (tgt, rgx, match_not_bow))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  return 0;
  }


The first search succeeds because the beginning of the target sequence is treated as the beginning of a word, thus matching the initial escape sequence “”. The second search fails because the beginning of the target sequence is not treated as the beginning of a word.

17.4.5. The match_not_eow Flag

Ordinarily, the end of the target sequence is treated as the end of a word if the last character can be part of a word. This flag tells the search engine not to apply this rule. That’s useful when searching through buffered input, where additional input might not yet have been read into the buffer.

Example 17.8. Flag match_not_eow (regexsrch/noteow.cpp)


#include <regex>
#include <iostream>
using std::tr1::regex; using std::tr1::regex_match;
using namespace std::tr1::regex_constants;

using std::cout;

int main()
  { // demonstrate use of match_not_eow
  regex rgx ("abcd \ b");
  const char *txt = "abcde";
  const char *end = txt + 4;        // points to 'e'
  if (regex_match (txt, end, rgx))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  if (regex_match (txt, end, rgx, match_not_eow))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  return 0;
  }


The first search succeeds because the end of the target sequence is treated as the end of a word, thus matching the final escape sequence “”. The second search fails because the end of the target sequence is not treated as the end of a word.

17.4.6. The match_any Flag

For grammars other than ECMAScript, when a regular expression includes an alternation, a successful match should use the longest possible alternative. For example, when matching the regular expression “(wee|week).*” to the target sequence “weeknights”, a successful match will associate the text “week” with the first capture group, because that’s a longer match than “wee”. This means that even after finding that the first alternative, “wee”, leads to a successful match, the search engine must continue, in order to find the match that uses the longer alternative. This flag tells the search engine not to apply that rule. Any successful match is acceptable, even if a longer branch of an alternative would lead to a match with the same overall length.

This flag does not affect the requirement of finding the longest possible match. It simplifies the rule for breaking ties among equally long matches, by not requiring repeated searches for the longest leftmost alternative. We look at examples using this flag in Chapter 18.

17.4.7. The match_not_null Flag

This flag tells the search engine not to consider a match that matches zero characters. That’s useful in repetitive searches that resume where the previous search left off. If the previous search matched zero characters, it left off at the place where it started, and repeating the search would simply find the zero-length match again. In order to make progress, a repetitive search should use the match_not_null flag after a zero-length match. We look at repetitive searches in Chapter 19. In the meantime, this example shows how this flag works.

Example 17.9. Flag match_not_null (regexsrch/notnull.cpp)


#include <regex>
#include <iostream>
using std::tr1::regex; using std::tr1::regex_match;
using namespace std::tr1::regex_constants;
using std::cout;

int main()
  { // demonstrate use of match_not_null
  regex rgx ("a *| b");
  const char *tgt = "ccc";
  if (regex_search (tgt, rgx))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  if (regex_search (tgt, rgx, match_not_null))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  return 0;
  }


In the first call to regex_search, the first alternative in the regular expression, “a*”, successfully matches zero characters at the start of the target sequence. When the search is repeated with the match_not_null flag, it fails because this match is not allowed.

17.4.8. The match_continuous Flag

This flag tells the search engine to consider only matches beginning at the start of the target sequence. This doesn’t affect the result of calling regex_match, which matches the entire target sequence. This flag can change the result of calling regex_search.

Example 17.10. Flag match_continuous (regexsrch/continuous.cpp)


#include <regex>
#include <iostream>
using std::tr1::regex; using std::tr1::regex_search;
using namespace std::tr1::regex_constants;
using std::cout;

int main()
  { // demonstrate use of match_continuous
  regex rgx ("bcd");
  const char *tgt = "abcd";
  if (regex_search (tgt, rgx))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  if (regex_search (tgt, rgx, match_continuous))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  return 0;
  }


The first search succeeds because the regular expression “bcd” matches the end of the target sequence. The second search fails because the match is required to begin at the beginning of the target sequence.

17.4.9. The match_prev_avail Flag

When passed to a search that uses a pair of iterators to designate the target sequence, this flag tells the search engine that the position preceding the first iterator is a valid element. This is needed when a search begins after the beginning of a sequence, so that the special rules for the first character in the target sequence won’t be applied.

Example 17.11. Flag match_prev_avail (regexsrch/prevavail.cpp)


#include <regex>
#include <iostream>
using std::tr1::regex; using std::tr1::regex_match;
using namespace std::tr1::regex_constants;
using std::cout;

int main()
  { // demonstrate use of match_prev_avail
  regex rgx ("\ bbcd");
  const char *txt = "abcd";
  const char *tgt = txt + 1;
  if (regex_match (tgt, rgx))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  if ( regex_match (tgt, rgx, match_prev_avail))
    cout << "Matches . ";
  else
    cout << "Doesn' t match . ";
  return 0;
  }


The first match succeeds because the search engine treats the character ‘b’ in the target sequence as the beginning of a word. The second match fails because the search engine looks back one character and sees that the character ‘b’ in the target sequence is not the beginning of a word.

Exercises

Exercise 1

For each of the following errors, write a simple test case containing the error, and try to compile it. In the error messages, look for the key words that relate to the error in the code.

1. Calling regex_match with a pointer to a null-terminated character sequence with a character type that’s different from the element type used to define the regular expression object

2. Calling regex_match with a basic_string object holding a character type that’s different from the element type used to define the regular expression object

3. Calling regex_match with a pair of iterators of type Iter, where the element type used to define the regular expression object is different from iterator_traits<Iter>::value_type

4. Calling regex_match with a pair of forward iterators

Exercise 2

The regular expression “http://([^/:]+)(:(d+))?(/.*)?” can be used to recognize an HTTP hostname.[3]

1. Write a program that uses regex_match to check each of the following character sequences to see whether the entire sequence is a valid HTTP hostname.

a. www.petebecker.com

b. http://www.petebecker.com

c.“Answers are at http://www.petebecker.com:80.”

d. http://www.petebecker.com has the answers.”

2. Write a program that uses regex_search to check each of the preceding character sequences to see whether the entire sequence is a valid HTTP hostname.

3. Write a program that uses regex_search to check each of the preceding character sequences to see whether the sequence contains a valid HTTP hostname.

4. Write a program that uses regex_match to check each of the preceding character sequences to see whether the sequence contains a valid HTTP hostname.

Exercise 3

The UNIX utility egrep searches files for occurrences of text that match a regular expression. Try your hand at implementing some of its options.

1. Write a program that takes two arguments from the command line and treats the first argument as a regular expression and the second as the name of a file. The program should search through the named file, writing out each line that contains text that matches the regular expression.

2. Change the program so that it writes out lines that do not contain text that matches the regular expression.

3. Change the program so that, instead of writing out each line that has a match, it counts them and writes out the number of matched lines when it finishes.

4. Change the program so that it writes out only lines in which the regular expression matches a complete word.

5. Change the program to make the match case insensitive.

Exercise 4

Suppose that you have an application that writes a log file where each line is in one of three forms. Successful operations are logged in a line beginning with the word success, followed by a description of what was done. Failures are logged in one of two forms. Some failures are logged with the word error, followed by a space, followed by a one-, two-, or three-digit decimal error code. Other failures are logged with a generic message beginning with the word error, followed by a space, followed by ordinary text describing the error; this text will never begin with a digit. Write a program that scans the log file, looking for error messages with error codes, and reports the number of messages for each error code that it encountered. You don’t need to parse the error message to do this; simply count the number of times each matching text sequence occurs.

Exercise 5

Suppose that you need to summarize all the error messages in a log file that was written in the format given in the previous example. That is, the program should determine how many error messages had each error code and how many had no error code. You can reuse most of the previous program for this exercise. If the test from the previous program fails but the text line begins with error the line is a generic error message. Add a test after the one that recognizes error codes to check for generic error messages, and count them separately. Make sure that the program handles such lines as “success, no error detected” correctly.

Exercise 6

The code for the previous exercise tests for the leading string error twice. That’s annoying and often inefficient. Rewrite the code to check first for the leading text error, then for an error code, and classify the message accordingly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.111.30