Chapter 20. Formatting and Text Replacement

Give this much to the Luftwaffe. When it knocked down our buildings, it didn’t replace them with anything more offensive than rubble. We did that.

Speech in London, December 1987
CHARLES PHILIP ARTHUR GEORGE,
PRINCE OF WALES

Suppose that you’ve been assigned to write a program that will send personalized e-mail to a list of pet owners whose names and e-mail addresses are stored in a file of comma-separated fields, one line per person. The fields in each line are, in order, the person’s e-mail address, first name, last name, pet’s name, and the kind of animal it is.[1] Since you read Chapter 19, you know how to write a regular expression to extract fields from a comma-separated list. Now you need to extract that information and insert it into the right places in the e-mail message. A brute-force approach might look like this.[2]

Example 20.1. Inserting Fields (regexform/inserting.cpp)


#include <regex>
#include <iostream>
#include <string>
#include <sstream>
using std::tr1::regex; using std::tr1::smatch;
using std::cout; using std::string;
using std::istringstream;

static const string addrlist =
"[email protected], Joe, Bob, Bubba, iguana  "
"[email protected], Missy, Treadwell,"
  "Reginald Addington Farnsworth II,"
  "prize - winning Toy Poodle  "

"[email protected], Spike, Redwood ,"
  "Fangs, snake  "
"[email protected], Sally, Smith ,"
  "Mr. Bubbles, goldfish  ";

static void write_letter (const smatch& match)
  {
  cout << "To :" << match.str (1) << '   ';
  cout << "Dear" << match.str (2) << ", ";
  cout << "I ' m sure your" << match.str (5)
    << "," << match.str (4) << ", ";
  cout << "as well as all the other pets in the"
    << match.str (3) << "family, ";
  cout << "will enjoy our latest offering,"
    << "Universal - Ultra - Mega Vitamins, ";
  cout << "Now available for all kinds of animals,"
    << "including" << match.str (5) << "s. ";
  cout << "Don't delay, get some today ! ";
  }

int main ()
  {
  regex rgx (
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)");
  smatch match;
    istringstream addresses (addrlist);
    string address;
    while (getline (addresses, address)
      &&  regex_match (address, match, rgx))
         write_letter (match);
    return 0;
    }


The function write_letter is rather poorly designed. First, it ought to format its text into a string or a stream so that the rest of the program can more easily manipulate it. Second, and more important, it should take as input a format string that gives the core of the text to write out, with placeholders for the pieces to be replaced for customization. So, with the format of our address list in mind, let’s look at one way to write that input text:


static string formletter =
"To: $1 "
"Dear $2, "
"I'm sure your $5, $4, "
"as well as all the other pets in the $3 family, "
"will enjoy our latest offering,"
" Universal -Ultra - Mega Vitamins, "
"Now available for all kinds of animals,"
" including $5s. "
"Don't delay, get some today ! ";

This text removes all the stream inserters and replaces every call to match. str(n) with the text $n. It’s much easier to read, but it’s not as easy to generate our customized messages: the program has to scan through the entire text, searching for the escape sequences, and replacing them with the corresponding text from the match object. I won’t bore you with the details of that code. If you had to, you could write it yourself. But you’ll certainly prefer not having to write it. Instead, you can do this, using match_result’s template member function format:

Example 20.2. Formatting (regexform/formatting.cpp)


#include <regex>
#include <iostream>
#include <string>
#include <sstream>
using std::tr1::regex; using std::tr1::smatch;
using std::cout; using std::string;
using std::istringstream;

static const string addrlist =
"[email protected], Joe, Bob, Bubba, iguana  "
"[email protected], Missy, Treadwell,"
     "Reginald Addington Farnsworth II,"
     "prize - winning Toy Poodle  "
"[email protected], Spike, Redwood,"
     "Fangs, snake  "
"[email protected], Sally, Smith,"
     "Mr. Bubbles, goldfish  ";

static string formletter =
"To : $1 "
"Dear $2, "
"I ' m sure your $5, $4, "
"as well as all the other pets in the $3 family, "

"will enjoy our latest offering,"
"Universal -Ultra - Mega Vitamins, "
"Now available for all kinds of animals,"
"including $5s. "
"Don't delay, get some today ! ";

int main ()
  {
  regex rgx (
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)");
  smatch match;
  istringstream addresses (addrlist);
  string address;
  while (getline (addresses, address)
    && regex_match (address, match, rgx))
      {
      string letter = match.format (formletter);
      cout << letter;
      }
  return 0;
  }


This can be written still more simply, using the algorithm regex_replace.

Example 20.3. Replacing (regexform/replacing.cpp)


#include <regex>
#include <iostream>
#include <string>
#include <sstream>
using std::tr1::regex; using std::tr1::regex_replace;
using std::cout; using std::string;
using std::istringstream;

static const string addrlist =
"[email protected], Joe, Bob, Bubba, iguana  "
"[email protected], Missy, Treadwell,"
  "Reginald Addington Farnsworth II,"
  "prize - winning Toy Poodle  "
"[email protected], Spike, Redwood,"

  "Fangs, snake  "
"[email protected], Sally, Smith,"
  "Mr. Bubbles, goldfish  ";

static string formletter =
"To : $1 "
"Dear $2, "
"I 'm sure your $5, $4, "
"as well as all the other pets in the $3 family, "
"will enjoy our latest offering,"
" Universal -Ultra - Mega Vitamins, "
"Now available for all kinds of animals,"
" including $5s. "
"Don't delay, get some today ! ";

int main ()
  {
  regex rgx (
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)[[: space :]]*,[[: space :]]*"
    "(.*)");
  string letter =
    regex_replace (addrlist, rgx, formletter);
  cout << letter;
  return 0;
  }


In this chapter, we look at both of those approaches. We start, in Section 20.1, with the flag values that you can use to control the result. In Section 20.2, we look at the template member function format. In Section 20.3, we look at the algorithm regex_replace.

20.1. Formatting Options

namespace regex_constants {
   static const match_flag_type
     format_default,
     format_sed,

     format_no_copy,
     format_first_only;
}

The flag values have the following meanings:

format_default:use ECMAScript formatting rules; copy all non-matching text; replace all occurrences of text matching the regular expression.

format_sed: use sed formatting rules.

format_no_copy: do not copy nonmatching text.

format_first_only: replace only the first occurrence of text that matches the regular expression.

The first two flags apply to both the template member function match_results::format and the algorithm regex_replace. The last two are meaningful only for regex_replace; the format member functions will ignore them.

The ECMAScript formatting rules are defined in [Ecm03]; the sed rules, in [Int03c]. The rules define escape sequences and their meanings. When you use these escape sequences in the format string, each escape sequence is replaced by text according to the rules in Table 20.1.

Table 20.1 Format Escape Sequences

image

20.2. Formatting Text

template <class BidIt, class Alloc>
template <class OutIt>
  OutIt match_results <BidIt, Alloc>::format (
    OutIt out ,
    const string_type & fmt,
    match_flag_type flags = format_default) const;
template < class BidIt, class Alloc>
string_type match_results <BidIt, Alloc>::format (
  const string_type & fmt,
  match_flag_type flags = format_default) const;

The first template member function generates an output sequence by copying the contents of fmt, replacing escape sequences in fmt with the corresponding text. The function then sequentially assigns each character in the output sequence to *out++. It returns the new value of out.

The second member function constructs a string_type object res, calls format(std::back_inserter(res), fmt, flags), and returns res.

We saw the second version of format in one of the examples in the previous section, in the line

string letter = match . format (formletter);

That call used the default flags. To pass a string that uses the sed format escapes instead of the ECMAScript escapes, pass the flag format_sed as the second argument:

string letter = match. format (formletter , format_sed);

The first version of format is more flexible. It takes an output iterator as the target for the output sequence and returns an iterator that points just past the end of the formatted text. The returned iterator can then be used as the target of further assignments, which will append text to the output sequence that format produced.[3]

Example 20.4. Using the Returned Iterator (regexform/returned.cpp)


#include <regex>
#include <iostream>
#include <string>
#include <algorithm>
using std::tr1::regex; using std::tr1::smatch;
using std::tr1::regex_search;
using std::string; using std::cout;
using std::copy;

int main ()
  { // demonstrate match_results::format
  string result ("The URL '");
  string tail ("' was found.");
  regex rgx ("http ://([^/: ]+)");
  string text ("The site http :// www.petebecker.com has"
    "  useful information. " );
  smatch match;
  if (regex_search (text , match , rgx))
    { // show result of successful match
    copy(tail. begin (), tail.end () ,
      match.format ( back_inserter (result), "$&"));
    cout << result << ' ';
    }
  return 0;
  }


20.3. Replacing Text


template <class OutIt , class BidIt ,
    class RXtraits, class Elem>
  OutIt regex_replace (
    OutIt out , BidIt first , BidIt last ,
    const basic_regex <Elem , RXtraits>& rgx ,
    const basic_string <Elem>& fmt ,
    match_flag_type flags = match_default);
template <class RXtraits , class Elem>
  basic_string <Elem> regex_replace (
    const basic_string <Elem>& str ,
    const basic_regex <Elem , RXtraits>& rgx ,
    const basic_string <Elem>& fmt,
    match_flag_type flags = match_default);

The first algorithm begins by constructing a regex_iterator object iter (first, last, rgx, flags) and using it to split its input range [first, last) into a series of alternating nonmatching and matching subsequences T0M0T1M1TN-1MN-1TN, where Mn is the nth match detected by the iterator. If no matches are found, T0 is the entire input range and N is 0. If (flags & format_first_only) != 0, only the first match is used, T1 is all the input text that follows the match, and N is 1. The algorithm then generates an output sequence as follows: For each index i in the range [0, N), if (flags & format_no_copy) == 0, the algorithm appends the text in the range Ti to the output sequence; regardless of the value of flags & format_no_copy, it then appends the text generated by a call to match.format(outseq, fmt, flags), where match is the match_results object returned by the iterator object iter for the subsequence Mi, and outseq is an output iterator that points at the current position in the output sequence. Finally, if (flags & format_no_copy) == 0, it appends the text in the range TN to the output sequence. Then it sequentially assigns each character in the output sequence to *out++ and returns the resulting value of out.

The second algorithm constructs a local variable result of type basic_string<Elem> and then calls regex_replace(back_inserter(result), str.begin(), str.end(), rgx, fmt, flags), returning result.

That first description is pretty dense. It has to be, to get the formal requirements right. Informally, the function copies text from the input sequence [first, last) to the output sequence pointed at by out. Whenever it finds text that matches the regular expression rgx, it replaces that text with the output sequence produced by calling match_results::format with the format string fmt. If you pass the flag format_no_copy, it skips the text that doesn’t match the regular expression and copies only the output sequences produced by match_results::format. If you pass the flag format_first_only, it looks only for the first match to the regular expression; all the text after that match is either copied without change or ignored, depending on whether you also passed format_no_copy.

For example, to replace every occurrence of the word “Intel” in a text sequence with the word “Microsoft”, try this.[4]

Example 20.5. Basic Search and Replace (regexform/basicrepl.cpp)


#include <regex>
#include <iostream>
#include <string>
using std::tr1::regex; using std::tr1::regex_replace;
using std::cout;
using std::string;

static const string text =
"For some reason , instead of using the name Microsoft , "
"I used the name Intel when I wrote this. Now I need  "
"to change it , since I wasn ' t talking about Intel , "
"but about Microsoft. Intelligent people like to think  "
"they don't make such silly mistakes , but sometimes , "
"alas , they do. ";

int main ()
  { // demonstrate basic search and replace
  regex rgx ("Intel");
  string replacement ("Microsoft");
  string result;
  regex_replace (back_inserter (result),
    text.begin (), text.end (), rgx, replacement);
  cout << result;
  return 0;
  }


To display only text that matches a regular expression, with each match on a separate line, try this.

Example 20.6. Basic Search (regexform/basicsrch.cpp)


#include <regex>
#include <iostream>
#include <string>
using std::tr1::regex; using std::tr1::regex_replace;
using namespace std::tr1::regex_constants;
using std::cout;
using std::string;

static const string text =
"Each morning I check http :// www.nytimes.com and  "
"http :// www.boston.com for news of what happened  "
"overnight.I also look at http :// www.tnr.com to "
"see any new articles they have posted.";

int main ()
  { // demonstrate basic search
  regex rgx ("http ://([^/: ]+)");
  string replacement ("$& ");
  string result;
  regex_replace (back_inserter (result) ,
    text.begin () , text.end () ,
    rgx , replacement , format_no_copy);
  cout  << result;
  return 0;
  }


These examples don’t take advantage of the iterator that they pass to receive the output sequence. They could just as easily have been written using the second form of regex_replace. To use that function to replace only the first URL in a text sequence, try this.

Example 20.7. Replace First (regexform/basicfirst.cpp)


#include <regex>
#include <iostream>
#include <string>
using std::tr1::regex; using std::tr1::regex_replace;
using namespace std::tr1::regex_constants;
using std::cout;
using std::string;

static const string text =
"Each morning I check http :// www.nytimes.com and  "
"http :// www.boston.com for news of what happened  "
"overnight. I also look at http :// www.tnr.com to "
"see any new articles they have posted.";

int main ()
  { // demonstrate basic search
  regex rgx ("http ://([^/: ]+)");
  string replacement ("http :// www.wsj.com");
  string result;
  regex_replace (back_inserter (result),
    text.begin (), text.end (),
    rgx, replacement , format_first_only);
  cout << result;
  return 0;
  }


Exercises

Exercise 1

In this exercise, we look at three approaches to displaying the results of regular expression searches. As we’ve been doing, we’ll write each result directly to cout and will then use regex_match::format to insert characters from each match into an output iterator that writes to cout; finally, we’ll use regex_replace to manage the search loop for us.

1. Write a program that searches for text in the form “name: first-name last-name” and inserts the contents of each successful match into cout with the last name first, followed by a comma, followed by the first name.

2. Write a program that uses a pair of regex_iterator objects to do the search and calls iter->format to write all the desired text to cout. You can use an iterator object ostream_iterator<char> out(cout, “”) to insert individual characters into cout.

3. Write a program that uses regex_replace with the flag format_no_copy to search for matches and write out their contents.

Exercise 2

Write a program that searches for one occurrence of text that matches a hostname[5] and uses regex_match::format to create a string object with an HTML link to that host. For example, the hostname http://www.petebecker.com would be converted to <A HREF=“http://www.petebecker.com”>http://www.petebecker.com</A>.

Exercise 3

Write a program that copies an input file and replaces every occurence of text that matches a hostname with an HTML link to that host.

Exercise 4

Write a program that searches an input file for text that matches a hostname and for each match writes a line of text with an HTML link to that host into an output file.

Exercise 5

Write a program that searches the standard input for text that matches a hostname and for each match writes a line of text with an HTML link to that host to the standard output.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.229.19