Chapter 3. Associative Containers

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3. Associative Containers

Somewhat like the polychromatic horse in The Wizard of Oz movie, the associative containers are creatures of a different color. True, they share many characteristics with the sequence containers, but they differ in a number of fundamental ways. For example, they automatically keep themselves sorted; they view their contents through the lens of equivalence instead of equality; sets and maps reject duplicate entries; and maps and multimaps generally ignore half of each object they contain. Yes, the associative containers are containers, but if you’ll excuse my likening vectors and strings to Kansas, we are definitely not in Kansas any more.

In the Items that follow, I explain the critical notion of equivalence, describe an important restriction on comparison functions, motivate custom comparison functions for associative containers of pointers, discuss the official and practical sides of key constness, and offer advice on improving the efficiency of associative containers.

At the end of the chapter, I examine the STL’s lack of containers based on hash tables, and I survey two common (nonstandard) implementations. Though the STL proper doesn’t offer hash tables, you need not write your own or do without. High-quality implementations are readily available.

Item 19: Understand the difference between equality and equivalence

The STL is awash in comparisons of objects to see if they have the same value. For example, when you ask find to locate the first object in a range with a particular value, find has to be able to compare two objects to see if the value of one is the same as the value of the other. Similarly, when you attempt to insert a new element into a set, set::insert has to be able to determine whether that element’s value is already in the set.

The find algorithm and set’s insert member function are representative of many functions that must determine whether two values are the same. Yet they do it in different ways. find’s definition of “the same” is equality, which is based on operator==. set::insert’s definition of “the same” is equivalence, which is usually based on operator<. Because these are different definitions, it’s possible for one definition to dictate that two objects have the same value while the other definition decrees that they do not. As a result, you must understand the difference between equality and equivalence if you are to make effective use of the STL.

Operationally, the notion of equality is based on operator==. If the expression “x == y” returns true, x and y have equal values, otherwise they don’t. That’s pretty straightforward, though it’s useful to bear in mind that just because x and y have equal values does not necessarily imply that all of their data members have equal values. For example, we might have a Widget class that internally keeps track of its last time of access,

class Widget {
public:
  ...

private:
  TimeStamp lastAccessed;
  ...

};

and we might have an operator== for Widgets that ignores this field:

bool operator==(const Widget& lhs, const Widget& rhs)
{
// code that ignores the lastAccessed field
}

In that case, two Widgets would have equal values even if their lastAccessed fields were different.

Equivalence is based on the relative ordering of object values in a sorted range. Equivalence makes the most sense if you think about it in terms of the sort order that is part of every standard associative container (i.e., set, multiset, map, and multimap). Two objects x and y have equivalent values with respect to the sort order used by an associative container c if neither precedes the other in c’s sort order. That sounds complicated, but in practice, it’s not. Consider, as an example, a set<Widget> s. Two Widgets w1 and w2 have equivalent values with respect to s if neither precedes the other in s’s sort order. The default comparison function for a set<Widget> is less<Widget>, and by default less<Widget> simply calls operator< for Widgets, so w1 and w2 have equivalent values with respect to operator< if the following expression is true:

!(w1 < w2)                            // it's not true that w1 < w2
&&                                    // and
!(w2 < w1)                            // it's not true that w2 < w1

This makes sense: two values are equivalent (with respect to some ordering criterion) if neither precedes the other (according to that criterion).

In the general case, the comparison function for an associative container isn’t operator< or even less, it’s a user-defined predicate. (See Item 39 for more information on predicates.) Every standard associative container makes its sorting predicate available through its key_comp member function, so two objects x and y have equivalent values with respect to an associative container c’s sorting criterion if the following evaluates to true:

!c.key_comp()(x, y) && !c.key_comp()(y, x)  // it's not true that x precedes
                                            // y in c's sort order and it's
                                            // also not true that y precedes
                                            // x in c's sort order

The expression !c.key_comp()(x, y) looks nasty, but once you understand that c.key_comp() is returning a function (or a function object), the nastiness dissipates. !c.key_comp()(x, y) is just calling the function (or function object) returned by key_comp, and it’s passing x and y as arguments. Then it negates the result. c.key_comp()(x, y) returns true only when x precedes y in c’s sort order, so !c.key_comp()(x, y) is true only when x doesn’t precede y in c’s sort order.

To fully grasp the implications of equality versus equivalence, consider a case-insensitive set<string>, i.e., a set<string> where the set’s comparison function ignores the case of the characters in the strings. Such a comparison function would consider “STL” and “stL” to be equivalent. Item 35 shows how to implement a function, ciStringCompare, that performs a case-insensitive comparison, but set wants a comparison function type, not an actual function. To bridge this gap, we write a functor class whose operator() calls ciStringCompare:

struct CIStringCompare:                            // class for case-insensitive
  public                                           // string comparisons;
  binary_function<string, string, bool> {          // see Item 40 for info on
                                                   // this base class
  bool operator()(const string& lhs,
                  const string& rhs) const
  {
    return ciStringCompare(lhs, rhs);              // see Item 35 for how
  }                                                // ciStringCompare is
                                                   // implemented
};

Given CIStringCompare, it’s easy to create a case-insensitive set<string>:

set<string, CIStringCompare> ciss; // ciss = "case-insensitive
// string set"

If we insert "Persephone" and "persephone" into the set, only the first string is added, because the second one is equivalent to the first:

ciss.insert("Persephone"); // a new element is added to the set

ciss.insert("persephone"); // no new element is added to the set

If we now search for the string "persephone" using set’s find member function, the search will succeed,

if (ciss.find("persephone") != ciss.end()) ... // this test will succeed

but if we use the non-member find algorithm, the search will fail:

if (find(ciss.begin(), ciss.end(),
"persephone") != ciss.end()) ... // this test will fail

That’s because "persephone" is equivalent to "Persephone" (with respect to the comparison functor CIStringCompare), but it’s not equal to it (because string("persephone") != string("Persephone")). This example demonstrates one reason why you should follow the advice in Item 44 and prefer member functions (like set::find) to their non-member counterparts (like find).

You might wonder why the standard associative containers are based on equivalence instead of equality. After all, most programmers have an intuition about equality that they lack for equivalence. (Were that not the case, there’d be no need for this Item.) The answer is simple at first glance, but the more closely you look, the more subtle it becomes.

The standard associative containers are kept in sorted order, so each container must have a comparison function (less, by default) that defines how to keep things sorted. Equivalence is defined in terms of this comparison function, so clients of a standard associative container need to specify only one comparison function (the one determining the sort order) for any container they use. If the associative containers used equality to determine when two objects have the same value, each associative container would need, in addition to its comparison function for sorting, a second comparison function for determining when two values are equal. (By default, this comparison function would presumably be equal_to, but it’s interesting to note that equal_to is never used as a default comparison function in the STL. When equality is needed in the STL, the convention is to simply call operator== directly. For example, this is what the non-member find algorithm does.)

Let’s suppose we had a set-like STL container called set2CF, “set with two comparison functions.” The first comparison function would be used to determine the sort order of the set, the second would be used to determine whether two objects had the same value. Now consider this set2CF:

set2CF<string, CIStringCompare, equal_to<string> > s;

Here, s sorts its strings internally without regard to case, and the equality criterion is the intuitive one: two strings have the same value if they are equal to one another. Let’s insert our two spellings of Hades’ reluctant bride (Persephone) into s:

s.insert("Persephone");
s.insert("persephone");

What should this do? If we observe that “Persephone" != "persephone" and insert them both into s, what order should they go in? Remember that the sorting function can’t tell them apart. Do we insert them in some arbitrary order, thus giving up the ability to traverse the contents of the set in deterministic order? (This inability to traverse associative container elements in a deterministic order already afflicts multisets and multimaps, because the Standard places no constraints on the relative ordering of equivalent values (for multisets) or keys (for multimaps).) Or do we insist on a deterministic ordering of s’s contents and ignore the second attempted insertion (the one for “persephone”)? If we do that, what happens here?

if (s.find("persephone") != s.end()) ... // does this test succeed or fail?

Presumably find employs the equality check, but if we ignored the second call to insert in order to maintain a deterministic ordering of the elements in s, this find will fail, even though the insertion of "persephone" was ignored on the basis of its being a duplicate value!

The long and short of it is that by using only a single comparison function and by employing equivalence as the arbiter of what it means for two values to be “the same,” the standard associative containers sidestep a whole host of difficulties that would arise if two comparison functions were allowed. Their behavior may seem a little odd at first (especially when one realizes that member and non-member find may return different results), but in the long run, it avoids the kind of confusion that would arise from mixing uses of equality and equivalence within standard associative containers.

Interestingly, once you leave the realm of sorted associative containers, the situation changes, and the issue of equality versus equivalence can be—and has been—revisited. There are two common designs for nonstandard (but widely available) associative containers based on hash tables. One design is based on equality, while the other is based on equivalence. I encourage you to turn to Item 25 to learn more about these containers and the design decisions they’ve adopted.

Item 20: Specify comparison types for associative containers of pointers

Suppose you have a set of string* pointers and you insert the names of some animals into the set:

set<string*> ssp; // ssp = "set of string ptrs"

ssp.insert(new string("Anteater"));
ssp.insert(new string("Wombat"));
ssp.insert(new string("Lemur"));
ssp.insert(new string("Penguin"));

You then write the following code to print the contents of the set, expecting the strings to come out in alphabetical order. After all, sets keep their contents sorted.

for (set<string*>::const_iterator i = ssp.begin();    // you expect to see
     i != ssp.end();                                  // this: "Anteater",
     ++i)                                             // "Lemur", "Penguin",
  cout << *i << endl;                                 // "Wombat"

The comment describes what you expect to see, but you don’t see that at all. Instead, you see four hexadecimal numbers. They are pointer values. Because the set holds pointers, *i isn’t a string, it’s a pointer to a string. Let this be a lesson that reminds you to adhere to the guidance of Item 43 and avoid writing your own loops. If you’d used a call to the copy algorithm instead,

copy(ssp.begin(), ssp.end(),                          // copy the strings in
     ostream_iterator<string>(cout, " "));           // ssp to cout (but this
                                                      // won't compile)

you’d not only have typed fewer characters, you’d have found out about your error sooner, because this call to copy won’t compile. ostream_iterator insists on knowing the type of object being printed, so when you tell it it’s a string (by passing that as the template parameter), your compilers detect the mismatch between that and the type of objects stored in ssp (which is string*), and they refuse to compile the code. Score another one for type safety.

If you exasperatedly change the *i in your explicit loop to **i, you might get the output you want, but you probably won’t. Yes, the animal names will be printed, but the chances of their coming out in alphabetical order are only 1 in 24. ssp keeps its contents in sorted order, but it holds pointers, so it sorts by pointer value, not by string value. There are 24 possible permutations for four pointer values, so there are 24 possible orders for the pointers to be stored in. Hence the odds of 1 in 24 of your seeing the strings in alphabetical order.^†

^† Practically speaking, the 24 permutations are not equally likely, so the “1 in 24” statement is a bit misleading. Still, there are 24 different orders, and you could get any one of them.

To surmount this problem, it helps to recall that

set<string*> ssp;

is shorthand for this:

set<string*, less<string*> > ssp;

Well, to be completely accurate, it’s shorthand for

set<string*, less<string*>, allocator<string*> > ssp;

but allocators don’t concern us in this Item, so we’ll ignore them.

If you want the string* pointers to be stored in the set in an order determined by the string values, you can’t use the default comparison functor class less<string*>. You must instead write your own comparison functor class, one whose objects take string* pointers and order them by the values of the strings they point to. Like this:

struct StringPtrLess:
  public binary_function<const string*,            // see Item 40 for the
                         const string*,            // reason for this base
                         bool>{                    // class

  bool operator()(const string *ps1, const string *ps2) const
  {
    return *ps1 < *ps2;
  }
};

Then you can use StringPtrLess as ssp’s comparison type:

typedef set<string*, StringPtrLess> StringPtrSet;

StringPtrSet ssp;                       // create a set of strings and order
                                        // them as defined by StringPtrLess

...                                     // insert the same four strings as
                                        // before

Now your loop will finally do what you want it to do (provided you’ve fixed the earlier problem whereby you used *i instead of **i):

for (StringPtrSet::const_iterator i = ssp.begin();  // prints "Anteater",
     i != ssp.end();                                // "Lemur",
     ++i)                                           // "Penguin",
   cout << **i << endl;                             // "Wombat"

If you want to use an algorithm instead, you could write a function that knows how to dereference string* pointers before printing them, then use that function in conjunction with for_each:

void print(const string *ps)                        // print to cout the
{                                                   // object pointed to
cout << *ps << endl;                               // by ps
}

for_each(ssp.begin(), ssp.end(), print);            // invoke print on each
                                                    // element in ssp

Or you could get fancy and write a generic dereferencing functor class, then use that with transform and an ostream_iterator:

// when functors of this type are passed a T*, they return a const T&
struct Dereference {

  template <typename T>
  const T& operator()(const T *ptr) const
  {
    return *ptr;
  }
};

transform(ssp.begin(), ssp.end(),                // "transform" each
          ostream_iterator<string>(cout, " "),  // element in ssp by
          Dereference());                        // dereferencing it,
                                                 // writing the results
                                                 // to cout

Replacement of loops with algorithms, however, is not the point, at least not for this Item. (It is the point for Item 43.) The point is that anytime you create a standard associative container of pointers, you must bear in mind that the container will be sorted by the values of the pointers. Only rarely will this be what you want, so you’ll almost always want to create your own functor class to serve as a comparison type.

Notice that I wrote “comparison type.” You may wonder why you have to go to the trouble of creating a functor class instead of simply writing a comparison function for the set. For example, you might think to try this:

bool stringPtrLess(const string* ps1,     // would-be comparison
                   const string* ps2)     // function for string*
{                                         // pointers to be sorted by
  return *ps1 < *ps2;                     // string value
}

set<string*, stringPtrLess> ssp;          // attempt to use stringPtrLess
                                          // as ssp's comparison function;
                                          // this won't compile

The problem here is that each of the set template’s three parameters is a type. Unfortunately, stringPtrLess isn’t a type, it’s a function. That’s why the attempt to use stringPtrLess as set’s comparison function won’t compile. set doesn’t want a function, it wants a type that it can internally instantiate to create a function.

Anytime you create associative containers of pointers, figure you’re probably going to have to specify the container’s comparison type, too. Most of the time, your comparison type will just dereference the pointers and compare the pointed-to objects (as is done in StringPtrLess above). That being the case, you might as well keep a template for such comparison functors close at hand. Like this:

struct DereferenceLess {                   // see Item 7 for why we're using
                                           // a class containing a templatized
                                           // operator()
  template <typename PtrType>
  bool operator()(PtrType pT1,             // parameters are passed by
                  PtrType pT2) const       // value, because we expect them
  {                                        // to be (or to act like) pointers
    return *pT1 <*pT2;
  }
};

Such a template eliminates the need to write classes like StringPtrLess, because we can use DereferenceLess instead:

set<string*, DereferenceLess> ssp; // behaves the same as
// set<string*, StringPtrLess>

Oh, one more thing. This Item is about associative containers of pointers, but it applies equally well to containers of objects that act like pointers, e.g., smart pointers and iterators. If you have an associative container of smart pointers or of iterators, plan on specifying the comparison type for it, too. Fortunately, the solution for pointers tends to work for pointeresque objects, too. Just as DereferenceLess is likely to be suitable as the comparison type for an associative container of T*, it’s likely to work as the comparison type for containers of iterators and smart pointers to T objects, too.

Item 21: Always have comparison functions return `false` for equal values

Let me show you something kind of cool. Create a set where less_equal is the comparison type, then insert 10 into the set:

set<int, less_equal<int> > s; // s is sorted by "<="

s.insert(10); // insert the value 10

Now try inserting 10 again:

s.insert(10);

For this call to insert, the set has to figure out whether 10 is already present. We know that it is, but the set is dumb as toast, so it has to check. To make it easier to understand what happens when the set does this, we’ll call the 10 that was initially inserted 10_A and the 10 that we’re trying to insert 10_B.

The set runs through its internal data structures looking for the place to insert 10_B. It ultimately has to check 10_B to see if it’s the same as 10_A. The definition of “the same” for associative containers is equivalence (see Item 19), so the set tests to see whether 10_B is equivalent to 10_A. When performing this test, it naturally uses the set’s comparison function. In this example, that’s operator<=, because we specified less_equal as the set’s comparison function, and less_equal means operator<=. The set thus checks to see whether this expression is true:

!(10_A < = 10_B) && !(10_B <= 10_A) // test 10_A and 10_B for equivalence

Well, 10_A and 10_B are both 10, so it’s clearly true that 10_A <= 10_B. Equally clearly, 10_B <= 10_A. The above expression thus simplifies to

!(true) && !(true)

and that simplifies to

false && false

which is simply false. That is, the set concludes that 10_A and 10_B are not equivalent, hence not the same, and it thus goes about inserting 10_B into the container alongside 10_A. Technically, this action yields undefined behavior, but the nearly universal outcome is that the set ends up with two copies of the value 10, and that means it’s not a set any longer. By using less_equal as our comparison type, we’ve corrupted the container! Furthermore, any comparison function where equal values return true will do the same thing. Isn’t that cool?

Okay, maybe your definition of cool isn’t the same as mine. Even so, you’ll still want to make sure that the comparison functions you use for associative containers always return false for equal values. You’ll need to be vigilant, however. It’s surprisingly easy to run afoul of this constraint.

For example, Item 20 describes how to write a comparison function for containers of string* pointers such that the container sorts its contents by the values of the strings instead of the values of the pointers. That comparison function sorts them in ascending order, but let’s suppose you’re in need of a comparison function for a container of string* pointers that sorts in descending order. The natural thing to do is to grab the existing code and modify it. If you’re not careful, you might come up with this, where I’ve highlighted the changes to the code in Item 20:

struct StringPtrGreater:                         // highlights show how
  public binary_function<const string*,          // this code was changed
                         const string*,          // from page 89.Beware,
                         bool> {                 // this code is flawed!

  bool operator()(const string *ps1, const string *ps2) const
  {
    return !( *ps1 < *ps2);                      // just negate the old test;
  }                                              // this is incorrect!

};

The idea here is to reverse the sort order by negating the test inside the comparison function. Unfortunately, negating “<” doesn’t give you “>” (which is what you want), it gives you “>=”. And you now understand that “>=”, because it will return true for equal values, is an invalid comparison function for associative containers.

The comparison type you really want is this one:

struct StringPtrGreater:                         // this is a valid
  public binary_function<const string*,          // comparison type for
                         const string*,          // associative containers
                         bool> {

  bool operator()(const string *ps1, const string *ps2) const
  {
    return *ps2 < *ps1;                          // return whether *ps2
  }                                              // precedes *ps1 (i.e., swap
                                                 // the order of the
};                                               // operands)

To avoid falling into this trap, all you need to remember is that the return value of a comparison function indicates whether one value precedes another in the sort order defined by that function. Equal values never precede one another, so comparison functions should always return false for equal values.

Sigh.

I know what you’re thinking. You’re thinking, “Sure, that makes sense for set and map, because those containers can’t hold duplicates. But what about multiset and multimap? Those containers may contain duplicates, so what do I care if the container thinks that two objects of equal value aren’t equivalent? It will store them both, which is what the multi containers are supposed to do. No problem, right?”

Wrong. To see why, let’s go back to the original example, but this time we’ll use a multiset:

multiset<int, less_equal<int> > s;      // s is still sorted by "<="

s.insert(10);                           // insert 10_A

s.insert(10);                           // insert 10_B

s now has two copies of 10 in it, so we’d expect that if we do an equal_range on it, we’ll get back a pair of iterators that define a range containing both copies. But that’s not possible. equal_range, its name notwithstanding, doesn’t identify a range of equal values, it identifies a range of equivalent values. In this example, s’s comparison function says that 10_A and 10_B are not equivalent, so there’s no way that both can be in the range identified by equal_range.

You see? Unless your comparison functions always return false for equal values, you break all standard associative containers, regardless of whether they are allowed to store duplicates.

Technically speaking, comparison functions used to sort associative containers must define a “strict weak ordering” over the objects they compare. (Comparison functions passed to algorithms like sort (see Item 31) are similarly constrained.) If you’re interested in the details of what it means to be a strict weak ordering, you can find them in many comprehensive STL references, including Josuttis’ The C++ Standard Library [3], Austern’s Generic Programming and the STL [4], and the SGI STL Web Site [21]. I’ve never found the details terribly illuminating, but one of the requirements of a strict weak ordering bears directly on this Item. That requirement is that any function defining a strict weak ordering must return false if it’s passed two copies of the same value.

Hey! That is this Item!

Item 22: Avoid in-place key modification in `set` and `multiset`

The motivation for this Item is easy to understand. Like all the standard associative containers, set and multiset keep their elements in sorted order, and the proper behavior of these containers is dependent on their remaining sorted. If you change the value of an element in an associative container (e.g., change a 10 to a 1000), the new value might not be in the correct location, and that would break the sortedness of the container. Simple, right?

It’s especially simple for map and multimap, because programs that attempt to change the value of a key in these containers won’t compile:

map<int, string> m;
...
m.begin()->first = 10;           // error! map keys can't be changed

multimap<int, string> mm;
...
mm.begin()->first = 20;          // error! multimap keys can't
                                 // be changed, either

That’s because the elements in an object of type map<K, V> or multimap<K, V> are of type pair<const K, V>. Because the type of the key is const K, it can’t be changed. (Well, you can probably change it if you employ a const_cast, as we’ll see below. Believe it or not, sometimes that’s what you want to do.)

But notice that the title of this Item doesn’t mention map or multimap. There’s a reason for that. As the example above demonstrates, in-place key modification is impossible for map and multimap (unless you use a cast), but it may be possible for set and multiset. For objects of type set<T> or multiset<T>, the type of the elements stored in the container is simply T, not const T. Hence, the elements in a set or multiset may be changed anytime you want to. No cast is required. (Actually, things aren’t quite that straightforward, but we’ll come to that presently. There’s no reason to get ahead of ourselves. First we crawl. Later we crawl on broken glass.)

Let us begin with an understanding of why the elements in a set or multiset aren’t const. Suppose we have a class for employees:

class Employee {
public:
  ...
  const string& name() const;           // get employee name
  void setName(const string& name);     // set employee name

  const string& title() const;          // get employee title
  void setTitle(const string& title);   // set employee title

  int idNumber() const;                 // get employee ID number
  ...
};

As you can see, there are various bits of information we can get about employees. Let us make the reasonable assumption, however, that each employee has a unique ID number, a number that is returned by the idNumber function. To create a set of employees, then, it could easily make sense to order the set by ID number only, like this:

struct IDNumberLess:
  public binary_function<Employee, Employee, bool>{      // see Item 40

  bool operator()(const Employee& lhs,
                  const Employee& rhs) const
  {
    return lhs.idNumber() < rhs.idNumber();
  }

};

typedef set<Employee, IDNumberLess> EmpIDSet;

EmpIDSet se;                           // se is a set of employees
                                       // sorted by ID number

Practically speaking, the employee ID number is the key for the elements in this set. The rest of the employee data is just along for the ride. That being the case, there’s no reason why we shouldn’t be able to change the title of a particular employee to something interesting. Like so:

Employee selectedID;                          // variable to hold employee
                                              // with selected ID number
...

EmpIDSet::iterator i = se.find(selectedID);
if (i != se.end()){
    i->setTitle("Corporate Deity");           // give employee new title;
}                                             // this shouldn't compile --
                                              // see next page for why

Because all we’re doing here is changing an aspect of an employee that is unrelated to the way the set is sorted (a non-key part of the employee), this code can’t corrupt the set. That’s why it’s reasonable. But such reasonable code precludes having the elements of a set/multiset be const. And that’s why they aren’t.

Why, you might wonder, doesn’t the same logic apply to the keys in maps and multimaps? Wouldn’t it make sense to create a map from employees to, say, the country in which they live, a map where the comparison function was IDNumberLess, just as in the previous example? And given such a map, wouldn’t it be reasonable to change an employee’s title without affecting the employee’s ID number, just as in the previous example?

To be brutally frank, I think it would. Being equally frank, however, it doesn’t matter what I think. What matters is what the Standardization Committee thought, and what it thought is that map/multimap keys should be const and set/multiset values shouldn’t be.

Because the values in a set or multiset are not const, attempts to change them may compile. The purpose of this Item is to alert you that if you do change an element in a set or multiset, you must be sure not to change a key part—a part of the element that affects the sortedness of the container. If you do, you may corrupt the container, using the container will yield undefined results, and it will be your fault. On the other hand, this constraint applies only to the key parts of the contained objects. For all other parts of the contained elements, it’s open season; change away!

Except for the broken glass. Remember the broken glass I referred to earlier? We’re there now. Grab some bandages and follow me.

Even if set and multiset elements aren’t const, there are ways for implementations to keep them from being modified. For example, an implementation could have operator* for a set<T>::iterator return a const T&. That is, it could have the result of dereferencing a set iterator be a reference-to-const element of the set. Under such an implementation, there’d be no way to modify set or multiset elements, because all the ways of accessing the elements would add a const before letting you at them.

Are such implementations legal? Arguably yes. And arguably no. The Standard is inconsistent in this area, and in accord with Murphy’s Law, different implementers have interpreted it in different ways. The result is that it’s not uncommon to find STL implementations where this code compiles and others where it does not:

EmpIDSet se;                            // as before, se is a set of employees
                                        // sorted by ID number

Employee selectedID;                    // as before, selectedID is a dummy
                                        // employee with the selected ID
                                        // number
...

EmpIDSet::iterator i = se.find(selectedID);
if (i != se.end()){
    i->setTitle("Corporate Deity");     // some STL implementations
}                                       // accept this line, others reject it

Because of the equivocal state of the Standard and the differing interpretations it has engendered, code that attempts to modify elements in a set or multiset isn’t portable.^†

^† The Standardization Committee has since clarified that elements in a set or map should not be modifiable without a cast. However, versions of the STL implemented prior to this clarification continue to be used, so, practically speaking, the material in this Item continues to apply.

So where do we stand? Encouragingly, things aren’t complicated:

• If portability is not a concern, you want to change the value of an element in a set or multiset, and your STL implementation will let you get away with it, go ahead and do it. Just be sure not to change a key part of the element, i.e., a part of the element that could affect the sortedness of the container.

• If you value portability, assume that the elements in sets and multisets cannot be modified, at least not without a cast.

Ah, casts. We’ve seen that it can be entirely reasonable to change a non-key portion of an element in a set or a multiset, so I feel compelled to show you how to do it. How to do it correctly and portably, that is. It’s not hard, but it requires a subtlety too many programmers overlook: you must cast to a reference. As an example, look again at the setTitle call we just saw that failed to compile under some implementations:

EmpIDSet::iterator i = se.find(selectedID);
if (i != se.end()){
i->setTitle("Corporate Deity"); // some STL implementations will
} // reject this line because *i is const

To get this to compile and behave correctly, we must cast away the constness of *i. Here’s the correct way to do it:

if (i != se.end()) {                                           // cast away
    const_cast<Employee&>(*i).setTitle("Corporate Deity");     // constness
}                                                              // of *i

This takes the object pointed to by i, tells your compilers to treat the result of the cast as a reference to a (non-const) Employee, then invoke setTitle on the reference. Rather than explain why this works, I’ll explain why an alternative approach fails to behave the way people often expect it to.

Many people come up with this code,

if (i != se.end()) {                                             // cast *i
    static_cast<Employee>(*i).setTitle("Corporate Deity");       // to an
}                                                                // Employee

which is equivalent to the following:

if (i != se.end()) {                                         // same as above,
   ((Employee)(*i)).setTitle("Corporate Deity");             // but using C
}                                                            // cast syntax

Both of these compile, and because these are equivalent, they’re wrong for the same reason. At runtime, they don’t modify *i! In both cases, the result of the cast is a temporary anonymous object that is a copy of *i, and setTitle is invoked on the anonymous object, not on *i! *i isn’t modified, because setTitle is never invoked on that object, it’s invoked on a copy of that object. Both syntactic forms are equivalent to this:

if (i != se.end()) {
Employee tempCopy(*i); // copy *i into tempCopy
tempCopy.setTitle("Corporate Deity"); // modify tempCopy
}

Now the importance of a cast to a reference should be clear. By casting to a reference, we avoid creating a new object. Instead, the result of the cast is a reference to an existing object, the object pointed to by i. When we invoke setTitle on the object indicated by the reference, we’re invoking setTitle on *i, and that’s exactly what we want.

What I’ve just written is fine for sets and multisets, but when we turn to maps and multimaps, the plot thickens. Recall that a map<K, V> or a multimap<K, V> contains elements of type pair<const K, V>. That const means that the first component of the pair is defined to be const, and that means that attempts to modify it (even after casting away its constness) are undefined. If you’re a stickler for following the rules laid down by the Standard, you’ll never try to modify an object serving as a key to a map or multimap.

You’ve surely heard that casts are dangerous, and I hope this book makes clear that I believe you should avoid them whenever you can. To perform a cast is to shuck temporarily the safety of the type system, and the pitfalls we’ve discussed exemplify what can happen when you leave your safety net behind.

Most casts can be avoided, and that includes the ones we’ve just considered. If you want to change an element in a set, multiset, map, or multimap in a way that always works and is always safe, do it in five simple steps:

Locate the container element you want to change. If you’re not sure of the best way to do that, Item 45 offers guidance on how to perform an appropriate search.
Make a copy of the element to be modified. In the case of a map or multimap, be sure not to declare the first component of the copy const. After all, you want to change it!
Modify the copy so it has the value you want to be in the container.
Remove the element from the container, typically via a call to erase (see Item 9).
Insert the new value into the container. If the location of the new element in the container’s sort order is likely to be the same or adjacent to that of the removed element, use the “hint” form of insert to improve the efficiency of the insertion from logarithmic-time to amortized constant-time. Use the iterator you got from Step 1 as the hint.

Here’s the same tired employee example, this time written in a safe, portable manner:

EmpIDSet se;                          // as before, se is a set of employees
                                      // sorted by ID number

Employee selectedID;                  // as before, selectedID is a dummy
                                      // employee with the desired ID number
...

EmpIDSet::iterator i =
se.find(selectedID);                 // Step 1: find element to change

if (i != se.end()) {
    Employee e(*i);                   // Step 2: copy the element

    e.setTitle("Corporate Deity");    // Step 3: modify the copy

    se.erase(i++);                    // Step 4: remove the element;
                                      // increment the iterator to maintain
                                      // its validity (see Item 9)

    se.insert(i, e);                  // Step 5: insert new value; hint that its
                                      // location is the same as that of the
}                                     // original element

You’ll excuse my putting it this way, but the key thing to remember is that with set and multiset, if you perform any in-place modifications of container elements, you are responsible for making sure that the container remains sorted.

Item 23: Consider replacing associative containers with sorted `vector`s

Many STL programmers, when faced with the need for a data structure offering fast lookups, immediately think of the standard associative containers, set, multiset, map, and multimap. That’s fine, as far as it goes, but it doesn’t go far enough. If lookup speed is really important, it’s almost certainly worthwhile to consider the nonstandard hashed containers as well (see Item 25). With suitable hashing functions, hashed containers can be expected to offer constant-time lookups. (With poorly chosen hashing functions or with table sizes that are too small, the performance of hash table lookups may degrade significantly, but this is relatively uncommon in practice.) For many applications, the expected constant-time lookups of hashed containers are preferable to the guaranteed logarithmic-time lookups that are the hallmark of set, map and their multi companions.

Even if logarithmic-time lookup is what you want, the standard associative containers may not be your best bet. Counterintuitively, it’s not uncommon for the standard associative containers to offer performance that is inferior to that of the lowly vector. If you want to make effective use of the STL, you need to understand when and how a vector can offer faster lookups than a standard associative container.

The standard associative containers are typically implemented as balanced binary search trees. A balanced binary search tree is a data structure that is optimized for a mixed combination of insertions, erasures, and lookups. That is, it’s designed for applications that do some insertions, then some lookups, then maybe some more insertions, then perhaps some erasures, then a few more lookups, then more insertions or erasures, then more lookups, etc. The key characteristic of this sequence of events is that the insertions, erasures, and lookups are all mixed up. In general, there’s no way to predict what the next operation on the tree will be.

Many applications use their data structures in a less chaotic manner. Their use of data structures fall into three distinct phases, which can be summarized like this:

Setup. Create a new data structure by inserting lots of elements into it. During this phase, almost all operations are insertions and erasures. Lookups are rare or nonexistent.
Lookup. Consult the data structure to find specific pieces of information. During this phase, almost all operations are lookups. Insertions and erasures are rare or nonexistent. There are so many lookups, the performance of this phase makes the performance of the other phases incidental.
Reorganize. Modify the contents of the data structure, perhaps by erasing all the current data and inserting new data in its place. Behaviorally, this phase is equivalent to phase 1. Once this phase is completed, the application returns to phase 2.

For applications that use their data structures in this way, a vector is likely to offer better performance (in both time and space) than an associative container. But not just any vector will do. It has to be a sorted vector, because only sorted containers work correctly with the lookup algorithms binary_search, lower_bound, equal_range, etc. (see Item 34). But why should a binary search through a (sorted) vector offer better performance than a binary search through a binary search tree? Because some things are trite but true, and one of them is that size matters. Others are less trite but no less true, and one of those is that locality of reference matters, too.

Consider first the size issue. Suppose we need a container to hold Widget objects, and, because lookup speed is important to us, we are considering both an associative container of Widgets and a sorted vector<Widget>. If we choose an associative container, we’ll almost certainly be using a balanced binary tree. Such a tree would be made up of tree nodes, each holding not only a Widget, but also a pointer to the node’s left child, a pointer to its right child, and (typically) a pointer to its parent. That means that the space overhead for storing a Widget in an associative container would be at least three pointers.

In contrast, there is no overhead when we store a Widget in a vector; we simply store a Widget. The vector itself has overhead, of course, and there may be empty (reserved) space at the end of the vector (see Item 14), but the per-vector overhead is typically insignificant (usually three machine words, e.g., three pointers or two pointers and an int), and the empty space at the end can be lopped off via “the swap trick” if necessary (see Item 17). Even if the extra space is not eliminated, it’s unimportant for the analysis below, because that memory won’t be referenced when doing a lookup.

Assuming our data structures are big enough, they’ll be split across multiple memory pages, but the vector will require fewer pages than the associative container. That’s because the vector requires no per-Widget overhead, while the associative container exacts three pointers per Widget. To see why this is important, suppose you’re working on a system where a Widget is 12 bytes in size, pointers are 4 bytes, and a memory page holds 4096 (4K) bytes. Ignoring the per-container overhead, you can fit 341 Widgets on a page when they are stored in a vector, but you can fit at most 170 when they are stored in an associative container. You’ll thus use about twice as much memory for the associative container as you would for the vector. If you’re working in an environment where virtual memory is available, it’s easy to see how that can translate into a lot more page faults, therefore a system that is significantly slower for large sets of data.

I’m actually being optimistic about the associative containers here, because I’m assuming that the nodes in the binary trees are clustered together on a relatively small set of memory pages. Most STL implementations use custom memory managers (implemented on top of the containers’ allocators—see Items 10 and 11) to achieve such clustering, but if your STL implementation fails to take steps to improve locality of reference among tree nodes, the nodes could end up scattered all over your address space. That would lead to even more page faults. Even with the customary clustering memory managers, associative containers tend to have more problems with page faults, because, unlike contiguous-memory containers such as vector, node-based containers find it more difficult to guarantee that container elements that are close to one another in a container’s traversal order are also close to one another in physical memory. Yet this is precisely the kind of memory organization that minimizes page faults when performing a binary search.

Bottom line: storing data in a sorted vector is likely to consume less memory than storing the same data in a standard associative container, and searching a sorted vector via binary search is likely to be faster than searching a standard associative container when page faults are taken into account.

Of course, the big drawback of a sorted vector is that it must remain sorted! When a new element is inserted, everything beyond the new element must be moved up by one. That’s as expensive as it sounds, and it gets even more expensive if the vector has to reallocate its underlying memory (see Item 14), because then all the elements in the vector typically have to be copied. Similarly, if an element is removed from the vector, all the elements beyond it must be moved down. Insertions and erasures are expensive for vectors, but they’re cheap for associative containers. That’s why it makes sense to consider using a sorted vector instead of an associative container only when you know that your data structure is used in such a way that lookups are almost never mixed with insertions and erasures.

This Item has featured a lot of text, but it’s been woefully short on examples, so let’s take a look at a code skeleton for using a sorted vector instead of a set:

vector<Widget> vw;                               // alternative to set<Widget>

...                                              // Setup phase: lots of insertions,
                                                 // few lookups

sort(vw.begin(), vw.end());                      // end of Setup phase. (When
                                                 // simulating a multiset, you might
                                                 // prefer stable_sort instead see
                                                 // Item 31.) When emulating a set,
                                                 // be sure vw holds no duplicates!

Widget w;                                        // object for value to look up

...                                              // start Lookup phase

if (binary_search(vw.begin(), vw.end(), w)) ...  // lookup via binary_search

vector<Widget>::iterator i =
  lower_bound(vw.begin(), vw.end(), w);          // lookup via lower_bound;
if (i != vw.end() && !(w < *i)) ...              // see Item 19 for an explana-
                                                 // tion of the "!(w < *i)" test

pair<vector<Widget>::iterator,
     vector<Widget>::iterator> range =
  equal_range(vw.begin(), vw.end(), w);          // lookup via equal_range
if (range.first != range.second) ...

...                                              // end Lookup phase, start
                                                 // Reorganize phase

sort(vw.begin(), vw.end());                      // begin new Lookup phase...

As you can see, it’s all pretty straightforward. The hardest thing about it is deciding among the search algorithms (e.g., binary_search, lower_bound, etc.), and Item 45 helps you do that.

Things get a bit more interesting when you decide to replace a map or multimap with a vector, because the vector must hold pair objects. After all, that’s what map and multimap hold. Recall, however, that if you declare an object of type map<K, V> (or its multimap equivalent), the type of elements stored in the map is pair<const K, V>. To emulate a map or multimap using a vector, you must omit the const, because when you sort the vector, the values of its elements will get moved around via assignment, and that means that both components of the pair must be assignable. When using a vector to emulate a map<K, V>, then, the type of the data stored in the vector will be pair<K, V>, not pair<const K, V>.

maps and multimaps keep their elements in sorted order, but they look only at the key part of the element (the first component of the pair) for sorting purposes, and you must do the same when sorting a vector. You’ll need to write a custom comparison function for your pairs, because pair’s operator< looks at both components of the pair.

Interestingly, you’ll need a second comparison function for performing lookups. The comparison function you’ll use for sorting will take two pair objects, but lookups are performed given only a key value. The comparison function for lookups, then, must take an object of the key type (the value being searched for) and a pair (one of the pairs stored in the vector)—two different types. As an additional twist, you can’t know whether the key value or the pair will be passed as the first argument, so you really need two comparison functions for lookups: one where the key value is passed first and one where the pair is passed first.

Here’s an example of how to put all the pieces together:

typedef pair<string, int> Data;                         // type held in the "map"
                                                        // in this example

class DataCompare {                                     // class for comparison
public:                                                 // functions

  bool operator()(const Data& lhs,                            // comparison func
                  const Data& rhs) const                      // for sorting
  {
    return keyLess(lhs.first, rhs.first);                     // keyLess is below
  }

  bool operator()(const Data& lhs,                            // comparison func
                  const Data::first_type& k) const            // for lookups
  {                                                           // (form 1)
    return keyLess(lhs.first, k);
  }

  bool operator()(const Data::first_type& k,                  // comparison func
                  const Data& rhs) const                      // for lookups
  {                                                           // (form 2)
    return keyLess(k, rhs.first);
  }

private:
  bool keyLess(const Data::first_type& k1,                    // the "real"
               const Data::first_type& k2) const              // comparison
  {                                                           // function
    return k1 < k2;
  }
};

In this example, we assume that our sorted vector will be emulating a map<string, int>. The code is pretty much a literal translation of the discussion above, except for the presence of the member function keyLess. That function exists to ensure consistency among the various operator() functions. Each such function simply compares two key values, so rather than duplicate the logic, we put the test inside keyLess and have the operator() functions return whatever keyLess does. This admirable act of software engineering enhances the maintainability of DataCompare, but there is a minor drawback. Its provision for operator() functions with different parameter types renders the resulting function objects unadaptable (see Item 40). Oh well.

Using a sorted vector as a map is essentially the same as using it as a set. The only major difference is the need to use DataCompare objects as comparison functions:

vector<Data> vd;                                // alternative to
                                                // map<string, int>

...                                             // Setup phase: lots of
                                                // insertions, few lookups

sort(vd.begin(), vd.end(), DataCompare());      // end of Setup phase. Simu-
                                                // lated maps must avoid
                                                // dupes; simulated multimaps
                                                // might use stable_sort

string s;                                       // object for value to look up

...                                             // start Lookup phase

if (binary_search(vd.begin(), vd.end(), s,
                  DataCompare())) ...           // lookup via binary_search

vector<Data>::iterator i =
  lower_bound(vd.begin(), vd.end(), s,
              DataCompare());                   // lookup via lower_bound;
if (i != vd.end() && !DataCompare()(s, *i))...  // again, see Item 19 for info
                                                // on the
                                                // "!DataCompare()(s, *i)" test

pair<vector<Data>::iterator,
     vector<Data>::iterator> range =
  equal_range(vd.begin(), vd.end(), s,
              DataCompare());                   // lookup via equal_range
if (range.first != range.second) ...

...                                             // end Lookup phase, start
                                                // Reorganize phase

sort(vd.begin(), vd.end(), DataCompare());      // begin new Lookup phase...

As you can see, once you’ve written DataCompare, things pretty much fall into place. And once in place, they’ll often run faster and use less memory than the corresponding design using a real map as long as your program uses the data structure in the phased manner described on page 101. If your program doesn’t operate on the data structure in a phased manner, use of a sorted vector instead of a standard associative container will almost certainly end up wasting time.

Item 24: Choose carefully between `map::operator[]` and `map::insert` when efficiency is important

Let’s suppose we have a Widget class that supports default construction as well as construction and assignment from a double:

class Widget {
public:
  Widget();
  Widget(double weight);

  Widget& operator=(double weight);
  ...
};

Let’s now suppose we’d like to create a map from ints to Widgets, and we’d like to initialize the map with particular values. ’Tis simplicity itself:

map<int, Widget> m;

m[1] = 1.50;
m[2] = 3.67;
m[3] = 10.5;
m[4] = 45.8;
m[5] = 0.0003;

In fact, the only thing simpler is forgetting what’s really going on. That’s too bad, because what’s going on could incur a considerable performance hit.

The operator[] function for maps is a novel creature, unrelated to the operator[] functions for vectors, deques, and strings and equally unrelated to the built-in operator[] that works with arrays. Instead, map::operator[] is designed to facilitate “add or update” functionality. That is, given

map<K, V> m;

the expression

m[k] = v;

checks to see if the key k is already in the map. If not, it’s added, along with v as its corresponding value. If k is already in the map, its associated value is updated to v.

The way this works is that operator[] returns a reference to the value object associated with k. v is then assigned to the object to which the reference (the one returned from operator[]) refers. This is straightforward when an existing key’s associated value is being updated, because there’s already a value object to which operator[] can return a reference. But if k isn’t yet in the map, there’s no value object for operator[] to refer to. In that case, it creates one from scratch by using the value type’s default constructor. operator[] then returns a reference to this newly-created object.

Let’s look again at the first part of our original example:

map<int, Widget> m;

m[1] = 1.50;

The expression m[1] is shorthand for m.operator[](1), so this is a call to map::operator[]. That function must return a reference to a Widget, because m’s mapped type is Widget. In this case, m doesn’t yet have anything in it, so there is no entry in the map for the key 1. operator[] therefore default-constructs a Widget to act as the value associated with 1, then returns a reference to that Widget. Finally, the Widget becomes the target of an assignment; the assigned value is 1.50.

In other words, the statement

m[1] = 1.50;

is functionally equivalent to this:

typedef map<int, Widget> IntWidgetMap;              // convenience
                                                    // typedef

pair<IntWidgetMap::iterator, bool> result =         // create new map
  m.insert(IntWidgetMap::value_type(1, Widget()));  // entry with key 1
                                                    // and a default-
                                                    // constructed value
                                                    // object; see below
                                                    // for a comment on
                                                    // value_type

result.first->second = 1.50;                        // assign to the
                                                    // newly-constructed
                                                    // value object

Now it should be clear why this approach may degrade performance. We first default-construct a Widget, then we immediately assign it a new value. If it’s measurably more efficient to construct a Widget with the value we want instead of default-constructing the Widget and then doing the assignment, we’d be better off replacing our use of operator[] (including its attendant construction plus assignment) with a straightforward call to insert:

m.insert(IntWidgetMap::value_type(1, 1.50));

This has precisely the same ultimate effect as the code above, except it replaces construction (and later destruction) of a temporary plus an assignment with construction of a temporary from a pair of values (followed by later destruction of that temporary); there’s no assignment. The more expensive the cost of assignment of the value part of the pair, the more you save by using map::insert instead of map::operator[].

The code above takes advantage of the value_type typedef that’s provided by every standard container. There’s nothing particularly significant about this typedef, but it’s important to remember that for map and multimap (as well as the nonstandard containers hash_map and hash_multimap—see Item 25), the type of the contained elements will always be some kind of pair.

I remarked earlier that operator[] is designed to facilitate “add or update” functionality, and we now understand that when an “add” is performed, insert is more efficient than operator[]. The situation is reversed when we do an update, i.e., when an equivalent key (see Item 19) is already in the map. To see why that is the case, look at our update options:

m[k] = v;                                                   // use operator[]
                                                            // to update k's
                                                            // value to be v

m.insert(IntWidgetMap::value_type(k, v)).first->second = v; // use insert to
                                                            // update k's
                                                            // value to be v

The syntax alone is probably enough to convince you to favor operator[], but we’re focusing on efficiency here, so we’ll overlook that.

The call to insert requires an argument of type IntWidgetMap::value_type (i.e., pair<int, Widget>), so when we call insert, we must construct and destruct an object of that type. That costs us a pair constructor and destructor. That, in turn, entails a Widget construction and destruction, because a pair<int, Widget> itself contains a Widget object. operator[] uses no pair object, so it constructs and destructs no pair and no Widget.

Efficiency considerations thus lead us to conclude that insert is preferable to operator[] when adding an element to a map, and both efficiency and aesthetics dictate that operator[] is preferable when updating the value of an element that’s already in the map.

It would be nice if the STL provided a function that offered the best of both worlds, i.e., efficient add-or-update functionality in a syntactically attractive package. For example, it’s not hard to envision a calling interface such as this:

iterator affectedPair =                 // if key k isn't in map m, efficiently
  efficientAddOrUpdate(m, k, v);        // add pair (k,v) to m; otherwise
                                        // efficiently update to v the value
                                        // associated with k. Return an
                                        // iterator to the added or
                                        // modified pair

There’s no function like this in the STL, but, as the code below demonstrates, it’s not terribly hard to write yourself. The comments summarize what’s going on, and the paragraphs that follow provide some additional explanation.

template<typename MapType,                // type of map
         typename KeyArgType,             // see below for why
         typename ValueArgType>           // KeyArgType and
                                          // ValueArgType are type
typename MapType::iterator                // parameters
efficientAddOrUpdate(MapType& m,
                     const KeyArgType& k,
                     const ValueArgType& v)
{
  typename MapType::iterator lb =         // find where k is or should
  m.lower_bound(k);                       // be; see page 7 for why
                                          // "typename" is needed
                                          // here

  if (lb != m.end() &&                    // if lb points to a pair
      !(m.key_comp()(k, lb->first))) {    // whose key is equiv to k...

     lb->second = v;                      // update the pair's value
     return lb;                           // and return an iterator
  }                                       // to that pair

  else {
    typedef typename MapType::value_type MVT;
    return m.insert(lb, MVT(k, v));       // add pair(k, v) to m and
  }                                       // return an iterator to the
}                                         // new map element

To perform an efficient add or update, we need to be able to find out if k’s value is in the map; if so, where it is; and if not, where it should be inserted. That job is tailor-made for lower_bound (see Item 45), so that’s the function we call. To determine whether lower_bound found an element with the key we were looking for, we perform the second half of an equivalence test (see Item 19), being sure to use the correct comparison function for the map; that comparison function is available via map::key_comp. The result of the equivalence test tells us whether we’re performing an add or an update.

If it’s an update, the code is straightforward. The insert branch is more interesting, because it uses the “hint” form of insert. The construct m.insert(lb, MVT(k, v)) “hints” that lb identifies the correct insertion location for a new element with key equivalent to k, and the Standard guarantees that if the hint is correct, the insertion will occur in amortized constant, rather than logarithmic, time. In efficientAddOrUpdate, we know that lb identifies the proper insertion location, so the call to insert is guaranteed to be an amortized constant-time operation.

An interesting aspect of this implementation is that KeyArgType and ValueArgType need not be the types stored in the map. They need only be convertible to the types stored in the map. The alternative would be to eliminate the type parameters KeyArgType and ValueArgType, using instead MapType::key_type and MapType::mapped_type. However, if we did that, we might force unnecessary type conversions to occur at the point of call. For instance, look again at the definition of the map we’ve been using in this Item’s examples:

map<int, Widget> m; // as before

And recall that Widget accepts assignment from a double:

class Widget {                               // also as before
public:
  ...
  Widget& operator=(double weight);
  ...
};

Now consider this call to efficientAddOrUpdate:

efficientAddOrUpdate(m, 10, 1.5);

Suppose that it’s an update operation, i.e., m already contains an element whose key is 10. In that case, the template above deduces that ValueArgType is double, and the body of the function directly assigns 1.5 as a double to the Widget associated with the key 10. That’s accomplished by an invocation of Widget::operator=(double). If we had used MapType::mapped_type as the type of efficientAddOrUpdate’s third parameter, we’d have converted 1.5 to a Widget at the point of call, and we’d thus have paid for a Widget construction (and subsequent destruction) we didn’t need.

Subtleties in the implementation of efficientAddOrUpdate may be interesting, but they’re not as important as the main point of this Item, which is that you should choose carefully between map::operator[] and map::insert when efficiency is important. If you’re updating an existing map element, operator[] is preferable, but if you’re adding a new element, insert has the edge.

Item 25: Familiarize yourself with the nonstandard hashed containers

It generally doesn’t take much time for STL programmers to begin to wonder, “Vectors, lists, maps, sure, but where are the hash tables?” Alas, there aren’t any hash tables in the standard C++ library. Everyone agrees that this is unfortunate, but the Standardization Committee felt that the work needed to add them would have unduly delayed completion of the Standard. It’s a foregone conclusion that the next version of the Standard will include hash tables, but for the time being, the STL doesn’t do hashing.

If you like hash tables, however, take heart. You need not do without or roll your own. STL-compatible hashed associative containers are available from multiple sources, and they even have de facto standard names: hash_set, hash_multiset, hash_map, and hash_multimap.

Behind these common names, different implementations, er, differ. They differ in their interfaces, their capabilities, their underlying data structures, and the relative efficiency of the operations they support. It’s still possible to write reasonably portable code using hash tables, but it’s not as easy as it would be had the hashed containers been standardized. (Now you know why standards are important.)

Of the several available implementations for hashed containers, the two most common are from SGI (see Item 50) and Dinkumware (see Appendix B), so in what follows, I restrict myself to the designs of the hashed containers from these vendors. STLport (again, see Item 50) also offers hashed containers, but the STLport hashed containers are based on those from SGI. For purposes of this Item, assume that whatever I write about the SGI hashed containers applies to the STLport hashed containers, too.

Hashed containers are associative containers, so it should not surprise you that, like all associative containers, they need to know the type of objects stored in the container, the comparison function for such objects, and the allocator for these objects. In addition, hashed containers require specification of a hashing function. That suggests the following declaration for hashed containers:

template<typename T,
         typename HashFunction,
         typename CompareFunction,
         typename Allocator = allocator<T> >
class hash_container;

This is quite close to the SGI declaration for hashed containers, the primary difference being that SGI provides default types for HashFunction and CompareFunction. The SGI declaration for hash_set looks essentially like this (I’ve tweaked it a bit for presentation purposes):

template<typename T,
         typename HashFunction = hash<T>,
         typename CompareFunction = equal_to<T>,
         typename Allocator = allocator<T> >
class hash_set;

A noteworthy aspect of the SGI design is the use of equal_to as the default comparison function. This is a departure from the conventions of the standard associative containers, where the default comparison function is less. This design decision signifies more than a simple change in default comparison functions. SGI’s hashed containers determine whether two objects in a hashed container have the same value by testing for equality, not equivalence (see Item 19). For hashed containers, this is not an unreasonable decision, because hashed associative containers, unlike their standard (typically tree-based) counterparts, are not kept in sorted order.

The Dinkumware design for hashed containers adopts some different strategies. It still allows you to specify object types, hash function types, comparison function types, and allocator types, but it moves the default hash and comparison functions into a separate traits-like class called hash_compare, and it makes hash_compare the default argument for the HashingInfo parameter of the container templates. (If you’re unfamiliar with the notion of a “traits” class, open a good STL reference like Josuttis’ The C++ Standard Library [3] and study the motivation and implementation of the char_traits and iterator_traits templates.)

For example, here’s the Dinkumware hash_set declaration (again, tweaked for presentation):

template<typename T, typename CompareFunction>
class hash_compare;

template<typename T,
typename HashingInfo = hash_compare<T, less<T> >,
typename Allocator = allocator<T> >
class hash_set;

The interesting part of this interface design is the use of HashingInfo. The container’s hashing and comparison functions are stored there, but the HashingInfo type also holds enums controlling the minimum number of buckets in the table as well as the maximum allowable ratio of container elements to buckets. When this ratio is exceeded, the number of buckets in the table is increased, and some elements in the table are rehashed. (SGI provides member functions that afford similar control over the number of table buckets and, hence, the ratio of table elements to buckets.)

After some tweaks for presentation, hash_compare (the default value for HashingInfo) looks more or less like this:

template<typename T, typename CompareFunction = less<T> >
class hash_compare {
public:

  enum {
    bucket_size = 4,                     // max ratio of elements to buckets
    min_buckets = 8                      // minimum number of buckets

  };

  size_t operator()(const T&) const;     // hash function

  bool operator()(const T&,              // comparison function
                  const T&) const;

  ...                                    // a few things omitted, including
};                                       // the use of CompareFunction

The overloading of operator() (in this case, to implement both the hashing and comparison functions) is a strategy that crops up more frequently than you might imagine. For another application of the same idea, take a look at Item 23.

The Dinkumware design allows you to write your own hash_compare-like class (possibly by deriving from hash_compare itself), and as long as your class provides definitions for bucket_size, min_buckets, two operator() functions (one taking one argument, one taking two), plus a few things I’ve left out, you can use it to control the configuration and behavior of a Dinkumware hash_set or hash_multiset. Configuration control for hash_map and hash_multimap is similar.

Notice that with both the SGI and the Dinkumware designs, you can leave all the decision-making to the implementations and simply write something like this:

hash_set<int> intTable; // create a hashed set of ints

For this to compile, the hash table must hold an integral type (such as int), because the default hashing functions are generally limited to integral types. (SGI’s default hashing function is slightly more flexible. Item 50 will tell you where to find all the details.)

Under the hood, the SGI and Dinkumware implementations go their very separate ways. SGI employs a conventional open hashing scheme composed of an array (the buckets) of pointers to singly linked lists of elements. Dinkumware also employs open hashing, but its design is based on a novel data structure consisting of an array of iterators (essentially the buckets) into a doubly linked list of elements, where adjacent pairs of iterators identify the extent of the elements in each bucket. (For details, consult Plauger’s aptly titled column, “Hash Tables” [16].)

As a user of these implementations, it’s likely you’ll be interested in the fact that the SGI implementation stores the table elements in singly linked lists, while the Dinkumware implementation uses a doubly linked list. The difference is noteworthy, because it affects the iterator categories for the two implementations. SGI’s hashed containers offer forward iterators, so you forgo the ability to perform reverse iterations; there are no rbegin or rend member functions in SGI’s hashed containers. The iterators for Dinkumware’s hashed containers are bidirectional, so they offer both forward and reverse traversals. In terms of memory usage, the SGI design is a bit more parsimonious than that from Dinkumware.

Which design is best for you and your applications? I can’t possibly know. Only you can determine that, and this Item hasn’t tried to give you enough information to come to a reasonable conclusion. Instead, the goal of this Item is to make sure you know that though the STL itself lacks hashed containers, STL-compatible hashed containers (with varying interfaces, capabilities, and behavioral trade-offs) are not difficult to come by. In the case of the SGI and STLport implementations, you can even come by them for free, because they’re available for free download.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3. Associative Containers

Create new playlist

Sign In

Sign Up

Chapter 3. Associative Containers

Item 19: Understand the difference between equality and equivalence

Item 20: Specify comparison types for associative containers of pointers

Item 21: Always have comparison functions return false for equal values

Item 22: Avoid in-place key modification in set and multiset

Item 23: Consider replacing associative containers with sorted vectors

Item 24: Choose carefully between map::operator[] and map::insert when efficiency is important

Item 25: Familiarize yourself with the nonstandard hashed containers

Table of Contents for
Chapter 3. Associative Containers

Item 21: Always have comparison functions return `false` for equal values

Item 22: Avoid in-place key modification in `set` and `multiset`

Item 23: Consider replacing associative containers with sorted `vector`s

Item 24: Choose carefully between `map::operator[]` and `map::insert` when efficiency is important