Hash Tables

In the same way that arrays are sort of like lists, hash tables are sort of like alists, except that they also allow you to access arbitrary elements more quickly.

In fact, hash tables are so efficient that they can, at times, seem like magic. Think of the Babel fish in the Hitchhiker’s Guide to the Galaxy trilogy—something so impossibly useful that it really has no business existing in the first place. That’s why almost all modern languages now offer the hash table datatype.

image with no caption

Working with Hash Tables

Create a new hash table with the make-hash-table command:

> (make-hash-table)
#S(HASH-TABLE ...)

Like alists, hash tables store items using a lookup key and a value. We can retrieve an item from the hash table using the item’s key with the gethash function:

> (defparameter x (make-hash-table))
  #S(HASH-TABLE ...)
  > (gethash 'yup x)
 NIL ;
 NIL

So far, our hash table remains empty. This means that when we look up any key in the hash table, such as 'yup in this example, we receive NIL as an answer . Actually, we receive two NILs —the gethash command returns multiple values, which you can do in Common Lisp (discussed in the next section). The first returned value is the actual value stored in the hash table, and the second indicates whether the key was found in the table (in this case, it wasn’t).

Just as with arrays, we can once again combine a command used for referencing data elements—in this case, gethash—with the setf command in order to fill our table with data:

> (defparameter x (make-hash-table))
  #S(HASH-TABLE ...)
 > (setf (gethash 'yup x) '25)
  25
  > (gethash 'yup x)
 25 ;
 T

In this example, we’ve stored the value 25 in the hash table with a lookup key of yup . Then, when we look up yup in the table, we get the answer of 25 . We also get a second value of t , which means, “Yes, I found the key in the table.”

Remember when we discussed alists, we set up a data structure containing an order for coffee drinks? Here is that same data, but this time it’s stored using a hash table:

> (defparameter *drink-order* (make-hash-table))
#S(HASH-TABLE ...)
> (setf (gethash 'bill *drink-order*) 'double-espresso)
DOUBLE-ESPRESSO
> (setf (gethash 'lisa *drink-order*) 'small-drip-coffee)
SMALL-DRIP-COFFEE
> (setf (gethash 'john *drink-order*) 'medium-latte)
MEDIUM-LATTE

Accessing the drink order for any person is now simple:

> (gethash 'lisa *drink-order*)
'small-drip-coffee ;
T

Returning Multiple Values

Common Lisp allows you to return more than one value as a result. Some of the core Common Lisp core functions do this, including the gethash function, as you’ve seen. Another commonly used function that does this is the round function, which rounds off a number:

> (round 2.4)
 2 ;
 0.4

Calling this function appropriately rounded our number to 2 , but then it also generated a second value, which is the remainder of the rounding operation . Both values are returned from this function call.

You can also create multiple values in your own code by using the values function. Here, for instance, we can write a foo function that returns two separate numbers, 3 and 7:

> (defun foo ()
      (values 3 7))
  FOO
  > (foo)
 3 ;
 7

Both of these values are printed out at the REPL , just as with the round function. However, Lisp considers the first value to be more important, and it will always be used by default during follow-up calculations. For instance, we can perform an addition after calling foo, like this:

> (+ (foo) 5)
8

In this case, the addition operator just ignores the second value that foo returns.

However, sometimes you might need to use that additional returned value. You can do this by using the multiple-value-bind command:

> (multiple-value-bind (a b) (foo)
                       (* a b))
21

In this example, we’ve bound the variables a and b to both of the values returned by foo (3 and 7). Calling our function with multiple-value-bind lets us use the extra values returned from the function, which would otherwise be ignored.

You might be wondering whether you could just return a list from your function instead of using the multiple-value feature. The answer is, yes, you could. However, it’s possible that using the multiple-value feature can lead to more optimized and cleaner code.

In this book, we will not be making much use of multiple values. In fact, more recent Lisp dialects, such as Arc and Clojure, do not support multiple values at all. Instead, they just return a list in the few cases where more than one value needs to be returned.

Hash Table Performance

As with arrays, accessing and modifying a value inside a hash table requires only a constant amount of time, no matter how many items your hash table contains. For instance, suppose we have a hash table with only 10 items in it. We access a value in the table, using a key, and find it takes on average 1 millisecond to do so. Now suppose that the hash table has 1,000,000 items in it. Because of how hash tables are designed, we could still expect it to take only about 1 millisecond to retrieve a value. In other words, no matter how big the table is, we can access items at a constant time of 1 millisecond.

Think of how incredible that is! Even if your hash table contained 1,000,000 items, the gethash function could take the key you gave it and determine in a constant amount of time exactly where your desired item could be found!

In this era of web-based programs backed by enormous amounts of data, the ability of hash tables to store large numbers of values with fast retrieval makes them indispensable. The efficient storage of key/value pairs is essential for most online storage systems. Even the latest tools for storing vast amounts of online data, like Google’s BigTable or Amazon’s S3, are built around the quick retrieval of values using keys, which makes them similar to hash tables.

However, you can’t always expect hash tables to provide the best performance. Here’s why:

Virtual memory paging and cache misses:

As with arrays, large hash tables may cause your operating system to start paging virtual memory to your hard drive, thus degrading performance. Similarly, they can increase the number of cache misses within your CPU.

Hash collisions:

Internally, hash tables use a special function called a hash function, which converts keys into numbers. Such a hash function can cause hash collisions. Basically, a hash collision happens when, by chance, two keys are converted by the hash function into the same number. In this case, the hash table will still behave correctly, but at a slightly degraded performance. In rare cases, certain types of keys can interact with a hash function to increase the number of collisions and impede an application’s ability to perform lookups, degrading performance even more.

Inefficiency with small tables:

With very small tables, the creation and lookup time required by hash tables can make them less inefficient than simpler structures, such as alists. The performance benefits of hash tables are noticeable only when they contain larger amounts of data in them.

Varying speed for operations:

In Common Lisp, if you create a small hash table, and then fill it with values, you will find that occasionally, adding a new value will be unusually slow. This is because the make-hash-table function is designed to minimize the cost for creating small hash tables. However, as you start adding values to make the table big, Lisp will need to take extra time to allocate more memory so that the table can hold more items. These extra allocations will lead to occasional slow insertions into the table as it grows.

There is one final reason why hash tables are not always the best solution: They are simply not as Lispy as traditional Lisp structures built from cons cells. This means they can be harder to debug than cons cells, since they cannot be read and printed as naturally in the Lisp REPL. Therefore, a good rule of thumb is to stay away from arrays and hash tables as you conceive a new piece of code. Then, if performance ends up becoming an issue, and only then, judiciously modify the critical sections of your code to take advantage of arrays and hash tables to resolve any performance problems.

A Faster Grand Theft Wumpus Using Hash Tables

Let’s look at a practical example of what hash tables can do for your code. There is a glaring inefficiency in our latest game, Grand Theft Wumpus, that we can now correct with hash tables.

Recall from the previous chapter that Grand Theft Wumpus uses lists of nodes and edges to represent the graph of the city. This means that in order to find connections to a given node, we must do a linear search through a list. This isn’t a big deal in Grand Theft Wumpus, because Congestion City doesn’t have a lot of intersections. But what if our city had a thousand nodes with a thousand edges? Let’s time the get-connected function and see what kind of numbers we get:

> (setf *edge-num* 1000)
1000
> (setf *node-num* 1000)
1000
> (time (dotimes (i 100) (get-connected 1 (make-edge-list))))
Real time: 57.699303 sec.
Run time: 57.687607 sec.
Space: 39566832 Bytes
GC: 43, GC time: 0.120005 sec.

The time command is a Lisp utility that outputs all kinds of useful timing information about a chunk of code, and the dotimes function lets us run our code 100 times, building 100 cities. Using these commands, it took about a minute to run this code on my computer. Given how many gazillion instructions a CPU can crunch in a minute, this is absolutely horrifyingly bad performance.

To fix this problem, we’ll replace our edge list for this code with a hash table so that the get-connected function will be able to find connections to a node in constant time. We’ll also replace our visited list with a visited table, so the function can quickly tell whether a node has already been visited.

Here is the code that makes this happen, consisting of hashed versions of our previous functions:

 (defun hash-edges (edge-list)
   (let ((tab (make-hash-table)))
     (mapc (lambda (x)
              (let ((node (car x)))
               (push (cdr x) (gethash node tab))))
            edge-list)
     tab))

First, we need hash-edges, a function that converts our edge list into a hash table . At the beginning of the function, we create a new hash table and name it tab . Then, we iterate through the table with mapc . Remember that mapc is just like mapcar, except that you use it in places where you care only about the side effects and don’t care about generating a final list as a result.

For every node, we want the table to contain a list of nodes connected to it. Therefore, as we iterate through the list, we push a new neighbor onto the current list of neighbors for the current starting node . We can use the push command on hash table values just as for regular Lisp variable values. This, again, makes use of the general variables system built into Common Lisp, which we’ll discuss in Handling Data in a Generic Way in Handling Data in a Generic Way.

You may be wondering why we don’t need to deal with the case where there is no value yet for a node in the table. How can we push something into a value in the table if no value exists? Well, it turns out that because the gethash function returns NIL when a key is not found in the table, this code will simply push the new neighbor onto an empty list and stick a new record into the table where none was found before. In this way, the push command magically does the “right thing,” no matter whether the node is new or old.

Finally, once our table is populated, we return it as a result . It contains the same data as the original edge list. The difference is that we can now find the neighbors of any node in Congestion City at blazing speeds.

  (defun get-connected-hash (node edge-tab)
   (let ((visited (make-hash-table)))
      (labels ((traverse (node)
                (unless (gethash node visited)
                  (setf (gethash node visited) t)
                  (mapc (lambda (edge)
                           (traverse edge))
                         (gethash node edge-tab)))))
        (traverse node))
     visited))

Now we’re ready to write get-connected-hash, which retrieves all the nodes connected to a starting node in Congestion City . It is identical in behavior to get-connected, but is optimized through hash tables.

The first thing this function does is create a hash table of visited nodes . Then we travel through the nodes of Congestion City, beginning with the starting node. Every time we visit a new node, we ask ourselves if we’ve visited it before. We can now answer this question very efficiently by looking up the current node in the visited table . If the answer is no, we’ll need to mark this node as visited and check all of its neighbors by mapcing through them—checking our edge table . Finally, we return our visited table, which in the end will hold all nodes that are connected to the starting node .

Now we can rerun our test with this new logic:

> (time (dotimes (i 100)
                 (get-connected-hash 1 (hash-edges (make-edge-list)))))
Real time: 1.221269 sec.
Run time: 1.224076 sec.
Space: 33096264 Bytes
GC: 36, GC time: 0.10801 sec. :

As you can see, instead of taking a minute to calculate the connections in the graph, it now takes only one second to do the same thing! This is why you must know how to use hash tables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.85