Chapter 3. *Dictionary Data Structures
For the enhanced efficiency of A* this chapter discusses different dictionary data structures. Priority queues are provided for integer and total ordered keys. For duplicate elimination, hashing is studied, including provably constant access time. Moreover, maintaining partial states in form of substrings or subsets is considered.
Keywords: priority queue, 1-level bucket, 2-level bucket, Radix heap, van-Emde-Boas queue, heap, pairing heap, weak-heap, Fibonacci heap, relaxed weak queue, hash function, hash table, incremental hashing, universal hashing, perfect hashing, ranking and unranking, cuckoo hashing, suffix list, bitstate hashing, hash compaction, collapse compression, subset dictionary, unlimited branching tree, substring dictionary, suffix tree.
The exploration efficiency of algorithms like A* is often measured with respect to the number of expanded/generated problem graph nodes, but the actual runtimes depend crucially on how the Open and Closed lists are implemented. In this chapter we look closer at efficient data structures to represent these sets.
For the Open list, different options for implementing a priority queue data structure are considered. We distinguish between integer and general edge costs, and introduce bucket and advanced heap implementations.
For efficient duplicate detection and removal we also look at hash dictionaries. We devise a variety of hash functions that can be computed efficiently and that minimize the number of collisions by approximating uniformly distributed addresses, even if the set of chosen keys is not (which is almost always the case). Next we explore memory-saving dictionaries, the space requirements of which come close to the information-theoretic lower bound and provide a treatment of approximate dictionaries.
Subset dictionaries address the problem of finding partial state vectors in a set. Searching the set is referred to as the Subset Query or the Containment Query problem. The two problems are equivalent to the Partial Match retrieval problem for retrieving a partially specified input query word from a file of k-letter words with k being fixed. A simple example is the search for a word in a crossword puzzle. In a state space search, subset dictionaries are important to store partial state vectors like dead-end patterns in Sokoban that generalize pruning rules (see Ch. 10).
In state space search, string dictionaries are helpful to exclude a set of forbidden action sequences, like UD or RL in the (n2 − 1)-Puzzle, from being generated. Such sets of excluded words are formed by concatenating move labels on paths that can be learned and generalized (see Ch. 10). Therefore, string dictionaries provide an option to detect and eliminate duplicates without hashing and can help to reduce the efforts for storing all visited states. Besides the efficient insertion (and deletion) of strings, the task to determine if a query string is the substring of a stored string is most important and has to be executed very efficiently. The main application for string dictionaries are web search engines. The most flexible data structure for efficiently solving this Dynamic Dictionary Matching problem is the Generalized Suffix Tree.

3.1. Priority Queues

When applying the A* algorithm to explore a problem graph, we rank all generated but not expanded nodes u in list Open by their priority B9780123725127000031/si6.gif is missing. As basic operations we need to find the element of the minimal f-value: to insert a node together with its f-value and to update the structure if a node becomes a better f-value due to a shorter path. An abstract data structure for the three operations Insert, DeleteMin, and DecreaseKey is a priority queue.
In Dijkstra's original implementation, the Open list is a plain array of nodes together with a bitvector indicating if elements are currently open or not. The minimum is found through a complete scan, yielding quadratic execution time in the number of nodes. More refined data structures have been developed since, which are suitable for different classes of weight functions. We will discuss integer and general weights; for integer cost we look at bucket structures and for general weights we consider refined heap implementations.

3.1.1. Bucket Data Structures

In many applications, edge weights can only be positive integers (sometimes for fractional values it is also possible and beneficial to achieve this by rescaling). As a general assumption we state that the difference between the largest key and the smallest key is less than or equal to a constant C.

Buckets

A simple implementation for the priority queues is a 1-Level Bucket. This priority queue implementation consists of an array of C + 1 buckets, each of which is the first link in a linked list of elements. With the array we associate the three numbers minValue, minPos, and n: minValue denotes the smallest f value in the queue, minPos fixes the index of the bucket with the smallest key, and n is the number of stored elements. The i th bucket B9780123725127000031/si8.gif is missing contains all elements v with B9780123725127000031/si9.gif is missing, B9780123725127000031/si10.gif is missing. Figure 3.1 illustrates an example for the set of keys {16, 16, 18, 20, 23, 25}. The implementations for the four main priority queue operations Initialize, Insert, DeleteMin, and DecreaseKey are shown in Algorithm 3.1, Algorithm 3.2, Algorithm 3.3 and Algorithm 3.4.
With doubly linked lists (each element has a predecessor and successor pointer) we achieve constant runtimes for the Insert and DecreaseKey operations, while the DeleteMin operation consumes O (C) time in the worst-case for searching a nonempty bucket. For DecreaseKey we generally assume that a pointer to the element to be deleted is available. Consequently, Dijkstra's algorithm and A* run in B9780123725127000031/si11.gif is missing time, where e is the number of edges (generated) and n is the number of nodes (expanded).
B9780123725127000031/f03-01-9780123725127.jpg is missing
Figure 3.1
Example for a 1-Level Bucket data structure.
B9780123725127000031/u03-01-9780123725127.jpg is missing
Algorithm 3.1.
Initializing a 1-Level Bucket.
B9780123725127000031/u03-02-9780123725127.jpg is missing
Algorithm 3.2.
Inserting an element into a 1-Level Bucket.
B9780123725127000031/u03-03-9780123725127.jpg is missing
Algorithm 3.3.
Deleting the minimum element in a 1-Level Bucket.
Given that the f-value can often be bounded by a constant fmax in a practical state space search, authors usually omit the modulo operation mod (C + 1), which reduces the space for array b to O (C), and take a plain array addressed by f instead. If fmax is not known in advance a doubling strategy can be applied.
B9780123725127000031/u03-04-9780123725127.jpg is missing
Algorithm 3.4.
Updating the key in a 1-Level Bucket.

Multilayered Buckets

In state space search, we often have edge weights that are of moderate size, say realized by a 32-bit integer for which a bucket array b of size 232 is too large, whereas 216 can be afforded.
The space complexity and the worst-case time complexity O (C) for DeleteMin can be reduced to an amortized complexity of B9780123725127000031/si12.gif is missing operations by using a 2-Level Bucket data structure with one top and one bottom level, both of length B9780123725127000031/si13.gif is missing.
In this structure we have two pointers for the minimum position, minPosTop and minPosBottom, and a number nbot of bottom elements. Although each bucket in the bottom array holds a list of elements with the same key as before, the top layer points to lower-level arrays. If after a DeleteMin operation that yields a minimum key k no insertion is performed with a key less than k(as it is the case for a consistent heuristic in A*), it is sufficient to maintain only one bottom bucket (at minPosTop ), and collect elements in higher buckets in the top level; the lower-level buckets can be created only when the current bucket at minPosTop becomes empty and minPosTop moves on to a higher one. One advantage is that in the case of maximum distance between keys, DeleteMin has to inspect only the B9780123725127000031/si14.gif is missing buckets of the top level; moreover, it saves space if only a small fraction of the available range C is actually filled.
As an example, take C = 80, minPosTop = 2, minPosBottom = 1, and the set of element keys B9780123725127000031/si15.gif is missing. The intervals and elements in the top buckets are B9780123725127000031/si16.gif is missing, B9780123725127000031/si17.gif is missing, B9780123725127000031/si18.gif is missing, B9780123725127000031/si19.gif is missing, B9780123725127000031/si20.gif is missing, B9780123725127000031/si21.gif is missing, B9780123725127000031/si22.gif is missing, B9780123725127000031/si23.gif is missing, B9780123725127000031/si24.gif is missing, and B9780123725127000031/si25.gif is missing. Bucket b [2] is expanded with nonempty bottom buckets 1, 5, and 7 containing the elements 7, 11, and 13, respectively. Figure 3.2 illustrates the example.
B9780123725127000031/f03-02-9780123725127.jpg is missing
Figure 3.2
Example for 2-Level Bucket data structure.
Since DeleteMin reuses the bottom bucket in case it becomes empty, in some cases it is fast and in other cases it is slow. In our case of the 2-Level Bucket, let B9780123725127000031/si26.gif is missing be the number of elements in the top-level bucket, for the l th operation, then DeleteMin uses B9780123725127000031/si27.gif is missing time in the worst-case, where ml is the number of elements that move from top to bottom. The term B9780123725127000031/si28.gif is missing is the worst-case distance passed by in the top bucket, and ml are efforts for the reassignment, which costs are equivalent to the number of elements that move from top to bottom. Having to wait until all moved elements in the bottom layer are dealt with, the worst-case work is amortized over a longer time period. By amortization we have B9780123725127000031/si29.gif is missing operations. Both operations Insert and DecreaseKey run in real and amortized constant time.

Radix Heaps

For achieving an even better amortized runtime, namely B9780123725127000031/si30.gif is missing, a so-called Radix Heap maintains a list of B9780123725127000031/si31.gif is missing buckets of sizes 1, 1, 2, 4, 8, 16, and so on (see Fig. 3.3). The main difference to layered buckets is to use buckets of exponentially increasing sizes instead of a hierarchy. Therefore, only B9780123725127000031/si32.gif is missing buckets are needed.
For the implementation we maintain buckets B9780123725127000031/si33.gif is missing and bounds B9780123725127000031/si34.gif is missing with B9780123725127000031/si35.gif is missing and B9780123725127000031/si36.gif is missing. Furthermore, the bucket number B9780123725127000031/si37.gif is missing denotes the index of the actual bucket for key k. The invariants of the algorithms are (1) all keys in B9780123725127000031/si38.gif is missing are in B9780123725127000031/si39.gif is missing; (2) B9780123725127000031/si40.gif is missing; and (3) for all B9780123725127000031/si41.gif is missing we have B9780123725127000031/si42.gif is missing.
B9780123725127000031/f03-03-9780123725127.jpg is missing
Figure 3.3
Example for a Radix Heap. The bottom row numbers denote current values of the bounds u and the top row numbers denote the size of the interval defined by two successive u-values.
The operations are as follows. Initialize generates an empty Radix Heap according to the invariants (2) and (3). The pseudo code is shown in Algorithm 3.5.
To insert an element with key k, in a linear scan a bucket i is searched, starting from the largest one (i = B). Then the new element with key k is inserted into the bucket B9780123725127000031/si43.gif is missing with B9780123725127000031/si44.gif is missing. The pseudo-code implementation is depicted in Algorithm 3.6.
B9780123725127000031/u03-05-9780123725127.jpg is missing
Algorithm 3.5.
Creating a Radix Heap.
B9780123725127000031/u03-06-9780123725127.jpg is missing
Algorithm 3.6.
Inserting an element into a Radix Heap.
B9780123725127000031/u03-07-9780123725127.jpg is missing
Algorithm 3.7.
Inserting an element into a Radix Heap.
For DecreaseKey, bucket i for an element with key k is searched linearly. The difference is that the search starts from the actual bucket i for key k as stored in B9780123725127000031/si45.gif is missing. The implementation is shown in Algorithm 3.7.
For DeleteMin we first search for the first nonempty bucket B9780123725127000031/si46.gif is missing and identify the element with minimum key k therein. If the smallest bucket contains an element it is returned. For the other case B9780123725127000031/si47.gif is missing is set to k and the bucket bounds are adjusted according to the invariances; that is, B9780123725127000031/si48.gif is missing is set to B9780123725127000031/si49.gif is missing and for B9780123725127000031/si50.gif is missing bound B9780123725127000031/si51.gif is missing is set to B9780123725127000031/si52.gif is missing. Lastly, the elements of B9780123725127000031/si53.gif is missing are distributed to buckets B9780123725127000031/si54.gif is missing and the minimum element is extracted from the nonempty smallest bucket. The implementation is shown in Algorithm 3.8.
B9780123725127000031/u03-08-9780123725127.jpg is missing
Algorithm 3.8.
Delete the minimum from a Radix Heap.
As a short example for DeleteMin consider the following configuration (written as B9780123725127000031/si55.gif is missing) of a Radix Heap B9780123725127000031/si56.gif is missing, B9780123725127000031/si57.gif is missing, B9780123725127000031/si58.gif is missingB9780123725127000031/si59.gif is missing, B9780123725127000031/si60.gif is missing, B9780123725127000031/si61.gif is missing(see Fig. 3.4). Extracting key 0 from bucket 1 yields B9780123725127000031/si62.gif is missing, B9780123725127000031/si63.gif is missing, B9780123725127000031/si64.gif is missing, B9780123725127000031/si65.gif is missing, B9780123725127000031/si66.gif is missing, B9780123725127000031/si67.gif is missing. Now, key 6 and 7 are distributed. If B9780123725127000031/si68.gif is missing then the interval size is at most B9780123725127000031/si69.gif is missing. In B9780123725127000031/si70.gif is missing we have B9780123725127000031/si71.gif is missing buckets available. Since all keys in B9780123725127000031/si72.gif is missing are in B9780123725127000031/si73.gif is missing all elements fit into B9780123725127000031/si74.gif is missing.
B9780123725127000031/f03-04-9780123725127.jpg is missing
Figure 3.4
Example for DeleteMin operation in a Radix Heap.
The amortized analysis of the costs of maintaining a Radix Heap uses the potential B9780123725127000031/si75.gif is missing for operation l. We have that Initialize runs in O (B), and Insert runs in O (B). DecreaseKey has an amortized time complexity in B9780123725127000031/si76a.gif is missing, and DeleteMin runs in time B9780123725127000031/si77.gif is missing amortized. In total we have a running time of B9780123725127000031/si78.gif is missing for m Insert and l DecreaseKey and ExtractMin operations.
Utilizing this representation, A* runs in time B9780123725127000031/si79.gif is missing time. For current computers, the value of B9780123725127000031/si80.gif is missing for encompassing the entire integer range is small (32 or 64), so that A* on integers using a Radix Heap runs in linear time in practice.

Van Emde Boas Priority Queues

A Van Emde Boas Priority Queue is efficient when B9780123725127000031/si81.gif is missing for a universe B9780123725127000031/si82.gif is missing of keys. In this implementation, all priority queue operations reduce to successor computation, which takes B9780123725127000031/si83.gif is missing time. The space requirements are B9780123725127000031/si84.gif is missing.
We start by considering a data structure TN on the elements B9780123725127000031/si85.gif is missing defining only three operations: Insert (x), Delete (x), and Succ (x), where the two first ones have an obvious semantics and the last one returns the smallest item in TN that is larger than or equal to x. All priority queue operations use the recursive operation Succ (x) that finds the smallest y in the structure TN with B9780123725127000031/si86.gif is missing. For the priority queue data structure, DeleteMin is simply implemented as Delete (Succ (0)), assuming positive key values, and DecreaseKey is a combination of a Delete and an Insert operation.
Using an ordinary bitvector, Insert and Delete are constant-time operations, but Succ is inefficient. Using balanced trees, all operations run in time B9780123725127000031/si87.gif is missing. A better solution is to implement a recursive representation with B9780123725127000031/si88.gif is missing distinct versions of B9780123725127000031/si89.gif is missing. The latter trees are called bottom, and an element B9780123725127000031/si90.gif is missing is represented by the entry b in bottom (a). The conversion from i to a and b in bit-vector representation is simple, since a and b refer to the most and least significant half of the bits. Moreover, we have another version B9780123725127000031/si91.gif is missing called top that contains a only if a is nonempty.
Algorithm 3.9 depicts a pseudo-code implementation of Succ. The recursion for the runtime is B9780123725127000031/si92.gif is missing. If we set B9780123725127000031/si93.gif is missing then B9780123725127000031/si94.gif is missing so that B9780123725127000031/si95.gif is missing and B9780123725127000031/si96.gif is missing. The subsequent implementations for Insert and Delete are shown in Algorithm 3.10 and Algorithm 3.11. Inserting element x in TN locates a possible place by first seeking the successor Succ (x) of x. This leads to a running time of B9780123725127000031/si97.gif is missing. Deletion used the doubly linked structure and the successor relation. It also runs in B9780123725127000031/si98.gif is missing time.
A Van Emde Boas Priority Queue k-structure is recursively defined. Consider the example k = 4 (implying N = 16) with the set of five elements B9780123725127000031/si99.gif is missing. Set top is a 2-structure on B9780123725127000031/si100.gif is missing based on the set of possible prefixes in the binary encoding of the values in S. Set bottom is a vector of 2-structures (based on the suffixes of the binary state encodings in S) with bottom B9780123725127000031/si101.gif is missing, bottom B9780123725127000031/si102.gif is missing, bottom B9780123725127000031/si103.gif is missing, and bottom B9780123725127000031/si104.gif is missing, since B9780123725127000031/si105.gif is missing, B9780123725127000031/si106.gif is missing, B9780123725127000031/si107.gif is missing, B9780123725127000031/si108.gif is missing, and B9780123725127000031/si109.gif is missing. Representing top as a 2-structure implies k = 2 and N = 4, such that the representation of B9780123725127000031/si110.gif is missing with B9780123725127000031/si111.gif is missing, B9780123725127000031/si112.gif is missing, B9780123725127000031/si113.gif is missing, and B9780123725127000031/si114.gif is missing leads to a sub-top structure on B9780123725127000031/si115.gif is missing and two sub-bottom structures bottom B9780123725127000031/si116.gif is missing and bottom B9780123725127000031/si117.gif is missing.
B9780123725127000031/u03-09-9780123725127.jpg is missing
Algorithm 3.9.
Finding the successor in a Van Emde Boas Priority Queue.
B9780123725127000031/u03-10-9780123725127.jpg is missing
Algorithm 3.10.
Inserting an element in a Van Emde Boas Priority Queue.
B9780123725127000031/u03-11-9780123725127.jpg is missing
Algorithm 3.11.
Deleting an element from a Van Emde Boas Priority Queue.
To realize the structures in practice, a mixed representation of the element set is appropriate. On one hand, a doubly connected linked list contains the elements sorted according to the values they have in the universe. On the other hand, a bitvector b is devised, with bit i denoting if an element with value bi is contained in the list. The two structures are connected via links that point from each nonzero element to an item in the doubly connected list. The mixed representation (bitvector and doubly ended leaf list) for the earlier 4-structure (without unrolling the references to the single top and four bottom structures) is shown Figure 3.5.
B9780123725127000031/f03-05-9780123725127.jpg is missing
Figure 3.5
An example of a 4-structure for the Van Emde Boas Priority Queue.

3.1.2. Heap Data Structures

Let us now assume that we can have arbitrary (e.g., floating-point) keys. Each operation in a priority queue then divides into compare-exchange steps. For this case, the most common implementation of a priority queue (besides a plain list) is a Binary Search Tree or a Heap.

Binary Search Trees

A Binary Search Tree is a binary tree implementation of a priority queue in which each internal node x stores an element. The keys in the left subtree of x are smaller than (or equal) to the one of x, and keys in the right subtree of x are larger than the one of x. Operations on a binary search tree take time proportional to the height of the tree. If the tree is a linear chain of nodes, linear comparisons might be induced in the worst-case. If the tree is balanced, a logarithmic number of operations for insertion and deletion suffice. Because balancing can be involved, in the following we discuss more flexible and faster data structures for implementing a priority queue.

Heaps

A Heap is a complete binary tree; that is, all levels are completely filled except possibly the lowest one, which is filled from the left. This means that the depth of the tree (and every path length from the root to a leaf) is B9780123725127000031/si118.gif is missing. Each internal node v satisfies the heap property: The key of v is smaller than or equal to the key of either of its two children.
Complete binary trees can be embedded in an array A as follows. The elements are stored levelwise from left to right in ascending cells of the array; B9780123725127000031/si119.gif is missing is the root; the left and right child of B9780123725127000031/si120.gif is missing are B9780123725127000031/si121.gif is missing and B9780123725127000031/si122.gif is missing, respectively; and its parent is B9780123725127000031/si123.gif is missing. On most current microprocessors, the operation of multiplication by two can be realized as a single shift instruction. An example of a Heap (including its array embedding) is provided in Figure 3.6.
B9780123725127000031/f03-06-9780123725127.jpg is missing
Figure 3.6
Example of a Heap. Array indices are attached to the nodes.
To insert an element into a Heap, we first tentatively place it in the next available leaf. As this might violate the heap property, we restore the heap property by swapping the element with its parent, if the parent's key is larger; then we check for the grandparent key, and so on, until the heap property is valid or the element reaches the root. Thus, Insert needs at most B9780123725127000031/si124.gif is missing time. In the array embedding we start with the last unused index n + 1 in array A and place key k into B9780123725127000031/si125.gif is missing. Then, we climb up the ancestors until a correct Heap is constructed. An implementation is provided in Algorithm 3.12.
DecreaseKey starts at the node x that has changed its value. This reference has to be maintained with the elements that are stored. Algorithm 3.13 shows a possible implementation.
B9780123725127000031/u03-12-9780123725127.jpg is missing
Algorithm 3.12.
Inserting an element into a Heap.
B9780123725127000031/u03-13-9780123725127.jpg is missing
Algorithm 3.13.
Decreasing the key of an element in a Heap.
To extract the minimum key is particularly easy: It is always stored at the root. However, we have to delete it and guarantee the heap property afterward. First, we tentatively fill the gap at the root with the last element on the bottom level of the tree. Then we restore the heap property using two comparisons per node while going down. This operation is referred to as SiftDown. That is, at a node we determine the minimum of the current key and that of the children; if the node is actually the minimum of the three, we are done, otherwise it is exchanged with the minimum, and the balancing continues at its previous position. Hence, the running time for DeleteMin is again B9780123725127000031/si126.gif is missing in the worst-case. The implementation is displayed in Algorithm 3.14. Different SiftDown procedures are known: (1) top-down (as in Alg. 3.15); (2) bottom-up (first following the special path of smaller children to the leaf, then sifting up the root element as in Insert ); or (3) with binary search (on the special path).
B9780123725127000031/u03-14-9780123725127.jpg is missing
Algorithm 3.14.
Extracting the minimum element from a Heap.
B9780123725127000031/u03-15-9780123725127.jpg is missing
Algorithm 3.15.
Rearrange Heap.
An implementation of the priority queue using a Heap leads to an B9780123725127000031/si127.gif is missing algorithm for A*, where n(resp. e) is the number of generated problem graph nodes (resp. edges). The data structure is fast in practice if n is small, say a few million elements (an accurate number depends on the efficiency of the implementation).

Pairing Heaps

A Pairing Heap is a heap-ordered (not necessarily binary) self-adjusting tree. The basic operation on a Pairing Heap is pairing, which combines two Pairing Heaps by attaching the root with the larger key to the other root as its left-most child. More precisely, for two Pairing Heaps with respective root values k1 and k2, pairing inserts the first as the left-most subtree of the second if B9780123725127000031/si128.gif is missing, and otherwise inserts the second into the first as its left-most subtree. Pairing takes constant time and the minimum is found at the root.
In a multiway tree representation realizing the priority queue operations is simple. Insertion pairs the new node with the root of the heap. DecreaseKey splits the node and its subtree from the heap (if the node is not the root), decreases the key, and then pairs it with the root of the heap. Delete splits the node to be deleted and its subtree, performs a DeleteMin on the subtree, and pairs the resulting tree with the root of the heap. DeleteMin removes and returns the root, and then, in pairs, pairs the remaining trees. Then, the remaining trees from right to left are incrementally paired (see Alg. 3.16).
Since the multiple-child representation is difficult to maintain, the child-sibling binary tree representation for Pairing Heaps is often used, in which siblings are connected as follows. The left link of a node accesses its first child, and the right link of a node accesses its next sibling, so that the value of a node is less than or equal to all the values of nodes in its left subtree. It has been shown that in this representation Insert takes O (1) and DeleteMin takes B9780123725127000031/si129.gif is missing amortized, and DecreaseKey takes at least B9780123725127000031/si130.gif is missing and at most B9780123725127000031/si131.gif is missing steps.

Weak Heaps

A Weak Heap is obtained by relaxing the Heap requirements. It satisfies three conditions: the key of a node is smaller than or equal to all elements to its right, the root has no left child, and leaves are found on the last two levels only.
The array representation uses extra bits Reverse B9780123725127000031/si132.gif is missing, B9780123725127000031/si133.gif is missing. The location of the left child is located at B9780123725127000031/si134.gif is missing and the right child is found at B9780123725127000031/si135.gif is missing. By flipping Reverse B9780123725127000031/si136.gif is missing the locations of the left child and the right child are exchanged. As an example take B9780123725127000031/si137.gif is missing and B9780123725127000031/si138.gif is missing as an array representation of a Weak Heap. Its binary tree equivalent is shown in Figure 3.7.
B9780123725127000031/f03-07-9780123725127.jpg is missing
Figure 3.7
Example of a Weak Heap. Reflected nodes are shown in gray.
B9780123725127000031/u03-16-9780123725127.jpg is missing
Algorithm 3.16.
Rearrange Pairing Heap.
The function Grandparent is defined as B9780123725127000031/si139.gif is missing in case i is a left child, and B9780123725127000031/si140.gif is missing if i is a right one. In a Weak Heap, B9780123725127000031/si141.gif is missing refers to the index of the deepest element known to be smaller than or equal to the one at i. An illustration is given in Figure 3.8.
B9780123725127000031/f03-08-9780123725127.jpg is missing
Figure 3.8
Grandparent relationship in a Weak Heap indicated using dashed arrows.
Let node v be the root of a balanced tree T and let node u with the left subtree of T and v with the right subtree of T each form a Weak Heap. Merging u and v yields a new Weak Heap. If B9780123725127000031/si142.gif is missing then the tree with root u and right child v is a Weak Heap. If, however, B9780123725127000031/si143.gif is missing we swap B9780123725127000031/si144.gif is missing with B9780123725127000031/si145.gif is missing and reflect the subtrees in T(see Fig. 3.9, right). Algorithm 3.17 provides the pseudo-code implementation for Merge and Grandparent.
B9780123725127000031/f03-09-9780123725127.jpg is missing
Figure 3.9
Merging operation in a Weak Heap.
To restore the Weak Heap all subtrees corresponding to grandchildren of the root are combined. Algorithm 3.18 shows the implementation of this Merge-Forest procedure. The element at position m serves as a root node. We traverse the grandchildren of the root in which the second largest element is located. Then, in a bottom-up traversal, the Weak Heap property is restored by a series of Merge operations.
For DeleteMin we restore the Weak Heap property after exchanging the root element with the last one in the underlying array. Algorithm 3.19 gives an implementation.
To construct a Weak Heap from scratch all nodes at index i for decreasing i are merged to their grandparents, resulting in the minimal number of n − 1 comparisons.
B9780123725127000031/u03-17-9780123725127.jpg is missing
Algorithm 3.17.
Implementation of different subroutines for Weak Heaps.
B9780123725127000031/u03-18-9780123725127.jpg is missing
Algorithm 3.18.
Restoration of Weak Heaps.
For Insert given a key k, we start with the last unused index x in array A and place k into B9780123725127000031/si146.gif is missing. Then we climb up the grandparents until the Weak Heap property is satisfied (see Alg. 3.20). On the average, the path length of grandparents from a leaf node to a root is approximately half of the depth of the tree.
B9780123725127000031/u03-19-9780123725127.jpg is missing
Algorithm 3.19.
Extracting the minimum element from a Weak Heap.
B9780123725127000031/u03-20-9780123725127.jpg is missing
Algorithm 3.20.
Inserting an element into a Weak Heap.
For the DecreaseKey operation we start at the node x that has changed its value. Algorithm 3.21 shows an implementation.

Fibonacci Heaps

A Fibonacci Heap is an involved data structure with a detailed presentation that exceeds the scope of this book. In the following, therefore, we only motivate Fibonacci Heaps.
Intuitively, Fibonacci Heaps are relaxed versions of Binomial Queues, which themselves are extensions to Binomial Trees. A Binomial Tree Bn is a tree of height n with 2n nodes in total and B9780123725127000031/si147.gif is missing nodes in depth i. The structure of Bn is found by unifying 2-structure B9780123725127000031/si148.gif is missing, where one is added as an additional successor to the second.
Binomial Queues are unions of heap-ordered Binomial Trees. An example is shown in Figure 3.10. Tree Bi is represented in queue Q if the i th bit in the binary representation of n is set. The partition of a Binomial Queue structure Q into trees Bi is unique as there is only one binary representation of a given number. Since the minimum is always located at the root of one Bi, operation Min takes B9780123725127000031/si149.gif is missing time. Binomial Queues Q1 and Q2 of sizes n1 and n2 are meld by simulating binary addition of n1 and n2. This corresponds to a parallel scan of the root lists of Q1 and Q2. If B9780123725127000031/si150.gif is missing then meld can be performed in time B9780123725127000031/si151.gif is missing. Having to meld the queues B9780123725127000031/si152.gif is missing and B9780123725127000031/si153.gif is missing leads to a queue B9780123725127000031/si154.gif is missing.
B9780123725127000031/u03-21-9780123725127.jpg is missing
Algorithm 3.21.
Decreasing the key of an element in a Weak Heap.
Binomial Queues are themselves priority queues. Operations Insert and DeleteMin both use procedure meld as a subroutine. The former creates a tree B0 with one element, and the latter extracts tree Bi containing the minimal element and splits it into its subtrees B9780123725127000031/si155.gif is missing. In both cases the resulting trees are merged with the remaining queue to perform the update. DecreaseKey for element v updates the Binomial Tree Bi in which v is located by propagating the element change bottom-up. All operations run in B9780123725127000031/si156.gif is missing.
A Fibonacci Heap is a collection of heap-ordered Binomial Trees, maintained in the form of a circular doubly linked unordered list of root nodes. In difference to Binomial Queues, more than one Binomial Tree of rank i may be represented in one Fibonacci Heap. Consolidation traverses the linear list and merges trees of the same rank (each rank is unique). For this purpose, an additional array is devised that supports finding the trees of the same rank in the root list. The minimum element in a Fibonacci Heap is accessible in O (1) time through a pointer in the root list. Insert performs a meld operation with a singleton tree.
B9780123725127000031/f03-10-9780123725127.jpg is missing
Figure 3.10
Example of a Binomial Queue.
For the critical operation consolidate (see Algorithm 3.22) a node is marked if it loses a child. Before it is marked twice, a cut is performed, which separates the node from its parent. The subtree of the node is inserted in the root list (where the node becomes unmarked again). The cut may cascade as it is propagated to the parent node. An example for a cascading cut is shown in Figure 3.11. Nodes with the keys 3, 6, and 8 are already marked. Now we decrease the key 9 to 1, so that 3, 6, and 8 will lose their second child.
B9780123725127000031/u03-22-9780123725127.jpg is missing
Algorithm 3.22.
Consolidation with bitvectors and heuristic factor in a Fibonacci Heap.
DecreaseKey performs the update on the element in the heap-ordered tree. It removes the updated node from the child list of its parent and inserts it into the root list while updating the minimum. DeleteMin extracts the minimum and includes all subtrees into the root list and consolidates it.
B9780123725127000031/f03-11-9780123725127.jpg is missing
Figure 3.11
Cascading cut in heap-ordered tree.
A heuristic parameter can be set to call consolidation less frequently. Moreover, a bitvector can improve the performance of consolidation, as it avoids additional links and faster access to the trees of the same rank to be merged. In an eager variant Fibonacci heaps maintain the consolidation heap store at any time.

Relaxed Weak Queues

Relaxed Weak Queues are worst-case efficient priority queues, by means that all running times of Fibonacci Heaps are worst-case instead of amortized.
Weak Queues contribute to the observation that Perfect Weak Heaps inherit a one-to-one correspondence to Binomial Queues by only taking edges that are defined by the Grandparent relation. Note that in Perfect Weak Heaps the right subtree of the root is a complete binary tree. A Weak Queue stores n elements and is a collection of disjoint (nonembedded) Perfect Weak Heaps based on the binary representation of B9780123725127000031/si157.gif is missing. In its basic form, a Weak Queue contains a Perfect Weak Heap Hi of size 2i if and only if B9780123725127000031/si158.gif is missing.
Relaxed Weak Queues relax the requirement of having exactly one Weak Heap of a given rank in the Weak Queue and allow some inconsistent elements that violate the Weak Heap property. A structure (of logarithmic size) called the heap store maintains Perfect Weak Heaps of the same rank similar to Fibonacci Heaps. At most two heaps per rank suffice to efficiently realize injection and ejection of the heaps. To keep the worst-case complexity bounds, merging Weak Heaps of the same rank is delayed by maintaining the following structural property on the sequence of numbers of Perfect Weak Heaps of the same rank.
The rank sequence B9780123725127000031/si159.gif is missing is regular if any digit 2 is preceded by a digit 0, possibly having some digits 1 in between. A subsequence of the form B9780123725127000031/si160.gif is missing is called a block. That is, every digit 2 must be part of a block, but there can be digits, 0's and 1's, that are not part of a block. For example, the rank sequence (1011202012) contains three blocks. For injecting a Weak Heap, we join the first two Weak Heaps that are of the same size, if there are any. They are found by scanning the rank sequence. For O (1) access, a stack of pending joins, the so-called join schedule implements the rank sequence of pending joins. Then we insert the new Weak Heap, which will preserve the regularity of the rank sequence. For ejection, the smallest Weak Heap is eliminated from the heap sequence and, if this Perfect Weak Heap forms a pair with some other Perfect Weak Heap, the top of the join schedule is also popped.
To keep the complexity for DecreaseKey constant, resolving Weak Heap order violations is also delayed. The primary purpose of a node store is to keep track and reduce the number of potential violation nodes at which the key may be smaller than the key of its grandparent. A node that is a potential violation node is marked. A marked node is tough if it is the left child of its parent and also the parent is marked. A chain of consecutive tough nodes followed by a single nontough marked node is called a run. All tough nodes of a run are called its members; the single nontough marked node of that run is called its leader. A marked node that is neither a member nor a leader of a run is called a singleton. To summarize, we can divide the set of all nodes into four disjoint node type categories: unmarked nodes, run members, run leaders, and singletons.
A pair (type, height) with (type) being either unmarked, member, leader, or singleton and (height) being a value in B9780123725127000031/si161.gif is missing denotes the state of a node. Transformations induce a constant number of state transitions. A simple example of such a transformation is a join, where the height of the new root must be increased by one. Other operations are cleaning, parent, sibling, and pair transformations (see Fig. 3.12). A cleaning transformation rotates a marked left child to a marked right one, provided its neighbor and parent are unmarked. A parent transformation reduces the number of marked nodes or pushes the marking one level up. A sibling transformation reduces the markings by eliminating two markings in one level, while generating a new marking one level up. A pair transformation has a similar effect, but also operates on disconnected trees.
All transformations run in constant time. The node store consists of different list items containing the type of the node marking, which can either be a fellow, a chairman, a leader, or a member of a run, where fellows and chairmen refine the concept of singletons. A fellow is a marked node, with an unmarked parent, if it is a left child. If more than one fellow has a certain height, one of them is elected a chairman. The list of chairmen is required for performing a singleton transformation. Nodes that are left children of a marked parent are members, while the parent of such runs is entitled the leader. The list of leaders is needed for performing a run transformation.
B9780123725127000031/f03-12-9780123725127.jpg is missing
Figure 3.12
Primitives used in a λ-reduction: (a) cleaning transformation, (b) parent transformation, (c) sibling transformation, and (d) pair transformation.
The four primitive transformations are combined to a λ-reduction, which invokes either a singleton or run transformation (see Alg. 3.23). A singleton transformation reduces the number of markings in a given level by one, not producing a marking in the level above; or it reduces the number of markings in a level by two, producing a marking in the level above. A similar observation applies to a run transformation, so that in both transformations the number of markings is reduced by at least one in a constant amount of work and comparisons. A λ-reduction is invoked once for each DecreaseKey operation. It invokes either a singleton or a run transformation and is enforced, once the number of marked nodes exceeds B9780123725127000031/si162.gif is missing.
Table 3.1 measures the time in μ-seconds (for each operation) for inserting n integers (randomly assigned to values from n to 2n − 1). Next, their values are decreased by 10 and then the minimum element is deleted n times. (The lack of results in one row is due to the fact that Fibonacci Heaps ran out of space.)
Table 3.1 Performance of priority queue data structures on n integers.
n = 25'000'000n = 50'000'000
InsertDec.KeyDel.MinInsertDec.KeyDel.Min
Relaxed Weak Queues0.0480.2234.380.0490.2235.09
Pairing Heaps0.0100.0206.710.0090.0208.01
Fibonacci Heaps0.0620.1166.98
Heaps0.0900.0645.220.0820.0656.37
B9780123725127000031/u03-23-9780123725127.jpg is missing
Algorithm 3.23.
Reducing number of markings in a Relaxed Weak Queue.

3.2. Hash Tables

Duplicate detection is essential for state space search to avoid redundant expansions. As no access to all states is given in advance, a dynamically growing dictionary to represent sets of states has to be provided. For the Closed list, we memorize nodes that have been expanded and for each generated state we check whether it is already stored. We also have to search for duplicates in the Open list, so another dictionary is needed to assist lookups in the priority queue. The Dictionary problem consists of providing a data structure with the operations Insert, Lookup, and Delete. In search applications, deletion is not always necessary. The slightly easier membership problem neglects any associated information. However, many implementations of membership data structures can be easily generalized to dictionary data structures by adding a pointer. Instead of maintaining two dictionaries for Open and Closed individually, more frequently, the Open and Closed lists are maintained together in a combined dictionary.
There are two major techniques for implementing dictionaries: (balanced) search trees and hashing. The former class of algorithms can achieve all operations in B9780123725127000031/si163.gif is missing worst-case time and O (n) storage space, where n is the number of stored elements. Generally, for hashing constant time for lookup operations is required, so we concentrate on hash dictionaries. We first introduce different hash functions and algorithms. Incremental hashing will be helpful to enhance the efficiency of computing hash addresses. In perfect hashing we consider bijective mapping of states to addresses. In universal hashing, we consider a class of hash functions that will be useful for more general perfect hashing strategies. Because memory is a big concern in state space search we will also address memory-saving dictionary data structures. At the end of this section, we show how to save additional space by being imprecise (saying in the dictionary when it is not).

3.2.1. Hash Dictionaries

Hashing serves as a method to store and retrieve states B9780123725127000031/si164.gif is missing efficiently. A dictionary over a universe B9780123725127000031/si165.gif is missing of possible keys is a partial function from a subset B9780123725127000031/si166.gif is missing(the stored keys ) to some set I(the associated information). In state space hashing, every state B9780123725127000031/si167.gif is missing is assigned to a key k (x), which is a part of the representation that uniquely identifies S. Note that every state representation can be interpreted as a binary integer number. Then not all integers in the universe will correspond to valid states. For simplicity, in the following we will identify states with their keys.
The keys are mapped into a linear array B9780123725127000031/si168.gif is missing, called the hash table. The mapping B9780123725127000031/si169.gif is missing is called the hash function (see Fig. 3.13). The lack of injectiveness yields address collisions; that is, different states that are mapped to the same table location. Roughly speaking, hashing is all about computing keys and detecting collisions. The overall time complexity for hashing depends on the time to compute the hash function, the collision strategy, and the ratio between the number of stored keys and the hash table size, but usually not on the size of the keys.
B9780123725127000031/f03-13-9780123725127.jpg is missing
Figure 3.13
Basic principle of hashing.
The choice of a good hash function is the central problem for hashing. In the worst-case, all keys are mapped to the same address; for example, for all B9780123725127000031/si170.gif is missing we have B9780123725127000031/si171.gif is missing, with B9780123725127000031/si172.gif is missing. In the best case, we have no collisions and the access time to an element is constant. A special case is that of a fixed stored set R, and a hash table of at least m entries; then a suitable hash function is B9780123725127000031/si173.gif is missing with B9780123725127000031/si174.gif is missing and B9780123725127000031/si175.gif is missing.
These two extreme cases are more of theoretical interest. In practice, we can avoid the worst-case by a proper design of the hash function.

3.2.2. Hash Functions

A good hash function is one that can be computed efficiently and minimizes the number of address collisions. The returned addresses for given keys should be uniformly distributed, even if the set of chosen keys in S is not, which is almost always the case.
Given a hash table of size m and the sequence B9780123725127000031/si176.gif is missing of keys to be inserted, for each pair (ki, kj of keys, B9780123725127000031/si177.gif is missing, we define a random variable
B9780123725127000031/si178.gif is missing
Then B9780123725127000031/si179.gif is missing is the sum of collisions. Assuming a random hash function with uniform distribution, the expected value of X is
B9780123725127000031/si180.gif is missing
Using a hash table of size B9780123725127000031/si181.gif is missing, for 1 million elements, we expect about B9780123725127000031/si182.gif is missing address collisions.

Remainder Method

If we can extend S to B9780123725127000031/si183.gif is missing, then B9780123725127000031/si184.gif is missing is the quotient space with equivalence classes B9780123725127000031/si185.gif is missing induced by the relation
B9780123725127000031/si186.gif is missing
Therefore, a mapping B9780123725127000031/si187.gif is missing with B9780123725127000031/si188.gif is missing distributes S on T. For the uniformity, the choice of m is important; for example, if m is even then h (x) is even if and only if x is.
The choice B9780123725127000031/si189.gif is missing, for some B9780123725127000031/si190.gif is missing, is also not appropriate, since for B9780123725127000031/si191.gif is missing we have
B9780123725127000031/si192.gif is missing
This means that the distribution takes only the last w digits into account.
A good choice for m is a prime that does not divide a number B9780123725127000031/si193.gif is missing for small j, because B9780123725127000031/si194.gif is missing is equivalent to B9780123725127000031/si195.gif is missing so that (case +)
B9780123725127000031/si196.gif is missing
that is, keys with same sum of digits are mapped to the same address.

Multiplicative Hashing

In this approach the product of the key and an irrational number ϕ is computed and the fractional part is preserved, resulting in a mapping into B9780123725127000031/si197.gif is missing. This can be used for a hash function that maps the key x to B9780123725127000031/si198.gif is missing as follows:
B9780123725127000031/si199.gif is missing
One of the best choices for ϕ for multiplicative hashing is B9780123725127000031/si200.gif is missing, the golden ratio. As an example take B9780123725127000031/si201.gif is missing and B9780123725127000031/si202.gif is missing; then B9780123725127000031/si203.gif is missing.

Rabin and Karp Hashing

For incremental hashing based on the idea of Rabin and Karp, states are interpreted as strings over a fixed alphabet. In case no natural string representation exists it is possible to interpret the binary representation of a state as a string over the alphabet {0, 1}. To increase the effectiveness of the method the string of bits may be divided into blocks. For example, a state vector consisting of bytes yields 256 different characters.
The idea of Rabin and Karp originates in matching a pattern string B9780123725127000031/si204.gif is missing to a text B9780123725127000031/si205.gif is missing. For a certain hash function h, pattern M is mapped to the number h (M), assuming that h (M) fits into a single memory cell and can be processed in constant time. For B9780123725127000031/si206.gif is missing the algorithm checks if B9780123725127000031/si207.gif is missing. Due to possible collisions, this check is a necessary but not a sufficient condition for a valid match of M and B9780123725127000031/si208.gif is missing. To validate that the match is indeed valid in case B9780123725127000031/si209.gif is missing, a character-by-character comparison has to be performed.
To compute B9780123725127000031/si210.gif is missing in constant time, the value is calculated incrementally—the algorithm takes the known value B9780123725127000031/si211.gif is missing into account to determine B9780123725127000031/si212.gif is missing with a few CPU operations. The hash function has to be chosen carefully to be suited to the incremental computation; for example, linear hash functions based on a radix number representation such as B9780123725127000031/si213.gif is missing are suitable, where q is a prime and the radix r is equal to B9780123725127000031/si214.gif is missing.
Algorithmically, the approach works as follows. Let q be a sufficiently large prime and B9780123725127000031/si215.gif is missing. We assume that numbers of size B9780123725127000031/si216.gif is missing fit into a memory cell, so that all operations can be performed with single precision arithmetic. To ease notation, we identify characters in Σ with their order. The algorithm of Rabin and Karp as presented in Algorithm 3.24 performs the matching process.
The algorithm is correct due to the following observation.
B9780123725127000031/u03-24-9780123725127.jpg is missing
Algorithm 3.24.
Algorithm of Rabin and Karp.
Theorem 3.1
(Correctness Rabin-Karp) Let the steps ofAlgorithm 3.24be numbered wrt. the loop counter j. At the start of the j th iteration we have
B9780123725127000031/si217.gif is missing
Proof
Certainly, B9780123725127000031/si218.gif is missing and inductively we have
B9780123725127000031/si219.gif is missing
As an example take B9780123725127000031/si220.gif is missing and q = 13. Furthermore, let M = 31415 and T = 2359023141526739921. The application of the mapping h is illustrated in Figure 3.14.
B9780123725127000031/f03-14-9780123725127.jpg is missing
Figure 3.14
Example of Rabin-Karp hashing for string matching.
We see h produces collisions. The incremental computation works as follows.
B9780123725127000031/si221.gif is missing
The computation of all hash addresses has a resulting running time of B9780123725127000031/si222.gif is missing, which is also the best-case overall running time. In the worst-case, the matching is still of order B9780123725127000031/si223.gif is missing, as the example problem of searching B9780123725127000031/si224.gif is missing in B9780123725127000031/si225.gif is missing shows.

Incremental Hashing

For a state space search, we have often the case that a state transition changes only a part of the representation. In this case, the computation of the hash function can be executed incrementally. We refer to this approach as incremental state space hashing. The alphabet Σ denotes the set of characters in the string to be hashed. In the state space search, the set Σ will be used for denoting the domain(s) of the state variables.
Take, for example, the Fifteen-Puzzle. With B9780123725127000031/si226.gif is missing, a natural vector representation for state u is B9780123725127000031/si227.gif is missing, where B9780123725127000031/si228.gif is missing means that the tile labeled with l is located at position i, and B9780123725127000031/si229.gif is missing is the blank. Because successor generation is fast and Manhattan distance heuristic can be computed incrementally in constant time (using a table addressed by the tile's label B9780123725127000031/si230.gif is missing, the tile's move direction B9780123725127000031/si231.gif is missing, and the position B9780123725127000031/si232.gif is missing of the tile that is being moved), the computational burden is on computing the hash function.
One hash value of a Fifteen-Puzzle state u is B9780123725127000031/si233.gif is missing. Let state u′ with representation B9780123725127000031/si234.gif is missing be a successor of u. We know that there is only one transposition in the vectors t and t′. Let j be the position of the blank in u and k be the position of the blank in u′. We have B9780123725127000031/si235.gif is missing, B9780123725127000031/si236.gif is missing, and for all B9780123725127000031/si237.gif is missing, with B9780123725127000031/si238.gif is missing, it holds that B9780123725127000031/si239.gif is missing. Therefore,
B9780123725127000031/si240.gif is missing
To save time, we may precompute B9780123725127000031/si241.gif is missing for each k and l in B9780123725127000031/si242.gif is missing. If we were to store B9780123725127000031/si243.gif is missing for each value of j, k, and l, we would save another addition. As B9780123725127000031/si244.gif is missing and B9780123725127000031/si245.gif is missing, we may further substitute the last mod by faster arithmetic operations.
As a particular case, we look at an instance of the Fifteen-Puzzle, where the tile 12 is to be moved downward from its position 11 to position 15. We have B9780123725127000031/si246.gif is missing.
Next we generalize our observations. The savings are larger, when the state vector grows. For the (n2 − 1)-Puzzle nonincremental hashing results in B9780123725127000031/si247.gif is missing time, whereas in incremental hashing the efforts remain constant. Moreover, incremental hashing is available for many search problems that obey a static vector representation. Hence, we assume that state u is a vector B9780123725127000031/si248.gif is missing with ui in finite domain B9780123725127000031/si249.gif is missing, B9780123725127000031/si250.gif is missing.
Theorem 3.2
(Efficiency of Incremental Hashing) Let I (a) be the set of indices in the state vector that change when applying a, and B9780123725127000031/si251.gif is missing. The hash value of v for successor u of v via a given the hash value for u is available in time:
1. B9780123725127000031/si252.gif is missing; using an O (k)-size table.
2. O (1); using an B9780123725127000031/si253.gif is missing-size table. where B9780123725127000031/si254.gif is missing.
Proof
We define B9780123725127000031/si255.gif is missing as the hash function, with B9780123725127000031/si256.gif is missing and B9780123725127000031/si257.gif is missing for B9780123725127000031/si258.gif is missing. For case 1 we store B9780123725127000031/si259.gif is missing for all B9780123725127000031/si260.gif is missing in a precomputed table, so that B9780123725127000031/si261.gif is missing lookups are needed. For case 2 we compute B9780123725127000031/si262.gif is missing for all possible actions B9780123725127000031/si263.gif is missing. The number of possible actions is bounded by B9780123725127000031/si264.gif is missing, since at most B9780123725127000031/si265.gif is missing indices may change to at most B9780123725127000031/si266.gif is missing different values.
Note that the number of possible actions is much smaller in practice. The effectiveness of incremental hashing relies on two factors: on the state vector's locality (i.e., how many state variables are affected by a state transition) and on the node expansion efficiency (i.e., the running time of all other operations to generate one successor). In the Rubik's Cube, exploiting locality is limited. If we represent position and orientation of each subcube as a number in the state vector, then for each twist, 8 of the 20 entries will be changed. In contrast, for Sokoban the node expansion efficiency is small; as during move execution, the set of pushable balls has to be determined in linear time to the board layout, and the (incremental) computation of the minimum matching heuristic requires at least quadratic time in the number of balls.
For incremental hashing the resulting technique is very efficient. However, it can also have drawbacks, so take care. For example, as with ordinary hashing, the suggested schema induces collisions, which have to be resolved.

Universal Hash Functions

Universal hashing requires a set of hash functions to have on average a good distribution for any subset of stored keys. It is the basis for FKS and cuckoo hashing and has a lot of nice properties. Universal hashing is often used in a state space search, when restarting a randomized incomplete algorithm with a different hash function.
Let B9780123725127000031/si267.gif is missing be the set of hash addresses and B9780123725127000031/si268.gif is missing be the set of possible keys. A set of hash function H is universal, if for all B9780123725127000031/si269.gif is missing,
B9780123725127000031/si270.gif is missing
The intuition in the design of universal hash functions is to include a suitable random number generator inside the hash computation. For example, the Lehmer generator refers to linear congruences. It is one of the most common methods for generating random numbers. With respect to a triple of constants a, b, and c a sequence of pseudo-random numbers xi is generated according to the recursion
B9780123725127000031/si271.gif is missing
Universal hash functions lead to a good distribution of values on the average. If h is drawn randomly from H and S is the set of keys to be inserted in the hash table, the expected cost of each Lookup, Insert, and Delete operation is bounded by B9780123725127000031/si272.gif is missing. We give an example of a class of universal hash functions. Let B9780123725127000031/si273.gif is missing, p be prime with B9780123725127000031/si274.gif is missing. For B9780123725127000031/si275.gif is missing, B9780123725127000031/si276.gif is missing, define
B9780123725127000031/si277.gif is missing
Then
B9780123725127000031/si278.gif is missing
is a set of universal hash functions. As an example, take m = 3 and p = 5. Then we have 20 functions in H:
B9780123725127000031/si279.gif is missing
all taken mod 5 mod 3. Hashing 1 and 4 yields the following address collisions:
B9780123725127000031/si280.gif is missing
To prove that H is universal, let us look at the probability that two keys B9780123725127000031/si281.gif is missing are mapped to locations r and s by the inner part of the hash function,
B9780123725127000031/si282.gif is missing
This means that B9780123725127000031/si283.gif is missing, which has exactly one solution B9780123725127000031/si284.gif is missing since B9780123725127000031/si285.gif is missing is a field (we needed B9780123725127000031/si286.gif is missing to ensure that B9780123725127000031/si287.gif is missing). Value r cannot be equal to s, since this would imply B9780123725127000031/si288.gif is missing, contrary to the definition of the hash function. Therefore, we now assume B9780123725127000031/si289.gif is missing. Then there is a 1 in B9780123725127000031/si290.gif is missing chance that a has the right value. Given this value of a, we need B9780123725127000031/si291.gif is missing, and there is a B9780123725127000031/si292.gif is missing chance that b gets this value. Consequently, the overall probability that the inner function maps x to r and y to s is B9780123725127000031/si293.gif is missing.
Now, the probability that x and y collide is equal to this B9780123725127000031/si294.gif is missing, times the number of pairs B9780123725127000031/si295.gif is missing such that B9780123725127000031/si296.gif is missing. We have p choices for r, and subsequently at most B9780123725127000031/si297.gif is missing choices for s(the −1 is for disallowing s = r). Using B9780123725127000031/si298.gif is missing for integers v and w, the product is at most B9780123725127000031/si299.gif is missing.
Putting this all together, we obtain for the probability of a collision between x and y,
B9780123725127000031/si300.gif is missing

Perfect Hash Functions

Can we find a hash function h such that (besides the efforts to compute the hash function) all lookups require constant time? The answer is yes—this leads to perfect hashing. An injective mapping of R with B9780123725127000031/si301.gif is missing to B9780123725127000031/si302.gif is missing is called a perfect hash function; it allows an access without collisions. If n = m we have a minimal perfect hash function. The design of perfect hashing yields an optimal worst-case performance of O (1) accesses. Since perfect hashing uniquely determines an address, a state S can often be reconstructed given h (S).
If we invest enough space, perfect (and incremental) hash functions are not difficult to obtain. In the example of the Eight-Puzzle for a state u in vector representation B9780123725127000031/si303.gif is missing we may choose B9780123725127000031/si304.gif is missing for B9780123725127000031/si305.gif is missing different hash addresses (equivalent to about 46 megabytes space). Unfortunately, this approach leaves most hash addresses vacant. A better hash function is to compute the rank of the permutation in some given ordering, resulting in 9! states or about 44 kilobytes.

Lexicographic Ordering

The lexicographic rank of permutation π (of size N) is defined as B9780123725127000031/si306.gif is missing where the coefficients di are called the inverted index or factorial base.
By looking at a permutation tree it is easy to see that such a hash function exists. Leaves in the tree are all permutations and at each node in level i, the i th vector value is selected, reducing the range of available values in level i + 1. This leads to an B9780123725127000031/si307.gif is missing algorithm. A linear algorithm for this maps a permutation to its factorial base B9780123725127000031/si308.gif is missing with di being equal to ti minus the number of elements tj, B9780123725127000031/si309.gif is missing that are smaller than ti; that is; B9780123725127000031/si310.gif is missing with the number of inversions ci being set to B9780123725127000031/si311.gif is missing. For example, the lexicographic rank of permutation B9780123725127000031/si312.gif is missing is equal to B9780123725127000031/si313.gif is missing, corresponding to B9780123725127000031/si314.gif is missing and B9780123725127000031/si315.gif is missing. The values ci are computed in linear time using a table lookup in a B9780123725127000031/si316.gif is missing-size table T. In the table T we store the number of ones in the binary representation of a value, B9780123725127000031/si317.gif is missing with B9780123725127000031/si318.gif is missing. For computing the hash value, while processing vector position ti we mark bit ti in bitvector x(initially set to 0). Thus, x denotes the tiles we have seen so far and we can take B9780123725127000031/si319.gif is missing as the value for ci. Since this approach consumes exponential space, time-space trade-offs have been discussed.
For the design of a minimum perfect hash function of the sliding-tile puzzles we observe that in a lexicographic ordering every two successive permutations have an alternating signature (parity of the number of inversions) and differ by exactly one transposition. For minimal perfect hashing a (n2 − 1)-Puzzle state to B9780123725127000031/si320.gif is missing we consequently compute the lexicographic rank and divide it by 2. For unranking, we now have to determine which one of the two uncompressed permutations of the puzzle is reachable. This amounts to finding the signature of the permutation, which allows us to separate solvable from insolvable states. It is computed as B9780123725127000031/si321.gif is missing. For example, with B9780123725127000031/si322.gif is missing we have B9780123725127000031/si323.gif is missing.
B9780123725127000031/u03-25-9780123725127.jpg is missing
Algorithm 3.25.
Rank operation for permutations.
There is one subtle problem with the blank. Simply taking the minimum perfect hash value for the alternation group in B9780123725127000031/si324.gif is missing does not suffice, since swapping a tile with the blank does not necessarily toggle the solvability status (e.g., it may be a move). To resolve this problem, we partition state space along the position of the blank. Let B9780123725127000031/si325.gif is missing denote the sets of blank-projected states. Then each Bi contains B9780123725127000031/si326.gif is missing elements. Given index i and the rank inside Bi, it is simple to reconstruct the state.

Myrvold Ruskey Ordering

We next turn to alternative permutation indices proposed by Myrvold and Ruskey. The basic motivation is the generation of a random permutation according to swapping B9780123725127000031/si327.gif is missing with B9780123725127000031/si328.gif is missing, where r is a random number uniformly chosen in B9780123725127000031/si329.gif is missing, and i decreases from N − 1 down to 1.
One (recursive) algorithm Rank is shown in Algorithm 3.25. The permutation π and its inverse B9780123725127000031/si330.gif is missing are initialized according to the permutation, for which a rank has to be determined.
The inverse B9780123725127000031/si331.gif is missing of π can be computed by setting B9780123725127000031/si332.gif is missing, for all B9780123725127000031/si333.gif is missing. Take as an example permutation B9780123725127000031/si334.gif is missing. Then its rank is B9780123725127000031/si335.gif is missing. This unrolls to B9780123725127000031/si336.gif is missing. It is also possible to compile a rank back into a permutation in linear time. The inverse procedure Unrank, initialized with the identity permutation, is shown in Algorithm 3.26. The depth value N is initialized with the size of the permutation, and the rank r is the value computed with Algorithm 3.25. (As a side effect, if the algorithm is terminated at the N th step, the positions B9780123725127000031/si337.gif is missing hold a random l-permutation of the numbers B9780123725127000031/si338.gif is missing.)
Algorithm 3.27 shows another (in this case nonrecursive) unrank algorithm proposed by Myrvold and Ruskey. It also detects the parity of the number of inversions (the signature of the permutation) efficiently and fits to the ranking function in Algorithm 3.28. All permutations of size N = 4 together with their signature and ranked according to the two approaches are listed in Table 3.2.
Table 3.2 Myrvold and Ruskey 's perfect permutation hash functions.
IndexUnrank (Alg. 3.26)SignatureUnrank (Alg. 3.27)Signature
0(2,1,3,0)0(1,2,3,0)0
1(2,3,1,0)1(3,2,0,1)0
2(3,2,1,0)0(1,3,0,2)0
3(1,3,2,0)1(1,2,0,3)1
4(1,3,2,0)1(2,3,1,0)0
5(3,1,2,0)0(2,0,3,1)0
6(3,2,0,1)1(3,0,1,2)0
7(2,3,0,1)0(2,0,1,3)1
8(2,0,3,1)1(1,3,2,0)1
9(0,2,3,1)0(3,0,2,1)1
10(3,0,2,1)0(1,0,3,2)1
11(0,3,2,1)1(1,0,2,3)0
12(1,3,0,2)0(2,1,3,0)1
13(3,1,0,2)1(2,3,0,1)1
14(3,0,1,2)0(3,1,0,2)1
15(0,3,1,2)1(2,1,0,3)0
16(1,0,3,2)1(3,2,1,0)1
17(0,1,3,2)0(0,2,3,1)1
18(1,2,0,3)1(0,3,1,2)1
19(2,1,0,3)0(0,2,1,3)0
20(2,0,1,3)1(3,1,2,0)0
21(0,2,1,3)0(0,3,2,1)0
22(1,0,2,3)0(0,1,3,2)0
23(0,1,2,3)1(0,1,2,3)1
Theorem 3.3
(Myrvold-Ruskey Permutation Signature) Given the Myrvold-Ruskey rank (as computed byAlg. 3.28), the signature of a permutation can be computed in O (N) time withinAlgorithm 3.27.
Proof
In the unrank function we always have N − 1 element exchanges. For swapping two elements u and v at respective positions i and j with B9780123725127000031/si339.gif is missing we count B9780123725127000031/si340.gif is missing transpositions: B9780123725127000031/si341.gif is missing. As B9780123725127000031/si342.gif is missing, each transposition either increases or decreases the parity of the number of inversion, so that the parity for each iteration toggles. The only exception is if B9780123725127000031/si343.gif is missing, where no change occurs. Hence, the sign of the permutation can be determined by executing the Myrvold-Ruskey algorithm in O (N) time.
B9780123725127000031/u03-26-9780123725127.jpg is missing
Algorithm 3.26.
Unrank operation for permutations.
B9780123725127000031/u03-27-9780123725127.jpg is missing
Algorithm 3.27.
Recursion-free unranking and signature computation.
Theorem 3.4
(Compression of Alternation Group) Let π (i) denote the value returned by the Myrvold and Ruskey's Unrank function (Alg. 3.28) for index i. Then π (i) matches B9780123725127000031/si344.gif is missing except for transposing π0 and π1.
Proof
The last call for B9780123725127000031/si345.gif is missing in Algorithm 3.27 is B9780123725127000031/si346.gif is missing, which resolves to either B9780123725127000031/si347.gif is missing or B9780123725127000031/si348.gif is missing. Only the latter one induces a change. If B9780123725127000031/si349.gif is missing denote the indices of B9780123725127000031/si350.gif is missing in the iterations B9780123725127000031/si351.gif is missing of Myrvold and Ruskey's Unrank function, then B9780123725127000031/si352.gif is missing, which is 1 for B9780123725127000031/si353.gif is missing and 0 for B9780123725127000031/si354.gif is missing.

3.2.3. Hashing Algorithms

There are two standard options for dealing with colliding items: chaining or open addressing. In hashing with chaining, keys x are kept in linked overflow lists. The dictionary operations Lookup, Insert, and Delete amount to computing h (x) and then performing pure list manipulations in B9780123725127000031/si355.gif is missing. Their pseudo-code implementation is provided in Algorithm 3.29, Algorithm 3.30 and Algorithm 3.31. They assume a null pointer ⊥ and a link Next to the successor in the chained list. Operations Insert and Delete suggest a call to Lookup prior to their invocation to determine whether or not the element is contained in the hash table. An example for hashing the characters in heuristic search in a table of 10 elements with respect to their lexicographical order modulo 10 is depicted in Figure 3.15.
B9780123725127000031/f03-15-9780123725127.jpg is missing
Figure 3.15
Hashing the characters of the term heuristic search with chaining.
B9780123725127000031/u03-28-9780123725127.jpg is missing
Algorithm 3.28.
Recursion-free ranking operation for permutations.
B9780123725127000031/u03-29-9780123725127.jpg is missing
Algorithm 3.29.
Searching a chained hash table.
B9780123725127000031/u03-30-9780123725127.jpg is missing
Algorithm 3.30.
Inserting an element into a chained hash table.
B9780123725127000031/u03-31-9780123725127.jpg is missing
Algorithm 3.31.
Deleting an element from a chained hash table.
Hashing with open addressing integrates the colliding elements at free locations in the hash table; that is, if B9780123725127000031/si356.gif is missing is occupied, it searches for an alternative location for x. Searching a key x starts at h (x) and continues in the probing sequence until either x or an empty table entry is found. When deleting an element, some keys may have to be moved back to fill the hole in the lookup sequence.
The linear probing strategy considers B9780123725127000031/si357.gif is missing for B9780123725127000031/si358.gif is missing. In general, we have the sequence
B9780123725127000031/si359.gif is missing
for probing function B9780123725127000031/si360.gif is missing There is a broad spectrum of suitable probing sequences, for example,
B9780123725127000031/si361.gif is missing
where rx is a random number depending on x, and h′ is a second function, which determines the step size of the probing sequence in double hashing.
B9780123725127000031/u03-32-9780123725127.jpg is missing
Algorithm 3.32.
Searching an element in an open hash table.
To exploit the whole table, B9780123725127000031/si362.gif is missing, B9780123725127000031/si363.gif is missing, …, B9780123725127000031/si364.gif is missing, and B9780123725127000031/si365.gif is missing should be a permutation of B9780123725127000031/si366.gif is missing.
An implementation of the procedure Lookup for a generic probing function s is provided in Algorithm 3.32. The implementation assumes an additional array Tag that associates one of the values Empty, Occupied, and Deleted with each element. Deletions (see Alg. 3.33) are handled by setting the Deleted tag for the cell of the deleted key. Lookups skip over deleted cells, and insertions (see Alg. 3.34) overwrite them.
When the hash table is nearly full, unsuccessful searches lead to long probe sequences. An optimization is ordered hashing, which maintains all probe sequences sorted. Thus, we can abort a Lookup operation as soon as we reach a larger key in the probe sequence. The according algorithm for inserting a key x is depicted in Algorithm 3.35. It consists of a search phase and an insertion phase. First, the probe sequence is followed up to a table slot that is either empty or contains an element that is larger than x. The insertion phase restores the sorting condition to make the algorithm work properly. If x replaces an element B9780123725127000031/si367.gif is missing, the latter has to be reinserted into its respective probe sequence, in turn. This leads to a sequence of updates that end when an empty bin is found. It can be shown that the average number of probes to insert a key into the hash table is the same as in ordinary hashing.
B9780123725127000031/u03-33-9780123725127.jpg is missing
Algorithm 3.33.
Inserting an element into an open hash table.
B9780123725127000031/u03-34-9780123725127.jpg is missing
Algorithm 3.34.
Deleting an element from an open hash table.
B9780123725127000031/u03-35-9780123725127.jpg is missing
Algorithm 3.35.
Inserting an element for ordered hashing.
For a hash table of size m that stores n keys, the quotient B9780123725127000031/si368.gif is missing is called the load factor. The load factor crucially determines the efficiency of hash table operations. The analysis assumes uniformness of h; that is, B9780123725127000031/si369.gif is missing for all B9780123725127000031/si370.gif is missing and B9780123725127000031/si371.gif is missing. Under this precondition, the expected number of memory probes for insertion and unsuccessful lookup is
B9780123725127000031/si372.gif is missing
Thus, for any B9780123725127000031/si373.gif is missing, in terms of number of probes we obtain the following rank order: ideal hashing, chained hashing, double hashing, quadratic probing, and linear probing. The order of the last four methods is true for any α.
Although chaining scores quite favorably in terms of memory probes, the comparison is not totally fair, since it dynamically allocates memory and uses extra linear space to store the pointers.

FKS Hashing Scheme

With a hash function h of a class H of universal hash functions, we can easily obtain constant lookup time if we don't mind spending a quadratic amount of memory. Say we allocate a hash table of size B9780123725127000031/si374.gif is missing. Since there are B9780123725127000031/si375.gif is missing pairs in R, each with a chance B9780123725127000031/si376.gif is missing of colliding with each other, the probability of a collision in the hash table is bounded by B9780123725127000031/si377.gif is missing. In other words, the chance of drawing a perfect hash function for the set of stored keys is B9780123725127000031/si378.gif is missing. While for each given hash function there is a worst set of stored keys that maps all of them into the same bin, the crucial element of the algorithm is a randomized rehashing: If the chosen h actually leads to a collision, just try again with another hash function drawn with uniform probability from H.
Perfect hashing has important advances in storing and retrieving information in the visited list and, equally important, for fast lookup of pattern databases (see Ch. 4). The apparent question for practical search is whether memory consumption for perfect hashing can be reduced. The so-called FKS hashing scheme (named after the initials of the inventors Fredman, Komlòs, and Szemerèdi) ended a long dispute in research about whether it is also possible to achieve constant access time with a linear storage size of O (n). The algorithm uses a two-level approach: First, hash into a table of size n, which will produce some collisions, and then, for each resulting bin, rehash it as just described, squaring the size of the hash bucket to get zero collisions.
Denote the subset of elements mapped to bin i as Ri, with B9780123725127000031/si379.gif is missing. We will use the property that
(3.1)
B9780123725127000031/si380.gif is missing
This can be seen by noting that B9780123725127000031/si381.gif is missing is the total number of ordered pairs that land in the same bin of the table; we have
B9780123725127000031/si382.gif is missing
Using the Markov inequality B9780123725127000031/si383.gif is missing with B9780123725127000031/si384.gif is missing shows B9780123725127000031/si385.gif is missing. Consequently,
B9780123725127000031/si386.gif is missing
Choosing B9780123725127000031/si387.gif is missing, this implies that for at least half of the functions B9780123725127000031/si388.gif is missing we have
(3.2)
B9780123725127000031/si389.gif is missing
At the second level, we use the same property with the choice of the size of the hash table for Ri, B9780123725127000031/si390.gif is missing. Then, for at least half of the functions B9780123725127000031/si391.gif is missing we obtain
B9780123725127000031/si392.gif is missing
where nij is the number of elements in the second-level bin j of Ri; in other words, B9780123725127000031/si393.gif is missing for all j.
So, the total space used is O (n) for the first table (assuming it takes a constant amount of space to store each hash function), plus
B9780123725127000031/si394.gif is missing
for the other tables. For the last equality we used Equation 3.2.

Dynamic Perfect Hashing

Unfortunately, the FKS hashing scheme is only applicable to the static case, where the hash table is created once with a fixed set of keys, and no insertions and deletions are allowed afterward.
Later the algorithm was generalized to allow for update operations. Deletion of an element is handled simply by tagging it, and subsequently ignoring tagged keys, overwriting them later.
A standard doubling strategy is used to cope with a growing or shrinking number of stored elements. Every time that a predetermined maximum number of update operations has occurred, the structure is recreated from scratch in the same way as in the static case, but slightly larger than necessary to accommodate future insertions. More precisely, it is planned for a maximum capacity of B9780123725127000031/si395.gif is missing, where n is the number of currently stored keys. The top-level hash function contains s (m) bins, which is defined as an O (n) function. Each second-level bin is allocated for a capacity mi of twice as many elements in it; that is, if ni keys fall into bin i, its size is chosen as B9780123725127000031/si396.gif is missing, with B9780123725127000031/si397.gif is missing. The resulting new structure will be used subsequently for at most B9780123725127000031/si398.gif is missing update operations.
Before this maximum update count is reached, an insert operation first tries to insert an element according to the given structure and hash functions; this is possible if Equation 3.2 is still valid, the bin i the element is mapped to has some spare capacity left (i.e., B9780123725127000031/si399.gif is missing), and the position within bin i assigned by the second-level hash function is empty. If only the last condition doesn't hold, bin i is reorganized by randomly drawing a new second-level hash function. If B9780123725127000031/si400.gif is missing, the bin's capacity is doubled (from mi to B9780123725127000031/si401.gif is missing) prior to reorganization. If, on the other hand, Equation 3.2 is violated, a new top-level hash function has to be selected, and hence the whole structure must be recreated.
It can be shown that this scheme uses O (n) storage; Lookup and Delete are executed in constant worst-case time, whereas Insert runs in constant amortized expected time.

Cuckoo Hashing

The FKS hashing is involved and it is unclear whether the approach can be made incremental. For the first-level hash function this is possible, but for the selection of a universal hash function for each bucket we lack an appropriate answer.
Therefore, we propose an alternative conflict strategy. Cuckoo hashing implements the dictionary with two hash tables, T1 and T2, and two different hash functions, h1 and h2. Each key, k, is contained either in B9780123725127000031/si402.gif is missing or in B9780123725127000031/si403.gif is missing. Algorithm 3.36 provides a pseudo-code implementation for searching a key. If an item produces a collision in the first table the detected synonym is deleted and inserted into the other table. Figure 3.16 gives an example—arrows point to the alternative bucket in the other hash table. If D is hashed into the first hash table where it preempts C, then C needs to go into the second hash table where it preempts B, and B needs to go into the first hash table where it found an empty location. During the insertion process the arrows have to be inverted. That is, B that has been moved to the first table points to C in the second table, and C that has been moved to the second table now points to the inserted element D in the first table.
B9780123725127000031/f03-16-9780123725127.jpg is missing
Figure 3.16
A successful insertion of an element via cuckoo hashing.
There is a small probability that the cuckoo process may not terminate at all and loop forever; Figure 3.17 gives an example. If D is hashed into the first hash table where it preempts C, then C needs to go into the second hash table where it preempts A, A needs to go into the first hash table where it preempts E, E needs to go into the second hash table where it preempts B, B needs to go into the first hash table where it preempts D, D needs to go into the second hash table where it preempts G, G needs to go into the first hash table where it preempts F, F needs to go into the second hash table where it preempts D, D needs to go into the first hash table where it preempts B, and so on. The analysis shows that such a situation is rather unlikely, so that we can pick fresh hash functions and rehash the entire structure after a fixed number t of failures. Algorithm 3.37 provides an implementation for the insert procedure.
B9780123725127000031/u03-36-9780123725127.jpg is missing
Algorithm 3.36.
Lookup for a key in a cuckoo hash table.
Although reorganization costs linear time it contributes a small amount to the expected runtime. The analysis reveals that if t is fixed appropriately (B9780123725127000031/si404.gif is missing for r being the individual hash table sizes and B9780123725127000031/si405.gif is missing) the probability of rehash is B9780123725127000031/si406.gif is missing. Therefore, rehashing n elements causes no recursive rehash with probability B9780123725127000031/si407.gif is missing. As the expected time for inserting one element is constant, the total expected time to reinsert all n elements is O (n). This is also the total expected time for rehashing.
B9780123725127000031/f03-17-9780123725127.jpg is missing
Figure 3.17
An infinite cuckoo process; rehashing is needed.
In summary, cuckoo hashing has worst-case constant access time and amortized worst-case insertion time. It is simple to implement and efficient in practice.
B9780123725127000031/u03-37-9780123725127.jpg is missing
Algorithm 3.37.
Inserting a key into a cuckoo hash table.

3.2.4. Memory-Saving Dictionaries

The information-theoretic lower bound on the number of bits required to store an arbitrary subset of size n of a universe of size N is B9780123725127000031/si408.gif is missing, since we have to be able to represent all possible combinations of selecting n values out of the N. Using Stirling's approximation and defining B9780123725127000031/si409.gif is missing, we obtain
B9780123725127000031/si410.gif is missing
with an error less than B9780123725127000031/si411.gif is missing, where e is Euler's constant. Alternatively, using
B9780123725127000031/si412.gif is missing
we can approximate the logarithm by two corresponding integrals. If we properly bias the integral limits we can be sure to compute a lower bound
B9780123725127000031/si413.gif is missing
For the case of dynamic dictionaries (where insertions and deletions are fully supported), we want to be able to maintain subsets of varying size, say, of zero up to a maximum of n elements. This results in a minimum number of
B9780123725127000031/si414.gif is missing
bits. For B9780123725127000031/si415.gif is missing(which is usually the case for nontrivial search problems) we have
B9780123725127000031/si416.gif is missing
The correctness follows from the property of binomial coefficients B9780123725127000031/si417.gif is missing for B9780123725127000031/si418.gif is missing. We are only interested in the logarithms, so we conclude
B9780123725127000031/si419.gif is missing
Obviously in this restricted range it is sufficient to concentrate on the last binomial coefficient. The error in our estimate is at most one bit. At the end, as we look at the logarithms, the dynamic case is not much different from the static case.
If N is large compared to n, listing all elements, for example, in a hash table comes close to the information-theoretic minimum number of bits B. In the other border case, for small r it is optimal to list the answers, for example, in the form of a bitvector of size N. The more difficult part is to find appropriate representations for intermediate sizes.

Suffix Lists

Given B bits of memory, for space-efficient state storage we want to maintain a dynamically evolving visited list closed under inserts and membership queries. For the ease of presentation, the entries of Closed are integers from B9780123725127000031/si420.gif is missing. Using hashing with open addressing, the maximal size of Closed nodes m is limited to B9780123725127000031/si421.gif is missing, since B9780123725127000031/si422.gif is missing bits are required to encode a state. A gain is only to be expected if we can exploit redundancies in the state vector set. In the following we describe a simple but very space-efficient approach with small update and query times.
Let B9780123725127000031/si423.gif is missing be the binary representation of an element B9780123725127000031/si424.gif is missing from the set Closed. We split B9780123725127000031/si425.gif is missing in p high bits and B9780123725127000031/si426.gif is missing low bits. Furthermore, B9780123725127000031/si427.gif is missing denotes the prefix of B9780123725127000031/si428.gif is missing and B9780123725127000031/si429.gif is missing stands for the suffix of B9780123725127000031/si430.gif is missing.
A Suffix Tree list data structure consists of a linear array P of size 2p bits and of a two-dimensional array L of size B9780123725127000031/si431.gif is missing bits. The basic idea of a Suffix Tree list is to store a common prefix of several entries as a single bit in P, whereas the distinctive suffixes form a group within L. P is stored as a bit array. L can hold several groups with each group consisting of a multiple of B9780123725127000031/si432.gif is missing bits. The first bit of each B9780123725127000031/si433.gif is missing-bit row in L serves as a group bit. The first s-bit suffix entry of a group has group bit one, and the other elements of the group have group bit zero. We place the elements of a group together in lexicographical order (see Fig. 3.18).
First, we compute B9780123725127000031/si434.gif is missing, which gives us the search position in the prefix array P. Then we simply count the number of ones in P starting from position B9780123725127000031/si435.gif is missing until we reach B9780123725127000031/si436.gif is missing. Let z be this number. Finally, we search through L until we have found the z th suffix of L with group bit one. If we have to perform a membership query we simply search in this group. Note that searching a single entry may require scanning large areas of main memory.
B9780123725127000031/f03-18-9780123725127.jpg is missing
Figure 3.18
Example for a Suffix List with p = 4 and s= 3.
To insert entry u we first search the corresponding group as described earlier. In case u opens a new group within L this involves setting group bits in P and L. The suffix of u is inserted in its group while maintaining the elements of the group sorted. Note that an insert may need to shift many rows in L to create space at the desired position. The maximum number m of elements that can be stored in B bits is limited as follows: We need B9780123725127000031/si437.gif is missing bits for P and B9780123725127000031/si438.gif is missing bits for each entry of L. Hence, we choose p so that r is maximal subject to
B9780123725127000031/si439.gif is missing
For B9780123725127000031/si440.gif is missing the space requirement for both P and the suffixes in L is small enough to guarantee B9780123725127000031/si441.gif is missing.
We now show how to speed up the operations. When searching or inserting an element u we have to compute z to find the correct group in L. Instead of scanning potentially large parts of P and L for each single query we maintain checkpoints, one-counters, to store the number of ones seen so far. Checkpoints are to lie close enough to support rapid search but must not consume more than a small fraction of the main memory. For B9780123725127000031/si442.gif is missing we have B9780123725127000031/si443.gif is missing for both arrays, so B9780123725127000031/si444.gif is missing bits are sufficient for each one-counter.
Keeping one-counters after every B9780123725127000031/si445.gif is missing entries limits the total space requirement. Binary search on the one-counters of P now reduces the scan area to compute the correct value of z to B9780123725127000031/si446.gif is missing bits.
Searching in L is slightly more difficult because groups could extend over 2s entries, thus potentially spanning several one-counters with equal values. Nevertheless, finding the beginning and the end of large groups is possible within the stated bounds. As we keep the elements within a group sorted, another binary search on the actual entries is sufficient to locate the position in L.
We now turn to insertions where two problems remain: adding a new element to a group may need shifting large amounts of data. Also, after each insert the checkpoints must be updated. A simple solution uses a second buffer data structure BU that is less space efficient but supports rapid inserts and lookups. When the number of elements in BU exceeds a certain threshold, BU is merged with the old Suffix Tree list to obtain a new up-to-date space-efficient representation. Choosing an appropriate size of BU, amortized analysis shows improved computational bounds for inserts while achieving asymptotically the same order of phases for the graph search algorithm.
Note that membership queries must be extended to BU as well. We implement BU as an array for hashing with open addressing. BU stores at most B9780123725127000031/si447.gif is missing elements of size B9780123725127000031/si448.gif is missing, for some small constant c2. As long as there is 10% space left in BU, we continue to insert elements into BU, otherwise BU is sorted and the suffixes are moved from BU into the proper groups of L. The reason not to exploit the full hash table size is again to bound the expected search and insert time within BU to a constant number of tests. Altogether, we can prove the following theorem.
Theorem 3.5
(Time Complexity Suffix Tree list) Searching and inserting n items into a Suffix Tree list under space restriction amounts to a runtime of O (nlgn).
Proof
For a membership query we perform binary searches on numbers of B9780123725127000031/si449.gif is missing bits or s bits, respectively. So, to search an element we need B9780123725127000031/si450.gif is missing bit operations since B9780123725127000031/si451.gif is missing and B9780123725127000031/si452.gif is missing.
Each of the B9780123725127000031/si453.gif is missing buffer entries consists of B9780123725127000031/si454.gif is missing bits, hence sorting the buffer can be done with
B9780123725127000031/si455.gif is missing
bit operations. Starting with the biggest occurring keys merging can be performed in O (1) memory scans. This also includes updating all one-counters. In spite of the additional data structures we still have
B9780123725127000031/si456.gif is missing
Thus, the total bit complexity for n inserts and membership queries is given by
B9780123725127000031/si457.gif is missing
Assuming a machine word length of B9780123725127000031/si458.gif is missing, any modification or comparison of entries with B9780123725127000031/si459.gif is missing bits appearing in a Suffix Tree list can be done using O (1) machine operations. Hence, the total complexity reduces to B9780123725127000031/si460.gif is missing operations.
The constants can be improved using the following observation: In the case n = (1 + ϵ) ⋅ B, for a small ϵ > 0 nearly half of the entries in P will always be zero, namely those that are lexicographically bigger than the suffix of n itself. Cutting the P array at this position leaves more room for L, which in turn enables us to keep more elements.
Table 3.3 compares a Suffix Tree list data structure with hashing and open addressing. The constants for the Suffix Tree list are chosen so that B9780123725127000031/si461.gif is missing, which means that if m elements can be treated, we set aside B9780123725127000031/si462.gif is missing bits to speed up internal computations. For hashing with open addressing we also leave 10% of memory free to keep the internal computation time moderate. When using a Suffix Tree list instead of hashing, note that only the ratio between n and B is important.
Table 3.3 Fractions of n that can be accommodated in a Suffix List and in hashing with open addressing. Note: n is the size of the search space to be memorized, and B is the number of bits available for storing the data. The columns denote the maximal portion of the state space that can be stored according to the information theoretical bound, the Suffix List structure, and ordinary hashing according to two practical values of n.
Hashing
n/BUpper BoundSuffix Listsn = 220n = 230
1.0533.2%22.7%4.3%2.9%
1.1032.4%21.2%4.1%2.8%
1.2524.3%17.7%3.6%2.4%
1.5017.4%13.4%3.0%2.0%
2.0011.0%9.1%2.3%1.5%
3.006.1%5.3%1.5%1.0%
4.004.1%3.7%1.1%0.7%
8.001.7%1.5%0.5%0.4%
16.000.7%0.7%0.3%0.2%
Hence, the Suffix Tree list data structures can close the memory gap in search algorithms between the best possible and trivial approaches like hashing with open addressing.

3.2.5. Approximate Dictionaries

If we relax the requirements to a membership data structure, allowing it to store a slightly different key set than intended, new possibilities for space reduction arise.
The idea of erroneous dictionaries was first exploited by Bloom. A Bloom Filter is a bitvector v of length m, together with k independent hash functions B9780123725127000031/si463.gif is missing. Initially, v is set to zero. To insert a key x, compute B9780123725127000031/si464.gif is missing, for all B9780123725127000031/si465.gif is missing, and set each B9780123725127000031/si466.gif is missing to one. To lookup a key, check the status of B9780123725127000031/si467.gif is missing; if it is zero, x is not stored, otherwise continue with B9780123725127000031/si468.gif is missing. If all these bits are set, report that x is in the filter. However, since they might have been turned on by different keys, the filter can make false positive errors. Deletions are not supported by this data structure, but they can be incorporated by replacing the bits by counters that are incremented in insertions rather than just set to one.

Bit-State Hashing

For large problem spaces, it can be most efficient to apply a depth-first search strategy in combination with duplicate detection via a membership data structure. Bit-state hashing is a Bloom Filter storage technique without storing the complete state vectors. If the problem contains up to 230 states and more (which implies a memory consumption of 1GB times state vector size in bytes), it resorts to approximate hashing. Obviously, the algorithm is no longer guaranteed to find a shortest solution (or any solution at all, for that matter). As an illustration of the bit-state hashing idea, Figure 3.19, Figure 3.20 and Figure 3.21 depict the range of possible hash structures: usual hashing with chaining, single-bit hashing, and double-bit hashing.
B9780123725127000031/f03-19-9780123725127.jpg is missing
Figure 3.19
Ordinary hashing with chaining.
B9780123725127000031/f03-20-9780123725127.jpg is missing
Figure 3.20
Single bit-state hashing.
B9780123725127000031/f03-21-9780123725127.jpg is missing
Figure 3.21
Double bit-state hashing.
Let n be the number of reachable states and m be the maximal number of bits available. As a coarse approximation for single bit-state hashing with B9780123725127000031/si469.gif is missing, the average probability P1 of a false positive error during the course of the search is bounded by
B9780123725127000031/si470.gif is missing
since the i th element collides with one of the i − 1 already inserted elements with a probability of at most B9780123725127000031/si471.gif is missing, B9780123725127000031/si472.gif is missing. For multibit hashing using h(independent) hash functions with the assumption B9780123725127000031/si473.gif is missing, the average probability of collision Ph is reduced to B9780123725127000031/si474.gif is missing, since i elements occupy at most B9780123725127000031/si475.gif is missing addresses, B9780123725127000031/si476.gif is missing. In the special case of double bit-state hashing, this simplifies to
B9780123725127000031/si477.gif is missing
An attempt to remedy the incompleteness of partial search is to reinvoke the algorithm several times with different hash functions to improve the coverage of the search tree. This technique, called sequential hashing, successively examines various beams in the search tree (up to a certain threshold depth). In considerably large problems, sequential hashing succeeds in finding solutions (but often returns long paths). As a rough estimate on the error probability we take the following. If in sequential hashing exploration of which the first hash function covers B9780123725127000031/si478.gif is missing of the search space, the probability that a state x is not generated in d independent runs is B9780123725127000031/si479.gif is missing, such that x is reached with probability B9780123725127000031/si480.gif is missing.

Hash Compaction

To increase the coverage of the search space, further lossy compression techniques have been considered in practice to best exploit the limited amount of memory. Like bit-state hashing, the hash compaction method aims at reducing the memory requirements for the state table. However, it stores a compressed state descriptor in a conventional hash table instead of setting bits corresponding to hash values of the state descriptor in a bitvector. The compression function c maps a state to a b-bit number in B9780123725127000031/si481.gif is missing. Since different states can have the same compression, false positive errors can arise. Note, however, that if the probe sequence and the compression are calculated independently from the state, the same compressed state can occur at different locations in the table.
In the analysis, we assume that breadth-first search with ordered hashing using open addressing is applied. Let the goal state sd be located at depth d, and B9780123725127000031/si482.gif is missing be a shortest path to it.
It can be shown that the probability pk of a false positive error, given that the table already contains k elements, is approximately equal to
(3.3)
B9780123725127000031/si483.gif is missing
where B9780123725127000031/si484.gif is missing denotes a harmonic number.
Let ki be the number of states stored in the hash table after the algorithm has completely explored the nodes in level i. Then there were at most B9780123725127000031/si485.gif is missing states in the hash table when we tried to insert si. Hence, the probability B9780123725127000031/si486.gif is missing that no state on the solution path was omitted is bounded by
B9780123725127000031/si487.gif is missing
If the algorithm is run up to a maximum depth d, it can record the ki values online and report this lower bound on the omission probability after termination.
To obtain an a priori estimate, knowledge of the depth of the search space and the distribution of the ki is required. For a coarse approximation, we assume that the table fills up completely (m = n) and that half the states in the solution path experience an empty table during insertion, and the other half experiences the table with only one empty slot. This models (crudely) the typically bell-shaped state distribution over the levels 0,…,d. Assuming further that the individual values in Equation 3.3 are close enough to one to approximate the product by a sum, we obtain the approximation
B9780123725127000031/si488.gif is missing
Assuming, more conservatively, only one empty slot for all states on the solution path would increase this estimate by a factor of two.

Collapse Compression

A related memory-saving strategy is collapse compression. It stores complex state vectors in an efficient way. The main idea is to store different parts of the state space in separate descriptors and represent the actual state as an index to relevant states. Collapse compression is based on the observation that although the number of distinct search states can become very large, the number of distinct parts of the state vector are usually smaller. In contrast, to jointly store the leading part of the state vector, as in Suffix Tree lists, different parts of the state can be shared (across all the visited states that are stored). This is especially important when only small parts of the state vector change and avoids storing the complete vector every time a new state is visited.
Essentially, different components are stored in separate hash tables. Each entry in one of the tables is given a unique number. A whole state vector is then identified by a vector of numbers that refer to corresponding components in the hash tables. This greatly reduces the storage needs for storing the set of already explored states. An illustration of this technique is provided in Figure 3.22. Besides the memory capacity for the state components, collapse compression additionally needs an overall hash table to represent the combined state. The collapsed state vector consists of (hash) IDs for the individual components. Therefore, there is a gain only if the individual state components that are collapsed are themselves complex data structures. Collapse compression can be implemented lossless or lossy.
B9780123725127000031/f03-22-9780123725127.jpg is missing
Figure 3.22
Effect of collapse compression. On the left there is the search tree. On the right, the states have been partitioned in three parts, which are stored individually and jointly addressed with indices. When moving from a to b only the part in the middle of the state vector changes and has to be newly stored.

3.3. Subset Dictionaries

The problem of finding an element in a set of elements such that this element is a subset (or a superset) of the query occurs in many search applications; for example, in the matching of a large number of production rules, in the identification of inconsistent subgoals in AI planning, and in finding string completions for constructing or solving crossword puzzles. Moreover, efficiently storing and searching partial information is central to many learning processes. The problem also occurs in search applications that allow the user to search for documents containing a given set of words, and therefore extends the setting in the previous section.
For a state space search the stored sets often correspond to partially specified state vectors (patterns ). As an example, consider the solitaire game Sokoban (see Ch. 1), together with a selection of dead-end patterns. Since every given state is unsolvable, if the dead-end pattern is a subset of it, we want to quickly detect whether or not such a dead-end pattern is present in the data structure (see Ch. 10). We assume the patterns to sets of elements from a certain universe Γ (set of coordinates in Sokoban).
Definition 3.1
(Subset and Containment Query Problem, Subset Dictionary) Let D be a set of n subsets over a universe Γ. The Subset Query (Containment Query) problem asks for any query set B9780123725127000031/si489.gif is missing if there is any B9780123725127000031/si490.gif is missing with B9780123725127000031/si491.gif is missing B9780123725127000031/si492.gif is missing.
A subset dictionary is an abstract data structure providing insertion of sets to D while supporting subset and containment queries.
Since p is a subset of q if and only if its complement is a superset of the complement of q, the two query problems are equivalent.
For the Sokoban problem, we have that each board position is an element of Γ. Inserting a pattern amounts to inserting a subset of Γ to the subset dictionary. Subsequently, determining whether or not a state matches a stored pattern is a containment query to the dictionary.
From an implementation point of view we may think of subset dictionaries as hash tables that contain generalized information about sets of problem states. But before diving into implementation issues, we draw another equivalence, which turns out to be essential.
Definition 3.2
(Partial Match) Let * denote a special don't care character that matches every character contained in an alphabet. Given a set D of n vectors over the alphabet Σ, the Partial Match problem asks for a data structure, which for any query B9780123725127000031/si493.gif is missing detects if there is any entry p in D such that q matches p.
The application for this problem is to solve approximate matching problems in information retrieval. A sample application is a crossword puzzle dictionary. A query like B*T*R in the Crossword Puzzle would be answered with words like BETTER, BITTER, BUTLER, or BUTTER.
Theorem 3.6
(Equivalence Partial Match and Subset Query Problems) The Partial Match problem is equivalent to the Subset Query problem.
Proof
As we can adjust any algorithm for solving the Partial Match problem to handle binary symbols by using a binary representation, it is sufficient to consider the alphabet B9780123725127000031/si494.gif is missing.
To reduce the Partial Match to the Subset Query problem, we replace each B9780123725127000031/si495.gif is missing by a set of all pairs B9780123725127000031/si496.gif is missing for all B9780123725127000031/si497.gif is missing. Moreover, we replace each query q by a set of all pairs B9780123725127000031/si498.gif is missing provided that q is not the don't care symbol ∗. Solving this instance to the subset query problem also solves the Partial Match problem.
To reduce the Subset Query to the Partial Match problem, we replace each database set by its characteristic vector, and replace query set q by its characteristic vector, of which the zeros have been replaced with don't cares.
As the Subset Query is equivalent to the Containment Query problem, the latter one can also be solved by algorithms for the Partial Match. For the sake of simplicity, in the following data structures we restrict the alphabet for the Partial Match problem to {0, 1}.

3.3.1. Arrays and Lists

These problems have two straightforward solutions. The first approach is to store all answers to all possible queries in a (perfect) hash table or array of size 2m, with B9780123725127000031/si499.gif is missing. Query time is O (m) to compute the hash address. For the Containment Query each hash table entry contains a list of sets from the database corresponding to the query (state), which in turn is interpreted as a bitvector. Unfortunately, the memory requirements for this implementation are too large for most practical applications because we have to reserve a table entry for all queries (corresponding to the entire state space).
The list representation with each list item containing one database entry is the other extreme. The storage requirements with O (n) are optimal but searching for a match now corresponds to time O (nm), a term that is also too big for practical applications. In the following, we propose compromises in between storing plain arrays and lists.

3.3.2. Tries

One possible implementation that immediately comes to mind is a Trie, which compares a query string to the set of stored entries. A Trie is a lexicographic search tree structure, in which each node spawns at most B9780123725127000031/si500.gif is missing children. The transitions are labeled by B9780123725127000031/si501.gif is missing and are mutually exclusive for two successors of a state. Leaf nodes correspond to stored strings. A Trie is a natural and unique representation for a set of strings.
Since inserting and deleting strings in a Trie is simple, in Algorithm 3.38 we consider the traversal of the tree for a search. For notational convenience, we consider the Partial Match problem as introduced earlier. The recursive procedure Lookup is initially invoked with the root of the Trie, the query B9780123725127000031/si502.gif is missing with B9780123725127000031/si503.gif is missing, B9780123725127000031/si504.gif is missing, and the level 1. The expected sum of nodes examined has been estimated by B9780123725127000031/si505.gif is missing, where s is the number of indices specified in a query.

3.3.3. Hashing

An alternative reduction of the space complexity for the array representation is to hash the query sets to a smaller table. The lists in the chained hash table again correspond to database sets. However, the lists have to be searched to filter the elements that match.
A refined implementation of the array approach appropriate for Sokoban is to construct containers Li of all patterns that share a ball at position i. In the pattern lookup for position u we test whether or not B9780123725127000031/si506.gif is missing is empty. Insertion and retrieval time correspond to the sizes B9780123725127000031/si507.gif is missing and the individual storage structures for them (e.g., sorted lists, bitvectors, balanced trees).
B9780123725127000031/u03-38-9780123725127.jpg is missing
Algorithm 3.38.
Searching a Trie for a partial match.
B9780123725127000031/u03-39-9780123725127.jpg is missing
Algorithm 3.39.
Searching a hash table for a partial match.
Generalizing the idea for the Partial Match problem leads to the following hashing approach. Let h be the hash function mapping B9780123725127000031/si508.gif is missing to the chained hash table. A record p is stored in the list Lj if and only if B9780123725127000031/si509.gif is missing.
For mapping queries q we have to hash all matching elements of B9780123725127000031/si510.gif is missing that are covered by q, and define h (q) as the union of all p such that q matches p. The implementation of the Lookup procedure is shown in Algorithm 3.39.
The complexity of computing set h (q) heavily depends on the chosen hash function h. For a balanced hash function, consider the partition of B9780123725127000031/si511.gif is missing induced by h; generating blocks B9780123725127000031/si512.gif is missing. A hash function is balanced if B9780123725127000031/si513.gif is missing is equal to B9780123725127000031/si514.gif is missing divided by the hash table size b for all j.
For large alphabets (as in the Crossword Puzzle problem) the hash table size b can be scaled to some value larger than 2m and letters can be individually mapped. More precisely, we assume an auxiliary hash function h′ that maps Σ to a small set of b bits. B9780123725127000031/si515.gif is missing is determined by the concatenation B9780123725127000031/si516.gif is missing. A partial match query for queries q like B9780123725127000031/si517.gif is missing would be answered by inspecting all B9780123725127000031/si518.gif is missing table entries in h (q), where s is the number of fixed bits.
For small alphabets (like the binary case) we have B9780123725127000031/si519.gif is missing. One suitable approach is to extract the first B9780123725127000031/si520.gif is missing bits of each record as a first hash table index. However, the worst-case behavior can be poor: If none of the bits occurs in the first m positions, then every list must be searched.
To obtain good hash functions also for the worst-case, they have to depend on every input character.

3.3.4. Unlimited Branching Trees

A compromise between the Trie and hash table subset dictionary data structure is an ordered list of tries, called an Unlimited Branching Tree. Insertion is similar to an ordinary Trie insertion with the exception that we maintain a distinctive root for the first element in the sorted representation of the set.
Figure 3.23 displays the Unlimited Branching Tree data structure during the insertion of {1, 2, 3, 4}, {1, 2, 4}, and {3, 4}. To the left of the figure, the first subset generates a new Unlimited Branching Tree. In the middle of the figure, we see that insertion can result in branching. The insertion, which has been executed to the right of the figure, shows that a new Trie is inserted into the root list. The corresponding pseudo code is provided in Algorithm 3.40. The algorithm traverses the root list to detect whether or not a matching root element is present. In case we do not establish a new root element the implementation of the ordinary insert routine for the corresponding Trie (not shown) is called. In case there is no such element, a new one is constructed and added to the list.
B9780123725127000031/f03-23-9780123725127.jpg is missing
Figure 3.23
Evolution of an Generalized Suffix Tree.
The running time of the algorithm is B9780123725127000031/si521.gif is missing, where k is the size of the current Trie list and l the size of the sorted set, plus the time B9780123725127000031/si522.gif is missing to sort the elements. As with our example it is often the case that all elements are selected from the set B9780123725127000031/si523.gif is missing such that the running time is O (n) altogether.
The data structure is designed to solve the Subset Query and the Containment Query problem. In Algorithm 3.41 we show a possible implementation for the latter. First, all root elements matching the query are retrieved. Then the corresponding tries are searched individually for a possible match with the query. As both the query and the stored set are sorted, the match is available in linear time with respect to the query set. The number of root elements that have to be processed can grow considerably and is bounded by the size of the universe Γ.
The worst-case running time of the algorithm is B9780123725127000031/si524.gif is missing, where k is the size of the current Trie list and m is the size of the query set plus the time B9780123725127000031/si525.gif is missing to sort the elements. If all set elements have been selected from the set B9780123725127000031/si526.gif is missing, the worst-case running time is bounded by B9780123725127000031/si527.gif is missing.
B9780123725127000031/u03-40-9780123725127.jpg is missing
Algorithm 3.40.
Inserting a set in an Unlimited Branching Tree.
B9780123725127000031/u03-41-9780123725127.jpg is missing
Algorithm 3.41.
Searching for subsets in an Unlimited Branching Tree.
The matching efforts have been reduced using a data structure from rule-based production systems. The so-called Rete algorithm exploits the fact that the firing of rules, or playing moves as in our case, only changes a few parts of the state, and are of structural similarity, meaning that the same subpattern can occur in multiple rules. The Rete algorithm uses a rooted acyclic-directed graph, the Rete, where the nodes, with the exception of the root, represent patterns, and the edges represent dependencies (the relation ⊆ defined earlier can be directly mapped).

3.4. String Dictionaries

String dictionaries offer substring and superstring queries, and are a specialization of subset dictionaries since substrings and superstrings are consecutive character vectors that do not include gaps. In the following, we study string dictionaries based on Suffix Tree.
A Suffix Tree is a compact trie representation of all suffixes of a given string. The substring information stored at each suffix node is simply given by the indices of the first and the last characters. In the following, the Suffix Tree data structure and its linear-time construction algorithm are explained in detail.
Inserting each suffix of string m in a Trie yields a Suffix Trie. To avoid conflicts at terminal nodes, we append a special character $ to m. For notational convenience, in the following this tag is commonly interpreted as an integral part of m. An example of a Suffix Trie is shown in Figure 3.24. Each node in the Suffix Treetrie corresponds to exactly one unique substring in m. Unfortunately, it can consist of B9780123725127000031/si528.gif is missing nodes. Take, for example, the strings of the form B9780123725127000031/si529.gif is missing. They include B9780123725127000031/si530.gif is missing different substrings (see Exercises).

3.4.1. Suffix Trees

A Suffix Tree (see Fig. 3.25) is a compact representation of a Suffix Treetrie in which each node with only one successor is merged with its parent. (Such compressed structure of a Trie is sometimes referred to as Patricia Trie for practical algorithms to retrieve information coded in alphanumeric.) Each node in the Suffix Tree for m has more than one successor and B9780123725127000031/si531.gif is missing leaves. As a consequence, it consumes at most B9780123725127000031/si532.gif is missing space.
For efficient Suffix Tree construction, we need some definitions. A partial path is a consecutive sequence of edges starting at the root. A path is a partial path that ends at a leaf. The locus of a string α is the node at the end of the path of α (if it exists). An extension of a string α is each string that has α as a prefix. The extended locus of α is the locus of the shortest extension of α. The contracted locus of a string α is the locus of the longest prefix of α. The term B9780123725127000031/si533.gif is missing refers to the suffix of m starting at i, so that B9780123725127000031/si534.gif is missing. The string B9780123725127000031/si535.gif is missing is the longest prefix of B9780123725127000031/si536.gif is missing, which is also a prefix of B9780123725127000031/si537.gif is missing for some B9780123725127000031/si538.gif is missing, and B9780123725127000031/si539.gif is missing is defined as B9780123725127000031/si540.gif is missing; that is, B9780123725127000031/si541.gif is missing. As an example take B9780123725127000031/si542.gif is missing, then B9780123725127000031/si543.gif is missing, B9780123725127000031/si544.gif is missing, and B9780123725127000031/si545.gif is missing A naive approach starts with the empty tree T0 and inserts sufi + 1 to construct Ti + 1 from Ti for an increasing value of i.
To generate a Suffix Tree efficiently, suffix links are helpful, where a suffix link points from the locus of B9780123725127000031/si546.gif is missing, B9780123725127000031/si547.gif is missing, B9780123725127000031/si548.gif is missing, to the locus of α. Suffix links are used as shortcuts during construction and search. We have that headi is the longest prefix of sufi, which has an extended locus in Ti − 1, since in Ti all suffixes sufj, B9780123725127000031/si549.gif is missing already have a locus.
For inserting B9780123725127000031/si550.gif is missing, tree Ti + 1 can be constructed from Ti as follows (see Fig. 3.26). First, we determine the extended locus headi + 1 in Ti, divide the last edge that leads to it in two new edges, and introduce a new node. Then, we create a new leaf for sufi + 1. For the given example string, Figure 3.27 depicts the modifications to transform T2 into T3.
B9780123725127000031/f03-24-9780123725127.jpg is missing
Figure 3.24
A Suffix Trie for the string 11010$.
B9780123725127000031/f03-25-9780123725127.jpg is missing
Figure 3.25
A Suffix Tree for the same string with suffix links.
B9780123725127000031/f03-26-9780123725127.jpg is missing
Figure 3.26
Inserting sufi.
B9780123725127000031/f03-27-9780123725127.jpg is missing
Figure 3.27
Insertion for the example string.
The algorithm takes a linear number of steps. If the extended locus of headi + 1 in Ti is found, the extension of the tree can be accomplished in constant time. Algorithm 3.42 has two stages. First, it determines headi + 1 in Ti in amortized constant time. Then, it sets another suffix link.
We observe that if B9780123725127000031/si551.gif is missing for character a and a (possibly empty) string γ, then γ is a prefix of headi + 1. Let B9780123725127000031/si552.gif is missing, then there is a B9780123725127000031/si553.gif is missing, such that B9780123725127000031/si554.gif is missing is prefix of sufi and sufj according to the definition of headi. Hence, γ is a prefix of sufi + 1 and sufj + 1.
The loop invariants of the algorithm are (1) all internal nodes in Ti − 1 have a correct suffix link in Ti (I1), and (2) during the construction of Ti the contracted locus of headi in Ti − 1 is visited (I2). The invariants are certainly true if B9780123725127000031/si555.gif is missing. If B9780123725127000031/si556.gif is missing, then (I2) implies that construction Ti + 1 from Ti can start at the contracted locus of headi in Ti − 1. If B9780123725127000031/si557.gif is missing, then let αi be the concatenation of the edge labels of the path to the contracted locus of headi without the first letter ai. Moreover, βi = headiai αi; that is, headi = ai αi βi. If B9780123725127000031/si558.gif is missing, then Ti can be visualized as shown in Figure 3.28.
B9780123725127000031/f03-28-9780123725127.jpg is missing
Figure 3.28
Partition of Ti.
B9780123725127000031/u03-42-9780123725127.jpg is missing
Algorithm 3.42.
Algorithm to construct a Suffix Tree in linear time.
Based on the lemma we have B9780123725127000031/si559.gif is missing. From the contracted locus v′ of headi we already have a correct suffix link in Ti to a node u according to (I1). To build the locus of headi + 1 in Ti we start at u instead of the root of Ti in the naive approach. In an actual implementation both stages would have to be interleaved.
Lemma 3.1
If the locus of B9780123725127000031/si560.gif is missing in Ti does not exist, then B9780123725127000031/si561.gif is missing.
Proof
Let v be the contracted locus and w be the extended locus of B9780123725127000031/si562.gif is missing. Let the labeling of the edges on the path to v be equal to γ and let the label of B9780123725127000031/si563.gif is missing be equal B9780123725127000031/si564.gif is missing with B9780123725127000031/si565.gif is missing and B9780123725127000031/si566.gif is missing. Then all suffixes with prefix B9780123725127000031/si567.gif is missing are contained in the subtree of T with root node w, and all suffixes in T have a prefix B9780123725127000031/si568.gif is missing. Therefore, B9780123725127000031/si569.gif is missing and sufj has the prefix B9780123725127000031/si570.gif is missing. Hence, sufj has the prefix B9780123725127000031/si571.gif is missing. We have to show that B9780123725127000031/si572.gif is missing and B9780123725127000031/si573.gif is missing with B9780123725127000031/si574.gif is missing.
Let sufj be a suffix with prefix B9780123725127000031/si575.gif is missing. Then sufj′ + 1 has prefix B9780123725127000031/si576.gif is missing and sufj has prefix B9780123725127000031/si577.gif is missing. Since B9780123725127000031/si578.gif is missing, the first letter a of δ2 and the first letter b that follows B9780123725127000031/si579.gif is missing in sufi are different. Therefore, the prefix of sufi + 1 is B9780123725127000031/si580.gif is missing and the prefix of sufj is B9780123725127000031/si581.gif is missing, so that the longest common prefix is B9780123725127000031/si582.gif is missing.
As an example take B9780123725127000031/si583.gif is missing. The construction of T14 from T13 by inserting B9780123725127000031/si584.gif is missing in T13 is shown in Figure 3.29.
B9780123725127000031/f03-29-9780123725127.jpg is missing
Figure 3.29
Construction of T14 from T13.
Theorem 3.7
(Time Complexity Suffix Tree Construction) Algorithm 3.42takes B9780123725127000031/si585.gif is missing time to generate a Suffix Tree for m.
Proof
In every step a suffix of m is scanned and rescanned. We first analyze rescanning. Since αi βi is a prefix of headi + 1, at an edge we only have to test how many characters we have to skip in βi. Subsequently, we require constant time for each traversed edge so that the total number of steps during rescanning is proportional to the number of traversed edges. Let B9780123725127000031/si586.gif is missing. At each edge e, which is traversed while rescanning βi − 1, the string βi is extended by δ of edge e; that is, δ is in resi, but not in resi + 1. Since B9780123725127000031/si587.gif is missing, we have B9780123725127000031/si588.gif is missing with ki as the number of rescanned edges in step i, and
B9780123725127000031/si589.gif is missing
Next we analyze scanning. The number of scanned characters in step i equals B9780123725127000031/si590.gif is missing, where B9780123725127000031/si591.gif is missing Therefore, the total number of scanned characters is equal to
B9780123725127000031/si592.gif is missing

3.4.2. Generalized Suffix Trees

A Generalized Suffix Tree is a string data structure appropriate for web search and for solving problems in computational biology. After introducing Generalized Suffix Trees we first consider the problem of updating the information to obtain optimal space performance even in a dynamic setting.
The efficient construction of a Suffix Tree can be extended naturally to more than one string by building the Suffix Tree of the string B9780123725127000031/si593.gif is missing. It is not difficult to show (see Exercises) that the Suffix Tree for B9780123725127000031/si594.gif is missing is isomorphic to the compacted Trie for all suffixes of B9780123725127000031/si595.gif is missing up to all suffixes of B9780123725127000031/si596.gif is missing. Furthermore, the trees are identical except for the labels of the edges incident to leaves. This fact allows us to insert and search a string into an existing Suffix Tree.
A straightforward deletion of strings causes problems, since each edge stored at the subsequent nodes includes substring interval information of some previously inserted strings. Therefore, the update procedure also has to update substring references in the tree. The solution to this nontrivial problem is based on maintaining an additional Inverted Trie. Let M be the set of strings in the generalized Suffix Tree S and let T be the Trie that contains all inverted strings. Then there is a bijection between the set of nodes in T and the set of leaf nodes in S: On one hand, each suffix of a string mi corresponds to a leaf node; on the other hand, for each prefix of B9780123725127000031/si597.gif is missing there is a prefix in T. Figure 3.30 shows a snapshot of inserting string 11010$ in a Generalized Suffix Tree with an associated inverted trie. Nodes of the same index indicate the bijection.
B9780123725127000031/f03-30-9780123725127.jpg is missing
Figure 3.30
Generalized Suffix Tree during insertion of 11010$.
Given the associated Inverted Trie, it is possible to delete a string from the largest suffix to the shortest. As a consequence, in each step the suffix links are correct. The problem that is often not dealt with in literature is that the deleted strings are indeed needed inside the generalized tree. The idea of the improvement is to extend the unique representation of the leaves given by T bottom to the internal nodes. Therefore, we invent twins that refer to the history of leaf generation. Figure 3.31 gives an example.
B9780123725127000031/f03-31-9780123725127.jpg is missing
Figure 3.31
Twin structure in the Generalized Suffix Tree.
As with the algorithm for constructing an ordinary Suffix Tree, the insertion process can be divided into a sequence of update operations. In the pseudo-code implementation of Algorithm 3.43 we assume a procedure to insert a suffix at an existing locus, and a procedure to split an existing edge. Deletion, as shown in Algorithm 3.44, is based on a subroutine for removing a leaf that is called for each removed node while deleting the inverted string in the Inverted Trie T. If removing a leaf, we access and adjust the string representation of a twin. The correctness argument is based on the following result.
B9780123725127000031/u03-43-9780123725127.jpg is missing
Algorithm 3.43.
Insertion of one suffix in a Generalized Suffix Tree.
B9780123725127000031/u03-44-9780123725127.jpg is missing
Algorithm 3.44.
Deleting a string in a Generalized Suffix Tree.
Lemmma 3.2
Let Internal and Leaves be the sets of all internal and leaf nodes in the Generalized Suffix Tree. Let B9780123725127000031/si598.gif is missing be the set of successor of p and B9780123725127000031/si599.gif is missing be the set of twin successors of p. The following invariances are preserved:
(Ia) For all B9780123725127000031/si600.gif is missing there is a B9780123725127000031/si601.gif is missing with B9780123725127000031/si602.gif is missing, which has the same string representative as p.
(Ib) For all B9780123725127000031/si603.gif is missing we have B9780123725127000031/si604.gif is missing; that is, the number of ordinary successors is always one larger to the number of twin successors.
Proof
To prove the result we perform a case study:
Inserting a suffix at a given node A newly inserted leaf extends both sets B9780123725127000031/si605.gif is missing and B9780123725127000031/si606.gif is missing for the existing node by one element. The string representation of the leaf and of the existing node is set to the inserted string m. Therefore, the invariances remain satisfied.
Inserting a suffix between two nodes In this case both newly generated nodes refer to m, so that we have (Ia). The internal node gets two successors and one twin successor (the new leaf node). Therefore, B9780123725127000031/si607.gif is missing and (Ib).
Removing a Leaf Let q be the node to be deleted, s be its predecessor, and p its twin predecessor. The algorithm considers two cases:
[B9780123725127000031/si608.gif is missing] Since q in B9780123725127000031/si609.gif is missing invariance (Ib) is satisfied. If B9780123725127000031/si610.gif is missing, then there exists a leaf r in B9780123725127000031/si611.gif is missing. Leaf r changes the string representation such that s no longer refers to the string representation of q. Therefore, we have (Ia) for node s. If, however, B9780123725127000031/si612.gif is missing, then s is deleted for good, and nothing has to be shown.
[B9780123725127000031/si613.gif is missing] This case is tricky. If B9780123725127000031/si614.gif is missing, then s is deleted. Moreover, B9780123725127000031/si615.gif is missing is set to B9780123725127000031/si616.gif is missing such that B9780123725127000031/si617.gif is missing remains unchanged. Otherwise, B9780123725127000031/si618.gif is missing. Using (Ib), then at the time when q was deleted we have B9780123725127000031/si619.gif is missing successors and k twin successors of s. Consequently, besides B9780123725127000031/si620.gif is missing there is another twin successor r of s. This node is used to determine the string representation for p; that is, B9780123725127000031/si621.gif is missing is set to B9780123725127000031/si622.gif is missing. We see that both invariances are maintained.
Hence, we can prove the following result.
Theorem 3.8
(Space Optimality Generalized Suffix Tree) Let S be a Generalized Suffix Tree after an arbitrary number of insert and delete operations and B9780123725127000031/si623.gif is missing be the maximal number of all characters of strings in the dictionary M; that is, B9780123725127000031/si624.gif is missing, where i denotes the operation step. The space requirements of S are bounded by B9780123725127000031/si625.gif is missing.
To find a substring of a given string m, we can determine the longest pattern prefix h of the string stored in the Generalized Suffix Tree that matches m starting at position i, B9780123725127000031/si626.gif is missing. Similarly, we can determine the longest substring h of the strings stored in the Generalized Suffix Tree that matches m ending at position i. In both cases we have to check if h is maximal; that is, if an accepting node that corresponds to a path for a full string m in the dictionary has been reached.

3.5. Summary

The search algorithms discussed in the previous chapter need to keep track of the generated and expanded states. A*, for example, needs to be able to check whether a state is in the Open list, insert a state into the Open list, with a given f-value, decrease the f-value of a state in the Open list, extract the state with the smallest f-value from the Open list, check whether a state is in the Closed list, insert a state into the Closed list, and perhaps delete a state from the Closed list. These operations need to be fast since they are typically performed a large number of times during each search. In this chapter, therefore, we discussed algorithms and data structures for implementing them.
The Open list is basically a priority queue. The values of the priorities (e.g., the f-values for A*) determine how the operations on the Open list can be implemented. If the priorities are floating-point values, then the operations can be implemented with heaps, including advanced heap structures and data structures. A Heap is a complete binary tree that stores a state at every node so that the priority of the state at a node is always higher than the priority of the states at the children of the node. A Fibonacci Heap, a Weak Heap, and a Weak Queue relax this requirement in different ways. If the priorities are integers, then the operations can also be implemented with buckets of fixed or exponentially increasing sizes (Radix Heap) or hierarchical structures of buckets, including the Van Emde Boas Priority Queue. Buckets consist of randomly accessible storage locations in a consecutive address range that are labeled with consecutive ranges of priorities, where each storage location stores the set of states of which the priorities are in its range of priorities. Implementations that use buckets are usually faster than those that use heaps.
Table 3.4 gives an overview for the priority queue data structures introduced in this chapter. The complexities for integer-based methods are measured in the number of instructions. For generic weights we express complexities in the number of comparisons. The parameters are C= max edge weight, N= max key, n= nodes stored, and e= nodes visited. The star (*) denotes amortized costs.
Table 3.4 Priority queue data structures; top are bucket structures for integer keys, bottom are heap structures for general keys in a totally ordered set defined through a comparison function.
Data StructureDecreaseKeyDeleteMinInsertDijkstra/A*
1-Level Bucket (Algorithm 3.1, Algorithm 3.2, Algorithm 3.3 and Algorithm 3.4)O (1)O (C)O (1)O (e + Cn)
2-Level BucketO (1)B9780123725127000031/si1.gif is missingO (1)B9780123725127000031/si2.gif is missing
Radix Heap (Algorithm 3.5, Algorithm 3.6, Algorithm 3.7 and Algorithm 3.8)O (1)*O (lg C)*O (1)*O (e + n ⋅ lg C)
Emde Boas (Algorithm 3.9, Algorithm 3.10 and Algorithm 3.11)O (lg lg N)O (lg lg N)O (lg lg N)O ((e + n) lg lg N)
Binary Search TreeO (lg n)O (lg n)O (lg n)O ((e + n) lg n)
Binomial QueueO (lg n)O (lg n)O (lg n)O ((e + n) lg n)
Heap (Algorithm 3.12, Algorithm 3.13 and Algorithm 3.14)2lgn2lgn2lgnO ((e + n) lg n)
Weak Heap (Algorithm 3.17, Algorithm 3.18, Algorithm 3.19, Algorithm 3.20 and Algorithm 3.21)lgnlgnlgnO ((e + n) lg n)
Pairing Heap (3.16)B9780123725127000031/si3.gif is missingO (lg n)*O (1)B9780123725127000031/si4.gif is missing
Fibonacci Heap (3.22)O (1)*O (lg n)*O (1)O (e + nlgn)
Rel. Weak Queue (3.23)O (1)O (lg n)O (1)O (e + nlgn)
The Closed list is a simple set. The operations on it can, therefore, be implemented with bitvectors, lists, search trees, or hash tables. Bitvectors assign a bit to every state in a set. The bit is set to one if the state is in the set. They are a good choice if the percentage of states in the set (out of all states) is large. Lists simply represent all states in the set, perhaps storing compressed versions of states by representing similar parts of several states only once (Suffix Tree list). They are a good choice if the percentage of states in the set is small. The question then becomes how to test membership efficiently. To this end, lists are often represented as search trees or, more commonly since faster, hash tables rather than linked lists. Hash tables (hash dictionaries) consist of randomly accessible storage locations in a consecutive address range. Hashing maps each state to an address.
We discussed different hash functions. Perfect hashing (similar to bitvectors) maps every state to its own address. To insert a state into a hash table, we store the state at its address. To delete a state from the hash table, we remove the state from its address. To check whether a state is in the hash table, we compare the state in question against the state stored at its address. If and only if there is a state stored at its address and it matches the state in question, then the state in question is in the hash table. Perfect hashing is memory intensive. Regular hashing can map two states to the same address, which is called an address collision. Address collisions can be handled either via chaining or open addressing. Chaining resolves the conflict by storing all states in the hash table that map to the same address in a linked list and stores a pointer to the linked list at this address. Open addressing resolves the conflict by storing a state at a different address in either the same or a different hash table when some other state is already stored at its address. We discussed different ways of determining this other address, including using more than one hash table.
We also discussed how to increase the size of the hash table, in case the number of successive address collisions is too large, until an empty address is found. Regular hashing is less memory intensive than perfect hashing but can still be memory intensive. Approximate hashing saves memory by storing an insufficient amount of information to implement the membership test exactly. For example, it might store only a compressed version of the state in one or more hash tables. In the extreme case, it might only set a single bit to one in one or more hash tables to indicate that some state is stored at an address. In case of several hash tables, the state is considered stored if and only if all hash tables report that it is stored. Approximate hashing can make the mistake to determine that a state is in the Closed list even though it is not, which means that a search might not expand a state since it thinks it has expanded the state already and thus might not be able to find a path even if one exists.
Table 3.5 gives an overview of the different hash methods and their time complexity. We indicate whether states are stored in a compressed or ordinary way, and whether or not the hashing method is lossy.
Table 3.5 Hashing algorithms. With Y we denote B9780123725127000031/si5.gif is missing. With p+ (α) and p(α) we denote the time complexities for successful and unsuccessful search based on the current hash table load α. More accurate values depend on the conflict resolution strategy used.
InsertLookupCompressedLossy
Chaining (Algorithm 3.29, Algorithm 3.30 and Algorithm 3.31)O (1)O (Y)
Open addressing (Algorithm 3.32, Algorithm 3.33, Algorithm 3.34 and Algorithm 3.35)O (p(α))O (p+ (α))
Suffix List hashingO (lgn)*O (lgn)*
FKS hashingO (1)*O (1)
Cuckoo hashing (Algorithm 3.36, Algorithm 3.37 and Algorithm 3.38)O (1)*O (1)
Bit-state hashingO (1)O (1)
Hash compactO (1)O (1)
Moreover, we have seen two storage structures for partial information. The subset dictionary stores partial states in the form of sets, and the substring dictionary stores partial paths in the form of substrings. In the first case, different implementations have been discussed to solve one of the equivalent problems, Subset Query, Containment Query, and Partial Match; for the second case, we have concentrated on the Suffix Tree data structure and its extensions for solving the Dynamic Dictionary Matching problem.

3.6. Exercises

3.1 Display the
1. 2-Level Bucket data structure (C = 80)
2. Radix Heap data structure
for the elements {28, 7, 69, 3, 24, 7, 72}.
3.2 A union-find data structure is a dictionary for maintaining partitions of a set. We may represent each interval by its right-most element, so that the partitioning B9780123725127000031/si627.gif is missing is represented by the set B9780123725127000031/si628.gif is missing. Consider the data type that represents a partition of B9780123725127000031/si629.gif is missing into intervals with the operations:
Find (x) that returns the interval containing x.
Union (x) that unifies an interval with the immediately following one.
Split (x) that splits the interval T containing x into two intervals B9780123725127000031/si630.gif is missing and B9780123725127000031/si631.gif is missing.
1. How do the basic operations act on this set?
2. Use a Van Emde Boas Priority Queue to implement this strategy.
3.3 In a randomly filled array with n entries the minimal and maximal elements have to be found. For the sake of simplicity, you may assume B9780123725127000031/si632.gif is missing to be a power of 2.
1. Describe a divide-and-conquer algorithm that uses B9780123725127000031/si633.gif is missing comparisons.
2. Use Weak Heaps to elegantly solve the problem with B9780123725127000031/si634.gif is missing comparisons.
3. Show as a lower bound B9780123725127000031/si635.gif is missing comparisons are required to solve the problem.
4. Show a similar bound to search the first- and second-smallest element.
3.4
1. Show that the path to index n in a Heap is determined by the binary representation of n.
2. Let f (n) be the number of Heaps with n pairwise different keys and let si be the size of the subtree for root i, B9780123725127000031/si636.gif is missing. Show that B9780123725127000031/si637.gif is missing.
3.5 Merge two Heaps with n1 and n2 elements efficiently.
1. Assume that n1 und n2 are very different; for example, n1 is much larger than n2.
2. Assume that n1 and n2 are almost the same, say B9780123725127000031/si638.gif is missing.
Provide the time complexities for both cases in big-oh notation.
3.6 Double-ended queues are priority queues that allow the insertion and the deletion of the minimum and maximum element.
1. Transform a Heap/Weak Heap into its dual and estimate the number of required comparisons.
2. Show how to perform the transposition of elements in such a compare-exchange structure.
3.7 Transform a Weak Heap into a heap in-place with a small number of comparisons.
1. Study the base case of a Weak Heap of size 8.
2. Develop a recursive algorithm. For the ease of construction, you may assume B9780123725127000031/si639.gif is missing.
3. Compare the number of comparisons with ordinary heap construction and top-down, bottom-up, and binary search sift-down.
3.8 In an initially empty Binomial Queue perform the following operations:
1. Insert(45), Insert(33), Insert(28), Insert(21), Insert(17),Insert(14)
2. Insert(9), Insert(6), Insert(5), Insert(1),DeleteMin
3. DecreaseKey(33,11), Delete(21), DecreaseKey(28,3),DeleteMin
Display the data structure for all intermediate results.
3.9 Consider an initially empty hash table with 11 entries. Insert the keys 16, 21, 15, 10, 5, 19, and 8 according to the following hash algorithms and display the table after the last insertion. Use the two hash functions B9780123725127000031/si640.gif is missing and B9780123725127000031/si641.gif is missing.
1. Linear probing using B9780123725127000031/si642.gif is missing, quadratic probing using B9780123725127000031/si643.gif is missing.
2. Double and ordered hashing, single and double bit-state hashing.
3.10 Let B9780123725127000031/si644.gif is missing be a state for an Atomix problem on a board of size B9780123725127000031/si645.gif is missing. We define its hash value as B9780123725127000031/si646.gif is missing. Let v be an immediate successor of u that differs from its predecessor u only in the position of atom i.
1. Determine h (v) based on h (u) using incremental hashing.
2. Use a precomputed table with B9780123725127000031/si647.gif is missing entries to accelerate the computation.
3. Avoid computationally expensive modulo operators by using addition/subtraction.
3.11 Dynamic incremental hashing considers hashing of state vectors of variable size.
1. How does the hash function B9780123725127000031/si648.gif is missing change if:
• A value is added to/deleted at the end of the existing state vector of u?
• A value is added to/deleted at the beginning of the existing state vector of u?
For both cases devise a formula that can be computed in time O (1).
2. For the general case, where a value is changed somewhere in the state vector B9780123725127000031/si649.gif is missing, compute the hash address in B9780123725127000031/si650.gif is missing time.
3.12 In the Suffix Tree list example of Figure 3.18, insert B9780123725127000031/si651.gif is missing and delete B9780123725127000031/si652.gif is missing.
3.13 Compute the
1. perfect hash value of the permutation B9780123725127000031/si653.gif is missing
2. permutation (of size 15) for the rank B9780123725127000031/si654.gif is missing
according to the lexicographic ordering and according to the one of Myrvold and Ruskey.
3.14 Devise two hash functions and a sequence of insertions that lead to an infinite cuckoo process.
3.15 Let B9780123725127000031/si655.gif is missing. A Navigation Pile is a complete binary tree with B9780123725127000031/si656.gif is missing nodes. The first B9780123725127000031/si657.gif is missing leaf elements store one element each and the remaining leaves are empty. Interior nodes (branches) contain links to the leaf nodes in the form of binary-encoded relative index information. For each branch the leaf is addressed that contains the smallest element of all elements stored in the leaf sequence.
The representation of a Navigation Pile are two sequences: B9780123725127000031/si658.gif is missing for the elements and B9780123725127000031/si659.gif is missing for the navigation information, pointing to the elements in A. An example Navigation Pile of size 14 and capacity 16 are shown in Figure 3.32. The parent/ child relationship is shown with dotted arrows and the navigation information with solid arrows.
B9780123725127000031/f03-32-9780123725127.jpg is missing
Figure 3.32
Example of a Navigation Pile.
1. Show that all navigation information can be stored with B9780123725127000031/si660.gif is missing bits.
2. Argue that the following operations can be supported in constant time: depth, height, parent, first-leaf, last-leaf, first-child, second-child, root, is-root, and ancestor.
3. Show that bottom-up construction of a Navigation Pile requires B9780123725127000031/si661.gif is missing comparisons.
4. Show how to implement Insert with at most B9780123725127000031/si662.gif is missing comparisons and one element move (given that B9780123725127000031/si663.gif is missing additional instructions are allowed).
3.16 Draw a Suffix Tree for 10100100110001100$ including the suffix links.
3.17 Consider a text t.
1. Show how to report all substrings in t in between two given strings with respect to their lexicographic ordering. For example, ACCGTA is in between ACA and ACCT.
2. Devise an efficient algorithm to find the longest substring that occurs at least twice in t.
3.18 There are B9780123725127000031/si664.gif is missing substrings of string T of size n. Some of the substrings are identical.
1. Show that B9780123725127000031/si665.gif is missing has B9780123725127000031/si666.gif is missing different substrings.
2. Show how to print all different substrings in time proportional to their total length.
3.19 Let D be a set of k string of length B9780123725127000031/si7.gif is missing.
1. Devise an efficient algorithm to determine for each string in D if it is a substring of another string in D.
2. Devise an algorithm that computes the longest common substring for all pairs of strings in D. The running time should be B9780123725127000031/si667.gif is missing, where d is the sum of the sizes of the strings in D.
3. Let B9780123725127000031/si668.gif is missing. Devise an algorithm that computes the longest common prefix for all pairs of strings in D. The running time should be B9780123725127000031/si669.gif is missing, where p is the number of pairs, of which the common prefix is not empty.
3.20 Consider a (very long) text T = t1tn over the alphabet Σ to be searched for a maximal pattern P = titi + 1tj such that the reflection B9780123725127000031/si671.gif is missing is also a pattern in T. For example, in T = 10000111100101111000001 the pair P = 000011110 and B9780123725127000031/si670.gif is missing is maximal. Describe an efficient algorithm to solve the problem and provide its time and space complexities.

3.7. Bibliographic Notes

Dial (1969) has invented the 1-Level Bucket priority queue data structure. Its variants have been studied by Ahuja, Magnanti, and Orlin (1989). The 2-level architecture can be further refined to an arbitrary number k of levels, with k-arrays of the size B9780123725127000031/si672.gif is missing. Space and time can be improved to B9780123725127000031/si673.gif is missing, but the implementation becomes quite involved. Two-layered Radix Heap data structures improve the bound for DeleteMin to B9780123725127000031/si674.gif is missing and a hybrid with a Fibonacci Heap yields an B9780123725127000031/si675.gif is missing time algorithm. Alternative priority queue data structures based on keys have been studied by van Emde Boas, Kaas, and Zijlstra (1977). Dementiev, Kettner, Mehnert, and Sanders (2004) have provided cache-efficient implementations.
Fredman and Tarjan (1987) have given the amortized analysis for a Fibonacci Heap that apply Insert and DecreaseKey in amortized constant time and DeleteMin in amortized logarithmic time. Cherkassy, Goldberg, and Silverstein (1997b) compare different priority queue implementations and provide an efficient shortest path library (Cherkassy, Goldberg, and Ratzig, 1997a). Many priority queue implementations have been integrated in LEDA by Mehlhorn and Näher (1999).
The Weak Heap data structure has been introduced by Dutton (1993) and analyzed in detail by Edelkamp and Wegener (2000). Edelkamp and Stiegeler (2002) have implemented a sorting index based on Weak Heapsort with B9780123725127000031/si676.gif is missing comparisons in the worst-case and an in-place Quicksort variant with B9780123725127000031/si677.gif is missing comparisons on average. The latter approach is based on replacing original Heapsort with Weak Heapsort in the hybrid of Quick-Heapsort originally proposed by Cantone and Cinotti (2002).
Minimizing the number of moves has been considered by Munro and Raman (1996). The Navigation Pile data structure has been introduced by Katajainen and Vitale (2003). It has been applied to sorting, yielding an algorithm with B9780123725127000031/si678.gif is missing comparisons, B9780123725127000031/si679.gif is missing element moves, and B9780123725127000031/si680.gif is missing further instructions. Independently, Franceschini and Geffert (2003) devised a sorting algorithm with less than B9780123725127000031/si681.gif is missing moves and B9780123725127000031/si682.gif is missing comparisons. Other doubly ended priority queue structures are min-max Heaps proposed by Atkinson, Sack, Santoro, and Strothotte (1986), Deaps by Carlsson (1987) and Interval Heaps by van Leeuwen and Wood (1993).
Thorup (1999) has shown that for integer weights in undirected graphs a deterministic linear-time algorithm can be devised. It bypasses the requirement for extracting the minimum element. The data structure is substituted by a growing Component Tree. However, the algorithm is pretty involved and rather of theoretical interest, since its data structure, an Atomic Heap, requires B9780123725127000031/si683.gif is missing. Thorup (2000) has studied RAM priority queues. For a random access machine with arbitrary word size a priority queue is obtained supporting Insert, Delete, and DeleteMin operations in worst-case time B9780123725127000031/si684.gif is missing. This improves B9780123725127000031/si685.gif is missing for a hybrid Radix Heap. Relaxed Weak Queues (a.k.a. Run-Relaxed Weak Queues) by Elmasry, Jensen, and Katajainen (2005) are binary tree variants of run-relaxed heaps invented by Driscoll, Gabow, Shrairman, and Tarjan (1988) and implement a worst-case efficient priority queue (with constant-time efficiencies for Insert and DecreaseKey and logarithmic time for Delete and DeleteMin). Other structures achieving this performance are Brodal Heaps by Brodal (1996) and Fat Heaps by Kaplan, Shafrir, and Tarjan (2002). By sacrificing worst for average case performance, Rank-Relaxed Weak Queues achieve a better practical efficiency. Another promising competitor in this class are Violation Heaps proposed by Elmasry (2010). Policy-based benchmarking has been done by Bruun, Edelkamp, Katajainen, and Rasmussen (2010).
Theoretical advances for reducing the number of comparisons for deleting the minimum to B9780123725127000031/si686.gif is missing and B9780123725127000031/si687.gif is missing have been discussed by Elmasry, Jensen, and Katajainen (2008c) and Elmasry, Jensen, and Katajainen (2008b). Pairing Heaps have been suggested by Fredman, Sedgewick, Sleator, and Tarjan (1986). A refined implementation has been suggested by Stasko and Vitter (1987). A transformational approach and survey on efficient double-ended priority queues has been provided by Elmasry, Jensen, and Katajainen (2008a).
Hashing is fundamental to state space search and by the need of good distribution functions links to the generation of pseudo-random numbers. One generator has been proposed by Lehmer (1949) and its improvement has been suggested by Schrage (1979). The distribution and the selection of good random numbers have been analyzed by Park and Miller (1988).
Karp and Rabin (1987) have suggested incremental hashing for string search. A related incremental hashing for game playing has been introduced by Zobrist (1970). Its application to state space search and multiple pattern databases has been proposed by Mehler and Edelkamp (2005). For dynamic state vectors, incremental hashing has been extended by Mehler and Edelkamp (2006). Recursive hashing has been introduced by (Cohen 1997) and most prominently implemented in the software model checker SPIN (Holzmann 2004). Gains for incremental recursive hashing in SPIN are documented by Nguyen and Ruys (2008). In this context universal hashing has been shown to have advantages by Eckerle and Lais (1998). In experiments the authors have shown that the ideal circumstances for error prediction in sequential hashing are not found in practice and refine the model for coverage prediction to match the observation. Bit-state hashing has been adopted in protocol validator SPIN that parses the expressive concurrent Promela protocol specification language (Holzmann, 1998). Hash compaction has been contributed by Stern and Dill (1996) and collapse compression has been implemented by Holzmann (1997) and Lerda and Visser (2001).
The Bloom Filter has been invented by Bloom (1970) and been used in the web context by Marais and Bharat (1997) as a mechanism for identifying which pages have associated comments stored. Holzmann and Puri (1999) have suggested a finite-state machine description that shares similarities with binary decision diagrams. In work by Geldenhuys and Valmari (2003) the practical performance ratio has been shown to be close to the information theoretical bound for some protocols like the Dining Philosophers. As with the suffix list by Edelkamp and Meyer (2001) the construction aims at redundancies in the state vector. Similar ideas have been considered by Choueka, Fraenkel, Klein, and Segal (1986), but the data structure there is static and not theoretically analyzed. Another dynamic variant achieving asymptotically equivalent storage bounds is sketched in Brodnik and Munro (1999). Constants are only given for two static examples. Comparing with the numbers of Brodnik, a dynamic Suffix Tree list can host up to five times more elements of the same value range. However, the data structure of Brodnik provides constant access time.
Ranking permutations in linear time is due to Myrvold and Ruskey (2001). Korf and Schultze (2005) have used lookup tables with a space requirement of B9780123725127000031/si688.gif is missing bits to compute lexicographic ranks, and Bonet (2008) has discussed different time-space trade-offs. Mares and Straka (2007)proposed a linear-time algorithm for lexicographic rank, which relies on constant-time bitvector manipulations.
FKS hashing is due to Fredman, Komlós, and Szemeródi (1984). Dietzfelbinger, Karlin, Mehlhorn, auf der Heide, Rohnert, and Tarjan (1994) have devised the first dynamic version of it, yielding a worst-case constant-time hashing algorithm. Östlin and Pagh (2003) have shown that the space complexity for dynamical perfect hashing can be greatly reduced and Fotakis, Pagh, and Sanders (2003) have studied how to further reduce the space complexity to an information theoretical minimum. Practical perfect hashing has been analyzed by Botelho, Pagh, and Ziviani (2007) and an external memory perfect hash function variant has been given by Botelho and Ziviani (2007). The complexity is bounded by the need to sort all elements by their hash value in the partitioning step. Minimum perfect hash functions can be stored with less than four bits per item. With cuckoo hashing, Pagh and Rodler (2001) have devised a further practical and theoretical worst-case optimal hashing algorithm.
The Suffix Tree data structure is of widespread use in the context of web search (Stephen 1994) and computational biology (Gusfield 1997). The linear-time construction algorithm is due to McCreight (1976). Dynamic Dictionary Matching problems have been proposed and solved by Amir, Farach, Galil, Giancarlo, and Park (1994) and Amir, Farach, Idury, Poutré, and Schäffer (1995). The optimal space bound for arbitrary deletions has been proven by (Edelkamp 1998b). Another class for string search based on bit manipulation has been contributed by Baeza-Yates and Gonnet (1992).
Probably the first nontrivial result for the Partial Match problem has been obtained by Rivest (1976). He showed that the 2m space of the exhaustive storage solution can be improved for B9780123725127000031/si689.gif is missing. New algorithms for subset queries and partial matching have been provided by Charikar, Indyk, and Panigrahy (2002), studying two algorithms with different trade-offs. The Rete algorithm is due to Forgy (1982). The related Two-Dimensional Pattern String Matching problem has been studied intensively in literature, for example, by Fredriksson, Navarro, and Ukkonen (2005). Hoffmann and Koehler (1999) have suggested Unlimited Branching Trees.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.86.18