35

Cache-Oblivious Data Structures*

Lars Arge

University of Aarhus

Gerth Stølting Brodal

University of Aarhus

Rolf Fagerberg

University of Southern Denmark

35.1The Cache-Oblivious Model

35.2Fundamental Primitives

Van Emde Boas Layoutk-Merger

35.3Dynamic B-Trees

Density BasedExponential Tree Based

35.4Priority Queues

Merge Based Priority Queue: Funnel HeapExponential Level Based Priority Queue

35.52d Orthogonal Range Searching

Cache-Oblivious kd-TreeCache-Oblivious Range Tree

Acknowledgments

References

35.1The Cache-Oblivious Model

The memory system of most modern computers consists of a hierarchy of memory levels, with each level acting as a cache for the next; for a typical desktop computer the hierarchy consists of registers, level 1 cache, level 2 cache, level 3 cache, main memory, and disk. One of the essential characteristics of the hierarchy is that the memory levels get larger and slower the further they get from the processor, with the access time increasing most dramatically between main memory and disk. Another characteristic is that data is moved between levels in large blocks. As a consequence of this, the memory access pattern of an algorithm has a major influence on its practical running time. Unfortunately, the RAM model (Figure 35.1) traditionally used to design and analyze algorithms is not capable of capturing this, since it assumes that all memory accesses take equal time.

fig35_1.jpg

Figure 35.1The RAM model.

Because of the shortcomings of the RAM model, a number of more realistic models have been proposed in recent years. The most successful of these models is the simple two-level I/O-model introduced by Aggarwal and Vitter [1] (Figure 35.2). In this model the memory hierarchy is assumed to consist of a fast memory of size M and a slower infinite memory, and data is transfered between the levels in blocks of B consecutive elements. Computation can only be performed on data in the fast memory, and it is assumed that algorithms have complete control over transfers of blocks between the two levels. We denote such a transfer a memory transfer. The complexity measure is the number of memory transfers needed to solve a problem. The strength of the I/O model is that it captures part of the memory hierarchy, while being sufficiently simple to make design and analysis of algorithms feasible. In particular, it adequately models the situation where the memory transfers between two levels of the memory hierarchy dominate the running time, which is often the case when the size of the data exceeds the size of main memory. Agarwal and Vitter showed that comparison based sorting and searching require Θ(SortM,B(N))=Θ(NBlogM/BNB)Θ(SortM,B(N))=Θ(NBlogM/BNB) and Θ(logB N) memory transfers in the I/O-model, respectively [1]. Subsequently a large number of other results have been obtained in the model; see the surveys by Arge [2] and Vitter [3] for references. Also see Chapter 28.

fig35_2.jpg

Figure 35.2The I/O model.

More elaborate models of multi-level memory than the I/O-model have been proposed (see, e.g., [3] for an overview) but these models have been less successful, mainly because of their complexity. A major shortcoming of the proposed models, including the I/O-model, have also been that they assume that the characteristics of the memory hierarchy (the level and block sizes) are known. Very recently however, the cache-oblivious model, which assumes no knowledge about the hierarchy, was introduced by Frigo et al. [4]. In essence, a cache-oblivious algorithm is an algorithm formulated in the RAM model but analyzed in the I/O model, with the analysis required to hold for any B and M. Memory transfers are assumed to be performed by an off-line optimal replacement strategy. The beauty of the cache-oblivious model is that since the I/O-model analysis holds for any block and memory size, it holds for all levels of a multi-level memory hierarchy (see [4] for details). In other words, by optimizing an algorithm to one unknown level of the memory hierarchy, it is optimized on all levels simultaneously. Thus the cache-oblivious model is effectively a way of modeling a complicated multi-level memory hierarchy using the simple two-level I/O-model.

Frigo et al. [4] described optimal Θ(SortM, B(N)) memory transfer cache-oblivious algorithms for matrix transposition, fast Fourier transform, and sorting; Prokop also described a static search tree obtaining the optimal O(logB N) transfer search bound [5]. Subsequently, Bender et al. [6] described a cache-oblivious dynamic search trees with the same search cost, and simpler and improved cache-oblivious dynamic search trees were then developed by several authors [710]. Cache-oblivious algorithms have also been developed for, for example, problems in computational geometry [7,11,12], for scanning dynamic sets [7], for layout of static trees [13], for partial persistence [7], and for a number of fundamental graph problems [14] using cache-oblivious priority queues [14,15]. Most of these results make the so-called tall cache assumption, that is, they assume that M > Ω(B2); we make the same assumption throughout this chapter.

Empirical investigations of the practical efficiency of cache-oblivious algorithms for sorting [16], searching [9,10,17] and matrix problems [4] have also been performed. The overall conclusion of these investigations is that cache-oblivious methods often outperform RAM algorithms, but not always as much as algorithms tuned to the specific memory hierarchy and problem size. On the other hand, cache-oblivious algorithms perform well on all levels of the memory hierarchy, and seem to be more robust to changing problem sizes than cache-aware algorithms.

In the rest of this chapter we describe some of the most fundamental and representative cache-oblivious data structure results. In Section 35.2 we discuss two fundamental primitives used to design cache-oblivious data structures. In Section 35.3 we describe two cache-oblivious dynamic search trees, and in Section 35.4 two priority queues. Finally, in Section 35.5 we discuss structures for 2-dimensional orthogonal range searching.

35.2Fundamental Primitives

The most fundamental cache-oblivious primitive is scanning—scanning an array with N elements incurs Θ(NB)Θ(NB) memory transfers for any value of B. Thus algorithms such as median finding and data structures such as stacks and queues that only rely on scanning are automatically cache-oblivious. In fact, the examples above are optimal in the cache-oblivious model. Other examples of algorithms that only rely on scanning include Quicksort and Mergesort. However, they are not asymptotically optimal in the cache-oblivious model since they use O(NBlogNM)O(NBlogNM) memory transfers rather than Θ(SortM, B (N)).

Apart from algorithms and data structures that only utilize scanning, most cache-oblivious results use recursion to obtain efficiency; in almost all cases, the sizes of the recursive problems decrease double-exponentially. In this section we describe two of the most fundamental such recursive schemes, namely the van Emde Boas layout and the k-merger.

35.2.1Van Emde Boas Layout

One of the most fundamental data structures in the I/O-model is the B-tree [18]. A B-tree is basically a fanout Θ(B) tree with all leaves on the same level. Since it has height O(logB N) and each node can be accessed in O(1) memory transfers, it supports searches in O(logB N) memory transfers. It also supports range queries, that is, the reporting of all K elements in a given query range, in O(logBN+KB)O(logBN+KB) memory transfers. Since B is an integral part of the definition of the structure, it seems challenging to develop a cache-oblivious B-tree structure. However, Prokop [5] showed how a binary tree can be laid out in memory in order to obtain a (static) cache-oblivious version of a B-tree. The main idea is to use a recursively defined layout called the van Emde Boas layout closely related to the definition of a van Emde Boas tree [19]. The layout has been used as a basic building block of most cache-oblivious search structures (e.g., in [611,13]).

35.2.1.1Layout

For simplicity, we only consider complete binary trees. A binary tree is complete if it has N = 2h − 1 nodes and height h for some integer h. The basic idea in the van Emde Boas layout of a complete binary tree TT with N leaves is to divide TT at the middle level and lay out the pieces recursively (Figure 35.3). More precisely, if TT only has one node it is simply laid out as a single node in memory. Otherwise, we define the top tree T0T0 to be the subtree consisting of the nodes in the topmost ⌊h/2⌋ levels of TT, and the bottom trees T1,,TkT1,,Tk to be the Θ(N)Θ(N) subtrees rooted in the nodes on level ⌈h/2⌉ of TT; note that all the subtrees have size Θ(N)Θ(N). The van Emde Boas layout of TT consists of the van Emde Boas layout of T0T0 followed by the van Emde Boas layouts of T1,,TkT1,,Tk.

fig35_3.jpg

Figure 35.3The van Emde Boas layout.

35.2.1.2Search

To analyze the number of memory transfers needed to perform a search in TT, that is, traverse a root-leaf path, we consider the first recursive level of the van Emde Boas layout where the subtrees are smaller than B. As this level TT is divided into a set of base trees of size between Θ(B)Θ(B) and Θ(B), that is, of height Ω(log B) (Figure 35.4). By the definition of the layout, each base tree is stored in O(B) contiguous memory locations and can thus be accessed in O(1) memory transfers. That the search is performed in O(logB N) memory transfers then follows since the search path traverses O((log N)/ log B) = O(logB N) different base trees.

fig35_4.jpg

Figure 35.4A search path.

35.2.1.3Range query

To analyze the number of memory transfers needed to answer a range query [x1, x2] on TT using the standard recursive algorithm that traverses the relevant parts of TT (starting at the root), we first note that the two paths to x1 and x2 are traversed in O(logB N) memory transfers. Next we consider traversed nodes v that are not on the two paths to x1 and x2. Since all elements in the subtree TvTv rooted at such a node v are reported, and since a subtree of height log B stores Θ(B) elements, O(KB)O(KB) subtrees TvTv of height log B are visited. This in turn means that the number of visited nodes above the last log B levels of TT is also O(KB)O(KB); thus they can all be accessed in O(KB)O(KB) memory transfers. Consider the smallest recursive level of the van Emde Boas layout that completely contain TvTv. This level is of size between Ω(B) and O(B2) (Figure 35.5a). On the next level of recursion TvTv is broken into a top part and O(B)O(B) bottom parts of size between Ω(B)Ω(B) and O(B) each (Figure 35.5b). The top part is contained in a recursive level of size O(B) and is thus stored within O(B) consecutive memory locations; therefore it can be accessed in O(1) memory accesses. Similarly, the O(B) nodes in the O(B)O(B) bottom parts are stored consecutively in memory; therefore they can all be accessed in a total of O(1) memory transfers. Therefore, the optimal paging strategy can ensure that any traversal of TvTv is performed in O(1) memory transfers, simply by accessing the relevant O(1) blocks. Thus overall a range query is performed in O(logBN+KB)O(logBN+KB) memory transfers.

fig35_5.jpg

Figure 35.5Traversing tree TvTv with O(B) leaves; (a) smallest recursive van Emde Boas level containing TvTv has size between Ω(B) and O(B2); (b) next level in recursive subdivision.

Theorem 35.1

Let TT be a complete binary tree with N leaves laid out using the van Emde Boas layout. The number of memory transfers needed to perform a search (traverse a root-to-leaf path) and a range query in TT is O(logB N) and O(logBN+KB)O(logBN+KB), respectively.

Note that the navigation from node to node in the van Emde Boas layout is straight-forward if the tree is implemented using pointers. However, navigation using arithmetic on array indexes is also possible [9]. This avoids the use of pointers and hence saves space.

The constant in the O(logB N) bound for searching in Theorem 35.1 can be seen to be four. Further investigations of which constants are possible for cache-oblivious comparison based searching appear in [20].

35.2.2k-Merger

In the I/O-model the two basic optimal sorting algorithms are multi-way versions of Mergesort and distribution sorting (Quicksort) [1]. Similarly, Frigo et al. [4] showed how both merge based and distribution based optimal cache-oblivious sorting algorithms can be developed. The merging based algorithm, Funnelsort, is based on a so-called k-merger. This structure has been used as a basic building block in several cache-oblivious algorithms. Here we describe a simplified version of the k-merger due to Brodal and Fagerberg [12].

35.2.2.1Binary mergers and merge trees

A binary merger merges two sorted input streams into a sorted output stream: In one merge step an element is moved from the head of one of the input streams to the tail of the output stream; the heads of the input streams, as well as the tail of the output stream, reside in buffers of a limited capacity.

Binary mergers can be combined to form binary merge trees by letting the output buffer of one merger be the input buffer of another—in other words, a binary merge tree is a binary tree with mergers at the nodes and buffers at the edges, and it is used to merge a set of sorted input streams (at the leaves) into one sorted output stream (at the root). Refer to Figure 35.6 for an example.

fig35_6.jpg

Figure 35.6A 16-merger consisting of 15 binary mergers. Shaded parts represent elements in buffers.

An invocation of a binary merger in a binary merge tree is a recursive procedure that performs merge steps until the output buffer is full (or both input streams are exhausted); if an input buffer becomes empty during the invocation (and the corresponding stream is not exhausted), the input buffer is recursively filled by an invocation of the merger having this buffer as output buffer. If both input streams of a merger get exhausted, the corresponding output stream is marked as exhausted. A procedure FILL(v) performing an invocation of a binary merger v is shown in Figure 35.7 (ignoring exhaustion issues). A single invocation FILL(r) on the root r of a merge tree will merge the streams at the leaves of the tree.

fig35_7.jpg

Figure 35.7Invocation of binary merger v.

35.2.2.2k-merger

A k-merger is a binary merge tree with specific buffer sizes. For simplicity, we assume that k is a power of two, in which case a k-merger is a complete binary tree of k − 1 binary mergers. The output buffer at the root has size k3, and the sizes of the rest of the buffers are defined recursively in a manner resembling the definition of the van Emde Boas layout: Let i = log k be the height of the k-merger. We define the top tree to be the subtree consisting of all mergers of depth at most ⌈i/2⌉, and the bottom trees to be the subtrees rooted in nodes at depth ⌈i/2⌉ + 1. We let the edges between the top and bottom trees have buffers of size k3/2, and define the sizes of the remaining buffers by recursion on the top and bottom trees. The input buffers at the leaves hold the input streams and are not part of the k-merger definition. The space required by a k-merger, excluding the output buffer at the root, is given by S(k) = k1/2 · k3/2 + (k1/2 + 1) · S(k1/2), which has the solution S(k) = Θ(k2).

We now analyze the number of memory transfers needed to fill the output buffer of size k3 at the root of a k-merger. In the recursive definition of the buffer sizes in the k-merger, consider the first level where the subtrees (excluding output buffers) have size less than M/3; if ˉkk¯ is the number of leaves of one such subtree, we by the space usage of k-mergers have ˉk2M/3k¯2M/3 and (ˉk2)2=ˉk4=Ω(M)(k¯2)2=k¯4=Ω(M). We call these subtrees of the k-merger base trees and the buffers between the base trees large buffers. Assuming B2M/3, a base tree TvTv rooted in v together with one block from each of the large buffers surrounding it (i.e., its single output buffer and ˉkk¯ input buffers) can be contained in fast memory, since M/3+B+ˉkBM/3+B+(M/3)1/2(M/3)1/2MM/3+B+k¯BM/3+B+(M/3)1/2(M/3)1/2M. If the k-merger consists of a single base tree, the number of memory transfers used to fill its output buffer with k3 elements during an invocation is trivially O(k3/B + k). Otherwise, consider an invocation of the root v of a base tree TvTv, which will fill up the size Ω(ˉk3)Ω(k¯3) output buffer of v. Loading TvTv and one block for each of the ˉkk¯ buffers just below it into fast memory will incur O(ˉk2/B+ˉk)O(k¯2/B+k¯) memory transfers. This is O(1/B) memory transfer for each of the Ω(ˉk3)Ω(k¯3) elements output, since ˉk4=Ω(M)k¯4=Ω(M) implies ˉk2=Ω(M1/2)=Ω(B)k¯2=Ω(M1/2)=Ω(B), from which ˉk=O(ˉk3/B)k¯=O(k¯3/B) follows. Provided that none of the input buffers just below TvTv become empty, the output buffer can then be filled in O(ˉk3/B)O(k¯3/B) memory transfers since elements can be read from the input buffers in O(1/B) transfers amortized. If a buffer below TvTv becomes empty, a recursive invocation is needed. This invocation may evict TvTv from memory, leading to its reloading when the invocation finishes. We charge this cost to the Ω(ˉk3)Ω(k¯3) elements in the filled buffer, or O(1/B) memory transfers per element. Finally, the last time an invocation is used to fill a particular buffer, the buffer may not be completely filled (due to exhaustion). However, this happens only once for each buffer, so we can pay the cost by charging O(1/B) memory transfers to each position in each buffer in the k-merger. As the entire k-merger uses O(k2) space and merges k3 elements, these charges add up to O(1/B) memory transfers per element.

We charge an element O(1/B) memory transfers each time it is inserted into a large buffer. Since ˉk=Ω(M1/4)k¯=Ω(M1/4), each element is inserted in O(logˉkk)=O(logMk3)O(logk¯k)=O(logMk3) large buffers. Thus we have the following.

Theorem 35.2

Excluding the output buffers, the size of a k-merger is O(k2) and it performs O(k3BlogMk3+k)O(k3BlogMk3+k) memory transfers during an invocation to fill up its output buffer of size k3.

35.2.2.3Funnelsort

The cache-oblivious sorting algorithm Funnelsort is easily obtained once the k-merger structure is defined: Funnelsort breaks the N input elements into N1/3 groups of size N2/3, sorts them recursively, and then merges the sorted groups using an N1/3-merger.

Funnelsort can be analyzed as follows: Since the space usage of a k-merger is sub-linear in its output, the elements in a recursive sort of size M/3 only need to be loaded into memory once during the entire following recursive sort. For k-mergers at the remaining higher levels in the recursion tree, we have k3M/3 ≥ B2, which implies k2B4/3 > B and hence k3/B > k. By Theorem 35.2, the number of memory transfers during a merge involving N′ elements is then O(logM(N′)/B) per element. Hence, the total number of memory transfers per element is

O(1B(1+i=0logMN(2/3)i))=O((logMN)/B).
O(1B(1+i=0logMN(2/3)i))=O((logMN)/B).

Since logM x = Θ(logM/B x) when B2M/3, we have the following theorem.

Theorem 35.3

Funnelsort sorts N element using O(SortM, B(N)) memory transfers.

In the above analysis, the exact (tall cache) assumption on the size of the fast memory is B2M/3. In [12] it is shown how to generalize Funnelsort such that it works under the weaker assumption B1 + ɛM, for fixed ɛ > 0. The resulting algorithm incurs the optimal O(SortM, B(N)) memory transfers when B1 + ɛ = M, at the price of incurring O(1/ɛ · SortM, B(N)) memory transfers when B2M. It is shown in [21] that this trade-off is the best possible for comparison based cache-oblivious sorting.

35.3Dynamic B-Trees

The van Emde Boas layout of a binary tree provides a static cache-oblivious version of B-trees. The first dynamic solution was given Bender et al. [6], and later several simplified structures were developed [710]. In this section, we describe two of these structures [7,9].

35.3.1Density Based

In this section we describe the dynamic cache-oblivious search tree structure of Brodal et al. [9]. A similar proposal was given independently by Bender et al. [8].

The basic idea in the structure is to embed a dynamic binary tree of height log N + O(1) into a static complete binary tree, that is, in a tree with 2h − 1 nodes and height h, which in turn is embedded into an array using the van Emde Boas layout. Refer to Figure 35.8.

fig35_8.jpg

Figure 35.8Illustration of embedding a height H tree into a complete static tree of height H, and the van Emde Boas layout of this tree.

To maintain the dynamic tree we use techniques for maintaining small height in a binary tree developed by Andersson and Lai [22]; in a different setting, similar techniques has also been given by Itai et al. [23]. These techniques give an algorithm for maintaining height log N + O(1) using amortized O(log2N) time per update. If the height bound is violated after performing an update in a leaf l, this algorithm performs rebalancing by rebuilding the subtree rooted at a specific node v on the search path from the root to l. The subtree is rebuilt to perfect balance in time linear in the size of the subtree. In a binary tree of perfect balance the element in any node v is the median of all the elements stored in the subtree TvTv rooted in v. This implies that only the lowest level in TvTv is not completely filled and the empty positions appearing at this level are evenly distributed across the level. Hence, the net effect of the rebuilding is to redistribute the empty positions in TvTv. Note that this can lower the cost of future insertions in TvTv, and consequently it may in the long run be better to rebuild a subtree larger than strictly necessary for reestablishment of the height bound. The criterion for choosing how large a subtree to rebuild, that is, for choosing the node v, is the crucial part of the algorithms by Andersson and Lai [22] and Itai et al. [23]. Below we give the details of how they can be used in the cache-oblivious setting.

35.3.1.1Structure

As mentioned, the data structure consists of a dynamic binary tree TT embedded into a static complete binary tree TT of height H, which in turn is embedded into an array using the van Emde Boas layout.

In order to present the update and query algorithms, we define the density ρ(u) of a node u as |Tu|/|Tu|, where |Tu| and |Tu| are the number of nodes in the trees rooted in u in TT and TT, respectively. In Figure 35.8, the node containing the element 4 has balance 4/7. We also define two density thresholds τi and γi for the nodes on each level i = 1, 2, …, H (where the root is at level 1). The upper density thresholds τi are evenly space values between 3/4 and 1, and the lower density thresholds γi are evenly spaced values between 1/4 and 1/8. More precisely, τi = 3/4 + (i − 1)/(4(H − 1)) and γi = 1/4 − (i − 1)/(8(H − 1)).

35.3.1.2Updates

To insert a new element into the structure we first locate the position in TT of the new node w. If the insertion of w violates the height bound H, we rebalance TT as follows: First we find the lowest ancestor v of w satisfying γiρ(v) ≤ τi, where i is the level of v. If no ancestor v satisfies the requirement, we rebuild the entire structure, that is, TT, TT and the layout of TT: For k the integer such that 2kN < 2k + 1 we choose the new height H of the tree TT as k + 1 if N ≤ 5/4 · 2k; otherwise we choose H = k + 2. On the other hand, if the ancestor v exists we rebuild TvTv: We first create a sorted list of all elements in TvTv by an in-order traversal of TvTv. The ⌈|Tv|/2⌉th element becomes the element stored at v, the smallest ⌊(|Tv| − 1)/2⌋ elements are recursively distributed in the left subtree of v, and the largest ⌈(|Tv| − 1)/2⌉ elements are recursively distributed in the right subtree of v.

We can delete an element from the structure in a similar way: We first locate the node w in TT containing the element e to be deleted. If w is not a leaf and has a right subtree, we then locate the node w′ containing the immediate successor of e (the node reached by following left children in the right subtree of w), swap the elements in w and w′, and let w = w′. We repeat this until w is a leaf. If on the other hand w is not a leaf but only has a left subtree, we instead repeatedly swap w with the node containing the predecessor of e. Finally, we delete the leaf w from TT, and rebalance the tree by rebuilding the subtree rooted at the lowest ancestor v of w satisfying satisfying γiρ(v) ≤ τi, where i is the level of v; if no such node exists we rebuild the entire structure completely.

Similar to the proof of Andersson and Lai [22] and Itai et al. [23] that updates are performed in O(log2N) time, Brodal et al. [9] showed that using the above algorithms, updates can be performed in amortized O(logB N + (log2N)/B) memory transfers.

35.3.1.3Range queries

In Section 35.2, we discussed how a range query can be answered in O(logBN+KB)O(logBN+KB) memory transfers on a complete tree TT laid out using the van Emde Boas layout. Since it can be shown that the above update algorithm maintains a lower density threshold of 1/8 for all nodes, we can also perform range queries in TT efficiently: To answer a range query [x1, x2] we traverse the two paths to x1 and x2 in TT, as well as O(log N) subtrees rooted in children of nodes on these paths. Traversing one subtree TvTv in TT incurs at most the number of memory transfers needed to traverse the corresponding (full) subtree TvTv in TT. By the lower density threshold of 1/8 we know that the size of TvTv is at most a factor of eight larger than the size of TvTv. Thus a range query is answered in O(logBN+KB)O(logBN+KB) memory transfers.

Theorem 35.4

There exists a linear size cache-oblivious data structure for storing N elements, such that updates can be performed in amortized O(logB N + (log2N)/B) memory transfers and range queries in O(logBN+KB)O(logBN+KB) memory transfers.

Using the method for moving between nodes in a van Emde Boas layout using arithmetic on the node indices rather than pointers, the above data structure can be implemented as a single size O(N) array of data elements. The amortized complexity of updates can also be lowered to O(logB N) by changing leaves into pointers to buckets containing Θ(log N) elements each. With this modification a search can still be performed in O(logB N) memory transfers. However, then range queries cannot be answered efficiently, since the O(KlogN)O(KlogN) buckets can reside in arbitrary positions in memory.

35.3.2Exponential Tree Based

The second dynamic cache-oblivious search tree we consider is based on the so-called exponential layout of Bender et al. [7]. For simplicity, we here describe the structure slightly differently than in [7].

35.3.2.1Structure

Consider a complete balanced binary tree TT with N leaves. Intuitively, the idea in an exponential layout of TT is to recursively decompose TT into a set of components, which are each laid out using the van Emde Boas layout. More precisely, we define component C0C0 to consist of the first 12logN12logN levels of TT. The component C0C0 contains NN nodes and is called an N-component because its root is the root of a tree with N leaves (that is, TT). To obtain the exponential layout of TT, we first store C0C0 using the van Emde Boas layout, followed immediately by the recursive layout of the NN subtrees, T1,T2,,TNT1,T2,,TN, of size NN, beneath C0C0 in TT, ordered from left to right. Note how the definition of the exponential layout naturally defines a decomposition of TT into log log N + O(1) layers, with layer i consisting of a number of N1/2i1N1/2i1-components. An X-component is of size Θ(X)Θ(X) and its Θ(X)Θ(X) leaves are connected to XX-components. Thus the root of an X-component is the root of a tree containing X elements. Refer to Figure 35.9. Since the described layout of TT is really identical to the van Emde Boas layout, it follows immediately that it uses linear space and that a root-to-leaf path can be traversed in O(logB N) memory transfers.

fig35_9.jpg

Figure 35.9Components and exponential layout.

By slightly relaxing the requirements on the layout described above, we are able to maintain it dynamically: We define an exponential layout of a balanced binary tree TT with N leaves to consist of a composition of TT into log log N + O(1) layers, with layer i consisting of a number of N1/2i − 1-components, each laid out using the van Emde Boas layout (Figure 35.9). An X-component has size Θ(X)Θ(X) but unlike above we allow its root to be root in a tree containing between X and 2X elements. Note how this means that an X-component has between X/2X=12XX/2X=12X and 2X/X=2X2X/X=2X leaves. We store the layout of TT in memory almost as previously: If the root of TT is root in an X-component C0C0, we store C0C0 first in 22X122X1 memory locations (the maximal size of an X-component), followed immediately by the layouts of the subtrees (XX-components) rooted in the leaves of C0C0 (in no particular order). We make room in the layout for the at most 2X2X such subtrees. This exponential layout for TT uses S(N)=Θ(N)+2NS(N)S(N)=Θ(N)+2NS(N) space, which is Θ(N log N).

35.3.2.2Search

Even with the modified definition of the exponential layout, we can traverse any root-to-leaf path in TT in O(logB N) memory transfers: The path passes through exactly one N1/2i − 1-component for 1 ≤ i ≤ log log N + O(1). Each X-component is stored in a van Emde Boas layout of size Θ(X)Θ(X) and can therefore be traversed in Θ(logBX)Θ(logBX) memory transfers (Theorem 35.1). Thus, if we use at least one memory transfer in each component, we perform a search in O(logB N) + log log N memory accesses. However, we do not actually use a memory transfer for each of the log log N + O(1) components: Consider the traversed X-component with BXBBXB. This component is of size O(B)O(B) and can therefore be loaded in O(1) memory transfers. All smaller traversed components are of total size O(BlogB)=O(B)O(BlogB)=O(B), and since they are stored in consecutively memory locations they can also be traversed in O(1) memory transfers. Therefore only O(1) memory transfers are used to traverse the last log log BO(1) components. Thus, the total cost of traversing a root-to-leaf path is O(logB N + log log N − log log B) = O(logB N).

35.3.2.3Updates

To perform an insertion in TT we first search for the leaf l where we want to perform the insertion; inserting the new element below l will increase the number of elements stored below each of the log log N + O(1) components on the path to the root, and may thus result in several components needing rebalancing (an X-component with 2X elements stored below it). We perform the insertion and rebalance the tree in a simple way as follows: We find the topmost X-component CjCj on the path to the root with 2X elements below it. Then we divide these elements into two groups of X elements and store them separately in the exponential layout (effectively we split CjCj with 2X elements below it into two X-components with X elements each). This can easily be done in O(X) memory transfers. Finally, we update a leaf and insert a new leaf in the X2-component above CjCj (corresponding to the two new X-components); we can easily do so in O(X) memory transfers by rebuilding it. Thus overall we have performed the insertion and rebalancing in O(X) memory transfers. The rebuilding guarantees that after rebuilding an X-component, X inserts have to be performed below it before it needs rebalancing again. Therefore we can charge the O(X) cost to the X insertions that occurred below CjCj since it was last rebuilt, and argue that each insertion is charged O(1) memory accesses on each of the log log N + O(1) levels. In fact, using the same argument as above for the searching cost, we can argue that we only need to charge an insertion O(1) transfers on the last log log BO(1) levels of TT, since rebalancing on any of these levels can always be performed in O(1) memory transfers. Thus overall we perform an insertion in O(logB N) memory transfers amortized.

Deletions can easily be handled in O(logB N) memory transfers using global rebuilding: To delete the element in a leaf l of TT we simply mark l as deleted. If l’s sibling is also marked as deleted, we mark their parent deleted too; we continue this process along one path to the root of TT. This way we can still perform searches in O(logB N) memory transfers, as long as we have only deleted a fraction of the elements in the tree. After N2N2 deletes we therefore rebuild the entire structure in O(N logB N) memory accesses, or O(logB N) accesses per delete operation.

Bender et al. [7] showed how to modify the update algorithms to perform updates “lazily” and obtain worst case O(logB N) bounds.

35.3.2.4Reducing space usage

To reduce the space of the layout of a tree TT to linear we simply make room for 2 log N elements in each leaf, and maintain that a leaf contains between log N and 2 log N elements. This does not increase the O(logB N) search and update costs since the O(log N) elements in a leaf can be scanned in O((log N)/B) = O(logB N) memory accesses. However, it reduces the number of elements stored in the exponential layout to O(N/ log N).

Theorem 35.5

The exponential layout of a search tree TT on N elements uses linear space and supports updates in O(logB N) memory accesses and searches in O(logB N) memory accesses.

Note that the analogue of Theorem 35.1 does not hold for the exponential layout, that is, it does not support efficient range queries. The reason is partly that the XX-components below an X-component are not located in (sorted) order in memory because components are rebalanced by splitting, and partly because of the leaves containing Θ(log N) elements. However, Bender et al. [7] showed how the exponential layout can be used to obtain a number of other important results: The structure as described above can easily be extended such that if two subsequent searched are separated by d elements, then the second search can be performed in O(log*d + logB d) memory transfers. It can also be extended such that R queries (batched searching) can be answered simultaneously in O(RlogBNR+SortM,B(R))O(RlogBNR+SortM,B(R)) memory transfers. The exponential layout can also be used to develop a persistent B-tree, where updates can be performed in the current version of the structure and queries can be performed in the current as well as all previous versions, with both operations incurring O(logB N) memory transfers. It can also be used as a basic building block in a linear space planar point location structure that answers queries in O(logB N) memory transfers.

35.4Priority Queues

A priority queue maintains a set of elements with a priority (or key) each under the operations INSERT and DELETEMIN, where an INSERT operation inserts a new element in the queue, and a DELETEMIN operation finds and deletes the element with the minimum key in the queue. Frequently we also consider a DELETE operation, which deletes an element with a given key from the priority queue. This operation can easily be supported using INSERT and DELETEMIN: To perform a DELETE we insert a special delete-element in the queue with the relevant key, such that we can detect if an element returned by a DELETEMIN has really been deleted by performing another DELETEMIN.

A balanced search tree can be used to implement a priority queue. Thus the existence of a dynamic cache-oblivious B-tree immediately implies the existence of a cache-oblivious priority queue where all operations can be performed in O(logB N) memory transfers, where N is the total number of elements inserted. However, it turns out that one can design a priority queue where all operations can be performed in Θ(SortM,B(N)/N)=O(1BlogM/BNB)Θ(SortM,B(N)/N)=O(1BlogM/BNB) memory transfers; for most realistic values of N, M, and B, this bound is less than 1 and we can, therefore, only obtain it in an amortized sense. In this section we describe two different structures that obtain these bounds [14,15].

35.4.1Merge Based Priority Queue: Funnel Heap

The cache-oblivious priority queue Funnel Heap due to Brodal and Fagerberg [15] is inspired by the sorting algorithm Funnelsort [4,12]. The structure only uses binary merging; essentially it is a heap-ordered binary tree with mergers in the nodes and buffers on the edges.

35.4.1.1Structure

The main part of the Funnel Heap structure is a sequence of k-mergers (Section 35.2.2) with double-exponentially increasing k, linked together in a list using binary mergers; refer to Figure 35.10. This part of the structure constitutes a single binary merge tree. Additionally, there is a single insertion buffer I.

fig35_10.jpg

Figure 35.10Funnel Heap: Sequence of k-mergers (triangles) linked together using buffers (rectangles) and binary mergers (circles).

More precisely, let ki and si be values defined inductively by

(k1,s1)=(2,8),si+1=si(ki+1),ki+1=si+11/3,
(k1,s1)=(2,8),si+1=si(ki+1),ki+1=si+11/3,

(35.1)

where ⌈⌈x⌉⌉ denotes the smallest power of two above x, that is, ⌈⌈x⌉⌉ = 2⌈ log x. We note that si1/3ki < 2si1/3, from which si4/3 < si + 1 < 3si4/3 follows, so both si and ki grow double-exponentially: si+1=Θ(s4/3i)si+1=Θ(s4/3i) and ki+1=Θ(k4/3i)ki+1=Θ(k4/3i). We also note that by induction on i we have si=s1+i1j=1kjsjsi=s1+i1j=1kjsj for all i.

A Funnel Heap consists of a linked list with link i containing a binary merger vi, two buffers Ai and Bi, and a ki-merger Ki having ki input buffers Si1,…, Siki. We refer to Bi, Ki, and Si1,…, Siki as the lower part of the link. The size of both Ai and Bi is k3ik3i, and the size of each Sij is si. Link i has an associated counter ci for which 1 ≤ ciki + 1. The initial value of ci is one for all i. The structure also has one insertion buffer I of size s1. We maintain the following invariants:

Invariant 35.1

For link i, Sici,…, Siki are empty.

Invariant 35.2

On any path in the merge tree from some buffer to the root buffer A1, elements appear in decreasing order.

Invariant 35.3

Elements in buffer I appear in sorted order.

Invariant 35.2 can be rephrased as the entire merge tree being in heap order. It implies that in all buffers in the merge tree, the elements appear in sorted order, and that the minimum element in the queue will be in A1 or I, if buffer A1 is non-empty. Note that an invocation (Figure 35.7) of any binary merger in the tree maintains the invariants.

35.4.1.2Layout

The Funnel Heap is laid out in consecutive memory locations in the order I, link 1, link 2,…, with link i being laid out in the order ci, Ai, vi, Bi, Ki, Si1, Si2,…, Siki.

35.4.1.3Operations

To perform a DELETEMIN operation we compare the smallest element in I with the smallest element in A1 and remove the smallest of these; if A1 is empty we first perform an invocation of v1. The correctness of this procedure follows immediately from Invariant 35.2.

To perform an INSERT operation we insert the new element among the (constant number of) elements in I, maintaining Invariant 35.3. If the number of elements in I is now s1, we examine the links in order to find the lowest index i for which ciki. Then we perform the following SWEEP(i) operation.

In SWEEP(i), we first traverse the path p from A1 to Sici and record how many elements are contained in each encountered buffer. Then we traverse the part of p going from Ai to Sici, remove the elements in the encountered buffers, and form a sorted stream σ1 of the removed elements. Next we form another sorted stream σ2 of all elements in links 1,…, i − 1 and in buffer I; we do so by marking Ai temporarily as exhausted and calling DELETEMIN repeatedly. We then merge σ1 and σ2 into a single stream σ, and traverse p again while inserting the front (smallest) elements of σ in the buffers on p such that they contain the same numbers of elements as before we emptied them. Finally, we insert the remaining elements from σ into Sici, reset cl to one for l = 1,2,…, i − 1, and increment ci.

To see that SWEEP(i) does not insert more than the allowed si elements into Sici, first note that the lower part of link i is emptied each time ci is reset to one. This implies that the lower part of link i never contains more than the number of elements inserted into Si1, Si2,…, Siki by the at most ki SWEEP(i) operations occurring since last time ci was reset. Since si=s1+i1j=1kjsjsi=s1+i1j=1kjsj for all i, it follows by induction on time that no instance of SWEEP(i) inserts more than si elements into Sici.

Clearly, SWEEP(i) maintains Invariants 35.1 and 35.3, since I and the lower parts of links 1,…, i − 1 are empty afterwards. Invariant 35.2 is also maintained, since the new elements in the buffers on p are the smallest elements in σ, distributed such that each buffer contains exactly the same number of elements as before the SWEEP(i) operation. After the operation, an element on this path can only be smaller than the element occupying the same location before the operation, and therefore the merge tree is in heap order.

35.4.1.4Analysis

To analyze the amortized cost of an INSERT or DELETEMIN operation, we first consider the number of memory transfers used to move elements upwards (towards A1) by invocations of binary mergers in the merge tree. For now we assume that all invocations result in full buffers, that is, that no exhaustions occur. We imagine charging the cost of filling a particular buffer evenly to the elements being brought into the buffer, and will show that this way an element from an input buffer of Ki is charged O(1BlogM/Bsi)O(1BlogM/Bsi) memory transfers during its ascent to A1.

Our proof rely on the optimal replacement strategy keeping as many as possible of the first links of the Funnel Heap in fast memory at all times. To analyze the number of links that fit in fast memory, we define Δi to be the sum of the space used by links 1 to i and define iM to be the largest i for which ΔiM. By the space bound for k-mergers in Theorem 35.2 we see that the space used by link i is dominated by the Θ(si ki) = Θ(ki4) space use of Si1,…, Siki. Since ki+1=Θ(k4/3i)ki+1=Θ(k4/3i), the space used by link i grows double-exponentially with i. Hence, Δi is a sum of double-exponentially increasing terms and is therefore dominated by its last term. In other words, Δi = Θ(ki4) = Θ(si4/3). By the definition of iM we have ΔiMM < ΔiM + 1. Using si+1=Θ(s4/3i)si+1=Θ(s4/3i) we see that logM(siM) = Θ(1).

Now consider an element in an input buffer of Ki. If iiM the element will not get charged at all in our charging scheme, since no memory transfers are used to fill buffers in the links that fit in fast memory. So assume i > iM. In that case the element will get charged for the ascent through Ki to Bi and then through vj to Aj for j = i, i − 1,…, iM. First consider the cost of ascending through Ki: By Theorem 35.2, an invocation of the root of Ki to fill Bi with k3ik3i elements incurs O(ki+ki3BlogM/Bki3)O(ki+ki3BlogM/Bki3) memory transfers altogether. Since M<ΔiM+1=Θ(k4iM+1)M<ΔiM+1=Θ(k4iM+1) we have M = O(ki4). By the tall cache assumption M = Ω(B2) we get B = O(ki2), which implies ki = O(ki3/B). Under the assumption that no exhaustions occur, that is, that buffers are filled completely, it follows that an element is charged O(1BlogM/Bki3)=O(1BlogM/Bsi)O(1BlogM/Bki3)=O(1BlogM/Bsi) memory transfers to ascend through Ki and into Bi. Next consider the cost of ascending through vj, that is, insertion into Aj, for j = i, i − 1,…, iM: Filling of Aj incurs O(1 + |Aj|/B) memory transfers. Since B=O(k2iM+1)=O(k8/3iM)B=O(k2iM+1)=O(k8/3iM) and |Aj| = kj3, this is O(|Aj|/B) memory transfers, so an element is charged O(1/B) memory transfers for each Aj (under the assumption of no exhaustions). It only remains to bound the number of such buffers Aj, that is, to bound iiM. From s4/3i<si+1s4/3i<si+1 we have s(4/3)iiMiM<sis(4/3)iiMiM<si. Using logM(siM) = Θ(1) we get iiM = O(log logM si). From log logM si = O(logM si) and the tall cache assumption M = Ω(B2) we get iiM = O(logM si) = O(logM/B si). In total we have proved our claim that, assuming no exhaustions occur, an element in an input buffer of Ki is charged O(1BlogM/Bsi)O(1BlogM/Bsi) memory transfers during its ascent to A1.

We imagine maintaining the credit invariant that each element in a buffer holds enough credits to be able to pay for the ascent from its current position to A1, at the cost analyzed above. In particular, an element needs O(1BlogM/Bsi)O(1BlogM/Bsi) credits when it is inserted in an input buffer of Ki. The cost of these credits we will attribute to the SWEEP(i) operation inserting it, effectively making all invocations of mergers be prepaid by SWEEP(i) operations.

A SWEEP(i) operation also incurs memory transfers by itself; we now bound these. In the SWEEP(i) operation we first form σ1 by traversing the path p from A1 to Sici. Since the links are laid out sequentially in memory, this traversal at most constitutes a linear scan of the consecutive memory locations containing A1 through Ki. Such a scan takes O((Δi − 1 + |Ai| + |Bi| + |Ki|)/B) = O(ki3/B) = O(si/B) memory transfers. Next we form σ2 using DELETEMIN operations; the cost of which is paid for by the credits placed on the elements. Finally, we merge of σ1 and σ2 into σ, and place some of the elements in buffers on p and some of the elements in Sici. The number of memory transfers needed for this is bounded by the O(si/B) memory transfers needed to traverse p and Sici. Hence, the memory transfers incurred by the SWEEP(i) operation itself is O(si/B).

After the SWEEP(i) operation, the credit invariant must be reestablished. Each of the O(si) elements inserted into Sici must receive O(1BlogM/Bsi)O(1BlogM/Bsi) credits. Additionally, the elements inserted into the part of the path p from A1 through Ai − 1 must receive enough credits to cover their ascent to A1, since the credits that resided with elements in the same positions before the operations were used when forming σ2 by DELETEMIN operations. This constitutes Oi − 1) = o(si) elements, which by the analysis above, must receive O(1BlogM/Bsi)O(1BlogM/Bsi) credits each. Altogether O(si/B)+O(siBlogM/Bsi)=O(siBlogM/Bsi)O(si/B)+O(siBlogM/Bsi)=O(siBlogM/Bsi) memory transfers are attributed to a SWEEP(i) operation, again under the assumption that no exhaustions occur during invocations.

To actually account for exhaustions, that is, the memory transfers incurred when filling buffers that become exhausted, we note that filling a buffer partly incurs at most the same number of memory transfers as filling it entirely. This number was analyzed above to be O(|Ai|/B) for Ai and O(|Bi|BlogM/Bsi)O(|Bi|BlogM/Bsi) for Bi, when i > iM. If Bi become exhausted, only a SWEEP(i) can remove that status. If Ai become exhausted, only a SWEEP(j) for ji can remove that status. As at most a single SWEEP(j) with j > i can take place between one SWEEP(i) and the next, Bi can only become exhausted once for each SWEEP(i), and Ai can only become exhausted twice for each SWEEP(i). From |Ai| = |Bi| = ki3 = Θ(si) it follows that charging SWEEP(i) an additional cost of O(siBlogM/Bsi)O(siBlogM/Bsi) memory transfers will cover all costs of filling buffers when exhaustion occurs.

Overall we have shown that we can account for all memory transfers if we attribute O(siBlogM/Bsi)O(siBlogM/Bsi) memory transfers to each SWEEP(i). By induction on i, we can show that at least si insertions have to take place between each SWEEP(i). Thus, if we charge the SWEEP(i) cost to the last si insertions preceding the SWEEP(i), each insertion is charged O(1BlogM/Bsi)O(1BlogM/Bsi) memory transfers. Given a sequence of operation on an initial empty priority queue, let imax be the largest i for which SWEEP(i) takes place. We have simaxN, where N is the number of insertions in the sequence. An insertion can be charged by at most one SWEEP(i) for i = 1,…, imax, so by the double-exponential growth of si, the number of memory transfers charged to an insertion is

O(k=01BlogM/BN(3/4)k)=O(1BlogM/BN)=O(1BlogM/BNB),
O(k=01BlogM/BN(3/4)k)=O(1BlogM/BN)=O(1BlogM/BNB),

where the last equality follows from the tall cache assumption M = Ω(B2).

Finally, we bound the space use of the entire structure. To ensure a space usage linear in N, we create a link i when it is first used, that is, when the first SWEEP(i) occurs. At that point in time, ci, Ai, vi, Bi, Ki, and Si1 are created. These take up Θ(si) space combined. At each subsequent SWEEP(i) operation, we create the next input buffer Sici of size si. As noted above, each SWEEP(i) is preceded by at least si insertions, from which an O(N) space bound follows. To ensure that the entire structure is laid out in consecutive memory locations, the structure is moved to a larger memory area when it has grown by a constant factor. When allocated, the size of the new memory area is chosen such that it will hold the input buffers Sij that will be created before the next move. The amortized cost of this is O(1/B) per insertion.

Theorem 35.6

Using Θ(M) fast memory, a sequence of N INSERT, DELETEMIN, and DELETE operations can be performed on an initially empty Funnel Heap using O(N) space in O(1BlogM/BNB)O(1BlogM/BNB) amortized memory transfers each.

Brodal and Fagerberg [15] gave a refined analysis for a variant of the Funnel Heap that shows that the structure adapts to different usage profiles. More precisely, they showed that the ith insertion uses amortized O(1BlogM/BNiB)O(1BlogM/BNiB) memory transfers, where Ni can be defined in any of the following three ways: (a) Ni is the number of elements present in the priority queue when the ith insertion is performed, (b) if the ith inserted element is removed by a DELETEMIN operation prior to the jth insertion then Ni = ji, or (c) Ni is the maximum rank of the ith inserted element during its lifetime in the priority queue, where rank denotes the number of smaller elements in the queue.

35.4.2Exponential Level Based Priority Queue

While the Funnel Heap is inspired by Mergesort and uses k-mergers as the basic building block, the exponential level priority queue of Arge et al. [14] is somewhat inspired by distribution sorting and uses sorting as a basic building block.

35.4.2.1Structure

The structure consists of Θ(log log N) levels whose sizes vary from N to some small size c below a constant threshold ct; the size of a level corresponds (asymptotically) to the number of elements that can be stored within it. The i’th level from above has size N(2/3)i − 1 and for convenience we refer to the levels by their size. Thus the levels from largest to smallest are level N, level N2/3, level N4/9,…, level X9/4, level X3/2, level X, level X2/3, level X4/9,…, level c9/4, level c3/2, and level c. In general, a level can contain any number of elements less than or equal to its size, except level N, which always contains Θ(N) elements. Intuitively, smaller levels store elements with smaller keys or elements that were more recently inserted. In particular, the minimum key element and the most recently inserted element are always in the smallest (lowest) level c. Both insertions and deletions are initially performed on the smallest level and may propagate up through the levels.

Elements are stored in a level in a number of buffers, which are also used to transfer elements between levels. Level X consists of one up buffer uX that can store up to X elements, and at most X1/3 down buffers dX1,,dXX1/3dX1,,dXX1/3 each containing between (1/2)X2/3 and 2X2/3 elements. Thus level X can store up to 3X elements. We refer to the maximum possible number of elements that can be stored in a buffer as the size of the buffer. Refer to Figure 35.11. Note that the size of a down buffer at one level matches the size (up to a constant factor) of the up buffer one level down.

fig35_11.jpg

Figure 35.11Levels X2/3, X, X3/2, and X9/4 of the priority queue data structure.

We maintain three invariants about the relationships between the elements in buffers of various levels:

Invariant 35.4

At level X, elements are sorted among the down buffers, that is, elements in dXidXi have smaller keys than elements in dXi+1dXi+1, but elements within dXidXi are unordered.

The element with largest key in each down buffer dXidXi is called a pivot element. Pivot elements mark the boundaries between the ranges of the keys of elements in down buffers.

Invariant 35.5

At level X, the elements in the down buffers have smaller keys than the elements in the up buffer.

Invariant 35.6

The elements in the down buffers at level X have smaller keys than the elements in the down buffers at the next higher level X3/2.

The three invariants ensure that the keys of the elements in the down buffers get larger as we go from smaller to larger levels of the structure. Furthermore, an order exists between the buffers on one level: keys of elements in the up buffer are larger than keys of elements in down buffers. Therefore, down buffers are drawn below up buffers on Figure 35.11. However, the keys of the elements in an up buffer are unordered relative to the keys of the elements in down buffers one level up. Intuitively, up buffers store elements that are “on their way up,” that is, they have yet to be resolved as belonging to a particular down buffer in the next (or higher) level. Analogously, down buffers store elements that are “on their way down”—these elements are by the down buffers partitioned into several clusters so that we can quickly find the cluster of smallest key elements of size roughly equal to the next level down. In particular, the element with overall smallest key is in the first down buffer at level c.

35.4.2.2Layout

The priority queue is laid out in memory such that the levels are stored consecutively from smallest to largest with each level occupying a single region of memory. For level X we reserve space for exactly 3X elements: X for the up buffer and 2X2/3 for each possible down buffer. The up buffer is stored first, followed by the down buffers stored in an arbitrary order but linked together to form an ordered linked list. Thus O(log3/2logcNi=0N(2/3)i)=O(N)O(log3/2logcNi=0N(2/3)i)=O(N) is an upper bound on the total memory used by the priority queue.

35.4.2.3Operations

To implement the priority queue operations we use two general operations, push and pull. Push inserts X elements into level X3/2, and pull removes the X elements with smallest keys from level X3/2 and returns them in sorted order. An INSERT or a DELETEMIN is performed simply by performing a push or pull on the smallest level c.

Push. To push X elements into level X3/2, we first sort the X elements cache-obliviously using O(1+XBlogM/BXB)O(1+XBlogM/BXB) memory transfers. Next we distribute the elements in the sorted list into the down buffers of level X3/2 by scanning through the list and simultaneously visiting the down buffers in (linked) order. More precisely, we append elements to the end of the current down buffer dX3/2idX3/2i, and advance to the next down buffer inline-math559_1.jpg as soon as we encounter an element with larger key than the pivot of inline-math559_2.jpg. Elements with larger keys than the pivot of the last down buffer are inserted in the up buffer uX3/2. Scanning through the X elements take O(1+XB)O(1+XB) memory transfers. Even though we do not scan through every down buffer, we might perform at least one memory transfer for each of the X1/2 possible buffers. Thus the total cost of distributing the X elements is O(XB+X1/2)O(XB+X1/2) memory transfers.

During the distribution of elements a down buffer may run full, that is, contain 2X elements. In this case, we split the buffer into two down buffers each containing X elements using O(1+XB)O(1+XB) transfers. We place the new buffer in any free down buffer location for the level and update the linked list accordingly. If the level already has the maximum number X1/2 of down buffers, we remove the last down buffer dXX1/2 by inserting its no more than 2X elements into the up buffer using O(1+XB)O(1+XB) memory transfers. Since X elements must have been inserted since the last time the buffer split, the amortized splitting cost per element is O(1X+1B)O(1X+1B) transfers. In total, the amortized number of memory transfers used on splitting buffers while distributing the X elements is O(1+XB)O(1+XB).

If the up buffer runs full during the above process, that is, contains more than X3/2 elements, we recursively push all of these elements into the next level up. Note that after such a recursive push, X3/2 elements have to be inserted (pushed) into the up buffer of level X3/2 before another recursive push is needed.

Overall we can perform a push of X elements from level X into level X3/2 in O(X1/2+XBlogM/BXB)O(X1/2+XBlogM/BXB) memory transfers amortized, not counting the cost of any recursive push operations; it is easy to see that a push maintains all three invariants.

Pull. To describe how to pull the X smallest keys elements from level X3/2, we first assume that the down buffers contain at least 32X32X elements. In this case the first three down buffers inline-math559_3.jpg, inline-math559_4.jpg, and inline-math559_5.jpg contain the between 32X32X and 6X smallest elements (Invariants 4 and 5). We find and remove the X smallest elements simply by sorting these elements using O(1+XBlogM/BXB)O(1+XBlogM/BXB) memory transfers. The remaining between X/2 and 5X elements are left in one, two, or three down buffers containing between X/2 and 2X elements each. These buffers can easily be constructed in O(1+XB)O(1+XB) transfers. Thus we use O(1+XBlogM/BXB)O(1+XBlogM/BXB) memory transfers in total. It is easy to see that Invariants 4–6 are maintained.

In the case where the down buffers contain fewer than 32X32X elements, we first pull the X3/2 elements with smallest keys from the next level up. Because these elements do not necessarily have smaller keys than the, say U, elements in the up buffer uX3/2, we then sort this up buffer and merge the two sorted lists. Then we insert the U elements with largest keys into the up buffer, and distribute the remaining between X3/2 and X3/2+32XX3/2+32X elements into X1/2 down buffers containing between X and X+32X1/2X+32X1/2 each (such that the O(1X+1B)O(1X+1B) amortized down buffer split bound is maintained). It is easy to see that this maintains the three invariants. Afterwards, we can find the X minimal key elements as above. Note that after a recursive pull, X3/2 elements have to be deleted (pulled) from the down buffers of level X3/2 before another recursive pull is needed. Note also that a pull on level X3/2 does not affect the number of elements in the up buffer uX3/2. Since we distribute elements into the down and up buffers after a recursive pull using one sort and one scan of X3/2 elements, the cost of doing so is dominated by the cost of the recursive pull operation itself. Thus ignoring the cost of recursive pulls, we have shown that a pull of X elements from level X3/2 down to level X can be performed in O(1+XBlogM/BXB)O(1+XBlogM/BXB) memory transfers amortized, while maintaining Invariants 4–6.

35.4.2.4Analysis

To analyze the amortized cost of an INSERT or DELETEMIN operation, we consider the total number of memory transfers used to perform push and pull operations during N2N2 operations; to ensure that the structure always consists of O(log log N) levels and use O(N) space we rebuild it using O(NBlogM/BNB)O(NBlogM/BNB) memory transfers (or O(1BlogM/BNB)O(1BlogM/BNB) transfers per operation) after every N2N2 operations [14].

The total cost of N2N2 such operations is analyzed as follows: We charge a push of X elements from level X up to level X3/2 to level X. Since X elements have to be inserted in the up buffer uX of level X between such pushes, and as elements can only be inserted in uX when elements are inserted (pushed) into level X, O(N/X) pushes are charged to level X during the N2N2 operations. Similarly, we charge a pull of X elements from level X3/2 down to level X to level X. Since between such pulls Θ(X) elements have to be deleted from the down buffers of level X by pulls on X, O(N/X) pulls are charged to level X during the N2N2 operations.

Above we argued that a push or pull charged to level X uses O(X1/2+XBlogM/BXB)O(X1/2+XBlogM/BXB) memory transfers. We can reduce this cost to O(XBlogM/BXB)O(XBlogM/BXB) by more carefully examining the costs for differently sized levels. First consider a push or pull of XB2 elements into or from level X3/2B3. In this case XBXXBX, and we trivially have that O(X1/2+XBlogM/BXB)=O(XBlogM/BXB)O(X1/2+XBlogM/BXB)=O(XBlogM/BXB). Next, consider the case B4/3X < B2, where the X1/2 term in the push bound can dominate and we have to analyze the cost of a push more carefully. In this case we are working on a level X3/2 where B2X3/2 < B3; there is only one such level. Recall that the X1/2 cost was from distributing X sorted elements into the less than X1/2 down buffers of level X3/2. More precisely, a block of each buffer may have to be loaded and written back without transferring a full block of elements into the buffer. Assuming M = Ω(B2), we from X1/2B see that a block for each of the buffers can fit into fast memory. Consequently, if a fraction of the fast memory is used to keep a partially filled block of each buffer of level X3/2 (B2X3/2B3) in fast memory at all times, and full blocks are written to disk, the X1/2 cost would be eliminated. In addition, if all of the levels of size less than B2 (of total size O(B2)) are also kept in fast memory, all transfer costs associated with them would be eliminated. The optimal paging strategy is able to keep the relevant blocks in fast memory at all times and thus eliminates these costs.

Finally, since each of the O(N/X) push and pull operations charged to level X (X > B2) uses O(XBlogM/BXB)O(XBlogM/BXB) amortized memory transfers, the total amortized transfer cost of an INSERT or DELETEMIN operation in the sequence of N2N2 such operations is

O(i=01BlogM/BN(2/3)iB)=O(1BlogM/BNB).
O(i=01BlogM/BN(2/3)iB)=O(1BlogM/BNB).

Theorem 35.7

Using Θ(M) fast memory, N INSERT, DELETEMIN, and DELETE operations can be performed on an initially empty exponential level priority queue using O(N) space in O(1BlogM/BNB)O(1BlogM/BNB) amortized memory transfers each.

35.52d Orthogonal Range Searching

As discussed in Section 35.3, there exist cache-oblivious B-trees that support updates and queries in O(logB N) memory transfers (e.g., Theorem 35.5); several cache-oblivious B-tree variants can also support (one-dimensional) range queries in O(logBN+KB)O(logBN+KB) memory transfers [6,8,9], but at an increased amortized update cost of O(logBN+log2NB)=O(log2BN)O(logBN+log2NB)=O(log2BN) memory transfers (e.g., Theorem 35.4).

In this section we discuss cache-oblivious data structures for two-dimensional orthogonal range searching, that is, structures for storing a set of N points in the plane such that the points in a axis-parallel query rectangle can be reported efficiently. In Section 35.5.1 we first discuss a cache-oblivious version of a kd-tree. This structure uses linear space and answers queries in O(N/B+KB)O(N/B+KB) memory transfers; this is optimal among linear space structures [24]. It supports updates in O(logNBlogM/BN)=O(log2BN)O(logNBlogM/BN)=O(log2BN) transfers. In Section 35.5.2 we then discuss a cache-oblivious version of a two-dimensional range tree. The structure answers queries in the optimal O(logBN+KB)O(logBN+KB) memory transfers but uses O(N log2N) space. Both structures were first described by Agarwal et al. [11].

35.5.1Cache-Oblivious kd-Tree

35.5.1.1Structure

The cache-oblivious kd-tree is simply a normal kd-tree laid out in memory using the van Emde Boas layout. This structure, proposed by Bentley [25], is a binary tree of height O(log N) with the N points stored in the leaves of the tree. The internal nodes represent a recursive decomposition of the plane by means of axis-orthogonal lines that partition the set of points into two subsets of equal size. On even levels of the tree the dividing lines are horizontal, and on odd levels they are vertical. In this way a rectangular region Rv is naturally associated with each node v, and the nodes on any particular level of the tree partition the plane into disjoint regions. In particular, the regions associated with the leaves represent a partition of the plane into rectangular regions containing one point each. Refer to Figure 35.12.

fig35_12.jpg

Figure 35.12kd-tree and the corresponding partitioning.

35.5.1.2Query

An orthogonal range query Q on a kd-tree TT is answered recursively starting at the root: At a node v we advance the query to a child vc of v if Q intersects the region Rvc associated with vc. At a leaf w we return the point in w if it is contained in Q. A standard argument shows that the number of nodes in TT visited when answering Q, or equivalently, the number of nodes v where Rv intersects Q, is O(N+K)O(N+K); NN nodes v are visited where Rv is intersected by the boundary of Q and K nodes u with Ru completely contained in Q [25].

If the kd-tree TT is laid out using the van Emde Boas layout, we can bound the number of memory transfers used to answer a query by considering the nodes log B levels above the leaves of TT. There are O(NB)O(NB) such nodes as the subtree TvTv rooted in one such node v contains B leaves. By the standard query argument, the number of these nodes visited by a query is O(N/B+KB)O(N/B+KB). Thus, the number of memory transfers used to visit nodes more than log B levels above the leaves is O(N/B+KB)O(N/B+KB). This is also the overall number of memory transfers used to answer a query, since (as argued in Section 35.2.1) the nodes in TvTv are contained in O(1) blocks, that is, any traversal of (any subset of) the nodes in a subtree TvTv can be performed in O(1) memory transfers.

35.5.1.3Construction

In the RAM model, a kd-tree on N points can be constructed recursively in O(N log N) time; the root dividing line is found using an O(N) time median algorithm, the points are distributed into two sets according to this line in O(N) time, and the two subtrees are constructed recursively. Since median finding and distribution can be performed cache-obliviously in O(N/B) memory transfers [4,5], a cache-oblivious kd-tree can be constructed in O(NBlogN)O(NBlogN) memory transfers. Agarwal et al. [11] showed how to construct logN=12logNlogN=12logN levels in O(SortM, B(N)) memory transfers, leading to a recursive construction algorithms using only O(SortM, B(N)) memory transfers.

35.5.1.4Updates

In the RAM model a kd-tree TT can relatively easily be modified to support deletions in O(log N) time using global rebuilding. To delete a point from TT, we simply find the relevant leaf w in O(log N) time and remove it. We then remove w’s parent and connect w’s grandparent to w’s sibling. The resulting tree is no longer a kd-tree but it still answers queries in O(N+T)O(N+T) time, since the standard argument still applies. To ensure that N is proportional to the actual number of points in TT, the structure is completely rebuilt after N2N2 deletions. Insertions can be supported in O(log2N) time using the so-called logarithmic method [26], that is, by maintaining log N kd-trees where the i’th kd-tree is either empty or of size 2i and then rebuilding a carefully chosen set of these structures when performing an insertion.

Deletes in a cache-oblivious kd-tree is basically done as in the RAM version. However, to still be able to load a subtree TvTv with B leaves in O(1) memory transfers and obtain the O(N/B+KB)O(N/B+KB) query bound, data locality needs to be carefully maintained. By laying out the kd-tree using (a slightly relaxed version of) the exponential layout (Section 35.3.2) rather than the van Emde Boas layout, and by periodically rebuilding parts of this layout, Agarwal et al. [11] showed how to perform a delete in O(logB N) memory transfers amortized while maintaining locality. They also showed how a slightly modified version of the logarithmic method and the O(SortM, B(N)) construction algorithms can be used to perform inserts in O(logNBlogM/BN)=O(log2BN)O(logNBlogM/BN)=O(log2BN) memory transfers amortized.

Theorem 35.8

There exists a cache-oblivious (kd-tree) data structure for storing a set of N points in the plane using linear space, such that an orthogonal range query can be answered in O(N/B+KB)O(N/B+KB) memory transfers. The structure can be constructed cache-obliviously in O(SortM, B(N)) memory transfers and supports updates in O(logNBlogM/BN)=O(log2BN)O(logNBlogM/BN)=O(log2BN) memory transfers.

35.5.2Cache-Oblivious Range Tree

The main part of the cache-oblivious range tree structure for answering (four-sided) orthogonal range queries is a structure for answering three-sided queries Q = [xl, xr] × [yb,∞), that is, for finding all points with x-coordinates in the interval [xl, xr] and y-coordinates above yb. Below we discuss the two structures separately.

35.5.2.1Three-Sided Queries

35.5.2.1.1Structure

Consider dividing the plane into NN vertical slabs X1,X2,,XNX1,X2,,XN containing NN points each. Using these slabs we define 2N12N1 buckets. A bucket is a rectangular region of the plane that completely spans one or more consecutive slabs and is unbounded in the positive y-direction, like a three-sided query. To define the 2N12N1 buckets we start with NN active buckets b1,b2,,bNb1,b2,,bN corresponding to the NN slabs. The x-range of the slabs define a natural linear ordering on these buckets. We then imagine sweeping a horizontal sweep line from y = − ∞ to y = ∞. Every time the total number of points above the sweep line in two adjacent active buckets, bi and bj, in the linear order falls to NN, we mark bi and bj as inactive. Then we construct a new active bucket spanning the slabs spanned by bi and bj with a bottom y-boundary equal to the current position of the sweep line. This bucket replaces bi and bj in the linear ordering of active buckets. The total number of buckets defined in this way is 2N12N1, since we start with NN buckets and the number of active buckets decreases by one every time a new bucket is constructed. Note that the procedure defines an active y-interval for each bucket in a natural way. Buckets overlap but the set of buckets with active y-intervals containing a given y-value (the buckets active when the sweep line was at that value) are non-overlapping and span all the slabs. This means that the active y-intervals of buckets spanning a given slab are non-overlapping. Refer to Figure 35.13a.

fig35_13.jpg

Figure 35.13(a) Active intervals of buckets spanning slab Xi; (b) Buckets active at yb.

After defining the 2N12N1 buckets, we are ready to present the three-sided query data structure; it is defined recursively: It consists of a cache-oblivious B-tree TT on the NN boundaries defining the NN slabs, as well as a cache-oblivious B-tree for each of the NN slabs; the tree TiTi for slab i contains the bottom endpoint of the active y-intervals of the O(N)O(N) buckets spanning the slab. For each bucket bi we also store the NN points in bi in a list BiBi sorted by y-coordinate. Finally, recursive structures S1,S2,,S2N1S1,S2,,S2N1 are built on the NN points in each of the 2N12N1 buckets.

35.5.2.1.2Layout

The layout of the structure in memory consists of O(N) memory locations containing TT, then T1,,TNT1,,TN, and B1,,B2N1B1,,B2N1, followed by the recursive structures S1,,S2N1S1,,S2N1. Thus the total space use of the structure is S(N)2NS(N)+O(N)=O(NlogN)S(N)2NS(N)+O(N)=O(NlogN).

35.5.2.1.3Query

To answer a three-sided query Q, we consider the buckets whose active y-interval contain yb. These buckets are non-overlapping and together they contain all points in Q, since they span all slabs and have bottom y-boundary below yb. We report all points that satisfy Q in each of the buckets with x-range completely between xl and xr. At most two other buckets bl and br—the ones containing xl and xr—can contain points in Q, and we find these points recursively by advancing the query to SlSl and SrSr. Refer to Figure 35.13b.

We find the buckets bl and br that need to be queried recursively and report the points in the completely spanned buckets as follows. We first query TT using O(logBN)O(logBN) memory transfers to find the slab Xl containing xl. Then we query TlTl using another O(logBN)O(logBN) memory transfers to find the bucket bl with active y-interval containing yb. We can similarly find br in O(logBN)O(logBN) memory transfers. If bl spans slabs Xl, Xl + 1,…, Xm we then query Tm+1Tm+1 with yb in O(logBN)O(logBN) memory transfers to find the active bucket bi to the right of bl completely spanned by Q (if it exists). We report the relevant points in bi by scanning BiBi top-down until we encounter a point not contained in Q. If K′ is the number or reported points, a scan of BiBi takes O(1+KB)O(1+KB) memory transfers. We continue this procedure for each of the completely spanned active buckets. By construction, we know that every two adjacent such buckets contain at least NN points above yb. First consider the part of the query that takes place on recursive levels of size NB2, such that N/BlogBN1N/BlogBN1. In this case the O(logBN)O(logBN) overhead in finding and processing two consecutive completely spanned buckets is smaller than the O(N/B)O(N/B) memory transfers used to report output points; thus we spend O(logBN+KiB)O(logBN+KiB) memory transfers altogether to answer a query, not counting the recursive queries. Since we perform at most two queries on each level of the recursion (in the active buckets containing xl and xr), the total cost over all levels of size at least B2 is O(loglogBNi=1logBN1/2i+KiB)=O(logBN+KB)O(loglogBNi=1logBN1/2i+KiB)=O(logBN+KB) transfers. Next consider the case where N = B. In this case the whole level, that is, TT, T1,,TBT1,,TB and B1,,B2B1B1,,B2B1, is stored in O(B) contiguously memory memory locations and can thus be loaded in O(1) memory transfers. Thus the optimal paging strategy can ensure that we only spend O(1) transfers on answering a query. In the case where NBNB, the level and all levels of recursion below it occupies O(BlogB)=O(B)O(BlogB)=O(B) space. Thus the optimal paging strategy can load it and all relevant lower levels in O(1) memory transfers. This means that overall we answer a query in O(logBN+KB)O(logBN+KB) memory transfers, provided that N and B are such that we have a level of size B2 (and thus of size B and BB); when answering a query on a level of size between B and B2 we cannot charge the O(logBN)O(logBN) cost of visiting two active consecutive buckets to the ( < B) points found in the two buckets. Agarwal et al. [11] showed how to guarantee that we have a level of size B2 by assuming that B = 22d for some non-negative integer d. Using a somewhat different construction, Arge et al. [27] showed how to remove this assumption.

Theorem 35.9

There exists a cache-oblivious data structure for storing N points in the plane using O(N log N) space, such that a three-sided orthogonal range query can be answered in O(logBN+KB)O(logBN+KB) memory transfers.

35.5.2.2Four-sided queries

Using the structure for three-sided queries, we can construct a cache-oblivious range tree structure for four-sided orthogonal range queries in a standard way. The structure consists of a cache-oblivious B-tree TT on the N points sorted by x-coordinates. With each internal node v we associate a secondary structure for answering three-sided queries on the points stored in the leaves of the subtree rooted at v: If v is the left child of its parent then we have a three-sided structure for answering queries with the opening to the right, and if v is the right child then we have a three-sided structure for answering queries with the opening to the left. The secondary structures on each level of the tree use O(N log N) space, for a total space usage of O(N log2N).

To answer an orthogonal range query Q, we search down TT using O(logB N) memory transfers to find the first node v where the left and right x-coordinate of Q are contained in different children of v. Then we query the right opening secondary structure of the left child of v, and the left opening secondary structure of the right child of v, using O(logBN+KB)O(logBN+KB) memory transfers. Refer to Figure 35.14. It is easy to see that this correctly reports all K points in Q.

fig35_14.jpg

Figure 35.14Answering a four-sided query in v using two three-sided queries in v’s children.

Theorem 35.10

There exists a cache-oblivious data structure for storing N points in the plane using O(N log2N) space, such that an orthogonal range query can be answered in O(logBN+KB)O(logBN+KB) memory transfers.

Acknowledgments

Lars Arge was supported in part by the National Science Foundation through ITR grant EIA–0112849, RI grant EIA–9972879, CAREER grant CCR–9984099, and U.S.–Germany Cooperative Research Program grant INT–0129182.

Gerth Stølting Brodal was supported by the Carlsberg Foundation (contract number ANS-0257/20), BRICS (Basic Research in Computer Science, www.brics.dk, funded by the Danish National Research Foundation), and the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT).

Rolf Fagerberg was supported by BRICS (Basic Research in Computer Science, www.brics.dk, funded by the Danish National Research Foundation), and the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT). Part of this work was done while at University of Aarhus.

References

1.A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, Sept. 1988.

2.L. Arge. External memory data structures. In J. Abello, P. M. Pardalos, and M. G. C. Resende, editors, Handbook of Massive Data Sets, pages 313–358. Kluwer Academic Publishers, 2002.

3.J. S. Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, 33(2):209–271, June 2001.

4.M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th Annual IEEE Symposium on Foundations of Computer Science, pages 285–298. IEEE Computer Society Press, 1999.

5.H. Prokop. Cache-oblivious algorithms. Master’s thesis, Massachusetts Institute of Technology, Cambridge, MA, June 1999.

6.M. A. Bender, E. D. Demaine, and M. Farach-Colton. Cache-oblivious B-trees. In Proc. 41st Annual IEEE Symposium on Foundations of Computer Science, pages 339–409. IEEE Computer Society Press, 2000.

7.M. A. Bender, R. Cole, and R. Raman. Exponential structures for cache-oblivious algorithms. In Proc. 29th International Colloquium on Automata, Languages, and Programming, volume 2380 of Lecture Notes in Computer Science, pages 195–207. Springer, 2002.

8.M. A. Bender, Z. Duan, J. Iacono, and J. Wu. A locality-preserving cache-oblivious dynamic dictionary. In Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 29–38. SIAM, 2002.

9.G. S. Brodal, R. Fagerberg, and R. Jacob. Cache oblivious search trees via binary trees of small height. In Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 39–48. SIAM, 2002.

10.N. Rahman, R. Cole, and R. Raman. Optimized predecessor data structures for internal memory. In Proc. 3rd Workshop on Algorithm Engineering, volume 2141 of Lecture Notes in Computer Science, pages 67–78. Springer, 2001.

11.P. K. Agarwal, L. Arge, A. Danner, and B. Holland-Minkley. Cache-oblivious data structures for orthogonal range searching. In Proc. 19th ACM Symposium on Computational Geometry, pages 237–245. ACM Press, 2003.

12.G. S. Brodal and R. Fagerberg. Cache oblivious distribution sweeping. In Proc. 29th International Colloquium on Automata, Languages, and Programming, volume 2380 of Lecture Notes in Computer Science, pages 426–438. Springer, 2002.

13.M. Bender, E. Demaine, and M. Farach-Colton. Efficient tree layout in a multilevel memory hierarchy. In Proc. 10th Annual European Symposium on Algorithms, volume 2461 of Lecture Notes in Computer Science, pages 165–173. Springer, 2002. Full version at http://www.cs.sunysb.edu/bender/pub/treelayout-full.ps.

14.L. Arge, M. Bender, E. Demaine, B. Holland-Minkley, and J. I. Munro. Cache-oblivious priority-queue and graph algorithms. In Proc. 34th ACM Symposium on Theory of Computation, pages 268–276. ACM Press, 2002.

15.G. S. Brodal and R. Fagerberg. Funnel heap - a cache oblivious priority queue. In Proc. 13th International Symposium on Algorithms and Computation, volume 2518 of Lecture Notes in Computer Science, pages 219–228. Springer, 2002.

16.G. S. Brodal, R. Fagerberg, and K. Vinther. Engineering a cache-oblivious sorting algorithm. In Proc. 6th Workshop on Algorithm Engineering and Experiments, 2004.

17.R. E. Ladner, R. Fortna, and B.-H. Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In Experimental Algorithmics, From Algorithm Design to Robust and Efficient Software (Dagstuhl seminar, September 2000), volume 2547 of Lecture Notes in Computer Science, pages 78–92. Springer, 2002.

18.R. Bayer and E. McCreight. Organization and maintenance of large ordered indexes. Acta Informatica, 1:173–189, 1972.

19.P. van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6:80–82, 1977.

20.M. A. Bender, G. S. Brodal, R. Fagerberg, D. Ge, S. He, H. Hu, J. Iacono, and A. López-Ortiz. The cost of cache-oblivious searching. In Proc. 44th Annual IEEE Symposium on Foundations of Computer Science, pages 271–282. IEEE Computer Society Press, 2003.

21.G. S. Brodal and R. Fagerberg. On the limits of cache-obliviousness. In Proc. 35th ACM Symposium on Theory of Computation, pages 307–315. ACM Press, 2003.

22.A. Andersson and T. W. Lai. Fast updating of well-balanced trees. In Proc. 2nd Scandinavian Workshop on Algorithm Theory, volume 447 of Lecture Notes in Computer Science, pages 111–121. Springer, 1990.

23.A. Itai, A. G. Konheim, and M. Rodeh. A sparse table implementation of priority queues. In Proc. 8th International Colloquium on Automata, Languages, and Programming, volume 115 of Lecture Notes in Computer Science, pages 417–431. Springer, 1981.

24.K. V. R. Kanth and A. K. Singh. Optimal dynamic range searching in non-replicating index structures. In Proc. International Conference on Database Theory, volume 1540 of Lecture Notes in Computer Science, pages 257–276. Springer, 1999.

25.J. L. Bentley. Multidimensional binary search trees used for associative searching. Communication of the ACM, 18:509–517, 1975.

26.J. L. Bentley. Decomposable searching problems. Information Processing Letters, 8(5):244–251, 1979.

27.L. Arge, G. S. Brodal, and R. Fagerberg. Improved cache-oblivious two-dimensional orthogonal range searching. Unpublished results, 2004.

*This chapter has been reprinted from first edition of this Handbook, without any content updates.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.152.26