The information-theoretic lower bound on the number of bits required to store an arbitrary subset of size
n of a universe of size
N is
, since we have to be able to represent all possible combinations of selecting
n values out of the
N. Using Stirling's approximation and defining
, we obtain
with an error less than
, where
e is Euler's constant. Alternatively, using
we can approximate the logarithm by two corresponding integrals. If we properly bias the integral limits we can be sure to compute a lower bound
For the case of
dynamic dictionaries (where insertions and deletions are fully supported), we want to be able to maintain subsets of varying size, say, of zero up to a maximum of
n elements. This results in a minimum number of
bits. For
(which is usually the case for nontrivial search problems) we have
The correctness follows from the property of binomial coefficients
for
. We are only interested in the logarithms, so we conclude
Obviously in this restricted range it is sufficient to concentrate on the last binomial coefficient. The error in our estimate is at most one bit. At the end, as we look at the logarithms, the dynamic case is not much different from the static case.
Suffix Lists
Given
B bits of memory, for space-efficient state storage we want to maintain a dynamically evolving visited list closed under inserts and membership queries. For the ease of presentation, the entries of
Closed are integers from
. Using hashing with open addressing, the maximal size of
Closed nodes
m is limited to
, since
bits are required to encode a state. A gain is only to be expected if we can exploit redundancies in the state vector set. In the following we describe a simple but very space-efficient approach with small update and query times.
Let
be the binary representation of an element
from the set
Closed. We split
in
p high bits and
low bits. Furthermore,
denotes the prefix of
and
stands for the suffix of
.
A
Suffix Tree list data structure consists of a linear array
P of size 2
p bits and of a two-dimensional array
L of size
bits. The basic idea of a
Suffix Tree list is to store a common prefix of several entries as a single bit in
P, whereas the distinctive suffixes form a group within
L.
P is stored as a bit array.
L can hold several groups with each group consisting of a multiple of
bits. The first bit of each
-bit row in
L serves as a
group bit. The first
s-bit suffix entry of a group has group bit one, and the other elements of the group have group bit zero. We place the elements of a group together in lexicographical order (see
Fig. 3.18).
First, we compute
, which gives us the search position in the prefix array
P. Then we simply count the number of ones in
P starting from position
until we reach
. Let
z be this number. Finally, we search through
L until we have found the
z th suffix of
L with group bit one. If we have to perform a membership query we simply search in this group. Note that searching a single entry may require scanning large areas of main memory.
To insert entry
u we first search the corresponding group as described earlier. In case
u opens a new group within
L this involves setting group bits in
P and
L. The suffix of
u is inserted in its group while maintaining the elements of the group sorted. Note that an insert may need to shift many rows in
L to
create space at the desired position. The maximum number
m of elements that can be stored in
B bits is limited as follows: We need
bits for
P and
bits for each entry of
L. Hence, we choose
p so that
r is maximal subject to
For
the space requirement for both
P and the suffixes in
L is small enough to guarantee
.
We now show how to speed up the operations. When searching or inserting an element
u we have to compute
z to find the correct group in
L. Instead of scanning potentially large parts of
P and
L for each single query we maintain checkpoints,
one-counters, to store the number of ones seen so far. Checkpoints are to lie close enough to support rapid search but must not consume more than a small fraction of the main memory. For
we have
for both arrays, so
bits are sufficient for each one-counter.
Keeping one-counters after every
entries limits the total space requirement. Binary search on the one-counters of
P now reduces the scan area to compute the correct value of
z to
bits.
Searching in L is slightly more difficult because groups could extend over 2s entries, thus potentially spanning several one-counters with equal values. Nevertheless, finding the beginning and the end of large groups is possible within the stated bounds. As we keep the elements within a group sorted, another binary search on the actual entries is sufficient to locate the position in L.
We now turn to insertions where two problems remain: adding a new element to a group may need shifting large amounts of data. Also, after each insert the checkpoints must be updated. A simple solution uses a second buffer data structure
BU that is less space efficient but supports rapid inserts and lookups. When the number of elements in
BU exceeds a certain threshold,
BU is merged with the old
Suffix Tree list to obtain a new up-to-date space-efficient representation. Choosing an appropriate
size of
BU, amortized analysis shows improved computational bounds for inserts while achieving asymptotically the same order of phases for the graph search algorithm.
Note that membership queries must be extended to
BU as well. We implement
BU as an array for hashing with open addressing.
BU stores at most
elements of size
, for some small constant
c2. As long as there is 10% space left in
BU, we continue to insert elements into
BU, otherwise
BU is sorted and the suffixes are moved from
BU into the proper groups of
L. The reason not to exploit the full hash table size is again to bound the expected search and insert time within
BU to a constant number of tests. Altogether, we can prove the following theorem.
Theorem 3.5
(Time Complexity Suffix Tree list) Searching and inserting n items into a Suffix Tree list under space restriction amounts to a runtime of O (nlgn).
Proof
For a membership query we perform binary searches on numbers of
bits or
s bits, respectively. So, to search an element we need
bit operations since
and
.
Each of the
buffer entries consists of
bits, hence sorting the buffer can be done with
bit operations. Starting with the biggest occurring keys merging can be performed in
O (
1) memory scans. This also includes updating all one-counters. In spite of the additional data structures we still have
Thus, the total bit complexity for
n inserts and membership queries is given by
Assuming a machine word length of
, any modification or comparison of entries with
bits appearing in a
Suffix Tree list can be done using
O (
1) machine operations.
Hence, the total complexity reduces to
operations.
The constants can be improved using the following observation: In the case n = (1 + ϵ) ⋅ B, for a small ϵ > 0 nearly half of the entries in P will always be zero, namely those that are lexicographically bigger than the suffix of n itself. Cutting the P array at this position leaves more room for L, which in turn enables us to keep more elements.
Table 3.3 compares a
Suffix Tree list data structure with hashing and open addressing. The constants for the
Suffix Tree list are chosen so that
, which means that if
m elements can be treated, we set aside
bits to speed up internal computations. For hashing with open addressing we also leave 10% of memory free to keep the internal computation time moderate. When using a
Suffix Tree list instead of hashing, note that only the ratio between
n and
B is important.
Table 3.3 Fractions of n that can be accommodated in a Suffix List and in hashing with open addressing. Note: n is the size of the search space to be memorized, and B is the number of bits available for storing the data. The columns denote the maximal portion of the state space that can be stored according to the information theoretical bound, the Suffix List structure, and ordinary hashing according to two practical values of n.
| Hashing |
---|
n/B | Upper Bound | Suffix Lists | n = 220 | n = 230 |
---|
1.05 | 33.2% | 22.7% | 4.3% | 2.9% |
1.10 | 32.4% | 21.2% | 4.1% | 2.8% |
1.25 | 24.3% | 17.7% | 3.6% | 2.4% |
1.50 | 17.4% | 13.4% | 3.0% | 2.0% |
2.00 | 11.0% | 9.1% | 2.3% | 1.5% |
3.00 | 6.1% | 5.3% | 1.5% | 1.0% |
4.00 | 4.1% | 3.7% | 1.1% | 0.7% |
8.00 | 1.7% | 1.5% | 0.5% | 0.4% |
16.00 | 0.7% | 0.7% | 0.3% | 0.2% |
Hence, the Suffix Tree list data structures can close the memory gap in search algorithms between the best possible and trivial approaches like hashing with open addressing.