120 High Performance Visualization
number of array-based research databases, such as SciDB [6] or MonetDB [4],
provide advanced index/query technology, are optimized for scientific data.
However, scientific data is commonly stored outside of a database system. In
practice, it is often too expensive to make copies of extremely large scientific
data sets for the purpose of indexing, and the conversion of data to different
file formats may break downstream data processing pipelines. Query-based
analyses, hence, require efficient data interfaces that make semantic indexing
methods accessible within state-of-the-art scientific data formats (see 7.2.2).
The first section, 7.2.1, discusses bitmap indexing in detail. Then, in 7.2.2,
FastQuery [9] is introduced as an example index and query system for integrat-
ing bitmap indexing technology with state-of-the-art scientific data formats,
including extensions of bitmap indexing to the processing of queries on large
distributed-memory platforms.
7.2.1 Bitmap Indexing
A bitmap index stores the bulk of its information in a series of bit sequences
known as bitmaps. Figure 7.1 shows a small example with one variable named
I, with a value range of 0, 1, 2, or 3. For each of the four possible values,
a separate bitmap is created to record which rows, or records, contain the
corresponding value. For example, the value 0 only appears in row 1. The
bitmap representing the value 0, hence, contains a 1, as the first bit, while the
remaining bits are 0. This is the most basic form of a bitmap index [22].
A scientific data set often involves many variables, and a query may involve
an arbitrary combination of the variables. When the B-Tree index [10] or a
multidimensional index [13] is used, the most efficient way to answer a query
is to create an index with the variables involved in the query. This approach
requires a different index for each combination of variables. Because the num-
ber of possible combinations is large, a large amount of disk space is needed
to store the different indices. These combinations are necessary because the
results from these tree-based indices cannot be easily combined to produce
the final answer. In contrast, when using a bitmap index to answer a query, a
bitmap is produced to represent the answer. A multivariate query can be an-
swered by resolving the condition on each variable separately, using the bitmap
index for that variable, and then combining all the intermediate answers with
bitwise logical operations to produce the final answer. In this case, the total
size of all bitmap indices grows linearly with the number of variables in the
data. Typically, the bitwise logical operations are well-supported by computer
hardware. Therefore, the bitmap indices can answer queries efficiently.
The variable I shown in Figure 7.1 has only four possible values. In general,
for a variable with C distinct values, the basic bitmap index needs C bitmaps
of N bits each, where N is the number of rows (or records) in the data set [8].
If C is the cardinality of the variable, then in a worst-case scenario, C = N ,
and the corresponding bitmap index has N
2
bits. Even for a moderate size
data set, N
2
bits can easily fill up all disks of a large computer system. Ideally,