Chapter 7
Query-Driven Visualization and
Analysis
Oliver R¨ubel
Lawrence Berkeley National Laboratory
E. Wes Bethel
Lawrence Berkeley National Laboratory
Prabhat
Lawrence Berkeley National Laboratory
Kesheng Wu
Lawrence Berkeley National Laboratory
7.1 Introduction ...................................................... 118
7.2 Data Subsetting and Performance ............................... 119
7.2.1 Bitmap Indexing ......................................... 120
7.2.2 Data Interfaces .......................................... 122
7.3 Formulating Multivariate Queries ............................... 124
7.3.1 Parallel Coordinates Multivariate Query Interface ...... 125
7.3.2 Segmenting Query Results .............................. 127
7.4 Applications of Query-Driven Visualization ..................... 129
7.4.1 Applications in Forensic Cybersecurity ................. 129
7.4.2 Applications in High Energy Physics .................... 132
7.4.2.1 Linear Particle Accelerator ................ 133
7.4.2.2 Laser Plasma Particle Accelerator ......... 135
7.5 Conclusion ........................................................ 139
References .......................................................... 140
This chapter focuses on an approach to high performance visualization and
analysis, termed query-driven visualization and analysis (QDV). QDV aims
to reduce the amount of data that needs to be processed by the visualiza-
tion, analysis, and rendering pipelines. The goal of the data reduction process
is to separate out data that is “scientifically interesting” and to focus visu-
alization, analysis, and rendering on that interesting subset. The premise is
that for any given visualization or analysis task, the data subset of interest
117
118 High Performance Visualization
is much smaller than the larger, complete data set. This strategy—extracting
smaller data subsets of interest and focusing of the visualization processing
on these subsets—is complementary to the approach of increasing the capac-
ity of the visualization, analysis, and rendering pipelines through parallelism,
which is the subject of many other chapters in this book. This chapter dis-
cusses the fundamental concepts in QDV, their relationship to different stages
in the visualization and analysis pipelines, and presents QDV’s application to
problems in diverse areas, ranging from forensic cybersecurity to high energy
physics.
7.1 Introduction
Query-driven visualization and analysis (QDV) addresses the big-data
analysis challenge by focusing on data that is “scientifically interesting” and,
hence, reducing the amount of data that needs to be processed by the visu-
alization, analysis, and rendering pipelines. The data reduction is typically
based on range queries, which are used to constrain the data variables of in-
terest. The term QDV was coined by Stockinger et al. [31, 30] to describe the
combination of high performance index/query methods with visual data ex-
ploration methods. Using this approach, the complexity of the visualization,
analysis and rendering is reduced to O(k), with k being the number of ob-
jects retrieved by the query. The premise of QDV, then, is that for any given
visualization or analysis task, the data subset of interest, k, is much smaller
than the larger, complete data set. QDV methods are among a small subset
of techniques that are able to address both large and highly complex data.
In a typical QDV task, the user begins the analysis by forming definitions
for data subsets of interest. This characterization is done through the concept
of range queries, that is, the user defines a set of constraints for variables of
interest. For example, in the context of a combustion simulation, a scientist
may only be interested in regions of a particular temperature t and pressure
p, such that (1000F<t<1500F) AND (p<800Pa). QDV uses such range
queries to filter data before it is passed to the subsequent data processing
pipelines, thereby, focusing the visualization and analysis exclusively on the
data that is meaningful to the user. In practice, a user may not have a pre-
cise notion of which parts of the data are of the most interest. Rather than
specifying the exact data subset(s) of interest, a user may begin the analysis
by excluding large data portions that are known to be of no interest and/or
specifying regions that are known to contain the data of interest. Through
a process of iterative refinement of queries and analysis of query results, the
user is able to gain further insight into the data and identify and specify more
precisely the data subset(s) of interest.
QDV inherently relies on two main technologies: fast methods for evaluat-
ing data queries; and efficient methods for specification, validation, visualiza-
tion, and analysis of query results. In the context of QDV of scientific data,
Query-Driven Visualization and Analysis 119
approaches based on bitmap indexing have shown to be very effective for ac-
celerating multivariate data queries, which will be the focus of 7.2. Then, 7.3
discusses different methods used for specification, validation, and visualization
of data queries, with a particular focus on parallel coordinates as an effective
interface for formulating multivariate data queries (see 7.3.1). To facilitate the
query-based data analysis process, the QDV approach has also been combined
with automated analysis methods to suggest further query refinements, as well
as to automate the definition of queries and extraction of features of inter-
est. As an example, 7.3.2 discusses various methods for the segmentation and
labeling of query components. To illustrate the effectiveness and wide applica-
bility of QDV methods, two case studies are investigated: the applications of
QDV to network traffic data in 7.4.1, and particle accelerator simulation data
in 7.4.2. This chapter concludes with an evaluation of the QDV approach and
a discussion of future challenges in 7.5.
7.2 Data Subsetting and Performance
Efficient methods for data subsetting are fundamental to the concept of
QDV. The database community has developed a variety of indexing methods
to accelerate data searches. Many popular database systems use variations
of the B-tree [2]. The B-tree was designed for transaction-type applications,
exemplified by interactions between a bank and its customer. Interactions
with transactional data are characterized by the frequent modification of data
records, one record at a time, and retrieval of a relatively small number of data
records, such as the look-up of a customer’s banking information. In contrast,
scientific data is typically not modified after creation, but is processed in a
read-only fashion. Furthermore, scientific search operations typically retrieve
many more data records than typical transaction-type queries, and the num-
ber of records may also vary considerably more. For example, when studying
the ignition process in a combustion data set, a query that isolates regions
of high temperature may initially only retrieve very few data records; how-
ever, as the combustion process progresses from an initial spark into a large
flame, the same query may result in the retrieval of a sizable fraction of the
data. Scientific search operations also often involve a large number of data
dimensions. For example, when analyzing the combustion process, a scientist
may be interested in regions that exhibit high temperature, high pressure, and
contain a high/low concentration of different chemical species. For these types
of versatile, high-dimensional, read-only query operations, the bitmap index
is a more appropriate indexing structure [28], which is the subject of 7.2.1.
There is a pressing need for efficient data interfaces that make indexing
technology accessible to the scientific data processing pipelines. Across sci-
entific applications, array-based data models are most commonly used for
data storage. However, common array-based scientific data formats, such as
HDF5 [32] and NetCDF [33], do not support semantic indexing methods. A
120 High Performance Visualization
number of array-based research databases, such as SciDB [6] or MonetDB [4],
provide advanced index/query technology, are optimized for scientific data.
However, scientific data is commonly stored outside of a database system. In
practice, it is often too expensive to make copies of extremely large scientific
data sets for the purpose of indexing, and the conversion of data to different
file formats may break downstream data processing pipelines. Query-based
analyses, hence, require efficient data interfaces that make semantic indexing
methods accessible within state-of-the-art scientific data formats (see 7.2.2).
The first section, 7.2.1, discusses bitmap indexing in detail. Then, in 7.2.2,
FastQuery [9] is introduced as an example index and query system for integrat-
ing bitmap indexing technology with state-of-the-art scientific data formats,
including extensions of bitmap indexing to the processing of queries on large
distributed-memory platforms.
7.2.1 Bitmap Indexing
A bitmap index stores the bulk of its information in a series of bit sequences
known as bitmaps. Figure 7.1 shows a small example with one variable named
I, with a value range of 0, 1, 2, or 3. For each of the four possible values,
a separate bitmap is created to record which rows, or records, contain the
corresponding value. For example, the value 0 only appears in row 1. The
bitmap representing the value 0, hence, contains a 1, as the first bit, while the
remaining bits are 0. This is the most basic form of a bitmap index [22].
A scientific data set often involves many variables, and a query may involve
an arbitrary combination of the variables. When the B-Tree index [10] or a
multidimensional index [13] is used, the most efficient way to answer a query
is to create an index with the variables involved in the query. This approach
requires a different index for each combination of variables. Because the num-
ber of possible combinations is large, a large amount of disk space is needed
to store the different indices. These combinations are necessary because the
results from these tree-based indices cannot be easily combined to produce
the final answer. In contrast, when using a bitmap index to answer a query, a
bitmap is produced to represent the answer. A multivariate query can be an-
swered by resolving the condition on each variable separately, using the bitmap
index for that variable, and then combining all the intermediate answers with
bitwise logical operations to produce the final answer. In this case, the total
size of all bitmap indices grows linearly with the number of variables in the
data. Typically, the bitwise logical operations are well-supported by computer
hardware. Therefore, the bitmap indices can answer queries efficiently.
The variable I shown in Figure 7.1 has only four possible values. In general,
for a variable with C distinct values, the basic bitmap index needs C bitmaps
of N bits each, where N is the number of rows (or records) in the data set [8].
If C is the cardinality of the variable, then in a worst-case scenario, C = N ,
and the corresponding bitmap index has N
2
bits. Even for a moderate size
data set, N
2
bits can easily fill up all disks of a large computer system. Ideally,
Query-Driven Visualization and Analysis 121
Bitmap Index
RID I =0 =1 =2 =3
1 0 1000
2 1 0100
3 2 0010
4 3 0001
5 3 0001
6 3 0001
7 3 0001
8 1 0100
b
1
b
2
b
3
b
4
FIGURE 7.1: A sample bitmap index where RID is the record ID and I is the
integer attribute with values in the range of 0 to 3.
the index size should be proportional to N, instead of N
2
. The techniques for
controlling the bitmap index sizes roughly fall in three categories: compression,
encoding, and binning.
Compression: In principle, any lossless compression method can be used
to compress bitmap indices. In practice, the most efficient bitmap compression
methods are based on run-length encoding, as they support bitwise logical op-
erations on the compressed data without decompressing the bitmaps [1, 37].
Working directly with the compressed bitmaps reduces the time needed for
reading the bitmaps from disk to memory, as well as the amount of time to
perform the necessary computations. Furthermore, using Word-Aligned Hy-
brid (WAH) compression, it was shown that the time needed to evaluate a
query is, in the worst case, a linear function of the number of records h that
satisfy the query [38]. Therefore, the WAH compressed bitmap index has an
optimal computational complexity of O(h). In the context of QDV, this means
that the computational complexity of the analysis no longer depends on the
total of number of data records, N , but only on the number of records, h,that
define the feature(s) of interest defined by the query.
Encoding: The basic bitmap encoding used in Figure 7.1 is also called
equality-encoding,asitshowsthebestperformanceforequality queries,such
as temperature, t = 100F. Other well-known bitmap encoding methods in-
clude the range encoding [7] and the interval encoding [8]. They are better
suited for other query types, such as (35.8Pa <p<56.7Pa). There are also a
number of different ways of composing more complex bitmap indices by using
combinations of the equality, range, and interval encoding [40].
Binning: The basic bitmap index works well for low-cardinality attributes,
such as “gender,” “types of cars sold per month,” or “airplane models pro-
duced by Airbus and Boeing.” However, for scientific data, the variables often
have a large value-range, while different values are hardly ever repeated. These
types of scientific data are referred to as high-cardinality data. A bitmap in-
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.229.44