7. Query-Driven Visualization and Analysis (1/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7

Query-Driven Visualization and

Analysis

Oliver R¨ubel

Lawrence Berkeley National Laboratory

E. Wes Bethel

Lawrence Berkeley National Laboratory

Prabhat

Lawrence Berkeley National Laboratory

Kesheng Wu

Lawrence Berkeley National Laboratory

7.1 Introduction ...................................................... 118

7.2 Data Subsetting and Performance ............................... 119

7.2.1 Bitmap Indexing ......................................... 120

7.2.2 Data Interfaces .......................................... 122

7.3 Formulating Multivariate Queries ............................... 124

7.3.1 Parallel Coordinates Multivariate Query Interface ...... 125

7.3.2 Segmenting Query Results .............................. 127

7.4 Applications of Query-Driven Visualization ..................... 129

7.4.1 Applications in Forensic Cybersecurity ................. 129

7.4.2 Applications in High Energy Physics .................... 132

7.4.2.1 Linear Particle Accelerator ................ 133

7.4.2.2 Laser Plasma Particle Accelerator ......... 135

7.5 Conclusion ........................................................ 139

References .......................................................... 140

This chapter focuses on an approach to high performance visualization and

analysis, termed query-driven visualization and analysis (QDV). QDV aims

to reduce the amount of data that needs to be processed by the visualiza-

tion, analysis, and rendering pipelines. The goal of the data reduction process

is to separate out data that is “scientiﬁcally interesting” and to focus visu-

alization, analysis, and rendering on that interesting subset. The premise is

that for any given visualization or analysis task, the data subset of interest

117

118 High Performance Visualization

is much smaller than the larger, complete data set. This strategy—extracting

smaller data subsets of interest and focusing of the visualization processing

on these subsets—is complementary to the approach of increasing the capac-

ity of the visualization, analysis, and rendering pipelines through parallelism,

which is the subject of many other chapters in this book. This chapter dis-

cusses the fundamental concepts in QDV, their relationship to diﬀerent stages

in the visualization and analysis pipelines, and presents QDV’s application to

problems in diverse areas, ranging from forensic cybersecurity to high energy

physics.

7.1 Introduction

Query-driven visualization and analysis (QDV) addresses the big-data

analysis challenge by focusing on data that is “scientiﬁcally interesting” and,

hence, reducing the amount of data that needs to be processed by the visu-

alization, analysis, and rendering pipelines. The data reduction is typically

based on range queries, which are used to constrain the data variables of in-

terest. The term QDV was coined by Stockinger et al. [31, 30] to describe the

combination of high performance index/query methods with visual data ex-

ploration methods. Using this approach, the complexity of the visualization,

analysis and rendering is reduced to O(k), with k being the number of ob-

jects retrieved by the query. The premise of QDV, then, is that for any given

visualization or analysis task, the data subset of interest, k, is much smaller

than the larger, complete data set. QDV methods are among a small subset

of techniques that are able to address both large and highly complex data.

In a typical QDV task, the user begins the analysis by forming deﬁnitions

for data subsets of interest. This characterization is done through the concept

of range queries, that is, the user deﬁnes a set of constraints for variables of

interest. For example, in the context of a combustion simulation, a scientist

may only be interested in regions of a particular temperature t and pressure

p, such that (1000F<t<1500F) AND (p<800Pa). QDV uses such range

queries to ﬁlter data before it is passed to the subsequent data processing

pipelines, thereby, focusing the visualization and analysis exclusively on the

data that is meaningful to the user. In practice, a user may not have a pre-

cise notion of which parts of the data are of the most interest. Rather than

specifying the exact data subset(s) of interest, a user may begin the analysis

by excluding large data portions that are known to be of no interest and/or

specifying regions that are known to contain the data of interest. Through

a process of iterative reﬁnement of queries and analysis of query results, the

user is able to gain further insight into the data and identify and specify more

precisely the data subset(s) of interest.

QDV inherently relies on two main technologies: fast methods for evaluat-

ing data queries; and eﬃcient methods for speciﬁcation, validation, visualiza-

tion, and analysis of query results. In the context of QDV of scientiﬁc data,

Query-Driven Visualization and Analysis 119

approaches based on bitmap indexing have shown to be very eﬀective for ac-

celerating multivariate data queries, which will be the focus of 7.2. Then, 7.3

discusses diﬀerent methods used for speciﬁcation, validation, and visualization

of data queries, with a particular focus on parallel coordinates as an eﬀective

interface for formulating multivariate data queries (see 7.3.1). To facilitate the

query-based data analysis process, the QDV approach has also been combined

with automated analysis methods to suggest further query reﬁnements, as well

as to automate the deﬁnition of queries and extraction of features of inter-

est. As an example, 7.3.2 discusses various methods for the segmentation and

labeling of query components. To illustrate the eﬀectiveness and wide applica-

bility of QDV methods, two case studies are investigated: the applications of

QDV to network traﬃc data in 7.4.1, and particle accelerator simulation data

in 7.4.2. This chapter concludes with an evaluation of the QDV approach and

a discussion of future challenges in 7.5.

7.2 Data Subsetting and Performance

Eﬃcient methods for data subsetting are fundamental to the concept of

QDV. The database community has developed a variety of indexing methods

to accelerate data searches. Many popular database systems use variations

of the B-tree [2]. The B-tree was designed for transaction-type applications,

exempliﬁed by interactions between a bank and its customer. Interactions

with transactional data are characterized by the frequent modiﬁcation of data

records, one record at a time, and retrieval of a relatively small number of data

records, such as the look-up of a customer’s banking information. In contrast,

scientiﬁc data is typically not modiﬁed after creation, but is processed in a

read-only fashion. Furthermore, scientiﬁc search operations typically retrieve

many more data records than typical transaction-type queries, and the num-

ber of records may also vary considerably more. For example, when studying

the ignition process in a combustion data set, a query that isolates regions

of high temperature may initially only retrieve very few data records; how-

ever, as the combustion process progresses from an initial spark into a large

ﬂame, the same query may result in the retrieval of a sizable fraction of the

data. Scientiﬁc search operations also often involve a large number of data

dimensions. For example, when analyzing the combustion process, a scientist

may be interested in regions that exhibit high temperature, high pressure, and

contain a high/low concentration of diﬀerent chemical species. For these types

of versatile, high-dimensional, read-only query operations, the bitmap index

is a more appropriate indexing structure [28], which is the subject of 7.2.1.

There is a pressing need for eﬃcient data interfaces that make indexing

technology accessible to the scientiﬁc data processing pipelines. Across sci-

entiﬁc applications, array-based data models are most commonly used for

data storage. However, common array-based scientiﬁc data formats, such as

HDF5 [32] and NetCDF [33], do not support semantic indexing methods. A

120 High Performance Visualization

number of array-based research databases, such as SciDB [6] or MonetDB [4],

provide advanced index/query technology, are optimized for scientiﬁc data.

However, scientiﬁc data is commonly stored outside of a database system. In

practice, it is often too expensive to make copies of extremely large scientiﬁc

data sets for the purpose of indexing, and the conversion of data to diﬀerent

ﬁle formats may break downstream data processing pipelines. Query-based

analyses, hence, require eﬃcient data interfaces that make semantic indexing

methods accessible within state-of-the-art scientiﬁc data formats (see 7.2.2).

The ﬁrst section, 7.2.1, discusses bitmap indexing in detail. Then, in 7.2.2,

FastQuery [9] is introduced as an example index and query system for integrat-

ing bitmap indexing technology with state-of-the-art scientiﬁc data formats,

including extensions of bitmap indexing to the processing of queries on large

distributed-memory platforms.

7.2.1 Bitmap Indexing

A bitmap index stores the bulk of its information in a series of bit sequences

known as bitmaps. Figure 7.1 shows a small example with one variable named

I, with a value range of 0, 1, 2, or 3. For each of the four possible values,

a separate bitmap is created to record which rows, or records, contain the

corresponding value. For example, the value 0 only appears in row 1. The

bitmap representing the value 0, hence, contains a 1, as the ﬁrst bit, while the

remaining bits are 0. This is the most basic form of a bitmap index [22].

A scientiﬁc data set often involves many variables, and a query may involve

an arbitrary combination of the variables. When the B-Tree index [10] or a

multidimensional index [13] is used, the most eﬃcient way to answer a query

is to create an index with the variables involved in the query. This approach

requires a diﬀerent index for each combination of variables. Because the num-

ber of possible combinations is large, a large amount of disk space is needed

to store the diﬀerent indices. These combinations are necessary because the

results from these tree-based indices cannot be easily combined to produce

the ﬁnal answer. In contrast, when using a bitmap index to answer a query, a

bitmap is produced to represent the answer. A multivariate query can be an-

swered by resolving the condition on each variable separately, using the bitmap

index for that variable, and then combining all the intermediate answers with

bitwise logical operations to produce the ﬁnal answer. In this case, the total

size of all bitmap indices grows linearly with the number of variables in the

data. Typically, the bitwise logical operations are well-supported by computer

hardware. Therefore, the bitmap indices can answer queries eﬃciently.

The variable I shown in Figure 7.1 has only four possible values. In general,

for a variable with C distinct values, the basic bitmap index needs C bitmaps

of N bits each, where N is the number of rows (or records) in the data set [8].

If C is the cardinality of the variable, then in a worst-case scenario, C = N ,

and the corresponding bitmap index has N

bits. Even for a moderate size

data set, N

bits can easily ﬁll up all disks of a large computer system. Ideally,

Query-Driven Visualization and Analysis 121

Bitmap Index

RID I =0 =1 =2 =3

1 0 1000

2 1 0100

3 2 0010

4 3 0001

5 3 0001

6 3 0001

7 3 0001

8 1 0100

FIGURE 7.1: A sample bitmap index where RID is the record ID and I is the

integer attribute with values in the range of 0 to 3.

the index size should be proportional to N, instead of N

. The techniques for

controlling the bitmap index sizes roughly fall in three categories: compression,

encoding, and binning.

Compression: In principle, any lossless compression method can be used

to compress bitmap indices. In practice, the most eﬃcient bitmap compression

methods are based on run-length encoding, as they support bitwise logical op-

erations on the compressed data without decompressing the bitmaps [1, 37].

Working directly with the compressed bitmaps reduces the time needed for

reading the bitmaps from disk to memory, as well as the amount of time to

perform the necessary computations. Furthermore, using Word-Aligned Hy-

brid (WAH) compression, it was shown that the time needed to evaluate a

query is, in the worst case, a linear function of the number of records h that

satisfy the query [38]. Therefore, the WAH compressed bitmap index has an

optimal computational complexity of O(h). In the context of QDV, this means

that the computational complexity of the analysis no longer depends on the

total of number of data records, N , but only on the number of records, h,that

deﬁne the feature(s) of interest deﬁned by the query.

Encoding: The basic bitmap encoding used in Figure 7.1 is also called

equality-encoding,asitshowsthebestperformanceforequality queries,such

as temperature, t = 100F. Other well-known bitmap encoding methods in-

clude the range encoding [7] and the interval encoding [8]. They are better

suited for other query types, such as (35.8Pa <p<56.7Pa). There are also a

number of diﬀerent ways of composing more complex bitmap indices by using

combinations of the equality, range, and interval encoding [40].

Binning: The basic bitmap index works well for low-cardinality attributes,

such as “gender,” “types of cars sold per month,” or “airplane models pro-

duced by Airbus and Boeing.” However, for scientiﬁc data, the variables often

have a large value-range, while diﬀerent values are hardly ever repeated. These

types of scientiﬁc data are referred to as high-cardinality data. A bitmap in-

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Query-Driven Visualization and Analysis (1/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
7. Query-Driven Visualization and Analysis (1/6)