7. Query-Driven Visualization and Analysis (3/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Query-Driven Visualization and Analysis 127

bin, and rendering order takes into account region density so that important

regions are not hidden by occlusion. Diﬀerent colors could be assigned to diﬀer-

ent timesteps, producing a temporal parallel coordinates plot (see Fig. 7.3c).

This approach is useful in helping to reveal multivariate trends and outliers

across large, time-varying data sets.

7.3.2 Segmenting Query Results

A query describes a binary classiﬁcation of the data, based on whether

a record satisﬁes the query condition(s). However, typically, the feature(s)

of interest to the end users are not individual data records, but regions of

space deﬁned by connected components of a query—such as ignition kernels or

ﬂame fronts. Besides physical components of a query, a feature of interest may

also be deﬁned by groups of records (clusters) in high-dimensional variable

space. Combining QDV with methods for the segmentation and classiﬁcation

of query results can facilitate the analysis of large data sets by supporting

the identiﬁcation of subfeatures of a query, by suggesting strategies for the

reﬁnement of queries, or by automating the deﬁnition of complex queries for

automatic feature detection.

A common approach for enhancing the QDV analysis process consists of

identifying spatially connected components in the query results. This has been

accomplished in the past using a technique known as connected component

labeling [35, 39]. Information from connected component labels provides the

means to enhance the visualization—as shown in Figure 7.4b—and enables

further quantitative analysis. For example, statistical analysis of the number

and distributions in size or volume of physical components of query results

can provide valuable information about the state and evolution of dynamic

physical processes, such as a ﬂame [5].

Connected component analysis for QDV has a wide range of applications.

Stockinger et al. [31] applied this approach to combustion simulation data. In

this context, a researcher might be interested in ﬁnding ignition kernels, or

regions of extinction. On the other hand, when studying the stability of mag-

netic conﬁnement for fusion, a researcher might be interested in regions with

high electric potential because of their association with zonal ﬂows, critical to

the stability of the magnetic conﬁnement—as shown in Figure 7.4a [41].

In practice, connected component analysis is mainly useful for data with

known topology, in particular, data deﬁned on regular meshes. For scattered

data—such as particle data—where no connectivity is given, density-based

clustering approaches may provide a useful alternative to group selected par-

ticles based on their spatial distribution. One of the main limitations of a

connected component analysis, in the context of QDV, is that the labeling

of components by itself does not yield any direct feedback about possible

strategies for the reﬁnement of data queries.

To address this problem, Gosink et al. [17] proposed the use of multivariate

statistics to support the exploration of the solution space of data queries. To

128 High Performance Visualization

(a) (b) (c)

FIGURE 7.4: Applications of segmentation of query results. Image (a) displays

the magnetic conﬁnement fusion visualization showing regions of high mag-

netic potential colored by their connected component label. Image source: Wu

et. al, 2011 [41]. Image (b) shows the query selecting the eye of a hurricane.

Multivariate statistics-based segmentation reveals three distinct regions in

which the query’s joint distribution is dominated by the inﬂuence of pres-

sure (blue), velocity (green), and temperature (red). Image source: Gosink et.

al, 2011 [17]. Image (c) is the volume rendering of the plasma density (gray),

illustrating the wake of the laser in a plasma-based accelerator. The data set

contains ≈ 229 × 10

particles per timestep. The particles of the two main

beams, automatically detected by the query-based analysis, are shown colored

by their momentum in acceleration direction (px). Image source: R¨ubel et. el,

2009 [26].

analyze the structure of a query result, they used kernel density estimation to

compute the joint and univariate probability distributions for the multivariate

solution space of a query. Visual exploration of the joint density function—for

example, using isosurfaces in physical space—helps users with the visual iden-

tiﬁcation of regions in which the combined behavior of the queried variables is

statistically more important to the inquiry. Based on the univariate distribu-

tion functions, the query solution is then segmented into diﬀerent subregions

in which the distribution of diﬀerent variables is more important in deﬁning

the queries’ solution. Figure 7.4b shows an example of the segmentation of

a query deﬁning the eye of a hurricane. Segmentation of the query solution

based on the univariate density functions reveals three regions in which the

query’s joint distribution is dominated by the inﬂuence of pressure (blue), ve-

locity (green), and temperature (red). The comparison of a query’s univariate

distributions to the corresponding distributions, restricted to the segmented

regions, can help identify parameters for further query-reﬁnement.

As illustrated by the above example, understanding the structure of a

query deﬁned by the trends and interactions of variables within the query’s

solution, can provide important insight to help with the reﬁnement of queries.

To this end, Gosink et al. [15] proposed the use of univariate cumulative dis-

tribution functions restricted to the solution space of the query to identify

Query-Driven Visualization and Analysis 129

principle level sets for all variables that are deemed the most relevant to the

query’s solution. Additional derived scalar ﬁelds, describing the local pairwise

correlation between two variables in physical space are then used to iden-

tify a pairwise variable interactions. Visualization of these correlation ﬁelds,

in conjunction with principle isosurfaces, then enables the identiﬁcation of

statistically important interactions and trends between any three variables.

The manual, query-driven exploration of extremely large data sets en-

ables the user to build and test new hypotheses. However, manual exploration

is often a time-consuming and complex process and is, therefore, not well-

suited for the processing of large collections of scientiﬁc data. Combining the

QDV concept with methods to automate the deﬁnition of advanced queries

to extract speciﬁc features of interest helps to support the analysis of large

data collection. For example, in the context of plasma-based particle accel-

erators, many analyses rely on the ability to accurately deﬁne the subset of

particles that form the main particle beams of interest. To this eﬀect, 7.4.2

describes an eﬃcient query-based algorithm for the automatic detection of par-

ticle beams [26]. Combining the automatic query-based beam analysis with

advanced visualization supports an eﬃcient analysis of complex plasma-based

accelerator simulations, as shown in Figure 7.4c.

7.4 Applications of Query-Driven Visualization

QDV is among the small subset of techniques that can address both large

and highly complex data. QDV is an eﬃcient, feature-focused analysis ap-

proach that has a wide range of applications. The case studies that follow

all make use of supercomputing platforms to perform QDV and analysis on

data sets of unprecedented size. One theme across all these studies is that

QDV reduces visualization and analysis processing time from hours or days,

to seconds or minutes.

The study results presented in 7.4.1 aim to accelerate analysis within the

context of forensic cybersecurity [3, 29]. Two high energy physics case studies

are presented in 7.4.2, both of which are computational experiments aimed at

designing next-generation particle accelerators, such as: a free-electron laser

(7.4.2.1) and a plasma-wakeﬁeld accelerator (7.4.2.2).

7.4.1 Applications in Forensic Cybersecurity

Modern forensic analytics applications, like network traﬃc analysis, consist

of hypothesis testing, knowledge discovery, and data mining on very large data

sets. One key strategy to reduce the time-to-solution is to be able to quickly

focus analysis on the subset of data relevant for a given analysis. This case

study, presented in earlier work [3, 29], uses a combination of supercomputing

platforms for parallel processing, high performance indexing for fast searches,

130 High Performance Visualization

and user interface technologies for specifying queries and examining query

results.

The problem here is to ﬁnd and analyze a distributed network scan attack

buried in one year’s worth of network ﬂow data, which consists of 2.5 billion

records. In a distributed scan, multiple distributed hosts systematically probe

for vulnerabilities on ports of a set of target hosts. The traditional approach,

consisting of command line scripts and using ASCII ﬁles, could have required

many weeks of processing time. The approach presented below reduces this

time to minutes.

Data: In this experiment, the data set consists of network connection data

for a period of 42 weeks—a total of 2.5 billion records—acquired from a Bro

system [23] running at a large supercomputing facility. The data set contains

for each record, the “standard” set of connection variables, such as source and

destination IP addresses, ports, connection duration, etc. The data is stored in

ﬂat ﬁles, using an uncompressed binary format for a total size of about 281GB.

The IP addresses are split into four octets, A : B : C : D,toimprovequery

performance. For instance, IPS

refers to the class A octet of the source IP

address. The FastBit bitmap indices used to accelerate data queries require a

total of ≈ 78.6GB of space.

Interactive Query Interface: The study’s authors showed a histogram-

based visualization and query interface to facilitate the exploration of network

traﬃc data. After loading the metadata from a ﬁle, the system invokes Fast-

Bit to generate coarse temporal resolution histograms containing, for example,

counts of variables over the entire temporal range at a weekly resolution. The

user then reﬁnes the display by drilling into a narrower temporal range and at

a ﬁner temporal resolution. The result of such a query is another histogram,

which is listed as a new variable in the user interface. A more complex query

may be formed by the “cross-product” of histograms of arbitrary numbers

of data variables, and may be further qualiﬁed with arbitrary, user-speciﬁed

conditions.

One advantage of such a histogram-centric approach is that it allows for

an eﬀective iterative reﬁnement of queries. This iterative, multiresolution ap-

proach, which consists of an analyst posing diﬀerent ﬁltering criteria to exam-

ine data at diﬀerent temporal scales and resolutions, enables eﬀective context

and focus changes. Rather than trying to visually analyze 42 weeks worth of

data at one-second resolution all at once, the user can identify higher-level

features ﬁrst and then analyze those features and their subfeatures in greater

detail.

Another advantage is performance. The traditional methodology in net-

work traﬃc analysis consists of running command-line scripts on logﬁles. Due

to the nature of how FastBit stores and operates on bitmap indices, it can

compute conditional histograms much more quickly than a traditional sequen-

tial scan. This idea was the subject of several diﬀerent performance experi-

ments [3, 29].

Network Traﬃc Analysis: Theobjectiveofthisanalysisexampleistoex-

Query-Driven Visualization and Analysis 131















































































































































% & 



















































































D

E

F



























































FIGURE 7.5: Histograms showing the number of unsuccessful connection at-

tempts (with radiation excluded): (a) on ports 2000 to 65535 over a 42-week

period, indicating high levels of activity on port 5554 during the 7th week; (b)

per source A octet during the 7th week on port 5554, indicating suspicious

activity from IPs with a 220 A octet; (c) per destination C octet scanned

by seven suspicious source hosts (color), indicating a clear scanning pattern.

Image source: Bethel et al., 2006 [3].

amine 42 weeks worth of network data to investigate an initial intrusion de-

tection system (IDS) alert indicating a large number of scanning attempts on

TCP port 5554, which is indicative of a so-called “Sasser worm.” The ﬁrst step,

then, is to search the complete data set to identify the ports for which large

numbers of unsuccessful connection attempts have been recorded by the Bro

system. Figure 7.5a shows the counts of unsuccessful connection attempts on

all ports over the entire time range at weekly temporal resolution. The chart

reveals a high degree of suspicious activity in the seventh week, on destination

port 5554.

Now that the suspicious activity has been conﬁrmed and localized, the next

goal is to identify the addresses of the host(s) responsible for the unsuccessful

connection attempts. The splitting of IP addresses into four octets (A : B : C :

D) enables an iterative search to determine the A,thenB, C, and ﬁnally, D

octet addresses of the attacking hosts. The ﬁrst query then asks for the number

of unsuccessful connection attempts on port 5554 during the seventh week over

each address within the Class A octet. As can be seen in Figure 7.5b, the most

suspicious activity originates from host(s), having IP addresses with a 220 A

octet. Further reﬁnements of the query, then reveal the B octet (IP=220 : 184 :

x : x)andalistofC octets (IPS

= {26, 31, 47, 74, 117, 220, 232})fromwhich

most of the suspicious activity originates, indicating that a series of hosts on

these seven IPS

network segments may be involved in a distributed scan.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Query-Driven Visualization and Analysis (3/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
7. Query-Driven Visualization and Analysis (3/6)