Chapter 10: Data Stream (2/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

220 Applied Data Mining

sampling, a sample is randomly chosen from each strata, whereas in cluster

sampling, only the randomly selected clusters are explored; and (3) cluster

sampling primarily aims to reduce costs by increasing sampling efﬁ ciency,

whereas stratiﬁ ed sampling aims to increase effectiveness (i.e., precision).

However, both of these sampling methods are limited by the unknown

size of the total data.

As introduced in [27], several issues confront existing sampling

techniques. First, data streams have an unknown dataset size. Therefore,

the sampling process on a data stream requires a special analysis to limit

the error bounds. Another problem is that to check the sampling strategy

may be inappropriate for checking anomalies in surveillance analysis

because the data rates in the stream are always changing. Thus, we explore

the relationship among the data rate, sampling rate, and error bounds for

real applications.

10.3 Wavelet Method

The wavelet-based technique is a fundamental tool for analyzing data

streams. From a traditional perspective, a wavelet is a mathematical function

used to divide a given function or continuous time series into different scale

components [6]. This approach has been successfully applied to applications

such as signal processing, motion recognition, image compression, and so on

[63, 7]. The wavelet technique provides concise and general summarization

of data (i.e., stream), which can be used as the basis for efﬁ cient and accurate

query processing methods. Numerous strategies have been introduced

based on the idea of wavelet, in which the most commonly used approach

for data streams is called Haar wavelets [68].

The Haar wavelet provides a foundation for query processing on stream

and relational data. It creates a decomposition of the data (or compact

summary) into a set of Haar wavelet functions, which can be used for later

query processing. The essential step is the determination of the Haar wavelet

coefﬁ cients. Only coefﬁ cients with high values are typically stored. Higher

order coefﬁ cients in the decomposition generally indicate broad trends in

the data, whereas lower order coefﬁ cients represent the local trends. We will

show a concrete example to illustrate the Haar wavelet process [66, 31].

Suppose our data stream is {3, 2, 4, 3, 1, 5, 0, 3}. The data in the

vector are computed as averaged values between neighbors to obtain

a lower resolution representation (i.e., level 2) of the data, such as

32431503 57 3

,,, ,,3,

2222 222

++++

⎡⎤⎡⎤

⎢⎥⎢⎥

⎣⎦⎣⎦

. This transformation results in the loss

of information, thus requiring more information to be stored. The Haar

wavelet technique computes the differences of the averaged values between

neighbors, such as

32431503 11 3

,,, ,,2,

2222 22 2

−−−−

⎡⎤⎡⎤

=−−

⎢⎥⎢⎥

⎣⎦⎣⎦

. These eight data values

are the ﬁ rst-order coefﬁ cients of the sample data, which can be used to

recover the original data set. Similarly, we can obtain the lower resolution

representation (i.e., formal part of level 1) as

57 357 3

913

22 222 2

,, , 3,,,

2222 424

⎡⎤

++ −−

⎢⎥

⎡⎤

=−

⎢⎥

⎣⎦

⎢⎥

⎣⎦

Recursively, we can obtain the ﬁ nal Haar wavelet transformation of the data

21 3 1 3 1 1 3

,, ,,,,2,

8 8 2422 2

⎧⎫

−−−

⎨⎬

⎩⎭

. The whole transformation process is illustrated

in Fig. 10.3.1.

Haar wavelet transformation on {3, 2, 4, 3, 1, 5, 0, 3}

Original data (level 3) 3 2 4 3 1 5 0 3

level 2 2.5 3.5 3 1.5 0.5 0.5 -2 -1.5

level 1 3 2.25 -0.5 0.75

level 0 2.625 0.375

Figure 10.3.1: Sample Haar wavelet transformation

Haar wavelet analysis commonly assumes that the size q of the time

series data is a power of two without loss of generality because the series

can be decomposed into segment subseries, each of which has a length that

is a power of two. From the process of Haar wavelet transformation, we

obtain 2

l+1

coefﬁ cients of level l (as shown in Fig. 10.3.1). Each coefﬁ cient

(i.e., 2

l+1

) represents (and summarizes) a contiguous part of the data stream

(i.e, q/2

l+1

). In the segment series data, the ith of the 2

l+1

coefﬁ cients covers

the part beginning from(l + 1) · q/2

l+1

+1 to i · q/2

l+1

Previous works were not concerned about retaining all coefﬁ cients,

but only a smaller number of them (i.e., top-B) [36]. Given this simplicity,

some information on the original data during the transformation will be

lost. In [36], the authors report that the highest B-term approximation is

in fact the best B-term approximation that it minimizes the sum squared

error for a given B.

A large number of existing approaches employ a lossy mechanism

because of the large number of coefficients introduced by the Haar

wavelet transformation, that is, the number is equal to the length of the

total data stream. While keeping the top-B coefﬁ cients with large values,

the other ones are set to zero. This heuristic is thus, employed to reduce

the dimensionality of the time series data. However, a trade-off always

exists between the number of coefﬁ cients and the error introduced by the

transformation. Obtaining the optimal number of Haar wavelet coefﬁ cients

is an interesting issue in the literature. Nevertheless, previous works assume

that only a small number of coefﬁ cients dominate the full effectiveness of

the transformation.

Data Stream 221

222 Applied Data Mining

In addition to the determination of the optimal number of the Haar

wavelet coefﬁ cients, another issue is the selection of the appropriate

coefﬁ cients. The absolute value is not the sole optimal criterion to dominate

performance. For example, based on different evaluation metrics for various

applications such as mean square error vs. least maximum error, different

selection strategies may be appropriate [29, 30, 63]. However, other issues

such as computation efﬁ ciency should also be considered.

As introduced in [8], other important topics are related to Haar wavelet

transformation. A number of applications prefer to monitor large quantities

of information simultaneously in the same stream data source. Taking sensor

applications, the monitors may store a large number of data features, such

as location, pressure, wind direction, and so on. Therefore, the Haar wavelet

transformation needs to process all of these features simultaneously. An

intuitive approach is the application of decomposition on each feature,

such that the top-B coefﬁ cients are discovered by merging the transformed

results [63]. However, this strategy may be inefﬁ cient, because duplicate

transformations may be conducted on the same individual data (relative to

different features). The authors in [22] introduced several effective strategies

for the simultaneous monitoring and transformation of multi-feature data

by using bitmaps to determine the optimal features.

In summary, Haar wavelet transformation continuously builds a

summarization of the B representative wavelet coefﬁ cients (e.g., with

largest absolute values) for time series data set. Considering the special

properties of dynamic stream data, the following criteria should be satisﬁ ed:

(1) sub-linear space usage must be available to store the summarization

and (2) sub-linear per-item update time must be sufﬁ cient to maintain the

summarization. Applications on query processing may further need to

consider another criterion, that is, sub-linear query time.

10.4 Sketch Method

The sketch-based technique is one of the major tools for stream data analysis.

Sketches are small space summarizations of stream data in a centralized or

distributed environment. The main advantage of sketch-based techniques is

that they require storage that is signiﬁ cantly smaller than the input stream

length. For most sketch based algorithms, the storage usage is sub-linear in

N, that is, log

N, where N is the input size and k is some constant. The hashing

function is generally employed by sketch based algorithms to project the

data stream into a small space sketch vector that can be easily updated and

queried. As demonstrated by numerous experimental evaluations, the cost

of updating and querying on the sketch vector is only a constant time for

each operation.

Given the property of a sketch, the answers to the queries that are

determined by examining the sketches are only approximations because

only part of the data information is stored (as sketch). A sketch generally

contains multiple counters for random variables relative to different

attributes. The error boundary on the answers should be held with

probabilistic guarantees. The introduced sketch-based algorithms differ in

terms of deﬁ ning and updating random variables as well as in the efﬁ cient

querying of the sketches.

The sketch-based technique is closely related to the random projection

strategy [45]. Indyk et al. [41] introduced this strategy into the database

domain (i.e., time series domain) to discover the representative trends.

The key idea is that a data point with dimensionality d is reduced by

(randomly) selecting k dimensionalities. The dot product of the these k

dimensionalities between data points is computed. Each k dimensionality

(i.e., random vectors) follows the normal distribution with zero mean and

unit variance. Moreover, the random vector is normalized relative to one

unit in magnitude. Accuracy is dependent on the value of k, where a larger

value of k results in high accuracy. We will then introduce several sketch-

based algorithms for processing data streams. Please refer to [17, 62, 8] for

a more detailed survey of related issues.

10.4.1 Sliding Window-based Sketch

Indyk et al. [41] introduced the sketch-based technique into the database

domain to discover the trends governing data streams. The authors observed

that the length of a time series can be considered to have one dimensionality.

Therefore, we can construct the sketch by considering the length as the

random vector. Two situations are considered in [41]: ﬁ xed window sketches

and variable window sketches.

For ﬁ xed window sketches, the aim is to obtain sliding window sketches

with a ﬁ xed length l. Thus, l · k operations should be conducted for a sketch

with size k. Given the total of O(n−l) sliding windows, O(n · l · k) operations

are necessary. From this analysis, we ﬁ nd that if the window length l is large

(i.e., the same order of magnitude as the time series), the cost of calculation

would be quadratic to the size of the series. Therefore, this approach could

be impractical for large time series data (i.e., stream). As introduced in

[41], the construction of ﬁ xed window-based sketches can be considered

as the computation of the polynomial convolution of random vectors of

appropriate length over the time series data. Therefore, we can use the fast

Fourier transform to address the issue. This observation indicates that the

ﬁ xed window sketches can be obtained efﬁ ciently.

For variable window sketches, the aim is to construct the sketches for

any sub-vector between length l and u [41]. This approach requires O(n

)

Data Stream 223

224 Applied Data Mining

sub-vectors that may have the length O(n) in the worst case. Through this

analysis, we ﬁ nd that the total cost is O(n

), which limits the application of

the technique for large time series data (i.e., stream). To address this issue,

Indyk et al. [41] proposed the construction of a set of sketches. The size of

the group must be considerably smaller than the total sketches. To determine

the sketches to be stored, the authors deliberately chose sub vectors out of

the original ones, such that the computation can be achieved in O(1) time

with sufﬁ cient accuracy guaranteed for each vector. The experimental

evaluation demonstrates the effectiveness and efﬁ ciency of the introduced

strategies [41]. Please refer to [41] for more details about this technique.

10.4.2 Count Sketch

Alon et al. [10] were the ﬁ rst to introduce the term sketch as a tug-of-war

sketch. The authors aim to measure and optimize the second order of the

frequency moment F

=¹

. Notably, more recent studies found that the

introduced summarization in [10] can also be utilized to measure the inner

product of two distributions on frequency, that is, ¹

, where f

and f'

denote the two frequency distributions. From the observation, we ﬁ nd that

if f

can be obtained from a data stream, the product of f'

=1 and f'

=0 for all

j  i can be calculated during the query processing. Thus, we ﬁ nd that for

the error bound, the value should be ¢F

1/2

≤ ¢n, which has more than 1-δ

probability for a sketch with size of O(

log 1/δ).

The cost of updating the count sketch is high because all the sketches

should be rebuilt if a new instance comes in. Thus, the naive count sketch

technique as an appropriate strategy for data streams. To address the issue,

Charikar et al. [14] proposed an improved algorithm that requires only a

small part of the sketches to be updated when a new instance comes in.

Thus, performance is signiﬁ cantly improved.

We brieﬂ y describe the idea in [14]. Interested readers are referred

to the previous paper for more details. The introduced sketch structure

contains a d × ω array (denoted as C) that stores counters (where d is the

number of rows), with two hash functions presented for each row. One hash

function g maps the instances of the data stream into [ω], whereas the other

hash function h maps the instances into {–1, +1}. For each row j (i.e., 1 ≤ j

≤ d), the corresponding mapping on instance i, that is, h

(i), is stored in the

array element C[j, g

(i)]. We ﬁ nd that



is median 1≤j≤d

(i)C[j, g

(i)] [17].

The analysis shows that for each value of j, the expectation and variance

depending on F

/ω can be accurately derived.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10: Data Stream (2/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 10: Data Stream (2/5)