220 Applied Data Mining
sampling, a sample is randomly chosen from each strata, whereas in cluster
sampling, only the randomly selected clusters are explored; and (3) cluster
sampling primarily aims to reduce costs by increasing sampling effi ciency,
whereas stratifi ed sampling aims to increase effectiveness (i.e., precision).
However, both of these sampling methods are limited by the unknown
size of the total data.
As introduced in [27], several issues confront existing sampling
techniques. First, data streams have an unknown dataset size. Therefore,
the sampling process on a data stream requires a special analysis to limit
the error bounds. Another problem is that to check the sampling strategy
may be inappropriate for checking anomalies in surveillance analysis
because the data rates in the stream are always changing. Thus, we explore
the relationship among the data rate, sampling rate, and error bounds for
real applications.
10.3 Wavelet Method
The wavelet-based technique is a fundamental tool for analyzing data
streams. From a traditional perspective, a wavelet is a mathematical function
used to divide a given function or continuous time series into different scale
components [6]. This approach has been successfully applied to applications
such as signal processing, motion recognition, image compression, and so on
[63, 7]. The wavelet technique provides concise and general summarization
of data (i.e., stream), which can be used as the basis for effi cient and accurate
query processing methods. Numerous strategies have been introduced
based on the idea of wavelet, in which the most commonly used approach
for data streams is called Haar wavelets [68].
The Haar wavelet provides a foundation for query processing on stream
and relational data. It creates a decomposition of the data (or compact
summary) into a set of Haar wavelet functions, which can be used for later
query processing. The essential step is the determination of the Haar wavelet
coeffi cients. Only coeffi cients with high values are typically stored. Higher
order coeffi cients in the decomposition generally indicate broad trends in
the data, whereas lower order coeffi cients represent the local trends. We will
show a concrete example to illustrate the Haar wavelet process [66, 31].
Suppose our data stream is {3, 2, 4, 3, 1, 5, 0, 3}. The data in the
vector are computed as averaged values between neighbors to obtain
a lower resolution representation (i.e., level 2) of the data, such as
32431503 57 3
,,, ,,3,
2222 222
++++
⎡⎤
=
⎢⎥
⎣⎦
. This transformation results in the loss
of information, thus requiring more information to be stored. The Haar
wavelet technique computes the differences of the averaged values between
neighbors, such as
32431503 11 3
,,, ,,2,
2222 22 2
−−
⎡⎤
=−
⎢⎥
⎣⎦
. These eight data values
are the fi rst-order coeffi cients of the sample data, which can be used to
recover the original data set. Similarly, we can obtain the lower resolution
representation (i.e., formal part of level 1) as
57 357 3
33
913
22 222 2
,, , 3,,,
2222 424
⎡⎤
++ −−
⎢⎥
⎡⎤
=−
⎢⎥
⎢⎥
⎣⎦
⎢⎥
⎣⎦
.
Recursively, we can obtain the fi nal Haar wavelet transformation of the data
as
21 3 1 3 1 1 3
,, ,,,,2,
8 8 2422 2
⎧⎫
−−
⎨⎬
⎩⎭
. The whole transformation process is illustrated
in Fig. 10.3.1.
Haar wavelet transformation on {3, 2, 4, 3, 1, 5, 0, 3}
Original data (level 3) 3 2 4 3 1 5 0 3
level 2 2.5 3.5 3 1.5 0.5 0.5 -2 -1.5
level 1 3 2.25 -0.5 0.75
level 0 2.625 0.375
Figure 10.3.1: Sample Haar wavelet transformation
Haar wavelet analysis commonly assumes that the size q of the time
series data is a power of two without loss of generality because the series
can be decomposed into segment subseries, each of which has a length that
is a power of two. From the process of Haar wavelet transformation, we
obtain 2
l+1
coeffi cients of level l (as shown in Fig. 10.3.1). Each coeffi cient
(i.e., 2
l+1
) represents (and summarizes) a contiguous part of the data stream
(i.e, q/2
l+1
). In the segment series data, the ith of the 2
l+1
coeffi cients covers
the part beginning from(l + 1) · q/2
l+1
+1 to i · q/2
l+1
.
Previous works were not concerned about retaining all coeffi cients,
but only a smaller number of them (i.e., top-B) [36]. Given this simplicity,
some information on the original data during the transformation will be
lost. In [36], the authors report that the highest B-term approximation is
in fact the best B-term approximation that it minimizes the sum squared
error for a given B.
A large number of existing approaches employ a lossy mechanism
because of the large number of coefficients introduced by the Haar
wavelet transformation, that is, the number is equal to the length of the
total data stream. While keeping the top-B coeffi cients with large values,
the other ones are set to zero. This heuristic is thus, employed to reduce
the dimensionality of the time series data. However, a trade-off always
exists between the number of coeffi cients and the error introduced by the
transformation. Obtaining the optimal number of Haar wavelet coeffi cients
is an interesting issue in the literature. Nevertheless, previous works assume
that only a small number of coeffi cients dominate the full effectiveness of
the transformation.
Data Stream 221
222 Applied Data Mining
In addition to the determination of the optimal number of the Haar
wavelet coeffi cients, another issue is the selection of the appropriate
coeffi cients. The absolute value is not the sole optimal criterion to dominate
performance. For example, based on different evaluation metrics for various
applications such as mean square error vs. least maximum error, different
selection strategies may be appropriate [29, 30, 63]. However, other issues
such as computation effi ciency should also be considered.
As introduced in [8], other important topics are related to Haar wavelet
transformation. A number of applications prefer to monitor large quantities
of information simultaneously in the same stream data source. Taking sensor
applications, the monitors may store a large number of data features, such
as location, pressure, wind direction, and so on. Therefore, the Haar wavelet
transformation needs to process all of these features simultaneously. An
intuitive approach is the application of decomposition on each feature,
such that the top-B coeffi cients are discovered by merging the transformed
results [63]. However, this strategy may be ineffi cient, because duplicate
transformations may be conducted on the same individual data (relative to
different features). The authors in [22] introduced several effective strategies
for the simultaneous monitoring and transformation of multi-feature data
by using bitmaps to determine the optimal features.
In summary, Haar wavelet transformation continuously builds a
summarization of the B representative wavelet coeffi cients (e.g., with
largest absolute values) for time series data set. Considering the special
properties of dynamic stream data, the following criteria should be satisfi ed:
(1) sub-linear space usage must be available to store the summarization
and (2) sub-linear per-item update time must be suffi cient to maintain the
summarization. Applications on query processing may further need to
consider another criterion, that is, sub-linear query time.
10.4 Sketch Method
The sketch-based technique is one of the major tools for stream data analysis.
Sketches are small space summarizations of stream data in a centralized or
distributed environment. The main advantage of sketch-based techniques is
that they require storage that is signifi cantly smaller than the input stream
length. For most sketch based algorithms, the storage usage is sub-linear in
N, that is, log
k
N, where N is the input size and k is some constant. The hashing
function is generally employed by sketch based algorithms to project the
data stream into a small space sketch vector that can be easily updated and
queried. As demonstrated by numerous experimental evaluations, the cost
of updating and querying on the sketch vector is only a constant time for
each operation.
Given the property of a sketch, the answers to the queries that are
determined by examining the sketches are only approximations because
only part of the data information is stored (as sketch). A sketch generally
contains multiple counters for random variables relative to different
attributes. The error boundary on the answers should be held with
probabilistic guarantees. The introduced sketch-based algorithms differ in
terms of defi ning and updating random variables as well as in the effi cient
querying of the sketches.
The sketch-based technique is closely related to the random projection
strategy [45]. Indyk et al. [41] introduced this strategy into the database
domain (i.e., time series domain) to discover the representative trends.
The key idea is that a data point with dimensionality d is reduced by
(randomly) selecting k dimensionalities. The dot product of the these k
dimensionalities between data points is computed. Each k dimensionality
(i.e., random vectors) follows the normal distribution with zero mean and
unit variance. Moreover, the random vector is normalized relative to one
unit in magnitude. Accuracy is dependent on the value of k, where a larger
value of k results in high accuracy. We will then introduce several sketch-
based algorithms for processing data streams. Please refer to [17, 62, 8] for
a more detailed survey of related issues.
10.4.1 Sliding Window-based Sketch
Indyk et al. [41] introduced the sketch-based technique into the database
domain to discover the trends governing data streams. The authors observed
that the length of a time series can be considered to have one dimensionality.
Therefore, we can construct the sketch by considering the length as the
random vector. Two situations are considered in [41]: fi xed window sketches
and variable window sketches.
For fi xed window sketches, the aim is to obtain sliding window sketches
with a fi xed length l. Thus, l · k operations should be conducted for a sketch
with size k. Given the total of O(n−l) sliding windows, O(n · l · k) operations
are necessary. From this analysis, we fi nd that if the window length l is large
(i.e., the same order of magnitude as the time series), the cost of calculation
would be quadratic to the size of the series. Therefore, this approach could
be impractical for large time series data (i.e., stream). As introduced in
[41], the construction of fi xed window-based sketches can be considered
as the computation of the polynomial convolution of random vectors of
appropriate length over the time series data. Therefore, we can use the fast
Fourier transform to address the issue. This observation indicates that the
xed window sketches can be obtained effi ciently.
For variable window sketches, the aim is to construct the sketches for
any sub-vector between length l and u [41]. This approach requires O(n
2
)
Data Stream 223
224 Applied Data Mining
sub-vectors that may have the length O(n) in the worst case. Through this
analysis, we fi nd that the total cost is O(n
3
), which limits the application of
the technique for large time series data (i.e., stream). To address this issue,
Indyk et al. [41] proposed the construction of a set of sketches. The size of
the group must be considerably smaller than the total sketches. To determine
the sketches to be stored, the authors deliberately chose sub vectors out of
the original ones, such that the computation can be achieved in O(1) time
with suffi cient accuracy guaranteed for each vector. The experimental
evaluation demonstrates the effectiveness and effi ciency of the introduced
strategies [41]. Please refer to [41] for more details about this technique.
10.4.2 Count Sketch
Alon et al. [10] were the fi rst to introduce the term sketch as a tug-of-war
sketch. The authors aim to measure and optimize the second order of the
frequency moment F
2
=¹
i
f
2
i
. Notably, more recent studies found that the
introduced summarization in [10] can also be utilized to measure the inner
product of two distributions on frequency, that is, ¹
i
f
i
f'
i
, where f
i
and f'
i
denote the two frequency distributions. From the observation, we fi nd that
if f
i
can be obtained from a data stream, the product of f'
i
=1 and f'
j
=0 for all
j i can be calculated during the query processing. Thus, we fi nd that for
the error bound, the value should be ¢F
1/2
2
¢n, which has more than 1-δ
probability for a sketch with size of O(
2
1
ε
log 1/δ).
The cost of updating the count sketch is high because all the sketches
should be rebuilt if a new instance comes in. Thus, the naive count sketch
technique as an appropriate strategy for data streams. To address the issue,
Charikar et al. [14] proposed an improved algorithm that requires only a
small part of the sketches to be updated when a new instance comes in.
Thus, performance is signifi cantly improved.
We briefl y describe the idea in [14]. Interested readers are referred
to the previous paper for more details. The introduced sketch structure
contains a d × ω array (denoted as C) that stores counters (where d is the
number of rows), with two hash functions presented for each row. One hash
function g maps the instances of the data stream into [ω], whereas the other
hash function h maps the instances into {–1, +1}. For each row j (i.e., 1 j
d), the corresponding mapping on instance i, that is, h
j
(i), is stored in the
array element C[j, g
j
(i)]. We fi nd that
f
i
is median 1jd
h
j
(i)C[j, g
j
(i)] [17].
The analysis shows that for each value of j, the expectation and variance
depending on F
2
/ω can be accurately derived.
¢
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.252.23