220 Applied Data Mining
sampling, a sample is randomly chosen from each strata, whereas in cluster
sampling, only the randomly selected clusters are explored; and (3) cluster
sampling primarily aims to reduce costs by increasing sampling effi ciency,
whereas stratifi ed sampling aims to increase effectiveness (i.e., precision).
However, both of these sampling methods are limited by the unknown
size of the total data.
As introduced in [27], several issues confront existing sampling
techniques. First, data streams have an unknown dataset size. Therefore,
the sampling process on a data stream requires a special analysis to limit
the error bounds. Another problem is that to check the sampling strategy
may be inappropriate for checking anomalies in surveillance analysis
because the data rates in the stream are always changing. Thus, we explore
the relationship among the data rate, sampling rate, and error bounds for
real applications.
10.3 Wavelet Method
The wavelet-based technique is a fundamental tool for analyzing data
streams. From a traditional perspective, a wavelet is a mathematical function
used to divide a given function or continuous time series into different scale
components [6]. This approach has been successfully applied to applications
such as signal processing, motion recognition, image compression, and so on
[63, 7]. The wavelet technique provides concise and general summarization
of data (i.e., stream), which can be used as the basis for effi cient and accurate
query processing methods. Numerous strategies have been introduced
based on the idea of wavelet, in which the most commonly used approach
for data streams is called Haar wavelets [68].
The Haar wavelet provides a foundation for query processing on stream
and relational data. It creates a decomposition of the data (or compact
summary) into a set of Haar wavelet functions, which can be used for later
query processing. The essential step is the determination of the Haar wavelet
coeffi cients. Only coeffi cients with high values are typically stored. Higher
order coeffi cients in the decomposition generally indicate broad trends in
the data, whereas lower order coeffi cients represent the local trends. We will
show a concrete example to illustrate the Haar wavelet process [66, 31].
Suppose our data stream is {3, 2, 4, 3, 1, 5, 0, 3}. The data in the
vector are computed as averaged values between neighbors to obtain
a lower resolution representation (i.e., level 2) of the data, such as
32431503 57 3
,,, ,,3,
2222 222
++++
⎡⎤⎡⎤
=
⎢⎥⎢⎥
⎣⎦⎣⎦
. This transformation results in the loss
of information, thus requiring more information to be stored. The Haar
wavelet technique computes the differences of the averaged values between