10.4.3 Fast Count Sketch
To improve the effi ciency of count sketches further, Thorup and Zhang [64]
introduced the fast count sketch technique, that uses one random hashing
to hasten the update time while maintaining reasonable error bounds. The
price of obtaining this improvement is that more sketch vectors are utilized,
and deliberately designing the hash function is necessary. In the fast count
sketches, the counters in the vector are the same as those of count sketches,
the only difference is that the former contains a four-universal hash function
that is associated with the vector.
When a new data instance i arrives, its mapped value, ω, is immediately
stored into the corresponding counter, that is, x
f
[h(i)] = x
f
[h(i)] + ω, where
h : I {1, . . . . , n} is the four-universal hash function [64].
We can deduce the estimate of the size of the join attributes based on the
second frequency moment as in [62]. As claimed by the authors, the estimate
is an unbiased one of the inner product
f
g
. The variance is retained in the
fast count sketches in a manner similar to that in the count sketches. The
multiplicative factor is
1
1n
for the fast count sketches but
1
n
for the count
sketches. More entries are needed in the fast count sketches than in the count
sketches, but the difference is negligible for large values of n.
10.4.4 Count Min Sketch
Cormode and Muthukrishnan [18] introduced another effective sketch type,
that is, count min sketches, for facilitating the construction and update of
the synopsis. The data structure of count min sketches is the same as that
of fast count sketches. Count min sketches apply a series of two-universal
functions to map the data instances, which differs from the four-universal
functions used in fast count sketches. The mechanism of sketch update in
Count Min sketches is the same as that in fast count sketches.
An issue for count min sketches is that they employ the L
1
norm,
whereas count sketches utilizes the L
2
norm. Thus, count min sketches
require more space to maintain the same level of error bound compared
with count sketches. This condition is attributable to the fact that the L
2
norm is generally smaller than the L
1
norm.
In summary, given an input data stream of length N and user specifi ed
parameters δ and epsilon, the count min sketch technique can store the
frequencies of all the instances with the following guarantees: (1) all the
stored frequencies differ from the truth at most ¢N with a probability of at
least δ; (2) the space usage is O(
1
log
1
δ
); and (3) for each update and query,
the cost is constant, that is, O(log
1
δ
).
Data Stream 225
226 Applied Data Mining
10.4.5 Some Related Issues on Sketches
In addition to the above introduced sketch based algorithms, several other
extended approaches are based on the sketch techniques.
Pseudo random vector generation. A primary issue for sketch construction
is that the number of distinct items may be large and, therefore, the size of
the corresponding random vector will also be large. This issue will reduce
the effi ciency of building the sketches. To address this problem, we fi rst
generate a set of k random vectors, and then, when the data instance comes
in, we can map it to the corresponding pre-generated random vector. This
strategy, however, may consume a large amount of space. A more feasible
idea is that we can store the random vectors implicitly (i.e., as seeds), which
are utilized dynamically to generate the vectors.
The authors in [10] have found that we can generate the random vectors
with four-wise independent random vectors from a seed of size O(log(N)).
Gilbert et al. [37] demonstrated that if Reed-Muller codes are used, we can
generate seven-wise independent random vectors. The properties of the
pseudo random vectors generation approach are as follows: (1) we can
generate a random vector in poly-logarithmic time from the seed; and (2)
the dot-product of two vectors can be approximately computed using only
their sketch representations. We can observe that the dot product of two
vectors is closely related to the Euclidean distance, which is an indication
derived through the random projection strategy [45].
Sketch partitioning. Dobra et al. [23] introduced the sketch partitioning
technique. The authors deliberately partitioned the join attributes to
construct the separate sketches of each group. The fi nal estimation is
accumulated from all partitions. The essential part of the introduced
technique is the partition of the domains to bind the variance, which can
result in high accuracy for applications. The authors in [24] further studied
the issue by extending it to multi-query processing.
Sketch skimming. Ganguly et al. [28] observed that sketch skimming
can be utilized to improve the estimation of join size. The variance of the
join estimation is largely affected by the most frequent random variables,
which are generally few even for a large data set. Given that high variance
is undesirable, the frequent instances are deliberately separated from
others. Therefore, the skimmed sketches can be identifi ed by removing
the sketches with frequent instances. We can estimate the join size using
four-wise independent random vectors and the experimental evaluation
demonstrates the effi ciency of the proposed technique in [28].
10.4.6 Applications of Sketches
A large number of applications can utilize the sketch based strategies. One
practical issue is the heavy hitters [19, 49, 17]. For this problem, we need to
detect the most frequent items in the data stream. Recognizing the difference
among networks in the data stream was explored in [20], and detecting the
differences among data streams was studied in [25, 26]. Similar issues on
XML data (or tree data) were presented in [57, 58, 61]. These works aimed
to construct the synopsis for structured queries, which can be used later to
improve query processing performance.
Sketches based strategies are also well utilized in network research.
Improving the communication effi ciency for signals in sensor networks
is important. Moreover, considering resource limitations (e.g., battery),
effi cient storage by using concise summarization on the stream data is an
essential issue for sensor networks. A number of works have been conducted
to address the aforementioned issues, [15, 38, 46]. Please refer to [8] for
more details on these issues.
10.4.7 Advantages and Limitations of Sketch Strategies
Sketch-based strategies have several advantages. First is the space usage.
Sketch-based approaches have been theoretically and experimentally
proven to obtain an optimal sub-linear space usage in the data size. This
nding can be attributed to the fact that the space requirement is logarithmic
in the number of distinct items in the stream, which is relevant small by
considering the large volume of the data.
Despite the advantages of sketch based methods, several challenging
issues remain. First, almost all related studies use L
p
norm as the aggregate
measure, which may not refl ect the actual data distribution. Thus, the sketch
summarization may fail to store the essential information of the data.
Another issue is high dimensionality. The existence of hundreds of
independent dimensions in the data stream may hinder the practical
usage of the existing state-of-the-art sketch-based techniques. This issue
has been raised by several researchers [16]. However, the problem remains
challenging because of its intrinsic complexity.
As highlighted in [8], most sketch-based works only focus on identifying
the frequent instances and estimating the frequency moments and join size.
This emphasis on the micro view may neglect the macro trends in the stream
data, such as the temporal property. Thus, the temporal information may
be lost because of the transformation process when building the sketches.
Although several scholars have mentioned this issue and consequently
introduced effective techniques for temporal analysis [41], the strategy
requires signifi cant space usage, which makes the approach impractical
Data Stream 227
228 Applied Data Mining
for real large data streams. Extending the state-of-the-art strategies for
temporal trends analysis with limited space usage remains an interesting
and challenging issue.
As previously mentioned, a trade-off exists between space usage and
data rate. For real applications, considering that we have a suffi cient storage
resource (i.e., main memory, SSD), addressing data streams with relevant
slow data update rate (but may have a large volume of distinct items) [37]
may be practical. The real challenging applications are therefore those that
need to process very fast data streams such as sensor networks [21, 55, 46,
47] because of the power and hardware limitation.
10.5 Histogram Method
The histogram approach is another major tool used for analyzing data
streams. In statistics, the term histogram was fi rst introduced as a graphical
representation to illustrate the visual impression of the distribution of data
[2]. A histogram is an estimate of the probability distribution of a continuous
variable and was fi rst introduced by Pearson [56].
Histograms are commonly used to describe the density of data and
measure the probability density function of the underlying variable.
Specifi cally, in database research, the histogram approach partitions the
data into a series of categories (known as bins or buckets) relative to some
feature (or dimension). Each count of the bin is stored.
From a formal mathematical view, a histogramis a function m that
accumulates the number of instances that fall into each of the disjoint
buckets, whereas the graph of a histogram is one method to describe the
histogram. Let n be the total number of instances, k be the total number
of bins, and m
i
be the histogram, we then have n = ¹
k
i=1
m
i
. A cumulative
histogram is a mapping that measures the cumulative number of instances
in all buckets up to the specifi ed bucket. The cumulative histogram M
i
of
a histogram m
j
is defi ned as M
i
= ¹
i
j=1
m
j
.
By analyzing the process of histogram construction, we fi nd that
the space usage for a histogram is determined by the total number of
buckets used. Buckets can intuitively be obtained by partitioning the data
into equal sizes. Such equi-width division technique is related to Haar
wavelet coeffi cients in that if the wavelet summarization of the frequency
distribution is built relative to any dimension, then the Haar coeffi cients
present the difference in relative frequencies in equi-width histogram
buckets [8].
Although this technique is easily implemented for equi-width
histogram strategy, it has the drawback of low representation accuracy.
This low accuracy can be attributed to the fact that the distribution of data
is not well kept by the equi-width mechanism because of the assumption
of uniform distribution. The localized data distribution is commonly cut
by the bucket boundaries. For instance, the number of points distributed in
different buckets may vary signifi cantly. This issue may lend diffi culty to
query estimations. Therefore, the histogram technique requires the design
of an appropriate bucket construction mechanism.
Similar to the idea of kd-tree, we can build buckets to enable each one
contain approximately equal instances (known as equi-depth histogram).
Numerous experiments have illustrated that equi-depth histograms are
considerably more effective than equi-width histograms. Therefore, a large
number of commercial vendors switched to the equi-depth histograms in
the years following their introduction [42]. Multidimensional equi-depth
histograms were introduced in [52]. However, for the special data such as
a stream, the construction of buckets based on the equi-depth technique is
diffi cult because the data are dynamic and unknown apriori.
To improve the effectiveness of histograms, Ioannidis et al. introduced
the V-optimal histograms [43], which aim to minimize the frequency
variance of different values in buckets. In this way, the assumption of data
uniform distribution can be satisfi ed. Specifi cally, if a bucket b with count
c contains the frequency of n instances, then the average frequency of each
instance in b is c/n. Let f
1
. . . f
n
be the frequencies of the n instances in b.
The variance v of the frequencies based on the averages is obtained as
v = ¹
l
i=1
(f
i
c/n)
2
. Finally, the overall variance V on all the buckets is obtained
as V = ¹
b
v
.
Improvement on the V-optimal histogram construction has been
introduced in [44]. In this work, the L
p
-difference function between
two vectors with cardinalities that are based on the distinct instances is
considered as the objective function. Other works consider alternative
objective functions to optimize the histogram construction [60]. The
advantage and disadvantage of V-optimal histograms is explained as
follows.
Advantage of V-optimal histogram: V-optimal histograms can optimally
measure the contents of buckets. However, any histogram could encounter
an error when used to summarize data. V-optimal histograms binds
the error by fi nding the smallest variance among all possible buckets.
As demonstrated in [59], V-optimal histograms can achieve the best
performance in terms of accuracy in summarizing data.
Limitation of V-optimal histogram: The major drawback of V-optimal
histograms is that they are diffi cult to update. Rebuilding all histograms
is necessary when new data are received. By contrast, the equi-width
histogram technique can address this issue. Moreover, although equi-
depth histograms also have to rebuild, the cost is lower compared with
Data Stream 229
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.196.217