Chapter 10: Data Stream (3/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

10.4.3 Fast Count Sketch

To improve the efﬁ ciency of count sketches further, Thorup and Zhang [64]

introduced the fast count sketch technique, that uses one random hashing

to hasten the update time while maintaining reasonable error bounds. The

price of obtaining this improvement is that more sketch vectors are utilized,

and deliberately designing the hash function is necessary. In the fast count

sketches, the counters in the vector are the same as those of count sketches,

the only difference is that the former contains a four-universal hash function

that is associated with the vector.

When a new data instance i arrives, its mapped value, ω, is immediately

stored into the corresponding counter, that is, x

[h(i)] = x

[h(i)] + ω, where

h : I  {1, . . . . , n} is the four-universal hash function [64].

We can deduce the estimate of the size of the join attributes based on the

second frequency moment as in [62]. As claimed by the authors, the estimate

is an unbiased one of the inner product

סּ

. The variance is retained in the

fast count sketches in a manner similar to that in the count sketches. The

multiplicative factor is

1n −

for the fast count sketches but

for the count

sketches. More entries are needed in the fast count sketches than in the count

sketches, but the difference is negligible for large values of n.

10.4.4 Count Min Sketch

Cormode and Muthukrishnan [18] introduced another effective sketch type,

that is, count min sketches, for facilitating the construction and update of

the synopsis. The data structure of count min sketches is the same as that

of fast count sketches. Count min sketches apply a series of two-universal

functions to map the data instances, which differs from the four-universal

functions used in fast count sketches. The mechanism of sketch update in

Count Min sketches is the same as that in fast count sketches.

An issue for count min sketches is that they employ the L

norm,

whereas count sketches utilizes the L

norm. Thus, count min sketches

require more space to maintain the same level of error bound compared

with count sketches. This condition is attributable to the fact that the L

norm is generally smaller than the L

norm.

In summary, given an input data stream of length N and user speciﬁ ed

parameters δ and epsilon, the count min sketch technique can store the

frequencies of all the instances with the following guarantees: (1) all the

stored frequencies differ from the truth at most ¢N with a probability of at

least δ; (2) the space usage is O(

∈

log

∈

); and (3) for each update and query,

the cost is constant, that is, O(log

∈

Data Stream 225

226 Applied Data Mining

10.4.5 Some Related Issues on Sketches

In addition to the above introduced sketch based algorithms, several other

extended approaches are based on the sketch techniques.

Pseudo random vector generation. A primary issue for sketch construction

is that the number of distinct items may be large and, therefore, the size of

the corresponding random vector will also be large. This issue will reduce

the efﬁ ciency of building the sketches. To address this problem, we ﬁ rst

generate a set of k random vectors, and then, when the data instance comes

in, we can map it to the corresponding pre-generated random vector. This

strategy, however, may consume a large amount of space. A more feasible

idea is that we can store the random vectors implicitly (i.e., as seeds), which

are utilized dynamically to generate the vectors.

The authors in [10] have found that we can generate the random vectors

with four-wise independent random vectors from a seed of size O(log(N)).

Gilbert et al. [37] demonstrated that if Reed-Muller codes are used, we can

generate seven-wise independent random vectors. The properties of the

pseudo random vectors generation approach are as follows: (1) we can

generate a random vector in poly-logarithmic time from the seed; and (2)

the dot-product of two vectors can be approximately computed using only

their sketch representations. We can observe that the dot product of two

vectors is closely related to the Euclidean distance, which is an indication

derived through the random projection strategy [45].

Sketch partitioning. Dobra et al. [23] introduced the sketch partitioning

technique. The authors deliberately partitioned the join attributes to

construct the separate sketches of each group. The ﬁ nal estimation is

accumulated from all partitions. The essential part of the introduced

technique is the partition of the domains to bind the variance, which can

result in high accuracy for applications. The authors in [24] further studied

the issue by extending it to multi-query processing.

Sketch skimming. Ganguly et al. [28] observed that sketch skimming

can be utilized to improve the estimation of join size. The variance of the

join estimation is largely affected by the most frequent random variables,

which are generally few even for a large data set. Given that high variance

is undesirable, the frequent instances are deliberately separated from

others. Therefore, the skimmed sketches can be identiﬁ ed by removing

the sketches with frequent instances. We can estimate the join size using

four-wise independent random vectors and the experimental evaluation

demonstrates the efﬁ ciency of the proposed technique in [28].

10.4.6 Applications of Sketches

A large number of applications can utilize the sketch based strategies. One

practical issue is the heavy hitters [19, 49, 17]. For this problem, we need to

detect the most frequent items in the data stream. Recognizing the difference

among networks in the data stream was explored in [20], and detecting the

differences among data streams was studied in [25, 26]. Similar issues on

XML data (or tree data) were presented in [57, 58, 61]. These works aimed

to construct the synopsis for structured queries, which can be used later to

improve query processing performance.

Sketches based strategies are also well utilized in network research.

Improving the communication efﬁ ciency for signals in sensor networks

is important. Moreover, considering resource limitations (e.g., battery),

efﬁ cient storage by using concise summarization on the stream data is an

essential issue for sensor networks. A number of works have been conducted

to address the aforementioned issues, [15, 38, 46]. Please refer to [8] for

more details on these issues.

10.4.7 Advantages and Limitations of Sketch Strategies

Sketch-based strategies have several advantages. First is the space usage.

Sketch-based approaches have been theoretically and experimentally

proven to obtain an optimal sub-linear space usage in the data size. This

ﬁ nding can be attributed to the fact that the space requirement is logarithmic

in the number of distinct items in the stream, which is relevant small by

considering the large volume of the data.

Despite the advantages of sketch based methods, several challenging

issues remain. First, almost all related studies use L

norm as the aggregate

measure, which may not reﬂ ect the actual data distribution. Thus, the sketch

summarization may fail to store the essential information of the data.

Another issue is high dimensionality. The existence of hundreds of

independent dimensions in the data stream may hinder the practical

usage of the existing state-of-the-art sketch-based techniques. This issue

has been raised by several researchers [16]. However, the problem remains

challenging because of its intrinsic complexity.

As highlighted in [8], most sketch-based works only focus on identifying

the frequent instances and estimating the frequency moments and join size.

This emphasis on the micro view may neglect the macro trends in the stream

data, such as the temporal property. Thus, the temporal information may

be lost because of the transformation process when building the sketches.

Although several scholars have mentioned this issue and consequently

introduced effective techniques for temporal analysis [41], the strategy

requires signiﬁ cant space usage, which makes the approach impractical

Data Stream 227

228 Applied Data Mining

for real large data streams. Extending the state-of-the-art strategies for

temporal trends analysis with limited space usage remains an interesting

and challenging issue.

As previously mentioned, a trade-off exists between space usage and

data rate. For real applications, considering that we have a sufﬁ cient storage

resource (i.e., main memory, SSD), addressing data streams with relevant

slow data update rate (but may have a large volume of distinct items) [37]

may be practical. The real challenging applications are therefore those that

need to process very fast data streams such as sensor networks [21, 55, 46,

47] because of the power and hardware limitation.

10.5 Histogram Method

The histogram approach is another major tool used for analyzing data

streams. In statistics, the term histogram was ﬁ rst introduced as a graphical

representation to illustrate the visual impression of the distribution of data

[2]. A histogram is an estimate of the probability distribution of a continuous

variable and was ﬁ rst introduced by Pearson [56].

Histograms are commonly used to describe the density of data and

measure the probability density function of the underlying variable.

Speciﬁ cally, in database research, the histogram approach partitions the

data into a series of categories (known as bins or buckets) relative to some

feature (or dimension). Each count of the bin is stored.

From a formal mathematical view, a histogramis a function m that

accumulates the number of instances that fall into each of the disjoint

buckets, whereas the graph of a histogram is one method to describe the

histogram. Let n be the total number of instances, k be the total number

of bins, and m

be the histogram, we then have n = ¹

i=1

. A cumulative

histogram is a mapping that measures the cumulative number of instances

in all buckets up to the speciﬁ ed bucket. The cumulative histogram M

a histogram m

is deﬁ ned as M

= ¹

j=1

By analyzing the process of histogram construction, we ﬁ nd that

the space usage for a histogram is determined by the total number of

buckets used. Buckets can intuitively be obtained by partitioning the data

into equal sizes. Such equi-width division technique is related to Haar

wavelet coefﬁ cients in that if the wavelet summarization of the frequency

distribution is built relative to any dimension, then the Haar coefﬁ cients

present the difference in relative frequencies in equi-width histogram

buckets [8].

Although this technique is easily implemented for equi-width

histogram strategy, it has the drawback of low representation accuracy.

This low accuracy can be attributed to the fact that the distribution of data

is not well kept by the equi-width mechanism because of the assumption

of uniform distribution. The localized data distribution is commonly cut

by the bucket boundaries. For instance, the number of points distributed in

different buckets may vary signiﬁ cantly. This issue may lend difﬁ culty to

query estimations. Therefore, the histogram technique requires the design

of an appropriate bucket construction mechanism.

Similar to the idea of kd-tree, we can build buckets to enable each one

contain approximately equal instances (known as equi-depth histogram).

Numerous experiments have illustrated that equi-depth histograms are

considerably more effective than equi-width histograms. Therefore, a large

number of commercial vendors switched to the equi-depth histograms in

the years following their introduction [42]. Multidimensional equi-depth

histograms were introduced in [52]. However, for the special data such as

a stream, the construction of buckets based on the equi-depth technique is

difﬁ cult because the data are dynamic and unknown apriori.

To improve the effectiveness of histograms, Ioannidis et al. introduced

the V-optimal histograms [43], which aim to minimize the frequency

variance of different values in buckets. In this way, the assumption of data

uniform distribution can be satisﬁ ed. Speciﬁ cally, if a bucket b with count

c contains the frequency of n instances, then the average frequency of each

instance in b is c/n. Let f

. . . f

be the frequencies of the n instances in b.

The variance v of the frequencies based on the averages is obtained as

v = ¹

i=1

− c/n)

. Finally, the overall variance V on all the buckets is obtained

as V = ¹

Improvement on the V-optimal histogram construction has been

introduced in [44]. In this work, the L

-difference function between

two vectors with cardinalities that are based on the distinct instances is

considered as the objective function. Other works consider alternative

objective functions to optimize the histogram construction [60]. The

advantage and disadvantage of V-optimal histograms is explained as

follows.

Advantage of V-optimal histogram: V-optimal histograms can optimally

measure the contents of buckets. However, any histogram could encounter

an error when used to summarize data. V-optimal histograms binds

the error by ﬁ nding the smallest variance among all possible buckets.

As demonstrated in [59], V-optimal histograms can achieve the best

performance in terms of accuracy in summarizing data.

Limitation of V-optimal histogram: The major drawback of V-optimal

histograms is that they are difﬁ cult to update. Rebuilding all histograms

is necessary when new data are received. By contrast, the equi-width

histogram technique can address this issue. Moreover, although equi-

depth histograms also have to rebuild, the cost is lower compared with

Data Stream 229

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10: Data Stream (3/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 10: Data Stream (3/5)