Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Incremental unsupervised learning using clustering

The concept behind clustering in a data stream remains the same as in batch or offline modes; that is, finding interesting clusters or patterns which group together in the data while keeping the limits on finite memory and time required to process as constraints. Doing single-pass modifications to existing algorithms or keeping a small memory buffer to do mini-batch versions of existing algorithms, constitute the basic changes done in all the algorithms to make them suitable for stream or real-time unsupervised learning.

Modeling techniques

The clustering modeling techniques for online learning are divided into partition-based, hierarchical-based, density-based, and grid-based, similar to the case of batch-based clustering.

Partition based

The concept of partition-based algorithms is similar to batch-based clustering where k clusters are formed to optimize certain objective functions such as minimizing the inter-cluster distance, maximizing the intra-cluster distance, and so on.

Online k-Means

k-Means is the most popular clustering algorithm, which partitions the data into user-specified k clusters, mostly to minimize the squared error or distance between centroids and cluster assigned points. We will illustrate a very basic online adaptation of k-Means, of which several variants exist.

Inputs and outputs

Mainly, numeric features are considered as inputs; a few tools take categorical features and convert them into some form of numeric representation. The algorithm itself takes the parameters' number of clusters k and number of max iterations n as inputs.

How does it work?

The input data stream is considered to be infinite but of constant block size.
A memory buffer of the block size is kept reserved to store the data or a compressed representation of the data.
Initially, the first stream of data of block size is used to find the k centroids of the clusters, the centroid information is stored and the buffer is cleared.
For the next data after it reaches the block size:
1. For either max number of iterations or until there is no change in the centroids:
2. Execute k-Means with buffer data and the present centroids.
3. Minimize the squared sum error between centroids and data assigned to the cluster.
4. After the iterations, the buffer is cleared and new centroids are obtained.
Repeat step 4 until the data is no longer available.

Advantages and limitations

The advantages and limitations are as follows:

Similar to batch-based, the shape of the detected cluster depends on the distance measure and is not appropriate in problem domains with irregular shapes.
The choice of parameter k, as in batch-based, can limit the performance in datasets with many distinct patterns or clusters.
Outliers and missing data can pose lots of irregularities in clustering behavior of online k-Means.
If the selected buffer size or the block size of the stream on which iterative k-Means runs is small, it will not find the right clusters. If the chosen block size is large, it can result in slowdown or missed changes in the data. Extensions such as Very Fast k-Means Algorithm (VFKM), which uses the Hoeffding bound to determine the buffer size, overcome this limitation to a large extent.

Hierarchical based and micro clustering

Hierarchical methods are normally based on Clustering Features (CF) and Clustering Trees (CT). We will describe the basics and elements of hierarchical clustering and the BIRCH algorithm, the extension of which the CluStream algorithm is based on.

The Clustering Feature is a way to compute and preserve a summarization statistic about the cluster in a compressed way rather than holding on to the whole data belonging to the cluster. In a d dimensional dataset, with N points in the cluster, two aggregates in the form of total sum LS for each dimensions and total squared sum of data SS again for each dimension, are computed and the vector representing this triplet form the Clustering Feature:

CF_j = < N, LS_j, SS_j >

These statistics are useful in summarizing the entire cluster information. The centroid of the cluster can be easily computed using:

centroid_j = LS_j/N

The radius of the cluster can be estimated using:

The diameter of the cluster can be estimated using:

CF vectors have great incremental and additive properties which becomes useful in stream or incremental updates.

For an incremental update, when we must update the CF vector, the following holds true:

When two CFs have to be merged, the following holds true:

The Clustering Feature Tree (CF Tree) represents an hierarchical tree structure. The construction of the CF tree requires two user defined parameters:

Branching factor b which is the maximum number of sub-clusters or non-leaf nodes any node can have
Maximum diameter (or radius) T, the number of examples that can be absorbed by the leaf node for a CF parent node

CF Tree operations such as insertion are done by recursively traversing the CF Tree and using the CF vector for finding the closest node based on distance metrics. If a leaf node has already absorbed the maximum elements given by parameter T, the node is split. At the end of the operation, the CF vector is appropriately updated for its statistic:

Figure 3 An example Clustering Feature Tree illustrating hierarchical structure.

We will discuss BIRCH (Balanced Iterative Reducing and Clustering Hierarchies) following this concept.