The concept behind clustering in a data stream remains the same as in batch or offline modes; that is, finding interesting clusters or patterns which group together in the data while keeping the limits on finite memory and time required to process as constraints. Doing single-pass modifications to existing algorithms or keeping a small memory buffer to do mini-batch versions of existing algorithms, constitute the basic changes done in all the algorithms to make them suitable for stream or real-time unsupervised learning.
The clustering modeling techniques for online learning are divided into partition-based, hierarchical-based, density-based, and grid-based, similar to the case of batch-based clustering.
The concept of partition-based algorithms is similar to batch-based clustering where k clusters are formed to optimize certain objective functions such as minimizing the inter-cluster distance, maximizing the intra-cluster distance, and so on.
k-Means is the most popular clustering algorithm, which partitions the data into user-specified k clusters, mostly to minimize the squared error or distance between centroids and cluster assigned points. We will illustrate a very basic online adaptation of k-Means, of which several variants exist.
Mainly, numeric features are considered as inputs; a few tools take categorical features and convert them into some form of numeric representation. The algorithm itself takes the parameters' number of clusters k and number of max iterations n as inputs.
The advantages and limitations are as follows:
Hierarchical methods are normally based on Clustering Features (CF) and Clustering Trees (CT). We will describe the basics and elements of hierarchical clustering and the BIRCH algorithm, the extension of which the CluStream algorithm is based on.
The Clustering Feature is a way to compute and preserve a summarization statistic about the cluster in a compressed way rather than holding on to the whole data belonging to the cluster. In a d dimensional dataset, with N points in the cluster, two aggregates in the form of total sum LS for each dimensions and total squared sum of data SS again for each dimension, are computed and the vector representing this triplet form the Clustering Feature:
CFj = < N, LSj, SSj >
These statistics are useful in summarizing the entire cluster information. The centroid of the cluster can be easily computed using:
centroidj = LSj/N
The radius of the cluster can be estimated using:
The diameter of the cluster can be estimated using:
CF vectors have great incremental and additive properties which becomes useful in stream or incremental updates.
For an incremental update, when we must update the CF vector, the following holds true:
When two CFs have to be merged, the following holds true:
The Clustering Feature Tree (CF Tree) represents an hierarchical tree structure. The construction of the CF tree requires two user defined parameters:
CF Tree operations such as insertion are done by recursively traversing the CF Tree and using the CF vector for finding the closest node based on distance metrics. If a leaf node has already absorbed the maximum elements given by parameter T, the node is split. At the end of the operation, the CF vector is appropriately updated for its statistic:
We will discuss BIRCH (Balanced Iterative Reducing and Clustering Hierarchies) following this concept.
BIRCH only accepts numeric features. CF and CF tree parameters, such as branching factor b and maximum diameter (or radius) T for leaf are user-defined inputs.
BIRCH, designed for very large databases, was meant to be a two-pass algorithm; that is, scan the entire data once and re-scan it again, thus being an O(N) algorithm. It can be modified easily enough for online as a single pass algorithm preserving the same properties:
The advantages and limitations are as follows:
CluStream only accepts numeric features. Among the user-defined parameters are the number of micro-clusters in memory (q) and the threshold (δ) in time after which they can be deleted. Additionally, included in the input are time-sensitive parameters for storing the micro-clusters information, given by α and l.
microClusterj = < N, LSj, SSj, ST, SST>
The advantages and limitations are as follows:
Similar to batch clustering, density-based techniques overcome the "shape" issue faced by distance-based algorithms. Here we will present a well-known density-based algorithm, DenStream, which is based on the concepts of CF and CF Trees discussed previously.
The extent of the neighborhood of a core micro-cluster is the user-defined radius ϵ. A second input value is the minimum total weight µ of the micro-cluster which is the sum over the weighted function of the arrival time of each instance in the object, where the weight decays with a time constant proportional to another user-defined parameter, λ. Finally, an input factor β ∈ (0,1) is used to distinguish potential core micro-clusters from outlier micro-clusters.
The advantages and limitations are as follows:
This technique is based on discretizing the multi-dimensional continuous space into a multi-dimensional discretized version with grids. The mapping of the incoming instance to grid online and maintaining the grid offline results in an efficient and effective way of finding clusters in real-time.
Here we present D-Stream, which is a grid-based online stream clustering algorithm.
As in density-based algorithms, the idea of decaying weight of instances is used in D-Stream. Additionally, as described next, cells in the grid formed from the input space may be deemed sparse, dense, or sporadic, distinctions that are central to the computational and space efficiency of the algorithm. The inputs to the grid-based algorithm, then, are:
Many of the static clustering evaluation measures discussed in Chapter 3, Unsupervised Machine Learning Techniques, have an assumption of static and non-evolving patterns. Some of these internal and external measures are used even in streaming based cluster detection. Our goal in this section is to first highlight problems inherent to cluster evaluation in stream learning, then describe different internal and external measures that address these, and finally, present some existing measures—both internal and external—that are still valid.
It is important to understand some of the important issues that are specific to streaming and clustering, as the measures need to address them:
Evaluation measures for clustering in the context of streaming data must provide a useful index of the quality of clustering, taking into consideration the effect of evolving and noisy data streams, overlapping and merging clusters, and so on. Here we present some external measures used in stream clustering. Many internal measures encountered in Chapter 3, Unsupervised Machine Learning Techniques, such as the Silhouette coefficient, Dunn's Index, and R-Squared, are also used and are not repeated here.
The idea behind CMM is to quantify the connectivity of the points to clusters given the ground truth. It works in three phases:
Mapping phase: In this phase, clusters assigned by the stream learning algorithm are mapped to the ground truth clusters. Based on these, various statistics of distance and point connectivity are measured using the concepts of k-Nearest Neighbors.
The average distance of point p to its closest k neighbors in a cluster Ci is given by:
The average distance for a cluster Ci is given by:
The point connectivity of a point p in a cluster Ci is given by:
Class frequencies are counted for each cluster and the mapping of the cluster to the ground truth is performed by calculating the histograms and similarity in the clustering.
Specifically, a cluster Ci is mapped to the ground truth class, and Clj is mapped to a ground truth cluster , which covers the majority of class frequencies of Ci. The surplus is defined as the number of instances from class Cli not covered by the ground truth cluster and total surplus for instances in classes Cl1, Cl2, Cl3 … Cl1 in cluster Ci compared to is given by:
Cluster Ci is mapped using:
Penalty phase: The penalty for every instance that is mapped incorrectly is calculated in this step using computations of fault objects; that is, objects which are not noise and yet incorrectly placed, using:
The overall penalty of point o with respect to all clusters found is given by:
CMM calculation: Using all the penalties weighted over the lifespan is given by:
Here, C is found clusters, Cl is ground truth clusters, F is the fault objects and w(o) is the weight of instance.
Validity or V-Measure is an external measure which is computed based on two properties that are of interest in stream clustering, namely, Homogeneity and Completeness. If there are n classes as set C = {c1, c2 …, cn} and k clusters K = {k1, k2..km}, contingency tables are created such that A = {aij} corresponds to the count of instances in class ci and cluster kj.
Homogeneity: Homogeneity is defined as a property of a cluster that reflects the extent to which all the data in the cluster belongs to the same class.
Conditional entropy and class entropy:
Homogeneity is defined as:
A higher value of homogeneity is more desirable.
Completeness: Completeness is defined as the mirror property of Homogeneity, that is, having all instances of a single class belong to the same cluster.
Similar to Homogeneity, conditional entropies and cluster entropy are defined as:
Completeness is defined as:
V-Measure is defined as the harmonic mean of homogeneity and completeness using a weight factor β:
A higher value of completeness or V-measure is better.
Some external measures which are quite popular in comparing the clustering algorithms or measuring the effectiveness of clustering when the classes are known are given next:
Purity and Entropy: They are similar to homogeneity and completeness defined previously.
Entropy is defined as:
Here, q = number of classes, k = number of clusters, nr = size of cluster r and .
Precision, Recall, and F-Measure: Information retrieval measures modified for clustering algorithms are as follows:
Given, and .
Recall is defined as:
F-measures is defined as:
3.144.110.155