Unsupervised learning and the clustering

Clustering analysis is about dividing data samples or data points and putting them into corresponding homogeneous classes or clusters. Thus a trivial definition of clustering can be thought as the process of organizing objects into groups whose members are similar in some way.
A cluster is, therefore, a collection of objects that are similar between them and are dissimilar to the objects belonging to other clusters. As shown in Figure 2, if a collection of objects is given, clustering algorithms put those objects into a group based on similarity. A clustering algorithm such as K-means has then located the centroid of the group of data points. However, to make the clustering accurate and effective, the algorithm evaluates the distance between each point from the centroid of the cluster. Eventually, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data.

Figure 2: Clustering raw data

Spark supports many clustering algorithms such as K-means, Gaussian mixture, power iteration clustering (PIC), latent dirichlet allocation (LDA), bisecting K-means, and Streaming K-means. LDA is used for document classification and clustering commonly used in text mining. PIC is used for clustering vertices of a graph consisting of pairwise similarities as edge properties. However, to keep the objective of this chapter clearer and focused, we will confine our discussion to the K-means, bisecting K-means, and Gaussian mixture algorithms.

Table of Contents for Unsupervised learning and the clustering

Create new playlist

Sign In

Sign Up

Table of Contents for
Unsupervised learning and the clustering