Clustering

Compared to a supervised classifier, the goal of clustering is to identify intrinsic groups in a set of unlabeled data. It can be applied to identifying representative examples of homogeneous groups, finding useful and suitable groupings, or finding unusual examples, such as outliers.

We'll demonstrate how to implement clustering by analyzing a bank dataset. The dataset consists of 11 attributes, describing 600 instances, with age, sex, region, income, marital status, children, car ownership status, saving activity, current activity, mortgage status, and PEP. In our analysis, we will try to identify the common groups of clients by applying the expectation maximization (EM) clustering.

EM works as follows: given a set of clusters, EM first assigns each instance with a probability distribution of belonging to a particular cluster. For example, if we start with three clusters—namely, A, B, and C—an instance might get the probability distribution of 0.70, 0.10, and 0.20, belonging to the A, B, and C clusters, respectively. In the second step, EM re-estimates the parameter vector of the probability distribution of each class. The algorithm iterates these two steps until the parameters converge or the maximum number of iterations is reached.

The number of clusters to be used in EM can be set either manually or automatically by cross-validation. Another approach to determining the number of clusters in a dataset includes the elbow method. This method looks at the percentage of variance that is explained with a specific number of clusters. The method suggests increasing the number of clusters until the additional cluster does not add much information, that is, it explains little additional variance.

Table of Contents for Clustering

Create new playlist

Sign In

Sign Up

Table of Contents for
Clustering