Compared to a supervised classifier, the goal of clustering is to identify intrinsic groups in a set of unlabeled data. It could be applied in identifying representative examples of homogeneous groups, finding useful and suitable groupings, or finding unusual examples, such as outliers.
We'll demonstrate how to implement clustering by analyzing the Bank dataset. The dataset consist of 11 attributes, describing 600 instances with age, sex, region, income, marriage status, children, car ownership status, saving activity, current activity, mortgage status, and PEP. In our analysis, we will try to identify the common groups of clients by applying the Expectation Maximization (EM) clustering.
EM works as follows: given a set of clusters, EM first assigns each instance with a probability distribution of belonging to a particular cluster. For example, if we start with three clusters—A, B, and C—an instance might get the probability distribution of 0.70, 0.10, and 0.20, belonging to the A, B, and C clusters, respectively. In the second step, EM re-estimates the parameter vector of the probability distribution of each class. The algorithm iterates these two steps until the parameters converge or the maximum number of iterations is reached.
The number of clusters to be used in EM can be set either manually or automatically by cross validation. Another approach to determining the number of clusters in a dataset includes the elbow method. The method looks at the percentage of variance that is explained with a specific number of clusters. The method suggests increasing the number of clusters until the additional cluster does not add much information, that is, explains little additional variance.
The process of building a cluster model is quite similar to the process of building a classification model, that is, load the data and build a model. Clustering algorithms are implemented in the weka.clusterers
package, as follows:
import java.io.BufferedReader; import java.io.FileReader; import weka.core.Instances; import weka.clusterers.EM; public class Clustering { public static void main(String args[]) throws Exception{ //load data Instances data = new Instances(new BufferedReader(new FileReader(args[0]))); // new instance of clusterer EM model = new EM(); // build the clusterer model.buildClusterer(data); System.out.println(model); } }
The model identified the following six clusters:
EM == Number of clusters selected by cross validation: 6 Cluster Attribute 0 1 2 3 4 5 (0.1) (0.13) (0.26) (0.25) (0.12) (0.14) ====================================================================== age 0_34 10.0535 51.8472 122.2815 12.6207 3.1023 1.0948 35_51 38.6282 24.4056 29.6252 89.4447 34.5208 3.3755 52_max 13.4293 6.693 6.3459 50.8984 37.861 81.7724 [total] 62.1111 82.9457 158.2526 152.9638 75.4841 86.2428 sex FEMALE 27.1812 32.2338 77.9304 83.5129 40.3199 44.8218 MALE 33.9299 49.7119 79.3222 68.4509 34.1642 40.421 [total] 61.1111 81.9457 157.2526 151.9638 74.4841 85.2428 region INNER_CITY 26.1651 46.7431 73.874 60.1973 33.3759 34.6445 TOWN 24.6991 13.0716 48.4446 53.1731 21.617 17.9946 ...
The table can be read as follows: the first line indicates six clusters, while the first column shows attributes and their ranges. For example, the attribute age
is split into three ranges: 0-34
, 35-51
, and 52-max
. The columns on the left indicate how many instances fall into the specific range in each cluster, for example, clients in the 0-34
years age group are mostly in cluster #2 (122 instances).
A clustering algorithm's quality can be estimated using the log likelihood measure, which measures how consistent the identified clusters are. The dataset is split into multiple folds and clustering is run with each fold. The motivation here is that if the clustering algorithm assigns high probability to similar data that wasn't used to fit parameters, then it has probably done a good job of capturing the data structure. Weka offers the CluterEvaluation
class to estimate it, as follows:
double logLikelihood = ClusterEvaluation.crossValidateModel(model, data, 10, new Random(1)); System.out.println(logLikelihood);
-8.773410259774291
18.118.186.143