Clustering

Compared to a supervised classifier, the goal of clustering is to identify intrinsic groups in a set of unlabeled data. It could be applied in identifying representative examples of homogeneous groups, finding useful and suitable groupings, or finding unusual examples, such as outliers.

We'll demonstrate how to implement clustering by analyzing the Bank dataset. The dataset consist of 11 attributes, describing 600 instances with age, sex, region, income, marriage status, children, car ownership status, saving activity, current activity, mortgage status, and PEP. In our analysis, we will try to identify the common groups of clients by applying the Expectation Maximization (EM) clustering.

EM works as follows: given a set of clusters, EM first assigns each instance with a probability distribution of belonging to a particular cluster. For example, if we start with three clusters—A, B, and C—an instance might get the probability distribution of 0.70, 0.10, and 0.20, belonging to the A, B, and C clusters, respectively. In the second step, EM re-estimates the parameter vector of the probability distribution of each class. The algorithm iterates these two steps until the parameters converge or the maximum number of iterations is reached.

The number of clusters to be used in EM can be set either manually or automatically by cross validation. Another approach to determining the number of clusters in a dataset includes the elbow method. The method looks at the percentage of variance that is explained with a specific number of clusters. The method suggests increasing the number of clusters until the additional cluster does not add much information, that is, explains little additional variance.

Clustering algorithms

The process of building a cluster model is quite similar to the process of building a classification model, that is, load the data and build a model. Clustering algorithms are implemented in the weka.clusterers package, as follows:

import java.io.BufferedReader;
import java.io.FileReader;

import weka.core.Instances;
import weka.clusterers.EM;

public class Clustering {

  public static void main(String args[]) throws Exception{
    
    //load data
    Instances data = new Instances(new BufferedReader(new FileReader(args[0])));
    
    // new instance of clusterer
    EM model = new EM();
    // build the clusterer
    model.buildClusterer(data);
    System.out.println(model);

  }
}

The model identified the following six clusters:

EM
==

Number of clusters selected by cross validation: 6

                 Cluster
Attribute              0        1        2        3        4        5
                   (0.1)   (0.13)   (0.26)   (0.25)   (0.12)   (0.14)
======================================================================
age
  0_34            10.0535  51.8472 122.2815  12.6207   3.1023   1.0948
  35_51           38.6282  24.4056  29.6252  89.4447  34.5208   3.3755
  52_max          13.4293    6.693   6.3459  50.8984   37.861  81.7724
  [total]         62.1111  82.9457 158.2526 152.9638  75.4841  86.2428
sex
  FEMALE          27.1812  32.2338  77.9304  83.5129  40.3199  44.8218
  MALE            33.9299  49.7119  79.3222  68.4509  34.1642   40.421
  [total]         61.1111  81.9457 157.2526 151.9638  74.4841  85.2428
region
  INNER_CITY      26.1651  46.7431   73.874  60.1973  33.3759  34.6445
  TOWN            24.6991  13.0716  48.4446  53.1731   21.617  
17.9946
...

The table can be read as follows: the first line indicates six clusters, while the first column shows attributes and their ranges. For example, the attribute age is split into three ranges: 0-34, 35-51, and 52-max. The columns on the left indicate how many instances fall into the specific range in each cluster, for example, clients in the 0-34 years age group are mostly in cluster #2 (122 instances).

Evaluation

A clustering algorithm's quality can be estimated using the log likelihood measure, which measures how consistent the identified clusters are. The dataset is split into multiple folds and clustering is run with each fold. The motivation here is that if the clustering algorithm assigns high probability to similar data that wasn't used to fit parameters, then it has probably done a good job of capturing the data structure. Weka offers the CluterEvaluation class to estimate it, as follows:

double logLikelihood = ClusterEvaluation.crossValidateModel(model, data, 10, new Random(1));
System.out.println(logLikelihood);

It has the following output:

   -8.773410259774291
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.83.96