Clustering (Simple)

This recipe will demonstrate how to perform basic clustering. We will again use the bank data introduced in the previous recipe.

Getting ready

The idea of clustering is to automatically group examples in the dataset by a similarity measure, most commonly by Euclidean distance.

How to do it...

Execute the following code:


import weka.core.Instances;
import weka.clusterers.EM;

public class Clustering {

  public static void main(String args[]) throws Exception{

    //load data
    Instances data = new Instances(new BufferedReader(new FileReader("dataset/bank-data.arff")));

    // new instance of clusterer
    EM model = new EM();
    // build the clusterer


Here is the output:


Number of clusters selected by cross validation: 6

Attribute              0        1        2        3        4        5
                   (0.1)   (0.13)   (0.26)   (0.25)   (0.12)   (0.14)
  0_34            10.0535  51.8472 122.2815  12.6207   3.1023   1.0948
  35_51           38.6282  24.4056  29.6252  89.4447  34.5208   3.3755
  52_max          13.4293    6.693   6.3459  50.8984   37.861  81.7724
  [total]         62.1111  82.9457 158.2526 152.9638  75.4841  86.2428
  FEMALE          27.1812  32.2338  77.9304  83.5129  40.3199  44.8218
  MALE            33.9299  49.7119  79.3222  68.4509  34.1642   40.421
  [total]         61.1111  81.9457 157.2526 151.9638  74.4841  85.2428
  INNER_CITY      26.1651  46.7431   73.874  60.1973  33.3759  34.6445
  TOWN            24.6991  13.0716  48.4446  53.1731   21.617  

The algorithm outputs clusters (columns) and central attribute values in a specific cluster (rows).

How it works...

The necessary clustering classes are located in the weka.clusters package. This example demonstrates simple Expectation Minimization clustering:

import weka.clusterers.EM;

Load the data:

Instances data = new Instances(new BufferedReader(new FileReader("dataset/bank-data.arff")));

Initialize a clusterer:

EM model = new EM();

Building a cluster is similar to building a classifier, but instead the buildClusterer(Instances) method is called:


Finally, output the model:


The EM created six clusters; the first column lists all the attributes and their average value in a specific cluster is shown in the corresponding column.

There's more...

This section shows additional tasks than can be performed with clustering.

Cluster classification

Clustering can also be used to classify instances similar to classification; the only difference is the method name. So, instead of the classifyInstance(Instance) method, call clusterInstance(Instance). You can still use the distributionForInstance(Instance) method to obtain the distribution:

Instances data = new Instances(new BufferedReader(new FileReader("dataset/bank-data.arff")));

//load data
//remove the first instance from dataset 
Instance inst = data.instance(0);

// new instance of clusterer
EM model = new EM();
// build the clusterer

// classify instance
int cls = model.clusterInstance(inst);
System.out.println("Cluster: "+cls);

double[] dist = model.distributionForInstance(inst);
for(int i = 0; i < dist.length; i++)
System.out.println("Cluster "+i+".	"+dist[i]);

Here is the output:

Cluster: 4
Cluster 0.  0.05197208212603157
Cluster 1.  3.42266240021125E-4
Cluster 2.  2.4532781490129885E-6
Cluster 3.  0.09898134885565614
Cluster 4.  0.8311195273577744
Cluster 5.  0.01753085601118695
Cluster 6.  5.146613118085059E-5

The output shows instance distances to each of the clusters.

Incremental clustering

A cluster that implements the weka.clusterers.UpdateableClusterer interface can be trained incrementally similar to classifiers. This is an example using the bank data loaded with weka.core.converters.ArffLoader to train weka.clusterers.Cobweb:

Load the data:

// load data
ArffLoader loader = new ArffLoader();
loader.setFile(new File("dataset/bank-data.arff"));
Instances data = loader.getStructure();

Call buildClusterer(Instances) with the structure of the dataset (may or may not contain any actual data rows):

Cobweb model = new Cobweb();

Subsequently, call the updateClusterer(Instance) method to feed the clusterer's new weka.core.Instance objects, one by one.

Instance current;
while ((current = loader.getNextInstance(data)) != null)

Call updateFinished() after all Instance objects have been processed, for the clusterer to perform additional computations:


Here is the output:

Number of merges: 183
Number of splits: 144
Number of clusters: 850

node 0 [600]
|   node 1 [266]
|   |   node 2 [65]
|   |   |   node 3 [15]
|   |   |   |   node 4 [4] 

The outputted model displays cluster nodes.

Cluster evaluation

A clusterer can be evaluated with the weka.clusterers.ClusterEvaluation class.

For example, we can use separate train and test sets, and output the number of clusters found:

ClusterEvaluation eval = new ClusterEvaluation();
System.out.println("# of clusters: " + eval.getNumClusters);


# of clusters: 6

When density-based clusters are used, cross-validation can be applied as follows (note that the official documentation specifies that with the MakeDensityBasedClusterer class, you can turn any clusterer into a density-based one):

Instances data = new Instances(new BufferedReader(new FileReader("dataset/bank-data.arff")));
EM model = new EM();
double logLikelyhood = ClusterEvaluation.crossValidateModel(model, data, 10, new Random(1));



The output indicates the log likelihood that cluster distributions in each of the folds match.

