This recipe will demonstrate how to perform basic clustering. We will again use the bank data introduced in the previous recipe.
The idea of clustering is to automatically group examples in the dataset by a similarity measure, most commonly by Euclidean distance.
Execute the following code:
import java.io.BufferedReader; import java.io.FileReader; import weka.core.Instances; import weka.clusterers.EM; public class Clustering { public static void main(String args[]) throws Exception{ //load data Instances data = new Instances(new BufferedReader(new FileReader("dataset/bank-data.arff"))); // new instance of clusterer EM model = new EM(); // build the clusterer model.buildClusterer(data); System.out.println(model); } }
Here is the output:
EM == Number of clusters selected by cross validation: 6 Cluster Attribute 0 1 2 3 4 5 (0.1) (0.13) (0.26) (0.25) (0.12) (0.14) ====================================================================== age 0_34 10.0535 51.8472 122.2815 12.6207 3.1023 1.0948 35_51 38.6282 24.4056 29.6252 89.4447 34.5208 3.3755 52_max 13.4293 6.693 6.3459 50.8984 37.861 81.7724 [total] 62.1111 82.9457 158.2526 152.9638 75.4841 86.2428 sex FEMALE 27.1812 32.2338 77.9304 83.5129 40.3199 44.8218 MALE 33.9299 49.7119 79.3222 68.4509 34.1642 40.421 [total] 61.1111 81.9457 157.2526 151.9638 74.4841 85.2428 region INNER_CITY 26.1651 46.7431 73.874 60.1973 33.3759 34.6445 TOWN 24.6991 13.0716 48.4446 53.1731 21.617 17.9946 ...
The algorithm outputs clusters (columns) and central attribute values in a specific cluster (rows).
The necessary clustering classes are located in the weka.clusters
package. This example demonstrates simple Expectation Minimization clustering:
import weka.clusterers.EM;
Load the data:
Instances data = new Instances(new BufferedReader(new FileReader("dataset/bank-data.arff")));
Initialize a clusterer:
EM model = new EM();
Building a cluster is similar to building a classifier, but instead the buildClusterer(Instances)
method is called:
model.buildClusterer(data);
Finally, output the model:
System.out.println(model); EM == Number of clusters selected by cross validation: 6 Cluster Attribute 0 1 2 3 4 5 (0.1) (0.13) (0.26) (0.25) (0.12) (0.14) ====================================================================== age 0_34 10.0535 51.8472 122.2815 12.6207 3.1023 1.0948 35_51 38.6282 24.4056 29.6252 89.4447 34.5208 3.3755 52_max 13.4293 6.693 6.3459 50.8984 37.861 81.7724 [total] 62.1111 82.9457 158.2526 152.9638 75.4841 86.2428 ...
The EM created six clusters; the first column lists all the attributes and their average value in a specific cluster is shown in the corresponding column.
This section shows additional tasks than can be performed with clustering.
Clustering can also be used to classify instances similar to classification; the only difference is the method name. So, instead of the classifyInstance(Instance)
method, call clusterInstance(Instance)
. You can still use the distributionForInstance(Instance)
method to obtain the distribution:
Instances data = new Instances(new BufferedReader(new FileReader("dataset/bank-data.arff"))); //load data Instances data = new Instances(new BufferedReader(new FileReader("dataset/bank-data.arff"))); //remove the first instance from dataset Instance inst = data.instance(0); data.delete(0); // new instance of clusterer EM model = new EM(); // build the clusterer model.buildClusterer(data); // classify instance int cls = model.clusterInstance(inst); System.out.println("Cluster: "+cls); double[] dist = model.distributionForInstance(inst); for(int i = 0; i < dist.length; i++) System.out.println("Cluster "+i+". "+dist[i]);
Here is the output:
Cluster: 4 Cluster 0. 0.05197208212603157 Cluster 1. 3.42266240021125E-4 Cluster 2. 2.4532781490129885E-6 Cluster 3. 0.09898134885565614 Cluster 4. 0.8311195273577744 Cluster 5. 0.01753085601118695 Cluster 6. 5.146613118085059E-5
The output shows instance distances to each of the clusters.
A cluster that implements the weka.clusterers.UpdateableClusterer
interface can be trained incrementally similar to classifiers. This is an example using the bank data loaded with weka.core.converters.ArffLoader
to train weka.clusterers.Cobweb
:
Load the data:
// load data ArffLoader loader = new ArffLoader(); loader.setFile(new File("dataset/bank-data.arff")); Instances data = loader.getStructure();
Call buildClusterer(Instances)
with the structure of the dataset (may or may not contain any actual data rows):
Cobweb model = new Cobweb(); model.buildClusterer(data);
Subsequently, call the updateClusterer(Instance)
method to feed the clusterer's new weka.core.Instance
objects, one by one.
Instance current; while ((current = loader.getNextInstance(data)) != null) model.updateClusterer(current);
Call updateFinished()
after all Instance
objects have been processed, for the clusterer to perform additional computations:
model.updateFinished(); System.out.println(model);
Here is the output:
Number of merges: 183 Number of splits: 144 Number of clusters: 850 node 0 [600] | node 1 [266] | | node 2 [65] | | | node 3 [15] | | | | node 4 [4] ...
The outputted model displays cluster nodes.
A clusterer can be evaluated with the weka.clusterers.ClusterEvaluation
class.
For example, we can use separate train and test sets, and output the number of clusters found:
ClusterEvaluation eval = new ClusterEvaluation(); model.buildClusterer(data); eval.setClusterer(model); eval.evaluateClusterer(newData); System.out.println("# of clusters: " + eval.getNumClusters);
Output:
# of clusters: 6
When density-based clusters are used, cross-validation can be applied as follows (note that the official documentation specifies that with the MakeDensityBasedClusterer
class, you can turn any clusterer into a density-based one):
Instances data = new Instances(new BufferedReader(new FileReader("dataset/bank-data.arff"))); EM model = new EM(); double logLikelyhood = ClusterEvaluation.crossValidateModel(model, data, 10, new Random(1)); System.out.println(logLikelyhood);
Output:
-8.773410259774291
The output indicates the log likelihood that cluster distributions in each of the folds match.
3.139.107.210