This case study uses another well-known publicly available dataset to demonstrate active learning techniques using open source Java libraries. As before, we begin with defining the business problem, what tools and frameworks are used, how the principles of machine learning are realized in the solution, and what the data analysis steps reveal. Next, we describe the experiments that were conducted, evaluate the performance of the various models, and provide an analysis of the results.
For the experiments in Active Learning, JCLAL was the tool used. JCLAL is a Java framework for Active Learning, supporting single-label and multi-label learning.
JCLAL is open source and is distributed under the GNU general public license: https://sourceforge.net/p/jclal/git/ci/master/tree/.
The abalone dataset, which is used in these experiments, contains data on various physical and anatomical characteristics of abalone—commonly known as sea snails. The goal is to predict the number of rings in the shell, which is indicative of the age of the specimen.
As we have seen, active learning is characterized by starting with a small set of labeled data accompanied by techniques of querying the unlabeled data such that we incrementally add instances to the labeled set. This is performed over multiple iterations, a batch at a time. The number of iterations and batch size are hyper-parameters for these techniques. The querying strategy and the choice of supervised learning method used to train on the growing number of labeled instances are additional inputs.
As before, we will use an existing dataset available from the UCI repository (https://archive.ics.uci.edu/ml/datasets/Abalone). The original owners of the database are the Department of Primary Industry and Fisheries in Tasmania, Australia.
The data types and descriptions of the attributes accompany the data and are reproduced in Table 5. The class attribute, Rings, has 29 distinct classes:
Name |
Data type |
Measurement units |
Description |
---|---|---|---|
Sex |
nominal |
M, F, and I (infant) |
sex of specimen |
Length |
continuous |
mm |
longest shell measurement |
Diameter |
continuous |
mm |
perpendicular to length |
Height |
continuous |
mm |
with meat in shell |
Whole weight |
continuous |
grams |
whole abalone |
Shucked weight |
continuous |
grams |
weight of meat |
Viscera weight |
continuous |
grams |
gut weight (after bleeding) |
Shell weight |
continuous |
grams |
after being dried |
Rings |
integer |
count |
+1.5 gives the age in years |
Table 5. Abalone dataset features
For this experiment, we treated a randomly selected 4,155 records as unlabeled and kept the remaining 17 as labeled. There is no transformation of the data.
With only eight features, there is no need for dimensionality reduction. The dataset comes with some statistics on the features, reproduced in Table 6:
Length |
Diameter |
Height |
Whole |
Shucked |
Viscera |
Shell |
Rings | |
Min |
0.075 |
0.055 |
0 |
0.002 |
0.001 |
0.001 |
0.002 |
1 |
Max |
0.815 |
0.65 |
1.13 |
2.826 |
1.488 |
0.76 |
1.005 |
29 |
Mean |
0.524 |
0.408 |
0.14 |
0.829 |
0.359 |
0.181 |
0.239 |
9.934 |
SD |
0.12 |
0.099 |
0.042 |
0.49 |
0.222 |
0.11 |
0.139 |
3.224 |
Correl |
0.557 |
0.575 |
0.557 |
0.54 |
0.421 |
0.504 |
0.628 |
1 |
Table 6. Summary statistics by feature
We conducted two sets of experiments. The first used pool-based scenarios and the second, stream-based. In each set, we used entropy sampling, least confident sampling, margin sampling, and vote entropy sampling. The classifiers used were Naïve Bayes, Logistic Regression, and J48 (implementation of C4.5). For every experiment, 100 iterations were run, with batch sizes of 1 and 10. In Table 7, we present a subset of these results, specifically, pool-based and stream-based scenarios for each sampling method using Naïve Bayes, Simple Logistic, and C4.5 classifiers with a batch size of 10.
The full set of results can be seen at https://github.com/mjmlbook/mastering-java-machine-learning/tree/master/Chapter4.
The JCLAL library requires an XML configuration file to specify which scenario to use, the query strategy selected, batch size, max iterations, and base classifier. The following is an example configuration:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <experiment> <process evaluation-method-type="net.sf.jclal.evaluation.method.RealScenario"> <file-labeled>datasets/abalone-labeled.arff</file-labeled> <file-unlabeled>datasets/abalone-unlabeled.arff</file-unlabeled> <algorithm type="net.sf.jclal.activelearning.algorithm.ClassicalALAlgorithm"> <stop-criterion type="net.sf.jclal.activelearning.stopcriteria.MaxIteration"> <max-iteration>10</max-iteration> </stop-criterion> <stop-criterion type="net.sf.jclal.activelearning.stopcriteria.UnlabeledSetEmpty"/> <listener type="net.sf.jclal.listener.RealScenarioListener"> <informative-instances>reports/real-scenario-informative-data.txt</informative-instances> </listener> <scenario type="net.sf.jclal.activelearning.scenario.PoolBasedSamplingScenario"> <batch-mode type="net.sf.jclal.activelearning.batchmode.QBestBatchMode"> <batch-size>1</batch-size> </batch-mode> <oracle type="net.sf.jclal.activelearning.oracle.ConsoleHumanOracle"/> <query-strategy type="net.sf.jclal.activelearning.singlelabel.querystrategy.EntropySamplingQueryStrategy"> <wrapper-classifier type="net.sf.jclal.classifier.WekaClassifier"> <classifier type="weka.classifiers.bayes.NaiveBayes"/> </wrapper-classifier> </query-strategy> </scenario> </algorithm> </process> </experiment>
The tools itself is invoked via the following:
java -jar jclal-<version>.jar -cfg <config-file>
In the following three tables, we compare results using pool-based scenarios when using Naïve Bayes, Simple Logistic, and C4.5 classifiers.
Naïve Bayes:
Experiment |
Area Under ROC |
F Measure |
False Positive Rate |
Precision |
Recall |
---|---|---|---|---|---|
PoolBased-EntropySampling-NaiveBayes-b10 |
0.6021 |
0.1032 |
0.0556(1) |
0.1805 |
0.1304 |
PoolBased-KLDivergence-NaiveBayes-b10 |
0.6639(1) |
0.1441(1) |
0.0563 |
0.1765 |
0.1504 |
PoolBased-LeastConfidentSampling-NaiveBayes-b10 |
0.6406 |
0.1300 |
0.0827 |
0.1835(1) |
0.1810(1) |
PoolBased-VoteEntropy-NaiveBayes-b10 |
0.6639(1) |
0.1441(1) |
0.0563 |
0.1765 |
0.1504 |
Table 7. Performance of pool-based scenario using Naïve Bayes classifier
Logistic Regression:
Experiment |
Area Under ROC |
F Measure |
False Positive Rate |
Precision |
Recall |
---|---|---|---|---|---|
PoolBased-EntropySampling-SimpleLogistic-b10 |
0.6831 |
0.1571 |
0.1157 |
0.1651 |
0.2185(1) |
PoolBased-KLDivergence-SimpleLogistic-b10 |
0.7175(1) |
0.1616 |
0.1049 |
0.2117(1) |
0.2065 |
PoolBased-LeastConfidentSampling-SimpleLogistic-b10 |
0.6629 |
0.1392 |
0.1181(1) |
0.1751 |
0.1961 |
PoolBased-VoteEntropy-SimpleLogistic-b10 |
0.6959 |
0.1634(1) |
0.0895 |
0.2307 |
0.1880 |
Table 8. Performance of pool-based scenario using Logistic Regression classifier
C4.5:
Experiment |
Area Under ROC |
F Measure |
False Positive Rate |
Precision |
Recall |
---|---|---|---|---|---|
PoolBased-EntropySampling-J48-b10 |
0.6730(1) |
0.3286(1) |
0.0737 |
0.3432(1) |
0.32780(1) |
PoolBased-KLDivergence-J48-b10 |
0.6686 |
0.2979 |
0.0705(1) |
0.3153 |
0.2955 |
PoolBased-LeastConfidentSampling-J48-b10 |
0.6591 |
0.3094 |
0.0843 |
0.3124 |
0.3227 |
PoolBased-VoteEntropy-J48-b10 |
0.6686 |
0.2979 |
0.0706 |
0.3153 |
0.2955 |
Table 9. Performance of pool-based scenario using C4.5 classifier
In the following three tables, we have results for experiments on stream-based scenarios using Naïve Bayes, Logistic Regression, and C4.5 classifiers with four different sampling methods.
Naïve Bayes:
Experiment |
Area Under ROC |
F Measure |
False Positive Rate |
Precision |
Recall |
---|---|---|---|---|---|
StreamBased-EntropySampling-NaiveBayes-b10 |
0.6673(1) |
0.1432(1) |
0.0563 |
0.1842(1) |
0.1480 |
StreamBased-LeastConfidentSampling-NaiveBayes-b10 |
0.5585 |
0.0923 |
0.1415 |
0.1610 |
0.1807(1) |
StreamBased-MarginSampling-NaiveBayes-b10 |
0.6736(1) |
0.1282 |
0.0548(1) |
0.1806 |
0.1475 |
StreamBased-VoteEntropyQuery-NaiveBayes-b10 |
0.5585 |
0.0923 |
0.1415 |
0.1610 |
0.1807(1) |
Table 10. Performance of stream-based scenario using Naïve Bayes classifier
Logistic Regression:
Experiment |
Area Under ROC |
F Measure |
False Positive Rate |
Precision |
Recall |
---|---|---|---|---|---|
StreamBased-EntropySampling-SimpleLogistic-b10 |
0.7343(1) |
0.1994(1) |
0.0871 |
0.2154 |
0.2185(1) |
StreamBased-LeastConfidentSampling-SimpleLogistic-b10 |
0.7068 |
0.1750 |
0.0906 |
0.2324(1) |
0.2019 |
StreamBased-MarginSampling-SimpleLogistic-b10 |
0.7311 |
0.1994(1) |
0.0861 |
0.2177 |
0.214 |
StreamBased-VoteEntropy-SimpleLogistic-b10 |
0.5506 |
0.0963 |
0.0667(1) |
0.1093 |
0.1117 |
Table 11. Performance of stream-based scenario using Logistic Regression classifier
C4.5:
Experiment |
Area Under ROC |
F Measure |
False Positive Rate |
Precision |
Recall |
---|---|---|---|---|---|
StreamBased-EntropySampling-J48-b10 |
0.6648 |
0.3053 |
0.0756 |
0.3189(1) |
0.3032 |
StreamBased-LeastConfidentSampling-J48-b10 |
0.6748(1) |
0.3064(1) |
0.0832 |
0.3128 |
0.3189(1) |
StreamBased-MarginSampling-J48-b10 |
0.6660 |
0.2998 |
0.0728(1) |
0.3163 |
0.2967 |
StreamBased-VoteEntropy-J48-b10 |
0.4966 |
0.0627 |
0.0742 |
0.1096 |
0.0758 |
Table 12. Performance of stream-based scenario using C4.5 classifier
It is quite interesting to see that pool-based, Query By Committee—an ensemble method—using KL-Divergence sampling does really well across most classifiers. As discussed in the section, these methods have been proven to have a theoretical guarantee on reducing the errors by keeping a large hypothesis space, and this experimental result supports that empirically.
Pool-based, entropy-based sampling using C4.5 as a classifier has the highest Precision, Recall, FPR and F-Measure. Also with stream-based, entropy sampling with C4.5, the metrics are similarly high. With different sampling techniques and C4.5 using pool-based as in KL-Divergence, LeastConfident or vote entropy, the metrics are significantly higher. Thus, this can be attributed more strongly to the underlying classifier C4.5 in finding non-linear patterns.
The Logistic Regression algorithm performs very well in both stream-based and pool-based when considering AUC. This may be completely due to the fact that LR has a good probabilistic approach in confidence mapping, which is an important factor for giving good AUC scores.
3.135.190.182