Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Case study in active learning

This case study uses another well-known publicly available dataset to demonstrate active learning techniques using open source Java libraries. As before, we begin with defining the business problem, what tools and frameworks are used, how the principles of machine learning are realized in the solution, and what the data analysis steps reveal. Next, we describe the experiments that were conducted, evaluate the performance of the various models, and provide an analysis of the results.

Tools and software

For the experiments in Active Learning, JCLAL was the tool used. JCLAL is a Java framework for Active Learning, supporting single-label and multi-label learning.

Note

JCLAL is open source and is distributed under the GNU general public license: https://sourceforge.net/p/jclal/git/ci/master/tree/.

Business problem

The abalone dataset, which is used in these experiments, contains data on various physical and anatomical characteristics of abalone—commonly known as sea snails. The goal is to predict the number of rings in the shell, which is indicative of the age of the specimen.

Machine learning mapping

As we have seen, active learning is characterized by starting with a small set of labeled data accompanied by techniques of querying the unlabeled data such that we incrementally add instances to the labeled set. This is performed over multiple iterations, a batch at a time. The number of iterations and batch size are hyper-parameters for these techniques. The querying strategy and the choice of supervised learning method used to train on the growing number of labeled instances are additional inputs.

Data Collection

As before, we will use an existing dataset available from the UCI repository (https://archive.ics.uci.edu/ml/datasets/Abalone). The original owners of the database are the Department of Primary Industry and Fisheries in Tasmania, Australia.

The data types and descriptions of the attributes accompany the data and are reproduced in Table 5. The class attribute, Rings, has 29 distinct classes:

Name	Data type	Measurement units	Description
Sex	nominal	M, F, and I (infant)	sex of specimen
Length	continuous	mm	longest shell measurement
Diameter	continuous	mm	perpendicular to length
Height	continuous	mm	with meat in shell
Whole weight	continuous	grams	whole abalone
Shucked weight	continuous	grams	weight of meat
Viscera weight	continuous	grams	gut weight (after bleeding)
Shell weight	continuous	grams	after being dried
Rings	integer	count	+1.5 gives the age in years

Table 5. Abalone dataset features

Data sampling and transformation

For this experiment, we treated a randomly selected 4,155 records as unlabeled and kept the remaining 17 as labeled. There is no transformation of the data.

Feature analysis and dimensionality reduction

With only eight features, there is no need for dimensionality reduction. The dataset comes with some statistics on the features, reproduced in Table 6:

	Length	Diameter	Height	Whole	Shucked	Viscera	Shell	Rings
Min	0.075	0.055	0	0.002	0.001	0.001	0.002	1
Max	0.815	0.65	1.13	2.826	1.488	0.76	1.005	29
Mean	0.524	0.408	0.14	0.829	0.359	0.181	0.239	9.934
SD	0.12	0.099	0.042	0.49	0.222	0.11	0.139	3.224
Correl	0.557	0.575	0.557	0.54	0.421	0.504	0.628	1

Table 6. Summary statistics by feature

Models, results, and evaluation

We conducted two sets of experiments. The first used pool-based scenarios and the second, stream-based. In each set, we used entropy sampling, least confident sampling, margin sampling, and vote entropy sampling. The classifiers used were Naïve Bayes, Logistic Regression, and J48 (implementation of C4.5). For every experiment, 100 iterations were run, with batch sizes of 1 and 10. In Table 7, we present a subset of these results, specifically, pool-based and stream-based scenarios for each sampling method using Naïve Bayes, Simple Logistic, and C4.5 classifiers with a batch size of 10.

Note

The full set of results can be seen at https://github.com/mjmlbook/mastering-java-machine-learning/tree/master/Chapter4.

The JCLAL library requires an XML configuration file to specify which scenario to use, the query strategy selected, batch size, max iterations, and base classifier. The following is an example configuration:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<experiment>
    <process evaluation-method-type="net.sf.jclal.evaluation.method.RealScenario">
        <file-labeled>datasets/abalone-labeled.arff</file-labeled>
        <file-unlabeled>datasets/abalone-unlabeled.arff</file-unlabeled>    
        <algorithm type="net.sf.jclal.activelearning.algorithm.ClassicalALAlgorithm">
      <stop-criterion type="net.sf.jclal.activelearning.stopcriteria.MaxIteration">
              <max-iteration>10</max-iteration>	
      </stop-criterion>
      <stop-criterion type="net.sf.jclal.activelearning.stopcriteria.UnlabeledSetEmpty"/>
            <listener type="net.sf.jclal.listener.RealScenarioListener">
                <informative-instances>reports/real-scenario-informative-data.txt</informative-instances>
            </listener>
            <scenario type="net.sf.jclal.activelearning.scenario.PoolBasedSamplingScenario">
                <batch-mode type="net.sf.jclal.activelearning.batchmode.QBestBatchMode">
                    <batch-size>1</batch-size>
                </batch-mode>
                <oracle type="net.sf.jclal.activelearning.oracle.ConsoleHumanOracle"/>
               <query-strategy type="net.sf.jclal.activelearning.singlelabel.querystrategy.EntropySamplingQueryStrategy">
                    <wrapper-classifier type="net.sf.jclal.classifier.WekaClassifier">
                        <classifier type="weka.classifiers.bayes.NaiveBayes"/>
                    </wrapper-classifier>
                </query-strategy>
            </scenario>
        </algorithm>
    </process>
</experiment>

The tools itself is invoked via the following:

java -jar jclal-<version>.jar -cfg <config-file>

Pool-based scenarios

In the following three tables, we compare results using pool-based scenarios when using Naïve Bayes, Simple Logistic, and C4.5 classifiers.

Naïve Bayes:

Experiment	Area Under ROC	F Measure	False Positive Rate	Precision	Recall
PoolBased-EntropySampling-NaiveBayes-b10	0.6021	0.1032	0.0556(1)	0.1805	0.1304
PoolBased-KLDivergence-NaiveBayes-b10	0.6639(1)	0.1441(1)	0.0563	0.1765	0.1504
PoolBased-LeastConfidentSampling-NaiveBayes-b10	0.6406	0.1300	0.0827	0.1835(1)	0.1810(1)
PoolBased-VoteEntropy-NaiveBayes-b10	0.6639(1)	0.1441(1)	0.0563	0.1765	0.1504

Table 7. Performance of pool-based scenario using Naïve Bayes classifier

Logistic Regression:

Experiment	Area Under ROC	F Measure	False Positive Rate	Precision	Recall
PoolBased-EntropySampling-SimpleLogistic-b10	0.6831	0.1571	0.1157	0.1651	0.2185(1)
PoolBased-KLDivergence-SimpleLogistic-b10	0.7175(1)	0.1616	0.1049	0.2117(1)	0.2065
PoolBased-LeastConfidentSampling-SimpleLogistic-b10	0.6629	0.1392	0.1181(1)	0.1751	0.1961
PoolBased-VoteEntropy-SimpleLogistic-b10	0.6959	0.1634(1)	0.0895	0.2307	0.1880

Table 8. Performance of pool-based scenario using Logistic Regression classifier

C4.5:

Experiment	Area Under ROC	F Measure	False Positive Rate	Precision	Recall
PoolBased-EntropySampling-J48-b10	0.6730(1)	0.3286(1)	0.0737	0.3432(1)	0.32780(1)
PoolBased-KLDivergence-J48-b10	0.6686	0.2979	0.0705(1)	0.3153	0.2955
PoolBased-LeastConfidentSampling-J48-b10	0.6591	0.3094	0.0843	0.3124	0.3227
PoolBased-VoteEntropy-J48-b10	0.6686	0.2979	0.0706	0.3153	0.2955

Table 9. Performance of pool-based scenario using C4.5 classifier

Stream-based scenarios

In the following three tables, we have results for experiments on stream-based scenarios using Naïve Bayes, Logistic Regression, and C4.5 classifiers with four different sampling methods.

Naïve Bayes:

Experiment	Area Under ROC	F Measure	False Positive Rate	Precision	Recall
StreamBased-EntropySampling-NaiveBayes-b10	0.6673(1)	0.1432(1)	0.0563	0.1842(1)	0.1480
StreamBased-LeastConfidentSampling-NaiveBayes-b10	0.5585	0.0923	0.1415	0.1610	0.1807(1)
StreamBased-MarginSampling-NaiveBayes-b10	0.6736(1)	0.1282	0.0548(1)	0.1806	0.1475
StreamBased-VoteEntropyQuery-NaiveBayes-b10	0.5585	0.0923	0.1415	0.1610	0.1807(1)

Table 10. Performance of stream-based scenario using Naïve Bayes classifier

Logistic Regression:

Experiment	Area Under ROC	F Measure	False Positive Rate	Precision	Recall
StreamBased-EntropySampling-SimpleLogistic-b10	0.7343(1)	0.1994(1)	0.0871	0.2154	0.2185(1)
StreamBased-LeastConfidentSampling-SimpleLogistic-b10	0.7068	0.1750	0.0906	0.2324(1)	0.2019
StreamBased-MarginSampling-SimpleLogistic-b10	0.7311	0.1994(1)	0.0861	0.2177	0.214
StreamBased-VoteEntropy-SimpleLogistic-b10	0.5506	0.0963	0.0667(1)	0.1093	0.1117

Table 11. Performance of stream-based scenario using Logistic Regression classifier

C4.5:

Experiment	Area Under ROC	F Measure	False Positive Rate	Precision	Recall
StreamBased-EntropySampling-J48-b10	0.6648	0.3053	0.0756	0.3189(1)	0.3032
StreamBased-LeastConfidentSampling-J48-b10	0.6748(1)	0.3064(1)	0.0832	0.3128	0.3189(1)
StreamBased-MarginSampling-J48-b10	0.6660	0.2998	0.0728(1)	0.3163	0.2967
StreamBased-VoteEntropy-J48-b10	0.4966	0.0627	0.0742	0.1096	0.0758

Table 12. Performance of stream-based scenario using C4.5 classifier

Analysis of active learning results

It is quite interesting to see that pool-based, Query By Committee—an ensemble method—using KL-Divergence sampling does really well across most classifiers. As discussed in the section, these methods have been proven to have a theoretical guarantee on reducing the errors by keeping a large hypothesis space, and this experimental result supports that empirically.

Pool-based, entropy-based sampling using C4.5 as a classifier has the highest Precision, Recall, FPR and F-Measure. Also with stream-based, entropy sampling with C4.5, the metrics are similarly high. With different sampling techniques and C4.5 using pool-based as in KL-Divergence, LeastConfident or vote entropy, the metrics are significantly higher. Thus, this can be attributed more strongly to the underlying classifier C4.5 in finding non-linear patterns.

The Logistic Regression algorithm performs very well in both stream-based and pool-based when considering AUC. This may be completely due to the fact that LR has a good probabilistic approach in confidence mapping, which is an important factor for giving good AUC scores.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Case study in active learning

Create new playlist

Sign In

Sign Up

Case study in active learning

Tools and software

Note

Business problem

Machine learning mapping

Data Collection

Data sampling and transformation

Feature analysis and dimensionality reduction

Models, results, and evaluation

Note

Pool-based scenarios

Stream-based scenarios

Analysis of active learning results

Table of Contents for
Case study in active learning