Case study in active learning

This case study uses another well-known publicly available dataset to demonstrate active learning techniques using open source Java libraries. As before, we begin with defining the business problem, what tools and frameworks are used, how the principles of machine learning are realized in the solution, and what the data analysis steps reveal. Next, we describe the experiments that were conducted, evaluate the performance of the various models, and provide an analysis of the results.

Tools and software

For the experiments in Active Learning, JCLAL was the tool used. JCLAL is a Java framework for Active Learning, supporting single-label and multi-label learning.

Note

JCLAL is open source and is distributed under the GNU general public license: https://sourceforge.net/p/jclal/git/ci/master/tree/.

Business problem

The abalone dataset, which is used in these experiments, contains data on various physical and anatomical characteristics of abalone—commonly known as sea snails. The goal is to predict the number of rings in the shell, which is indicative of the age of the specimen.

Machine learning mapping

As we have seen, active learning is characterized by starting with a small set of labeled data accompanied by techniques of querying the unlabeled data such that we incrementally add instances to the labeled set. This is performed over multiple iterations, a batch at a time. The number of iterations and batch size are hyper-parameters for these techniques. The querying strategy and the choice of supervised learning method used to train on the growing number of labeled instances are additional inputs.

Data Collection

As before, we will use an existing dataset available from the UCI repository (https://archive.ics.uci.edu/ml/datasets/Abalone). The original owners of the database are the Department of Primary Industry and Fisheries in Tasmania, Australia.

The data types and descriptions of the attributes accompany the data and are reproduced in Table 5. The class attribute, Rings, has 29 distinct classes:

Name

Data type

Measurement units

Description

Sex

nominal

M, F, and I (infant)

sex of specimen

Length

continuous

mm

longest shell measurement

Diameter

continuous

mm

perpendicular to length

Height

continuous

mm

with meat in shell

Whole weight

continuous

grams

whole abalone

Shucked weight

continuous

grams

weight of meat

Viscera weight

continuous

grams

gut weight (after bleeding)

Shell weight

continuous

grams

after being dried

Rings

integer

count

+1.5 gives the age in years

Table 5. Abalone dataset features

Data sampling and transformation

For this experiment, we treated a randomly selected 4,155 records as unlabeled and kept the remaining 17 as labeled. There is no transformation of the data.

Feature analysis and dimensionality reduction

With only eight features, there is no need for dimensionality reduction. The dataset comes with some statistics on the features, reproduced in Table 6:

 

Length

Diameter

Height

Whole

Shucked

Viscera

Shell

Rings

Min

0.075

0.055

0

0.002

0.001

0.001

0.002

1

Max

0.815

0.65

1.13

2.826

1.488

0.76

1.005

29

Mean

0.524

0.408

0.14

0.829

0.359

0.181

0.239

9.934

SD

0.12

0.099

0.042

0.49

0.222

0.11

0.139

3.224

Correl

0.557

0.575

0.557

0.54

0.421

0.504

0.628

1

Table 6. Summary statistics by feature

Models, results, and evaluation

We conducted two sets of experiments. The first used pool-based scenarios and the second, stream-based. In each set, we used entropy sampling, least confident sampling, margin sampling, and vote entropy sampling. The classifiers used were Naïve Bayes, Logistic Regression, and J48 (implementation of C4.5). For every experiment, 100 iterations were run, with batch sizes of 1 and 10. In Table 7, we present a subset of these results, specifically, pool-based and stream-based scenarios for each sampling method using Naïve Bayes, Simple Logistic, and C4.5 classifiers with a batch size of 10.

The JCLAL library requires an XML configuration file to specify which scenario to use, the query strategy selected, batch size, max iterations, and base classifier. The following is an example configuration:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<experiment>
    <process evaluation-method-type="net.sf.jclal.evaluation.method.RealScenario">
        <file-labeled>datasets/abalone-labeled.arff</file-labeled>
        <file-unlabeled>datasets/abalone-unlabeled.arff</file-unlabeled>    
        <algorithm type="net.sf.jclal.activelearning.algorithm.ClassicalALAlgorithm">
      <stop-criterion type="net.sf.jclal.activelearning.stopcriteria.MaxIteration">
              <max-iteration>10</max-iteration>	
      </stop-criterion>
      <stop-criterion type="net.sf.jclal.activelearning.stopcriteria.UnlabeledSetEmpty"/>
            <listener type="net.sf.jclal.listener.RealScenarioListener">
                <informative-instances>reports/real-scenario-informative-data.txt</informative-instances>
            </listener>
            <scenario type="net.sf.jclal.activelearning.scenario.PoolBasedSamplingScenario">
                <batch-mode type="net.sf.jclal.activelearning.batchmode.QBestBatchMode">
                    <batch-size>1</batch-size>
                </batch-mode>
                <oracle type="net.sf.jclal.activelearning.oracle.ConsoleHumanOracle"/>
               <query-strategy type="net.sf.jclal.activelearning.singlelabel.querystrategy.EntropySamplingQueryStrategy">
                    <wrapper-classifier type="net.sf.jclal.classifier.WekaClassifier">
                        <classifier type="weka.classifiers.bayes.NaiveBayes"/>
                    </wrapper-classifier>
                </query-strategy>
            </scenario>
        </algorithm>
    </process>
</experiment>

The tools itself is invoked via the following:

java -jar jclal-<version>.jar -cfg <config-file>

Pool-based scenarios

In the following three tables, we compare results using pool-based scenarios when using Naïve Bayes, Simple Logistic, and C4.5 classifiers.

Naïve Bayes:

Experiment

Area Under ROC

F Measure

False Positive Rate

Precision

Recall

PoolBased-EntropySampling-NaiveBayes-b10

0.6021

0.1032

0.0556(1)

0.1805

0.1304

PoolBased-KLDivergence-NaiveBayes-b10

0.6639(1)

0.1441(1)

0.0563

0.1765

0.1504

PoolBased-LeastConfidentSampling-NaiveBayes-b10

0.6406

0.1300

0.0827

0.1835(1)

0.1810(1)

PoolBased-VoteEntropy-NaiveBayes-b10

0.6639(1)

0.1441(1)

0.0563

0.1765

0.1504

Table 7. Performance of pool-based scenario using Naïve Bayes classifier

Logistic Regression:

Experiment

Area Under ROC

F Measure

False Positive Rate

Precision

Recall

PoolBased-EntropySampling-SimpleLogistic-b10

0.6831

0.1571

0.1157

0.1651

0.2185(1)

PoolBased-KLDivergence-SimpleLogistic-b10

0.7175(1)

0.1616

0.1049

0.2117(1)

0.2065

PoolBased-LeastConfidentSampling-SimpleLogistic-b10

0.6629

0.1392

0.1181(1)

0.1751

0.1961

PoolBased-VoteEntropy-SimpleLogistic-b10

0.6959

0.1634(1)

0.0895

0.2307

0.1880

Table 8. Performance of pool-based scenario using Logistic Regression classifier

C4.5:

Experiment

Area Under ROC

F Measure

False Positive Rate

Precision

Recall

PoolBased-EntropySampling-J48-b10

0.6730(1)

0.3286(1)

0.0737

0.3432(1)

0.32780(1)

PoolBased-KLDivergence-J48-b10

0.6686

0.2979

0.0705(1)

0.3153

0.2955

PoolBased-LeastConfidentSampling-J48-b10

0.6591

0.3094

0.0843

0.3124

0.3227

PoolBased-VoteEntropy-J48-b10

0.6686

0.2979

0.0706

0.3153

0.2955

Table 9. Performance of pool-based scenario using C4.5 classifier

Stream-based scenarios

In the following three tables, we have results for experiments on stream-based scenarios using Naïve Bayes, Logistic Regression, and C4.5 classifiers with four different sampling methods.

Naïve Bayes:

Experiment

Area Under ROC

F Measure

False Positive Rate

Precision

Recall

StreamBased-EntropySampling-NaiveBayes-b10

0.6673(1)

0.1432(1)

0.0563

0.1842(1)

0.1480

StreamBased-LeastConfidentSampling-NaiveBayes-b10

0.5585

0.0923

0.1415

0.1610

0.1807(1)

StreamBased-MarginSampling-NaiveBayes-b10

0.6736(1)

0.1282

0.0548(1)

0.1806

0.1475

StreamBased-VoteEntropyQuery-NaiveBayes-b10

0.5585

0.0923

0.1415

0.1610

0.1807(1)

Table 10. Performance of stream-based scenario using Naïve Bayes classifier

Logistic Regression:

Experiment

Area Under ROC

F Measure

False Positive Rate

Precision

Recall

StreamBased-EntropySampling-SimpleLogistic-b10

0.7343(1)

0.1994(1)

0.0871

0.2154

0.2185(1)

StreamBased-LeastConfidentSampling-SimpleLogistic-b10

0.7068

0.1750

0.0906

0.2324(1)

0.2019

StreamBased-MarginSampling-SimpleLogistic-b10

0.7311

0.1994(1)

0.0861

0.2177

0.214

StreamBased-VoteEntropy-SimpleLogistic-b10

0.5506

0.0963

0.0667(1)

0.1093

0.1117

Table 11. Performance of stream-based scenario using Logistic Regression classifier

C4.5:

Experiment

Area Under ROC

F Measure

False Positive Rate

Precision

Recall

StreamBased-EntropySampling-J48-b10

0.6648

0.3053

0.0756

0.3189(1)

0.3032

StreamBased-LeastConfidentSampling-J48-b10

0.6748(1)

0.3064(1)

0.0832

0.3128

0.3189(1)

StreamBased-MarginSampling-J48-b10

0.6660

0.2998

0.0728(1)

0.3163

0.2967

StreamBased-VoteEntropy-J48-b10

0.4966

0.0627

0.0742

0.1096

0.0758

Table 12. Performance of stream-based scenario using C4.5 classifier

Analysis of active learning results

It is quite interesting to see that pool-based, Query By Committee—an ensemble method—using KL-Divergence sampling does really well across most classifiers. As discussed in the section, these methods have been proven to have a theoretical guarantee on reducing the errors by keeping a large hypothesis space, and this experimental result supports that empirically.

Pool-based, entropy-based sampling using C4.5 as a classifier has the highest Precision, Recall, FPR and F-Measure. Also with stream-based, entropy sampling with C4.5, the metrics are similarly high. With different sampling techniques and C4.5 using pool-based as in KL-Divergence, LeastConfident or vote entropy, the metrics are significantly higher. Thus, this can be attributed more strongly to the underlying classifier C4.5 in finding non-linear patterns.

The Logistic Regression algorithm performs very well in both stream-based and pool-based when considering AUC. This may be completely due to the fact that LR has a good probabilistic approach in confidence mapping, which is an important factor for giving good AUC scores.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.190.182