Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Active learning

Although active learning has many similarities with semi-supervised learning, it has its own distinctive approach to modeling with datasets containing labeled and unlabeled data. It has roots in the basic human psychology that asking more questions often tends to solve problems.

The main idea behind active learning is that if the learner gets to pick the instances to learn from rather than being handed labeled data, it can learn more effectively with less data (Reference [6]). With very small amount of labeled data, it can carefully pick instances from unlabeled data to get label information and use that to iteratively improve learning. This basic approach of querying for unlabeled data to get labels from a so-called oracle—an expert in the domain—distinguishes active learning from semi-supervised or passive learning. The following figure illustrates the difference and the iterative process involved:

Figure 7. Active Machine Learning process contrasted with Supervised and Semi-Supervised Learning process.

Representation and notation

The dataset D, which represents all the data instances and their labels, is given by D = {(x₁, y₂), (x₂, y₂), … (x_n, y_n)} where are the individual instances of data and {y₁, y₂, … y_n} is the set of associated labels. D is composed of two sets U, labeled data and L, unlabeled data. x is the set {x₁, x₂,… x_n} of data instances without labels.

The dataset consists of all labeled data with known outcomes {y₁, y₂, … y_l} whereas is the dataset where the outcomes are not known. As before, |U|>> |L|.

Active learning scenarios

Active learning scenarios can be broadly classified as:

Stream-based active learning: In this approach, an instance or example is picked only from the unlabeled dataset and a decision is made whether to ignore the data or pass it on to the oracle to get its label (Referee[10,11]).
Pool-based active learning: In this approach, the instances are queried from the unlabeled dataset and then ranked on the basis of informativeness and a set from these is sent to the Oracle to get the labels (Referees [12]).
Query synthesis: In this method, the learner has only information about input space (features) and synthesizes queries from the unlabeled set for the membership. This is not used in practical applications, as most often it doesn't consider the data generating distribution and hence often the queries are arbitrary or meaningless.

Active learning approaches

Regardless of the scenario involved, every active learning approach includes selecting a query strategy or sampling method which establishes the mechanism for picking the queries in each iteration. Each method reveals a distinct way of seeking out unlabeled examples with the best information content for improving the learning process. In the following subsections, we describe the major query strategy frameworks, how they work, their merits and limitations, and the different strategies within each framework.

Uncertainty sampling

The key idea behind this form of sampling is to select instances from the unlabeled pool that the current model is most uncertain about. The learner can then avoid the instances the model is more certain or confident in classifying (Reerences [8]).

Probabilistic based models (Naïve Bayes, Logistic Regression, and so on) are the most natural choices for such methods as they give confidence measures on the given model, say θ for the data x, for a class y_i i ϵ classes, and the probability as the posterior probability.

How does it work?

The general process for all uncertainty-based algorithms is outlined as follows:

Initialize the data as labeled, and unlabeled, .
While there is still unlabeled data:
1. Train the classifier model with the labeled data L.
2. Apply the classifier model f on the unlabeled data U to assess informativeness J using one of the sampling mechanisms (see next section)
3. Choose k most informative data from U as set L_u to get labels from the oracle.
4. Augment the labeled data with the k new labeled data points obtained in the previous step: L = L ∪ L_u.
Repeat all the steps under 2.

Some of the most common query synthesis algorithms to sample the informative instances from the data are given next.

Least confident sampling

In this technique, the data instances are sorted based on their confidence in reverse, and the instances most likely to be queried or selected are the ones the model is least confident about. The idea behind this is the least confident ones are the ones near the margin or separating hyperplane and getting their labels will be the best way to learn the boundaries effectively.

This can be formulated as .

The disadvantage of this method is that it effectively considers information of the best; information on the rest of the posterior distribution is not used.

Smallest margin sampling

This is margin-based sampling, where the instances with smaller margins have more ambiguity than instances with larger margins.

This can be formulated as where and are two labels for the instance x.

Label entropy sampling

Entropy, which is the measure of average information content in the data and is the impurity measure, can be used to sample the instances. This can be formulated as:

Advantages and limitations

The advantages and limitations are:

Label entropy sampling is the simplest approach and can work with any probabilistic classifiers—this is the biggest advantage
Presence of outliers or wrong feedback can go unnoticed and models can degrade

Version space sampling

Hypothesis H is the set of all the particular models that generalize or explain the training data; for example, all possible sets of weights that separate two linearly separable classes. Version spaces V are subsets of hypothesis H, which are consistent with the training data as defined by Tom Mitchell (References [15]) such that .

The idea behind this sampling is to query instances from the unlabeled dataset that reduce the size of the version space or minimize |V|.

Query by disagreement (QBD)

QBD is one of the earliest algorithms which works on maintaining a version space V—when two hypotheses disagree on the label of new incoming data, that instance is selected for getting labels from the oracle or expert.

How does it work?

The entire algorithm can be summarized as follows:

Initialize as the set of all legal hypotheses.
Initialize the data as labeled and unlabeled.
While data x_' is in U:
1. If for any h₂ ∈ V:
  1. Query the label of x_' and get y_'.
  2. V = {h: h(x_') = y_' for all points.
2. Otherwise:
  1. Ignore x_'.

Query by Committee (QBC)

Query by Committee overcomes the limitation of Query by Disagreement related to maintaining all possible version spaces by creating a committee of classifiers and using their votes as the mechanism to capture the disagreement (References [7]).

How does it work?

For this algorithm:

Initialize the data as labeled and unlabeled.
Train the committee of models C = {θ¹θ², ... θ^c} on the labeled data w (see the following).
For all data x^' in the U:
1. Vote for predictions on x' as {.
2. Rank the instances based on maximum disagreement (see the following).
3. Choose k most informative data from U as set L_u to get labels from the oracle.
4. Augment the labeled data with the k new labeled data points L = L ∪ L_u.
5. Retrain the models {θ₁, θ₂, ... θ_c} with new L.

With the two tasks of training the committee of learners and choosing the disagreement methods, each has various choices.

Training different models can be either done using different samples from L or they can be trained using ensemble methods such as boosting and bagging.

Vote entropy is one of the methods chosen as the disagreement metric to rank. The mathematical way of representing it is:

Here, V(y_i) is number of votes given to the label y_i from all possible labels and |C| is size of the committee.

Kullback-Leibler (KL) divergence is an information theoretic measure of divergence between two probability distributions. The disagreement is quantified as the average divergence of each committee's prediction from that of the consensus in the committee C:

Advantages and limitations

The advantages and limitations are the following:

Simplicity and the fact that it can work with any supervised algorithm gives it a great advantage.
There are theoretical guarantees of minimizing errors and generalizing in some conditions.
Query by Disagreement suffers from maintaining a large number of valid hypotheses.
These methods still suffer from the issue of wrong feedback going unnoticed and models potentially degrading.

Data distribution sampling

The previous methods selected the best instances from the unlabeled set based either on the uncertainty posed by the samples on the models or by reducing the hypothesis space size. Neither of these methods worked on what is best for the model itself. The idea behind data distribution sampling is that adding samples that help reduce the errors to the model serves to improve predictions on unseen instances using expected values (References [13 and 14]).

How does it work?

There are different ways to find what is the best sample for the given model and here we will describe each one in some detail.

Expected model change

The idea behind this is to select examples from the unlabeled set that that will bring maximum change in the model:

Here, P_θ (y|x) = expectation over labels of x, is the sum over unlabeled instances of the entropy of including x ' after retraining with x.

Expected error reduction

Here, the approach is to select examples from the unlabeled set that reduce the model's generalized error the most. The generalized error is measured using the unlabeled set with expected labels:

Here, Pθ (y|x) = expectation over labels of x, is the sum over unlabeled instances of the entropy of including x^' after retraining with x.

Variance reduction

The general equation of estimation on an out-of-sample error in terms of noise-bias-variance is given by the following:

Here, G(x) is the model's prediction given the label y. In variance reduction, we select examples from the unlabeled set that most reduce the variance in the model:

Here, θ + represents the model after it has been retrained with the new point x ^' and its label y^'.

Density weighted methods

In this approach, we select examples from the unlabeled set that have average similarity to the labeled set.

This can be represented as follows:

Here, sim(x, x ^') is the density term or the similarity term where H_θ(y|x) is the base utility measure.

Advantages and limitations

The advantages and limitations are as follows:

The biggest advantage is that they work directly on the model as an optimization objective rather than implicit or indirect methods described before
These methods can work on pool- or stream-based scenarios
These methods have some theoretical guarantees on the bounds and generalizations
The biggest disadvantage of these methods is computational cost and difficulty in implementation

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Active learning

Create new playlist

Sign In

Sign Up

Active learning

Representation and notation

Active learning scenarios

Active learning approaches

Uncertainty sampling

How does it work?

Least confident sampling

Smallest margin sampling

Label entropy sampling

Advantages and limitations

Version space sampling

Query by disagreement (QBD)

How does it work?

Query by Committee (QBC)

How does it work?

Advantages and limitations

Data distribution sampling

How does it work?

Expected model change

Expected error reduction

Variance reduction

Density weighted methods

Advantages and limitations

Table of Contents for
Active learning