Active learning

Although active learning has many similarities with semi-supervised learning, it has its own distinctive approach to modeling with datasets containing labeled and unlabeled data. It has roots in the basic human psychology that asking more questions often tends to solve problems.

The main idea behind active learning is that if the learner gets to pick the instances to learn from rather than being handed labeled data, it can learn more effectively with less data (Reference [6]). With very small amount of labeled data, it can carefully pick instances from unlabeled data to get label information and use that to iteratively improve learning. This basic approach of querying for unlabeled data to get labels from a so-called oracle—an expert in the domain—distinguishes active learning from semi-supervised or passive learning. The following figure illustrates the difference and the iterative process involved:

Active learning

Figure 7. Active Machine Learning process contrasted with Supervised and Semi-Supervised Learning process.

Representation and notation

The dataset D, which represents all the data instances and their labels, is given by D = {(x1, y2), (x2, y2), … (xn, yn)} where Representation and notation are the individual instances of data and {y1, y2, … yn} is the set of associated labels. D is composed of two sets U, labeled data and L, unlabeled data. x is the set {x1, x2,… xn} of data instances without labels.

The dataset Representation and notation consists of all labeled data with known outcomes {y1, y2, … yl} whereas Representation and notation is the dataset where the outcomes are not known. As before, |U|>> |L|.

Active learning scenarios

Active learning scenarios can be broadly classified as:

  • Stream-based active learning: In this approach, an instance or example is picked only from the unlabeled dataset and a decision is made whether to ignore the data or pass it on to the oracle to get its label (Referee[10,11]).
  • Pool-based active learning: In this approach, the instances are queried from the unlabeled dataset and then ranked on the basis of informativeness and a set from these is sent to the Oracle to get the labels (Referees [12]).
  • Query synthesis: In this method, the learner has only information about input space (features) and synthesizes queries from the unlabeled set for the membership. This is not used in practical applications, as most often it doesn't consider the data generating distribution and hence often the queries are arbitrary or meaningless.

Active learning approaches

Regardless of the scenario involved, every active learning approach includes selecting a query strategy or sampling method which establishes the mechanism for picking the queries in each iteration. Each method reveals a distinct way of seeking out unlabeled examples with the best information content for improving the learning process. In the following subsections, we describe the major query strategy frameworks, how they work, their merits and limitations, and the different strategies within each framework.

Uncertainty sampling

The key idea behind this form of sampling is to select instances from the unlabeled pool that the current model is most uncertain about. The learner can then avoid the instances the model is more certain or confident in classifying (Reerences [8]).

Probabilistic based models (Naïve Bayes, Logistic Regression, and so on) are the most natural choices for such methods as they give confidence measures on the given model, say θ for the data x, for a class yi i ϵ classes, and the probability Uncertainty sampling as the posterior probability.

How does it work?

The general process for all uncertainty-based algorithms is outlined as follows:

  1. Initialize the data as labeled, How does it work? and unlabeled, How does it work?.
  2. While there is still unlabeled data:
    1. Train the classifier model How does it work? with the labeled data L.
    2. Apply the classifier model f on the unlabeled data U to assess informativeness J using one of the sampling mechanisms (see next section)
    3. Choose k most informative data from U as set Lu to get labels from the oracle.
    4. Augment the labeled data with the k new labeled data points obtained in the previous step: L = L Lu.
  3. Repeat all the steps under 2.

Some of the most common query synthesis algorithms to sample the informative instances from the data are given next.

Least confident sampling

In this technique, the data instances are sorted based on their confidence in reverse, and the instances most likely to be queried or selected are the ones the model is least confident about. The idea behind this is the least confident ones are the ones near the margin or separating hyperplane and getting their labels will be the best way to learn the boundaries effectively.

This can be formulated as Least confident sampling.

The disadvantage of this method is that it effectively considers information of the best; information on the rest of the posterior distribution is not used.

Smallest margin sampling

This is margin-based sampling, where the instances with smaller margins have more ambiguity than instances with larger margins.

This can be formulated as Smallest margin sampling where Smallest margin sampling and Smallest margin sampling are two labels for the instance x.

Label entropy sampling

Entropy, which is the measure of average information content in the data and is the impurity measure, can be used to sample the instances. This can be formulated as:

Label entropy sampling

Advantages and limitations

The advantages and limitations are:

  • Label entropy sampling is the simplest approach and can work with any probabilistic classifiers—this is the biggest advantage
  • Presence of outliers or wrong feedback can go unnoticed and models can degrade

Version space sampling

Hypothesis H is the set of all the particular models that generalize or explain the training data; for example, all possible sets of weights that separate two linearly separable classes. Version spaces V are subsets of hypothesis H, which are consistent with the training data as defined by Tom Mitchell (References [15]) such that Version space sampling.

The idea behind this sampling is to query instances from the unlabeled dataset that reduce the size of the version space or minimize |V|.

Query by disagreement (QBD)

QBD is one of the earliest algorithms which works on maintaining a version space V—when two hypotheses disagree on the label of new incoming data, that instance is selected for getting labels from the oracle or expert.

How does it work?

The entire algorithm can be summarized as follows:

  1. Initialize How does it work? as the set of all legal hypotheses.
  2. Initialize the data as How does it work? labeled and How does it work? unlabeled.
  3. While data x' is in U:
    1. If How does it work? for any h2V:
      1. Query the label of x' and get y'.
      2. V = {h: h(x') = y' for all points.
    2. Otherwise:
      1. Ignore x'.
Query by Committee (QBC)

Query by Committee overcomes the limitation of Query by Disagreement related to maintaining all possible version spaces by creating a committee of classifiers and using their votes as the mechanism to capture the disagreement (References [7]).

How does it work?

For this algorithm:

  1. Initialize the data as How does it work? labeled and How does it work? unlabeled.
  2. Train the committee of models C = {θ1θ2, ... θc} on the labeled data w (see the following).
  3. For all data x' in the U:
    1. Vote for predictions on x' as {How does it work?.
    2. Rank the instances based on maximum disagreement (see the following).
    3. Choose k most informative data from U as set Lu to get labels from the oracle.
    4. Augment the labeled data with the k new labeled data points L = LLu.
    5. Retrain the models {θ1, θ2, ... θc} with new L.

With the two tasks of training the committee of learners and choosing the disagreement methods, each has various choices.

Training different models can be either done using different samples from L or they can be trained using ensemble methods such as boosting and bagging.

Vote entropy is one of the methods chosen as the disagreement metric to rank. The mathematical way of representing it is:

How does it work?

Here, V(yi) is number of votes given to the label yi from all possible labels and |C| is size of the committee.

Kullback-Leibler (KL) divergence is an information theoretic measure of divergence between two probability distributions. The disagreement is quantified as the average divergence of each committee's prediction from that of the consensus in the committee C:

How does it work?

Advantages and limitations

The advantages and limitations are the following:

  • Simplicity and the fact that it can work with any supervised algorithm gives it a great advantage.
  • There are theoretical guarantees of minimizing errors and generalizing in some conditions.
  • Query by Disagreement suffers from maintaining a large number of valid hypotheses.
  • These methods still suffer from the issue of wrong feedback going unnoticed and models potentially degrading.

Data distribution sampling

The previous methods selected the best instances from the unlabeled set based either on the uncertainty posed by the samples on the models or by reducing the hypothesis space size. Neither of these methods worked on what is best for the model itself. The idea behind data distribution sampling is that adding samples that help reduce the errors to the model serves to improve predictions on unseen instances using expected values (References [13 and 14]).

How does it work?

There are different ways to find what is the best sample for the given model and here we will describe each one in some detail.

Expected model change

The idea behind this is to select examples from the unlabeled set that that will bring maximum change in the model:

Expected model change

Here, Pθ (y|x) = expectation over labels of x, Expected model change is the sum over unlabeled instances of the entropy of including x ' after retraining with x.

Expected error reduction

Here, the approach is to select examples from the unlabeled set that reduce the model's generalized error the most. The generalized error is measured using the unlabeled set with expected labels:

Expected error reduction

Here, Pθ (y|x) = expectation over labels of x, Expected error reduction is the sum over unlabeled instances of the entropy of including x' after retraining with x.

Variance reduction

The general equation of estimation on an out-of-sample error in terms of noise-bias-variance is given by the following:

Variance reduction

Here, G(x) is the model's prediction given the label y. In variance reduction, we select examples from the unlabeled set that most reduce the variance in the model:

Variance reduction

Here, θ + represents the model after it has been retrained with the new point x ' and its label y'.

Density weighted methods

In this approach, we select examples from the unlabeled set that have average similarity to the labeled set.

This can be represented as follows:

Density weighted methods

Here, sim(x, x ') is the density term or the similarity term where Hθ(y|x) is the base utility measure.

Advantages and limitations

The advantages and limitations are as follows:

  • The biggest advantage is that they work directly on the model as an optimization objective rather than implicit or indirect methods described before
  • These methods can work on pool- or stream-based scenarios
  • These methods have some theoretical guarantees on the bounds and generalizations
  • The biggest disadvantage of these methods is computational cost and difficulty in implementation
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.20.231