Although active learning has many similarities with semi-supervised learning, it has its own distinctive approach to modeling with datasets containing labeled and unlabeled data. It has roots in the basic human psychology that asking more questions often tends to solve problems.
The main idea behind active learning is that if the learner gets to pick the instances to learn from rather than being handed labeled data, it can learn more effectively with less data (Reference [6]). With very small amount of labeled data, it can carefully pick instances from unlabeled data to get label information and use that to iteratively improve learning. This basic approach of querying for unlabeled data to get labels from a so-called oracle—an expert in the domain—distinguishes active learning from semi-supervised or passive learning. The following figure illustrates the difference and the iterative process involved:
The dataset D, which represents all the data instances and their labels, is given by D = {(x1, y2), (x2, y2), … (xn, yn)} where are the individual instances of data and {y1, y2, … yn} is the set of associated labels. D is composed of two sets U, labeled data and L, unlabeled data. x is the set {x1, x2,… xn} of data instances without labels.
The dataset consists of all labeled data with known outcomes {y1, y2, … yl} whereas is the dataset where the outcomes are not known. As before, |U|>> |L|.
Active learning scenarios can be broadly classified as:
Regardless of the scenario involved, every active learning approach includes selecting a query strategy or sampling method which establishes the mechanism for picking the queries in each iteration. Each method reveals a distinct way of seeking out unlabeled examples with the best information content for improving the learning process. In the following subsections, we describe the major query strategy frameworks, how they work, their merits and limitations, and the different strategies within each framework.
The key idea behind this form of sampling is to select instances from the unlabeled pool that the current model is most uncertain about. The learner can then avoid the instances the model is more certain or confident in classifying (Reerences [8]).
Probabilistic based models (Naïve Bayes, Logistic Regression, and so on) are the most natural choices for such methods as they give confidence measures on the given model, say θ for the data x, for a class yi i ϵ classes, and the probability as the posterior probability.
The general process for all uncertainty-based algorithms is outlined as follows:
Some of the most common query synthesis algorithms to sample the informative instances from the data are given next.
In this technique, the data instances are sorted based on their confidence in reverse, and the instances most likely to be queried or selected are the ones the model is least confident about. The idea behind this is the least confident ones are the ones near the margin or separating hyperplane and getting their labels will be the best way to learn the boundaries effectively.
This can be formulated as .
The disadvantage of this method is that it effectively considers information of the best; information on the rest of the posterior distribution is not used.
Hypothesis H is the set of all the particular models that generalize or explain the training data; for example, all possible sets of weights that separate two linearly separable classes. Version spaces V are subsets of hypothesis H, which are consistent with the training data as defined by Tom Mitchell (References [15]) such that .
The idea behind this sampling is to query instances from the unlabeled dataset that reduce the size of the version space or minimize |V|.
QBD is one of the earliest algorithms which works on maintaining a version space V—when two hypotheses disagree on the label of new incoming data, that instance is selected for getting labels from the oracle or expert.
The entire algorithm can be summarized as follows:
For this algorithm:
With the two tasks of training the committee of learners and choosing the disagreement methods, each has various choices.
Training different models can be either done using different samples from L or they can be trained using ensemble methods such as boosting and bagging.
Vote entropy is one of the methods chosen as the disagreement metric to rank. The mathematical way of representing it is:
Here, V(yi) is number of votes given to the label yi from all possible labels and |C| is size of the committee.
Kullback-Leibler (KL) divergence is an information theoretic measure of divergence between two probability distributions. The disagreement is quantified as the average divergence of each committee's prediction from that of the consensus in the committee C:
The advantages and limitations are the following:
The previous methods selected the best instances from the unlabeled set based either on the uncertainty posed by the samples on the models or by reducing the hypothesis space size. Neither of these methods worked on what is best for the model itself. The idea behind data distribution sampling is that adding samples that help reduce the errors to the model serves to improve predictions on unseen instances using expected values (References [13 and 14]).
There are different ways to find what is the best sample for the given model and here we will describe each one in some detail.
The idea behind this is to select examples from the unlabeled set that that will bring maximum change in the model:
Here, Pθ (y|x) = expectation over labels of x, is the sum over unlabeled instances of the entropy of including x ' after retraining with x.
Here, the approach is to select examples from the unlabeled set that reduce the model's generalized error the most. The generalized error is measured using the unlabeled set with expected labels:
Here, Pθ (y|x) = expectation over labels of x, is the sum over unlabeled instances of the entropy of including x' after retraining with x.
The general equation of estimation on an out-of-sample error in terms of noise-bias-variance is given by the following:
Here, G(x) is the model's prediction given the label y. In variance reduction, we select examples from the unlabeled set that most reduce the variance in the model:
Here, θ + represents the model after it has been retrained with the new point x ' and its label y'.
The advantages and limitations are as follows:
3.147.53.119