5 Advanced active learning

This chapter covers

  • Combining uncertainty sampling and diversity sampling techniques
  • Using active transfer learning to sample the most uncertain and the most representative items
  • Implementing adaptive transfer learning within an active learning cycle

In chapters 3 and 4, you learned how to identify where your model is uncertain (what your model knows it doesn’t know) and what is missing from your model (what your model doesn’t know that it doesn’t know). In this chapter, you learn how to combine these techniques into a comprehensive active learning strategy. You also learn how to use transfer learning to adapt your models to predict which items to sample.

5.1 Combining uncertainty sampling and diversity sampling

This section explores ways to combine all the active learning techniques that you have learned up to this point so that you can use them effectively them for your particular use cases. You will also learn one new active learning strategy: expected error reduction, which combines principles of uncertainty sampling and diversity sampling. Recall from chapter 1 that an ideal strategy for active learning tries to sample items that are near the decision boundary but are distant from one another, as shown in figure 5.1.

Figure 5.1 One possible result of combining uncertainty sampling and diversity sampling. When these strategies are combined, items near diverse sections of the decision boundary are selected. Therefore, we are optimizing the chance of finding items that are likely to result in a changed decision boundary when they’re added to the training data.

You have learned to identify items that are near the decision boundary (uncertainty sampling) and distant from one another (cluster-based sampling and adaptive representative sampling). This chapter shows you how to sample items that are both near the decision boundary and diverse, like those shown in figure 5.1.

5.1.1 Least confidence sampling with cluster-based sampling

The most common way that uncertainty sampling and diversity sampling are combined in industry is takes a large sample from one method and further filter the sample with another method. This technique has no common name, despite its ubiquity, probably because so many companies have invented it independently by necessity.

If you sampled the 50% most uncertain items with least confidence sampling and then applied cluster-based sampling to sample 10% of those items, you could end up with a sample of 5% of your data more or less like those in figure 5.1: a near-optimal combination of uncertainty and diversity. Figure 5.2 represents this result graphically. First, you sample the 50% most uncertain items; then you apply clustering to ensure diversity within that selection, sampling the centroid of each cluster.

Figure 5.2 An example combining least confidence and clustering-based sampling. First, uncertainty sampling finds items near the decision boundary; then clustering ensures diversity within that selection. In this figure, the centroids from each cluster are sampled. Alternatively, or in addition, you could select random members of outliers.

With the code you have already learned, you can see that combining least confidence sampling and clustering is a simple extension in advanced_active_learning.py within the same code repository that we have been using (https://github.com/rmunro/pytorch_active_learning), as shown in the following listing.

Listing 5.1 Combining least confidence sampling and clustering

def get_clustered_uncertainty_samples(self,  model, unlabeled_data, method, 
 feature_method, perc_uncertain = 0.1, num_clusters=20, max_epochs=10, 
 limit=10000):
        
    if limit > 0:
      shuffle(unlabeled_data)
      unlabeled_data = unlabeled_data[:limit]            
    uncertain_count = math.ceil(len(unlabeled_data) * perc_uncertain)
        
    uncertain_samples = self, uncertainty_sampling.get_samples(model, 
     unlabeled_data, 
     method, feature_method, uncertain_count, limit=limit)                
    samples = self, diversity_sampling.get_cluster_samples(uncertain_samples,
     num_clusters=num_clusters)                                           
        
    for item in samples:
      item[3] = method.__name__+"_"+item[3] # record the sampling method
            
    return samples

Get a large sample of the most uncertain items.

Within those uncertain items, use clustering to ensure a diverse sample.

Only two new lines of code are needed to combine the two approaches: one to get the most uncertain items and one to cluster them. If you are interested in the disaster-response text classification task, try it with this new command:

> python active_learning.py --clustered_uncertainty=10 --verbose

You’ll immediately see that data tends to fall near the divide of text that may or not be disaster-related and that the items are a diverse selection. You have many options for using uncertainty sampling to find items near the decision boundary and then apply cluster-based sampling to ensure diversity within those items. You can experiment with different types of uncertainty sampling, different thresholds for your uncertainty cutoff, and different parameters for clustering. In many settings, this combination of clustering and uncertainty sampling will be the fastest way to drill down on the highest-value items for active learning and should be one of the first strategies that you try for almost any use case.

The simple methods of combining strategies rarely make it into academic papers; academia favors papers that combine methods into a single algorithm rather than chaining multiple simpler algorithms. This makes sense, because combining the methods is easy, as you have already seen; there is no need to write an academic paper about something that can be implemented in a few lines of code. But as a developer building real-world active learning systems, you should always implement the easy solutions before attempting more experimental algorithms.

Another reason to try simple methods first is that you might need to keep supporting them in your applications for a long time. It will be easier to maintain your code if you can get 99% of the way there without having to invent new techniques. See the following sidebar for a great example of how early decisions matter.

Your early data decisions continue to matter

Expert anecdote by Kieran Snyder

The decisions that you make early in a machine learning project can influence the products that you are building for many years to come. This is especially true for data decisions: your feature-encoding strategies, labeling ontologies, and source data will have long-term impacts.

In my first job out of graduate school, I was responsible for building the infrastructure that allowed Microsoft software to work in dozens of languages around the world. This job included making fundamental decisions such as deciding on the alphabetical order of the characters in a language—something that didn’t exist for many languages at the time. When the 2004 tsunami devastated countries around the Indian Ocean, it was an immediate problem for Sinhalese-speaking people in Sri Lanka: there was no easy way to support searching for missing people because Sinhalese didn’t yet have standardized encodings. Our timeline for Sinhalese support went from several months to several days so that we could help the missing-persons service, working with native speakers to build solutions as quickly as possible.

The encodings that we decided on at that time were adopted by Unicode as the official encodings for the Sinhalese language and now encode that language forever. You won’t always be working on such critical timelines, but you should always consider the long-term impact of your product decisions right from the start.

Kieran Snyder is CEO and co-founder of Textio, a widely used augmented writing platform. Kieran previously held product leadership roles at Microsoft and Amazon and has a PhD in linguistics from the University of Pennsylvania.

Don’t assume that a complicated solution is necessarily the best; you may find that a simple combination of least confidence and clustering is all you need for your data. As always, you can test different methods to see which results in the biggest change in accuracy against a baseline of random sampling.

5.1.2 Uncertainty sampling with model-based outliers

When you combine uncertainty sampling with model-based outliers, you are maximizing your model’s current confusion. You are looking for items near the decision boundary and making sure that their features are relatively unknown to the current model. Figure 5.3 shows the kinds of samples that this approach might generate.

Figure 5.3 This example of combining uncertainty sampling with model-based outliers selects items that are near the decision boundary but that are different from the current training data items and, therefore, different from the model.

Listing 5.2 Combining uncertainty sampling with model-based outliers

def get_uncertain_model_outlier_samples(self,  model, outlier_model,  
 unlabeled_data, training_data, validation_data, method, feature_method,
 perc_uncertain = 0.1, number=10, limit=10000):
        
    if limit > 0:
      shuffle(unlabeled_data)
      unlabeled_data = unlabeled_data[:limit]            
    uncertain_count = math.ceil(len(unlabeled_data) * perc_uncertain)
 
    uncertain_samples = self, uncertainty_sampling.get_samples(model, 
     unlabeled_data, method, feature_method, uncertain_count, limit=limit) 
        
    samples = self, diversity_sampling.get_model_outliers(outlier_model,
     uncertain_samples, validation_data,feature_method, 
     number=number, limit=limit)                                           
 
    for item in samples:
      item[3] = method.__name__+"_"+item[3]
            
    return samples

Get the most uncertain items.

Apply model-based outlier sampling to those items.

As in the example in listing 5.1, you need only two lines of code here to pull everything together. Although combining uncertainty sampling with model-based outliers is optimal for targeting items that are most likely to increase your model’s knowledge and overall accuracy, it can also sample similar items. You can try this technique with this command:

> python active_learning.py --uncertain_model_outliers=100 --verbose

5.1.3 Uncertainty sampling with model-based outliers and clustering

Because the method in section 5.1.2 might oversample items that are close to one another, you may want to implement this strategy first and then apply clustering to ensure diversity. It takes only one line of code to add clustering to the end of the previous method, so you could implement it easily. Alternatively, if you have quick active learning iterations, this approach ensures more diversity when you combine uncertainty sampling and model-based outliers; you can sample a small number of items in each iteration.

5.1.4 Representative sampling cluster-based sampling

One shortcoming of the representative sampling technique that you learned in chapter 4 is that it treats the training data and target domain as single clusters. In reality, your data will often be multinodal in a way that a single cluster cannot optimally capture.

To capture this complexity, you can combine representative sampling and cluster-based sampling in a slightly more complicated architecture. You can cluster your training data and your unlabeled data independently, identify the clusters that are most representative of your unlabeled data, and oversample from them. This approach gives you a more diverse set of items than representative sampling alone (figure 5.4).

Figure 5.4 An example (bottom) of combining representative sampling and cluster-based sampling. This method samples items that are most like your application domain relative to your current training data and also different from one another. By comparison, the simpler representative sampling method in chapter 4 treats each distribution as a single distribution.

As you can see in figure 5.4, your current training data and target domains may not be uniform distributions within your feature space. Clustering the data first will help you model your feature space more accurately and sample a more diverse set of unlabeled items. First, create the clusters for the training data and unlabeled data from the application domain.

Listing 5.3 Combining representative sampling and clustering

def get_representative_cluster_samples(self, training_data, unlabeled_data,
 number=10, num_clusters=20, max_epochs=10, limit=10000):
    """Gets the most representative unlabeled items, compared to training data,
     across multiple clusters
        
    Keyword arguments:
      training_data -- data with a label, that the current model is trained on
      unlabeled_data -- data that does not yet have a label
      number -- number of items to sample
      limit -- sample from only this many items for faster sampling (-1 = 
       no limit)
      num_clusters -- the number of clusters to create
      max_epochs -- maximum number of epochs to create clusters
       
    """ 
            
    if limit > 0:
      shuffle(training_data)
      training_data = training_data[:limit]
      shuffle(unlabeled_data)
      unlabeled_data = unlabeled_data[:limit]
            
    # Create clusters for training data
    
    training_clusters = CosineClusters(num_clusters)
    training_clusters.add_random_training_items(training_data)
        
    for i in range(0, max_epochs):        
      print("Epoch "+str(i))
      added = training_clusters.add_items_to_best_cluster(training_data)
        if added == 0:
          break
    
    # Create clusters for unlabeled data
    
    unlabeled_clusters = CosineClusters(num_clusters)    
    unlabeled_clusters.add_random_training_items(training_data)
        
    for i in range(0, max_epochs):        
      print("Epoch "+str(i))
      added = unlabeled_clusters.add_items_to_best_cluster(unlabeled_data)
        if added == 0:
          Break

Create clusters within the existing training data.

Create clusters within the unlabeled data.

Then iterate each cluster of unlabeled data, and find the item in each cluster that is closest to the centroid of that cluster relative to training data clusters.

Listing 5.4 Combining representative sampling and clustering, continued

        most_representative_items = []
        
        # for each cluster of unlabeled data
        for cluster in unlabeled_clusters.clusters:
            most_representative = None
            representativeness = float("-inf")
            
            # find the item in that cluster most like the unlabeled data 
            item_keys = list(cluster.members.keys())
             
            for key in item_keys:
                item = cluster.members[key]
                
                _, unlabeled_score =
                 unlabeled_clusters.get_best_cluster(item)                  
                _, training_score = training_clusters.get_best_cluster(item)  
    
                cluster_representativeness = unlabeled_score - training_score 
    
                if cluster_representativeness > representativeness:
                    representativeness = cluster_representativeness 
                    most_representative = item
                    
            most_representative[3] = "representative_clusters"            
            most_representative[4] = representativeness
            most_representative_items.append(most_representative)
                     
        most_representative_items.sort(reverse=True, key=lambda x: x[4])       
        return most_representative_items[:number:]     

Find the best-fit cluster within the unlabeled data clusters.

Find the best-fit cluster within the training data clusters.

Record the difference between the two as our representativeness score.

In design, this code is almost identical to the representative sampling method that you implemented in chapter 4, but you are asking the clustering algorithm to create multiple clusters for each distribution instead of only one for training data and one for unlabeled data. You can try this technique with this command:

> python active_learning.py --representative_clusters=100 --verbose

5.1.5 Sampling from the highest-entropy cluster

If you have high entropy in a certain cluster, a lot of confusion exists about the right labels for items in that cluster. In other words, these clusters have the highest average uncertainty across all the items. These items, therefore, are most likely to change labels and have the most room for changes in label.

The example in figure 5.5 is the opposite of clustering for diversity in some ways, as it deliberately focuses on one part of the problem space. But sometimes, that focus is exactly what you want.

Figure 5.5 This example of combining cluster-based sampling with entropy (bottom) samples items within the cluster that show the most confusion. You might think of this cluster as being the one that straddles the decision boundary most closely. In this example, random items are sampled in the cluster, but you could experiment by sampling the centroid, outliers, and/or oversampling items within the cluster that have the highest entropy. By comparison, simple clustering (top) samples items from every cluster.

Note that this approach works best when you have data with accurate labels and are confident that the task can be solved with machine learning. If you have data that has a lot of inherent ambiguity, this method will tend to focus in those areas. To solve this problem, see how much of your existing training data falls into your high-entropy clusters. If the cluster is already well represented in your training data, you have good evidence that it is an inherently ambiguous part of your feature space and that additional labels will not help. The following listing shows the code for selecting the cluster with the highest average entropy.

Listing 5.5 Sampling from the cluster with the highest entropy

def get_high_uncertainty_cluster(self,  model, unlabeled_data, method, 
 feature_method, number=10, num_clusters=20, max_epochs=10, limit=10000):
    """Gets items from the cluster with the highest average uncertainty
        
    Keyword arguments:
      model -- machine learning model to get predictions from to determine 
       uncertainty
      unlabeled_data -- data that does not yet have a label
      method -- method for uncertainty sampling (eg: least_confidence())
      feature_method -- the method for extracting features from your data
      number -- number of items to sample
      num_clusters -- the number of clusters to create
      max_epochs -- maximum number of epochs to create clusters
      limit -- sample from only this many items for faster sampling 
       (-1 = no limit)
    """
                
    if limit > 0:
      shuffle(unlabeled_data)
      unlabeled_data = unlabeled_data[:limit]            
 
    unlabeled_clusters = CosineClusters(num_clusters)    
    unlabeled_clusters.add_random_training_items(unlabeled_data)
        
    for i in range(0, max_epochs):                           
      print("Epoch  "str(i))
      added = unlabeled_clusters.add_items_to_best_cluster(unlabeled_data)
      if added == 0:
        break
    
    # get scores
        
    most_uncertain_cluster = None
    highest_average_uncertainty = 0.0
        
    # for each cluster of unlabeled data
    for cluster in unlabeled_clusters.clusters:
      total_uncertainty = 0.0
      count = 0
 
      item_keys = list(cluster.members.keys())
             
      for key in item_keys:
        item = cluster.members[key]
        text = item[1] # the text for the message
                
        feature_vector = feature_method(text)
        hidden, logits, log_probs = model(feature_vector, 
         return_all_layers=True)
    
        prob_dist = torch.exp(log_probs) # the probability distribution of 
         our prediction
                
        score = method(prob_dist.data[0]) # get the specific type of 
         uncertainty sampling
                
        total_uncertainty += score
        count += 1
                
      average_uncertainty = total_uncertainty / count         
      if average_uncertainty > highest_average_uncertainty:
          highest_average_uncertainty = average_uncertainty
          most_uncertain_cluster = cluster
            
    samples = most_uncertain_cluster.get_random_members(number)
            
    return samples

Create the clusters.

Calculate the average uncertainty (using entropy) for the items in each cluster.

In this code example, we are taking the average entropy of all items in a cluster. You can try different aggregate statistics based on your sampling strategy. If you know that you are sampling only the top 100 items, for example, you could calculate the average entropy across the 100 most uncertain items in each cluster rather than across every item in the cluster. You can try this technique with this command:

> python active_learning.py --high_uncertainty_cluste=100 --verbose

5.1.6 Other combinations of active learning strategies

There are too many possible combinations of active learning techniques to cover in this book, but by this stage, you should have a good idea of how to combine them. Here are some starting points:

  • Combining uncertainty sampling and representative sampling—You can sample items that are most representative of your target domains and are also uncertain. This approach will be especially helpful in later iterations of active learning. If you used uncertainty sampling for early iterations, your target domain will have items that are disproportionately far from the decision boundary and could be selected erroneously as representative.

  • Combining model-based outliers and representative sampling—This method is the ultimate method for domain adaptation, targeting items that are unknown to your model today but are also relatively common in your target domain.

  • Combining clustering with itself for hierarchical clusters—If you have some large clusters or want to sample for diversity within one cluster, you can take the items from one cluster and use them to create a new set of clusters.

  • Combining sampling from the highest-entropy cluster with margin of confidence sampling (or some other uncertainty metrics—You can find the cluster with the highest entropy and then sample all the items within it that fall closest to a decision boundary.

  • Combining ensemble methods or dropouts with individual strategies—You may be building multiple models and decide that a Bayesian model is better for determining uncertainty, but a neural model is better for determining model-based outliers. You can sample with one model and further refine with another. If you’re clustering based on hidden layers, you could adapt the dropout method from uncertainty sampling and randomly ignore some neurons while creating clusters. This approach will prevent the clusters from overfitting to the internal representation of your network.

5.1.7 Combining active learning scores

An alternative to piping the output from one sampling strategy to another is taking the scores from the different sampling strategies and finding the highest average score, which makes mathematical sense for all methods other than clustering. You could average each item’s score for margin of confidence, model-based outliers, and representative learning, for example, and then rank all items by that single aggregate score.

Although all the scores should be in a [0–1] range, note that some of them may be clustered in small ranges and therefore not contribute as much to the average. If this is the case with your data, you can try converting all your scores to percentiles (quantiles), effectively turning all the sampling scores into stratified rank orders. You can use built-in functions from your math library of choice to turn any list of numbers into percentiles. Look for functions called rank(), percentile(), or percentileofscore() in various Python libraries. Compared with the other methods that you are using for sampling, converting scores to percentiles is relatively quick, so don’t worry about trying to find the most optimal function; choose a function from a library that you are already using.

You could also sample via the union of the methods rather than filtering (which is a combination via intersection). This approach can be used for any methods and might make the most sense when you are combining multiple uncertainty sampling scores. You could sample the items that are in the most 10% uncertain by any of least confidence, margin of confidence, ratio of confidence, or entropy to produce a general “uncertain” set of samples, and then use those samples directly or refine the sampling by combining it with additional methods. There are many ways to combine the building blocks that you have learned, and I encourage you to experiment with them.

5.1.8 Expected error reduction sampling

Expected error reduction is one of a handful of active learning strategies in the literature that aim to combine uncertainty sampling and diversity sampling into a single metric. This algorithm is included here for completeness, with the caveat that I have not seen it implemented in real-world situations. The core metric for expected error reduction sampling is how much the error in the model would be reduced if an unlabeled item were given a label.1 You could give each unlabeled item the possible labels that it could have, retrain the model with those labels, and then look at how the model accuracy changes. You have two common ways to calculate the change in model accuracy:

  • Overall accuracy—What is the change in number of items predicted correctly if this item had a label?

  • Overall entropy—What is the change in aggregate entropy if this item had a label? This method uses the definition of entropy that you learned in the uncertainty sampling chapter in sections 3.2.4 and 3.2.5. It is sensitive to the confidence of the prediction, unlike the first method, which is sensitive only to the predicted label.

The score is weighted across labels by the frequency of each label. You sample the items that are most likely to improve the model overall. This algorithm has some practical problems, however:

  • Retraining the model once for every unlabeled item multiplied by every label is prohibitively expensive for most algorithms.

  • There can be so much variation when retraining a model that the change from one additional label could be indistinguishable from noise.

  • The algorithm can oversample items a long way from the decision boundary, thanks to the high entropy for the labels that are a diminishingly small likelihood.

So there are practical limitations to using this method with neural models. The original authors of this algorithm used incremental Naive Bayes, which can be adapted to new training items by updating the counts of a new item’s features, and is deterministic. Given this fact, expected error reduction works for the authors’ particular algorithm. The problem of oversampling items away from the decision boundary can be addressed by using the predicted probability of each label rather than the label frequency (prior probability), but you will need accurate confidence predictions from your model, which you may not have, as you learned in chapter 3.

If you do try to implement expected error reduction, you could experiment with different accuracy measures and with uncertainty sampling algorithms other than entropy. Because this method uses entropy, which comes from information theory, you might see it called information gain in the literature on variations of this algorithm. Read these papers closely, because gain can mean lower information. Although the term is mathematically correct, it can seem counterintuitive to say that your model knows more when the predictions have less information.

As stated at the start of this section, no one has (as far as I know) published on whether expected error reduction is better than the simple combination of methods through the intersection and/or union of sampling strategies. You could try implementing expected error reduction and related algorithms to see whether they help in your systems. You may be able to implement them by retraining only the final layer of your model with the new item, which will speed the process.

If you want to sample items with a goal similar to expected error reduction, you can cluster your data and then look at clusters with the highest entropy in the predictions, like the example in figure 5.4 earlier in this chapter. Expected error reduction has a problem, however, in that it might find items in only one part of the feature space, like the uncertainty sampling algorithms used in isolation. If you extend the example in figure 5.4 to sample items from the N highest entropy clusters, not only the single highest entropy cluster, you will have addressed the limitations of expected error reduction in only a few lines of code.

Rather than try to handcraft an algorithm that combines uncertainty sampling and diversity sampling into one algorithm, however, you can let machine learning decide on that combination for you. The original expected error reduction paper was titled “Toward Optimal Active Learning through Sampling Estimation of Error Reduction” and is 20 years old, so this is likely the direction that the authors had in mind. The rest of this chapter builds toward machine learning models for the sampling process itself in active learning.

5.2 Active transfer learning for uncertainty sampling

The most advanced active learning methods use everything that you have learned so far in this book: the sampling strategies for interpreting confusion that you learned in chapter 3, the methods for querying the different layers in your models that you learned in chapter 4, and the combinations of techniques that you learned in the first part of this chapter.

Using all these techniques, you can build a new model with the task of predicting where the greatest uncertainty occurs. First, let’s revisit the description of transfer learning from chapter 1, shown here in figure 5.6.

Figure 5.6 We have a model that predicts a label as “A,” “B,” “C,” or “D” and a separate dataset with the labels “W,” “X,” “Y,” and “Z.” When we retrain only the last layer of the model, the model is able to predict labels “W,” “X,” “Y,” and “Z,” using far fewer human-labeled items than if we were training a model from scratch.

In the example in figure 5.6, you can see how a model can be trained on one set of labels and then retrained on another set of labels by keeping the architecture the same and freezing part of the model, retraining only the last layer in this case. There are many more ways to use transfer learning and contextual models for human-in-the-loop machine learning. The examples in this chapter are variations on the type of transfer learning shown in figure 5.6.

5.2.1 Making your model predict its own errors

The new labels from transfer learning can be any categories that you want, including information about the task itself. This fact is the core insight for active transfer learning: you can use transfer learning to ask your model where it is confused by making it predict its own errors. Figure 5.7 outlines this process.

Figure 5.7 Validation items are predicted by the model and bucketed as “Correct” or “Incorrect” according to whether they were classified correctly. Then the last layer of the model is retrained to predict whether items are “Correct” or “Incorrect,” effectively turning the two buckets into new labels.

As figure 5.7 shows, this process has several steps:

  1. Apply the model to a validation dataset, and capture which validation items were classified correctly and incorrectly. This data is your new training data. Now your validation items have an additional label of “Correct” or “Incorrect.”

  2. Create a new output layer for the model, and train that new layer on your new training data, predicting your new “Correct” and “Incorrect” labels.

  3. Run your unlabeled data items through the new model, and sample the items that are predicted to be “Incorrect” with the highest confidence.

Now you have a sample of items that are predicted by your model as the most likely to be incorrect and therefore will benefit from a human label.

5.2.2 Implementing active transfer learning

The simplest forms of active transfer learning can be built with the building blocks of code that you have already learned. To implement the architecture in figure 5.7, you can create the new layer as its own model and use the final hidden layer as the features for that layer.

Here are the three steps from section 5.2.1, implemented in PyTorch. First, apply the model to a validation dataset, and capture which validation items were classified correctly and incorrectly. This data is your new training data. Your validation items have an additional label of “Correct” or “Incorrect,” which is in the (verbosely but transparently named) get_deep_active_transfer_learning_uncertainty_samples() method.

Listing 5.6 Active transfer learning

correct_predictions = [] # validation items predicted correctly
incorrect_predictions = [] # validation items predicted incorrectly
item_hidden_layers = {} # hidden layer of each item, by id
 
for item in validation_data:
    
    id = item[0]
    text = item[1]
    label = item[2]
 
    feature_vector = feature_method(text)
    hidden, logits, log_probs = model(feature_vector, return_all_layers=True)
 
    item_hidden_layers[id] = hidden              
 
    prob_dist = torch.exp(log_probs)     
    # get confidence that item is disaster-related
    prob_related = math.exp(log_probs.data.tolist()[0][1]) 
 
    if item[3] == "seen":
        correct_predictions.append(item)         
 
    elif(label=="1" and prob_related > 0.5) or (label=="0" and prob_related 
     <= 0.5):
        correct_predictions.append(item)
    else:
        incorrect_predictions.append(item)       

Store the hidden layer for this item to use later for our new model.

The item was correctly predicted, so it gets a “Correct” label in our new model.

The item was incorrectly predicted, so it gets an “Incorrect” label in our new model.

Second, create a new output layer for the model trained on your new training data, predicting your new “Correct” and “Incorrect” labels.

Listing 5.7 Creating a new output layer

correct_model = SimpleUncertaintyPredictor(128)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(correct_model.parameters(), lr=0.01)   
 
for epoch in range(epochs):                           
    if self.verbose:
        print("Epoch: "+str(epoch))
    current = 0
    
    # make a subset of data to use in this epoch
    # with an equal number of items from each label
    
    shuffle(correct_predictions) #randomize the order of the validation data
    shuffle(incorrect_predictions) #randomize the order of the validation data
  
    correct_ids = {}
    for item in correct_predictions:
        correct_ids[item[0]] = True         
    epoch_data = correct_predictions[:select_per_epoch]
    epoch_data += incorrect_predictions[:select_per_epoch]
    shuffle(epoch_data) 
            
    # train the final layers model
    for item in epoch_data:                
        id = item[0]
        label = 0
        if id in correct_ids:
            label = 1
    
        correct_model.zero_grad() 
    
        feature_vec = item_hidden_layers[id]          
        target = torch.LongTensor([label])
    
        log_probs = correct_model(feature_vec)
    
        # compute loss function, do backward pass, and update the gradient
        loss = loss_function(log_probs, target)
        loss.backward(retain_graph=True)
        optimizer.step()    

The code for training is similar to the other examples in this book.

Here, we use the hidden layer from the original model as our feature vector.

Finally, run your unlabeled data items through the new model, and sample the items that are predicted to be incorrect with the highest confidence.

Listing 5.8 Predicting “Incorrect” labels

deep_active_transfer_preds = []
 
with torch.no_grad():                                                       
    v=0
    for item in unlabeled_data:
        text = item[1]
        
        # get prediction from main model
        feature_vector = feature_method(text)                               
        hidden, logits, log_probs = model(feature_vector, 
         return_all_layers=True)
 
        # use hidden layer from main model as input to model predicting 
         correct/errors
        logits, log_probs = correct_model(hidden, return_all_layers=True)   
    
        # get confidence that item is correct
        prob_correct = 1 - math.exp(log_probs.data.tolist()[0][1]) 
 
        if(label == "0"):
            prob_correct = 1 - prob_correct
            
        item[3] = "predicted_error"            
        item[4] = 1 - prob_correct
        deep_active_transfer_preds.append(item)
 
       
deep_active_transfer_preds.sort(reverse=True, key=lambda x: x[4])
  
return deep_active_transfer_preds[:number:]

The code for evaluation is similar to the others in this book.

First, we need to get the hidden layer from our original model.

Then we use that hidden layer as the feature vector for our new model.

If you are interested in the disaster-response text classification task, try it with this new method for active transfer learning:

> python active_learning.py --transfer_learned_uncertainty 10 --verbose 

As you can see in this code, we are not altering our original model for predicting whether a message is related to disaster response. Instead of replacing the final layer of that model, we are effectively adding a new output layer over the existing model. As an alternative, you could replace the final layer with the same result.

This architecture is used in this book because it is nondestructive. The old model remains. This architecture prevents unwanted errors when you still want to use the original model, either in production or for other sampling strategies. You also avoid needing the extra memory to have two copies of the full model in parallel. Building a new layer or copying and modifying the model are equivalent, so choose whichever approach is right for your codebase. All this code is in the same file as the methods discussed earlier in this chapter: advanced_active_learning.py.

5.2.3 Active transfer learning with more layers

You don’t need to limit active transfer learning to a single new layer or build on only the last hidden layer. As figure 5.8 shows, you can build multiple new layers, and they can connect directly with any hidden layer.

Figure 5.8 More-complicated active transfer learning architectures, using active transfer learning to create a prediction. The top example has a single neuron in the new output layer. The bottom example is a more-complicated architecture, with a new hidden layer that connects with multiple existing hidden layers.

The extension to the more-complicated architecture in figure 5.8 requires only a few lines of extra code. First, the new model to predict “Correct” or “Incorrect” needs a hidden layer. Then that new model will take its features from multiple hidden layers. You can append the vectors from the different layers to one another, and this flattened vector becomes the features for the new model.

If you are familiar with contextual models for natural language processing (NLP) or convolutional models for computer vision, this process is a familiar one; you are extracting the activations of neurons from several parts of your network and flattening into one long feature vector. The resulting vector is often called a representation because you are using the neurons from one model to represent your features in another model. We will return to representations in chapter 9, where they are also important for some semi-automated methods for creating training data.

The fact that you can build a more complicated model, however, doesn’t mean that you should build it. If you don’t have a lot of validation data, you are more likely to overfit a more-complicated model. It is a lot easier to avoid training errors if you are training only a single new output neuron. Use your instincts about how complicated your model needs to be, based on what you would normally build for that amount of data for a binary prediction task.

5.2.4 The pros and cons of active transfer learning

Active transfer learning has some nice properties that make it suitable for a wide range of problems:

  • You are reusing your hidden layers, so you are building models directly based on your model’s current information state.

  • You don’t need too many labeled items for the model to be effective, especially if you are retraining only the last layer (handy if your validation data is not large).

  • It is fast to train, especially if you are retraining only the last layer.

  • It works with many architectures. You may be predicting labels at document or image level, predicting objects within an image, or generating sequences of text. For all these use cases, you can add a new final layer or layers to predict “Correct” or “Incorrect.” (For more on active learning use cases, see chapter 6.)

  • You don’t need to normalize the different ranges of activation across different neurons, because your model is going to work out that task for you.

The fifth point is especially nice. Recall that with model-based outliers, you need to quantize the activation with the validation data because some of the neurons could be arbitrarily higher or lower in their average activation. It is nice to be able to pass the information to another layer of the neurons and tell that new layer to figure out exactly what weight to apply to the activation of each existing neuron. Active transfer learning also has some drawbacks:

  • Like other uncertainty sampling techniques, it can focus too much on one part of the feature space; therefore, it lacks diversity.

  • You can overfit your validation data. If there aren’t many validation items, your model for predicting uncertainty may not generalize beyond your validation data to your unlabeled data.

The first problem can be partially addressed without additional human labels, as you see later in this chapter in section 5.3.2. This fact is one of the biggest strengths of this approach compared with the other uncertainty sampling algorithms.

The overfitting problem can be diagnosed relatively easily too, because it manifests itself as high confidence that an item is an error. If you have a binary prediction for your main model, and your error-prediction model is 95% confident that an item was classified incorrectly, your main model should have classified that item correctly in the first place.

If you find that you are overfitting and that stopping the training earlier doesn’t help, you can try to avoid overfitting by getting multiple predictions, using the ensemble methods from section 3.4 of chapter 3. These methods include training multiple models, using dropouts at inference (Monte Carlo sampling), and drawing from different subsets of the validation items and features.

5.3 Applying active transfer learning to representative sampling

We can apply the same active transfer learning principles to representative sampling. That is, we can adapt our models to predict whether an item is most like the application domain of our model compared with the current training data.

This approach will help with domain adaptation, like the representative sampling methods that you learned in chapter 4. In fact, representative sampling is not too different. In both chapter 4 and the example in the following sections, you are building a new model to predict whether an item is most representative of the data to which you are trying to adapt your model.

5.3.1 Making your model predict what it doesn’t know

In principle, you don’t need your existing model to predict whether an item is in your training data or in your unlabeled data. You can build a new model that uses both your training data and your unlabeled data as a binary prediction problem. In practice, it is useful to include features that are important for the machine learning task that you are trying to build.

Figure 5.9 shows the process and architecture for representative active transfer learning, showing how you can retrain your model to predict whether unlabeled items are more like your current training data or more like the application domain for your model.

Figure 5.9 We can build a model to sample the items that are most unlike the current training data. To begin, we take validation data from the same distribution as the training data and give it a “Training” label. Then we take unlabeled data from our target domain and give it an “Application” label. We train a new output layer to predict the “Training” and “Application” labels, giving it access to all layers of the model. We apply the new model to the unlabeled data (ignoring the unlabeled items that we trained on), and sample the items that are most confidently predicted as “Application.”

As figure 5.9 shows, there are few differences from active transfer learning for uncertainty sampling. First, the original model predictions are ignored. The validation and unlabeled data can be given labels directly. The validation data is from the same distribution as the training data, so it is given a “Training” label. The unlabeled data from the target domain is given an “Application” label. Then the model is trained on these labels.

Second, the new model should have access to more layers. If you are adapting to a new domain, you may have many features that do not yet exist in your training data. In such a case, the only information that your existing model contains is the fact that these features exist in the input layer as features but have not contributed to any other layer in the previous model. The more-complicated type of architecture will capture this information.

5.3.2 Active transfer learning for adaptive representative sampling

Just like representative sampling (chapter 4) can be adaptive, active transfer learning for representative sampling can be adaptive, meaning that you can have multiple iterations within one active learning cycle, as shown in figure 5.10.

Figure 5.10 Because our sampled items will get a human label later, we can assume that they become part of the training data without needing to know what the label is. To begin, we take validation data from the same distribution as the training data and give it a “Training” label. We take unlabeled data from our target domain and give it an “Application” label. We train a new output layer to predict the “Training” and “Application” labels, giving it access to all layers of the model. We apply the new model to the unlabeled data (ignoring the unlabeled items that we trained on) and sample the items that are most confidently predicted as “Application.” We can assume that those items will later get labels and become part of the training data. So we can take those sampled items, change their label from “Application” to “Training,” and retrain our final layer(s) on the new dataset.

The process in figure 5.10 starts like the non-adaptive version. We create new output layers to classify whether an item is in the existing training data or in the target domain, sampling the items that are most confidently predicted as “Application.” To extend the process to the adaptive strategy, we can assume that the sampled items will later get a label and become part of the training data. So we can take those sampled items, change their label from “Application” to “Training,” and retrain our final layer(s) on the new dataset. This process can be repeated until there are no more confident predictions for “Application” domain items, or until you reach the maximum number of items that you want to sample in this iteration of active learning.

5.3.3 The pros and cons of active transfer learning for representative sampling

The pros and cons of active transfer learning for representative sampling are the same as for the simpler representative sampling methods in chapter 4. Compared with those methods, the pros can be more positive because you are using more powerful models, but some of the cons, such as the danger of overfitting, become bigger potential errors.

To summarize those strengths and weaknesses again: representative sampling is effective when you have all the data in a new domain, but if you’re adapting to future data that you haven’t sampled yet, your model can wind up being stuck in the past. This method is also the most prone to noise of all the active learning strategies in this book. If you have new data that is corrupted text—text from a language that is not part of your target domain, corrupted image files, artifacts that arise from using different cameras, and so on—any of these factors could look different from your current training data, but not in an interesting way. Finally, active transfer learning for representative sampling can do more harm than good if you apply it in iterations after you use uncertainty sampling, because your application domain will have more items away from the decision boundary than your training data. For these reasons, I recommended that you deploy active transfer learning for representative sampling only in combination with other sampling strategies, as you learned in section 5.1.

5.4 Active transfer learning for adaptive sampling

The final algorithm for active learning in this book is also the most powerful; it is a form of uncertainty sampling that can be adaptive within one iteration of active learning. All the uncertainty sampling techniques that you learned in chapter 3 were non-adaptive. Within one active learning cycle, all these techniques risk sampling items from only one small part of the problem space.

Active transfer learning for adaptive sampling (ATLAS) is an exception, allowing adaptive sampling within one iteration without also using clustering to ensure diversity. ATLAS is introduced here with the caveat that it is the least-tested algorithm in this book at the time of publication. I invented ATLAS in late 2019 when I realized that active transfer learning had certain properties that could be exploited to make it adaptive. ATLAS has been successful on the data that I have been experimenting with, but it has not yet been widely deployed in industry or tested under peer review in academia. As you would with any new method, be prepared to experiment to be certain that this algorithm is right for your data.

5.4.1 Making uncertainty sampling adaptive by predicting uncertainty

As you learned in chapter 3, most uncertainty sampling algorithms have the same problem: they can sample from one part of the feature space, meaning that all the samples are similar in one iteration of active learning. You can end up sampling items from only one small part of your feature space if you are not careful.

As you learned in section 5.1.1, you can address this problem by combining clustering and uncertainty sampling. This approach is still the recommended way to think about beginning your active learning strategy; you can try ATLAS after you have that baseline. You can exploit two interesting properties of active transfer learning for uncertainty sampling:

  • You are predicting whether the model is correct, not the actual label.

  • You can generally expect to predict the labels of your training data items correctly.

Taken together, these two items mean that you can assume that your sampled items will be correct later, even if you don’t yet know the labels (figure 5.11).

Figure 5.11 Because our sampled items will later get a human label and become part of the training data, we can assume that the model will later predict those items correctly, because models are typically the most accurate on the actual items on which they trained. To begin, validation items are predicted by the model and bucketed as “Correct” or “Incorrect,” according to whether they were classified correctly. The last layer of the model is retrained to predict whether items are “Correct” or “Incorrect,” effectively turning the two buckets into new labels. We apply the new model to the unlabeled data, predicting whether each item will be “Correct” or “Incorrect.” We can sample the most likely to be “Incorrect.” Then we can assume that those items will get labels later and become part of the training data, which will be labeled correctly by a model that predicted on that same data. So we can take those sampled items, change their label from “Incorrect” to “Correct,” and retrain our final layer(s) on the new dataset.

The process in figure 5.11 starts like the non-adaptive version. We create new output layers to classify whether an item is “Correct” or “Incorrect,” sampling the items that are most confidently predicted as “Incorrect.” To extend this architecture to the adaptive strategy, we can assume that those sampled items will be labeled later and become part of the training data, and that they will be predicted correctly after they receive a label (whatever that label might be). So we can take those sampled items, change their label from “Incorrect” to “Correct,” and retrain our final layer(s) on the new dataset. This process can be repeated until we have no more confidence predictions for “Incorrect” domain items or reach the maximum number of items that we want to sample in this iteration of active learning. It takes only 10 lines of code to implement ATLAS as a wrapper for active learning for uncertainty sampling.

Listing 5.9 Active transfer learning for adaptive sampling

def get_atlas_samples(self, model, unlabeled_data, validation_data, 
 feature_method, number=100, limit=10000, number_per_iteration=10,
 epochs=10, select_per_epoch=100):
"""Uses transfer learning to predict uncertainty within the model
 
Keyword arguments:
    model -- machine learning model to get predictions from to determine 
     uncertainty
    unlabeled_data -- data that does not yet have a label
    validation_data -- data with a label that is not in the training set, to 
     be used for transfer learning
    feature_method -- the method for extracting features from your data
    number -- number of items to sample
    number_per_iteration -- number of items to sample per iteration
    limit -- sample from only this many items for faster sampling (-1 = no 
     limit)
""" 
 
if(len(unlabeled_data) < number):
    raise Exception('More samples requested than the number of unlabeled 
     items')
        
atlas_samples = [] # all items sampled by atlas
        
while(len(atlas_samples) < number):
    samples = 
     self.get_deep_active_transfer_learning_uncertainty_samples(model, 
     unlabeled_data, validation_data, feature_method, 
     number_per_iteration, limit, epochs, select_per_epoch)
                          
    for item in samples:
        atlas_samples.append(item)
        unlabeled_data.remove(item)
 
        item = copy.deepcopy(item)
        item[3] = "seen" # mark this item as already seen
        
        validation_data.append(item) # append so that it is in the next 
         iteration
    
return atlas_samples  

The key line of code adds a copy of the sampled item to the validation data after each cycle. If you are interested in the disaster-response text classification task, try it with this new method for an implementation of ATLAS:

> python active_learning.py --atlas=100 --verbose 

Because you are selecting 10 items by default (number_per_iteration=10) and want 100 total, you should see the model retrain 10 times during the sampling process. Play around with smaller numbers per iteration for a more diverse selection, which will take more time to retrain.

Although ATLAS adds only one step to the active transfer learning for uncertainty sampling architecture that you first learned, it can take a little bit of time to get your head around it. There aren’t many cases in machine learning in which you can confidently give a label to an unlabeled item without human review. The trick is that we are not giving our items an actual label; we know that the label will come later.

5.4.2 The pros and cons of ATLAS

The biggest pro of ATLAS is that it addresses both uncertainty sampling and diversity sampling in one method. This method has another interesting advantage over the other methods of uncertainty sampling: it won’t get stuck in inherently ambiguous parts of your feature space. If you have data that is inherently ambiguous, that data will continue to have high uncertainty for your model. After you annotate the data in one iteration of active learning, your model might still find the most uncertainty in that data in the next iteration. Here, our model’s (false) assumption that it will get this data right later helps us. We need to see only a handful of ambiguous items for ATLAS to start focusing on other parts of our feature space. There aren’t many cases in which a model’s making a mistake will help, but this case is one of them.

The biggest con is the flip side: sometimes, you won’t get enough labels from one part of your feature space. You won’t know for certain how many items you need from each part of your feature space until you get the actual labels. This problem is the equivalent of deciding how many items to sample from each cluster when combining clustering and uncertainty sampling. Fortunately, future iterations of active learning will take you back to this part of your feature space if you don’t have enough labels. So it is safe to underestimate if you know that you will have more iterations of active learning later.

The other cons largely come from the fact that this method is untested and has the most complicated architecture. You may need a fair amount of hyperparameter tuning to build the most accurate models to predict “Correct” and “Incorrect.” If you can’t automate that tuning and need to do it manually, this process is not an automated adaptive process. Because the models are a simple binary task, and you are not retraining all the layers, the models shouldn’t require much tuning.

5.5 Advanced active learning cheat sheets

For quick reference, figures 5.12 and 5.13 show cheat sheets for the advanced active learning strategies in section 5.1 and the active transfer learning techniques in sections 5.2, 5.3, and 5.4.

Figure 5.12 Advanced active learning cheat sheet

 

Figure 5.13 Active transfer learning cheat sheet

5.6 Further reading for active transfer learning

As you learned in the chapter, there is little existing work on the advanced active learning techniques in which one method is used to sample a large number of items and a second method is used to refine the sample. Academic papers about combining uncertainty sampling and diversity sampling focus on single metrics that combine the two methods, but in practice, you can simply chain the methods: apply one method to get a large sample and then refine that sample with another method. The academic papers tend to compare the combined metrics to the individual methods in isolation, so they will not give you an idea of whether they are better than chaining the methods together (section 5.1).

The active transfer learning methods in this chapter are more advanced than the methods currently reported in academic or industry-focused papers. I have given talks about the methods before publishing this book, but all the content in those talks appears in this chapter, so there is nowhere else to read about them. I didn’t discover the possibility of extending active transfer learning to adaptive learning until late 2019, while I was creating the PyTorch library to accompany this chapter. After this book is published, look for papers that cite ATLAS for the up-to-date research.

If you like the fact that ATLAS turns active learning into a machine learning problem in itself, you can find a long list of interesting research papers. For as long as active learning has existed, people have been thinking about how to apply machine learning to the process of sampling items for human review. One good recent paper that I recommend is “Learning Active Learning from Data,” by Ksenia Konyushkova, Sznitman Raphael, and Pascal Fua (http://mng.bz/Gxj8). Look for the most-cited works in this paper and more recent work that cites this paper for approaches to active learning that use machine learning. For a deep dive, look at the PhD dissertation of Ksenia Konyushkova, the first author of the NeurIPS paper, which includes a comprehensive literature review.

For an older paper that looks at ways to combine uncertainty and representative sampling, I recommend “Optimistic Active Learning Using Mutual Information,” by Yuhong Guo and Russ Greiner (http://mng.bz/zx9g).

Summary

  • You have many ways to combine uncertainty sampling and diversity sampling. These techniques will help you optimize your active learning strategy to sample the items for annotation that will most help your model’s accuracy.

  • Combining uncertainty sampling and clustering is the most common active learning technique and is relatively easy to implement after everything that you have learned in this book so far, so it is a good starting point for exploring advanced active learning strategies.

  • Active transfer learning for uncertainty sampling allows you to build a model to predict whether unlabeled items will be labeled correctly, using your existing model as the starting point for the uncertainty-predicting model. This approach allows you to use machine learning within the uncertainty sampling process.

  • Active transfer learning for representative sampling allows you to build a model to predict whether unlabeled items are more like your target domain than your existing training data. This approach allows you to use machine learning within the representative sampling process.

  • ATLAS allows you to extend active transfer learning for uncertainty sampling so that you are not oversampling items from one area of your feature space, combining aspects of uncertainty sampling and diversity sampling into a single machine learning model.


1.“Toward Optimal Active Learning through Sampling Estimation of Error Reduction,” by Nicholas Roy and Andrew McCallum (https://dl.acm.org/doi/10.5555/645530.655646).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.64.132