Summary

After a tour of supervised and unsupervised machine learning techniques and their application to real-world datasets in the previous chapters, this chapter introduces the concepts, techniques, and tools of Semi-Supervised Learning (SSL) and Active Learning (AL).

In SSL, we are given a few labeled examples and many unlabeled ones—the goal is either to simply train on the labeled ones in order to classify the unlabeled ones (transductive SSL), or use the unlabeled and labeled examples to train models to correctly classify new, unseen data (inductive SSL). All techniques in SSL are based on one or more of the assumptions related to semi-supervised smoothness, cluster togetherness, and manifold togetherness.

Different SSL techniques are applicable to different situations. The simple self-training SSL is straightforward and works with most supervised learning algorithms; when the data is from more than just one domain, the co-training SSL is a suitable method. When the cluster togetherness assumption holds, the cluster and label SSL technique can be used; a "closeness" measure is exploited by transductive graph label propagation, which can be computationally expensive. Transductive SVM performs well with linear or non-linear data and we see an example of training a TSVM with a Gaussian kernel on the UCI Breast Cancer dataset using the JKernelMachines library. We present experiments comparing SSL models using the graphical Java tool KEEL in the concluding part of the SSL portion of the chapter.

We introduced active learning (AL) in the second half of the chapter. In this type of learning, various strategies are used to query the unlabeled portion of the dataset in order to present the expert with examples that will prove most effective in learning from the entire dataset. As the expert, or oracle, provides the labels to the selected instances, the learner steadily improves its ability to generalize. The techniques of AL are characterized by the choice of classifier, or committee of classifiers, and importantly, on the querying strategy chosen. These strategies include uncertainty sampling, where the instances with the least confidence are queries, version sampling, where a subset of the hypotheses that explain the training data are selected, and data distribution sampling, which involves improving the model by selections that would decrease the generalization error. We presented a case study using the UCI abalone dataset to demonstrate active learning in practice. The tool used here is the JCLAL Java framework for active learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.34.197