Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11

Beyond supervised and unsupervised learning

Abstract

In many practical applications, labeled data is rare or costly to obtain. “Semisupervised” learning exploits unlabeled data to improve the performance of supervised learning. We first discuss how to combine clustering with classification; more specifically, how mixture model clustering using expectation maximization can be combined with Naïve Bayes to blend information from both labeled and unlabeled data. Next, we discuss how a generative approach for learning from unlabeled data, such as the one based on fitting a mixture model, can be combined with discriminative learning from labeled data. Following that, we consider the “cotraining” method for semisupervised learning, which can be applied when two different views are available for the same data. These two views normally correspond to two distinct sets of attributes for the same instances. Finally, we see how cotraining can be combined with expectation maximization in the so called “Co-EM” method to yield further improvements in predictive performance in some scenarios. In addition to semi-supervised learning, this chapter also discusses sophisticated techniques for multi-instance learning, which improve on the simple methods presented in Chapter 4, Algorithms: the basic methods.

Keywords

Semisupervised; learning; combining; clustering; and; classification; semisupervised; learning using; expectation; maximization; combining; generative; and; discriminative; modeling; cotraining; Co-EM; multi-instance learning

Modern machine learning embraces scenarios that transcend the classic dichotomy of supervised versus unsupervised learning. For example, in many practical applications labeled data is very scarce but unlabeled data is plentiful. “Semisupervised” learning attempts to improve the accuracy of supervised learning by exploiting information in unlabeled data. This sounds like magic, but it can work! This chapter reviews several well-established approaches to semisupervised learning: applying EM-style clustering to classification, combining generative and discriminative methods, and cotraining. We will also see how cotraining and EM-based semisupervised learning can be merged into a single algorithm.

Another nonstandard scenario with many practical applications is multi-instance learning. Here, each example is a bag of instances, each of which describes an aspect of the object to be classified—but there is still only one label for the entire example. Learning from such data poses serious algorithmic challenges, and some heuristic ingenuity may be necessary to make it practical. We will look at three different approaches: converting multi-instance data to single-instance data by aggregating the information in each bag of instances into a single instance, upgrading single-instance algorithms to be able to deal with bags of data, and dedicated approaches to multi-instance learning that do not have a single-instance equivalent.

11.1 Semisupervised Learning

When introducing the machine learning process in Chapter 2, Input: concepts, instances, attributes, we drew a sharp distinction between supervised and unsupervised learning. Recently researchers have begun to explore territory between the two, sometimes called semisupervised learning, in which the goal is classification but the input contains both unlabeled and labeled data. You cannot do classification without labeled data, of course, because only the labels tell what the classes are. But it is sometimes attractive to augment a small amount of labeled data with a large pool of unlabeled data. It turns out that the unlabeled data can help you learn the classes. How can this be?

First, why would you want it? Many situations present huge volumes of raw data, but assigning classes is expensive because it requires human insight. Text mining provides some classic examples. Suppose you want to classify Web pages into predefined groups. In an academic setting you might be interested in faculty pages, graduate student pages, course information pages, research group pages, and department pages. You can easily download thousands, or millions, of relevant pages from university Web sites. But labeling the training data is a laborious manual process. Or suppose your job is to use machine learning to spot names in text, differentiating between personal names, company names, and place names. You can easily download megabytes, or gigabytes, of text, but making this into training data by picking out the names and categorizing them can only be done manually. Cataloging news articles, sorting electronic mail, learning users’ reading interests—applications are legion. Leaving text aside, suppose you want to learn to recognize certain famous people in television broadcast news. You can easily record hundreds or thousands of hours of newscasts, but again labeling is manual. In any of these scenarios it would be enormously attractive to be able to leverage a large pool of unlabeled data to obtain excellent performance from just a few labeled examples—particularly if you were the graduate student who had to do the labeling!

Clustering for Classification

How can unlabeled data be used to improve classification? Here is a simple idea. Use Naïve Bayes to learn classes from a small labeled data set and then extend it to a large unlabeled data set using the EM (expectation–maximization) iterative clustering algorithm of Section 9.3. The procedure is this. First, train a classifier using the labeled data. Second, apply it to the unlabeled data to label it with class probabilities (the “expectation” step). Third, train a new classifier using the labels for all the data (the “maximization” step). Fourth, iterate until convergence. You could think of this as iterative clustering, where starting points and cluster labels are gleaned from the labeled data. The EM procedure guarantees to find model parameters that have equal or greater likelihood at each iteration. The key question, which can only be answered empirically, is whether these higher likelihood parameter estimates will improve classification accuracy.

Intuitively, this might work well, particularly if the data has many attributes and there are strong relationships between them. Consider document classification. Certain phrases are indicative of the classes. Some occur in labeled documents, whereas others only occur in unlabeled ones. But there are probably some documents that contain both, and the EM procedure uses these to generalize the learned model to utilize phrases that do not appear in the labeled data set. For example, both supervisor and PhD topic might indicate a graduate student’s home page. Suppose that only the former phrase occurs in the labeled documents. EM iteratively generalizes the model to correctly classify documents that contain just the latter.

This might work with any classifier and any iterative clustering algorithm. But it is basically a bootstrapping procedure, and you must take care to ensure that the feedback loop is a positive one. Using probabilities rather than hard decisions seems beneficial because it allows the procedure to converge slowly instead of jumping to conclusions that may be wrong. Naïve Bayes together with the basic probabilistic EM-based clustering procedure described in Section 9.3 are a particularly apt choice because they share the same fundamental assumption: independence between attributes or, more precisely, conditional independence between attributes given the class.

Of course, the independence assumption is universally violated. Even our little example used the two-word phrase PhD topic, whereas actual implementations would likely use individual words as attributes—and the example would have been far less compelling if we had substituted either of the single terms PhD or topic. The phrase PhD students is probably more indicative of faculty rather than graduate student home pages; the phrase research topic is probably less discriminating. It is the very fact that PhD and topic are not conditionally independent given the class that makes the example work: it is their combination that characterizes graduate student pages.

Nevertheless, coupling Naïve Bayes and EM in this manner works well in the domain of document classification. In a particular classification task it attained the performance of a traditional learner using fewer than one-third of the labeled training instances, as well as five times as many unlabeled ones. This is a good tradeoff when labeled instances are expensive but unlabeled ones are virtually free. With a small number of labeled documents, classification accuracy can be improved dramatically by incorporating many unlabeled ones.

Two refinements to the procedure have been shown to improve performance. The first is motivated by experimental evidence that when there are many labeled documents the incorporation of unlabeled data may reduce rather than increase accuracy. Hand-labeled data is (or should be) inherently less noisy than automatically labeled data. The solution is to introduce a weighting parameter that reduces the contribution of the unlabeled data. This can be incorporated into the maximization step of EM by maximizing the weighted likelihood of the labeled and unlabeled instances. When the parameter is close to zero, unlabeled documents have little influence on the shape of EM’s hill-climbing surface; when it is close to one, the algorithm reverts to the original version in which the surface is equally affected by both kinds of document.

The second refinement is to allow each class to have several clusters. As explained in Section 9.3, the EM clustering algorithm assumes that the data is generated randomly from a mixture of different probability distributions, one per cluster. Until now, a one-to-one correspondence between mixture components and classes has been assumed. In many circumstances this is unrealistic—including document classification, because most documents address multiple topics. With several clusters per class, each labeled document is initially assigned randomly to each of its components in a probabilistic fashion. The maximization step of the EM algorithm remains as before, but the expectation step is modified to not only probabilistically label each example with the classes, but to probabilistically assign it to the components within the class. The number of clusters per class is a parameter that depends on the domain and can be set by cross-validation.

Cotraining

Another situation in which unlabeled data can improve classification performance is when there are two different and independent perspectives on the classification task. The classic example again involves documents, this time Web documents, where the two perspectives are the content of a Web page and the links to it from other pages. These two perspectives are well known to be both useful and different: successful Web search engines capitalize on them both, using secret recipes. The text that labels a link to another Web page gives a revealing clue as to what that page is about—perhaps even more revealing than the page’s own content, particularly if the link is an independent one. Intuitively, a link labeled my advisor is strong evidence that the target page is a faculty member’s home page.

The idea, called cotraining, is this. Given a few labeled examples, first learn a different model for each perspective—in this case a content-based and a hyperlink-based model. Then use each one separately to label the unlabeled examples. For each model, select the example that it most confidently labels as positive and the one it most confidently labels as negative, and add these to the pool of labeled examples. Better yet, maintain the ratio of positive and negative examples in the labeled pool by choosing more of one kind than the other. In either case, repeat the whole procedure, training both models on the augmented pool of labeled examples, until the unlabeled pool is exhausted.

There is some experimental evidence, using Naïve Bayes throughout as the learner, that this bootstrapping procedure outperforms one that employs all the features from both perspectives to learn a single model from the labeled data. It relies on having two different views of an instance that are redundant but not completely correlated. Various domains have been proposed, from spotting celebrities in televised newscasts using video and audio separately to mobile robots with vision, sonar, and range sensors. The independence of the views reduces the likelihood of both hypotheses agreeing on an erroneous label.

EM and Cotraining

On data sets with two feature sets that are truly independent, experiments have shown that cotraining gives better results than using EM as described previously. Even better performance, however, can be achieved by combining the two into a modified version of cotraining called co-EM. Cotraining trains two classifiers representing different perspectives A and B, and uses both to add new examples to the training pool by choosing whichever unlabeled examples they classify most positively or negatively. The new examples are few in number and deterministically labeled. Co-EM, on the other hand, trains classifier A on the labeled data and uses it to probabilistically label all the unlabeled data. Next it trains classifier B on both the labeled data and the unlabeled data with classifier A’s tentative labels, and then it probabilistically relabels all the data for use by classifier A. The process iterates until the classifiers converge. This procedure seems to perform consistently better than cotraining because it does not commit to the class labels that are generated by classifiers A and B but rather reestimates their probabilities at each iteration.

The range of applicability of co-EM, like cotraining, is still limited by the requirement for multiple independent perspectives. But there is some experimental evidence to suggest that even when there is no natural split of features into independent perspectives, benefits can be achieved by manufacturing such a split and using cotraining—or, better yet, co-EM—on the split data. This seems to work even when the split is made randomly; performance could surely be improved by engineering the split so that the feature sets are maximally independent. Why does this work? Researchers have hypothesized that these algorithms succeed in part because the split makes them more robust to the assumptions that their underlying classifiers make.

There is no particular reason to restrict the base classifier to Naïve Bayes. Support vector machines (SVMs) are particularly useful for text categorization. However, for the EM iteration to work it is necessary that the classifier labels the data probabilistically; it must also be able to use probabilistically weighted examples for training. SVMs can easily be adapted to do both. We explained how to adapt learning algorithms to deal with weighted instances in Section 7.3 under Locally weighted linear regression. One way of obtaining probability estimates from SVMs is to fit a one-dimensional logistic model to the output, effectively performing logistic regression as described in Section 4.6 on the output. Excellent results have been reported for text classification using co-EM with the SVM classifier. It outperforms other variants of SVM and seems quite robust to varying proportions of labeled and unlabeled data.

Neural Network Approaches

Chapter 10, Deep learning, introduced the idea of using unsupervised pretraining to initialize deep networks. For very large labeled data sets, the use of rectified linear activation functions in purely supervised models has reduced the need for unsupervised pretraining. However, when the amount of labeled data is small relative to a larger source of unlabeled data, unsupervised pretraining methods can be effective.

Chapter 10, Deep learning, also showed how a network can be trained to predict its own input—an autoencoder. When labeled data is available, autoencoders can be augmented with another branch that makes predictions using the labeled data. There is evidence that this can make it easier to learn the reconstructive autoencoding part of the network, and also boost discriminative performance. Evidence suggests that unlabeled data can serve as a form of regularization, allowing higher capacity networks to be used. It may be important to weight the relative importance of the composite loss function, and one must take care to use validation sets to find the best model complexity (number of layers and units, etc.) to ensure that the model generalizes well to new data.

Designing networks that use the same representation to make multiple kinds of prediction is another way to leverage data from one task to help with another. If one of the tasks is to predict some other feature or set of features that are normally used as input, we can create network configurations that use both supervised and unsupervised learning.

11.2 Multi-instance Learning

We have already encountered another nonstandard learning scenario in Section 4.9: multi-instance learning. This can be viewed as a form of supervised learning where examples are bags of feature vectors rather than individual vectors. It can also be viewed as a form of weakly supervised learning where the “teacher” provides labels for bags of instances, rather than for each individual one. This section describes approaches to multi-instance learning that are more advanced than the simple techniques discussed earlier. First, we consider how to convert multi-instance learning to single-instance learning by transforming the data. Then we discuss how to upgrade single-instance learning algorithms to the multi-instance case. Finally we look at some methods that have no direct equivalent in single-instance learning.

Converting to Single-Instance Learning

Section 4.9 presented some ways of applying standard single-instance learning algorithms to multi-instance data by aggregating the input or the output. Despite their simplicity, these techniques often work surprisingly well in practice. Nevertheless, there are clearly situations in which they will fail. Consider the method of aggregating the input by computing the minimum and maximum values of numeric attributes present in a bag and treating the result as a single instance. This will yield a huge loss of information because attributes are condensed to summary statistics individually and independently. Can a bag be converted to a single instance without discarding quite so much information?

The answer is yes, although the number of attributes that are present in the so-called “condensed” representation may increase substantially. The basic idea is to partition the instance space into regions and create one attribute per region in the single-instance representation. In the simplest case, attributes can be Boolean: if a bag has at least one instance in the region corresponding to a particular attribute the value of the attribute is set to true, otherwise, it is set to false. However, to preserve more information the condensed representation could instead contain numeric attributes whose values are counts that indicate how many instances of the bag lie in the corresponding region.

Regardless of the exact types of attributes that are generated, the main problem is to come up with a partitioning of the input space. A simple approach is to partition it into hypercubes of equal size. Unfortunately, this only works when the space has very few dimensions (i.e., attributes): the number of cubes required to achieve a given granularity grows exponentially with the dimension of the space. One way to make this approach more practical is to use unsupervised learning. Simply take all the instances from all the bags in the training data, discard their class labels, and form a big single-instance data set; then process it with a clustering technique such as k-means. This will create regions corresponding to the different clusters (k regions, in the case of k-means). Then, for each bag, create one attribute per region in the condensed representation and use it as described previously.

Clustering is a rather heavy-handed way to infer a set of regions from the training data because it ignores information about class membership. An alternative approach that often yields better results is to partition the instance space using decision tree learning. Each leaf of a tree corresponds to one region of instance space. But how can a decision tree be learned when the class labels apply to entire bags of instances, rather than to individual instances? The approach described under Aggregating the Output in Section 4.9 can be used: take the bag’s class label and attach it to each of its instances. This yields a single-instance data set, ready for decision tree learning. Many of the class labels will be incorrect—the whole point of multi-instance learning is that it is not clear how bag-level labels relate to instance-level ones. However, these class labels are only being used to obtain a partition of instance space. The next step is to transform the multi-instance data set into a single-instance one that represents how instances from each bag are distributed throughout the space. Then another single-instance learning method is applied—perhaps, again, decision tree learning—that determines the importance of individual attributes in the condensed representation, which correspond to regions in the original space.

Using decision trees and clustering yields “hard” partition boundaries, where an instance either does or does not belong to a region. Such partitions can also be obtained using a distance function, combined with some reference points, by assigning instances to their closest reference point. This implicitly divides the space into regions, each corresponding to one reference point. (In fact, this is exactly what happens in k-means clustering: the cluster centers are the reference points.) But there is no fundamental reason to restrict attention to hard boundaries: we can make the region membership function “soft” by using distance—transformed into a similarity score—to compute attribute values in the condensed representation of a bag. All that is needed is some way of aggregating the similarity scores between each bag and reference point into a single value—e.g., by taking the maximum similarity between each instance in that bag and the reference point.

In the simplest case, each instance in the training data can be used as a reference point. That creates a large number of attributes in the condensed representation, but it preserves much of the information from a bag of instances in its corresponding single-instance representation. This method has been successfully applied to multi-instance problems.

Regardless of how the approach is implemented, the basic idea is to convert a bag of instances into a single one by describing the distribution of instances from this bag in instance space. Alternatively, ordinary learning methods can be applied to multi-instance data by aggregating the output rather than the input. Section 4.9 described a simple way: join instances of bags in the training data into a single data set by attaching bag-level class labels to them, perhaps weighting instances to give each bag the same total weight. A single-instance classification model can then be built. At classification time, predictions for individual instances are combined—e.g., by averaging predicted class probabilities.

Although this approach often works well in practice, attaching bag-level class labels to instances is simplistic. Generally, the assumption in multi-instance learning is that only some of the instances—perhaps just one—are responsible for the class label of the associated bag. How can the class labels be corrected to yield a more accurate representation of the true underlying situation? This is obviously a difficult problem; if it were solved, it would make little sense to investigate other approaches to multi-instance learning. One method that has been applied is iterative: start by assigning each instance its bag’s class label and learn a single-instance classification model; then replace the instances’ class labels by the predicted labels of this single-instance classification model for these instances. Repeat the whole procedure until the class labels remain unchanged from one iteration to the next.

Some care is needed to obtain sensible results. For example, suppose every instance in a bag were to receive a class label that differs from the bag’s label. Such a situation should be prevented by forcing the bag’s label on at least one instance—e.g., the one with the largest predicted probability for this class.

This iterative approach has been investigated for the original multi-instance scenario with two-class values, where a bag is positive if and only if one of its instances is positive. In that case it makes sense to assume that all instances from negative bags are truly negative and modify only the class labels of instances from positive bags. At prediction time, bags are classified as positive if one of their instances is classified as positive.

Upgrading Learning Algorithms

Tackling multi-instance learning by modifying the input or output so that single-instance schemes can be applied is appealing because there is a large battery of such techniques that can then be used directly, without any modification. However, it may not be the most efficient approach. An alternative is to adapt the internals of a single-instance algorithm to the multi-instance setting. This can be done in a particularly elegant fashion if the algorithm in question only considers the data through application of a distance (or similarity) function, as with nearest-neighbor classifiers or SVMs. These can be adapted by providing a distance (or similarity) function for multi-instance data that computes a score between two bags of instances.

In the case of kernel-based methods such as SVMs, the similarity must be a proper kernel function that satisfies certain mathematical properties. One that has been used for multi-instance data is the so-called set kernel. Given a kernel function for pairs of instances that SVMs can apply to single-instance data—e.g., one of the functions considered in Section 7.2—the set kernel sums it over all pairs of instances from the two bags being compared. This idea is generic and can be applied with any single-instance kernel function.

Nearest-neighbor learning has been adapted to multi-instance data by applying variants of the Hausdorff distance, which is defined for sets of points. Given two bags and a distance function between pairs of instances—e.g., the Euclidean distance—the Hausdorff distance between the bags is the largest distance from any instance in one bag to its closest instance in the other bag. It can be made more robust to outliers by using the nth-largest distance rather than the maximum.

For learning algorithms that are not based on similarity scores, more work is required to upgrade them to multi-instance data. There are multi-instance algorithms for rule learning and for decision tree learning, but we will not describe them here. Adapting algorithms to the multi-instance case is more straightforward if the algorithm concerned is essentially a numerical optimization strategy that is applied to the parameters of some function by minimizing a loss function on the training data. Logistic regression and multilayer perceptrons fall into this category; both have been adapted to multi-instance learning by augmenting them with a function that aggregates instance-level predictions. The so-called “soft maximum” is a differentiable function that is suitable for this purpose: it aggregates instance-level predictions by taking their (soft) maximum as the bag-level prediction.

Dedicated Multi-instance Methods

Some multi-instance learning schemes are not based directly on single-instance algorithms. Here is an early technique that was specifically developed for the drug activity prediction problem mentioned in Section 2.2, in which instances are conformations—shapes—of a molecule and a molecule (i.e., a bag) is considered positive if and only if it has at least one active conformation. The basic idea is to learn a single hyperrectangle that contains at least one instance from each positive bag in the training data and no instances from any negative bags. Such a rectangle encloses an area of instance space where all positive bags overlap but contains no negative instances—an area that is common to all active molecules but not represented in any inactive ones. The particular drug activity data originally considered was high-dimensional, with 166 attributes describing each instance, in which case it is computationally difficult to find a suitable hyperrectangle. Consequently a heuristic approach was developed that is tuned to this particular problem.

Other geometric shapes can be used instead of hyperrectangles. Indeed, the same basic idea has been applied using hyperspheres (balls). Training instances are treated as potential ball centers. For each one, a radius is found that yields the smallest number of errors for the bags in the training data. The original multi-instance assumption is used to make predictions: a bag is classified as positive if and only if it has at least one instance inside the ball. A single ball is generally not powerful enough to yield good classification performance. However, this method is not intended as a standalone algorithm. Rather, it is advocated as a “weak” learner to be used in conjunction with boosting algorithms (see Section 12.4) to obtain a powerful ensemble classifier—an ensemble of balls.

The dedicated multi-instance methods discussed so far have hard decision boundaries: an instance either falls inside or outside a ball or hyperrectangle. Other multi-instance algorithms use soft concept descriptions couched in terms of probability theory. The so-called diverse density method is a classic example, again designed with the original multi-instance assumption in mind. Its basic and most commonly used form learns a single reference point in instance space. The probability that an instance is positive is computed from its distance to this point: it is 1 if the instance coincides with the reference point and decreases with increasing distance from this point, usually based on a bell-shaped function.

The probability that a bag is positive is obtained by combining the individual probabilities of the instances it contains, generally using the “noisy-OR” function. This is a probabilistic version of the logical OR. If all instance-level probabilities are 0, the noisy-OR value—and thus the bag-level probability—is 0; if at least one instance-level probability is 1, the value is 1; otherwise the value falls somewhere in between.

The diverse density is defined as the probability of the class labels of the bags in the training data, computed based on this probabilistic model. It is maximized when the reference point is located in an area where positive bags overlap and no negative bags are present, just as for the two geometric methods discussed previously. A numerical optimization routine such as gradient ascent can be used to find the reference point that maximizes the diverse density measure. In addition to the location of the reference point, implementations of diverse density also optimize the scale of the distance function in each dimension, because generally not all attributes are equally important. This can improve predictive performance significantly.

11.3 Further Reading and Bibliographic Notes

Nigam, McCallum, Thrun, and Mitchell (2000) thoroughly explored the idea of clustering for classification, showing how the EM clustering algorithm can use unlabeled data to improve an initial classifier built by Naïve Bayes. The idea of cotraining is older: Blum and Mitchell (1998) pioneered it and developed a theoretical model for the use of labeled and unlabeled data from different independent perspectives. Nigam and Ghani (2000) analyzed the effectiveness and applicability of cotraining, relating it to the traditional use of standard EM to fill in missing values; they also introduced the co-EM algorithm. Up to this point, cotraining and co-EM were applied mainly to small two-class problems. Ghani (2002) used error-correcting output codes to address multiclass situations with many classes. Brefeld and Scheffer (2004) extended co-EM to use a SVM rather than Naïve Bayes.

Condensing the input data by aggregating information into simple summary statistics is a well-known technique in multirelational learning, used in the RELAGGS system by Krogel and Wrobel (2002), and multi-instance learning can be viewed as a special case of this more general setting (de Raedt, 2008). The idea of replacing simple summary statistics by region-based attributes, derived from partitioning the instance space, was explored by Weidmann, Frank, and Pfahringer (2003), Zhou and Zhang (2007), and Frank and Pfahringer (2013). Using reference points to condense bags was investigated by Chen, Bi, and Wang (2006) and evaluated in a broader context by Foulds and Frank (2008). Andrews et al. (2002) proposed manipulating the class labels of individual instances using an iterative learning process for learning SVM classifiers based on the original multi-instance assumption.

Nearest-neighbor learning based on variants of the Hausdorff distance was investigated by Wang and Zucker (2000). Gärtner et al. (2002) experimented with the set kernel to learn SVM classifiers for multi-instance data. Multi-instance algorithms for rule and decision tree learning, which are not covered here, have been described by Chevaleyre and Zucker (2001), Blockeel, Page, and Srinivasan (2005), and Bjerring and Frank (2011). Logistic regression has been adapted for multi-instance learning by Xu and Frank (2004) and Ray and Craven (2005); multilayer perceptrons have been adapted by Ramon and de Raedt (2000).

Hyperrectangles and spheres were considered as concept descriptions for multi-instance learning by Dietterich et al. (1997) and Auer and Ortner (2004), respectively. The diverse density method is the subject of Maron’s (1998) PhD thesis, and is also described in (Maron and Lozano-Peréz, 1997). A quicker, heuristic variant is evaluated by Foulds and Frank (2010b).

The multi-instance literature makes many different assumptions regarding the type of concept to be learned, defining, e.g., how the bag-level and instance-level class labels are connected, starting with the original assumption that a bag is labeled positive if and only if one of its instances is positive. A review of assumptions in multi-instance learning can be found in Foulds and Frank (2010a).

11.4 WEKA Implementations

• Multi-instance learning methods (in the multiInstanceLearning package, unless otherwise mentioned)

TLC (creates single-instance representations using partitioning methods)

MILES (single-instance representation using soft memberships, in the multiInstanceFilters package)

MISVM (iterative method for learning an SVM by relabeling instances)

MISMO (SVM with multi-instance kernel)

CitationKNN (nearest-neighbor method with Hausdorff distance)

MITI (learns a decision tree from multi-instance data)

MIRI (learns rule sets for multi-instance data)

MILR (logistic regression for multi-instance data)

MIOptimalBall (learning balls for multi-instance classification)

MIDD (the diverse density method using the noisy-or function)

QuickDDIterative (a heuristic, faster version of MIDD)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 11. Beyond supervised and unsupervised learning

Create new playlist

Sign In

Sign Up