Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7

Extending instance-based and linear models

Abstract

We begin by revisiting the basic instance-based learning method of nearest-neighbor classification and considering how it can be made more robust and storage efficient by generalizing both exemplars and distance functions. We then discuss two well-known approaches for generalizing linear models that go beyond modeling linear relationships between the inputs and the outputs. The first is based on the so-called kernel trick, which implicitly creates a high-dimensional feature space and models linear relationships in this extended space. We discuss support vector machines for classification and regression, kernel ridge regression, and kernel perceptrons. The second approach is based on applying simple linear models in a network structure that includes nonlinear transformations. This yields neural networks, and we discuss the classical multilayer perceptron. The final part of the chapter discusses an alternative method for tackling learning problems with complex relationships: building linear models that are local in the sense that they only apply to a small part of the input space. We consider model trees, which are decision trees with linear regression models at the leaf nodes, and locally weighted linear regression, which combines instance-based learning and linear regression.

Keywords

Instance-based learning; generalized exemplars; generalized distance functions; kernel trick; support vector machines; kernel ridge regression; kernel perceptron; multilayer perceptron; model trees; locally weighted learning

Instance-based learning and fitting linear models are both classic techniques that have been used for many decades to solve prediction tasks in statistics. In this chapter, we show how these basic methods can be extended to tackle more challenging tasks.

Basic instance-based learning using the nearest-neighbor classifier is quite fickle in the presence of noise and irrelevant attributes, and its predictive performance hinges on employing a distance function that matches the task at hand. It requires the entire training data to be stored, which may not be desirable or even feasible in practice. Finally, it provides no insight into what has been “learned.” To address these deficiencies, we will show how to reduce the number of training examples, how to guard against noisy examples, how to weight attributes to take account of their importance, how to generalize examples to rules to provide insight, and how to generalize distance functions to different types of data.

For linear models, we discuss several ways of extending their applicability to situations where the output is not a linear function of the original attributes. One is to increase the mathematical complexity of the model by forming new attributes based on the original ones, or by combining the output of many linear models to form a far more complex function. The first approach, applied naïvely, greatly increases the computational demands of the learning problem. However, it turns out that there is a neat mathematical device—known as the “kernel trick”—that resolves this issue. We discuss several kernel-based learning methods: support vector machines, kernel regression, and kernel perceptrons. The second approach, based on nonlinear transformations of the outputs of linear models, yields what is known as an artificial neural network. We will discuss multilayer perceptrons, a widely used type of neural network for classification and regression. We will also explain stochastic gradient descent, a simple and fast technique for learning many of the models we discuss—both basic linear models and their extended versions.

We also discuss two other ways to extend linear models. One is to build local linear models by dividing the instance space into regions using a tree learner and fitting models to the leaves of the tree, yielding so-called model trees. Another is to combine instance-based learning with linear models, yielding locally weighted regression. The former approach produces an intelligible model, in contrast to most of the other approaches discussed in this chapter; the latter naturally accommodates incremental learning.

7.1 Instance-Based Learning

In Section 4.7 we saw how the nearest-neighbor rule can be used to implement a basic form of instance-based learning. There are several practical problems with this simple scheme. First, it tends to be slow for large training sets, because the entire set must be searched for each test instance—unless sophisticated data structures such as kD-trees or ball trees are used. Second, it performs badly with noisy data, because the class of a test instance is determined by its single nearest neighbor without any “averaging” to help eliminate noise. Third, it performs badly when different attributes affect the outcome to different extents—in the extreme case, when some attributes are completely irrelevant—because all attributes contribute equally to the distance formula. Fourth, it does not perform explicit generalization, although we intimated in Section 3.5 (and illustrated in Fig. 3.10) that some instance-based learning systems do indeed perform explicit generalization.

Reducing the Number of Exemplars

The plain nearest-neighbor rule stores a lot of redundant exemplars. Yet it is almost always completely unnecessary to save all the examples seen so far. A simple variant is to classify each example with respect to the examples already seen and to save only ones that are misclassified. We use the term exemplars to refer to the already-seen instances that are used for classification. Discarding correctly classified instances reduces the number of exemplars and proves to be an effective way to prune the exemplar database. Ideally, only a single exemplar is stored for each important region of the instance space. However, early in the learning process examples may be discarded that later turn out to be important, possibly leading to some decrease in predictive accuracy. As the number of stored instances increases, the accuracy of the model improves, and so the system makes fewer mistakes.

Unfortunately, the strategy of only storing misclassified instances does not work well in the face of noise. Noisy examples are very likely to be misclassified, and so the set of stored exemplars tends to accumulate those that are least useful. This effect is easily observed experimentally. Thus this strategy is only a stepping stone on the way toward more effective instance-based learners.

Pruning Noisy Exemplars

Noisy exemplars inevitably lower the performance of any nearest-neighbor scheme that does not suppress them, because they have the effect of repeatedly misclassifying new instances. There are two ways of dealing with this. One is to locate, instead of the single nearest neighbor, the k nearest neighbors for some predetermined constant k, and assign the majority class to the unknown instance. The only problem here is determining a suitable value of k. Plain nearest-neighbor learning corresponds to k=1. The more noise, the greater the optimal value of k. One way to proceed is to perform cross-validation tests with several different values and choose the best. Although this is expensive in computation time, it often yields excellent predictive performance.

A second solution is to monitor the performance of each exemplar that is stored and discard ones that do not perform well. This can be done by keeping a record of the number of correct and incorrect classification decisions that each exemplar makes. Two predetermined thresholds are set on the success ratio. When an exemplar’s performance drops below the lower one, it is deleted from the exemplar set. If its performance exceeds the upper threshold, it is used for predicting the class of new instances. If its performance lies between the two, it is not used for prediction but, whenever it is the closest exemplar to the new instance (and thus would have been used for prediction if its performance record had been good enough), its success statistics are updated as though it had been used to classify that new instance.

To accomplish this, we use the confidence limits on the success probability of a Bernoulli process that we derived in Section 5.2. Recall that we took a certain number of successes S out of a total number of trials N as evidence on which to base confidence limits on the true underlying success rate p. Given a certain confidence level of, say, 5%, we can calculate upper and lower bounds and be 95% sure that p lies between them.

To apply this to the problem of deciding when to accept a particular exemplar, suppose that it has been used n times to classify other instances and that s of these have been successes. That allows us to estimate bounds, at a particular confidence level, on the true success rate of this exemplar. Now suppose that the exemplar’s class has occurred c times out of a total number N of training instances. This allows us to estimate bounds on the default success rate, i.e., the probability of successfully classifying an instance of this class without any information about the attribute values. We insist that the lower confidence bound on an exemplar’s success rate exceeds the upper confidence bound on the default success rate. We use the same method to devise a criterion for rejecting a poorly performing exemplar, requiring that the upper confidence bound on its success rate lies below the lower confidence bound on the default success rate.

With suitable choice of thresholds, this scheme works well. In a particular implementation, called IB3 for Instance-Based learner version 3, a confidence level of 5% is used to determine acceptance, whereas a level of 12.5% is used for rejection. The lower percentage figure produces a wider confidence interval, which makes for a more stringent criterion because it is harder for the lower bound of one interval to lie above the upper bound of the other. The criterion for acceptance is more stringent than for rejection, making it more difficult for an instance to be accepted. The reason for a less stringent rejection criterion is that there is little to be lost by dropping instances with only moderately poor classification accuracies: they will probably be replaced by similar instances later. Using these thresholds has been found to improve the performance of instance-based learning and, at the same time, dramatically reduce the number of exemplars—particularly noisy exemplars—that are stored.

Weighting Attributes

The Euclidean distance function, modified to scale all attribute values to between 0 and 1, works well in domains in which the attributes are equally relevant to the outcome. Such domains, however, are the exception rather than the rule. In most domains some attributes are irrelevant, and some relevant ones are less important than others. The next improvement in instance-based learning is to learn the relevance of each attribute incrementally by dynamically updating feature weights.

In some schemes, the weights are class specific in that an attribute may be more important to one class than to another. To cater for this, a description is produced for each class that distinguishes its members from members of all other classes. This leads to the problem that an unknown test instance may be assigned to several different classes, or no classes at all—a problem that is all too familiar from our description of rule induction. Heuristic solutions are applied to resolve these situations.

The weighted Euclidean distance metric incorporates the feature weights w₁, w₂,…, w_n on each dimension:

$\sqrt{w_{1}^{2} {(x_{1} - y_{1})}^{2} + w_{2}^{2} {(x_{2} - y_{2})}^{2} + \dots + w_{n}^{2} {(x_{n} - y_{n})}^{2}} .$ $\sqrt{w_{1}^{2} {(x_{1} - y_{1})}^{2} + w_{2}^{2} {(x_{2} - y_{2})}^{2} + \dots + w_{n}^{2} {(x_{n} - y_{n})}^{2}} .$

In the case of class-specific feature weights, there will be a separate set of weights for each class.

All attribute weights are updated after each training instance is classified, and the most similar exemplar (or the most similar exemplar of each class) is used as the basis for updating. Call the training instance x and the most similar exemplar y. For each attribute i, the difference $| x_{i} - y_{i} |$ $| x_{i} - y_{i} |$ is a measure of the contribution of that attribute to the decision. If this difference is small then the attribute contributes positively, whereas if it is large it may contribute negatively. The basic idea is to update the ith weight on the basis of the size of this difference and whether the classification was indeed correct. If the classification is correct the associated weight is increased and if it is incorrect it is decreased, the amount of increase or decrease being governed by the size of the difference: large if the difference is small and vice versa. The weight change is generally followed by a renormalization step. A simpler strategy, which may be equally effective, is to leave the weights alone if the decision is correct and if it is incorrect to increase the weights for those attributes that differ most greatly, accentuating the difference.

A good test of whether an attribute weighting scheme works is to add irrelevant attributes to all examples in a data set. Ideally, the introduction of irrelevant attributes should not affect either the quality of predictions or the number of exemplars stored.

Generalizing Exemplars

Removing training exemplars that are noisy or redundant aids understanding of the structure of the data—to some extent. To improve interpretability further, exemplars need to be generalized.

Generalized exemplars are rectangular regions of instance space, called hyperrectangles because they are high-dimensional. When classifying new instances it is necessary to modify the distance function as described below to allow the distance to a hyperrectangle to be computed. When a new exemplar is classified correctly, it is generalized by simply merging it with the nearest exemplar of the same class. The nearest exemplar may be either a single instance or a hyperrectangle. In the former case, a new hyperrectangle is created that covers the old and the new instance. In the latter, the hyperrectangle is enlarged to encompass the new instance. Finally, if the prediction is incorrect and it was a hyperrectangle that was responsible for the incorrect prediction, the hyperrectangle’s boundaries are altered so that it shrinks away from the new instance.

It is necessary to decide at the outset whether overgeneralization caused by nesting or overlapping hyperrectangles is to be permitted or not. If it is to be avoided, a check is made before generalizing a new example to see whether any regions of feature space conflict with the proposed new hyperrectangle. If they do, the generalization is aborted and the example is stored verbatim. Note that overlapping hyperrectangles are precisely analogous to situations in which the same example is covered by two or more rules in a rule set.

In some schemes generalized exemplars can be nested in that they may be completely contained within one another in the same way that, in some representations, rules may have exceptions. To do this, whenever an example is incorrectly classified, a fallback heuristic is tried using the second nearest neighbor if it would have produced a correct prediction in a further attempt to perform generalization. This second-chance mechanism promotes nesting of hyperrectangles. If an example falls within a rectangle of the wrong class that already contains an exemplar of the same class, the two are generalized into a new “exception” hyperrectangle nested within the original one. For nested generalized exemplars, the learning process frequently begins with a small number of seed instances to prevent all examples of the same class from being generalized into a single rectangle that covers most of the problem space.

Distance Functions for Generalized Exemplars

With generalized exemplars it is necessary to generalize the distance function to compute the distance from an instance to a generalized exemplar, as well as to another instance. The distance from an instance to a hyperrectangle is defined to be zero if the point lies within the hyperrectangle. The simplest way to generalize the distance function to compute the distance from an exterior point to a hyperrectangle is to choose the closest instance within it and measure the distance to that. However, this reduces the benefit of generalization because it reintroduces dependence on a particular single example. More precisely, whereas new instances that happen to lie within a hyperrectangle continue to benefit from generalizations, ones that lie outside do not. It might be better to use the distance from the nearest part of the hyperrectangle instead.

Fig. 7.1 shows the implicit boundaries that are formed between two rectangular classes if the distance metric is adjusted to measure distance to the nearest point of a rectangle. Even in two dimensions the boundary contains a total of nine regions (they are numbered for easy identification); the situation will be more complex for higher-dimensional hyperrectangles.

Figure 7.1 A boundary between two rectangular classes.

Proceeding from the lower left, the first region, in which the boundary is linear, lies outside the extent of both rectangles—to the left of both borders of the larger one and below both borders of the smaller one. The second is within the extent of one rectangle—to the right of the leftmost border of the larger rectangle—but outside that of the other—below both borders of the smaller one. In this region the boundary is parabolic, because the locus of a point that is the same distance from a given line as from a given point is a parabola. The third region is where the boundary meets the lower border of the larger rectangle when projected upward and the left border of the smaller one when projected to the right. The boundary is linear in this region, because it is equidistant from these two borders. The fourth is where the boundary lies to the right of the larger rectangle but below the bottom of that rectangle. In this case the boundary is parabolic because it is the locus of points equidistant from the lower right corner of the larger rectangle and the left side of the smaller one. The fifth region lies between the two rectangles: here the boundary is vertical. The pattern is repeated in the upper right part of the diagram: first parabolic, then linear, then parabolic (although this particular parabola is almost indistinguishable from a straight line), and finally linear as the boundary finally escapes from the scope of both rectangles.

This simple situation certainly defines a complex boundary! Of course it is not necessary to represent the boundary explicitly; it is generated implicitly by the nearest-neighbor calculation. Nevertheless, the solution is still not a very good one. Whereas taking the distance from the nearest instance within a hyperrectangle is overly dependent on the position of that particular instance, taking the distance to the nearest point of the hyperrectangle is overly dependent on that corner of the rectangle—the nearest example might be a long way from the corner.

A final problem concerns measuring the distance to hyperrectangles that overlap or are nested. This complicates the situation because an instance may fall within more than one hyperrectangle. A suitable heuristic for use in this case is to choose the class of the most specific hyperrectangle containing the instance, i.e., the one covering the smallest area of instance space.

Whether or not overlap or nesting is permitted, the distance function should be modified to take account of both the observed prediction accuracy of exemplars and the relative importance of different features, as described in the sections above on pruning noisy exemplars and attribute weighting.

Generalized Distance Functions

There are many different ways of defining a distance function, and it is hard to find rational grounds for any particular choice. An elegant solution is to consider one instance being transformed into another through a sequence of predefined elementary operations and to calculate the probability of such a sequence occurring if operations are chosen randomly. Robustness is improved if all possible transformation paths are considered, weighted by their probabilities, and the scheme generalizes naturally to the problem of calculating the distance between an instance and a set of other instances by considering transformations to all instances in the set. Through such a technique it is possible to consider each instance as exerting a “sphere of influence,” but a sphere with soft boundaries rather than the hard-edged cutoff implied by the k-nearest-neighbor rule, in which any particular example is either “in” or “out” of the decision.

With such a measure, given a test instance whose class is unknown, its distance to the set of all training instances in each class in turn is calculated, and the closest class is chosen. It turns out that nominal and numeric attributes can be treated in a uniform manner within this transformation-based approach by defining different transformation sets, and it is even possible to take account of unusual attribute types—such as degrees of arc or days of the week, which are measured on a circular scale.

Discussion

Nearest-neighbor methods gained popularity in machine learning through the work of Aha (1992), who showed that, when combined with noisy exemplar pruning and attribute weighting, instance-based learning performs well in comparison with other methods. It is worth noting that although we have described it solely in the context of classification rather than numeric prediction problems, it applies to these equally well: predictions can be obtained by combining the predicted values of the k nearest neighbors and weighting them by distance.

Viewed in instance space, the standard rule- and tree-based representations are only capable of representing class boundaries that are parallel to the axes defined by the attributes. This is not a handicap for nominal attributes, but it is for numeric ones. Nonaxis-parallel class boundaries can only be approximated by covering the region above or below the boundary with several axis-parallel rectangles, the number of rectangles determining the degree of approximation. In contrast, the instance-based method can easily represent arbitrary linear boundaries. Even with just one example of each of two classes, the boundary implied by the nearest-neighbor rule is a straight line of arbitrary orientation, namely, the perpendicular bisector of the line joining the examples.

Plain instance-based learning does not produce explicit knowledge representations except by selecting representative exemplars. However, when combined with exemplar generalization, a set of rules can be obtained that may be compared with those produced by other machine learning schemes. The rules tend to be more conservative because the distance metric, modified to incorporate generalized exemplars, can be used to process examples that do not fall within the rules. This reduces the pressure to produce rules that cover the whole example space or even all of the training examples. On the other hand, the incremental nature of the instance-based learning scheme we have described means that rules are formed eagerly, after only part of the training set has been seen; and this inevitably reduces their quality.

We have not given precise algorithms for variants of instance-based learning that involve generalization because it is not clear what the best way to do generalization is. Salzberg (1991) suggested that generalization with nested exemplars can achieve a high degree of classification of accuracy on a variety of different problems, a conclusion disputed by Wettschereck and Dietterich (1994), who argued that these results were fortuitous and did not hold in other domains. Martin (1995) explored the idea that it is not generalization but the overgeneralization that occurs when hyperrectangles nest or overlap that is responsible for poor performance and demonstrated that if nesting and overlapping are avoided excellent results are achieved in a large number of domains. The generalized distance function based on transformations is described by Cleary and Trigg (1995).

Exemplar generalization is a rare example of a learning strategy in which the search proceeds from specific to general rather than from general to specific as in the case of the tree and rule induction schemes we have described. There is no particular reason why specific-to-general searching should necessarily be handicapped by forcing the examples to be considered in a strictly incremental fashion, and batch-oriented approaches exist that generate rules using a basic instance-based approach. Moreover, it seems that the idea of producing conservative generalizations and coping with instances that are not covered by choosing the “closest” generalization may be generally useful for tree and rule inducers.

7.2 Extending Linear Models

Section 4.6 described how simple linear models can be used for classification in situations where all attributes are numeric. Their biggest disadvantage is that they can only represent linear boundaries between classes, which makes them too simple for many practical applications. Support vector machines use linear models to implement nonlinear class boundaries. (Although it is a widely used term, support vector machines is something of a misnomer: these are algorithms, not machines.) How can this be possible? The trick is easy: transform the input using a nonlinear mapping. In other words, transform the instance space into a new space. With a nonlinear mapping, a straight line in the new space does not look straight in the original instance space. A linear model constructed in the new space can represent a nonlinear decision boundary in the original space.

Imagine applying this idea directly to the ordinary linear models in Section 4.6. For example, the original set of attributes could be replaced by one giving all products of n factors that can be constructed from these attributes. An example for two attributes, including all products with three factors, is

$x = w_{1} a_{1}^{3} + w_{2} a_{1}^{2} a_{2} + w_{3} a_{1} a_{2}^{2} + w_{4} a_{2}^{3} .$ $x = w_{1} a_{1}^{3} + w_{2} a_{1}^{2} a_{2} + w_{3} a_{1} a_{2}^{2} + w_{4} a_{2}^{3} .$

Here, x is the outcome, a₁ and a₂ are the two attribute values, and there are four weights w_i to be learned. As described in Section 4.6, the result can be used for classification by training one linear system for each class and assigning an unknown instance to the class that gives the greatest output x—the standard technique of multiresponse linear regression. Then, a₁ and a₂ will be the attribute values for the test instance. To generate a linear model in the space spanned by these products, each training instance is mapped into the new space by computing all possible three-factor products of its two attribute values. The learning algorithm is then applied to the transformed instances. To classify an instance, it is processed by the same transformation prior to classification. There is nothing to stop us from adding in more synthetic attributes. For example, if a constant term were included, the original attributes and all two-factor products of them would yield a total of ten weights to be learned. (Alternatively, adding an additional attribute with a constant value would have the same effect.) Indeed, polynomials of sufficiently high degree can approximate arbitrary decision boundaries to any required accuracy.

It seems too good to be true—and it is. As you will probably have guessed, problems arise with this procedure due to the large number of coefficients introduced by the transformation in any realistic setting. The first snag is computational complexity. With 10 attributes in the original data set, suppose we want to include all products with 5 factors: then the learning algorithm will have to determine more than 2000 coefficients. If its run time is cubic in the number of attributes, as it is for linear regression, training will be infeasible. That is a problem of practicality. The second problem is one of principle: overfitting. If the number of coefficients is large relative to the number of training instances, the resulting model will be “too nonlinear”—it will overfit the training data. There are just too many parameters in the model.

The Maximum Margin Hyperplane

Support vector machines address both problems. They are based on an algorithm that finds a special kind of linear model: the maximum margin hyperplane. We already know what a hyperplane is—it is just another term for a linear model. To visualize a maximum margin hyperplane, imagine a two-class data set whose classes are linearly separable; i.e., there is a hyperplane in instance space that classifies all training instances correctly. The maximum margin hyperplane is the one that gives the greatest separation between the classes—it comes no closer to either than it has to. An example is shown in Fig. 7.2, where the classes are represented by open and filled circles, respectively. Technically, the convex hull of a set of points is the tightest enclosing convex polygon: its outline emerges when you connect every point of the set to every other point. Because we have supposed that the two classes are linearly separable, their convex hulls cannot overlap. Among all hyperplanes that separate the classes, the maximum margin hyperplane is the one that is as far as possible from both convex hulls—it is the perpendicular bisector of the shortest line connecting the hulls, which is shown dashed in the figure.

The instances that are closest to the maximum margin hyperplane—the ones with minimum distance to it—are called support vectors. There is always at least one support vector for each class, and often there are more. The important thing is that the set of support vectors uniquely defines the maximum margin hyperplane for the learning problem. Given the support vectors for the two classes, we can easily construct the maximum margin hyperplane. All other training instances are irrelevant—they can be deleted without changing the position and orientation of the hyperplane.

A hyperplane separating the two classes might be written

$x = w_{0} + w_{1} a_{1} + w_{2} a_{2}$ $x = w_{0} + w_{1} a_{1} + w_{2} a_{2}$

in the two-attribute case, where a₁ and a₂ are the attribute values, and there are three weights w_i to be learned. However, the equation defining the maximum margin hyperplane can be written in another form, in terms of the support vectors. Write the class value y of a training instance as either 1 (for yes, it is in this class) or −1 (for no, it is not). Then the maximum margin hyperplane can be written

$x = b + \sum_{i is support vector} α_{i} y_{i} a (i) \cdot a .$ $x = b + \sum_{i is support vector} α_{i} y_{i} a (i) \cdot a .$

Here, y_i is the class value of training instance a(i), while b and α_i are numeric parameters that have to be determined by the learning algorithm. Note that a(i) and a are vectors. The vector a represents a test instance—just as the vector [a₁, a₂] represented a test instance in the earlier formulation. The vectors a(i) are the support vectors, those circled in Fig. 7.2; they are selected members of the training set. The term $a (i) \cdot a$ $a (i) \cdot a$ represents the dot product of the test instance with one of the support vectors: $a (i) \cdot a = \sum_{j} a {(i)}_{j} a_{j}$ $a (i) \cdot a = \sum_{j} a {(i)}_{j} a_{j}$ . If you are not familiar with dot product notation, you should still be able to understand the gist of what follows: just think of a(i) as the whole set of attribute values for the i-th support vector. Finally, b and α_i are parameters that determine the hyperplane, just as the weights w₀, w₁, and w₂ are parameters that determine the hyperplane in the earlier formulation.

It turns out that finding the support vectors for the training instances and determining the parameters b and α_i belong to a standard class of optimization problems known as constrained quadratic optimization problems. There are off-the-shelf software packages for solving these problems. However, the computational complexity can be reduced, and learning accelerated, if special purpose algorithms for training support vector machines are applied—but the details of these algorithms lie beyond the scope of this book.

Nonlinear Class Boundaries

We motivated the introduction of support vector machines by claiming that they can be used to model nonlinear class boundaries. However, so far we have only described the linear case. Consider what happens when an attribute transformation, as described above, is applied to the training data before determining the maximum margin hyperplane. Recall that there are two problems with the straightforward application of such transformations to linear models: computational complexity, on the one hand, and overfitting, on the other.

With support vectors, overfitting is reduced. The reason is that it is inevitably associated with instability: with an algorithm that overfits, changing one or two instance vectors will make sweeping changes to large sections of the decision boundary. But the maximum margin hyperplane is relatively stable: it only moves if training instances are added or deleted that are support vectors—and this is true even in the high-dimensional space spanned by the nonlinear transformation. Overfitting is caused by too much flexibility in the decision boundary. The support vectors are global representatives of the whole set of training points, and there are often relatively few of them, which gives little flexibility. Thus overfitting is less likely to occur.

What about computational complexity? This is still a problem. Suppose that the transformed space is a high-dimensional one so that the transformed support vectors and test instance have many components. According to the preceding equation, every time an instance is classified its dot product with all support vectors must be calculated. In the high-dimensional space produced by the nonlinear mapping this is rather expensive. Obtaining the dot product involves one multiplication and one addition for each attribute, and the number of attributes in the new space can be huge. This problem occurs not only during classification but also during training, because the optimization algorithms have to calculate the same dot products very frequently.

Fortunately, it turns out that it is possible to calculate the dot product before the nonlinear mapping is performed, on the original attribute set. A high-dimensional version of the preceding equation is simply

$x = b + \sum α_{i} y_{i} {(a (i) \cdot a)}^{n},$ $x = b + \sum α_{i} y_{i} {(a (i) \cdot a)}^{n},$

where n is chosen as the number of factors in the transformation (three in the example we used earlier). If you expand the term ${(a (i) \cdot a)}^{n}$ ${(a (i) \cdot a)}^{n}$ , you will find that it contains all the high-dimensional terms that would have been involved if the test and training vectors were first transformed by including all products of n factors and the dot product was taken of the result. (If you actually do the calculation, you will notice that some constant factors—binomial coefficients—are introduced. However, it is primarily the dimensionality of the space that concerns us; the constants merely scale the axes.) Because of this mathematical equivalence, the dot products can be computed in the original low-dimensional space, and the problem becomes feasible. In implementation terms, you take a software package for constrained quadratic optimization and every time $a (i) \cdot a$ $a (i) \cdot a$ is evaluated you evaluate ${(a (i) \cdot a)}^{n}$ ${(a (i) \cdot a)}^{n}$ instead. It is as simple as that, because in both the optimization and the classification algorithms these vectors are only ever used in this dot product form. The training vectors, including the support vectors, and the test instance all remain in the original low-dimensional space throughout the calculations.

The function ${(x \cdot y)}^{n}$ ${(x \cdot y)}^{n}$ , which computes the dot product of two vectors x and y and raises the result to the power n, is called a polynomial kernel. A good way of choosing the value of n is to start with 1 (a linear model) and increment it until the estimated error ceases to improve. Usually, quite small values suffice. To include lower-order terms, we can use the kernel ${(x \cdot y + 1)}^{n}$ ${(x \cdot y + 1)}^{n}$ .

Other kernel functions can be used instead to implement different nonlinear mappings. Two that are often suggested are the radial basis function (RBF) kernel and the sigmoid kernel. Which one produces the best results depends on the application, although the differences are rarely large in practice. It is interesting to note that a support vector machine with the RBF kernel corresponds to a type of neural network called an RBF network (which we describe later), and one with the sigmoid kernel implements another type of neural network, a multilayer perceptron with one hidden layer (also described later).

Mathematically, any function $K (x, y)$ $K (x, y)$ is a kernel function if it can be written $K (x, y) = Φ (x) \cdot Φ (y)$ $K (x, y) = Φ (x) \cdot Φ (y)$ , where Φ is a function that maps an instance into a (potentially high-dimensional) feature space. In other words, the kernel function represents a dot product in the feature space created by Φ. Practitioners sometimes apply functions that are not proper kernel functions (the sigmoid kernel with certain parameter settings is an example). Despite the lack of theoretical guarantees, this can nevertheless produce accurate classifiers.

Throughout this section, we have assumed that the training data is linearly separable—either in the instance space or in the new space spanned by the nonlinear mapping. It turns out that support vector machines can be generalized to the case where the training data is not separable. This is accomplished by placing an upper bound C on the coefficients α_i. Unfortunately this parameter must be chosen by the user, and the best setting can only be determined by experimentation. Also, except in trivial cases, it is not possible to determine a priori whether the data is linearly separable or not.

Finally, we should mention that compared with other methods such as decision tree learners, even the fastest training algorithms for support vector machines are slow when applied in the nonlinear setting. On the other hand, they often produce very accurate classifiers because subtle and complex decision boundaries can be obtained.

Support Vector Regression

The concept of a maximum margin hyperplane only applies to classification. However, support vector machine algorithms have been developed for numeric prediction that share many of the properties encountered in the classification case: they produce a model that can usually be expressed in terms of a few support vectors and can be applied to nonlinear problems using kernel functions. As with regular support vector machines, we will describe the concepts involved but do not attempt to describe the algorithms that actually perform the work.

As with linear regression, covered in Section 4.6, the basic idea is to find a function that approximates the training points well by minimizing the prediction error. The crucial difference is that all deviations up to a user-specified parameter ε are simply discarded. Also, when minimizing the error, the risk of overfitting is reduced by simultaneously trying to maximize the flatness of the function. Another difference is that what is minimized is normally the predictions’ absolute error instead of the squared error used in linear regression. (There are, however, versions of the algorithm that use the squared error instead.)

A user-specified parameter ε defines a tube around the regression function in which errors are ignored: for linear support vector regression, the tube is a cylinder. If all training points can fit within a tube of width 2ε, the algorithm outputs the function in the middle of the flattest tube that encloses them. In this case the total perceived error is zero. Fig. 7.3A shows a regression problem with one attribute, a numeric class, and eight instances. In this case ε was set to 1, so the width of the tube around the regression function (indicated by dotted lines) is 2. Fig. 7.3B shows the outcome of the learning process when ε is set to 2. As you can see, the wider tube makes it possible to learn a flatter function.

Figure 7.3 Support vector regression: (A) ε=1; (B) ε=2; (C) ε=0.5.

The value of ε controls how closely the function will fit the training data. Too large a value will produce a meaningless predictor—in the extreme case, when 2ε exceeds the range of class values in the training data, the regression line is horizontal and the algorithm just predicts the mean class value. On the other hand, for small values of ε there may be no tube that encloses all the data. In that case some training points will have nonzero error, and there will be a tradeoff between the prediction error and the tube’s flatness. In Fig. 7.3C, ε was set to 0.5 and there is no tube of width 1 that encloses all the data.

For the linear case, the support vector regression function can be written

$x = b + \sum_{i is support vector} α_{i} a (i) \cdot a .$ $x = b + \sum_{i is support vector} α_{i} a (i) \cdot a .$

As with classification, the dot product can be replaced by a kernel function for nonlinear problems. The support vectors are all those points that do not fall strictly within the tube—i.e., the points outside the tube and on its border. As with classification, all other points have coefficient 0 and can be deleted from the training data without changing the outcome of the learning process—just as in the classification case, we obtain a so-called sparse model. In contrast to the classification case, the α_i may be negative.

We have mentioned that as well as minimizing the error, the algorithm simultaneously tries to maximize the flatness of the regression function. In Fig. 7.3A and B, where there is a tube that encloses all the training data, the algorithm simply outputs the flattest tube that does so. However, in Fig. 7.3C there is no tube with error 0, and a tradeoff is struck between the prediction error and the tube’s flatness. This tradeoff is controlled by enforcing an upper limit C on the absolute value of the coefficients α_i. The upper limit restricts the influence of the support vectors on the shape of the regression function and is a parameter that the user must specify in addition to ε. The larger C is, the more closely the function can fit the data. In the degenerate case ε=0 the algorithm simply performs least-absolute-error regression under the coefficient size constraint, and all training instances become support vectors. Conversely, if ε is large enough that the tube can enclose all the data, the error becomes zero, there is no tradeoff to make, and the algorithm outputs the flattest tube that encloses the data irrespective of the value of C.

Kernel Ridge Regression

Chapter 4, Algorithms: the basic methods, introduced classic least-squares linear regression as a technique for predicting numeric quantities. In “Nonlinear class boundaries” section we saw how the powerful idea of support vector machines can be applied to regression, and, furthermore, how nonlinear problems can be tackled by replacing the dot product in the support vector formulation by a kernel function—this is often known as the “kernel trick.” For classic linear regression using squared loss, only simple matrix operations are needed to find the model, but this is not the case for support vector regression with the user-specified loss parameter ε. It would be nice to combine the power of the kernel trick with the simplicity of standard least-squares regression. Kernel ridge regression does just that. In contrast to support vector regression, it does not ignore errors smaller than ε, and the squared error is used instead of the absolute error.

Instead of expressing the linear regression model’s predicted class value for a given test instance a as a weighted sum of the attribute values, as in Chapter 4, Algorithms: the basic methods, it can be expressed as a weighted sum over the dot products of each training instance a_j and the test instance in question:

$\sum_{j = 1}^{n} α_{j} a_{j} \cdot a$ $\sum_{j = 1}^{n} α_{j} a_{j} \cdot a$

where we assume that the function goes through the origin and an intercept is not required. This involves a coefficient α_j for each training instance, which resembles the situation with support vector machines—except that here j ranges over all instances in the training data, not just the support vectors. Again, the dot product can be replaced by a kernel function to yield a nonlinear model.

The sum of the squared errors of the model’s predictions on the training data is given by

${\sum_{i = 1}^{n} (y_{i} - \sum_{j = 1}^{n} α_{j} a_{j} \cdot a_{i})}^{2} .$ ${\sum_{i = 1}^{n} (y_{i} - \sum_{j = 1}^{n} α_{j} a_{j} \cdot a_{i})}^{2} .$

This is the squared loss, just as in Chapter 4, Algorithms: the basic methods, and again we seek to minimize it by choosing appropriate α_j’s. But now there is a coefficient for each training instance, not just for each attribute, and most data sets have far more instances than attributes. This means that there is a serious risk of overfitting the training data when a kernel function is used instead of the dot product to obtain a nonlinear model.

That is where the ridge part of kernel ridge regression comes in. Instead of minimizing the squared loss, we trade closeness of fit against model complexity by introducing a penalty term:

${\sum_{i = 1}^{n} (y_{i} - \sum_{j = 1}^{n} α_{j} a_{j} \cdot a_{i})}^{2} + λ \sum_{i, j = 1}^{n} α_{i} α_{j} a_{j} \cdot a_{i} .$ ${\sum_{i = 1}^{n} (y_{i} - \sum_{j = 1}^{n} α_{j} a_{j} \cdot a_{i})}^{2} + λ \sum_{i, j = 1}^{n} α_{i} α_{j} a_{j} \cdot a_{i} .$

The second sum penalizes large coefficients. This prevents the model from placing too much emphasis on individual training instances by giving them large coefficients, unless this yields a correspondingly large drop in error. The parameter λ controls the tradeoff between closeness of fit and model complexity. When matrix operations are used to solve for the coefficients of the model, the ridge penalty also has the added benefit of stabilizing degenerate cases. For this reason, it is often applied in standard least-squares linear regression as well.

Although kernel ridge regression has the advantage over support vector machines of computational simplicity, one disadvantage is that there is no sparseness in the vector of coefficients—in other words, no concept of “support vectors.” This makes a difference at prediction time, because support vector machines have to sum only over the set of support vectors, not the entire training set.

In a typical situation with more instances than attributes, kernel ridge regression is more computationally expensive than standard linear regression—even when using the dot product rather than a kernel. This is because of the complexity of the matrix inversion operation used to find the model’s coefficient vector. Standard linear regression requires inverting an m×m matrix, which has complexity O(m³) where m is the number of attributes in the data. Kernel ridge regression, on the other hand, involves an n×n matrix, with complexity O(n³) where n is the number of instances in the training data. Nevertheless, it is advantageous to use kernel ridge regression in cases where a nonlinear fit is desired, or where there are more attributes than training instances.

The Kernel Perceptron

In Section 4.6 we introduced the perceptron algorithm for learning a linear classifier. It turns out that the kernel trick can also be used to upgrade this algorithm to learn nonlinear decision boundaries. To see this, we first revisit the linear case. The perceptron algorithm repeatedly iterates through the training data instance by instance and updates the weight vector every time one of these instances is misclassified based on the weights learned so far. The weight vector is updated simply by adding or subtracting the instance’s attribute values to or from it. This means that the final weight vector is just the sum of the instances that have been misclassified. The perceptron makes its predictions based on whether

$\sum_{i} w_{i} a_{i}$ $\sum_{i} w_{i} a_{i}$

is greater or less than zero—where w_i is the weight for the ith attribute and a_i the corresponding attribute value of the instance that we wish to classify. Instead, we could use

$\sum_{i} \sum_{j} y (j) a' {(j)}_{i} a_{i} .$ $\sum_{i} \sum_{j} y (j) a' {(j)}_{i} a_{i} .$

Here, a′(j) is the jth misclassified training instance, a′(j)_i its ith attribute value, and y(j) its class value (either +1 or −1). To implement this we no longer keep track of an explicit weight vector: we simply store the instances that have been misclassified so far and use the above expression to make a prediction.

It looks like we have gained nothing—in fact, the algorithm is much slower because it iterates through all misclassified training instances every time a prediction is made. However, closer inspection of this formula reveals that it can be expressed in terms of dot products between instances. First, swap the summation signs to yield

$\sum_{j} y (j) \sum_{i} a' {(j)}_{i} a_{i} .$ $\sum_{j} y (j) \sum_{i} a' {(j)}_{i} a_{i} .$

The second sum is just a dot product between two instances and can be written

$\sum_{j} y (j) a' (j) \cdot a .$ $\sum_{j} y (j) a' (j) \cdot a .$

This rings a bell! A similar expression for support vector machines enabled the use of kernels. Indeed, we can apply exactly the same trick here and use a kernel function instead of the dot product. Writing this function as K(…) gives

$\sum_{j} y (j) K (a' (j), a) .$ $\sum_{j} y (j) K (a' (j), a) .$

In this way the perceptron algorithm can learn a nonlinear classifier simply by keeping track of the instances that have been misclassified during the training process and using this expression to form each prediction.

If a separating hyperplane exists in the high-dimensional space implicitly created by the kernel function, this algorithm will learn one. However, it won’t learn the maximum-margin hyperplane found by a support vector machine classifier. This means that classification performance is usually worse. On the plus side, the algorithm is easy to implement and supports incremental learning.

This classifier is called the kernel perceptron. It turns out that all sorts of algorithms for learning linear models can be upgraded by applying the kernel trick in a similar fashion. For example, logistic regression can be turned into kernel logistic regression. As we saw above, the same applies to regression problems: linear regression can also be upgraded using kernels. Again, a drawback of these advanced methods for linear and logistic regression (if they are done in a straightforward manner) is that the solution is not “sparse”: every training instance contributes to the solution vector. In support vector machines and the kernel perceptron, only some of the training instances affect the solution, and this can make a big difference to computational efficiency.

The solution vector found by the perceptron algorithm depends greatly on the order in which the instances are encountered. One way to make the algorithm more stable is to use all the weight vectors encountered during learning, not just the final one, letting them vote on a prediction. Each weight vector contributes a certain number of votes. Intuitively, the “correctness” of a weight vector can be measured roughly as the number of successive trials after its inception in which it correctly classified subsequent instances and thus didn’t have to be changed. This measure can be used as the number of votes given to the weight vector, giving an algorithm known as the voted perceptron that performs almost as well as a support vector machine. (Note that, as mentioned earlier, the various weight vectors in the voted perceptron don’t need to be stored explicitly, and the kernel trick can be applied here too.)

Multilayer Perceptrons

Using a kernel is not the only way to create a nonlinear classifier based on the perceptron. In fact, kernel functions are a fairly recent development in machine learning. Previously, neural network proponents used a different approach for nonlinear classification: they connected many simple perceptron-like models in a hierarchical structure. This approach has seen a dramatic resurgence in the form of deep learning, which we cover in Chapter 10, Deep learning.

Section 4.6 explained that a perceptron represents a hyperplane in instance space. We mentioned there that it is sometimes described as an artificial “neuron.” Of course, human and animal brains successfully undertake very complex classification tasks—e.g., image recognition. The functionality of each individual neuron in a brain is certainly not sufficient to perform these feats. How can they be solved by brain-like structures? The answer must lie in the fact that the neurons in the brain are massively interconnected, allowing a problem to be decomposed into subproblems that can be solved at the neuron level. This observation inspired the development of artificial networks of neurons—neural nets.

Consider the simple data sets in Fig. 7.4. Fig. 7.4A shows a two-dimensional instance space with four instances having classes 0 and 1, represented by white and black dots, respectively. No matter how you draw a straight line through this space, you will not be able to find one that separates all the black points from all the white ones. In other words, the problem is not linearly separable, and the simple perceptron algorithm will fail to generate a separating hyperplane (in this two-dimensional instance space a hyperplane is just a straight line). The situation is different in Fig. 7.4B and C: both these problems are linearly separable. The same holds for Fig. 7.4D, which shows two points in a one-dimensional instance space (in the case of one dimension the separating hyperplane degenerates to a separating point).

Figure 7.4 Example data sets and corresponding perceptrons.

If you are familiar with propositional logic, you may have noticed that the four situations in Fig. 7.4 correspond to four types of logical connectives. Fig. 7.4A represents a logical XOR, where the class is 1 if and only if exactly one of the attributes has value 1. Fig. 7.4B represents logical AND, where the class is 1 if and only if both attributes have value 1. Fig. 7.4C represents OR, where the class is 0 only if both attributes have value 0. Fig. 7.4D represents NOT, where the class is 0 if and only if the attribute has value 1. Because the last three are linearly separable, a perceptron can represent AND, OR, and NOT. Indeed, perceptrons for the corresponding data sets are shown in Fig. 7.4F, G, and H, respectively. However, a simple perceptron cannot represent XOR, because that is not linearly separable. To build a classifier for this type of problem a single perceptron is not sufficient: we need several of them.

Fig. 7.4E shows a network with three perceptrons, or units, labeled A, B, and C. The first two are connected to what is sometimes called the input layer of the network, representing the attributes in the data. As in a simple perceptron, the input layer has an additional constant input called the bias. However, the third unit does not have any connections to the input layer. Its input consists of the output of units A and B (either 0 or 1) and another constant bias unit. These three units make up the hidden layer of the multilayer perceptron. They are called “hidden” because the units have no direct connection to the environment. This layer is what enables the system to represent XOR. You can verify this by trying all four possible combinations of input signals. For example, if attribute a₁ has value 1 and a₂ has value 1, then unit A will output 1 (because 1×1+1×1−0.5×1>0), unit B will output 0 (because −1×1+−1×1+1.5×1<0), and unit C will output 0 (because 1×1+1×0+−1.5×1<0). This is the correct answer. Closer inspection of the behavior of the three units reveals that the first one represents OR, the second represents NAND (NOT combined with AND), and the third represents AND. Together they represent the expression (a₁ OR a₂) AND (a₁ NAND a₂), which is precisely the definition of XOR.

As this example illustrates, any expression from propositional logic can be converted into a multilayer perceptron, because the three connectives AND, OR, and NOT are sufficient for this and we have seen how each can be represented using a perceptron. Individual units can be connected together to form arbitrarily complex expressions. Hence, a multilayer perceptron has the same expressive power as, say, a decision tree. In fact, it turns out that a two-layer perceptron (not counting the input layer) is sufficient. In this case, each unit in the hidden layer corresponds to a variant of AND—a variant because we assume that it may negate some of the inputs before forming the conjunction—joined by an OR that is represented by a single unit in the output layer. In other words, in this particular neural network setup each node in the hidden layer has the same role as a leaf in a decision tree or a single rule in a set of decision rules.

The big question is how to learn a multilayer perceptron. There are two aspects to the problem: learning the structure of the network and learning the connection weights. It turns out that there is a relatively simple algorithm for determining the weights given a fixed network structure. This algorithm is called backpropagation and is described in “Backpropagation” section. However, although there are many algorithms that attempt to identify network structure, this aspect of the problem is commonly solved by experimentation—perhaps combined with a healthy dose of expert knowledge. Sometimes the network can be separated into distinct modules that represent identifiable subtasks (e.g., recognizing different components of an object in an image recognition problem), which opens up a way of incorporating domain knowledge into the learning process. Often a single hidden layer is all that is necessary, and an appropriate number of units for that layer is determined by maximizing the estimated accuracy.

Backpropagation

Suppose we have some data and seek a multilayer perceptron that is an accurate predictor for the underlying classification problem. Given a fixed network structure, we must determine appropriate weights for the connections in the network. In the absence of hidden layers, the perceptron learning rule from Section 4.6 can be used to find suitable values. But suppose there are hidden units. We know what the output unit should predict, and could adjust the weights of the connections leading to that unit based on the perceptron rule. But the correct outputs for the hidden units are unknown, so the rule cannot be applied there.

It turns out that, roughly speaking, the solution is to modify the weights of the connections leading to the hidden units based on the strength of each unit’s contribution to the final prediction. There is a standard mathematical optimization algorithm, called gradient descent, which achieves exactly that. The standard gradient descent algorithm requires taking derivatives, and the step function that the simple perceptron uses to convert the weighted sum of the inputs into a 0/1 prediction is not differentiable. We need to see whether the step function can be replaced by something else.

Fig. 7.5A shows the step function: if the input is smaller than zero, it outputs zero; otherwise, it outputs one. We want a function that is similar in shape but differentiable. A commonly used replacement is shown in Fig. 7.5B. In neural networks terminology it is called the sigmoid function, and it is defined by

$f (x) = \frac{1}{1 + e^{- x}} .$ $f (x) = \frac{1}{1 + e^{- x}} .$

Figure 7.5 Step vs sigmoid: (A) step function; (B) sigmoid function.

We encountered it in Section 4.6 when we described the logit transform used in logistic regression. In fact, learning a multilayer perceptron is closely related to logistic regression.

To apply the standard gradient descent procedure, the error function—the thing that is to be minimized by adjusting the weights—must also be differentiable. The number of misclassifications—measured by the discrete 0–1 loss mentioned in Section 5.7—does not fulfill this criterion. Instead, multilayer perceptrons are usually trained by minimizing the squared error of the network’s output, essentially treating it as an estimate of the class probability. (Other loss functions are also applicable. For example, if the negative log-likelihood is used instead of the squared error, learning a sigmoid-based perceptron is identical to logistic regression.)

We work with the squared-error loss function because it is most widely used. For a single training instance, it is

$E = \frac{1}{2} {(y - f (x))}^{2},$ $E = \frac{1}{2} {(y - f (x))}^{2},$

where f(x) is the network’s prediction obtained from the output unit and y is the instance’s class label (in this case, it is assumed to be either 0 or 1). The factor 1/2 is included just for convenience, and will drop out when we start taking derivatives.

Gradient descent exploits information given by the derivative of the function that is to be minimized—in this case, the error function. As an example, consider a hypothetical error function that happens to be identical to w²+1, shown in Fig. 7.6. The x-axis represents a hypothetical parameter w that is to be optimized. The derivative of w²+1 is simply 2w. The crucial observation is that, based on the derivative, we can figure out the slope of the function at any particular point. If the derivative is negative, the function slopes downward to the right; if it is positive, it slopes downward to the left; and the size of the derivative determines how steep the decline is. Gradient descent is an iterative optimization procedure that uses this information to adjust a function’s parameters. It takes the value of the derivative, multiplies it by a small constant called the learning rate, and subtracts the result from the current parameter value. This is repeated for the new parameter value, and so on, until a minimum is reached.

Figure 7.6 Gradient descent using the error function w²+1.

Returning to the example, assume that the learning rate is set to 0.1 and the current parameter value w is 4. The derivative is double this—8 at this point. Multiplying by the learning rate yields 0.8, and subtracting this from 4 gives 3.2, which becomes the new parameter value. Repeating the process for 3.2, we get 2.56, then 2.048, and so on. The little crosses in Fig. 7.6 show the values encountered in this process. The process stops once the change in parameter value becomes too small. In the example this happens when the value approaches 0, the value corresponding to the location on the x-axis where the minimum of the hypothetical error function is located.

The learning rate determines the step size and hence how quickly the search converges. If it is too large and the error function has several minima, the search may overshoot and miss a minimum entirely, or it may oscillate wildly. If it is too small, progress toward the minimum may be slow. Note that gradient descent can only find a local minimum. If the function has several minima—and error functions for multilayer perceptrons usually have many—it may not find the best one. This is a significant drawback of standard multilayer perceptrons compared with, e.g., support vector machines.

To use gradient descent to find the weights of a multilayer perceptron, the derivative of the squared error must be determined with respect to each parameter—i.e., each weight in the network. Let us start with a simple perceptron without a hidden layer. Differentiating the error function with respect to a particular weight w_i yields

$\frac{d E}{d w_{i}} = (f (x) - y) \frac{f (x)}{d w_{i}} .$ $\frac{d E}{d w_{i}} = (f (x) - y) \frac{f (x)}{d w_{i}} .$

Here, f(x) is the perceptron’s output and x is the weighted sum of the inputs.

To compute the second factor on the right-hand side, the derivative of the sigmoid function f(x) is needed. It turns out that this has a particularly simple form that can be written in terms of f(x) itself:

$\frac{d f (x)}{d x} = f (x) (1 - f (x)) .$ $\frac{d f (x)}{d x} = f (x) (1 - f (x)) .$

We use f′(x) to denote this derivative. But we seek the derivative with respect to w_i, not x. Because

$x = \sum_{i} w_{i} a_{i},$ $x = \sum_{i} w_{i} a_{i},$

the derivative of f(x) with respect to w_i is

$\frac{d f (x)}{d w_{i}} = f' (x) a_{i} .$ $\frac{d f (x)}{d w_{i}} = f' (x) a_{i} .$

Plugging this back into the derivative of the error function yields

$\frac{d E}{d w_{i}} = (f (x) - y) f' (x) a_{i} .$ $\frac{d E}{d w_{i}} = (f (x) - y) f' (x) a_{i} .$

This expression gives all that is needed to calculate the change of weight w_i caused by a particular example vector a (extended by 1 to represent the bias, as explained previously). Having repeated this computation for each training instance, we add up the changes associated with a particular weight w_i, multiply by the learning rate, and subtract the result from w_i’s current value.

So far so good. But all this assumes that there is no hidden layer. With a hidden layer, things get a little trickier. Suppose f(x_i) is the output of the ith hidden unit, w_ij is the weight of the connection from input j to the ith hidden unit, and w_i is the weight of the ith hidden unit to the output unit. The situation is depicted in Fig. 7.7, where, for simplicity, we have omitted bias inputs for all units. As before, f(x) is the output of the single unit in the output layer. The update rule for the weights w_i is essentially the same as above, except that a_i is replaced by the output of the ith hidden unit:

$\frac{d E}{d w_{i}} = (f (x) - y) f' (x) f (x_{i}) .$ $\frac{d E}{d w_{i}} = (f (x) - y) f' (x) f (x_{i}) .$

Figure 7.7 Multilayer perceptron with a hidden layer (omitting bias inputs).

However, to update the weights w_ij the corresponding derivatives must be calculated. Applying the chain rule gives

$\frac{d E}{d w_{i j}} = \frac{d E}{d x} \frac{d x}{d w_{i j}} = (f (x) - y) f' (x) \frac{d x}{d w_{i j}} .$ $\frac{d E}{d w_{i j}} = \frac{d E}{d x} \frac{d x}{d w_{i j}} = (f (x) - y) f' (x) \frac{d x}{d w_{i j}} .$

The first two factors are the same as in the previous equation. To compute the third factor, differentiate further. Because

$\begin{matrix} x = \sum_{i} w_{i} f (x_{i}), \\ \frac{d x}{d w_{i j}} = w_{i} \frac{d f (x_{i})}{d w_{i j}} . \end{matrix}$ $\begin{matrix} x = \sum_{i} w_{i} f (x_{i}), \\ \frac{d x}{d w_{i j}} = w_{i} \frac{d f (x_{i})}{d w_{i j}} . \end{matrix}$

Furthermore,

$x_{i} = \sum_{j} w_{i j} a_{j},$ $x_{i} = \sum_{j} w_{i j} a_{j},$

$\frac{d f (x_{i})}{d w_{i j}} = f' (x_{i}) \frac{d x_{i}}{d w_{i j}} = f' (x_{i}) a_{j} .$ $\frac{d f (x_{i})}{d w_{i j}} = f' (x_{i}) \frac{d x_{i}}{d w_{i j}} = f' (x_{i}) a_{j} .$

This means that we are finished. Putting everything together yields an equation for the derivative of the error function with respect to the weights w_ij:

$\frac{d E}{d w_{i j}} = (f (x) - y) f' (x) w_{i} f' (x_{i}) a_{j} .$ $\frac{d E}{d w_{i j}} = (f (x) - y) f' (x) w_{i} f' (x_{i}) a_{j} .$

As before, we calculate this value for every training instance, add up the changes associated with a particular weight w_ij, multiply by the learning rate, and subtract the outcome from the current value of w_ij.

This derivation applies to a perceptron with one hidden layer. If there are two hidden layers, the same strategy can be applied a second time to update the weights pertaining to the input connections of the first hidden layer, propagating the error from the output unit through the second hidden layer to the first one. Because of this error propagation mechanism, this version of the generic gradient descent strategy is called backpropagation.

We have tacitly assumed that the network’s output layer has just one unit, which is appropriate for two-class problems. For more than two classes, a separate network could be learned for each class that distinguishes it from the remaining classes. A more compact classifier can be obtained from a single network by creating an output unit for each class, connecting every unit in the hidden layer to every output unit. The squared error for a particular training instance is the sum of squared errors taken over all output units. The same technique can be applied to predict several targets, or attribute values, simultaneously by creating a separate output unit for each one. Intuitively, this may give better predictive accuracy than building a separate classifier for each class attribute if the underlying learning tasks are in some way related.

We have assumed that weights are only updated after all training instances have been fed through the network and all the corresponding weight changes have been accumulated. This is batch learning, because all the training data is processed together. But exactly the same formulas can be used to update the weights incrementally after each training instance has been processed. This is called stochastic backpropagation because the overall error does not necessarily decrease after every update. It can be used for online learning, in which new data arrives in a continuous stream and every training instance is processed just once. In both variants of backpropagation, it is often helpful to standardize the attributes, e.g., to have zero mean and unit standard deviation. Before learning starts, each weight is initialized to a small, randomly chosen value based on a normal distribution with zero mean.

Like any other learning scheme, multilayer perceptrons trained with backpropagation may suffer from overfitting—especially if the network is much larger than what is actually necessary to represent the structure of the underlying learning problem. Many modifications have been proposed to alleviate this. A very simple one, called early stopping, works like reduced-error pruning in rule learners: a holdout set is used to decide when to stop performing further iterations of the backpropagation algorithm. The error on the holdout set is measured and the algorithm is terminated once the error begins to increase, because that indicates overfitting to the training data. Another method, called weight decay, adds to the error function a penalty term that consists of the squared sum of all nonbias weights in the network, as in ridge regression. This attempts to limit the influence of irrelevant connections on the network’s predictions by penalizing large weights that do not contribute a correspondingly large reduction in the error.

Although standard gradient descent is the simplest technique for learning the weights in a multilayer perceptron, it is by no means the most efficient one. In practice, it tends to be rather slow when executed on a standard personal computer. A trick that often improves performance is to include a momentum term when updating weights: add to the new weight change a small proportion of the update value from the previous iteration. This smooths the search process by making changes in direction less abrupt. More sophisticated methods make use of information obtained from the second derivative of the error function as well; they can converge much more quickly. However, even those algorithms can be very slow compared with other methods of classification learning.

A serious disadvantage of multilayer perceptrons that contain hidden units is that they are essentially opaque. There are several techniques that attempt to extract rules from trained neural networks. However, it is unclear whether they offer any advantages over standard rule learners that induce rule sets directly from data—especially considering that this can generally be done much more quickly than learning a multilayer perceptron in the first place.

Although multilayer perceptrons are the most prominent type of neural network, many others have been proposed. Multilayer perceptrons belong to a class of networks called feedforward networks because they do not contain any cycles and the network’s output depends only on the current input instance. Recurrent neural networks do have cycles. Computations derived from earlier input are fed back into the network, which gives them a kind of memory.

Radial Basis Function Networks

Another popular type of feedforward network is the radial basis function (RBF) network. It has two layers, not counting the input layer, and differs from a multilayer perceptron in the way that the hidden units perform computations. Each hidden unit essentially represents a particular point in input space, and its output, or activation, for a given instance depends on the distance between its point and the instance—which is just another point. Intuitively, the closer these two points, the stronger the activation. This is achieved by using a nonlinear transformation function to convert the distance into a similarity measure. A bell-shaped Gaussian activation function, whose width may be different for each hidden unit, is commonly used for this purpose. The hidden units are called RBFs because the points in instance space for which a given hidden unit produces the same activation form a hypersphere or hyperellipsoid. (In a multilayer perceptron, this is a hyperplane.)

The output layer of an RBF network is the same as that of a multilayer perceptron: it takes a linear combination of the outputs of the hidden units and—in classification problems—pipes it through the sigmoid function (or something with a similar shape).

The parameters that such a network learns are (1) the centers and widths of the RBFs and (2) the weights used to form the linear combination of the outputs obtained from the hidden layer. A significant advantage over multilayer perceptrons is that the first set of parameters can be determined independently of the second set and still produce fairly accurate classifiers.

One way to determine the first set of parameters is to use clustering. The simple k-means clustering algorithm described in Section 4.8 can be applied, clustering each class independently to obtain k basis functions for each class. Intuitively, the resulting RBFs represent prototype instances. The second set of parameters is then learned by keeping the first parameters fixed. This involves learning a simple linear classifier using one of the techniques we have discussed (e.g., linear or logistic regression). If there are far fewer hidden units than training instances, this can be done very quickly. Note that although this two-stage process is very quick, it is generally not as accurate as training all network parameters using a strategy such as gradient descent.

A disadvantage of RBF networks is that they give every attribute the same weight because all are treated equally in the distance computation, unless attribute weight parameters are included in the overall optimization process. Support vector machines share the same problem. In fact, support vector machines with Gaussian kernels (i.e., “RBF kernels”) are a particular type of RBF network, in which one basis function is centered on every training instance, all basis functions have the same width, and the outputs are combined linearly by computing the maximum margin hyperplane. This has the effect that only some of the RBFs have a nonzero weight—the ones that represent the support vectors.

Stochastic Gradient Descent

We have introduced gradient descent and stochastic backpropagation as optimization methods for learning the weights in a neural network. Gradient descent is, in fact, a general-purpose optimization technique that can be applied whenever the objective function is differentiable. Actually, it turns out that it can even be applied in cases where the objective function is not completely differentiable through use of a device called subgradients.

One application is the use of gradient descent to learn linear models such as linear support vector machines or logistic regression. Learning such models using gradient descent is easier than optimizing nonlinear neural networks because the error function has a global minimum rather than many local minima, which is usually the case for nonlinear networks. For linear problems, a stochastic gradient descent procedure can be designed that is computationally simple and converges very rapidly, allowing models such as linear support vector machines and logistic regression to be learned from large data sets. Moreover, stochastic gradient descent allows models to be learned incrementally, in an online setting.

For support vector machines, the error function—the thing that is to be minimized—is called the “hinge loss.” Illustrated in Fig. 7.8, this is so named because it comprises a downwards sloping linear segment joined to a horizontal part at z=1—more formally, E(z)=max{0, 1−z}. For comparison, the Figure also shows the 0–1 loss, which is discontinuous, and the squared loss, which is both continuous and differentiable. These functions are plotted as a function of the margin z=yf(x), where the class y is either −1 or +1 and f(x) is the output of the linear model. Misclassification occurs when z < 0, so all loss functions incur their most serious penalties in the negative region. In the linearly separable case, the hinge loss is zero for a function that successfully separates the data. The maximum margin hyperplane is given by the smallest weight vector that achieves a zero hinge loss.

Figure 7.8 Hinge, squared and 0–1 loss functions.

The hinge loss is continuous, unlike the 0–1 loss, but is not differentiable at z=1, unlike the squared loss which is differentiable everywhere. This lack of differentiability presents a problem if gradient descent is used to update the model’s weights after a training example has been processed, because the loss function’s derivative is needed for this. That is where subgradients come in. The basic idea is that even though the gradient cannot be computed, the minimum will still be found if something resembling a gradient can be substituted. In the case of the hinge loss, the gradient is taken to be zero at the point of nondifferentiability. In fact, since the hinge loss is zero for z≥1, we can focus on that part of the function that is differentiable (z < 1) and proceed as usual.

The weight update for a linear support vector machine using the hinge loss is Δw_i=ηx_iy, where η is the learning rate. For stochastic gradient descent, all that is needed to compute z for each training instance is to take the dot product between the current weight vector and the instance, multiply the result by the instance’s class value, and check to see if the resulting value is less than 1. If so, the weights are updated accordingly. As with perceptrons, a bias term can be included by extending the weight vector by one element and including an additional attribute with each training instance that always has the value 1.

Discussion

Support vector machines originated from research in statistical learning theory (Vapnik, 1999), and a good starting point for exploration is a tutorial by Burges (1998). A general description, including generalization to the case in which the data is not linearly separable, has been published by Cortes and Vapnik (1995). We have introduced the standard version of support vector regression: Schölkopf, Bartlett, Smola, and Williamson (1999) present a different version that has one parameter instead of two. Smola and Schölkopf (2004) provide an extensive tutorial on support vector regression. Fletcher (1987) covers solution methods for constrained quadratic optimization problems, while Platt (1998) describes the sequential minimal optimization algorithm, which is specifically designed to train support vector machines.

Ridge regression was introduced in statistics by Hoerl and Kennard (1970) and can now be found in standard statistics texts. Hastie et al. (2009) give a good description of kernel ridge regression. Kernel ridge regression is equivalent to a technique called Gaussian process regression, a Bayesian approach that additionally provides estimates of predictive uncertainty. The complexity of the most efficient general matrix inversion algorithm is in fact O(n^2.807) rather than O(n³).

The (voted) kernel perceptron is due to Freund and Schapire (1999). Cristianini and Shawe-Taylor (2000) provide a nice introduction to support vector machines and other kernel-based methods, including the optimization theory underlying the support vector learning algorithms. We have barely skimmed the surface of these learning schemes, mainly because advanced mathematics lies just beneath. The idea of using kernels to solve nonlinear problems has been applied to many algorithms, e.g., principal component analysis (described in Section 8.3). A kernel is essentially a similarity function with certain mathematical properties, and it is possible to define kernel functions over all sorts of structures—e.g., sets, strings, trees, and probability distributions. Shawe-Taylor and Cristianini (2004) and Schölkopf and Smola (2002) cover kernel-based learning in detail.

There is extensive literature on neural networks, and Bishop (1995) provides an excellent introduction to both multilayer perceptrons and RBF networks. Interest in neural networks initially declined after the arrival of support vector machines, perhaps because the latter often require fewer parameters to be tuned to achieve the same (or greater) accuracy. However, recent studies have shown that multilayer perceptrons achieve performance competitive with more modern learning techniques on many practical data sets, and they excel in particular when performing deep learning (see chapter: Deep learning).

Gradient methods for learning classifiers are very popular. In particular, stochastic gradient methods have been explored because they are applicable to large data sets and online learning scenarios. Kivinen, Smola, and Williamson (2002); Zhang (2004); and Shalev-Shwartz, Singer, and Srebro (2007) explore such methods when applied to learning support vector machines. Kivinen et al. (2002) and Shalev-Shwartz et al. (2007) provide heuristics for setting the learning rate for gradient descent based on the current iteration, and only require the user to provide a value for a single parameter that determines the closeness of fit to the training data (a so-called regularization parameter). In the vanilla approach, regularization is performed by limiting the number of updates that can be performed.

7.3 Numeric Prediction With Local Linear Models

Trees that are used for numeric prediction are just like ordinary decision trees except that at each leaf they store either a class value that represents the average value of instances that reach the leaf, in which case the tree is called a regression tree, or a linear regression model that predicts the class value of instances that reach the leaf, in which case it is called a model tree. In what follows we will talk about model trees because regression trees are really a special case.

Regression and model trees are constructed by first using a decision tree induction algorithm to build an initial tree. However, whereas most decision tree algorithms choose the splitting attribute to maximize the information gain, it is appropriate for numeric prediction to instead minimize the intrasubset variation in the class values down each branch. Once the basic tree has been formed, consideration is given to pruning the tree back from each leaf, just as with ordinary decision trees. The only difference between regression tree and model tree induction is that for the latter, each node is replaced by a regression plane instead of a constant value. The attributes that serve to define that plane are generally those that participate in decisions in the subtree that will be pruned, i.e., in nodes beneath the current one, and perhaps those that occur on the path to the root node.

Following an extensive description of model trees, we briefly explain how to generate rules from model trees, and then describe another approach to numeric prediction based on generating local linear models—locally weighted linear regression. Whereas model trees derive from the basic divide-and-conquer decision tree methodology, locally weighted regression is inspired by the instance-based methods for classification that we described is discussed in Section 4.3. Like instance-based learning, it performs all “learning” at prediction time. Although locally weighted regression resembles model trees in that it uses linear regression to fit models locally to particular areas of instance space, it does so in quite a different way.

Model Trees

When a model tree is used to predict the value for a test instance, the tree is followed down to a leaf in the normal way, using the instance’s attribute values to make routing decisions at each node. The leaf will contain a linear model based on some of the attribute values, and this is evaluated for the test instance to yield a raw predicted value.

Instead of using this raw value directly, however, it turns out to be beneficial to use a smoothing process to reduce the sharp discontinuities that will inevitably occur between adjacent linear models at the leaves of the pruned tree. This is a particular problem for models constructed from a small number of training instances. Smoothing can be accomplished by producing linear models for each internal node, as well as for the leaves, at the time the tree is built. Then, once the leaf model has been used to obtain the raw predicted value for a test instance, that value is filtered along the path back to the root, smoothing it at each node by combining it with the value predicted by the linear model for that node.

An appropriate smoothing calculation is

$p' = \frac{n p + k q}{n + k},$ $p' = \frac{n p + k q}{n + k},$

where p′ is the prediction passed up to the next higher node, p is the prediction passed to this node from below, q is the value predicted by the model at this node, n is the number of training instances that reach the node below, and k is a smoothing constant. Experiments show that smoothing substantially increases the accuracy of predictions.

However, discontinuities remain and the resulting function is not smooth. In fact, exactly the same smoothing process can be accomplished by incorporating the interior models into each leaf model after the tree has been built. Then, during the classification process, only the leaf models are used. The disadvantage is that the leaf models tend to be larger and more difficult to comprehend, because many coefficients that were previously zero become nonzero when the interior nodes’ models are incorporated.

Building the Tree

The splitting criterion is used to determine which attribute is the best to split that portion T of the training data that reaches a particular node. It is based on treating the standard deviation of the class values in T as a measure of the error at that node, and calculating the expected reduction in error as a result of testing each attribute at that node. The attribute that maximizes the expected error reduction is chosen for splitting at the node.

The expected error reduction, which we call SDR for standard deviation reduction, is calculated by

$SDR = s d (T) - \sum_{i} \frac{| T_{i} |}{| T |} \times s d (T_{i}),$ $SDR = s d (T) - \sum_{i} \frac{| T_{i} |}{| T |} \times s d (T_{i}),$

where T₁, T₂,… are the sets that result from splitting the node according to the chosen attribute.

The splitting process terminates when the class values of the instances that reach a node vary very slightly, i.e., when their standard deviation is only a small fraction (say, less than 5%) of the standard deviation of the original instance set. Splitting also terminates when just a few instances remain, say four or fewer. Experiments show that the results obtained are not very sensitive to the exact choice of these parameters.

Pruning the Tree

As noted earlier, a linear model is needed for each interior node of the tree, not just at the leaves, for use in the smoothing process. Before pruning, a model is calculated for each node of the unpruned tree. The model takes the form

$w_{0} + w_{1} a_{1} + w_{2} a_{2} + \dots + w_{k} a_{k},$ $w_{0} + w_{1} a_{1} + w_{2} a_{2} + \dots + w_{k} a_{k},$

where a₁, a₂,…, a_k are attribute values. The weights w₁, w₂,…, w_k are calculated using standard regression. However, only a subset of the attributes are generally used here—e.g., those that are tested in the subtree below this node, and perhaps those occurring along the path to the root node. Note that we have tacitly assumed that attributes are numeric: we describe the handling of nominal attributes in “Nominal attributes” section.

The pruning procedure makes use of an estimate, at each node, of the expected error for test data. First, the absolute difference between the predicted value and the actual class value is averaged over each of the training instances that reach that node. Because the tree has been built expressly for this data set, this average will underestimate the expected error for unseen cases. To compensate, it is multiplied by the factor (n+ν)/(n−ν), where n is the number of training instances that reach the node and ν is the number of parameters in the linear model that gives the class value at that node.

The expected error for test data at a node is calculated as described previously, using the linear model for prediction. Because of the compensation factor (n+ν)/(n−ν), it may be that the linear model can be further simplified by dropping terms to minimize the estimated error. Dropping a term decreases the multiplication factor, which may be enough to offset the inevitable increase in average error over the training instances. Terms are dropped one by one, greedily, as long as the error estimate decreases.

Finally, once a linear model is in place for each interior node, the tree is pruned back from the leaves as long as the expected estimated error decreases. The expected error for the linear model at that node is compared with the expected error from the subtree below. To calculate the latter, the error from each branch is combined into a single, overall value for the node by weighting the branch by the proportion of the training instances that go down it and combining the error estimates linearly using those weights. Alternatively, one can calculate the training error of the subtree and multiply it by the above modification factor based on an ad hoc estimate of the number of parameters in the tree—perhaps adding one for each split point.

Nominal Attributes

Before constructing a model tree, all nominal attributes are transformed into binary variables that are then treated as numeric. For each nominal attribute, the average class value corresponding to each possible value in the set is calculated from the training instances, and the values are sorted according to these averages. Then, if the nominal attribute has k possible values, it is replaced by k−1 synthetic binary attributes, the ith being 0 if the value is one of the first i in the ordering and 1 otherwise. Thus all splits are binary: they involve either a numeric attribute or a synthetic binary one, treated as a numeric attribute.

It is possible to prove analytically that the best split at a node for a nominal variable with k values is one of the k−1 positions obtained by ordering the average class values for each value of the attribute. This sorting operation should really be repeated at each node; however, there is an inevitable increase in noise due to small numbers of instances at lower nodes in the tree (and in some cases nodes may not represent all values for some attributes), and not much is lost by performing the sorting just once, before starting to build a model tree.

Missing Values

To take account of missing values, a modification is made to the SDR formula. The final formula, including the missing value compensation, is

$SDR = \frac{m}{| T |} \times [s d (T) - \sum_{j \in {L, R}} \frac{| T_{j} |}{| T |} \times s d (T_{j})],$ $SDR = \frac{m}{| T |} \times [s d (T) - \sum_{j \in {L, R}} \frac{| T_{j} |}{| T |} \times s d (T_{j})],$

where m is the number of instances without missing values for that attribute, and T is the set of instances that reach this node. T_L, T_R are sets that result from splitting on this attribute—because all tests on attributes are now binary.

When processing both training and test instances, once an attribute is selected for splitting it is necessary to divide the instances into subsets according to their value for this attribute. An obvious problem arises when the value is missing. An interesting technique called surrogate splitting has been developed to handle this situation. It involves finding another attribute to split on in place of the original one and using it instead. The attribute is chosen as the one most highly correlated with the original attribute. However, this technique is both complex to implement and time consuming to execute.

A simpler heuristic is to use the class value as the surrogate attribute, in the belief that, a priori, this is the attribute most likely to be correlated with the one being used for splitting. Of course, this is only possible when processing the training set, because for test examples the class is not known. A simple solution for test examples is simply to replace the unknown attribute value by the average value of that attribute for the training examples that reach the node—which has the effect, for a binary attribute, of choosing the most populous subnode. This simple approach seems to work well in practice.

Let us consider in more detail how to use the class value as a surrogate attribute during the training process. We first deal with all instances for which the value of the splitting attribute is known. We determine a threshold for splitting in the usual way, by sorting the instances according to the splitting attribute’s value and, for each possible split point, calculating the SDR according to the preceding formula, choosing the split point that yields the greatest reduction in error. Only the instances for which the value of the splitting attribute is known are used to determine the split point.

Then we divide these instances into the two sets L and R according to the test. We determine whether the instances in L or R have the greater average class value, and we calculate the average of these two averages. Then, an instance for which this attribute value is unknown is placed into L or R according to whether its class value exceeds this overall average or not. If it does, it goes into whichever of L and R has the greater average class value; otherwise, it goes into the one with the smaller average class value. When the splitting stops, all the missing values will be replaced by the average values of the corresponding attributes of the training instances reaching the leaves.

Pseudocode for Model Tree Induction

Fig. 7.9 gives pseudocode for the model tree algorithm we have described. The two main parts are creating a tree by successively splitting nodes, performed by split, and pruning it from the leaves upward, performed by prune. The node data structure contains a type flag indicating whether it is an internal node or a leaf, pointers to the left and right child, the set of instances that reach that node, the attribute that is used for splitting at that node, and a structure representing the linear model for the node.

The sd function called at the beginning of the main program and again at the beginning of split calculates the standard deviation of the class values of a set of instances. Then follows the procedure for obtaining synthetic binary attributes that was described previously. Standard procedures for creating new nodes and printing the final tree are not shown. In split, sizeof returns the number of elements in a set. Missing attribute values are dealt with as described earlier. The SDR is calculated according to the equation at the beginning of “Missing values” section. Although not shown in the code, it is set to infinity if splitting on the attribute would create a leaf with less than two instances. In prune, the linearRegression routine recursively descends the subtree collecting attributes, performs a linear regression on the instances at that node as a function of those attributes, and then greedily drops terms if doing so improves the error estimate, as described earlier. Finally, the error function returns

$\frac{n + v}{n - v} \times \frac{\sum_{instances} | deviation from predicted class value |}{n},$ $\frac{n + v}{n - v} \times \frac{\sum_{instances} | deviation from predicted class value |}{n},$

where n is the number of instances at the node and ν the number of parameters in the node’s linear model.

Fig. 7.10 gives an example of a model tree formed by this algorithm for a problem with two numeric and two nominal attributes. What is to be predicted is the rise time of a simulated servo system involving a servo amplifier, motor, lead screw, and sliding carriage. The nominal attributes play important roles. Four synthetic binary attributes have been created for each of the five-valued nominal attributes motor and screw, and are shown in Table 7.1 in terms of the two sets of values to which they correspond. The ordering of these values—D, E, C, B, A for motor and coincidentally D, E, C, B, A for screw also—is determined from the training data: the rise time averaged over all examples for which motor=D is less than that averaged over examples for which motor=E, which is less than when motor=C, and so on. It is apparent from the magnitude of the coefficients in Table 7.1 that motor=D versus E, C, B, A and screw=D, E, C, B versus A play leading roles in the LM2, LM3, and LM4 models (among others). Both motor and screw also play a minor role in several of the models.

Figure 7.10 Model tree for a data set with nominal attributes.

Table 7.1

Linear Models in the Model Tree

Model		LM1	LM2	LM3	LM4	LM5	LM6	LM7	LM8	LM9	LM10	LM11
Constant term		0.96	1.14	1.43	1.52	2.69	2.91	0.88	0.98	1.11	1.06	0.97
Pgain		−0.38	−0.38	−0.38	−0.38	−0.38	−0.38	−0.24	−0.24	−0.24	−0.25	−0.25
Vgain		0.71	0.49	0.49	0.49	0.56	0.45	0.13	0.15	0.15	0.10	0.14
Motor=D	vs E, C, B, A	0.66	1.14	1.06	1.06	0.50	0.50	0.30	0.40	0.30	0.14	0.14
Motor=D, E	vs C, B, A	0.97	0.61	0.65	0.59	0.42	0.42	−0.02	0.06	0.06	0.17	0.22
Motor=D, E, C	vs B, A	0.32	0.32	0.32	0.32	0.41	0.41	0.05
Motor=D, E, C, B	vs A					0.08	0.05
Screw=D	vs E, C, B, A
Screw=D, E	vs C, B, A	0.13
Screw=D, E, C	vs B, A	0.49	0.54	0.54	0.54	0.39	0.40	0.30	0.20	0.16	0.08	0.08
Screw=D, E, C, B	vs A		1.73	1.79	1.79	0.96	1.13	0.22	0.15	0.15	0.16	0.19

Rules From Model Trees

Model trees are essentially decision trees with linear models at the leaves. Like decision trees, they may suffer from the replicated subtree problem explained in Section 3.4, and sometimes the structure can be expressed much more concisely using a set of rules instead of a tree. Can we generate rules for numeric prediction? Recall the rule learner described in Section 6.2 that uses separate-and-conquer in conjunction with partial decision trees to extract decision rules from trees. The same strategy can be applied to model trees to generate decision lists for numeric prediction.

First build a partial model tree from all the data. Pick one of the leaves and make it into a rule. Remove the data covered by that leaf; then repeat the process with the remaining data. The question is, how to build the partial model tree, i.e., a tree with unexpanded nodes? This boils down to the question of how to pick which node to expand next. The algorithm of Fig. 6.5 (Section 6.2) picks the node whose entropy for the class attribute is smallest. For model trees, whose predictions are numeric, simply use the variance instead. This is based on the same rationale: the lower the variance, the shallower the subtree and the shorter the rule. The rest of the algorithm stays the same, with the model tree learner’s split selection method and pruning strategy replacing the decision tree learner’s. Because the model tree’s leaves are linear models, the corresponding rules will have linear models on the right-hand side.

There is one caveat when using model trees in this fashion to generate rule sets. It turns out that using smoothed model trees does not reduce the error in the final rule set’s predictions. This may be because smoothing works best for contiguous data, but the separate-and-conquer scheme removes data covered by previous rules, leaving holes in the distribution. Smoothing, if it is done at all, must be performed after the rule set has been generated.

Locally Weighted Linear Regression

An alternative approach to numeric prediction is the method of locally weighted linear regression. With model trees, the tree structure divides the instance space into regions, and a linear model is found for each of them. In effect, the training data determines how the instance space is partitioned. Locally weighted regression, on the other hand, generates local models at prediction time by giving higher weight to instances in the neighborhood of the particular test instance. More specifically, it weights the training instances according to their distance to the test instance and performs a linear regression on the weighted data. Training instances close to the test instance receive a high weight; those far away a low one. In other words, a linear model is tailor made for the particular test instance at hand and used to predict the instance’s class value.

To use locally weighted regression, you need to decide on a distance-based weighting scheme for the training instances. A common choice is to weight the instances according to the inverse of their Euclidean distance from the test instance. Another possibility is to use the Euclidean distance in conjunction with a Gaussian kernel function. However, there is no clear evidence that the choice of weighting function is critical. More important is the selection of a “smoothing parameter” that is used to scale the distance function—the distance is multiplied by the inverse of this parameter. If it is set to a small value, only instances very close to the test instance will receive significant weight; if it is large, more distant instances will also have a significant impact on the model. One way of choosing the smoothing parameter is to set it to the distance of the kth nearest training instance so that its value becomes smaller as the volume of training data increases. If the weighting function is linear, say max(0, 1–smoothed-distance), the weight is zero for all instances further than the kth nearest one. Then the weighting function has bounded support and only the k−1 nearest neighbors need to be considered for building the linear model. The best choice of k depends on the amount of noise in the data. The more noise there is, the more neighbors should be included in the linear model. Generally, an appropriate smoothing parameter is found using cross-validation.

Like model trees, locally weighted linear regression is able to approximate nonlinear functions. One of its main advantages is that it is ideally suited for incremental learning: all training is done at prediction time, so new instances can be added to the training data at any time. However, like other instance-based methods, it is slow at deriving a prediction for a test instance. First, the training instances must be scanned to compute their weights; then, a weighted linear regression is performed on these instances. Also, like other instance-based methods, locally weighted regression provides little information about the global structure of the training data set. Note that if the smoothing parameter is based on the kth nearest neighbor and the weighting function gives zero weight to more distant instances, the kD-trees and ball trees described in Section 4.7 can be used to accelerate the process of finding the relevant neighbors.

Locally weighted learning is not restricted to linear regression: it can be applied with any learning technique that can handle weighted instances. In particular, you can use it for classification. Most algorithms can be easily adapted to deal with weights. The trick is to realize that (integer) weights can be simulated by creating several copies of the same instance. Whenever the learning algorithm uses an instance when computing a model, just pretend that it is accompanied by the appropriate number of identical shadow instances. This also works if the weight is not an integer. For example, in the Naïve Bayes algorithm described in Section 4.2, multiply the counts derived from an instance by the instance’s weight, and—voilà—you have a version of Naïve Bayes that can be used for locally weighted learning.

It turns out that locally weighted Naïve Bayes works quite well in practice, outperforming both Naïve Bayes itself and the k-nearest-neighbor technique. It also compares favorably with more sophisticated ways of enhancing Naïve Bayes by relaxing its intrinsic independence assumption. Locally weighted learning only assumes independence within a neighborhood, not globally in the whole instance space as standard Naïve Bayes does.

In principle, locally weighted learning can also be applied to decision trees and other models that are more complex than linear regression and Naïve Bayes. However, it is less beneficial here because locally weighted learning is primarily a way of allowing simple models to become more flexible by allowing them to approximate arbitrary targets. If the underlying learning algorithm can already do that, there is little point in applying locally weighted learning. Nevertheless it may improve other simple models—e.g., linear support vector machines and logistic regression.

Discussion

Regression trees were introduced in the CART system of Breiman et al. (1984). CART, for “classification and regression trees,” incorporated a decision tree inducer for discrete classes like that of C4.5, as well as a scheme for inducing regression trees. Many of the techniques described in this section, such as the method of handling nominal attributes and the surrogate device for dealing with missing values, were included in CART. However, model trees did not appear until much more recently, being first described by Quinlan (1992). Using model trees for generating rule sets (although not partial trees) has been explored by Hall, Holmes, and Frank (1999).

A comprehensive description (and implementation) of model tree induction is given by Wang and Witten (1997). Neural networks are also commonly used for predicting numeric quantities, although they suffer from the disadvantage that the structures they produce are opaque and cannot be used to help understand the nature of the solution. There are techniques for producing understandable insights from the structure of neural networks, but the arbitrary nature of the internal representation means that there may be dramatic variations between networks of identical architecture trained on the same data. By dividing the function being induced into linear patches, model trees provide a representation that is reproducible and at least somewhat comprehensible.

There are many variations of locally weighted learning. For example, statisticians have considered using locally quadratic models instead of linear ones and have applied locally weighted logistic regression to classification problems. Also, many different potential weighting and distance functions can be found in the literature. Atkeson, Schaal, and Moore (1997) have written an excellent survey on locally weighted learning, primarily in the context of regression problems. Frank, Hall, and Pfahringer (2003) evaluated the use of locally weighted learning in conjunction with Naïve Bayes.

7.4 Weka Implementations

• Instance-based learning
IBk

KStar

NNge (rectangular generalizations, in the NNge package)

• Linear models and extensions
SMO and variants
LibSVM (uses third-party libsvm library, in the LibSVM package)

LibLINEAR (uses third-party liblinear library, in the LibLINEAR package)

GaussianProcesses (kernel ridge regression, plus estimates of predictive uncertainty)

VotedPerceptron (voted kernel perceptrons)

MultiLayerPerceptron, as well as MLPClassifier and MLPRegressor in the multiLayerPerceptrons package

RBFNetwork, RBFClassifier, and RBFRegressor (all in the RBFNetwork package)
SGD (stochastic gradient descent for several loss functions)

• Numeric prediction
M5P (model trees)

M5Rules (rules from model trees)

LWL (locally weighted learning)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7. Extending instance-based and linear models

Create new playlist

Sign In

Sign Up

Extending instance-based and linear models

Abstract

Keywords

7.1 Instance-Based Learning

Reducing the Number of Exemplars

Pruning Noisy Exemplars

Weighting Attributes

Generalizing Exemplars

Distance Functions for Generalized Exemplars

Generalized Distance Functions

Discussion

7.2 Extending Linear Models

The Maximum Margin Hyperplane

Nonlinear Class Boundaries

Support Vector Regression

Kernel Ridge Regression

The Kernel Perceptron

Multilayer Perceptrons

Backpropagation

Radial Basis Function Networks

Stochastic Gradient Descent

Discussion

7.3 Numeric Prediction With Local Linear Models

Model Trees

Building the Tree

Pruning the Tree

Nominal Attributes

Missing Values

Pseudocode for Model Tree Induction

Rules From Model Trees

Locally Weighted Linear Regression

Discussion

7.4 Weka Implementations

Table of Contents for
Chapter 7. Extending instance-based and linear models