Much as the discipline of digital signal processing forms the technological basis for the processing of audio signals, pattern recognition is the basic field of study that underlies application areas such as speech recognition. In this chapter and the one that follows, we present some major concepts that are essential to understanding pattern-recognition technology. We begin in this chapter with a brief survey of the basic methods of classifying simple patterns. We distinguish here between pattern recognition and pattern classification by restricting the latter to the distinction between categories for sets of observed data samples or vectors. For the more inclusive term of pattern recognition we can also mean the recognition of sequences (which may be multidimensional) that are not presegmented into the patterns to be classified. In such cases it may not even be known exactly which training examples are synchronous with the class labels. For instance, in handwriting recognition, we wish to recognize the sequence of written marks with a sequence of characters (or words), but we may not know exactly where one letter ends and another begins. Similarly, in speech recognition, we wish to associate the input speech patterns with a sequence of words. Each of these examples requires the association of an observation sequence with a labeling sequence. For now, we are going put aside the issue of temporal sequencing to concentrate on the simpler case of pattern recognition for static patterns.

A supervised pattern-classification system is trained with labeled examples; that is, each input pattern has a class label associated with it. Pattern classifiers can also be trained in an unsupervised fashion. For example, in a technique known as vector quantization, some representation of the input data is clustered by finding implicit groupings in the data. The resulting table of cluster centers is known as a codebook, which can be used to index new vectors by finding the cluster center that is closest to the new vector. This approach has several advantages for low-bit-rate coding (some of which will be discussed in later chapters), but it also can be useful for the reduction of computation in speech recognition. For the most part, the pattern classifiers referred to in this book will be trained in a supervised manner.

As noted earlier, the job of static pattern classification is to classify individuals into groups. For example, consider classifying humans with the features of height and weight. In Fig. 8.1 the circles could be Swedish basketball players, whereas the triangles might represent speech-recognition researchers. For the figure as shown, it would be very easy to devise a system to divide people into the two classes.

For the case of speech, Fig. 8.2 shows an extreme case of some vowels represented by their formant frequencies F1 and F2. The vowels represented are as pronounced in the words beet (/i/), bat (/ae/), bot (/a/), and boot (/u/). Notice that they fall into nice groupings. Unfortunately, speech data are not generally this well behaved. For instance, if more vowels were added to the chart, the classes would often intermingle in the F1–F2 space. Furthermore, there is enough variability in speech that the graphing of the F1–F2 values from a large spoken-language database into this figure would leave little white space.


FIGURE 8.1 Distinct classes of Swedish basketball players (circles) and speech-recognition researchers (triangles). The points labeled z1 and z2 are the means for these two classes. The solid line represents a linear decision boundary derived from a minimum Euclidean distance, and the thick dashed line is a linear decision boundary that would perfectly separate classes for these data.

Note that in each case the individual examples were represented by some choice of variables. In the case of speech, for instance, we did not attempt to distinguish between vowel classes by using individual waveform amplitudes as features, but rather by using variables that we believed to be relevant to classification. In general, the choice of input representation is one of the major issues in the study of pattern recognition.


FIGURE 8.2 Fl and F2 for various vowels.


As shown in Fig. 8.3, the first step in any pattern-classification system is to evaluate some representation of the input pattern. Although in some cases this is the degenerate representation that consists of the raw input data, in general this can be improved with signal processing. In the case of speech, for instance, some form of power spectrum is often evaluated. Typically, these acoustic representations (vectors of relevant variables) are extracted within successive analysis windows that overlap. For example, windows of 20–30 ms overlapped by 10 ms are often used, although step sizes between frames can range from 5 to 20 ms. The features extracted are generally spectral or cepstral coefficients that condense the information in the speech signal to a vector of numbers, typically of length 5–40. For the classification of one-dimensional time series such as monaural music or speech, the input time series is transformed into a sequence of feature vectors that are sampled at a rate that is generally much lower than the original sequence (though each sample is represented by a vector rather than the original scalar).

What are the major considerations involved in the choice of a feature representation? Since the overall goal in pattern classification is to distinguish between examples of different classes, in general the goal of feature extraction is to reduce the variability for features for examples that are associated with the same class, while increasing the distinction between features from examples that belong to different classes. In some cases this goal can be formulated as an explicit criterion that yields a procedure or even an analytical solution for optimal performance (such as the linear discriminant that will be briefly described later in this chapter). The training data, however, may not give us a complete picture of the variability that will be seen during the classification process, so it is desirable to normalize for irrelevant factors in the data wherever possible. For the case of speech recognition, a number of these examples will be given in a later chapter, but for now we note that such normalization schemes (for instance, to reduce the effects of overall energy or of some constant spectral factor) can be important for practical classification systems.

Despite these deep problems associated with normalizing for unseen conditions, it is still true that it is desirable to learn something about the capabilities of candidate feature sets for discrimination from the data. This should be the goal of feature extraction – in some sense, the pattern classification can be solved trivially if the features are good enough. In practice, for most interesting problems none of the components of the task are truly trivial.


FIGURE 8.3 Pattern classification is often divided into two major components – feature extraction and the classification of the features associated with a pattern.

Aside from this basic goal of feature extraction, another perspective is the determination of the best feature vector size. The best representations for discrimination on the training set have a large dimensionality. In fact, given enough dimensions, a group of patterns may be split up in any arbitrary way. However, the best representations for generalization to the test set are usually succinct. Too detailed a representation may also represent characteristics of the particular training set that will not be present for independent data sets. Thus, there is a trade-off between the training-set classification performance and generalization to the test set. For instance, in the extreme case, you could represent the training set by storing a million measurements for each example; this will permit the splitting of the training set into the labeled categories with a trivial classifier. However, if many of the dimensions are not reliable measures for distinguishing between examples of the classes in unseen data (as is almost certainly the case with so many measures), then classification on a test set will be difficult.

In general, it is desirable to find a good (and moderate-sized) set of generalizable features. There are several common methods of reducing the dimensionality of the features. Two direct analytical approaches are principal components analysis (PCA) and linear discriminant analysis (LDA). The object of PCA is to find a projection of the feature vector down to a lower dimension, such that the variance accounted for is maximized. This is the most common approach used, and it is reasonable if you don't know anything else about the data. The limitation is that the components are chosen to maximize the variance accounted for, which can be a problem if most of the variance isn't important for discrimination such as the variance near 0 Hz within speech signals. For pattern classification, it is often better to have components that maximally account for discriminability.

Linear discriminant analysis is an approach that tries to maximize variance between classes and minimize variance within classes, given the constraint of a linear transformation of the input features. Chapter 5 of Duda, Hart, and Stork[3] gives a good description of this technique. This has been usefully applied to speech recognition by a number of researchers since the late 1980s [5, 4]. In general, discriminant approaches to dimensionality reduction have tended to give preferable results to using variance-based approaches such as PCA, though this is a deep and tricky issue. For instance, features chosen to maximally discriminate between samples with a high signal-to-noise ratio may be poor choices for test examples that are corrupted by unexpected additive noise.

One can also use application-specific knowledge to reduce the dimensionality of the feature vectors, or just test different reductions by evaluating a pattern recognizer trained on the reduced features on a test set. This is often helpful, although testing any significant number of possible feature subsets can often be quite computationally intensive.

8.2.1. Some Opinions

Although this is a debatable point, it can be argued that it is often better to throw away bad features (or bad data) than to reduce their weight. This is true even though we can generally establish an objective criterion for error and adjust weights to minimize this criterion. However, for many applications the biggest difficulty of pattern classification can be that the training patterns are not completely representative of unseen test data, and so in practice it can be better to simply ignore dimensions or data that we believe to be unreliable.

How can one determine what data or features are bad? Often the best way to make this determination is through the analysis of errors in the pattern recognizer. We often can simplify this process by having some idea of where to look for problems. For instance, in ASR, energy can be an unreliable feature for practical usage (since talker energy can vary significantly for the same word, given different speakers, gains, etc.), and removing it has often improved results in realistic conditions. In contrast, variables that reflect the signal energy can be important for a number of discriminations, for instance between voiced and unvoiced sounds. Fortunately, the local time derivative of log energy is independent of overall (long-term) gain, since


where C is a time-invariant (constant) quantity.

Finally, in pattern-recognition experiments, it is important to avoid two common errors during system development:

  1. Testing on the training set – the training examples are used to design the classifier, but the desired system is one that will perform well on unseen examples. Therefore, it is important to assess the classifier on examples that were not seen during training. The system could implicitly or explicitly be memorizing characteristics of the specific examples in the training set that may not be present in an independent test set.
  2. Training on the testing set – even when a test set is used for evaluation, features or parameters of the classifier are often modified over the course of many experiments. Thus, the system may also become tuned to the test set if it is used a number of times. It is common in speech-recognition research, for instance, to define a development set for all such adjustments; final performance is then reported on a second evaluation set for which no new parameters have been adjusted.


Given a feature vector choice, pattern classification primarily consists of the development or training of a system for classification of a large number of examples of the different classes. As of this writing, statistical methods are the dominant approach, and as such we will devote a separate chapter to their brief explanation. First, however, we describe the major methods for the training and use of deterministic pattern classifiers. It should be noted that these same methods can also be viewed statistically as well; however, the deterministic perspective may be a more intuitive starting point.

8.3.1. Minimum Distance Classifiers

Minimum distance classifiers are based on the notion that we can learn or store pattern examples that are representative of each class, and then we can identify the class of new examples by comparison to the stored values. For instance, if we store the heights and weights of all the Swedish basketball players and speech-recognition researchers, we can compare the height and weight of a new visitor to all of these examples and choose the class of the closest one as the class of the newcomer. One could even use some kind of a vote, in which the class of the majority of the local examples is used to determine the class of the new examples. Alternatively one could store one or more prototypes to represent each class, such as the most typical (or average) for each one, and again choose the class of the closest prototype as the class of the new example. All of these cases are types of minimum distance classifiers.

In a minimum distance classifier, a key point is the definition of a distance function that accepts as input two feature vectors, one representing a new example and one a stored vector, and returns the distance between the two vectors. The choice of this distance function is very important; it is also equivalent to implicit statistical assumptions about the data. Intuitively, however, simple distance measures can be misleading, given a different scale of values or differences in importance for the discrimination of different dimensions of the feature vector. For instance, in the case of height and weight, if weight is in pounds and height is in feet, the Euclidean distance between a new and a stored example will be dominated by weight, while in fact one might expect that height might be an important predictor of success of a career in professional basketball.

A sample minimum distance classifier might have j templates zi, where 0 ≤ ij. For each input vector x, choose


(where the symbol T refers to the transpose operation). In other words, choose the i which minimizes the distance Di. Manipulating the algebra, we get


Since xTx is constant over i, we can also equivalently choose to maximize the following function:


which is a linear function of x. If each template i is normalized so that ziTzi = 1, then we can just maximize


The minimum distance classifier in this case is equivalent to a classifier based on the maximum dot product with the stored vector.

Given this distance function, as noted earlier, we can derive a decision rule for classifying new vectors by assigning the vector the class of its nearest neighbor. This is called (appropriately) nearest-neighbor classification. In the limit of an infinite number of samples, this rule will achieve at most twice the error of the optimum classifier [3]. This can also be generalized to a k-nearest neighbor rule, where the class of the new vector is decided by a vote of the class of the k-nearest neighbors; in a variant of this, the voting weight is determined from the distance of the neighbor.

This technique is potentially problematic in several ways. For the case of the direct use of all examples, it requires a great deal of storage for large problems; in addition, the search over these examples can also be time consuming. To reduce computation and storage, one can sometimes store only a small number of prototypes to represent the class; for instance, in Fig. 8.1, we indicate the mean of each class (written as zi for class i). Also, unless the feature vector is preprocessed to equalize the importance of the feature dimensions, this method is particularly prone to scaling problems (as noted earlier). Finally, for high dimensions, the space will be sparsely sampled by the training set.

8.3.2. Discriminant Functions

As noted earlier, finding the minimum distance to a prototype is closely related to finding the maximum dot product (essentially a correlation) between a new vector and the stored vectors. Finding the best prototypes for this measurement, then, can be viewed as determining weights for a linear recombination of the input features that will be maximum for the correct class. The term for this determination of a function that is maximum for the correct class is discriminant analysis. This function will determine decision surfaces between classes, where the value of the function of values along a surface is the same for the two classes that are bounded by the surface. For example, Fig. 8.4 shows that two discriminant functions will be evaluated for each input pattern. The input values for which these two functions are equal specify a decision boundary in the input feature space. In general, there can be many discriminant functions that can then be used by a decision-making process to determine the best class for the pattern (as in Fig. 8.4).

Linear discriminant functions generate linear decision surfaces; in two dimensions (as in Fig. 8.1) the surface is a line, whereas in three dimensions it is a plane. In general, a linear decision surface is called a hyperplane. Returning to the case in which we use minimum distance to prototypes as the decision criterion (see Eq. 8.4), we find that each discriminant function would correspond to a constant plus a dot product between the input feature vector and the prototype. Suppose there are two prototypes (one for each class), z1 and z2. Then, for input vector variable x, the decision boundary corresponds to


FIGURE 8.4 A system to discriminate between two classes.




If the prototypes (templates) are all normalized, as in Eq. 8.5, the right-hand side goes to zero.

Take the example of our Swedish basketball players and speech-recognition researchers. A number of methods can be used to learn the right hypersurface (which in two dimensions is a line) to best separate the two classes; probably the most common approach is to use the Fisher linear discriminant function, which maximizes the ratio of between-class and within-class scatter.1 Alternatively, iterative methods such as gradient learning can be used to find the best line to divide the classes. In both cases, some criterion (such as the sum of the linear discriminant values for misclassified examples) is maximized (or minimized), in which the changes to the linear discriminant coefficients are adjusted according to the partial derivative of the criterion with respect to each coefficient.

A particularly simple approach would be to use Eq. 8.7, which was derived by using a discriminant function based on the minimum distance from a template that was the average of all of the examples of a class. The line corresponding to this equation (the thin solid line in Fig. 8.1) is not a particularly good decision boundary between the two classes. This illustrates one of the potential pitfalls with the use of minimum Euclidean distance as a criterion: in this case, the weight is represented with much larger numbers, and so it dominates the distances; this forces the decision boundary to be nearly horizontal. The dashed line in that figure shows a more sensible decision boundary, such as one might hope to get from one of the other approaches mentioned. Since this idealized line perfectly partitions the two classes, we refer to this data set as being linearly separable.

In current approaches, most often the technique for determining the linear separator is based on a statistical perspective. We will return to this point of view in the next chapter.

8.3.3. Generalized Discriminators

Often, one may want decision surfaces that are nonlinear. Although simple cases such as quadratic surfaces can be derived by methods similar to the ones discussed above, artificial neural networks can be utilized to get quite general surfaces. McCulloch and Pitts first modeled neurons in 1943, using a threshold logic unit (TLU). In this model, an output either fires or does not; firing occurs when a linear function of the input exceeds a threshold. This model could be viewed as equivalent to a linear discriminant function. As noted in earlier chapters, Rosenblatt developed perceptrons in the late 1950s and early 1960s, based on the McCulloch–Pitts neuron. In Rosenblatt's group, systems were built based on combinations of multiple neurons and also based on a final combining layer of neurons. Originally these multilayer perceptron (MLP) systems used TLUs, and so somewhat tricky training methods had to be used since these units implemented a function that was not differentiable. In other words, for systems of this generation, it was not straightforward to adjust an internal variable by using the partial derivative of an output error criterion with respect to the variable, since the TLU implemented a binary output. Some schemes were found to work around this problem; for instance, in discriminant analysis iterative design [5], the first layer of neurons was trained by simple statistical methods (see Chapter 9), while the output neurons were trained to minimize an error criterion based on a differentiable nonlinear function of the output sum (before the threshold). In later developments [12], [9] the thresholding function was replaced with a smoother (differentiable) nonlinear function. A common choice is the sigmoidal (S-like) function


which is bounded between zero and one.

The use of the differentiable nonlinearity (as opposed to a step function) permitted uniform training procedures for all layers based on changing the weights at all layers to minimize the output error criterion, using partial differentiation and the chain rule. It can be proved that a two-layer MLP2 with enough internal nodes (or hidden nodes) can learn any arbitrary mapping from input to output; hence, MLPs make good general discriminators. Figure 8.5 shows a typical MLP with two layers. A unit in the most common kind of modern MLP computes a nonlinear function [e.g., f(y)] of the weighted sum of its inputs (as shown in Fig. 8.6).


No computation takes place at the input layer; it only takes place at the hidden and output layers.

The training of such networks is straightforward. An error criterion is established (e.g., a mean-squared error), and this is then differentiated with respect to the weights that were used to evaluate the MLP output in Eq. 8.9. The weights are updated proportionately to this partial derivative, and an error signal is also computed that can be propagated backward to parts of the network that contributed to the value that was an input to the neurons in the output layer. This process can continue through multiple layers by using the chain rule of differentiation, since the nonlinearities that are typically used in most MLPs are differentiable functions such as the one given in Eq. 8.8. The overall algorithm is referred to as error backpropagation, and it is described in detail in such texts as [9] or [8]. A brief description and derivation are given in the Appendix at the end of this chapter.


FIGURE 8.5 A typical MLP.

During training, the update of the MLP parameters can be done in two different ways.

The first approach is off-line training. In this case, we accumulate the weight updates over all the training patterns and we modify the weights only when all the training patterns have been presented to the network. The gradient is then estimated for the complete set of training patterns, which guarantees, under the usual conditions of standard gradient procedures, the convergence of the algorithm.

The second way is on-line training. In this case, the MLP parameters are updated after each training pattern according to the local gradient. However, although this does not actually minimize the error gradient directly, it can be shown [10] that this process will stochastically converge to the same solution.3 In practice, the on-line training exhibits several advantages compared with the off-line procedure: it is generally acknowledged that it converges much faster and that it can more easily avoid local minima. This can be explained by the fact that the use of local gradients introduces noise in the training process, which usually improves the behavior of gradient searches because of the lowering of the risk of getting stuck in a suboptimal local minimum. Additionally, for large and varied training sets (such as are typically used for speech recognition), on-line training implies multiple passes through similar data for each single pass through the whole set.


FIGURE 8.6 A typical unit for a MLP. The unit implements a linear or nonlinear function of a weighted sum of the input variables. A typical choice for f(yj) is given by Eq. 8.8.


FIGURE 8.7 Illustration of the “maximum margin” decision boundary in a Support Vector Machine. Solid dots are training examples of one class, hollow dots are examples of the other class. Although many linear boundaries would perfectly separate the examples of the two classes, the SVM finds the decision boundary, shown as a solid line, that stays furthest away from any example, i.e., it has the largest possible “margin of error”. The dotted lines show the outer edges of this boundary, touching the closest training patterns (the “support vectors”).


MLPs can be viewed as a way of expressing a family of nonlinear decision boundaries whose complexity can be controlled by adjusting the number of hidden units, and whose parameters are set via gradient descent. In the mid-1990s, Corinna Cortes and Vladimir Vapnik proposed a new approach to the problem, based on Vapnik's earlier work on optimal classifiers. The Support Vector Machine (SVM) [2] employs a number of innovations to find nonlinear classification boundaries, and has subsequently had enormous impact on a wide range of applications, due to its excellent classification performance and resistance to overfitting. Rather than defining directly a family of nonlinear boundaries, the approach is to define an extended feature space (for instance by concatenating nonlinear functions or combinations of the existing feature dimensions), then solving for the optimal linear classifier in this extended space.

As the number of dimensions increases, the capacity of a linear classifier to make arbitrary separations of a finite set of training data increases, so at some dimensionality perfect performance on training data is possible. There may even be infinitely many hyperplanes that achieve maximum performance on the training data, which leads to the idea of maximum-margin classifiers: the optimal classifier is the one that keeps all the training examples as far away as possible from the decision boundary, so that for instance slight perturbations of their features due to noise are least likely to change their labeling. As illustrated in Fig. 8.7, this maximum-margin approach finds a decision boundary equidistant from the nearest training examples of each class. These nearest points are known as the “support vectors”, and the classifier depends only on these points and not at all on the remaining training examples. Finding these points and hence the decision boundary is cast as a quadratic-programming optimization problem, starkly different from the least-squares, gradient-descent solutions common in other classifiers.

One further innovation in Support Vector Machines is perhaps the most powerful. Because a plane in the high-dimensional space can be defined via inner products with specific points in that space, the SVM doesn't actually require the construction of the nonlinearly-extended feature space: all it requires is an inner product function (also known as a kernel function) that returns the inner product between two points in the space. By defining a nonlinear kernel function that operates on the original, lower-dimensional data, the optimization procedure can find the maximum-margin hyperplane without the computational burden of projecting points into a high dimensional space. In fact, the effective feature space may even be infinite-dimensional, meaning that this ‘kernel trick’ approach is the only way it could be handled computationally.

Although the theory of SVMs starts from perfectly-separating hyperplanes, in many situations we do not expect our labels or our features to provide perfect separation (for instance, some examples may simply be mislabeled, or the feature space may be inevitably overlapped). To avoid overfitting such data (i.e. finding a decision boundary that managed to snake its way around the interspersed examples in the overlap region), an extension to the SVM incorporates “slack variables” that allows a certain number of training points to remain on the incorrect side of the decision boundary (i.e. to have negative margins). The optimization problem remains much the same, seeking to identify the best set of points to maximize the total boundary to all the support vectors – including those exploiting their “slack”. A single cost weight, C, governs the trade-off between margin and slack, and is typically optimized empirically on cross-validation data.


FIGURE 8.8 Unsupervised clustering of examples visualized with the dendrogram. The left pane shows data in two dimensions that appear to fall into two clusters, centered around (0.5, 0.5) and (1.5, 1). The dendrogram on the right shows the progressive grouping of nearby examples by agglomerative hierarchical clustering. The topmost division separates the two main clusters.


So far, we have assumed our training examples are accompanied by the class label corresponding to each item. In many situations, however, examples are available without any predefined labels. Although this greatly limits what can be done with the data, there are still some techniques that can be applied to this unsupervised case. In particular, the examples may fall into natural clusters – as illustrated in Figure 8.8, in which the points in the left-hand panel appear to come from two distinct modes. These modes could take on the role of labels, leading to the problem of assigning labels to all examples that reflect their natural cluster. This process is known as clustering.

There is a large and diverse literature on clustering, reflecting among other factors the difficulty in defining the actual goal of the algorithm when data does not fall into clear clusters. One intuitive algorithm is agglomerative hierarchical clustering, which is illustrated in the right panel of Figure 8.8 by a representation known as a dendrogram [3]. Items are grouped into successively larger clusters, merging at each stage the most similar (least distant) pair of items or clusters. The dendrogram shows these merges, where the horizontal line joining the two items being merged reflects the dissimilarity (distance) at which the merge was made. (The items are rearranged on the horizontal axis to avoid crossing lines.) Discrete cluster assignment can be made by choosing a distance threshold, then separating each cluster that is still distinct at that threshold. In figure 8.8, a distance threshold of 0.2 divides the two clusters, whereas a threshold of 0.15 would separate out three of the outliers of the larger mode into separate clusters. (For this illustration, the distance between two clusters is taken as the shortest distance between any pair of examples spanning the clusters.)

While agglomerative clustering can lead to valuable visualizations, a more commonly used technique is k-means clustering. This algorithm starts with a desired number of clusters, k, each represented by a centroid point in feature space. These centroids are randomly initialized, e.g., by placing them at k randomly-selected examples from the data set. Then, each sample point is compared to each centroid, and assigned to closest. Once all points have been labeled this way, the centroids are recalculated as the mean positions of all points assigned to them, and the iteration repeats. If no centroids move in an iteration (i.e. all points keep the same labels), the process is complete.

Many other clustering algorithms have been proposed, addressing different limitations of simpler schemes. Spectral clustering [7], for example, is able to propagate clusters along elongated linkages by manipulating the eigenvectors of a modified “affinity” (near-neighbor) matrix. The Gaussian Mixture Model, described in the following chapter, is sometimes viewed as a clustering scheme, since it makes a probabilistic assignment of each training point to one of its Gaussian components. However, a more detailed discussion is beyond the scope of this chapter.


The use of nonlinear discriminant functions and/or feature spaces permits the construction of classification mechanisms that can become arbitrarily complicated and that need not conform to any strong prior notion of the nature of the decision boundary. The use of very general functions, however, makes the determination of a useful error criterion even more complicated. For this reason, as well as the availability of powerful training algorithms, statistical approaches have become very popular and important, particularly for speech recognition. This is the subject of Chapter 9, but even in the statistical case it is sometimes useful for intuition's sake to map approaches back to their approximate equivalent in a deterministic framework.


  • 8.1 Suppose that you wish to classify patterns into one of two classes. (Take the case of Swedish basketball players and speech-recognition researchers, given heights and weights for each example.) Unknown to you, it turns out that the squared value of heights and of weights (in centimeters and kilograms; respectively) sum to a value greater than some constant C for all basketball players, and to under this value for all the researchers. What is the shape that represents this decision surface? How would you need to modify the input features in order to make a straight line be the optimal decision surface?
  • 8.2 George has designed a classifier that can perfectly distinguish between basketball players and researchers for every example in the training set. Martha has designed a classifier that makes some errors in the training set classification. However, Martha insists that hers is better. How can she be right? In other words, what are some potential pitfalls in the development of a classifier given a training set?
  • 8.3 Assuming that one is willing to use an infinite number of training examples (and take an infinite amount of time for classification), would it be possible to do perfect classification of speech utterances given all possible waveforms?
  • 8.4 How can one simply transform the feature values to equalize the influence that they have on a Euclidean distance? Using the data points from Fig. 8.1, do this transformation and show that the distance-based linear classifier does a better job than the original one.
  • 8.5 Give some conditions under which it might be preferable to train a linear pattern classifier or a generalized nonlinear one such as a MLP.


There is a large body of literature on training methods for multilayer perceptrons. However, here we simply define notation and derive equations for the most common form of back-propagation training. Similar methods are useful in a large number of nonlinear systems for pattern classification. The development is largely reproduced from Chapter 4 of [1].

8.8.1. Definitions

We generalize the MLP of Fig. 8.5 to an η-layered perceptron consisting of (η + 1) layers Ll(l = 0, .. η) of several units, where L0 corresponds to the input layer, Lη to the output layer, and Ll(l = 1, ....., η – 1) to the hidden layers. Hidden and output units are computational units, and their output values are determined by first summing all of their inputs (as given by Eq. 8.9) and then passing the results through the sigmoid function given by Eq. 8.8. The output values of layer Ll form a nl vector hl(xn), which is a function of the input vector xn. Here nl is the number of units in Ll. Input vector h0(xn) and output vectors hη(xn) are also denoted xn and g(xn) in the following. Vector imagel(xn) (l = 0, ...., η – 1) stands for the (nl + 1) augmented vector, where the zeroth unit will be fixed to one and will account for the biases of the following layer. As the biasing unit is irrelevant for the output layer, we have imageη = hη. Layer Ll–1 is fully connected to layer Ll by a (nl–1 + 1) × nl weight matrix imagel. Matrix Wl denotes imagel deprived of its first row (corresponding to the biases). The state propagation is thus described by


where F is a nonlinear function, typically a sigmoid function (e.g., Eq. 8.8) that operates componentwise. Finally, we may write symbolically:


where g is now a nonlinear function of xn, depending on the parameters image.

The model parameters (the weight matrices imagel) are obtained from a set of training input and associated (or desired) output pairs by minimizing, in the parameter space, the error criterion defined as


where, given each training input vector xn and d(xn) is the desired output associated with xn, g(xn) represents the output vector hη(xn) generated by the system. If there is at least one hidden layer, given Eqs. 8.10 and 8.11, g(xn) is a nonlinear function of the input xn (defining nonlinear decision surfaces) and contains the sigmoid function of Eq. 8.8. The total number of training patterns is denoted by N.

Error backpropagation training of such a system can be briefly summarized as follows. For each training iteration t (t = 1, . . ., T), there is presentation of all the training input vectors xn n = 1, . ., N, forward computation of the output vectors g(xn) (using Eq. 8.10), and calculation of the error function E. There is also backward propagation of the error (using the chain rule) to compute the partial derivative of the error criterion with respect to every weight, and update of the weights according to


in which α is usually referred to as the step-size parameter or learning rate and has to be small enough to guarantee the convergence of the process. In other words, we compute the error gradient for the weight vector and adjust the weights in the opposite direction.

8.8.2. Derivation

In the following paragraph, it is shown that the backpropagation algorithm can be derived by interpreting the problem as a constrained minimization problem. This approach, initially proposed in [6], regards MLP training as a constrained minimization problem involving the network state variables imagel(xn) and the weights. We remind the reader of a method that is often used for such problems, namely that of Lagrange multipliers. Suppose that we wish to find an extreme value (e.g., minimum) for a function f (x, y, z), with side constraints given by equations such as g(x, y, z) = 0. We then construct the auxiliary function H(x, y, z, λ) = f(x, y, z) + λg(x, y, z), and we find values of x, y, z, and λ for which all the partial derivatives of H are equal to zero. This gives the desired solution, and the approach can be generalized to multiple side conditions by using multiple constraint terms, each with its own Lagrange multiplier imagei. See standard calculus textbooks (e.g., [10]) for further explanation and derivation.

The MLP training problem may be specified by a single Lagrange function L containing the objective function E and constraint terms, each multiplied by a Lagrange multiplier imagel. In this particular case, the constraint terms describe the network architecture, that is, the forward equations of the network.

For each training pattern xn, we want hη(xn) = d(xn) under the η constraints represented by Eq. 8.10. By introducing vectorial Lagrange multipliers imagel, l = 1, ....,η, we transform the problem to one of minimizing a modified error function L versus the parameters imagel, hl(xn), and imagel, for l = 1, ....,η, with


A constraint is met when the corresponding term in L is zero. It may be shown that


corresponds to a minimum of E while meeting the constraints. We may split condition 8.15 into its constituent partials.

Condition 1:


where the derivative is applied to each component of its argument. This leads to


which comprises the forward recurrences 8.10 of the error backpropagation algorithm.

Condition 2:


Setting to zero the derivative with respect to hη leads to


Differentiation with respect to the other hl's yields


By defining


we finally obtain


and the backward recurrence


which comes from substituting the expression for imagel into the definition for bl(xn) above and then replacing the expression for bl+1(xn) in the result.

This recurrence yields a vector that is generated for each layer from the higher (later) layers. The next condition will show how this vector, which is the backward error propagation term, is used to update the weights.

Condition 3:


This leads to




The weight matrices imagel satisfying these equations can be obtained by an iterative gradient procedure making weight changes according to image where a is again the learning-rate parameter. The parameters at training step t + 1 are then calculated from their value at iteration t by


which is the standard weight update formula of the error backpropagation algorithm.

These three conditions, when met, give a complete specification of the backpropagation training of the network: optimizing with respect to the Lagrange multipliers gives the forward propagation equations; optimization with respect to the state variables gives the backward equations (the gradients); and optimization with respect to the weights gives the weight update equations.


  1. Bourlard, H., and Morgan, N., Connectionist Speech Recognition: A Hybrid Approach, Kluwer, Boston, Mass., 1994.
  2. Cortes, C., and Vapnik, V., “Support-vector networks,” Machine Learning, 20: 273–297, 1995.
  3. Duda, D., Hart, P., and Stork, D., Pattern Classification (2nd Ed.), Wiley–Interscience, New York, 2001.
  4. Haeb-Umbach, R., Geller, D., and Ney, H., “Improvements in connected digit recognition using linear discriminant analysis and mixture densities,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Adelaide, Australia, pp. II-239–242, 1994.
  5. Hunt, M. and Lefebvre, C., “A comparison of several acoustic representations for speech recognition with degraded and undegraded speech,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Glasgow, Scotland, pp. 262–265, 1989.
  6. le Cun, Y. “A theoretical framework for back-propagation,” Proc. 1988 Connectionist Models Summer School, CMU, Pittsburgh, Pa., pp. 21–28, 1988.
  7. Ng, A. Y., Jordan, M. I., and Weiss, Y., “On spectral clustering: Analysis and an algorithm,” Advances in Neural Information Processing Systems, vol. 2, pp. 849–856, MIT Press, Cambridge, Mass., 2002.
  8. Ripley, B., Pattern Recognition and Neural Networks, Cambridge Univ. Press, London/New York, 1996.
  9. Rumelhart, D. E., Hinton, G. E., and Williams, R. J., “Learning internal representations by error propagation,” in D. E. Rumelhart and J. L. McClelland, eds., Parallel Distributed Processing. Exploration of the Microstructure of Cognition. vol. 1: Foundations, MIT Press, Cambridge, Mass., 1986.
  10. Thomas, G., and Finney, R., Calculus and Analytic Geometry, 5th ed., Addison–Wesley, Reading, Mass., 1980.
  11. Viglione, S. S., “Applications of pattern recognition technology in adaptive learning and pattern recognition systems,” in J. M. Mendel and K. S. Fu, eds., Adaptive Learning and Pattern Recognition Systems, Academic Press, New York, pp. 115–161, 1970.
  12. Werbos, P. J., “Beyond regression: new tools for prediction and analysis in the behavioral sciences,” Ph.D. Thesis, Harvard University, Cambridge, Mass., 1974.
  13. Widrow, B., and Stearns, S., Adaptive Signal Processing, Prentice–Hall, Englewood Cliffs, N.J., 1985.

1The within-class scatter is proportional to the sample covariance matrix for the pooled data. The between-class scatter is the weighted sum image, where C is the number of classes, mi is the mean vector for the ith class which contains ni items, and m is the total mean vector [3].

2Unfortunately, the literature is inconsistent on the naming convention here: the MLP in the figure is sometimes called a two-layer system (since there are two layers doing the computation) and sometimes called a three-layer system (since the input features form a kind of a layer).

3Since the local gradient can be viewed as a random variable whose mean is the true gradient, such an approach is sometimes called a stochastic gradient procedure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.