Much as the discipline of digital signal processing forms the technological basis for the processing of audio signals, pattern recognition is the basic field of study that underlies application areas such as speech recognition. In this chapter and the one that follows, we present some major concepts that are essential to understanding pattern-recognition technology. We begin in this chapter with a brief survey of the basic methods of classifying simple patterns. We distinguish here between pattern recognition and pattern classification by restricting the latter to the distinction between categories for sets of observed data samples or vectors. For the more inclusive term of pattern recognition we can also mean the recognition of sequences (which may be multidimensional) that are not presegmented into the patterns to be classified. In such cases it may not even be known exactly which training examples are synchronous with the class labels. For instance, in handwriting recognition, we wish to recognize the sequence of written marks with a sequence of characters (or words), but we may not know exactly where one letter ends and another begins. Similarly, in speech recognition, we wish to associate the input speech patterns with a sequence of words. Each of these examples requires the association of an observation sequence with a labeling sequence. For now, we are going put aside the issue of temporal sequencing to concentrate on the simpler case of pattern recognition for static patterns.

A *supervised* pattern-classification system is trained with labeled examples; that is, each input pattern has a class label associated with it. Pattern classifiers can also be trained in an *unsupervised* fashion. For example, in a technique known as vector quantization, some representation of the input data is clustered by finding implicit groupings in the data. The resulting table of cluster centers is known as a codebook, which can be used to index new vectors by finding the cluster center that is closest to the new vector. This approach has several advantages for low-bit-rate coding (some of which will be discussed in later chapters), but it also can be useful for the reduction of computation in speech recognition. For the most part, the pattern classifiers referred to in this book will be trained in a supervised manner.

As noted earlier, the job of static pattern classification is to classify individuals into groups. For example, consider classifying humans with the features of height and weight. In Fig. 8.1 the circles could be Swedish basketball players, whereas the triangles might represent speech-recognition researchers. For the figure as shown, it would be very easy to devise a system to divide people into the two classes.

For the case of speech, Fig. 8.2 shows an extreme case of some vowels represented by their formant frequencies F1 and F2. The vowels represented are as pronounced in the words beet (/i/), bat (/ae/), bot (/a/), and boot (/u/). Notice that they fall into nice groupings. Unfortunately, speech data are not generally this well behaved. For instance, if more vowels were added to the chart, the classes would often intermingle in the F1–F2 space. Furthermore, there is enough variability in speech that the graphing of the F1–F2 values from a large spoken-language database into this figure would leave little white space.

Note that in each case the individual examples were represented by some choice of variables. In the case of speech, for instance, we did not attempt to distinguish between vowel classes by using individual waveform amplitudes as features, but rather by using variables that we believed to be relevant to classification. In general, the choice of input representation is one of the major issues in the study of pattern recognition.

As shown in Fig. 8.3, the first step in any pattern-classification system is to evaluate some representation of the input pattern. Although in some cases this is the degenerate representation that consists of the raw input data, in general this can be improved with signal processing. In the case of speech, for instance, some form of power spectrum is often evaluated. Typically, these acoustic representations (vectors of relevant variables) are extracted within successive analysis windows that overlap. For example, windows of 20–30 ms overlapped by 10 ms are often used, although step sizes between frames can range from 5 to 20 ms. The features extracted are generally spectral or cepstral coefficients that condense the information in the speech signal to a vector of numbers, typically of length 5–40. For the classification of one-dimensional time series such as monaural music or speech, the input time series is transformed into a sequence of feature vectors that are sampled at a rate that is generally much lower than the original sequence (though each sample is represented by a vector rather than the original scalar).

What are the major considerations involved in the choice of a feature representation? Since the overall goal in pattern classification is to distinguish between examples of different classes, in general the goal of feature extraction is to reduce the variability for features for examples that are associated with the same class, while increasing the distinction between features from examples that belong to different classes. In some cases this goal can be formulated as an explicit criterion that yields a procedure or even an analytical solution for optimal performance (such as the linear discriminant that will be briefly described later in this chapter). The training data, however, may not give us a complete picture of the variability that will be seen during the classification process, so it is desirable to normalize for irrelevant factors in the data wherever possible. For the case of speech recognition, a number of these examples will be given in a later chapter, but for now we note that such normalization schemes (for instance, to reduce the effects of overall energy or of some constant spectral factor) can be important for practical classification systems.

Despite these deep problems associated with normalizing for unseen conditions, it is still true that it is desirable to learn something about the capabilities of candidate feature sets for discrimination from the data. This should be the goal of feature extraction – in some sense, the pattern classification can be solved trivially if the features are good enough. In practice, for most interesting problems none of the components of the task are truly trivial.

Aside from this basic goal of feature extraction, another perspective is the determination of the best feature vector size. The best representations for discrimination on the training set have a large dimensionality. In fact, given enough dimensions, a group of patterns may be split up in any arbitrary way. However, the best representations for generalization to the test set are usually succinct. Too detailed a representation may also represent characteristics of the particular training set that will not be present for independent data sets. Thus, there is a trade-off between the training-set classification performance and generalization to the test set. For instance, in the extreme case, you could represent the training set by storing a million measurements for each example; this will permit the splitting of the training set into the labeled categories with a trivial classifier. However, if many of the dimensions are not reliable measures for distinguishing between examples of the classes in unseen data (as is almost certainly the case with so many measures), then classification on a test set will be difficult.

In general, it is desirable to find a good (and moderate-sized) set of generalizable features. There are several common methods of reducing the dimensionality of the features. Two direct analytical approaches are principal components analysis (PCA) and linear discriminant analysis (LDA). The object of PCA is to find a projection of the feature vector down to a lower dimension, such that the variance accounted for is maximized. This is the most common approach used, and it is reasonable if you don't know anything else about the data. The limitation is that the components are chosen to maximize the variance accounted for, which can be a problem if most of the variance isn't important for discrimination such as the variance near 0 Hz within speech signals. For pattern classification, it is often better to have components that maximally account for discriminability.

Linear discriminant analysis is an approach that tries to maximize variance between classes and minimize variance within classes, given the constraint of a linear transformation of the input features. Chapter 5 of Duda, Hart, and Stork[3] gives a good description of this technique. This has been usefully applied to speech recognition by a number of researchers since the late 1980s [5, 4]. In general, discriminant approaches to dimensionality reduction have tended to give preferable results to using variance-based approaches such as PCA, though this is a deep and tricky issue. For instance, features chosen to maximally discriminate between samples with a high signal-to-noise ratio may be poor choices for test examples that are corrupted by unexpected additive noise.

One can also use application-specific knowledge to reduce the dimensionality of the feature vectors, or just test different reductions by evaluating a pattern recognizer trained on the reduced features on a test set. This is often helpful, although testing any significant number of possible feature subsets can often be quite computationally intensive.

Although this is a debatable point, it can be argued that it is often better to throw away bad features (or bad data) than to reduce their weight. This is true even though we can generally establish an objective criterion for error and adjust weights to minimize this criterion. However, for many applications the biggest difficulty of pattern classification can be that the training patterns are not completely representative of unseen test data, and so in practice it can be better to simply ignore dimensions or data that we believe to be unreliable.

How can one determine what data or features are bad? Often the best way to make this determination is through the analysis of errors in the pattern recognizer. We often can simplify this process by having some idea of where to look for problems. For instance, in ASR, energy can be an unreliable feature for practical usage (since talker energy can vary significantly for the same word, given different speakers, gains, etc.), and removing it has often improved results in realistic conditions. In contrast, variables that reflect the signal energy can be important for a number of discriminations, for instance between voiced and unvoiced sounds. Fortunately, the local time derivative of log energy is independent of overall (long-term) gain, since

where *C* is a time-invariant (constant) quantity.

Finally, in pattern-recognition experiments, it is important to avoid two common errors during system development:

- Testing on the training set – the training examples are used to design the classifier, but the desired system is one that will perform well on unseen examples. Therefore, it is important to assess the classifier on examples that were not seen during training. The system could implicitly or explicitly be memorizing characteristics of the specific examples in the training set that may not be present in an independent test set.
- Training on the testing set – even when a test set is used for evaluation, features or parameters of the classifier are often modified over the course of many experiments. Thus, the system may also become tuned to the test set if it is used a number of times. It is common in speech-recognition research, for instance, to define a development set for all such adjustments; final performance is then reported on a second evaluation set for which no new parameters have been adjusted.

Given a feature vector choice, pattern classification primarily consists of the development or *training* of a system for classification of a large number of examples of the different classes. As of this writing, statistical methods are the dominant approach, and as such we will devote a separate chapter to their brief explanation. First, however, we describe the major methods for the training and use of deterministic pattern classifiers. It should be noted that these same methods can also be viewed statistically as well; however, the deterministic perspective may be a more intuitive starting point.

Minimum distance classifiers are based on the notion that we can learn or store pattern examples that are representative of each class, and then we can identify the class of new examples by comparison to the stored values. For instance, if we store the heights and weights of all the Swedish basketball players and speech-recognition researchers, we can compare the height and weight of a new visitor to all of these examples and choose the class of the closest one as the class of the newcomer. One could even use some kind of a vote, in which the class of the majority of the local examples is used to determine the class of the new examples. Alternatively one could store one or more prototypes to represent each class, such as the most typical (or average) for each one, and again choose the class of the closest prototype as the class of the new example. All of these cases are types of minimum distance classifiers.

In a minimum distance classifier, a key point is the definition of a distance function that accepts as input two feature vectors, one representing a new example and one a stored vector, and returns the distance between the two vectors. The choice of this distance function is very important; it is also equivalent to implicit statistical assumptions about the data. Intuitively, however, simple distance measures can be misleading, given a different scale of values or differences in importance for the discrimination of different dimensions of the feature vector. For instance, in the case of height and weight, if weight is in pounds and height is in feet, the Euclidean distance between a new and a stored example will be dominated by weight, while in fact one might expect that height might be an important predictor of success of a career in professional basketball.

A sample minimum distance classifier might have *j* templates *z*_{i}, where 0 ≤ *i* ≤ *j*. For each input vector **x**, choose

(where the symbol *T* refers to the transpose operation). In other words, choose the *i* which minimizes the distance *D** _{i}*. Manipulating the algebra, we get

Since x^{T}x is constant over *i*, we can also equivalently choose to maximize the following function:

which is a linear function of x. If each template *i* is normalized so that **z**_{i}^{T}**z**_{i} = 1, then we can just maximize

The minimum distance classifier in this case is equivalent to a classifier based on the maximum dot product with the stored vector.

Given this distance function, as noted earlier, we can derive a decision rule for classifying new vectors by assigning the vector the class of its nearest neighbor. This is called (appropriately) nearest-neighbor classification. In the limit of an infinite number of samples, this rule will achieve at most twice the error of the optimum classifier [3]. This can also be generalized to a *k*-nearest neighbor rule, where the class of the new vector is decided by a vote of the class of the *k*-nearest neighbors; in a variant of this, the voting weight is determined from the distance of the neighbor.

This technique is potentially problematic in several ways. For the case of the direct use of all examples, it requires a great deal of storage for large problems; in addition, the search over these examples can also be time consuming. To reduce computation and storage, one can sometimes store only a small number of prototypes to represent the class; for instance, in Fig. 8.1, we indicate the mean of each class (written as *z _{i}* for class

As noted earlier, finding the minimum distance to a prototype is closely related to finding the maximum dot product (essentially a correlation) between a new vector and the stored vectors. Finding the best prototypes for this measurement, then, can be viewed as determining weights for a linear recombination of the input features that will be maximum for the correct class. The term for this determination of a function that is maximum for the correct class is discriminant analysis. This function will determine decision surfaces between classes, where the value of the function of values along a surface is the same for the two classes that are bounded by the surface. For example, Fig. 8.4 shows that two discriminant functions will be evaluated for each input pattern. The input values for which these two functions are equal specify a decision boundary in the input feature space. In general, there can be many discriminant functions that can then be used by a decision-making process to determine the best class for the pattern (as in Fig. 8.4).

Linear discriminant functions generate linear decision surfaces; in two dimensions (as in Fig. 8.1) the surface is a line, whereas in three dimensions it is a plane. In general, a linear decision surface is called a hyperplane. Returning to the case in which we use minimum distance to prototypes as the decision criterion (see Eq. 8.4), we find that each discriminant function would correspond to a constant plus a dot product between the input feature vector and the prototype. Suppose there are two prototypes (one for each class), **z**_{1} and **z**_{2}. Then, for input vector variable **x**, the decision boundary corresponds to

or

If the prototypes (templates) are all normalized, as in Eq. 8.5, the right-hand side goes to zero.

Take the example of our Swedish basketball players and speech-recognition researchers. A number of methods can be used to learn the right hypersurface (which in two dimensions is a line) to best separate the two classes; probably the most common approach is to use the Fisher linear discriminant function, which maximizes the ratio of between-class and within-class scatter.^{1} Alternatively, iterative methods such as gradient learning can be used to find the best line to divide the classes. In both cases, some criterion (such as the sum of the linear discriminant values for misclassified examples) is maximized (or minimized), in which the changes to the linear discriminant coefficients are adjusted according to the partial derivative of the criterion with respect to each coefficient.

A particularly simple approach would be to use Eq. 8.7, which was derived by using a discriminant function based on the minimum distance from a template that was the average of all of the examples of a class. The line corresponding to this equation (the thin solid line in Fig. 8.1) is not a particularly good decision boundary between the two classes. This illustrates one of the potential pitfalls with the use of minimum Euclidean distance as a criterion: in this case, the weight is represented with much larger numbers, and so it dominates the distances; this forces the decision boundary to be nearly horizontal. The dashed line in that figure shows a more sensible decision boundary, such as one might hope to get from one of the other approaches mentioned. Since this idealized line perfectly partitions the two classes, we refer to this data set as being linearly separable.

In current approaches, most often the technique for determining the linear separator is based on a statistical perspective. We will return to this point of view in the next chapter.

Often, one may want decision surfaces that are nonlinear. Although simple cases such as quadratic surfaces can be derived by methods similar to the ones discussed above, artificial neural networks can be utilized to get quite general surfaces. McCulloch and Pitts first modeled neurons in 1943, using a threshold logic unit (TLU). In this model, an output either fires or does not; firing occurs when a linear function of the input exceeds a threshold. This model could be viewed as equivalent to a linear discriminant function. As noted in earlier chapters, Rosenblatt developed perceptrons in the late 1950s and early 1960s, based on the McCulloch–Pitts neuron. In Rosenblatt's group, systems were built based on combinations of multiple neurons and also based on a final combining layer of neurons. Originally these multilayer perceptron (MLP) systems used TLUs, and so somewhat tricky training methods had to be used since these units implemented a function that was not differentiable. In other words, for systems of this generation, it was not straightforward to adjust an internal variable by using the partial derivative of an output error criterion with respect to the variable, since the TLU implemented a binary output. Some schemes were found to work around this problem; for instance, in discriminant analysis iterative design [5], the first layer of neurons was trained by simple statistical methods (see Chapter 9), while the output neurons were trained to minimize an error criterion based on a differentiable nonlinear function of the output sum (before the threshold). In later developments [12], [9] the thresholding function was replaced with a smoother (differentiable) nonlinear function. A common choice is the sigmoidal (S-like) function

which is bounded between zero and one.

The use of the differentiable nonlinearity (as opposed to a step function) permitted uniform training procedures for all layers based on changing the weights at all layers to minimize the output error criterion, using partial differentiation and the chain rule. It can be proved that a two-layer MLP^{2} with enough internal nodes (or hidden nodes) can learn any arbitrary mapping from input to output; hence, MLPs make good general discriminators. Figure 8.5 shows a typical MLP with two layers. A unit in the most common kind of modern MLP computes a nonlinear function [e.g., *f(y)]* of the weighted sum of its inputs (as shown in Fig. 8.6).

No computation takes place at the input layer; it only takes place at the hidden and output layers.

The training of such networks is straightforward. An error criterion is established (e.g., a mean-squared error), and this is then differentiated with respect to the weights that were used to evaluate the MLP output in Eq. 8.9. The weights are updated proportionately to this partial derivative, and an error signal is also computed that can be propagated backward to parts of the network that contributed to the value that was an input to the neurons in the output layer. This process can continue through multiple layers by using the chain rule of differentiation, since the nonlinearities that are typically used in most MLPs are differentiable functions such as the one given in Eq. 8.8. The overall algorithm is referred to as error backpropagation, and it is described in detail in such texts as [9] or [8]. A brief description and derivation are given in the Appendix at the end of this chapter.

During training, the update of the MLP parameters can be done in two different ways.

The first approach is off-line training. In this case, we accumulate the weight updates over all the training patterns and we modify the weights only when all the training patterns have been presented to the network. The gradient is then estimated for the complete set of training patterns, which guarantees, under the usual conditions of standard gradient procedures, the convergence of the algorithm.

The second way is on-line training. In this case, the MLP parameters are updated after each training pattern according to the local gradient. However, although this does not actually minimize the error gradient directly, it can be shown [10] that this process will stochastically converge to the same solution.^{3} In practice, the on-line training exhibits several advantages compared with the off-line procedure: it is generally acknowledged that it converges much faster and that it can more easily avoid local minima. This can be explained by the fact that the use of local gradients introduces noise in the training process, which usually improves the behavior of gradient searches because of the lowering of the risk of getting stuck in a suboptimal local minimum. Additionally, for large and varied training sets (such as are typically used for speech recognition), on-line training implies multiple passes through similar data for each single pass through the whole set.

MLPs can be viewed as a way of expressing a family of nonlinear decision boundaries whose complexity can be controlled by adjusting the number of hidden units, and whose parameters are set via gradient descent. In the mid-1990s, Corinna Cortes and Vladimir Vapnik proposed a new approach to the problem, based on Vapnik's earlier work on optimal classifiers. The Support Vector Machine (SVM) [2] employs a number of innovations to find nonlinear classification boundaries, and has subsequently had enormous impact on a wide range of applications, due to its excellent classification performance and resistance to overfitting. Rather than defining directly a family of nonlinear boundaries, the approach is to define an extended feature space (for instance by concatenating nonlinear functions or combinations of the existing feature dimensions), then solving for the optimal linear classifier in this extended space.

As the number of dimensions increases, the capacity of a linear classifier to make arbitrary separations of a finite set of training data increases, so at some dimensionality perfect performance on training data is possible. There may even be infinitely many hyperplanes that achieve maximum performance on the training data, which leads to the idea of maximum-margin classifiers: the *optimal* classifier is the one that keeps all the training examples as far away as possible from the decision boundary, so that for instance slight perturbations of their features due to noise are least likely to change their labeling. As illustrated in Fig. 8.7, this maximum-margin approach finds a decision boundary equidistant from the nearest training examples of each class. These nearest points are known as the “support vectors”, and the classifier depends only on these points and not at all on the remaining training examples. Finding these points and hence the decision boundary is cast as a quadratic-programming optimization problem, starkly different from the least-squares, gradient-descent solutions common in other classifiers.

One further innovation in Support Vector Machines is perhaps the most powerful. Because a plane in the high-dimensional space can be defined via inner products with specific points in that space, the SVM doesn't actually require the construction of the nonlinearly-extended feature space: all it requires is an inner product function (also known as a kernel function) that returns the inner product between two points in the space. By defining a nonlinear kernel function that operates on the original, lower-dimensional data, the optimization procedure can find the maximum-margin hyperplane without the computational burden of projecting points into a high dimensional space. In fact, the effective feature space may even be infinite-dimensional, meaning that this ‘kernel trick’ approach is the only way it could be handled computationally.

Although the theory of SVMs starts from perfectly-separating hyperplanes, in many situations we do not expect our labels or our features to provide perfect separation (for instance, some examples may simply be mislabeled, or the feature space may be inevitably overlapped). To avoid overfitting such data (i.e. finding a decision boundary that managed to snake its way around the interspersed examples in the overlap region), an extension to the SVM incorporates “slack variables” that allows a certain number of training points to remain on the incorrect side of the decision boundary (i.e. to have negative margins). The optimization problem remains much the same, seeking to identify the best set of points to maximize the total boundary to all the support vectors – including those exploiting their “slack”. A single cost weight, *C*, governs the trade-off between margin and slack, and is typically optimized empirically on cross-validation data.

So far, we have assumed our training examples are accompanied by the class label corresponding to each item. In many situations, however, examples are available without any predefined labels. Although this greatly limits what can be done with the data, there are still some techniques that can be applied to this *unsupervised* case. In particular, the examples may fall into natural clusters – as illustrated in Figure 8.8, in which the points in the left-hand panel appear to come from two distinct modes. These modes could take on the role of labels, leading to the problem of assigning labels to all examples that reflect their natural cluster. This process is known as clustering.

There is a large and diverse literature on clustering, reflecting among other factors the difficulty in defining the actual goal of the algorithm when data does not fall into clear clusters. One intuitive algorithm is agglomerative hierarchical clustering, which is illustrated in the right panel of Figure 8.8 by a representation known as a dendrogram [3]. Items are grouped into successively larger clusters, merging at each stage the most similar (least distant) pair of items or clusters. The dendrogram shows these merges, where the horizontal line joining the two items being merged reflects the dissimilarity (distance) at which the merge was made. (The items are rearranged on the horizontal axis to avoid crossing lines.) Discrete cluster assignment can be made by choosing a distance threshold, then separating each cluster that is still distinct at that threshold. In figure 8.8, a distance threshold of 0.2 divides the two clusters, whereas a threshold of 0.15 would separate out three of the outliers of the larger mode into separate clusters. (For this illustration, the distance between two clusters is taken as the shortest distance between any pair of examples spanning the clusters.)

While agglomerative clustering can lead to valuable visualizations, a more commonly used technique is *k*-means clustering. This algorithm starts with a desired number of clusters, *k*, each represented by a centroid point in feature space. These centroids are randomly initialized, e.g., by placing them at *k* randomly-selected examples from the data set. Then, each sample point is compared to each centroid, and assigned to closest. Once all points have been labeled this way, the centroids are recalculated as the mean positions of all points assigned to them, and the iteration repeats. If no centroids move in an iteration (i.e. all points keep the same labels), the process is complete.

Many other clustering algorithms have been proposed, addressing different limitations of simpler schemes. Spectral clustering [7], for example, is able to propagate clusters along elongated linkages by manipulating the eigenvectors of a modified “affinity” (near-neighbor) matrix. The Gaussian Mixture Model, described in the following chapter, is sometimes viewed as a clustering scheme, since it makes a probabilistic assignment of each training point to one of its Gaussian components. However, a more detailed discussion is beyond the scope of this chapter.

The use of nonlinear discriminant functions and/or feature spaces permits the construction of classification mechanisms that can become arbitrarily complicated and that need not conform to any strong prior notion of the nature of the decision boundary. The use of very general functions, however, makes the determination of a useful error criterion even more complicated. For this reason, as well as the availability of powerful training algorithms, statistical approaches have become very popular and important, particularly for speech recognition. This is the subject of Chapter 9, but even in the statistical case it is sometimes useful for intuition's sake to map approaches back to their approximate equivalent in a deterministic framework.

**8.1**Suppose that you wish to classify patterns into one of two classes. (Take the case of Swedish basketball players and speech-recognition researchers, given heights and weights for each example.) Unknown to you, it turns out that the squared value of heights and of weights (in centimeters and kilograms; respectively) sum to a value greater than some constant*C*for all basketball players, and to under this value for all the researchers. What is the shape that represents this decision surface? How would you need to modify the input features in order to make a straight line be the optimal decision surface?**8.2**George has designed a classifier that can perfectly distinguish between basketball players and researchers for every example in the training set. Martha has designed a classifier that makes some errors in the training set classification. However, Martha insists that hers is better. How can she be right? In other words, what are some potential pitfalls in the development of a classifier given a training set?**8.3**Assuming that one is willing to use an infinite number of training examples (and take an infinite amount of time for classification), would it be possible to do perfect classification of speech utterances given all possible waveforms?**8.4**How can one simply transform the feature values to equalize the influence that they have on a Euclidean distance? Using the data points from Fig. 8.1, do this transformation and show that the distance-based linear classifier does a better job than the original one.**8.5**Give some conditions under which it might be preferable to train a linear pattern classifier or a generalized nonlinear one such as a MLP.

There is a large body of literature on training methods for multilayer perceptrons. However, here we simply define notation and derive equations for the most common form of back-propagation training. Similar methods are useful in a large number of nonlinear systems for pattern classification. The development is largely reproduced from Chapter 4 of [1].

We generalize the MLP of Fig. 8.5 to an η-layered perceptron consisting of (η + 1) layers *L _{l}*(

where *F* is a nonlinear function, typically a sigmoid function (e.g., Eq. 8.8) that operates componentwise. Finally, we may write symbolically:

where *g* is now a nonlinear function of *x _{n}*, depending on the parameters .

The model parameters (the weight matrices _{l}) are obtained from a set of training input and associated (or desired) output pairs by minimizing, in the parameter space, the error criterion defined as

where, given each training input vector *x _{n}* and

Error backpropagation training of such a system can be briefly summarized as follows. For each training iteration *t (t* = 1, . . ., *T)*, there is presentation of all the training input vectors *x _{n} n* = 1, . .,

in which α is usually referred to as the *step-size parameter* or *learning rate* and has to be small enough to guarantee the convergence of the process. In other words, we compute the error gradient for the weight vector and adjust the weights in the opposite direction.

In the following paragraph, it is shown that the backpropagation algorithm can be derived by interpreting the problem as a constrained minimization problem. This approach, initially proposed in [6], regards MLP training as a constrained minimization problem involving the network state variables _{l}(*x** _{n}*) and the weights. We remind the reader of a method that is often used for such problems, namely that of Lagrange multipliers. Suppose that we wish to find an extreme value (e.g., minimum) for a function

The MLP training problem may be specified by a single Lagrange function *L* containing the objective function *E* and constraint terms, each multiplied by a Lagrange multiplier _{l}. In this particular case, the constraint terms describe the network architecture, that is, the forward equations of the network.

For each training pattern *x** _{n}*, we want

A constraint is met when the corresponding term in *L* is zero. It may be shown that

corresponds to a minimum of *E* while meeting the constraints. We may split condition 8.15 into its constituent partials.

**Condition 1**:

where the derivative is applied to each component of its argument. This leads to

which comprises the forward recurrences 8.10 of the error backpropagation algorithm.

**Condition 2**:

Setting to zero the derivative with respect to *h _{η}* leads to

Differentiation with respect to the other *h _{l}*'s yields

By defining

we finally obtain

and the backward recurrence

which comes from substituting the expression for _{l} into the definition for *b _{l}(x*

This recurrence yields a vector that is generated for each layer from the higher (later) layers. The next condition will show how this vector, which is the backward error propagation term, is used to update the weights.

**Condition 3**:

This leads to

or

The weight matrices _{l} satisfying these equations can be obtained by an iterative gradient procedure making weight changes according to where a is again the learning-rate parameter. The parameters at training step *t* + 1 are then calculated from their value at iteration *t* by

which is the standard weight update formula of the error backpropagation algorithm.

These three conditions, when met, give a complete specification of the backpropagation training of the network: optimizing with respect to the Lagrange multipliers gives the forward propagation equations; optimization with respect to the state variables gives the backward equations (the gradients); and optimization with respect to the weights gives the weight update equations.

- Bourlard, H., and Morgan, N.,
*Connectionist Speech Recognition: A Hybrid Approach*, Kluwer, Boston, Mass., 1994. - Cortes, C., and Vapnik, V., “Support-vector networks,”
*Machine Learning*,**20**: 273–297, 1995. - Duda, D., Hart, P., and Stork, D.,
*Pattern Classification (2nd Ed.)*, Wiley–Interscience, New York, 2001. - Haeb-Umbach,
**R**., Geller, D., and Ney, H., “Improvements in connected digit recognition using linear discriminant analysis and mixture densities,”*Proc. IEEE Int. Conf. Acoust. Speech Signal Process.*, Adelaide, Australia, pp.**II**-239–242, 1994. - Hunt, M. and Lefebvre, C., “A comparison of several acoustic representations for speech recognition with degraded and undegraded speech,”
*Proc. IEEE Int. Conf. Acoust. Speech Signal Process.*, Glasgow, Scotland, pp. 262–265, 1989. - le Cun, Y. “A theoretical framework for back-propagation,”
*Proc. 1988 Connectionist Models Summer School*, CMU, Pittsburgh, Pa., pp. 21–28, 1988. - Ng, A. Y., Jordan, M. I., and Weiss, Y., “On spectral clustering: Analysis and an algorithm,”
*Advances in Neural Information Processing Systems*, vol. 2, pp. 849–856, MIT Press, Cambridge, Mass., 2002. - Ripley, B.,
*Pattern Recognition and Neural Networks*, Cambridge Univ. Press, London/New York, 1996. - Rumelhart, D. E., Hinton, G. E., and Williams, R. J., “Learning internal representations by error propagation,” in D. E. Rumelhart and J. L. McClelland, eds.,
*Parallel Distributed Processing. Exploration of the Microstructure of Cognition. vol. 1: Foundations*, MIT Press, Cambridge, Mass., 1986. - Thomas, G., and Finney, R.,
*Calculus and Analytic Geometry*, 5th ed., Addison–Wesley, Reading, Mass., 1980. - Viglione, S. S., “Applications of pattern recognition technology in adaptive learning and pattern recognition systems,” in J. M. Mendel and K. S. Fu, eds.,
*Adaptive Learning and Pattern Recognition Systems*, Academic Press, New York, pp. 115–161, 1970. - Werbos, P. J., “Beyond regression: new tools for prediction and analysis in the behavioral sciences,” Ph.D. Thesis, Harvard University, Cambridge, Mass., 1974.
- Widrow, B., and Stearns, S.,
*Adaptive Signal Processing*, Prentice–Hall, Englewood Cliffs, N.J., 1985.

^{1}The within-class scatter is proportional to the sample covariance matrix for the pooled data. The between-class scatter is the weighted sum , where *C* is the number of classes, *m _{i}* is the mean vector for the

^{2}Unfortunately, the literature is inconsistent on the naming convention here: the MLP in the figure is sometimes called a two-layer system (since there are two layers doing the computation) and sometimes called a three-layer system (since the input features form a kind of a layer).

^{3}Since the local gradient can be viewed as a random variable whose mean is the true gradient, such an approach is sometimes called a stochastic gradient procedure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.