• Search in book...
• Toggle Font Controls

# STATISTICAL PATTERN CLASSIFICATION

## 9.1 INTRODUCTION

Audio signals such as speech or music are produced as a result of many causes. For instance, although a speech signal originates in a vocal apparatus, the received signal is generally also affected by the frequency response of the audio channel, additive noise, and so on. The latter sources of variability are unrelated to the message that is communicated. Much of the development of speech-signal analysis is motivated by the desire to deterministically account for as much of this variance as possible. For example, it would be desirable to reduce the spectral variability among multiple examples of the same word uttered with different amounts of background noise. However, with the use of common representations of speech, signals corresponding to the same linguistic message vary significantly.

For this reason, it is often desirable to model an audio signal such as speech as a random process and then to use statistical tools to analyze it. This general class of approaches has proven to be extremely useful for a range of speech-processing tasks, but most notably for speech recognition. However, before seriously discussing the applications of statistics to speech processing, we must introduce some basic concepts. In this chapter we briefly introduce statistical pattern classification (see [5] for a much more thorough treatment). As with Chapter 8, we focus on static patterns and leave the recognition of temporal sequences for a later chapter.

## 9.2 A FEW DEFINITIONS

Here we give, without proof (or detailed explanation), a few operational definitions that should prove useful for this and later chapters. More rigorous definitions as well as explanations and definitions for basic terms such as probability and random variable can be found in sources such as Papoulis [6].

1. A discrete random variable has a range of isolated possible values. For example, the random variable corresponding to the number of dots associated with a pair of thrown dice can take only the values 2, 3, ..., 12. The set of probabilities for all possible values of such variables form a discrete probability density function, whose values sum to one.

2. A continuous random variable has a continuum of possible values. For example, the temperature at a particular point in space and time could be any value within some reasonable range. In reality our measurement accuracy is limited, so that in some sense we only know the temperature to the closest value with some finite resolution; however, for many problems the number of such possible values is huge, and so it is better to characterize the variable as being continuous. In this case, there is a corresponding continuous probability density function that must integrate to one; that is, for x, a continuous random variable, and P(x), the probability density function (or pdf) for x,

In other words, the area under the P(x) curve is one.

The integral of this function over any finite interval corresponds to a probability. For simplicity's sake, both probability densities and probabilities will be represented by P(.).

3. A joint pdf governs multiple variables. Thus, for random variables x and y,

In other words, the volume under the P(x, y) curve is one.

4. A conditional density function is equivalent to a joint density function that has been scaled down by the density function for one of the variables. In other words,

and, consequently,

5. Given Eqs. 9.4 and 9.5, we also note that

which is one form of Bayes' rule. Thus, a conditional probability density function can be expressed as a product of the opposite conditioning times a ratio of the pdf's of the individual variables. This is extremely useful for manipulating the form of pdf's.

## 9.3 CLASS-RELATED PROBABILITY FUNCTIONS

In Chapter 8 we showed that there was a close relationship between a minimum distance classifier and discriminant analysis. When speech features are viewed as random variables, their probability density functions can be used to derive distance measures, for which the Euclidean case is a special case that implies certain statistical assumptions. Additionally, the corresponding discriminant functions have a straightforward statistical interpretation. This can be used to define (in theory) an optimal decision rule, that is, one that has the minimum probability of classification error.

To begin with, let us assume that there are K classes. If we let ωk be the kth class and let x be an input feature vector, then the following statements are true:

• Pk| x) is the probability that the correct class is cok given the input feature vector. This is sometimes referred to as the posterior or a posteriori probability, since this probability can only be estimated after the data have been seen.
• P(ωk) is the probability of class ωk which is called the prior or a priori probability of class k, since this can be evaluated before x has been observed.
• P(x |ωk) is the conditional pdf of x (conditioned on a particular class k), sometimes referred to as a likelihood function.

Intuitively, we can think of a likelihood function as being a kind of a closeness measure (if a particular class-dependent density is closer to the new observation than other densities, it will tend to have a higher likelihood); we will see later on that the negative log of this function can often be interpreted directly as a distance.

Finally, using Bayes' rule, we can express the posterior in terms of the prior and the likelihood as follows:

## 9.4 MINIMUM ERROR CLASSIFICATION

We can optimally classify a feature vector into a class by using a maximum a posteriori (MAP) decision rule. This rule is optimum in the sense that it will give the minimum probability of error. The rule can be stated simply as follows. Assign x to class k if

In other words, choose the class with the maximum posterior probability. This should be an intuitive result; one should always choose the most probable class given the evidence available. Why is this also the optimum strategy in theory?

To illustrate this we take the example of a two-class problem. In this case the probability of error is

Clearly for any observed value of x, this probability of error is minimized if we choose the value of k corresponding to the maximum posterior probability. Finally, since the overall probability of error is

then if P(error | x) is the minimal value for each choice of x, the overall integral is minimized1.

This extends naturally to the case of more than two classes.

## 9.5 LIKELIHOOD-BASED MAP CLASSIFICATION

Some kinds of systems directly estimate the class posterior probabilities P(ω | x). However, often it is easier to generate estimates of likelihoods P(x | ω). For this latter case one can typically compute statistics separately for the input vectors falling into the different classes. Given the Bayes rule formulation from earlier in this chapter, there is a straightforward (in principle) transformation between the two kinds of densities. In fact, the factor P(x) typically does not need to be computed. Since P(x) is a constant for all classes, finding the maximum posterior is equivalent to determining

Decisions between candidate classes can also be made by using a ratio of posterior probabilities:

The Bayes decision rule says that if this ratio is > 1 then we pick ωk over ωj. This is equivalent to assigning x to class k if

for all j other than k.

The left-hand side of Eq. 9.12 is often referred to as the likelihood ratio. Taking the log of the likelihood ratio, we find that the rule states the ωk is chosen over ωj if

The MAP classification based on likelihoods is then equivalent to choosing a class to maximize a statistical discriminant function:

## 9.6 APPROXIMATING A BAYES CLASSIFIER

How do we find the Bayes classifier? Unfortunately, the actual density functions referred to earlier are essentially always unknown for real problems. Consequently, we must estimate the densities from training data. The training results in the learning of some set of parameters Θ, which will then be used during classification to estimate the required probabilities. For the formulation of Eq. 9.14, this would mean that we would train estimators of P(x|ωk, Θ) for each class ωk. Maximum likelihood (ML) procedures are used to learn the parameters that will give the largest possible values for these quantities. Given a good enough estimator, enough data, and perfect estimates for the prior probabilities P( ωk), parameters Θ that are learned by the ML criterion will lead to the MAP classifier.

Some specific approaches to training the parameters Θ include the following.

1. For each class, count the instances for which a feature or group of features is closest to a finite set of prototypical values. When the discretization is done for a group of features (e.g., locating the nearest spectrum among one of 256 candidate spectra), it is called vector quantization, or VQ.

2. Assume a parametric form for the density, whose parameters can be directly estimated from the data. The most common example of this approach is the Gaussian distribution.

3. Assume a parametric form for the density, in which the parameters must be estimated with an iterative solution. For example, a density can be represented as a weighted sum or mixture of Gaussian densities – a Gaussian mixture model, or GMM – with the means, covariances, and mixture weights to be learned from the data. These parameters are typically learned through some variant of the expectation maximization (EM) algorithm (Section 9.8). GMMs are particularly useful for modeling densities that are more complicated than simple Gaussians (e.g., multimodal distributions).

4. Use automatic interpolation–learning to estimate posteriors directly (e.g., with a neural network; see Chapter 27. In this case it is the posterior probabilities that are learned, rather than the likelihood densities. As with the Gaussian mixtures, complicated multimodal distributions can be represented in this way.

In practice the training of posterior estimators may lead to different results than the training of likelihood estimators. In principle direct posterior estimators will require the training of parameters that are influenced by all of the input vectors, which can greatly increase the complexity, though there can be a trade off with a smaller number of parameters. We will return to these issues in Chapter 27.

For the moment, we restrict ourselves to the case of a single Gaussian. An analysis of this case can often provide us with insight that may be useful in understanding the more complex methods of density estimation, to which we will return later in this text.

The central feature of the Gaussian assumption is that the density is completely defined by the first- and second-order statistics (means and covariances). In one dimension (one feature), the Gaussian probability density function can be expressed as

where μi is the mean and σi is the standard deviation (square root of the variance) for the ith class.

In two or more dimensions, this takes the form

where Σi is the covariance matrix of the ith class,2i| is the corresponding determinant, and d is the dimension of x.

The product (x – μi)TΣi–1(x – μi) is often called the (squared) Mahalanobis distance.Note that it is similar in form to the Euclidean distance, which would be (x – μi)T (x – μi). In the special case in which the covariance matrix is diagonal (all off-diagonal values equal to zero), the inclusion of the factor Σi–1 is equivalent to scaling all the variables to a variance of one before computing the Euclidean distance. More generally, multiplying by this factor is equivalent to rotating the feature vectors so that their covariance matrix would be the identity matrix. In other words, the effect is to decorrelate the feature vectors, in addition to the scaling of each feature to have unity variance.

Recalling expression 9.13, the optimal Bayes classifier in the case of a Gaussian density would correspond to the use of a statistical discriminant function:

Note that the first term of this function is the Mahalanobis distance, so that when the later terms can be ignored, choosing i to maximize gi(x) is equivalent to choosing the class with the minimum Mahalanobis distance between the input vector and the corresponding mean vector.

Note also that the term d/2 loge 2π is constant over all classes, and so it will be ignored in the rest of this discussion. More generally, maximizing gi (x) can be simplified by dropping terms from the discriminant function that are the same for all classes. Consider the following two cases.

1. There is no correlation between the features for any of the classes, and all features have the same variance; that is, Σi = σ2I

where the constant and covariance determinant terms have been dropped. In the case in which the prior probabilities of the different classes are all equal, this function is a Euclidean distance. Thus, minimum distance classifiers that use a Euclidean distance are equivalent to Bayes' classifiers in which we assume Gaussian densities for the likelihoods, equal prior probabilities for each class, no correlation between the different features, and features that each have the same variance.

2. If the covariance matrices are the same for all classes, i.e., Σi = Σ, then an equivalent statistical discriminant function is linear in x. (See the Exercises for more on this.) In this case, we can estimate the optimal linear discriminant function by estimating the overall covariance and the means for each class.

The Gaussian approximation can be quite poor in some cases, for instance if the pdf is not unimodal (has more than one big bump). Density functions for real-world data can often be better approximated by more complicated models, such as a mixture (weighted summation) of Gaussians. However, the analysis above is a good place to start for understanding's sake, and in many cases a model based on a single Gaussian per class can be useful for formulating a classifier.

## 9.7 STATISTICALLY BASED LINEAR DISCRIMINANTS

As in Section 9.6, consider the Bayesian classifier with an assumption of Gaussian form for the density of each class. Further assume that the features are uncorrelated with one another so that the covariance matrix consists of variances only (no nonzero off-diagonal elements). Further assume that the covariance matrix is identical for each class. Since there are no off-diagonal elements, the net effect is to scale down each feature in the x and μ vectors by the corresponding standard deviation (see Exercise 9.1). For simplicity's sake, then, we simply remove the covariance matrix from Eq. 9.17 and assume that the random variables have been scaled to have a variance of one. Expanding out Eq. 9.17 and dropping the constant and class-independent terms, we derive a discriminant function:

Comparing this expression with the discriminant function D2i (Eq. 8.4) in Chapter 8, we see the following differences:

1. The prototype or template zi has been replaced by its mean (as was also done in Fig. 8.1).
2. The features are viewed as random variables and have been scaled by the within-class standard deviation (square root of variance), assumed to be the same for each class.
3. There is an additional term that corresponds to prior knowledge about how probable each class is. If this certainty is strong enough, it can outweigh evidence from the data. For instance, if 99% of the population consists of basketball players and only 1% consists of speech-recognition researchers, on the average the classification error would only be 1% if all new cases were classified as basketball players regardless of height or weight. More generally, this factor simply applies a bias.

If there is no prior information to suggest favoring one class over the other, and the features are scaled as suggested above, then the statistical discriminant function described above is equivalent to the distance-based classifier of Chapter 8. However, the statistical formulation is very general, permitting both a principled way of normalizing the amplitudes of the features (scaling down the variances to be equal to one) and a provision for incorporating prior biases about the classification. If the statistical assumptions are justified (for instance, the Gaussian form), the resulting classifier is optimal in the Bayes sense.

### 9.7.1. Discussion

As noted previously, the Bayes classifier is the optimal one that we attempt to approximate with the methods described here. The simplifying assumptions described above permit principled approaches to the approximation of this classifier. However, the class-dependent covariance matrices are not generally equal, and the distributions are not usually well described by a single Gaussian. Fortunately, approaches exist for learning more general forms of distributions. Iterative techniques can be used to find the parameters for distributions where there is no direct analytical solution, such as the means and variances for a mixture (weighted sum) of Gaussian densities. Neural networks can also be trained to determine probability density functions, which will be discussed in Chapter 27. However, for many cases (such as Gaussian mixtures) an approach can be used that guarantees finding the best possible value of the parameters given the parameters from the previous iteration. The dominant approach for this purpose is called EM.

## 9.8 ITERATIVE TRAINING: THE EM ALGORITHM

In the earlier sections here we have shown how one can estimate the parameters for a Gaussian density corresponding to a class. Given the true densities for all classes, one may determine the optimum Bayes classifier. Given parameters trained to maximize the likelihood for the data in each class, we can approximate the Bayes classifier. However, in many problems, there is no analytical solution for the parameters of the statistical classifier given the training data. An example of such a problem is finding the means, variances, and mixture weights for a sum of Gaussian densities, given a sample of data points. We will postulate a model in which each individual Gaussian generates some of the observed data points. Given that there is no way to directly compute estimates for these quantities (since we do not know which points correspond to which Gaussian), we need an optimization technique to find the parameters for each Gaussian (and for its mixture) that will maximize the likelihood of all of the observed data.

The dominant approach to such problems is called the Expectation Maximization (EM) algorithm [4]. In this approach, the parameter-estimation problem is structured to incorporate variables representing information that is not directly observed, but that is assumed to be part of the model that generated the data. Such a variable is often called hidden or missing. For instance, in the Gaussian mixture case, a hidden variable could be the index of the Gaussian that generated a data point. The key idea of EM is to estimate the densities by taking an expectation of the logarithm of the joint density between the known and unknown components, and then to maximize this function by updating the parameters that are used in the probability estimation. This process is then iterated as required. We can show fairly easily that if this expectation is maximized, so too is the data likelihood itself, which (as noted previously) is typically the goal for statistical model training. First, we express the expectation of the joint likelihood of observed and hidden variables as a function of both old parameters Θold and new ones Θ. Then for random variable k (a missing or hidden variable), observed random variable x, and parameters Θ, let

If Θ is chosen to be equal to Θold, then

Subtracting Eq. 9.21 from Eq. 9.20 and rearranging terms, we get

However, the last term is the relative entropy or Kullback Leibler distance, which can be shown to be nonnegative [3]. Therefore, if a change to Θ increases Q, logP(x | Θ) increases. In other words, changing the parameters to increase the expectation of the log likelihood of a joint distribution between the data and a hidden variable will also increase the log data likelihood itself. In principle, we could maximize the expectation for each value of Θ and then reestimate Θ. Although there is no guarantee of how good the estimates would be, they are guaranteed to improve (or at least get no worse) with each iteration.

In many cases, it is possible to structure the problem so that we can analytically determine the choice for the parameters that will maximize the expectation in each iteration. Then a new expression for the expectation can be determined, followed by a new parameter estimation, and so on.

Figure 9.1 shows a histogram drawn from a bimodal distribution that is the sum of two Gaussian densities (renormalized to integrate to one). If data points could be unambiguously assigned to one of the two component densities, then one could estimate the means, variances, and weight for each component analytically. However, if the data assignments are unknown, we must evaluate these quantities by using expected values over the joint distributions (of mixture index and observed data value) and iteratively improve the estimates. This has proven to be a powerful concept and works well under many conditions. The figure shows quick convergence to a good fit for this simple example. There are many other useful cases for which we can find the optimum values for Θ at each stage, and in practice the optimization often converges quite quickly.

FIGURE 9.1 In the center is a histogram for 1000 data points that were sampled from a mixture of two Gaussians. The correct mixture parameters are means of –4.0 and 5.0, sigmas of 2.0 and 1.5, and equal weights of 0.5. The outermost Gaussian shapes correspond to an initial (poor) guess for the mixture parameters, consisting of means of –12 and 11, sigmas of 3.5 and 2.6, and equal weights. The log likelihood of the data for this guess was –5482. After one iteration of EM, the density in the middle is obtained, which is quite a good fit to the data. The corresponding means are –4.1 and 5.0, sigmas are 2.0 and 1.5, and weights are 0.49 and 0.51. The log likelihood for this new estimate was –2646. Further iterations did not appreciably change the parameters or the likelihood for this simple example. The vertical axis for the densities is scaled up by 1000 to match the histograms, i.e., having an integral of 1000.

To illustrate EM, let us take the specific example of optimizing the parameters for a density that is composed of a weighted sum (or mixture) of simpler distributions, Gaussians. This is a fairly general notion, since we can always decompose an unknown probability density P(x | Θ) as3

using the definitions of joint and conditional densities, and assuming K disjoint (but unknown) categories. We hope to choose these categories so that the component densities can be simply computed.

As is typical for training with a ML criterion, we will attempt to approximate the true densities by maximizing the likelihood P(x | Θ) or its logarithm. There is no analytical expression that will generally yield the Θ to maximize either of these quantities directly. Furthermore, taking the logarithm of the rightmost expression in Eq. 9.23 would require the log of a sum, which would generally be difficult to optimize. However, we can consider the variable k to be an unobserved (hidden) random variable corresponding to the mixture component from which each data sample came. Then the theorem proved in Eq. 9.22 tells us that an increase in the expectation of log P(x, k | Θ) will also increase the data likelihood P(x | Θ). Equation 9.23 also shows that we can interpret the probability of this hidden variable as the mixture weight.

Given these interpretations, we can write an expression for the log joint density for observed and hidden variables.

with an expected value over N examples and K mixture components of

Note that we have distinguished between Θold, the parameters used to generate the distribution with which we will evaluate the expectation, and Θ, which is a variable to be optimized (in the sense of maximizing Q).

Finally, we can decompose this expression into two terms by using the usual properties of the log function:

We will often assume models for which the Θ parameters of the two terms are disjointed and can be optimized separately.

One way that we can train the parameters for this density is to assume that each mixture component P(x | k) is a Gaussian with mean μk and standard deviation σk. In this case, P(k) is the weight given to Gaussian k in the model.

Recalling Eq. 9.15, we can express the Gaussian density for component k as

where μk is the mean and σk is the standard deviation for the component.

Substituting this definition into expression 9.26, we find that the expectation to be maximized becomes

where the C term refers to constants that will disappear with the differentiation to follow.

Given this choice of parametric form for the mixture density, we can use standard optimization methods to find the best value for the parameters. Let's begin by solving for the means. Setting the partial derivative to zero, we find that the first term (which is not dependent on the means) drops away, as do all the terms for which k is not equal to j; thus we get

or (multiplying through by σj2 and moving the second term to the right-hand side)

Thus

A similar procedure can be used to proceed from Eq. 9.28 to derive the optimum value for the variances (see Exercises). The result is

In the case of the mixture weights P(k | Θ), Eq. 9.28 must be supplemented with a Lagrangian term , which expresses the constraint that the mixture weights must sum to one. Then let Q* be the augmented function that includes this constraint term. Taking the partial derivative and setting it to be equal to zero, we find that the term involving means and variances may be disregarded, leaving the equations

Summing this up over all the components j yields λ = –N, so that the mixture weights can be expressed as

The results in Eqs. 9.31, 9.32, and 9.34, each include an expression P(j | xn, Θold). How do we evaluate this? This can be done using Bayes' rule, that is,

which only includes terms that we have already shown how to evaluate. That is, P (xn | j, Θold) is the value of the jth Gaussian at xn, assuming the mean and variance from the previous iteration. Similarly, P(j | Θold) is the weight of the jth Gaussian from the previous optimization step.

### 9.8.1. Discussion

Equations 9.31, 9.32, and 9.34 have an interesting form. Each is an expectation over many examples of the desired parameter, with the appropriate normalization so that the densities incorporated actually sum to one. Thus, EM has led us to a sensible and satisfying result; compute the posterior distribution of the hidden variable given the observed variables and the old parameters, and use it to compute the expectations of the desired new parameters. The mean and variance for each Gaussian are estimated by weighting each instance of the sampled variable by the probability of that sample having originated from the corresponding density, and then scaling appropriately. The mean that is computed in this way is also sometimes called a center of mass for the variable. The estimate for the mixture weight, or the prior probability for each mixture, is just the average of the posterior probabilities of the mixture over all of the training samples.

An interesting special case occurs for a single Gaussian. When K = 1, the posterior is always equal to one, and Eqs. 9.31 and 9.32 can be reduced to the standard equations for the sample mean and covariance.

We have not discussed initialization. EM is only guaranteed to maximize the data likelihood for each step, and it is not guaranteed to find the global maximum. For simple problems such as that illustrated in Fig. 9.1, this is clearly not a problem. However, in the more general case, good initializations can be important for improving the quality of the resulting solution.

The overall EM procedure can be briefly summarized as follows.

1. Choose a form for the probability estimators (e.g., Gaussian) for the densities associated with each class.
2. Choose an initial set of parameters for the estimators.
3. Given the parameters, compute posterior estimates for the hidden variables.
4. Given posterior estimates for hidden variables, find the distributional parameters that maximize the expectation of the joint density for the data and the hidden variables. These will be guaranteed to also maximize the improvement to the likelihood of the data.
5. Assess goodness of fit (e.g., log likelihood) and compare to the stopping criterion. If not satisfied, go back to step 3.

This section constitutes a very brief and light introduction to EM. For a more complete description, the reader should consult one of the many references for EM, e.g., [2] or [4]. We will also return to EM later (in Chapter 26) in the context of hidden Markov models. In that case, we will modify the parameters for estimates of densities that are hypothesized to generate sequences.

## 9.9 EXERCISES

• 9.1 Expand out expression 9.17 for the case in which only the diagonal terms of the covariance matrix are nonzero, and in which the covariance matrix is equal across classes. Reduce the expression to a simpler function by eliminating all terms that are either constant or equal for all classes.
• 9.2 The heights and weights of basketball players and speech researchers are vector quantized so that there are 16 different possible reference indices i corresponding to (height, weight) pairs. A training set is provided that has the height and weight and occupation of 1000 people, 500 from each occupation. How would you assign them to reference indices? Given a chosen approach for this, how would you estimate the discrete densities for the likelihoods P(i | occupation)? How would you estimate the posteriors P(occupation | i)? How would you estimate the priors P(occupation) and P(i)?
• 9.3 A Bayes decision rule is the optimum (minimum error) strategy. Suppose that Albert uses a Bayes classifier and estimates the probabilities by using a Gaussian parametric form for the estimator. Beth is working on the same problem and claims to get a better result without even using explicitly statistical methods. Can you think of any reason that Beth could be right?
• 9.4 Section 9.8 derived the EM update equations for the mixture weights and means of a univariate Gaussian mixture, but it only stated the result for the variances. Derive the variance update equations.

## BIBLIOGRAPHY

1. Bilmes, J., “A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models,” International Computer Science Institute Tech. Rep. ICSI TR-97–021; 1997; ftp://ftp.icsi.berkeley.edu/pub/techreports/1997/tr-97–021.pdf.
2. Bishop, C., Neural Networks for Pattern Recognition, Oxford Univ. Press, London/New York, 1996.
3. Blahut, R., Principles and Practice of Information Theory, Addison–Wesley, Reading, Mass., 1987.
4. Dempster, A. P., Laird, N. M., and Rubin, D. B., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statistical Soc. 39: 1–38, 1977.
5. Duda, D., Hart, P., and Stork, D., Pattern Classification (2nd Ed.), Wiley–Interscience, New York, 2001.
6. Papoulis, A., Probability, Random Variables, and Stochastic Processes, McGraw–Hill, New York, 1965.

1For simplicity's sake, we have written the integral as if x were a scalar.

2Reminder: the element of the ith covariance matrix in the jth row and the kth column is the covariance between the jth and kth features for all observations in the ith class. The covariance between two random variables is the expected value of their product once the means have been removed.

3For simplicity's sake, we will only consider the case of a univariate density; that is, the case in which the observed variable is a scalar. The more general vector case is very similar but is more intricate because of the matrix arithmetic. See [1] for a description of the vector case.

• No Comment
..................Content has been hidden....................