CHAPTER 27

image

DISCRIMINANT ACOUSTIC PROBABILITY ESTIMATION

27.1 INTRODUCTION

In the previous chapters we introduced the notion of trainable statistical models for speech recognition, in particular focusing on the set of methods and constraints associated with hidden Markov models (HMMs). In both training and recognition phases, the key values that must be estimated from the acoustics are the emission probabilities, also referred to as the acoustic likelihoods. These values are used to derive likelihoods for each model of a complete utterance, in combination with statistical information about the a priori probability of word sequences. In other words, the probabilities that the local acoustic measurements were generated by each hypothesized state are ultimately integrated into a global probability that a complete utterance is generated by a complete HMM (either by considering all possible state sequences associated with a model, or by considering only the most likely).

In Chapter 26 we provided examples of two common approaches to the estimation of these acoustic probabilities: codebook tables associated with vector quantized features, giving probabilities for each feature value conditioned on the state; and Gaussians or mixtures of Gaussians associated with one or more states. For both of these examples, EM training is used to maximize the likelihood of the acoustic feature sequence's having been generated by the correct model. However, when the parameters are modified in this way, there is no guarantee that they will also reduce the likelihoods of the incorrect models. Training that explicitly guarantees the relative improvement of the likelihood for the correct versus incorrect models is referred to as being discriminant. Discriminant training for sequence-recognition systems thus has the same goal as discriminant training for static tasks; the parameters for the classifier are trained to distinguish between examples of different classes. In the limit of infinite training data and convergence to optimal parameters, maximum likelihood training is also discriminant, in that it converges to the Bayes solution that guarantees a minimum probability of error. However, given practical limitations, it is often helpful to focus more directly on discrimination during training, which is the topic discussed in this chapter. There are a number of approaches to discriminant training in ASR, including the use of neural networks for state probability estimation. We briefly survey a range of approaches, and then we focus on the use of neural networks for discriminant ASR systems. We will also return to other discriminant training methods in the second half of Chapter 28.

DISCRIMINANT TRAINING

Recall that in Chapter 25, we used the Bayes rule to describe the fundamental equation for statistical speech recognition. For convenience, we repeat this here:

image

where, as before, the class Mj is the jth statistical model for a sequence, 0 ≤ jJ, and X is the observable evidence of that sequence.

In real systems, the actual probabilities are unknown, and instead we have estimates that depend on parameters that we will learn during training. Again repeating from Chapter 25, we can explicitly incorporate dependence on a parameter set Θ to get

image

Recall that, for example, Θ could include means and variances for Gaussian components of the density associated with each state.

During training, Θ is changing. Typically Θ is changed to maximize the likelihood image. However, this will also change the denominator image, and we cannot be assured that the quotient will increase. To illustrate this potential difficulty further, if we expand the latter probability to a sum of joint probabilities

image

then the Bayes rule expression becomes

image

or

image

Thus, given some speech acoustics and some model parameters, the probability of a particular model is simply related to the ratio of the likelihood of that model weighted by its prior probability, divided into the sum of this product for all of the other models. Thus, to be sure that this probability estimate increased for a change in the parameters, we would aim to reduce the likelihood for incorrect models as well as to increase the likelihood for correct models. Training procedures that attempt to do this (or at least increase the likelihood ratio between correct and incorrect models) will be discriminant (between models).

There are several major categories of discriminant training that have been developed. We briefly mention three of them, and then we proceed in somewhat more detail on another discriminant approach that is based on neural networks.

27.2.1 Maximum Mutual Information

A quantity that is closely related to the fraction of Eq. 27.4 is the mutual information between the models and the acoustics, or

image

where E is the expectation operator over the joint probability space for the models and the acoustics.

For a particular choice of model and acoustic pair,

image

where the last transformation is obtained by dividing the numerator and denominator by image and then expanding out the denominator as in the previous section.

This differs from Eq. 27.4 only in that it lacks a prior probability term in the numerator, and in that there is a log function. It is clear that this too is a discriminant formulation, in that alterations to Θ that increase the mutual information will increase the earlier quantity [ignoring the prior probability term image in the numerator of Eq. 27.4].

In work at IBM (see, e.g., [1]), methods were developed to train parameters Θ in order to increase this criterion. These approaches have been referred to as maximum mutual information (MMI) methods. Training is done with a gradient learning approach, in which the parameters are modified in the direction that most increases the mutual information.

MMI approaches have been incorporated in a number of speech-recognition research systems. One practical problem with using MMI for continuous speech recognition is the need for probability estimates for each of the terms in the denominator of Eq. 27.7. In the general case, there are an infinite number of such terms, since there are an infinite number of possible word sequences. One solution is to approximate the denominator by estimating probabilities for a model that permits any phoneme sequence. Another approach has been to approximate the sum over all models (i.e., all possible word sequences) by just using the sum over the N-most probable models (word sequences). Specific techniques for large-scale MMI training are presented in Section 28.3.

27.2.2 Corrective Training

Corrective training was a term applied by the IBM group to an MMI-like approach in which the parameters were modified only for those utterances in which the correct models had a lower likelihood than the best models. This can be seen as an approximation to MMI in which the fraction in Eq. 27.7 is only modified for a reduced set of examples (only the cases in which the most likely hypothesis was incorrect). For these cases, the acoustic probabilities are adapted upward for the correct model and downward for the incorrect models. In other words, if

image

then

image

such that

image

Here Δ is a margin that must be exceeded before an utterance is considered to be recognized so poorly as to suggest correction of the models, Mc is the correct model, and Mr is therecognized model. The method was described in greater detail in [2].

27.2.3 Generalized Probabilistic Descent

A generalization of corrective training and MMI approaches was developed by Katagiri et al. [12]. Given parameters Θ, they define a discriminant function associated with each model Mi as gi (X; Θ). This discriminant function can be any differentiable distance function or probability distribution. Often the discriminant function is defined as

image

Another solution could be to define gi(X; Θ) as the MMI in Eq. 27.7.

Classification will then be based on this discriminant function according to the rule

image

Given this discriminant function, we can define a misclassification measure that will measure the distance between one specific class and all the others. Here again, several measures can be used, each of them leading to different interpretations. However, one of the most general ones given in [12] is

image

in which K represents the total number of possible reference models. It is easy to see that if η = 1, Eq. 27.10 is then equivalent to Eq. 27.7, in which all the priors are assumed equal to 1/K.

The error measure (Eq. 27.10) could be used as the criterion for optimization by a gradientlike procedure, which would result in something very similar to MMI training. However, the goal of generalized probabilistic descent is to minimize the actual misclassification rate, which can be achieved by passing dj(X; Θ) through a nonlinear, nondecreasing, differentiable function F (such as the sigmoidal function given in Chapter 8, Eq. 8.8) and then by minimizing

image

Other functions can be used to approximate the error rate. For example, we can also assign zero cost when an input is correctly classified and a unit cost when it is not properly classified, which is then another formulation of the minimum Bayes risk.

This approach is very general and includes several discriminant approaches as particular cases. In the case of continuous speech recognition (for which all incorrect models cannot typically be enumerated), this approach requires an approximation of the incorrect model scores. Approximations such as those used with MMI (e.g., N-best hypotheses) can also be used for this case.

27.2.4 Direct Estimation of Posteriors

Model probabilities are usually estimated from likelihoods using the Bayes rule. However, it is also possible to estimate the posterior probabilities directly and to incorporate the maximization of these estimates (for the correct model) directly in the training procedure. The most common structure for such an approach is a neural network, typically a multilayer perceptron (MLP) such as that described in Chapter 8; sometimes recurrent (feedback) connections are also used. Neural networks can, under some very general conditions, estimate HMM state posterior probabilities. These can then be used to estimate model probabilities.

It has been shown by a number of authors ([4], [9], [18]) that the outputs of gradient-trained classification systems can be interpreted as posterior probabilities of output classes conditioned on the input. See the Appendix at the end of this chapter for a version of the proof originally given in [18].

These proofs are valid for any neural network (or other gradient-trained system), given four conditions.

  1. The system must be trained in the classification mode; that is, for K classes (e.g., state categories), the target is one for the correct class and zero for all the others.
  2. The error criterion for gradient training is either the mean-squared difference between outputs and targets, or else the relative entropy between the outputs and targets.
  3. The system must be sufficiently complex (e.g., contain enough parameters) to be trained to a good approximation of the mapping function between input and the output class.
  4. The system must be trained to a global error minimum. This is not really achieved in practice, so the question is whether the local minimum that might actually be obtained will be good enough for our purpose.

It has been experimentally observed that, for systems trained on a large amount of speech, the outputs of a properly trained MLP or recurrent network do in fact approximate posterior probabilities, even for error values that are not precisely the global minimum. When sigmoidal functions (e.g., the function given in Eq. 8.8) are used as output nonlinearities, for instance, it is often found that the outputs roughly sum to one, at least on the average. However, since individual examples do not identically sum to one, many researchers use some form of normalized output. One of the most common approaches to this is to use a softmax function rather than a sigmoid. A common form for this nonlinearity is

image

where yi, is the weighted sum of inputs to the ith output neuron. In other words, each such neuron value is exponentiated, and then the results are normalized so that they sum to one.

Thus, neural networks can be trained to produce state posteriors for a HMM, assuming that each output is trained to correspond to a state category (e.g., a phone). HMM emission probabilities can then be estimated by applying the Bayes rule to the ANN outputs, which estimate state posterior probabilities image In practical systems, we most often actually compute

image

That is, we divide the posterior estimates from the ANN outputs by estimates of class priors. The scaled likelihood of the left-hand side can be used as an emission probability for the HMM, since, during recognition, the scaling factor P(xn) is a constant for all classes and will not change the classification.

Figure 27.1 shows the basic hybrid scheme, in which the ANN generates posterior estimates that can be transformed into emission probabilities as described here, and then can be used in dynamic programming for recognition.

Since posterior probabilities for an exhaustive set of state categories sum to one, the network training is discriminant at the state level; that is, changing the parameters to boost the correct state will also move the system farther away from choosing the incorrect states. In fact, the backpropagation process explicitly includes the effects of negative training from the targets associated with the incorrect states.

However, the goal of discriminant training is not to improve discrimination at the level of states, but rather at the level of complete models (i.e., words or utterances). Is there any reason to believe that this training is discriminant between models? Although there is no real proof of this, there are several types of observations that can be made.

1. Intuitive: a system that is better at distinguishing between submodels should be better at distinguishing models. Although mismatches at higher levels (e.g., poor pronunciation models) can interfere with this, it nonetheless would be hard to argue that it should be better to have poor local discrimination.

2. Empirical: when a neural network is trained, performance at the state (frame) level and at the word level can be tested after each epoch. The number of epochs that yields the best performance is often not exactly the same for the different levels; however, the general tendencies are very similar, and the optimum stopping point is typically very close for the two criteria. Therefore, to the extent that the word error rate is a measure of model discrimination, the state-based discriminant training of neural networks tends to lead to systems that are more discriminant at the model level.

image

FIGURE 27.1 Use of a neural network to generate HMM emission probabilities for speech recognition. At every time step n, acoustic vector xn with right and left context is presented to the net (Fig. 27.2). This generates local probabilities P(qk | xn) that are used, after division by priors P(qk), as local scaled likelihoods in a Viterbi dynamic programming algorithm. Here, the arrows coming up from each ANN output symbolize the use of these scaled likelihoods (after taking the negative logarithm) as distances from the acoustic input to their corresponding state at time n. The solid curves show the best path at each time point.

3. Theoretical: although there is no proof per se that the training described here is discriminant for words or utterances, it can actually be proved that an idealized system that incorporates dependencies on all previous states is discriminant at the complete model level. The simpler system here can be viewed as the same training regimen with some strong simplifying assumptions.

On the last point, work over the past few years has shown that relaxing these simplifying assumptions somewhat (for instance, including a dependency on the previous state explicitly in the network training) can demonstrate some improvement [5]. However, since the improvement is small, it is still likely that the simpler system is typically improving model-level discrimination.

Aside from improved discrimination, there are other reasons for researchers’ interest in the use of neural networks for probability estimation in statistical ASR. The structure of the network permits flexible inclusion of a range of features, such as acoustic inputs from long temporal contexts. This can facilitate a wide range of experiments that might otherwise be quite tricky with Gaussian mixtures or discrete densities. These approaches also tend to be unconstrained by implicit or explicit assumptions about the feature distributions (e.g., conditional independence within a state). Finally, in practice it has often been observed that fewer parameters are required for posterior-based systems, in comparison with equivalently performing systems that use likelihood estimators. This may be due to the tendency of the former systems to incorporate more parameter sharing.

The next section elaborates on some basic characteristics of hybrid HMM–ANN systems that are used for ASR.

27.3 HMM–ANN BASED ASR

27.3.1 MLP Architecture

As described earlier, scaled HMM state emission probabilities can be estimated by applying the Bayes rule to the outputs of neural networks that have been trained to classify HMM state categories. Such estimates have been used in a significant number of ASR systems, including large-vocabulary speaker-independent continuous speech-recognition systems. Sometimes these systems have used MLPs, such as those described in Chapter 8. The MLPs could either consist of one single large trained network (systems have been trained with millions of parameters)[16], or of a group of separately trained smaller networks [8]. Sometimes the systems use recurrent networks, typically with connections from a hidden layer back to the input [19], [7]. These can be trickier to train than the fully feedforward systems, but they typically can get very good results with fewer parameters than the feedforward systems.

A typical single-net feedforward implementation is illustrated in Fig. 27.2. Acoustic vectors usually incorporate features such as those discussed in Chapter 22 (e.g., mel cepstra or PLP). A temporal context of such vectors (e.g., 9) are input to the network. For simple implementations, the output categories correspond to context-independent acoustic classes such as phones, and each phone uses a single density. More complex implementations can use multiple states per phone, context-dependent phones, or both (see Chapter 23 for a discussion of triphones, for instance). These more complex designs typically use multiple networks (see, for instance, [8]).

27.3.2 MLP Training

A number of techniques have been developed to improve the performance of these networks. Some of these are as follows.

1. On-line training: neural network theory is somewhat more complete for so-called batch training than for on-line training. In the former, modification of the weights is only done once per pass through the data, whereas weights are modified for every new pattern in on-line training. However, the latter tends to be much faster for realistic data sets, since the data tend to be quite redundant, so that one on-line pass is in practice equivalent to many batch passes.

image

FIGURE 27.2 Single large MLP used for probability estimation in speech recognition. The acoustic input consists of feature vectors for the current frame, four previous, and four following frames. The output corresponds to phonetic categories for HMM states.

2. Cross-validation: the goodness of the network must be evaluated during training in order to determine whether sufficient learning has occurred, or sometimes even to optimize training parameters (such as the learning rate). In practice it is often useful to test the system on an independent data set after each training pass; depending on the learning algorithm, the resulting test set performance can be used either for comparison with a stopping criterion or to assess how to change the learning rate. For many cases, failure to do this testing can result in networks that are overtrained. That is, they can overfit the training data, which can lead to poor generalization on independent test sets.

3. Training criterion: the relative entropy criterion referred to here tends to perform better (certainly in terms of convergence) than the mean-squared error.

See [16] for further discussion on these and other practical points.

27.3.3 Embedded Training

In Chapter 26 we described two principle approaches for iteratively improving the model likelihoods, both based on EM. In Viterbi training, we alternately segmented word-labeled data and updated parameters for the models; these steps were repeated until some stopping criterion was reached. In forward–backward training, explicit segments were were not computed; rather, recursions were used to generate state probability estimates for each frame, and model parameters were estimated from these. These iterative or embedded procedures were necessary because of the lack of an analytical solution to the optimization of the model parameters. In particular, segment boundaries or probabilities are not typically known; rather, training is usually done with speech utterances for which the phone sequence (or sometimes only the word sequence) is known, but not the exact timing of phonetic segments.

Similarly, ANN training can be embedded in an EM-like1 process. In the Viterbi case, dynamic programming is used to segment the training data (using scaled likelihoods computed from the network outputs). The resegmented data are then used to retrain the network. There is also an approach that is quite analogous to the typical Baum–Welch procedure of Chapter 26; scaled likelihoods are used with the usual forward–backward equations to estimate posterior probabilities for each state and frame. The network is then retrained, using these probabilities as targets [10]. In some cases the recursions can be modified to accommodate dependencies on previous states for more complex models [5].

This section has focused on the use of feedforward neural networks in HMM-based speech recognition. For a brief description of a large-vocabulary system based on recurrent networks, see [7]; the system itself is described more completely in [19].

27.4 OTHER APPLICATIONS OF ANNs TO ASR

For brevity's sake, we have only discussed neural networks in the context of their application to discriminant training for HMM probability estimation. However, there are a range of other applications of neural networks to speech recognition; some examples follow.

  1. Predictive networks: training networks are used to estimate each frame's acoustic vector given some number of previous vectors, assuming some state category [14], [21].
  2. ANN models of HMMs: networks can be designed to implement the Viterbi algorithm [15] or the forward recursion of the forward–backward algorithm [6].
  3. Nonlinear transformation: networks may also be used to transform the observation features for HMMs [3].
  4. Clustering: networks have been used to cluster the data prior to classification stages, for instance, using the Kohonen feature maps [13].
  5. Adaptation: networks have been used to adapt to the acoustic vectors for a particular speaker [11], or to map from noisy to clean acoustic conditions [20].
  6. Postprocessing: networks trained for more complex models (e.g., statistics of a complete phonetic segment) can be used to rescore word-sequence hypotheses that are generated by a first-pass system that may use more conventional approaches [22].

We also refer the interested reader to [16] and [17] for more information on many of the topics discussed in this chapter.

27.5 EXERCISES

  1. 27.1 Consider Eq. 27.4. Suppose that all K models have entirely disjoint parameters; that is, changing the parameters for one model has no effect on the likelihoods for another model. Does an increase in image (as a result of a change in Θ) imply an increase in image What if each parameter change affects the likelihood estimates for all models?
  2. 27.2 Briefly describe the principles that underly the use of maximum mutual information for HMM parameters. Explain how these methods differ from maximum likelihood estimation. Under what circumstances might these alternative training techniques be of benefit?
  3. 27.3 It was stated in the chapter that HMM-based systems that use neural networks for posterior probability estimation often require fewer parameters than systems that are trained to estimate state likelihoods by using Gaussians or Gaussian mixtures. State some reasons why this might be true.
  4. 27.4 Systems that are trained to be more discriminant should, in principle, make fewer errors than systems than have not been so trained. Describe some testing condition in which the discriminant training could hurt performance.
  5. 27.5 It is sometimes said that neural networks do not require the selection of an intermediate representation (i.e., features) but rather can automatically extract the optimum representation. However, in practice, researchers have found that predetermined features are essential for good speech-recognition performance, even when neural networks are used. Why might someone think that the neural networks were sufficient, and why might they not be in practice?

27.6 APPENDIX: POSTERIOR PROBABILITY PROOF

Here we briefly repeat the proof originally given in [18], which gives a clear explanation of how networks trained for one-of-K classification can be used as estimators of probabilities for the K classes.

We assume that the network training criterion will be the mean-squared error (MSE) between the desired outputs of the network, which we represent as di (x) for the ith output and an input of x, and the actual outputs gi (x). In practice, other common error criteria will lead to the same result.

For continuous-valued acoustic input vectors, the MSE can be expressed as follows:

image

Since image we have

image

After a little more algebra, using the assumption that dl(x) = δkl if xqk, we find that adding and subtracting image in the previous equation leads to

image

Since the second term in this final expression 27.15 is independent of the network outputs, minimization of the squared-error cost function is achieved by choosing network parameters to minimize the first expectation term. However, the first expectation term is simply the MSE between the network output gl(x) and the posterior probability image Minimization of Eq. 27.14 is thus equivalent to minimization of the first term of Eq. 27.15, that is estimation of image at the output of the MLP. This shows that a discriminant function obtained by minimizing the MSE retains the essential property of being the best approximation to the Bayes probabilities in the sense of mean-squared error. A similar proof was also given in [18] for the relative entropy cost function (computing the relative entropy between target and output distributions).

BIBLIOGRAPHY

  1. Bahl, L., Brown, P., de Souza, P., and Mercer, R., “Maximum mutual information of hidden Markov model parameters,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Tokyo, pp. 49–52, 1986.
  2. Bahl, L., Brown, P., de Souza, P., and Mercer, R., “A new algorithm for the estimation of hidden Markov model parameters,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., New York, pp. 493–496, 1988.
  3. Bengio, Y., De Mori, R., Flammia, G., and Kompe, R., “Global optimization of a neural network-Hidden Markov Model hybrid,” IEEE Trans. Neural Net. 3: 252–259, 1992.
  4. Bourlard, H., and Wellekens, C., “Links between Markov models and multilayer perceptrons,” IEEE Trans. Pattern Anal. Machine Intell. 12: 1167–1178, 1990.
  5. Bourlard, H., Konig, Y., and Morgan, N., “A new training algorithm for statistical sequence recognition with applications to transition-based speech recognition,” IEEE Signal Process. Lett. 3: 203–205, 1996.
  6. Bridle, J., “Alpha-Nets: a recurrent neural network architecture with a hidden Markov model interpretation,” Speech Commun. 9: 83–92, 1990.
  7. Cook, G., and Robinson, A., “Transcribing broadcast news with the 1997 Abbot system,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. Seattle, pp. 917–920, 1998.
  8. Fritsch, J., and Finke, M., “ACID/HNN: clustering hierarchies of neural networks for context-dependent connectionist acoustic modeling,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Seattle, pp. 505–508, 1998.
  9. Gish, H., “A probabilistic approach to the understanding and training of neural network classifiers,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Albuquerque, N.M., pp. 1361–1364, 1990.
  10. Hennebert, J., Ris, C., Bourlard, H., Renals, S., and Morgan, N., “Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems,” in Proc. Eurospeech '97, Greece, pp. 1951–1954, 1997.
  11. Huang, X., Lee, K., and Waibel, A., “Connectionist speaker normalization and its application to speech recognition,” in Proc. IEEE Workshop Neural Net. Signal Process. IEEE, New York, pp. 357–366, 1991.
  12. Katagiri, S., Lee, C., and Juang, B., “New discriminative training algorithm based on the generalized probabilistic descent method,” in B. H. Juang, S. Y. Kung, and C. A. Kamm, eds., IEEE Proc. NNSP, IEEE, New York, pp. 299–308, 1991.
  13. Kohonen, T., “The ‘neural’ phonetic typewriter,” IEEE Comput. 21: 11–22, 1988.
  14. Levin, E., “Speech recognition using hidden control neural network architecture,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Albuquerque, N.M., pp. 433–436, 1990.
  15. Lippmann, R., “Review of neural networks for speech recognition,” Neural Comput. 1: 1–38, 1989.
  16. Morgan, N., and Bourlard, H., “Continuous speech recognition: an introduction to the hybrid HMM/connectionist approach,” Signal Process. Mag. 12: 25–42, 1995.
  17. Morgan, N., and Bourlard, H., “Neural networks for statistical recognition of continuous speech,” Proc. IEEE 83: 742–770, 1995.
  18. Richard, M., and Lippmann, R., “Neural network classifiers estimate Bayesian a posteriori probabilities,” Neural Comput. 3: 461–483, 1991.
  19. Robinson, A., “An application of recurrent nets to phone probability estimation,” IEEE Trans. Neural Net. 5: 298–305, 1994.
  20. Sorenson, H., “A cepstral noise reduction multi-layer network,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Toronto, pp. 933–936, 1991.
  21. Tebelskis, J., and Waibel, A., “Large vocabulary recognition using linked predictive neural networks,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Albuquerque, N.M., pp. 437–440, 1990.
  22. Zavaliagkos, G., Zhao, Y., Schwartz, R., and Makhoul, J., “A hybrid segmental neural net/hidden markov model system for continuous speech recognition,” IEEE Trans. Speech Audio Process. 2: 151–160, 1994.

1 The procedures that are actually used correspond more to what is sometimes called generalized EM. In generalized EM, the M step doesn't actually maximize the likelihood, but simply increases it, typically by a gradient procedure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.210.151.5