• Search in book...
• Toggle Font Controls

# STATISTICAL MODEL TRAINING

## 26.1 INTRODUCTION

In the previous chapter, we introduced the notion of statistical models and sequence recognition; we further introduced the common assumptions of conditional independence that lead to the particular form of generative statistical model1 called a hidden Markov model (HMM). We then showed how such models could be used to compute the likelihood of the sequence of feature vectors having been produced by each hypothetical model, given some assumptions of conditional independence. This likelihood was either a total likelihood (using the forward recursion), taking into account all possible state sequences associated with the model, or a Viterbi approximation, only taking into account the most likely state sequence. Further assuming that the language model parameters were separable from the acoustic model parameters, we showed that the Bayes rule gave us the prescription for combining the two models to indicate the model (or sequence of models) that gives the minimum probability of error.

A key component in this development was the integration of local probability values over the sequence; essentially this was a local product of state emission and transition probabilities with a cumulative value computed from legal predecessor states. In other words, we derived approaches for determining complete sequence likelihoods given all the local probabilities. However, this still left a major problem: How do we get these probabilities? Even with the simplifying assumptions we have made, this is far from a trivial problem. State densities and transition probabilities are rarely known a priori, so in general they must be estimated from the training data.

The methods of Chapter 9 are directly relevant here. In that chapter, we described two parametric forms (Gaussians and mixtures of Gaussians) and introduced a general methodology for training their parameters. Given a set of labels for a sequence of speech frames, then, one could associate a model with all frames that have a particular label. For example, a feature from all speech frames labeled as the [ae] sound could be represented by a mixture of Gaussians. The parameters of each such model could then be trained to maximize the likelihood of the corresponding data, for instance by using EM.

However, speech recognition requires somewhat more complex procedures, though they are qualitatively similar to this simple scenario. For one thing, EM on framewise densities would maximize the likelihood P(xn | qj) for each j, which might not be the same thing as maximizing the likelihood of a complete model for a word or sentence. Thus, we must optimize over the space of complete sequences; that is, we must compute expectations over the space of all possible state sequences corresponding to the models for the training data. EM in this case can be shown to increase the likelihood of a complete sequence of observations given the models. Additionally, the training data are typically labeled asynchronously to the frames; that is, we ordinarily know that the training phrase “fifty-five” corresponds to a sequence of speech frames, but we don't know where the subword units start and stop. Consequently the state identity itself must be a hidden variable in the sense that we used the term in Chapter 9.

In the next section we apply EM to HMMs.

## 26.2 HMM TRAINING

The EM derivations of Chapter 9 showed that choosing parameters that maximized the expectation of the log of the joint density for observed and hidden variables would also maximize the likelihood of the observed data. When applied to HMMs, the hidden variables are the sequence of states associated with the Markov models.2 Starting with a general expression for an auxiliary function that could be maximized in order to ensure the data likelihood is optimized (adapted from Eq. 9.25):

where k is a hidden variable (for which we will sum over all examples), xn is the nth example of the observation, and Θ are the parameters to be optimized.

For the case of a HMM, we use the state sequence as the hidden variables, and we compute the expectation conditioned on the entire sequence. As in Chapter 25, let qkn be the state at time n that is of type k, let XlN refer to the complete sequence of N frames, and let Q be the corresponding sequence of hidden state variables (not to be confused with Q, the symbol we introduced in Chapter 9 to refer to the expected log joint density being optimized in EM). Then the corresponding expectation would be

This is the expectation of the joint likelihood of the observed feature vectors and the unobserved HMM state sequences.3 This expression could be maximized for each model M (which has associated permissible state sequences) by adjusting Θ.

In Chapter 25, we noted that the probability terms in Eq. 26.2 are often simplified by a series of conditional independence assumptions. Using these, we can approximate (a product of emission probabilities), and P(Q) by (an initial state prior multiplied by a sequence of transition probabilities). Using the usual properties of the logarithm and re-expressing the summation over all sequences Q as the sum for all possible values of frame index n and state type k (or state pairs k and l in the case of the third term), we get

where, as in Chapter 25, L(M) is the number of states in model M.

Although we have written this expression with a general shared notation for the parameters, typically the three terms can be optimized separately. The first term can be optimized with the same methods that were shown in Chapter 9; in particular, note the similarity to the second term of Eq. 9.26. In that chapter, we chose a reasonable parametric form for the density estimator. This permitted the computation of partial derivatives of a Q function with respect to the unknown parameters. Setting the resulting expression to zero produced linear equations that could be simply solved. This determined the parameter values that corresponded to the best possible increase in data likelihood for the given initial density estimates.

For simplicity's sake, here we assume that a single univariate Gaussian is associated with each HMM state.4 Given these assumptions, we can proceed through the same steps as were taken in Chapter 9. The resulting expression for the mean associated with state (density) j is then

and the corresponding expression for the variance estimate is

(Compare these with Eqs. 9.31 and 9.32 in Chapter 9.)

Continuing the analogy with the development in Chapter 9 (Section 9.8), we can use Lagrangian multipliers to optimize the second and third terms of Eq. 26.3. That is, we take partial derivatives of each term separately, summed with an additional term that incorporates constraints based on the fact that for these cases the parameters that we wish to optimize are probabilities. For the case of the second term, we want to estimate the prior probability of the first state in each model. The corresponding Lagrangian term will be

which expresses the constraint that the probabilities of all initial states must sum to one. Similarly, for the third term of Eq. 26.3, the Lagrangian term will be

which expresses the constraint that the probabilities of all transitions from any particular state must sum to one. In each case we take partial derivatives of the augmented term (with respect to the variable we wish to optimize), set the result to zero, solve for the Lagrangian multiplier, and resubstitute for the final expression for the optimum value of the unknown.5

Taking these steps, we end up with an expression for the optimum first frame prior probabilities:

Thus, the best estimate of this prior is just the posterior probability taken from the previous parametric representation.

Similarly, the optimum transition probabilities can be shown to be

1. The denominator can also be expressed as an outer sum over the numerator for all states j, that is,
2. The form of this equation is similar to the others we derived, in the sense that it is a kind of normalized expectation. In particular, it may be interpreted as the expected value of the number of transitions from state i to state j, normalized by the expectation of the number of transitions from state i.

Thus, for the prior probabilities for the first state and for state transitions, we can compute optimum updates from estimates of the relevant posterior probabilities. That is, we require an estimate of posterior probability of single states and of pairs of states, conditioned on the entire observation sequence. In the case of the emission probabilities, we have just given a simple example of a parametric form (that of a single univariate Gaussian). However, it is generally true that the iterative training of HMM emission parameters is crucially dependent on the estimation of the posterior probability . In the following section we show how all of these posterior probabilities can be estimated. Given these estimates, though, the training procedure is (in principle) straight forward.

1. Choose a form for the local probability estimators (e.g., Gaussian) for the densities associated with each state.
2. Choose an initial set of parameters for the estimators.
3. Given the parameters, estimate the probabilities for each state and time. Similarly, estimate the probabilities for each state transition and time. These are essential terms in the estimate of the expectation Q for the EM algorithm, as given by Eq. 26.3.
4. Given these probabilities, and the parametric form chosen in step 1, find the parameters that maximize Q. These parameters will be guaranteed to give the best possible improvement to the likelihood for each model.
5. Assess the new models according to some stopping criterion; if good enough, stop. Otherwise, return to step 3.

Although some training approaches use somewhat different criteria and probabilistic estimates, the general form of the training for all statistical sequence systems remains the same: use the current parameters to estimate posterior probabilities for the hidden variables, and then use these posteriors to determine new parameters that maximize the expectation Q (and thus the data likelihood, according to the EM proof in Chapter 9). In the remainder of this chapter we describe some further specifics for some simple HMM-based approaches.

## 26.3 FORWARD-BACKWARD TRAINING

As noted in the previous section, EM-based training of HMMs requires the estimation of the probability of each type of state occurring at a frame. These probabilities can be estimated by using a combination of the forward procedure (described in Chapter 25) and a similar one called the backward recursion (see problem 25.2 in Chapter 25). Together these recursions will yield the required probabilities, which can then be used to generate the new set of estimator parameters. This procedure then will maximize the likelihood of the correct models (given the usual Markov assumptions).

Let's start by redefining the two recursions6:

Note that the second equation has the same form as the first, but it proceeds backward in time. The second one was also chosen to have the property that

In other words, the product is the joint likelihood of the complete data sequence and a particular state at a particular time. Summed up over all the possible states at that time, this will then yield .

Given these intermediate results, we can then compute the probability of a particular state at a particular time, given the entire data sequence:

This can be used to update the parameters of the probability estimators for the emission density associated with each state. In particular, given a specific form for these estimators (e.g., Gaussian), the probabilities of Eq. 26.13 can be used to compute parameters that maximize the expectation Q and hence the data likelihood. In Section 26.4, we show this process for Gaussians and for discrete densities.

In addition to these emission densities, the state transition probabilities can also be chosen to maximize the expectation Q. As noted in Section 26.2, these probabilities are determined from the posterior acoustic probability estimate of the state transitions, or see Eq. 26.9. This requires a series of approximations based on the same conditional independence assumptions that were used in Chapter 25 to develop the α recursion and will result in a somewhat messier expression. However, in principle, we can compute the required probability with a product of an α term, a β term, and an emission and a transition probability to represent the posterior contribution of the current transition. This is illustrated in Fig. 26.1.

More formally, we wish to estimate the numerator of Eq. 26.9 (since the denominator can be obtained by summing the numerator over all possible states). Equivalently, we can estimate the quantity , since both denominator and numerator of Eq. 26.9 can be multiplied by the term We then split up the data se-quence into past, present, and future terms to express the desired quantity (suppressing Θ and M for simplicity's sake), and proceed with the derivation:

FIGURE 26.1 A state transition and the terms that are used for training of the associated probability. The total probability of all paths terminating in state k at time n – 1 is given by αn–1 (k | M). The total probability of all backward paths (ending at time N) that start with state l at time n is given by βn (l| M). Local probabilities P(ql | qk) and P(xn | ql) are multiplied in to get an estimate of the total probability for all paths that contain the transition from k to l.

Each equality in this derivation either represents a definition (for α or β), or else an application of the axiom P (ab c) = P (a | bc)P (b | c). Each use of the symbol ≈ indicates that we are making use of the conditional independence assumptions described in Chapter 25. For instance, in the expression for the joint probability of the observation and state for the nth frame (26.19), we can drop dependence on if there is also a dependence on (expression 26.20). This is the first-order Markov assumption. That is, given knowledge of the previous state, we assume that no other information about the past will provide any additional information about the current state.

Using the terms of Eq. 26.22 to estimate the transition probabilities of Eq. 26.9, we find that the optimum transition probabilities are

Similarly, for the first frame probabilities as given by Eq. 26.8, the optimum probabilities can be computed directly from Eq. 26.13 for the case of n = 1, or

In summary, α (forward) and β (backward) recursions are used to derive estimates for the probabilities of the hidden state and state transition variables, conditioned on the sequence of acoustic feature vectors. For the unconditioned (prior) probabilities, as given by Eqs. 26.23 and 26.24, the probabilities themselves are trained parameters of the model. In the case of emission probabilities, the parameters are variables associated with some particular structure for the estimator, e.g., Gaussian.

For all of these parameters, estimation and maximization steps can be repeated until some stopping criterion is reached. The overall procedure is then referred to as forward-backward or Baum–Welch training.

## 26.4 OPTIMAL PARAMETERS FOR EMISSION PROBABILITY ESTIMATORS

For many common structures of probability estimators, update equations can be found for which the partial derivative of the expectation Q (with respect to the estimator parameters) is zero, and the likelihood of the data is maximized. For any particular iteration of the Baum–Welch procedure, these update equations will be applied, incorporating probabilities that have been estimated from the previous iteration. Here we illustrate this process for the case of two simple structures: a Gaussian density, and a discrete density.

### 26.4.1 Gaussian Density Functions

Recall that a multivariate Gaussian distribution is defined by two groups of parameters: the mean vector, and the covariance matrix. In the most general case, each of these will be unique for each state category (e.g., phone), but often the nonzero part of the covariance matrix is limited to diagonal elements (variances).

As noted previously, the expected values of each mean must be computed by weighting each feature vector by the probability that it corresponds to the particular state. This is expressed in Eq. 26.4. Note that the denominator may be interpreted as the expected number of frames associated with state j. Expressing both numerator and denominator in terms of the recursion results (where the α values come from the forward recursion and the β values come from the backward recursion), we get

The variances can be computed in a similar manner.

### 26.4.2 Example: Training with Discrete Densities

For discrete densities, each xn is mapped to the nearest cluster center y j in a process called vector quantization, or VQ (to be discussed a bit more later in this chapter). A distribution is stored in a table for each yj. For each feature vector xn, the probability is then approximated by where the feature vector was closest to cluster center y j.

To derive the optimum values for the emission parameters, we begin with the same term of the auxiliary function Q that we have been differentiating, namely the first term of Eq. 26.3. However, the parameters to be estimated for the discrete case are the probability estimates P(yj | q) themselves. Thus, we need a Lagrangian term in the optimization to represent the constraint that these parameters must sum to one. Putting in this term, we get

Taking the partial derivative with respect to and setting the result to zero yields

where the notation δnj, j is used for a function that is one when xn is quantized to yj and zero when it is quantized to some other codebook entry. Here is fixed to be the same for any value of n, so we can multiply through by this value and get

Summing over all values of yj, we find that the constraint reduces the right-hand side to be –λ and removes the δ term from the left-hand side (since the sum would now include all frames), so that we get

Resubstituting into Eq. 26.28 and rearranging terms, we finally get

This new estimate of the emission probabilities associated with a VQ value and a state can be viewed as the expected value of the number of frames associated with that VQ value and state, normalized by the expected value of the total number of frames.

Finally, we can see that the probabilities required for Eq. 26.30 are just the probabilities from Eq. 26.13, and thus they can be obtained from the normalized product of forward and backward recursions as given in that latter equation.

## 26.5 VITERBI TRAINING

In Chapter 25, we noted that the full likelihood of a model can be approximated by the likelihood associated with the most likely sequence of states. This was referred to as a Viterbi approximation. The advantage of this approximation is that sums in the α recursion can be replaced with a maximum, which simplifies the numerical considerations. Similarly, in Viterbi training, we will attempt to optimize the parameters to maximize the likelihood of the best path (state sequence) in the correct model. For this case, then, the posterior probabilities used in the estimation step are assumed to either be zero or one; that is, at each stage in the EM iterations, we will assume a particular state sequence for the training data.

The EM steps then take the following form7:

1. Assume an initial set of parameters for the density estimators.
2. Determine the most likely state sequence (or assume one if initializing from segmentations rather than densities).
3. Update the parameters.
4. Assess the solution and repeat the previous two steps as necessary.

We begin by discussing the second step. Assuming that we have some probability estimators, how do we find the best segmentation of the frames into a state sequence? This question was essentially answered in a different context in the previous chapter. We know that a Viterbi decoding procedure will find the likelihood of the best path (state sequence) for each model. Further, since the most probable transition was used at each step, we will be able to backtrack and determine the corresponding state sequence. Conceptually, segmentation of training data with a known model transcription is the same as recognition, except that in the former case there are no alternate model sequences to consider.

Since we can obtain an emission probability for each frame and state category, each of these is used in a process that is often called Viterbi alignment. In this process, dynamic programming is done, essentially using the one-pass method, in which the local distances are – log P(yj | q), and where there are transition costs – log P(q | qk) for hypothesizing transitions from states k to l. Unlike the recognition scenario, the only model sequences that are considered are the ones associated with the known word sequence; all the word models together can be considered as a single model for the entire utterance in this case, and there is only one to be evaluated. Backtracking can be done since the best previous state can be preserved for each frame, and so the best state sequence can be found. Additionally, since only one model sequence need be evaluated, often it is not necessary to use elaborate data structures for this process – the distance and backtracking information can be held in complete matrices, since the storage is not prohibitive as it would be in the recognition case.

FIGURE 26.2 illustration of the iterative Viterbi alignment. Each utterance Xj is segmented into an initial estimate of state occupancies for the corresponding model Mj. For instance, this can be done with the linear match shown by the dashed line in the figure. Then the parameters are chosen to maximize the likelihood of the data. Given the new parameters, a new segmentation is found (shown here as the solid curve). This corresponds to the most likely path through the model given the new parameters. The arrows down to the X-axis show the corresponding state transition times for each of these segmentations.

Figure 26.2 illustrates Viterbi alignment. In the case shown in the figure, we begin the iterative process with an assumed segmentation, but the initial segmentation can also be derived from an assumed set of densities.

The state sequence that is found through the backtracking procedure is considered to be an alignment of the states with the feature vectors. In the next step, we estimate the transition and emission probabilities, assuming that this state sequence is correct.

Finally, the solution must be assessed. We can do this by looking at the changes in the global likelihood and setting some threshold on the improvement. Another approach is to test for convergence of the segmentation by counting the number of frames for which the state label has changed.

Although Viterbi training can be quite effective and is simple to understand and implement, it requires an additional approximation over the Markov assumptions mentioned previously, and this is in some sense a disadvantage. However, in Viterbi training, the best path through each model is reinforced, so that during recognition the best path is more likely to correspond to the correct model.

For Viterbi training, the state transition probabilities are particularly simple to evaluate. Consider Eq. 26.9, which gave an expression for the optimum transition probabilities in the general case of probabilistic state occupation. If the posterior probabilities for a transition are assumed to be one when a state pair is present in the Viterbi alignment, and zero when it is not, then that equation becomes

In other words, to estimate state transition probabilities, we simply count to find the relative frequency of each particular transition.

As with the more general case of Baum–Welch training, we consider Gaussians and discrete densities in order to derive the update equations for the emission probability estimators.

### 26.5.1 Example: Training with Gaussian Density Functions

For each Viterbi iteration, the optimum means and variances are computed from all the frames labeled with each state category. The ordinary equations for these parameters are indeed optimum. This can be shown by taking the corresponding solution without any Viterbi assumptions, and substituting one and zero appropriately as the only permitted probabilities. For example, consider Eq. 26.4. Substituting a probability of one for all cases in which a segmentation yields a label of state j, and zero elsewhere, we get

Similarly, the variance is just

### 26.5.2 Example: Training with Discrete Densities

In the case of discrete density functions, since the acoustic vector has been quantized to one of a finite number of categories (typically a few hundred), we can compute relative frequencies of each of these categories for those frames that have a particular state label. Starting from Eq. 26.30, we again substitute probabilities of one for frames within a segmentation for state and zero elsewhere. Given the δ function in that equation, which further eliminates frames from the numerator sum for VQ values other than j, we end up with

Given these new probabilities, another Viterbi alignment can be run, and so on, until some stopping criterion is met.

## 26.6 LOCAL ACOUSTIC PROBABILITY ESTIMATORS FOR ASR

Here we give a little more detail about the estimator structures used in the previous examples, as well as a few more complex types.

### 26.6.1 Discrete Probabilities

As noted previously, discrete probabilities for speech frames (given a particular state category) are estimated by counting co-occurrences between state labels and the VQ index for each frame. Prior to the training methods described above, a VQ training phase is implemented. In this phase, the training data are clustered by using one of a number of common methods, such as the K-means method. In a variant on this approach, sometimes hierarchical clustering is done – first the feature vectors are clustered into two clusters, then into four, and so forth, with each step consisting of a complete K-means clustering, and with the next step initiated by choosing a split that satisfies some reasonable criterion. A common number of cluster centers to end up with is somewhere between 128 and 512. The smaller the number the more robust the quantization will be to small variations, but the larger the number is the better match each cluster center value is to the unquantized version of each feature vector.

Thus, in this training phase, a list of prototypical feature vectors is computed; this list is commonly called a codebook. When it is used (either during training or recognition), each incoming feature vector is mapped to the closest prototype vector in the codebook and the resulting index is used to represent the frame; as noted earlier, this can be used as an index into a table of probabilities.

In practice, it is better to use separate codebooks for static features (e.g., mel cepstra), dynamic features (e.g., delta mel cepstra), and energy-related features. This leads to multiple probabilities that must be combined. Typically, they are combined by multiplication, which is tantamount to assuming conditional independence between these features. Thus even though the use of discrete distributions seems to be free of strong distributional assumptions (e.g., Gaussian), their practical use requires quite strong statistical assumptions. If a single codebook is used, then the use of Euclidean distances for the quantization is also equivalent to a strong assumption about the statistical distribution of the features. Finally, recent practical experience for many researchers has shown that continuous densities can often be used to achieve better performance.

Despite these seeming difficulties, the simplicity and low computation requirements for these methods make discrete (VQ-based) HMMs a popular methodology for local probability estimation.

### 26.6.2 Gaussian Densities

As noted earlier, feature vectors associated (deterministically or probabilistically) with each state can be assumed to be generated by a multivariate Gaussian distribution. This is a fairly strong assumption, but it is not so bad if a full covariance matrix is used (including nondiagonal elements). Unfortunately, when enough state categories are used to make this assumption as reasonable one, the number of parameters for full covariance matrices can be prohibitive.

For this reason, it is now more common that simpler Gaussians are used (variance only) and are combined in mixtures, as described below. This is often found to be a more effective use of the parameters.

### 26.6.3 Tied Mixtures of Gaussians

Each emission probability density can also be assumed to be the weighted sum of a common pool of Gaussian densities. We can express this as

which is axiomatically correct if

Another way of looking at such a system is as a discrete system for which variances are also considered, and for which the probability is not just associated with a single cluster center, but with several (or all in the unconstrained case). Thus, this method is often referred to as soft VQ. The HMMs that result are sometimes referred to as semicontinuous for similar reasons. Practically, these methods have become very popular, as they share some of the simplicity of the discrete methods and yet permit continuous representations of the densities.

### 26.6.4 Independent Mixtures of Gaussians

In the previous case, the density for each state was estimated by a different weighting on the same pool of base densities. In the more general case of mixture Gaussians, each state has its own set of mixtures densities (not just the weights, but separate Gaussians with their own means and variances). This provides a more detailed estimate of the densities, but as such generally requires more training data.

Many systems now actually use something in between tied and independent Gaussians; there is generally some kind of automatic procedure to determine to what extent the data will support an independent estimator for each state, and to what extent the data are sparse enough that tying is required.

### 26.6.5 Neural Networks

As noted in Chapter 9, neural networks can also be used to estimate probabilities. However, they are sufficiently different (different assumptions, optimization criteria, etc.) that they will be treated separately in the next chapter.

## 26.7 INITIALIZATION

Thus far, we have largely skipped over the initialization of parameters. This is required for any form of EM. There are a number of common ways in which this is done. Sometimes the models are initially estimated from a manually transcribed database such as TIMIT (briefly described in Chapter 23). For instance, if three states are used for each phone, the phone segmentations can be used as marked in TIMIT, and three segments for each phone can initially be assumed to be of equal length. Models can be trained from this, iterated on TIMIT, and then used as the initial models for the new database. Alternatively, TIMIT phone models could be used to provide an initial segmentation of the new database, and that could be iterated upon in a similar manner. Sometimes systems use a signal-processing approach to come up with preliminary segmentations, which are used to generate the initial models. One can also just divide up the sequences according to the average relative lengths of phones. Even simpler approaches have been used, but particularly for Viterbi training, the initialization can have a significant effect on the results.

## 26.8 SMOOTHING

For all of these methods, there is a fundamental difficulty that is essentially always present to some degree: we wish to capture the variability inherent in the data, which pushes us toward ever finer representations (for instance, modeling triphones, or phones with specific left and right context, rather than phones); however, these finer categories have fewer examples. Indeed, poor estimates for categories that occur relatively infrequently may also hurt recognition for other categories significantly, since recognition requires integration of the probabilistic estimates over complete utterances.

Therefore, it is common that we will require the combination of good estimates of coarse categories with noisy estimates of fine categories, in order to stabilize the estimate of the latter: this process is often called smoothing. A number of techniques have been developed for this purpose; here we describe two.

1. Backoff smoothing: this is a simple but often effective method. Thresholds are set for a minimum number of examples for each level of granularity, and when the minimum is not present, the estimator backs off to a coarser level. For instance, if there are enough examples of a particular triphone, its emission probability may be used directly; if there are not enough, a biphone might be used; if there are not enough examples of the biphone, then the phone probabilities are used. The setting of proper thresholds is obviously a tricky point, but for many purposes simple heuristics work for this.

2. Deleted interpolation: this is a much more sophisticated process, and it appears to work quite well. Instead of choosing which estimator to use, all of them are used with weighting factors that are learned from testing on data that are disjoint from the training data. Often, this is done by partitioning the training data into N pieces (e.g., 10) and then training up N different sets of parameters by using all the different choices for (N – 1)/N of the data. For the case of two estimators, the goal is to choose ε optimally to combine the estimator parameters, that is,

In one approach, ε is set to the fraction of utterances for which Θ1 was better. Another solution is to treat Ε as missing information (a hidden variable) and to use EM training.

## 26.9 CONCLUSIONS

In this chapter we have shown the basic approaches used for the training of HMMs for speech recognition. As usual we have just scratched the surface; consult [1]–[5] for more of the theory (convergence proofs, etc.). Conference papers from the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) and the Interspeech (formerly ICSLP/Eurospeech) all have many more recent examples of practical approaches that have been taken in complete systems.

In all of these approaches, however, we have made a set of assumptions that we actually know to be unrealistic; in particular, trying to represent the longer-term dependence between observations is a recurring theme in much current research. Further, the basic training approaches are entirely based on improving the likelihood scores of the correct models, and they do nothing to assure us that the likelihoods of the incorrect models will be low. This issue will lead us to a study of discriminant models and estimators.

## 26.10 EXERCISES

1. 26.1 Show all the steps that lead to the expression for optimum first frame prior probabilities as given in Eq. 26.8.
2. 26.2 Show all the steps leading to Eq. 26.25, starting from Eq. 26.4 and using the definitions for the forward and backward recursions.
3. 26.3 Show all the steps leading to Eq. 26.31 from Eq. 26.9.

## BIBLIOGRAPHY

1. Baum, L. E., and Petrie, T., “Statistical inference for probabilistic functions of finite state Markov chains,” Ann. Mathemat. Stat. 37: 1554–1563, 1966.
2. Bourlard, H., and Morgan, N., Connectionist Speech Recognition: A Hybrid Approach, Kluwer, Boston, Mass., 1993.
3. Dempster, A. P., Laird, N. M., and Rubin, D. B., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Stat. Soc. 39: 1–38, 1977.
4. Jelinek, F., Statistical Methods for Speech Recognition, MIT Press, Cambridge, Mass., 1998.
5. Rabiner, L., and Juang, B.-H., Fundamentals of Speech Recognition, Prentice–Hall, Englewood Cliffs, N.J., 1993.

1A statistical model is called generative when its observations are assumed to be generated by a state occupancy or transition, according to some statistical distribution.

2This is the minimal set of hidden variables for the HMM training problem. Other variables may also be hidden, such as the component weightings for Gaussian mixtures.

31n some derivations, the first probability in this equation would be the joint probability of Q and X. Maximizing over either distribution will maximize the data likelihood, and it will also lead to estimation procedures that are essentially the same.

4As noted in Chapter 9, the extension to vector observations is not difficult, but it complicates the mathematics. Similarly, incorporating mixture Gaussians rather than single Gaussians requires us to have hidden variables that denote both the state and the generating Gaussian, which would complicate the notation without significantly enhancing this exposition.

5See Chapter 9 for the analogous computations for the probabilities that are used as the weights of a Gaussian mixture.

6For consistency of notation with Chapter 25, we will suppress the explicit dependence on the old parameters Om in this section. The reader should keep in mind, however, that all of the ‘local’ probabilities used in these expressions come from estimators that may have been trained by a previous step (or else that use parameters that have been chosen for initialization).

7This process is conceptually much simpler than the approaches discussed previously. We have chosen to discuss the non-Viterbi case first, however, because it is more general. As noted above, Viterbi training can be considered a special case of forward–backward training in which the posterior probabilities are either one or zero.

• No Comment
..................Content has been hidden....................