In Chapter 24 we showed how temporal integration of local distances between acoustic frames could be accomplished efficiently by dynamic programming. This approach not only integrates the matches between incoming speech and representations of the speech used in training, but it also normalizes time variations for speech sounds. In the case of continuous speech, this approach also effectively segments the speech as part of the recognition search, without the need for any explicit segmentation stage. Distances can also be modified to reflect the relative significance of different signal properties for classification.

However, there are a number of limitations to the DTW-based sequence-recognition approaches described in the last chapter. As noted previously, a comparison of templates requires end-point detection, which can be quite error prone with realistic acoustic conditions. Although in principle distances can be computed to correspond to any optimization criterion, without a strong mathematical structure it is difficult to show the effect on global error for an arbitrary local distance criterion. Since continuous speech is more than just a concatenation of individual linguistic elements (e.g., words or phones), we need a mechanism to represent the dependencies of each sound or category on the neighboring context. More generally, as noted in Chapter 9, statistical distributions are a reasonable way to formally represent the variability that is observed in real speech samples.

For all of these reasons, statistical models are extremely useful for sequence recognition, particularly for speech. The use of a statistical framework provides a powerful set of tools for density estimation, training data alignment, silence detection, and in general for the training and recognition of isolated words or continuous speech. These mathematical tools have now become so widely used that they are more or less the standard methods for nearly all speech-recognition systems. Even research systems that do not strictly follow the approach given here tend to describe their algorithms in terms of the common statistical paradigm. It could also be argued that the use of statistical pattern recognition increases the generality of the methods; since the choice of any distance is equivalent to some implicit statistical assumption, one may as well directly represent the distance in terms of statistical optimality.

In statistical speech recognition, as in the deterministic case, there is a strong temporal component; in particular, during recognition a local distance is integrated over time in a way that provides some normalization for temporal variability. For most statistical approaches, speech is represented as having been generated according to some probability distributions. Since there are different speech sounds, it is necessary to assume that there is more than one distribution; therefore, we construct models for linguistic units, each of which is associated with some statistical parameters related to the distributions. In a training process, we will estimate these parameters by choosing them to approximate a minimum probability of error (Bayes) solution. During recognition, we will search through the space of hypothesized utterances to find the hypothesis that has the maximum a posteriori probability. For practical reasons, we will need to incorporate some approximations, and this will make the results suboptimal. Furthermore, only estimates of the probabilities and density functions will be available, so we will never reach the true error minimum. In particular, any finite training set will be an imperfect representation of an independent test set; the latter might, for instance, have some form of acoustic noise that was not present in the training set. Thus, deterministic aspects (such as using features that are robust to spectral slope) are still important, even in a statistical system.


We will begin with a restatement of the Bayes rule:


where, in this case, the class Mj is the jth statistical model for a sequence, where 0 ≤ jJ, and X is the observable evidence of that sequence. As in the preceding chapter, X will typically consist of cepstral vectors computed with auditory-warped spectral estimates from succeeding acoustic windows.

According to the Bayes decision rule, the minimum probability of error (of classifying X into the correct category Me) is attained if one always assigns X to that model with the maximum P (Mj | X). If we had many examples of each model, one could learn the parameters for some pattern classifier from the examples in the training data (as in the static pattern classification problems in Chapters 8 and 9) and apply them during recognition for each sequence. More generally, it is often necessary to break up the models into submodels that we learn to classify with some distortion so that training data is shared over the models; for instance, rather than training models for complete sentences, we tend to learn to represent words, syllables, or phones, and we put them together to make models for the utterances. Thus, when we learn the statistics of the sound [ae], we can apply them both to the representation of pat and bat.

Let's begin with the simplest case, in which the candidate models Mj correspond to words, and in which we will only do isolated word recognition. In this case, it would be desirable to design a classifier that would look at a sequence of cepstral vectors (essentially the input template of Chapter 24) and categorize it as one of J words. If all of the examples (both in training and test) were of exactly the same length, one could concatenate all cepstral vectors into one large vector and design a static pattern classifier by any of the approaches discussed in Chapters 8 and 9. Aside from not being generalizable to more realistic cases, though, even this idealized problem would not take advantage of any structure within the concatenated vector.

When the Bayes rule is applied to the recognition problem, we should really incorporate the dependence on the parameter set Θ that was learned during training, so that we get:


Now let us consider the speech-recognition problem per se. Ordinarily, we will have some ability to predict the probability of a word sequence even without any acoustic evidence. In the degenerate case this just consists of a uniform probability distribution over all possible utterances, but more generally there is some preference for some utterances over others. In more complex cases this preference may come from pragmatic factors (e.g., during a conversation about the weather you are less likely to discuss brain surgery), but the simplest preconceptions about the word sequence come from the statistical distributions of groups of words. In either case, we can presume that there will be some estimator available for P(Mj). If we can assume that the components in Θ used for estimating P(Mj) are independent of the components associated with estimating P(X | Mj), we can separately estimate the two. Further using the fact that P(X | Θ) is fixed for all choices of j, we can state that an optimal decision rule for choosing the stochastic model Mj that will lead to the minimum probability of error is


where ΘA is a set of parameters that have been learned for the acoustic model (representing the statistical relationship between the model and the observed sequence of feature vectors), and ΘL is the set of parameters that have been learned for the language model (parametrizing the statistical distribution of all word sequences).

Figure 25.1 shows the general scheme, with the dependence on parameters suppressed.

If we limit our discussion to the acoustic model (and assume for the moment that the language model is perfect), then the optimum decision rule shows us that we wish to associate a large likelihood P(X | Mj, ΘA) with jbest, and smaller likelihoods with the other acoustic models. More generally, there are three problems that must be addressed with the use of statistical models for speech recognition.1

1. Parametrization and probability estimation: How should P(X | Mj, ΘA) be computed, and what are the necessary assumptions about the stochastic models to define the parameter set ΘA? In general it is not possible to estimate the probability density of a complete sequence without constraining assumptions; typically these assumptions involve temporal independence at some level so that the total estimate is broken up into constituents that can be effectively estimated. The rest of this chapter addresses the integration of these components into the likelihoods of acoustic sequences; the discussion of the estimation of the temporally local components will be postponed until the next chapter.


FIGURE 25.1 The Bayes decision rule classification. For the discussion in this chapter, the observations to be classified are a sequence of feature vectors X, corresponding to a speech utterance. The classes correspond to statistical models M, and the minimum probability of error is obtained by choosing the model with the largest product term on the right.

2. Training: Given the parametrization above, and a set of training sequences associated with stochastic models, how should ΘA be determined so that each model has the highest likelihood associated with its observed sequence of feature vectors? This will be discussed in Chapters 26 and 27.

3. Decoding: Given the set of stochastic models Mj with their trained parameters ΘA, how should the best sequence of these models be found to classify the input sequence X according to the Bayes decision rule? We give a simplified answer to this question in this chapter, but we will give a further discussion in Chapter 29, when we discuss complete systems.

See [4] or [2] for extended discussions of all three problems.


Even for the case of isolated words, the statistical models used must represent both temporal variability and temporal structure. A likely structure for this purpose is a stochastic finite-state automaton, which can be used as a model for a speech unit (e.g., a phone or word). The model will consist of some states, a topology of connection between them, and some associated parameters. The parameters will be learned from examples in a training phase and then will be incorporated during recognition. The automata most commonly used for speech recognition are generative models. That is, the states have outputs (rather than inputs), which are the observed feature vectors. The probability densities associated with each acoustic model are probability density functions for the generation of observed feature vectors in a sequence, conditioned on that model, as suggested by Fig. 25.2. For instance, one could imagine a density function for PLP cepstra produced by the model for the word “one.” However, for the many reasons mentioned earlier, it is commonly preferable to break up the density for a complete sequence into a combination of densities corresponding to subsequences. This is typically accomplished by using a Markov assumption.2


FIGURE 25.2 A generative model with its parameters produces an observation sequence according to some statistical distribution. To concatenate such models to generate concatenated sequences, we typically make Markov assumptions.

With the use of Markov assumptions, the general stochastic automata that we referred to here become Markov models. Hidden Markov models are stochastic automata that have an additional layer of indeterminacy, which we describe later in this chapter. However, we begin with an explanation of a simple Markov model (again borrowing from [4]) so that we can see the general form of such structures and how they can be used to represent the probability of sequences.

25.3.1 Markov Models

A Markov model is a finite-state automaton with stochastic transitions (that is, for which each transition has an associated probability) in which the sequence of states is a Markov chain. The states in each model M will be designated qimage[image = 1, . . ., L(M)]. Each of these states is associated with a class ω(qimage).

Consider the three-state Markov model shown in Fig. 25.3, where each state corresponds to one day's weather (assuming state residency is per day).

This model has a stochastic transition matrix:


By the definitions of joint and conditional probability,3 the probability of any sequence

Q = (q1, q2, q3, ..., qN) is


We typically assume that the probability of any state transition is independent of any previous transitions (that is, that the probability of moving to a particular state depends only on the identity of the previous state); this is a first-order Markov assumption. In this case, the previous equation reduces to


FIGURE 25.3 Three-state Markov model for the weather (sunny, cloudy, or rainy).Transition probabilities are given as fractions. State residency corresponds to a day.


Thus, given this Markov property, the probability of any particular sequence can be found by multiplying the probability of the individual transitions in the state sequence.

For the weather model, we will assume (for simplicity's sake) that all initial sequences start with “a” (the first day considered is always sunny).



The sequence “abc” occurs with a probability of 1/3 × 1/4 = 1/12.

Further, the probability of a set of sequences (for instance, the set of all sequences of a certain length with specified start and end states) is just the sum of probabilities of all sequences within the set.



The probability of observing the sequence “axc”, where x could be anything, can be calculated as is shown in Table 25.1. If states “a” “b” and “c” corresponded to speech phones, as opposed to states of the weather, this second example could be viewed as giving the prior probability of any sequence of length three that started with phone “a” and ended with phone “c.” This property will be useful later; however, it is also necessary to associate acoustic observations with the statistical models. For this, we will need to define hidden

TABLE 25.1 Computation of the Probability of Observing Sequence “axc”


25.3.2 Hidden Markov Model

Each output of a Markov model corresponds to a deterministic event, whereas each output of a hidden Markov model (HMM) corresponds to a probabilistic density function; thus, for any observed output sequence, the generating state sequence of a HMM is hidden. As an example, imagine that there is a line of people behind a screen and the people in it are a random mix of Swedish basketball players and speech-recognition researchers. We can't see the people in the line, but each one utters a deep sigh, presumably for being in such a long line (see Fig. 25.4). Now, since (as noted in Chapter 8) the basketball players tend to be larger, their sighs are produced by a larger vocal apparatus; the associated resonances will tend to be lower in frequency. However, there will be a lot of within-occupation variability in the spectra; that is, even when one analyzes spectra only from the basketball players, the spectra will still be highly variable from person to person. In the terminology of this chapter, then, the “state” refers to the job category of the hidden person (basketball player or researcher). The output associated with the state is an acoustic observation of a sigh. If we only hear the sighs and don't observe the speakers, then the job categories form a sequence of hidden variables, whereas the acoustic emissions (the sighs) are directly observed. The hidden sequence is presumed to be generated according to some distribution, and for each choice of state an acoustic signal is generated according to another distribution.


FIGURE 25.4 The screen hides the line of baske ball players and ASR researchers, but we hear their sighs.


FIGURE 25.5 Two-state HMM, with transitions permitted from each state to itself and to the other. Observations are presumed to be generated by this model, with associated density functions P(x | q) for each state. Each transition is stochastic with conditional probabilities P(qi | qj.

Figure 25.5 shows a two-state HMM that could represent the line-of-sighs example. State q1 could correspond to a Swedish basketball player, and state q2 could represent a speech researcher. Each state is associated with a density function for the emitted acoustics (where, for instance, the acoustic representation could be the average second-formant value for the high-energy portions of the sigh). In the limited (but common) case in which this density function is assumed to be independent of previous states or observations, it can be written as P(xn, qi). The density P(xn | q1) would then be the distribution of average formant values for a sigh from a basketball player, and P(xn | q2) would be the distribution of average formant values for ASR researchers’ sighs.

It is also commonly assumed that the sequence is a first-order Markov chain, so that the probability for a transition from state i to state j can be written as P(qj | qi), where the special case of i = j is referred to as a self-loop.

More formally, as with Markov models, the states in each model M will be designated qimage[image = 1, . . ., L(M)]. Each of these states is associated with a class ω(qimage). Each emission probability P(xn | qimage) will actually refer to the probability of the class ω(qi). This distinction is important since the ith state might not be the ith class; in particular, different states in the same model could be of the same class, and different incidences of the same class could occur in different models. For simplicity's sake we generally ignore the class designation in our probabilistic notation.

25.3.3 HMMs for Speech Recognition

As noted in Chapter 24, speech is commonly represented as a sequence of measurements that are computed over a 20-ms or 30-ms window, with a typical interframe step size of 10 ms. HMMs for speech recognition, then, are typically an interconnected group of states that are assumed to emit a new feature vector for each frame according to an emission probability density function associated with that state. Each new observation frame can in principle be associated with any state (as any frame of an input template can be associated with any frame of a reference template, as noted in Chapter 24). However, the topology of the HMM (that is, the pattern of interstate connections) and the associated transition probabilities provide temporal constraints; for instance, a high self-loop probability for a state means that it will tend to be repeated more than once, and a directed connection between [ae] and [m] in a word model means that pronunciations with [ae] followed by [m] is permitted. These models are also typically constrained to be left to right; that is, the state transitions have some temporal order.

For simplicity's sake, in most of our discussions we assume word models that associate states with subword sounds that are roughly phonelike. However, it is common that each phone is modeled by several states with their own densities, and that each subphone is modeled in context (that is, that a state represents a subphone under some contextual constraints). For instance, a state might represent the initial part of an [ae] sound for those cases in which the previous sound was a nasal. As we shall see in later chapters, however, states will often ultimately represent self-organized classes that may have something to do with linguistic categories, but that are ultimately defined implicitly by an iterative statistical procedure.

25.3.4 Estimation of P (X 1M)

Given a model Mi, and its definition (in terms of states, transition topology, probabilities, and parameters for the estimation of probabilities), we can then consider the estimation of the likelihood of a data sequence, assuming that it was generated by this model. Initially we make first-order Markov assumptions as suggested earlier, and we ignore any details or difficulties with the estimation of the local probabilities or likelihoods. We just assume that we are able to compute the likelihoods P(xn | qk) for each observation vector n and each model state qk4 Further, we assume that each transition probability P(qj | qi) is known.

Given these assumptions, we can show that we can get an estimate of the data sequence likelihood from the local likelihoods, using an efficient set of recursions that can be implemented in a way that is reminiscent of the computations performed in template matching for isolated word recognition.

Total Likelihood Estimate Let us assume that we can compute the likelihood of a data sequence, assuming a particular path through the model; let us call the state sequence associated with that path Qi, one of I paths that are of length N (the data sequence length) and that are permitted in the model M. The joint density for X and Qi could in principle be estimated by multiplying together all of the local transition probabilities and acoustic likelihoods associated with that sequence; in other words, the product would have an emission term P(x, qk) for each frame n and another term P(qj | qi) for each transition from i to j. These values for each legal sequence Qi in M could then be summed up to get the complete data likelihood:


Though straightforward, this procedure requires roughly 2N × L(M)N arithmetic steps, where L(M) is the number of emitting states of the model; there are L(M)N possible state sequences and each state sequence requires approximately 2N calculations.

Fortunately, a more efficient algorithm exists, and it is called the forward procedure.5 For this approach, the likelihood P(X M) is expressed as the sum of the joint probabilities of each possible final state:


where L(M) is the number of states, N is the length of the observed sequence in frames, Xab is the sequence of observations from frame a through frame b, and qln is the state ql at frame n.

Thus, to get the complete likelihood, we need to find the joint probability of the final state and all the data leading up to it. This may not seem like a very useful expression, but in fact we can decompose it further into the product of a local term and a cumulative term in order to build up a recursive estimation scheme. Factoring the expression P(qln, X1n | M) into two components (again using the definitions of joint and conditional probabilities),6 we get the following relation, called the forward recurrence:


We then define


Which is the probability of the joint event that the state at time n is ql, and that the sequence Xln (from the beginning to frame n) is observed, given the model M. Then we get the forward recurrence:


When the recurrence reaches the final frame, Eq. 25.7 can be used to obtain the complete likelihood.

The right-hand side of Eq. 25.10 is composed of two parts: the previous value for the recursion sum for each predecessor state, and a new likelihood for the state and data given all the previous values. Thus, if we could estimate the latter conditional likelihood for each time point and state, we could get the complete likelihood in an efficient number of steps. Without any simplifying assumptions, the second term can also be decomposed as follows:


Unfortunately, such densities are difficult to estimate. Therefore, further assumptions are required. We typically make the Markovian assumptions that we have referred to earlier, simplifying the two terms on the right-hand side of Eq. 25.11.

First, the state chain is assumed to be first-order Markov. As with the earlier examples, this means that the probability that the Markov chain is in state qi at time n depends only on the state of the Markov chain at time n – 1, and it is conditionally independent of the past. With this assumption, we transform


Second, observations are assumed to be independent of past features and states. This means that the probability that a particular acoustic vector will be emitted at time n depends only on the state at that time, and it is conditionally independent of the past:


With these simplifications, Eq. 25.11 becomes


Thus the forward recurrence becomes


Thus, given the conditional independence assumptions described here, we can compute the likelihood for a data sequence being produced by a particular model, using a recursion that is only dependent on local emission and transition probabilities. Each step consists of computing a sum over all possible predecessor states of the product of three components: the local acoustic likelihood (for that state and that frame); the transition probability from the previous state; and the value of the a recursion for that previous state at the previous frame. The reader may find this computation somewhat familiar – it is similar to the dynamic programming for templates discussed in the previous chapter, except that

  • each of the terms is probabilistic rather than deterministic,
  • the hypothesized predecessors are model states rather than actual frames,
  • the local constraints do not typically permit a repeat of the current frame to match the same state (all predecessors correspond to the previous frame),
  • the local and global factors are combined by multiplication rather than summation, and
  • the combination of terms over different predecessors is combined by summation rather than finding the maximum (or minimum in the case of distance).

The last two differences can actually be done away with in a further simplification that we discuss in the following paragraphs.

Viterbi Approximation: Estimation of Best Path Within the assumptions described here, the forward recursion yields the complete likelihood; although more efficient, it is functionally equivalent to the direct summation of the likelihoods of all possible legal paths (length N sequences of state transitions) within a model. However, the procedure can be tricky to use, since it requires both multiplication and addition of likelihoods and probabilities. If only multiplication were done, then the computation could be converted to the log domain to handle the wide range of values for the products. However, since there is also addition, the log probabilities must be exponentiated.7

Aside from these numerical difficulties when the complete likelihood is used; it is often useful to find the single best state sequence to explain the observation sequence. This simply requires modifying the forward recurrence by replacing all the summations with the max function. Hence, the best-path (or Viterbi) approximation to Eq. 25.8 is


With the independence and the first-order assumptions, it further reduces to


An equivalent form can be seen for the log domain:




In this form, the recursive step is extremely similar to the corresponding step in the deterministic dynamic time warp. Interpreting each negative log probability as a kind of distance (since large probabilities mean a better match), Eq. 25.17 essentially says that the global distance for a particular state and frame is the sum of a local (statistical) distance [– log P(xn | qln, M)] and the minimum over all possible predecessors of the sum of global distances and transition costs.8 Given such an interpretation, the approaches for recognition discussed in Chapter 24 (discrete word recognition, one-pass continuous recognition) also apply here, with the exception that we now have a formulation for how to combine the acoustic information with any prior statistics about the word sequences.


In this chapter, we have presented the general notion of statistical models for speech recognition, and we have then specialized them to the kind of stochastic finite-state automata that we call hidden Markov models. We have further shown how simplifying assumptions can lead us to a set of recurrences that yield acoustic sequence likelihoods. These likelihoods can, in combination with prior probabilities for word sequences, lead to word sequence hypotheses that will yield the minimum probability of error (if the assumptions are correct; alas, they aren't).

Thus far, however, we have said little about how the local probabilities used in these recursions are estimated, nor how these estimators are trained. These will be the primary topics for the next chapter.

Finally, to derive efficient and simple recursions, we found it necessary to make a number of assumptions that, as noted earlier, are almost certainly not true (but that are useful nonetheless).

  1. Language model parameters and acoustic model parameters are assumed to be completely separable; that is, the language model is independent of acoustic model parameters, and the acoustic model is independent of language model parameters.
  2. The state chain is assumed to be first-order Markov. This means that the probability that the Markov chain is in state ql at time n depends only on the state of the Markov chain at time n – 1, and it is conditionally independent of the past.
  3. Observations are assumed to be conditionally independent of past observations and states. This means that the probability that a particular acoustic vector will be emitted at time n depends only on the transition taken at that time, and it is conditionally independent of the past.
  4. Recognition is often based on the best path (Viterbi), and not on all possible state sequences (total likelihood) for a model.


  1. 25.1 In this chapter, we derived a forward recursion for P(X | M) by expressing it as Σi P(XiN, qiN | M). Derive a similar backward recursion. That is, derive


  2. 25.2 The Viterbi search finds the state sequence that is the most likely match to the observed data. A search incorporating full likelihood estimates (without the Viterbi assumption) permits us to find the most likely model. How could these two yield different results? Give an example.


  1. Bourlard, H., and Morgan, N., Connectionist Speech Recognition: A Hybrid Approach, Kluwer, Boston, Mass., 1993.
  2. Jelinek, F., Statistical Methods for Speech Recognition, MIT Press, Cambridge, Mass., 1998.
  3. Rabiner, L., “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE 37: 257–286, 1989.
  4. Rabiner, L., and Juang, B.-H., Fundamentals of Speech Recognition, Prentice–Hall, Englewood Cliffs, N.J., 1993.

1This categorization of HMM problems was used in [I], which in turn was adapted from the discussion by Rabiner in such sources as [3] and [4].

2Recall that an nth-order Markov chain is a sequence of discrete random variables that depends only on the preceding n variables. The value for n is typically one (a first-order system) for speech recognition.

3See Chapter 9 for a reminder.

4Chapters 9 and 26 give a number of methods for determining these local probabilities.

5There is an analogous procedure that is called backward, which will be discussed at a later point.

6Recall from Chapter 9 that P (a, b | c) = P(a | b, c)P(b | c).

7There are, however, clever interpolation schemes for approximating z = x + y (or equivalently elog z = elog x + elog y) while keeping all values in the log domain.

8The reader should compare Eq. 25.17 with Eq. 24.2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.