CHAPTER 28

ACOUSTIC MODEL TRAINING: FURTHER TOPICS

28.1 INTRODUCTION

In the previous chapters we provided a broad introduction to speech recognition methods, including training. However, there are a number of other methods for improving the statistical modeling of speech acoustics that have proved to be advantageous. In this chapter1, we will discuss two of the most important of these: adaptation, and common methods of discriminative training.

28.2 ADAPTATION

28.2.1 MAP and MLLR

We begin with a brief description of the adaptation problem which, for simplicity, we will frame in terms of speaker adaptation. There are many other goals for adaptation, for example channel adaptation, but the underlying principles are shared. We have at our disposal a baseline HMM that has been trained from a large corpus consisting of many (probably thousands of) hours of data collected from many (again probably thousands of) speakers. We think of these models as being speaker-independent and denote the model parameters ΘSI. We are given a small collection (possibly minutes or at most hours) of training frames, image from a single target speaker and we would like to produce speaker-dependent models, with model parameters ΘSD, that perform better than the speaker-independent models on the target speaker's test data. In adaptation, instead of training new models from scratch, we use ΘI and the frames image to estimate ΘSD.

Before proceeding let us recall the notation that we used in the E-M training formalism but now in the context of the speaker-independent models, ΘSI. If we let image denote the frames in the large corpus used to estimate ΘSI and recall that L(M) denotes the number of hidden states in the HMM, then for each state l with 1 ≤ lL(M) the model mean image is given by2

image

In Eq. 28.1 the probability image is often called state l's occupancy of frame n and we think of this probability as the fraction of frame n that is assigned to state l. From this point of view, Eq. 28.1 says that image is just the average of the frames weighted by their fractional frame assignments.

Given the collection of frames from our target speaker, image, and the speaker-independent models, ΘSI, a natural way to construct the speaker-dependent models, ΘSD is to use a expression analogous to Eq. 28.1, namely

image

The idea behind Eq. 28.2 is that we use the well-estimated models ΘSI to obtain the state occupancies for the frames from the target speaker, and then, just as before, our estimate of the means image are simply the averages of these frames weighted by the state occupancies. Also, it is useful to define for each l the total fractional count

image

which is the sum of the fractional frames from the adaptation data that are assigned to state l.

In general, the estimates of the means image given by Eq. 28.2 will be unreliable because of two main problems. The first problem arises because the adaptation data image will not be uniformly distributed across the states: some states, silence for example, will have lots of adaptation data, while others may not have any adaptation data at all. Clearly using Eq. 28.2 to estimate image will be reasonable when nl is large, risky when nl is small, and nonsensical when nl = 0. The end result is that the reliability of the estimates, image, will vary considerably across the states, with the reliability for a given state depending on how large nl is. The second problem arises when the adaptation data is substantially different from the original speaker-independent training data: for example if the speaker-independent training data was collected in clean acoustic conditions while the adaptation data comes from noisy conditions. In this case the state occupancies image based on the speaker-independent model will be unreliable. Taken together these two problems lead us to regard Eq. 28.2 as giving noisy estimates of the true speaker-dependent means, and our simple analysis suggests a possible remedy: smoothing these noisy estimates, image, using the counts, nl, and the more reliable speaker-independent means, image. The two smoothing methods that we will consider here are both linear, with the first being an interpolation while the second is weighted linear regression, and both methods can be viewed as applications of empirical Bayes.

The first approach is known as maximum a posteriori (MAP) adaptation ([11]). We introduce a parameter τ and define the MAP estimate for the mean for each state l by

image

or equivalently3

image

First, we note that in Eq. 28.4 the parameter τ controls how much weight is placed on the prior, namely the speaker-independent means image, relative to the speaker-dependent counts and means nl and image. In this sense image is the posterior estimate of the speaker-dependent mean for state l from the new evidence consisting of the fractional count nl and mean image and the prior consisting of the count τ and mean image. In particular, this is the point of view that is taken in [11]. However, from our point of view the MAP estimate fits most naturally within the framework of empirical Bayes (see e.g. [7], [8], or [15]). One point is that we do not have at our disposal a universal prior that will work in all situations. Instead, we use an empirically derived prior, image, that is estimated from a large collection of data closely related to our new evidence. We then use the additional information from the empirical prior to improve our estimate of the speaker dependent means.

Second, we note that Eq. 28.4 results in a form of smoothing or regularization that is very similar in spirit to how we deal with rare or unseen n-grams when estimating n-grams for language modeling (e.g., see [3]). In particular, provided that we choose τ > 0, this smoothing corrects for the ill-defined behavior of image when nl = 0. However, MAP adaptation is only able to compensate for the data sparsity component of the noise in the estimates image, since it explicitly uses ΘSI to produce the counts nl

Finally, we make some simple observations concerning the prior count τ and its relationship to the fractional counts nl based on Eq. 28.4. The first thing to note is that there is a single τ which is used for all of the states, l. Given a particular choice for τ > 0, if nl = 0, then image = while if nl > > τ, then image ~ image. Hence, as nl varies from 0 to ∞, image smoothly varies between image and image. The trick to using MAP adaptation is to choose τ large enough so to get sensible estimates image, but not too large. If τ is too small then it won't protect against the case nl = 0 while if τ is too large then image = image for every l. It is useful think of τ as being a prior count of frames and, in particular, it is useful to think of τ as being the minimum number of frames necessary for a reliable estimate of any state's mean. Consequently, a sensible choice for τ is usually somewhere between 10 and 100, depending on the particular situation.

The second approach is known as maximum likelihood linear regression (MLLR) (see [14]4). The idea is to use a linear model to predict the noisy image from the image. We use maximum likelihood to estimate the parameters of this linear model, and then use this estimate to transform the speaker-independent means. These transformed means are the MLLR estimates for the speaker-dependent means. We let a ≡ (a0, a1)t be the coefficients5 in our linear model given by the simultaneous system of equations, one for each l with 1 ≤ lL(M):

image

Since Eq. 28.5 is an over-determined system of linear equations (L(M) equations in the two unknowns a0 and a1), there is not a unique solution to a. We let â denote the maximum likelihood solution to Eq. 28.5, which means that we maximize the following with respect to a:

image

Then the MLLR estimates for the means are given by

image

An equivalent but more intuitive formulation of MLLR can be given using weighted linear regression. Since we are only adapting the means, we simplify the notation for the variances (and standard deviations) by writing σ instead of image. We have L(M) pairs of observations (image, image). As before we want to predict image from our knowledge of image, but now we use the weighted regression model

image

where each error term satisfies

image

Note that under this weighted regression model, states that have smaller variance or more adaptation data will have more influence on the model.6 As we previously noted, scarcity of adaptation data can make the speaker-dependent means noisy or unreliable, but the weighted regression model that MLLR uses produces a smoother estimate. We use the weighted least squares solution to Eq. 28.8, which is the maximum likelihood estimate for the weighted regression model. We define the L(M)-dimensional (column) vector of speaker-dependent means Z by

image

the 2 × L(M) matrix of augmented speaker-independent means Y by

image

and the diagonal L(M) × L(M) weight matrix W by

image

Then the weighted least squares estimate for a is given by

image

In Exercise 28.1 we show that these two formulations of MLLR are in fact the same, more specifically, that the maximum likelihood solution to Eq. 28.6 is given by Eq. 28.10. Also, part (c) of Exercise 28.1 recasts Eq. 28.10 into a simpler and perhaps more familiar least squares solution for â0 and â1.

Note that, just like MAP, adaptation using MLLR is a form of empirical Bayes estimation. While it was not originally formulated in this way in [14], MLLR takes our noisy estimates image and smooths them with an empirical prior image to obtain image In particular, from Eq. 28.7 we see that the estimate image is a linear function of the empirically derived prior image, and that the linear function's coefficients, â0 and â1, are functions of the prior ΘSI and the new evidence nl and image.

Unlike MAP, MLLR is able to compensate for both sources of noise in the original estimates image, namely the data sparsity issues and the problems arising from unreliable state occupancies nl. This latter problem is dealt with by the error terms in Eq. 28.9: here we balance the noisy counts by the variances in the speaker-dependent model that was used to produce them.

The full MLLR formalism is more elaborate than we have described here:

  • There can be more than one linear transformation. The collection of states is partitioned into regression classes which determine the correspondence between states and transformations. These regression classes may be constructed by hand using intuition or automatically using a top-down splitting algorithm.
  • One can constrain the form of the transformation to convenient types, for example block diagonal: one block for the cepstral features, another for the differences, etc.
  • There is an extension for adapting the variances in addition to the means (see [9]). This extension no longer has a closed-form solution nor does it have an interpretation in terms of linear regression.
  • There is a closely related alternative to adapting the means and variances known as constrained maximum likelihood linear regression (CMLLR).7 In this framework, the linear transformation that is estimated is usually applied to the features rather than the model parameters. CMLLR does not generally have a closed-form solution nor is it related to linear regression.

28.2.2 Speaker Adaptive Training

Speaker adaptive training (SAT) uses MLLR at training time to reduce the within-speaker variability in the models (see [1]). Before explaining the idea behind SAT, we need to introduce some notation. Let X denote training data collected from R training speakers. We partition the training frames image into the R speaker-specific subsets by dividing (and rearranging if necessary) the frames 1 through N into R intervals image where al = 1, bR = N, and the frames image come from speaker r. If ΘSI are maximum likelihood models estimated from X, which we use to seed the SAT process, then we denote the per-state mean of the data from speaker r by image and

image

Given any state l, we would expect that there is a great deal of variability in the collection of speaker specific means image which is due to various speaker effects, e.g., the responseof the channel to the speaker in a particular recording environment. The main idea in SAT is to assume that there are underlying speaker-independent means for each state, image, and R linear transformations image that explain the variability in image via

image

Using the transformations image to account for the variability in the image makes image a more speaker-independent estimate of the state mean. Also, when we estimate the variances relative to image we end up tighter estimates because we have eliminated at least some of the variance in the data due to speaker effects. The net result is that the models trained using SAT should be more focused on the within-speaker variability in the data than the original seed models which summarize both the within-speaker and the between-speaker variability in the data.

It is worth noting that while the transformations image are part of the model that we estimate during speaker adaptive training, they are only used during the training process and are subsequently discarded. When we use the models ΘSAT for recognition of an unknown speaker we use MLLR, either in supervised or unsupervised mode, to compute a transformation for the new speaker.

The goal of speaker adaptive training is to estimate the underlying means and standard deviations, ΘSAT = image, and the collection of transformations, image. The training procedure, which we will briefly describe, is a variant of standard HMM training using the EM algorithm. Via the transformation a(r), we introduce a speaker specific Gaussian output log-likelihood

image

which we use to re-write the standard auxiliary function Q as

image

We want to maximize Q with respect to the variables image, and image. We initialize by using the speaker-independent models and the identity transformations, i.e., ΘSAT = ΘSI and image. We then maximize Q in three separate steps:

  1. First with respect to each of the a (r). This amounts to separately estimating R MLLR transformations, where our estimate â(r) is obtained using the data from training speaker r and an analog of Eq. 28.10.
  2. Next with respect to each of the means image. This results in the following

    image

  3. Finally with respect to each of the variances image, which results in

    image

After step (c) is complete, we repeat starting with step (a), but using our current estimate of ΘSTA instead of ΘSI as a seed. We iterate steps (a), (b), and (c) until the procedure is judged to have converged, which typically occurs in two or three cycles. It is straightforward to verify that each of the steps (a), (b), and (c) increase the value of the auxiliary function Q and the log-likelihood of the training data.

28.2.3 Vocal tract length normalization

Vocal tract length normalization (VTLN) [4, 13, 20] is another method used to account for within-speaker variability in data. The method has its roots in the observation that if we assume a uniform tube with length L for the model of the vocal tract, then the formant frequencies in speech produced by this vocal tract will be proportional to 1/L. Suppose, then, that we have speech data collected from N speakers with vocal tract lengths L1, L2, . . ., LN. If we rescale or warp the frequency axis of the spectrum during signal processing of speaker i's data by the factor Li/L*, then, under this simple model, we will have made all of the data appear as if it were produced by the vocal tract of a single, standard length, L*.

In practice, we do not try to directly estimate a given speakers vocal tract length8. Instead we have a small – e.g., 10 – collection of predetermined frequency warpings that we select from using a maximum likelihood procedure. For each utterance, we produce 10 different versions of the corresponding feature file, the collection of which we denote by image We select which version of the feature file to use via the rule:

image

This procedure works very well using speaker-independent models, ΘSI, to reduce the variability of test data. However, we usually also use this to reduce the variability of the training data in an iterative procedure, similar in spirit to SAT, that produces a sequence of models (image). To produce the (image), we initialize with image = ΘSI At step k, were-process the training data, this time selecting the appropriately warped version of each utterance via:

image

We retrain using the warped training data to produce the models image We select ΘVTLN from the sequence (image) by running recognition on an independent, validation test set. Convergence in error rate usually occurs within two or three iterations of this procedure. When we perform recognition using ΘvTLN we need to use the VTLN procedure on the test data itself, otherwise performance will degrade; the models ΘVTLN are unaware of the speaker variability in the training data that VTLN accounts for, so we must use VTLN to account for this speaker variability in the test data too.

28.3 LATTICE-BASED MMI AND MPE

The early versions of MMI training described in Section 27.2.1 only worked well on small vocabulary tasks using small models, e.g. digit recognition. Subsequent refinements to MMI training (see [21] and [171) have resulted in two discriminative training methods for HMMs, lattice-based MMI and MPE, that work well irrespective of the size of the task or the models. There are two main aspects to these improvements that we shall discuss in turn.

The first refinement involves including what is known as the the language model scale in the MMI objective function ([19]). Recall that when we perform recognition on an acoustic observation X, in principle we choose the recognition transcript, image according to the rule

image

where P(X M, Θ) is given by our acoustic model and P(M | θ) is given by a language model (LM). As a practical matter, however, we almost always scale the LM probability by a language model scale, k, and choose image by the rule

image

or equivalently by

image

We let Mref be the true or reference model sequence for the training features X and since the LM will be unchanged by MMI we drop its dependence on Θ. Then the MMI criterion for lattice-based MMI is given by

image

Adding k makes this new MMI criterion more closely related to the criterion that is actually used during recognition. The set of possible alternative transcriptions image is obtained by running recognition on the training data X. Since J is typically very large, we store the results of this recognition in a lattice.

The second refinement is in how we estimate parameters using the MMI criterion. As noted above, early versions of MMI used gradient descent for parameter estimation. In [12] the authors introduced a variant of the Baum-Welch algorithm, which they called the extended Baum-Welch algorithm, to estimate parameters using the MMI criterion.9 If we momentarily forget about the scaling, then the idea behind extended Baum-Welch is simple and worth understanding. Extended Baum-Welch is an iterative algorithm, so we shall let Θ denote our initial models parameters and let Θ′ denote the possible choices for our updated parameters. Since the logarithm of the (unscaled) MMI criterion is given by

image

the idea is to take the difference of the auxiliary functions that Baum-Welch estimation would use for log P(X Mref, Θ′) and log P(X Θ′) to obtain an auxiliary function suitable for estimation using log P(Mref |X, Θ′). Here is a sketch of the details, where we further simplify matters by concentrating on the means. We run a single pass of Baum-Welch as if we were estimating parameters for the term log P(X|Mref, Θ′), we denote the corresponding auxiliary function by Qnum, and we call the resulting statistics that we accumulate (using the forward-backward algorithm) the numerator occupancies. For the state means the numerator occupancies are summarized by image which are determined byformulas analogous to Eq. 28.2 and Eq. 28.3. We also run a single Baum-Welch as if we were estimating parameters for the term log P (X | Θ′), where we consider all the model sequences consistent with the set of transcriptions image we denote the corresponding auxiliary function by Qden, and we call the resulting statistics that we accumulate the denominator occupancies. The denominator occupancies are summarized by image We would like to define an auxiliary function, QMMI, for the MMI criterion using Qnum and Qden. The simplest approach is to just take the difference of auxiliary functions:

image

Maximizing QMMI(Θ′, Θ) with respect to Θ′ results in the following estimate for state l's mean:

image

This naive estimate suffers from two, equally serious, problems. First of all, we have no a priori reason to expect that for all l, image in other words Eq. 28.13 is not necessarilywell-defined for all l. Second of all, even if Eq. 28.13 is well-defined for all l, there is no guarantee that this estimate will result in an increase in the MMI criterion in the sense of:

image

To solve both of these problems, we introduce positive constants image a prior R, and define a smoother version of QMMI via

image

The prior is the Kullback-Leibler distance from the Gaussian with parameters (μ, σ) to the Gaussian with parameters (μ′, σ′):

image

This prior has the property image ≥ 0 with equality if and only if μ′ = μ and σ′ = σ, so its role in Eq. 28.14 is to resist Θ′ moving away from Θ. Using the smoothed QMMI results in the estimate for state l's mean that is actually used in extended Baum-Welch, namely

image

Clearly, if we choose Diimage then Eq. 28.16 is well defined, and it turns out10 that if Dl >> 0, then using the estimates defined in Eq. 28.16 will result in an increase in the MMI criterion.

In practice, we start MMI training from a well estimated maximum likelihood model, say ΘMLE (here MLE denotes Maximum Likelihood Estimate, i.e., the model resulting from conventional, non-discriminant, training). Note that the mean update formula in Eq. 28.16 is similar in flavor to the MAP adaptation formula Eq. 28.4. In particular, if we choose each DI >> 0, then the updated parameters Θ will be essentially the same as our starting point ΘMLE. Another method used to smooth the parameter estimates with the MLE is called I-smoothing [17]. Just as in MAP adaptation, we introduce a prior count τ, and our estimate of state l's mean is now given by

image

The reader should think of the prior count τ as being relative to the MLE, so if τ is chosen appropriately, Eq. 28.17 will keep the model means and variances for states with small counts image and image close to the MLE means and variances even after several iterations of extended Baum-Welch. This results in less noisy updates for rare states, thereby adding some stability to extended Baum-Welch.

The reader should consult [17] for the many details that we have omitted, for example:

  • The estimation formulas for the variances, transition probabilities, and mixture weights.
  • How to choose the constants Dl.
  • Details on lattice generation.

Minimum phone error (MPE) training is another discriminative training method for HMMs that is closely related to MMI [17]. For any Mj, we let A (Mj, Mref) denote the phone accuracy11 of Mj relative to Mref, which is the number reference phones minus the number of phone errors in Mj. Then the MPE criterion is given by

image

When we use the MPE criterion to select model parameters, we are trying to choose parameters that maximize the expected phone accuracy. An iterative algorithm based on extended Baum-Welch is used to estimate model parameters using the MPE criterion. I-smoothing with MLE priors is an essential ingredient of this algorithm, in fact the update formula for the state means is given by a formula very similar to Eq. 28.17 (see Eq. 28.22 below), but the occupancies are determined differently than in MMI.

Generally speaking, MPE outperforms MMI when there is a large amount of training data (more than 100 hours). The MPE estimation algorithm is also more stable than extended Baum-Welch. When performing MMI training, the optimal number of iterations of extended Baum-Welch is determined by examining the performance of the resulting models on a validation test set. Optimal performance on the validation set will usually occur within a narrow range of iterations, with performance degrading significantly if extended Baum-Welch is run too many times. While the number of iterations of the MPE estimation algorithm still needs to be determined by examining the performance of the resulting models on a validation test set, optimal performance will occur within a broader range of iterations. One reason for this is that the MPE criterion is explicitly based on phone errors so it is inherently more self-regulating than the MMI criterion: if the model parameters are pushed too far away from the optimum, then this will induce errors which in turn will result in a decrease in the MPE criterion. The tradeoff for this stability is that MPE estimation takes roughly twice as many iterations as MMI estimation does. Recent work ([181) has successfully combined some of MPE's strengths with MMI.

28.3.1 Details of mean estimation using lattice-based MMI and MPE

In this section we will fill in the details concerning the statistics that are accumulated for mean estimation using lattice-based MMI. We will also describe and motivate the MPE estimates for the means. Finally we shall describe how the lattices are used to efficiently accumulate these statistics.

The lattices that lattice-based MMI and MPE use store two compatible levels of information and are usually referred to as phone-marked lattices. The top level consists of word lattices, with the nodes storing the start and end times of the words. An arc between two nodes stores the word identity along with information about the pronunciation variant. The language model scores are also stored at this level. Thus a path through a word lattice gives a transcription along with the corresponding language model score. The lower level consists of phone lattices that are obtained from the word lattices by first expanding the pronunciations of the words, then force-aligning to get the start and end times for each phone. Just like the the word lattices, the nodes in the phone lattices store the start and end times of the phones while each arc in the lattice stores the phone's identity and acoustic model score. A path through the phone lattice gives a phonetic transcription along with the corresponding acoustic model scores. MMI and MPE both use two phone-marked lattices, the numerator lattice and the denominator lattice, to accumulate the statistics used in their iterative estimation procedures. At the word level, the numerator lattice contains the correct (or reference) transcriptions for the training data while the denominator lattice consists of the correct transcription plus the result of a recognition of the training data using the MLE.12 Forced-alignment using the MLE gives the phone level information for both the numerator and the denominator lattice.

The phone arcs in the numerator and denominator lattices are central to the lattice-based MMI and MPE formalisms, so before proceeding further we begin to describe the necessary notation involving the phone arcs. To start, a phone arc a consists of the underlying phone (possibly with context) ap, along with start and end times for the arc, as and ae, measured in frames.13 We let A denote the set of all phone arcs in the denominator lattice, Anum denote the set of all phone arcs in the numerator lattice, image denote the set of all alternative transcriptions in the denominator lattice, and, given aA, we define Ma to be the set of transcriptions that pass through (or contain) the phone arc a. Finally, let image be the – scaled – joint probability of being in state / at frame n and in any transcription that contains a given X, i.e.,

image

In lattice-based MMI, the numerator occupancies for state 1 relative to initial parameters 0 are given by

image

and

image

while the denominator occupancies for state 1 are given by

image

and

image

As we have previously described, these occupancies are used to obtain the MMI estimates of the means using Eq. 28.16 or, in the case of I-smoothing, Eq. 28.17.

MPE also uses the numerator occupancies for I-smoothing, but we need to define two additional sets of occupancies. In order to facilitate this, we define the quantity c(a) to be the average phone accuracy of all the transcriptions that pass through a via

image

We also define cavg to be the average phone accuracy of all the transcriptions in the lattice via

image

Note that since

image

we can simplify the definition of cavg:

image

In particular, this means that cavg = FmpE(Θ), or the MPE criterion relative to our initial parameters Θ. Finally we introduce two additional subsets of A, namely image

The two sets of occupancies, the positive occupancies image and the negative occupancies image are given by

image

image

and

image

We use the occupancies image together with a prior count τ and collection of positive constants {Dl}l = 1L(M) to define the I-smoothed MPE estimate for state l's mean:

image

While the form of the MPE estimate, Eq. 28.22, is very similar to the that of the MMI estimate, Eq. 28.16, the MMI and MPE occupancies have different interpretations. In the case of MPE, the positive occupancies are computed on A+ which is precisely the set of arcs a where the expected accuracy of the transcriptions that pass through a, c(a), is greater than the expected accuracy of all the transcriptions, cavg, while the negative occupancies are computed on A_ which is precisely the set of arcs a where the expected accuracy of the transcriptions that pass through a is less than the expected accuracy of all the transcriptions.14 Thus, providing that image15 moves towards μlpos whichwe can think of a region of improved phone accuracy, and moves away from μlneg, which we can think of as a region of decreased phone accuracy, from a starting point constructed from the priors μl, μlnum, Dl and τ. In the case of MMI, the analog of the positive occupancies are the numerator occupancies which are computed on all of the phone arcs in the correct transcription, while the analog of the negative occupancies are the denominator occupancies which are computed on all of the phone arcs whether they are correct or not. Thus, again assuming that τ – nneg + Dl > 0, the MMI estimate image moves towards μlnum which we can think of a region of improved sentence accuracy, and moves away frome μlden, which we can think of as a region of decreased sentence accuracy, from a starting point constructed from the priors μl, μlnum, Dl, and τ.

Parameter estimation using lattice-based MMI or MPE is feasible because the quantities cavg and

image

can be efficiently computed using three variants of the forward-backward algorithm over the phone arcs in the lattices. To enable this computation we need to make the simplifying assumption that the allowable hidden state sequences in the HMMs may be restricted to those that respect the phone arc start and end times. We also make the following independence assumption about the distribution of frames given a sequence of phone arcs: given any transcription MM, let a1, a2, ak be a underlying sequence of non-trivialphone arcs corresponding to M. We encode the language model probability, P(M), into phone arc transition probabilities by setting P (ai | ai - 1) = 1 if the phone arcs ai-1 and ai are within a word, and setting P (ai | ai - 1) to the corresponding language model probability if ai-1 and ai cross a word boundary. We also define the joint probability P (a1, a2, ak) to be

image

With these conventions

image

and we can decompose P(M, a1, a2, ..., ak, X1N | Θ) as follows

image

Finally our independence assumption is that

image

Also, each image is modeled using an HMM for the underlying phone (usually with triphone context) api starting from ais and ending at aie.

Note that we have intentionally left out the scaling of the acoustic probability by the factor image Eq. 28.23. We will continue to do this in the rest of this section mainly because of notational difficulties that scaling introduces. In particular, the unscaled probabilities and densities that we use really are probabilities and densities, while this is not true for the scaled versions without a tedious re-normalization. It is also easy for the reader to get the correct, scaled versions of all the formulas: simply replace every occurrence of image with image.

We give a sketch of the details starting with the computation of image for each aA. For each phone arc aA we run the usual forward-backward algorithm within the boundaries as and ae. Thus for each phone arc we have a separate estimation problem for the the underlying triphone starting at as and finishing at ae. The result of these computations are

image

and

image

Since

image

to finish the computation of image it suffices to compute image for every phone arc aA.

To accomplish this, we again use the forward-backward algorithm, but this time over the phone arcs in the phone lattices. For convenience, we add to the lattice a single trivial entry node, 0, and insist that all the initial arcs in the lattice start at 0. We also add to the lattice a single trivial exit node, N + 1, and a trivial exit arc, ∊, that leads from the finalframe, N, to N + 1. Thus ∊s = N, ∊e = N + 1, and M = M. Given a phone arc aA, let. Ma be the set of phonetic transcriptions (or paths through the phone lattice) that start at the trivial beginning node 0 and end at ae, and let image be the set of phonetic transcriptions that start at ae and end at the trivial final node N + 1. Then the key observation is that we can decompose P(Ma, X1N) into the product of the forward probability image and the backward probability image i.e.

image

We compute the forward probabilities image recursively as follows:

  1. (a) If a is an initial arc, as = 0, then we set

    image

  2. (b) If a is not an initial arc, as > 1, then we set

    image

Since M = M = M the recursion terminates with the computation of

image

Likewise, we compute the backward probabilities image recursively by setting image and

image

Finally we compute P(Ma | XN1, Θ) (our goal) using the forward and backward probabilities

image

The quantities {c(a)}aA are computed using the final variant of the forward-backward algorithm, our account of which will assume that we have already computed the set of probabilities

image

First we give a more formal description of A (M, Mref). Given a transcription MM, we regard it as a phone-level transcription, and for each phone arc a in this transcription we define A(a) by

image

Then image16 Given a phone arc aA and a path MMa, we can split M into two parts: the path that starts at the beginning node and ends at ae, namely Mimage and the path that starts at ae and ends in the terminal node, namely image Since image it follows that

image

Motivated by this, we define the forward average, c(a), and the backward average, image by

image

and

image

The key observation that enables the efficient computation of c(a) is that

image

We compute the forward averages, c(a), recursively as follows:

  1. (a) If a is an initial arc, as = 0, then we set c(a) = A(a)
  2. (b) If a is not an initial arc, as > 1, then we set

image

which simplifies to

image

The terminal forward average computation gives cavg, via either image or

image

Similarly, we compute the backward averages, image recursively by initializing with image and setting

image

28.4 CONCLUSION

This chapter has introduced some significant recent improvements to HMM-based speech recognizers. First, we looked at adaptation, in which a speaker-independent acoustic model is systematically modified to match the particular characteristics of an individual speaker, either during recognition or for both training and recognition.The second part provided details of some improvements to the discriminant training first mentioned in Chapter 27 as a way to adjust model parameters in order to minimize recognition errors, instead of the proxy goal of maximizing training data likelihood in conventional MLE training. While the details of MMI and MPE training become increasingly involved, we hope that our examination gives some flavor and insight into how phone and word errors can be attributed to individual acoustic model components, thereby facilitating direct optimization of these parameters. Such training schemes, while considerably more computationally expensive than MLE techniques, have none-the-less become feasible even for very large-scale speech recognition systems, and have returned significant accuracy improvements in leading-edge research recognizers.

28.5 EXERCISES

  1. 28.1 For each l with 1 ≤ lL(M), define ul by

    image

    1. (a) Show that the Q defined in Eq. 28.6 when viewed as a function of a = (a0, a1) has precisely one critical point satisfying

      image

      and

      image

    2. (b) Show that the weighted least squares solution given in Eq. 28.10 also reduces to Eq. 28.24 and Eq. 28.25.
    3. (c) Let (MSD, MSl) be a pair of discrete random variables taking values in image Show that if we define the probability distribution of (MSD, MSl) by

      image

      then we can re-interpret Eq. 28.24 and Eq. 28.25 as follows:

      image

      and

      image

  2. 28.2 Verify the Kullback-Leibler formula Eq. 28.15.
  3. 28.3 Verify that image
  4. 28.4 In our account of the computation of cavg and

    image

    we split it into three separate variants of the forward-backward algorithm. Devise a more efficient algorithm that combines these variants into one forward-backward algorithm and show how to compute the MPE occupancies

    image

  5. 28.5 In the Baum-Welch algorithm for HMM parameter estimation, the probabilities image can be thought of as fractional counts, i.e., what fraction of frame n is assigned to state 1. An important property that these counts have is

    image

    which is simply a restatement of the fact that image is the probability distribution of the frame n across the states 1 through L(M). In this exercise we will verify that the probabilities image have an analogous interpretation.

    Let image i.e., the set of arcs that contain frame n. Show that

    image

    and that given a, bAn, with ab

    image

    from which it follows that

    image

    Use this to show that

    image

BIBLIOGRAPHY

  1. Anastasakos, T., McDonough, J., Schwartz, R., and Makhoul, J., “A compact model for speaker-adaptive training,” in Proc. Mt. Conf Spoken Lang. Process., Philadelphia, pp. 1137-1140, 1996.
  2. Axelrod, S., Goel, V., Gopinath, R., Olsen, P., and Visweswariah, K, “Discriminative estimation of subspace constrained Gaussian mixture models for speech recognition,” IEEE Trans. Audio, Speech, and Lang. Process. 15: 172-189, 2007.
  3. Chen, S. and Goodman, J., “An empirical study of smoothing techniques for language modeling,” Harvard University, Tech. Rep. TR-10-98, 1998.
  4. Cohen, J., Kamm, T., and Andreou, A. G., “Vocal tract normalization in speech recognition: compensating for systematic speaker variability,” J. Acoust. Soc. Am. 97: 3246-3247, 1995.
  5. Cover, T. M. and Thomas, J. A., Elements of Information Theory, Wiley & Sons, New York, 1991.
  6. Digalakis, V. V., Rtischev, D., and Neumeyer, L. G., “Speaker adaptation using constrained estimation of Gaussian mixtures,” IEEE Trans. Speech Audio Process. 3: 357-366, 1995.
  7. Efron, B., and Morris, C., “Empirical Bayes estimators on vector observations – an extension on Stein's method,” Biometrika, 59: 335-347, 1972.
  8. Efron, B., and Morris, C., “Data analysis using Stein's estimator and its generalizations,” J. Am. Stat. Assoc. 70: 311-319, 1973.
  9. Gales, M. F J., and Woodland, P. C., “Mean and variance adaptation within the MLLR framework,” Speech Commun. 10: 249-264, 1996.
  10. Gales, M. F. J., “Maximum likelihood linear transformations for HMM-based speech recognition,” Comp. Speech Lang. 12: 75-98, 1997.
  11. Gauvain, J.-L., and Lee, C., “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Trans. Speech Audio Process. 2: 291-298, 1994.
  12. Gopalakrishnan, P. S., Kanevsky, D., Nádas, A., and Nahamoo, D., “An inequality for rational functions with applications to some statistical estimation problems,” IEEE Trans. Info. Theory 37: 107-113, 1991.
  13. Lee, L., and Rose, R .C., “A frequency warping approach to speaker normalization,” IEEE Trans. Speech Audio Process. 6: 49-60, 1998.
  14. Leggetter, C. J., and Woodland, P. C., “Maximum likelihood linear regression for speaker adaptation of continuous density HMMs,” Speech Commun. 9: 171-186, 1995.
  15. Morris, C. N., “Parametrical empirical Bayes inference: theory and applications. J. Am. Stat. Assoc. 78: 47-55, 1983.
  16. Normandin, Y., and Morgera, S. D., “An improved MMIE training algorithm for speaker-independent, small vocabulary, continuous speech recognition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Toronto, pp. 537-540, 1991.
  17. Povey, D., Discriminative Training for Large Vocabulary Speech Recognition, Ph. D. Thesis, Cambridge University, 2004.
  18. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., and Visweswariah, K., “Boosted MMI for model and feature-space discriminative training,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Las Vegas, pp. 4057-4060, 2008.
  19. Schluter, R., and Macherey, W., “Comparison of discriminative training criterion,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Seattle, pp. 493-496, 1998.
  20. Wegmann, S., McAllaster, D., Orloff, J., and Peskin, B., “Speaker normalization on conversational telephone speech,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Atlanta, pp. 339341, 1996.
  21. Woodland, P. C., and Povey, D., “Large scale discriminative training of hidden Markov models for speech recognition,” Comp. Speech Lang., 16(1): 25-47, 2002.

1 This chapter was written by Steven Wegmann.

2 We are assuming that the Baum-Welch algorithm has converged.

3 Technically, for this to be true in general we need to follow the convention that nlimage = 0 whenever nl = 0.

4 We will be following the convention in this chapter that the features are one-dimensional. This means that in our account of MLLR the affine transformation will also be one-dimensional, which, in the authors opinion, simplifies the notation considerably and makes the connection between MLLR and weighted linear regression easier to see.

5Here a0 and al are scalars.

6 We are ignoring the minor technicalities involving states that have no adaptation data, i.e., nl = 0.

7 This was first described in [6] where the authors also derived the closed-form solution for estimating a diagonal transformation. The problem of estimating a non-diagonal transformations does not generally admit a closed-form solution, but in [9] a practical algorithm for estimating general transformations is described.

8 While this method doesn't actually measure the vocal tract length (which has a more complicated relationship with spectral content than would be given by the uniform tube model), it does provide a framework to improve performance.

9 The algorithm described in [12] was for HMMs that used discrete output distribution. In [16] the algorithm was generalized to HMMs with continuous output distributions.

10 See [2] for details.

11 We really do mean phone accuracy even when we are, say, using triphone-based HMMs: in this case we would first strip away the triphone contexts of both the hypotheses and the reference before computing A(Mj, Mref).

12 So by construction, the numerator lattice is a sub-lattice of the denominator lattice.

13 Naively, we could set the start and end frames of the arc to be equal to the times specified by the nodes that the arc traverses. This, however, would lead to some frames being counted too many times in the occupancy statistics. Consider, for example, a node that has an arc entering the node and another arc exiting the node: the frame corresponding to this node will be assigned to both the entry and the exit arc. Instead we need to ensure that the frame is assigned to just one of these arcs, so we set as equal to the first frame after the beginning node of the arc, and ae equal to the frame specified by the ending node of the arc.

14 Note that we ignore the counts accumulated on the phones arcs that have average accuracy, i.e. the aA0 {aA: c(a)cavg = 0}.

15 is almost always true in practice.

16 An important detail that we are omitting is how to efficiently compute A(a) over all of the phone arcs. In [17] this is handled through an approximation that takes into account the time boundaries as and ae.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.210.151.5