In the past few chapters, we have established the basics for understanding the static pattern-classification aspect of speech recognition.

  1. Signal representation: in most ASR systems, some function of the local short-term spectrum is used. Typically, this consists of cepstral parameters corresponding to a smoothed spectrum. These parameters are computed every 10 ms or so from a Hamming-windowed speech segment that is 20–30 ms in length. Each of these temporal steps is referred to as a frame.
  2. Classes: in most current systems, the categories that are associated with the short-term signal spectra are phones or subphones,1 as noted in Chapter 23. In some systems, though, the classes simply consist of implicit categories associated with the training data.

Given these choices, one can use any of the techniques described in Chapter 8 to train deterministic classifiers (e.g., minimum distance, linear discriminant functions, neural networks, etc.) that can classify signal segments into one of the classes. However, as noted earlier, speech recognition includes both pattern classification and sequence recognition; recognition of a string of linguistic units from the sequence of segment spectra requires finding the best match overall, not just locally. This would not be so much of a problem if the local match was always right,2 but we are not this lucky! There are always some local errors, and our real goal is (typically) to recognize the correct sequence of words, not to find the phone identity associated with each frame. Furthermore, both timing and pronunciation can vary between our stored representation and the new speech that we wish to recognize. Therefore we must somehow trade off between different choices of frame-class identification so that the global error is minimized, that is, the error over a complete choice of a transcription for an utterance. Furthermore, even during the training phase, the boundaries of internal sound units (e.g., phones) are unknown. For these reasons, even the simplest speech recognizer must take longer stretches of time (typically an entire utterance) into account when deciding what was said at any point. This entails doing both segmentation and classification; in many cases both are done with a single pass.

In this context, speech recognition may be seen as a particular case of sequence recognition. That is, for a sequence of frame observations (framewise signal representations or vectors) X = (x1, x2, ..., xn, ..., xN), we wish to associate X with a second sequence Q = (q1, q2, ..., qN), where each q represents some linguistic or quasi-linguisticunit, and where the sequence Q is chosen to minimize some error criterion. For simplicity's sake, in this chapter the second sequence will be restricted to another sequence of frame observations, used as a reference. In other words, we will be matching a sequence Xin with sequences Xkref for 1 ≤ kK, where K is the number of reference sequences. We will also confine the discussion to deterministic methods, in which we establish distance metrics and methods for time normalization. In the next chapter we will move to statistical approaches, which are in some ways more general, but the intuition for sequence-integration methods may be clearer if presented first in a deterministic framework. The limitation to frame-observation references and to deterministic methods also provides some historical background, since nearly all ASR systems used these approaches before 1980.


In order to make this discussion more concrete, let us consider the example of a simple template-based isolated word-recognition system. Imagine that the speech is received over the telephone, with an 8-kHz sampling rate, and with roughly a 3.4-kHz spectrum. The speech is pre-emphasized with a single-zero filter so that the spectral slope is relatively flat for voiced sounds. Once every 10 ms we apply a 25-ms Hamming window to the data, and we compute 10 mel-cepstral coefficients as explained in Chapter 22. We weight the cepstra by their index (except for co, which for simplicity's sake we ignore in this example). In principle, this weighting makes the feature variances roughly equal [5].

Thus, for each 10 ms we have 10 weighted cepstral numbers. Let us further imagine that the task is digit recognition, namely, the recognition of the words zero, one, . . ., nine.3 Assuming that the digits are, on the average, roughly 300 ms long, we will have approximately 30 cepstral vectors for each word uttered. Each collection of cepstral vectors X is called a template, illustrated in Fig. 24.1. During a training phase, we will compute and store reference templates Xkref. During testing, we will compare each new pattern of cepstral vectors Xin with all of the reference templates and identify the new utterance as being the word associated with the template with the smallest distance to the new pattern. We will defer for the moment what “distance” means here.

For any computations involving the distance between two sequences of feature vectors, we must handle the issue of time normalization – comparing two sequences that are of different length.


FIGURE 24.1 Template consisting of N tenth-order cepstral vectors.

24.2.1 Linear Time Warp

Suppose that we wish to compare two sequences of index-weighted cepstral vectors. Given a distance metric between any two vectors, if each template is the same length, we can compose the global distance (or distortion) between two templates as the sum of the local distance (still to be defined) between each corresponding frame of the templates being compared. In other words,


where d is the local distance between cepstral vectors, D is the total distortion between the templates, and N is the template length (the same for both templates by assumption).

In general, however, reference and input templates will have a differing number of vectors, so we cannot directly compare the sequences frame by frame. In particular, we would like two examples of the utterance “two” to have a very small overall distance, even though one of them was 300 ms long and the other was 600 ms long. The most obvious way to match the two sequences is by linear time normalization – basically, either the reference or the input vector sequence is subsampled or interpolated to yield a pair of sequences that are the same length. For the case given here, this could be done by repeating each vector of the 300-ms case, or by discarding every other vector from the 600-ms utterance. This is illustrated in Fig. 24.2 (only the first 10% on each axis is shown).

This is the approach that was used in many systems in the 1970s, and in fact it lingered in commercial systems for some time because of its simplicity. However, there are clear limitations to the idea. Generally, variations in speech duration are not spread evenly over different sounds. For instance, stop consonants vary only slightly in length, whereas diphthongs and glides (and in general vowels and vowellike sounds) vary a great deal over different utterances of the same word. For instance, in the word “two,” the initial stop consonant would probably have a very similar length for both the 300-ms and 600-ms cases, whereas the following vowel would comprise most of the time normalization. Therefore, it would be preferable to warp the stop consonant less than the following vowel. In general, we would like to have a way to optimally warp the matching process. This can be done by using a nonlinear or dynamic time warp process (illustrated for a single case in Fig. 24.3).


FIGURE 24.2 Linear time normalization for the case in which the reference template has twice as many frames as the input template.

24.2.2 Dynamic Time Warp

General Description As noted in the previous section, a linear time warping of acoustic templates does not properly compensate for the different compression–expansion factors that are observed for fast or slow instances of different phonetic classes. Therefore, it would be desirable to compensate for this variability in a dynamic manner, that is, finding the best possible warping of the time axis for one or both sequences that are being compared. Vintsyuk [8] may have been the first to observe that the dynamic programming principle of Bellman [1] could be applied to this problem; however, the idea was not widely known elsewhere until the mid-1970s.


FIGURE 24.3 Nonlinear time normalization for the case in which the reference template has twice as many frames as the input template, and the reference template is most expanded after frame 2.

Dynamic programming is an approach to the implicit storage of all possible solutions to a problem requiring the minimization of a global error criterion. It is formulated as a sequential optimization strategy in which the current estimate of the global error function is updated for each possible step. At each step, enough information about the plausible hypotheses are retained so that at the end, when the best global error value is found, the corresponding set of choices that correspond to this value can also be discovered. A classic example of this (see, for instance, [7]), is the knapsack problem. In this problem, we wish to stuff items of N different types with varying sizes and values into a knapsack that has a capacity of M (in the units of size), such that we maximize the value of the take. In this case, the dynamic programming solution is an efficient approach to computing the best combination for all knapsack sizes up to M. Thus, the problem for M = 10 is solved by considering each item i of size mi < M that can be fit into that bag, then adding on the best possible solution for a bag of the remaining space, Mmi, which has been found at an earlier stage in the computation. This can be implemented by explicitly or implicitly building up a table of cumulative costs. In addition to the costs, the local information of how each possible solution was achieved is stored; in the case of the knapsack, for each value of M and each item that could have been added to complete that solution, the identity of the item is stored. Thus, once the best total value is found for M = 10, we can look up what was added to achieve that best number, and we can backtrack to find the sequence of items added to the knapsack to achieve the best total value.

Applied to template matching for speech recognition, this algorithm can be stated fairly simply. Imagine a matrix D in which the rows correspond to frames of a reference template and the columns correspond to an input template. For each matrix element in D we will define a cumulative distortion measure,


where d is a local distance measure between frame i of the input and frame j of the reference template, and where p(i, j) is the set of possible predecessors to i, j; in other words, the coordinates of the possible previous points on the matching trajectory between the two templates. The T[·] is a term for the cost associated with any particular transition. Thus, each matrix element is the value of the total error that arises from the best step that could lead to associating those two frames, and since this step is made after a similar optimal decision, the best cumulative distance in the final column (corresponding to the last frame in the input template) will be the distortion corresponding to the best match between reference and input. For isolated word recognition, the reference with the lowest value will be taken as the best match. The basic computational step is illustrated in Fig. 24.4.

Thus, the algorithm consists of the following steps.

  1. Compute the local distance for each element in column 1 of each distortion matrix (that is, the distance between frame 1 of the input and all the frames of each reference template). Call this the cumulative distortion value for that matrix element.


    FIGURE 24.4 Basic DTW step for the case of simple local constraints. Each i, j node is associated with a local distance d and a cumulative distortion D; the equation to the right shows the basic computational step.

  2. Starting with frame 2 of the input, and beginning with the bottom row (frame 1 of the first reference template), compute the local distance and add it to the best cumulative distortion value for all possible predecessors (that is, all possible matches between input and reference that could temporally precede either the current input frame or the current reference frame). Compute this value for each element in the column for each reference template.
  3. Continue this operation through each of the other columns.
  4. Find the best distortion number in the last column for each reference template and declare it the distortion associated with that reference.
  5. Choose the word associated with the best of the reference distortions and declare it the winner.

Since this algorithm applies a dynamic programming approach to the time warp problem, it is often referred to as a dynamic time warp, or DTW.

Global Constraints Although the algorithm just described gives the basic approach, there are modifications to the approach that are used in nearly all real implementations. For instance, it is a waste of computation to compare the first input frame with the last reference frame. Figure 24.5 shows a plausible set of constraints to reduce the search space.

Local Constraints Similarly, the possible predecessors are limited to a few nearby matrix elements. Figure 24.6 shows a common constraint for DTW problems in which the local warping is constrained to skip no frames of either the reference or input template, but to permit repeats of either one.

Thus, for this case, the minimum of Eq. 24.2 is computed over only three predecessors.

Template Clustering As we have described the algorithm here, the input template is compared with each reference template. However, we have left open how many reference templates will be used for each word. In general, a single template will be insufficient, given the variability inherent to speech. For speaker-dependent recognition, it is often sufficient to store a small number of reference templates for each word in the vocabulary. For speaker-independent training, many more examples are generally required. In either case, the multiple examples are often replaced with a smaller number of representative prototypes that are obtained by clustering the raw examples. Many different clustering approaches can be used, including the K-means algorithm described in Chapter 8. In this case, however, the distance is between two sequences (templates consisting of sequences of vectors), and the distance is taken to be the DTW distortion between the two templates. With this metric, a large number of examples for each word can be clustered into one or more prototypes that can then be used to represent the word.


FIGURE 24.5 Typical global slope constraints for dynamic programming. The search-subspace section is the only region searched, since matrix elements outside of this space are considered to be unlikely matches.


FIGURE 24.6 Typical local constraints for dynamic programming; m is the index for the frame of the speech input, and n is the frame for the reference template. D (i, j) is the minimum distortion value for a match going through input frame i and reference frame j.

24.2.3 Distances

In Chapters 8 and 9 we saw that a poor distance measure could lead to a bad classification performance; for instance, in the example of Chapter 8, using a simple unweighted Euclidean distance for height–weight pairs (in feet and pounds, respectively) gives too much emphasis on the weight. We saw that a reasonable strategy was to scale the features so that they covered roughly the same range; in Chapter 9 we looked at this problem from a statistical standpoint, and the Gaussian model normalized by the standard deviations for each feature. However, we noted then that even this approach did not necessarily lead to an optimal distance measure, since it did not account for the effectiveness of each feature for classification. A linear discriminant analysis was mentioned as a solution, but even in this case there is an implicit assumption about the distance between points in feature space.

Many researchers have worked on plausible approaches to measures that corresponded well to human perceptual distance. Some of these led to the kinds of feature measures we have already discussed. For instance, we mentioned in Chapter 22 the liftering that is often applied to cepstral values that result from LPC, mel-cepstral analysis, or PLP. Viewed from the perspective of perceptual distances, the design of features such as index-weighted PLP cepstra is in fact focused on generating measures that can be effectively compared with a Euclidean distance.

Once such modifications are used in the front end, however, how can we know what distance measure would be optimal? In general we must establish some mathematical framework and allow an automatic procedure to learn the best measure for distance. Using a statistical framework gives us powerful mathematical tools and a definition for optimality that corresponds well with the goals of classification; for this reason, nearly all modern systems represent distance in probabilistic terms. This will be the focus of the chapters that follow.

24.2.4 End-Point Detection

Because accuracy is often higher for an isolated word system,4 pauses between words are often mandatory in commercial products (particularly for large-vocabulary dictation). For template-based systems such as the one described in this chapter, one of the main limitations to recognition accuracy has been the segmentation between the desired speech sequences and nonspeech segments; in other words, to locate the end points of the speech so that the templates can be formed. The reader may find this statement odd; after all, shouldn't it be easier to do a speech–nonspeech classification than to distinguish between different speech classes? However, it has often been observed that the former discrimination has been the source of most of the errors in such systems. One reason for this is that the vocabularies for artificial systems are typically chosen to be as orthogonal as possible – that is, most easily discriminable. In a case in which the words are necessarily similar, as in the alphabet, frequent users generally modify the vocabulary for better discrimination, as in “alpha bravo charlie.” In contrast, the competing nonspeech sounds have no such characteristic and can often be confused with speech; for instance, the short-term spectrum of a breath may be confused with that of a fricative sound.

The example in Fig. 24.7 shows a click preceding the beginning of a spoken word. The click has sufficient energy that an energy threshold alone would be insufficient to screen out the click while still preserving the beginning of the speech segment. It is common to include time thresholds in the determination; for instance, in this figure the length of time between the click and the start of speech is so great that the probability of the two being part of the same word is very low. Such detection schemes have been built and often have reasonable accuracies under the conditions that the thresholds have been optimized for, but they can be troublesome as conditions change.


FIGURE 24.7 Bell Labs example of a mouth click preceding a spoken word. From [5].


FIGURE 24.8 Example of a DTW path for connected words. In this case, the best path was found to correspond to words k′, K, 1, K, and k″.


Thus far, we have restricted the discussion to the recognition of individual words, uttered in isolation. How does the problem change if the words are uttered without pauses between them? Of course the pronunciation of each word can be dramatically altered, as we have noted earlier. However, even if we ignore this phenomenon, there are strong practical differences from the isolated word case. In principle, one could still warp the input against templates for every possible word sequence. However, in general this number is far too large – both storage requirements and the comparison time would be too great (infinite in the general case).

Therefore, in addition to the time normalization and recognition goals of isolated word DTW, connected word recognition requires segmentation into the separate words so that they can be time normalized and recognized. Here too, dynamic programming techniques can be used to search efficiently through all possible segmentations. Vintsyuk also did early work on this problem, as did Sakoe, as well as Bridle. Sakoe's algorithm [6] required two passes, one for computing distances and a second for assembling the best word sequences given these distances. Both Vintsyuk and Bridle developed one-stage approaches, and the algorithm is described in detail in [4]. Here we only briefly mention a few key features.

The basic principle of connected word DTW is (conceptually) to assemble one large matrix consisting of all the word templates, and to do dynamic programming on the whole thing a column at a time, as with the isolated word case. At word boundaries the local constraints are different (since the preceding frame can come from other words), but otherwise the basic distortion computation step is the same. However, at the end, it is necessary to backtrack from the best cumulative distortion in the last column in order to find what was said. In principle this could be done by storing a second matrix that holds the pointer to the best preceding matrix element; this can be followed back from the last column to reveal the complete warping function, even across words. Figure 24.8 shows a typical warping path that could be tracked in such a procedure.

As Ney points out in [4], the main problem with this simplified description is that it too leads to a large storage requirement, particularly for large-vocabulary cases. For instance, given a 20,000-word vocabulary, 50 frames/word, and 20 words per utterance, both the distortion and backtracking matrices would have 109 elements. Fortunately, it turns out that some small complications to the algorithm provide enormous savings in storage. The distortion matrix can be reduced to storing two columns, since for most implementations the legal predecessors in the warp come from at most one frame into the past. The backtracking matrix can be replaced with two lists that are each the length of the utterance (in frames): a “from template” list that gives the template index of the word with the lowest cost that ends at a particular frame, and a “from frame” list that points to the end frame of the previous word, given that the current frame is the end frame of the current word. These lists, which we can represent as T (n) and F (n) respectively, can be generated on line as the DTW algorithm proceeds. To backtrack, the following procedure can be followed.

  1. Let j be the frame pointer, and initially let j be N, the number of frames in the utterance. Then the word with the lowest-cost ending in the last frame would have index T (N). In Fig. 24.8, this would be word k″.
  2. Point to the last frame in the previous word using the “from frame” list F( ). In terms of the stored arrays, this value would be F[T (j)]; as noted above, j would initially be equal to N. In the figure, F [T (j)] would be the time index of the frame before the first frame of the best warp for word k″.
  3. Set j to the value of F[T (j)]. Then looking at the “from template” list T (), one finds that the word ending in frame j would be T(j). In the figure, this would be word K.
  4. Repeat the last two steps until the entire word sequence has been noted. This sequence will be associated with the optimal path through the match between the input template and all possible combinations of the reference templates.

The preceding is a simplified description of DTW. In particular, we have not discussed additive transition costs, multiplicative weightings of particular transition slopes, and word-sequence restrictions that are sometimes provided by a grammar.


Thus far in this chapter, the only linguistic units used have been words; dynamic programming was used to find the best match to reference templates associated with candidate words. Thus, the local (per frame) distances were not associated with any linguistically defined class other than the word. However, as noted in Chapter 23, words typically are structured items, consisting of subword units that are common across many words. In principle this structure can and should be used to improve the effectiveness of any finite amount of training data.

For framewise dynamic-programming-based speech recognition, such subword units have primarily been used for statistical systems, and so we will defer their discussion to the next two chapters. However, as noted in Chapter 4, in the 1970s and 1980s a number of systems were built that made extensive use of acoustic–phonetic knowledge, some of which used statistical classification and some of which used deterministic approaches. In either case, subword units were used.

Probably the best known systems in this class were the efforts of Ron Cole's group (first at CMU [2] and later at OGI [3]), and related work at MIT by Victor Zue and his colleagues [9]. These systems incorporated explicit acoustic–phonetic knowledge in order to segment the speech separately into phonetic segments, and then classify them. Time normalization was thus accomplished by combining costs at a segment level rather than on a framewise basis, and phones or phonemes were the classes to be identified. By incorporation of smaller units than words, learned parameters were shared across many words. By incorporating explicit decision rules for specific phonetic examples (sometimes deterministic, sometimes statistical), the designers were not limited to defining a single distance metric for all kinds of decisions.

A prototypical example of this approach was the FEATURE system developed at CMU [2]. This system used deterministic measures for the phonetic segmentation, and then it classified the segments either with a linear discriminant function or with a Gaussian classifier.


The incorporation of dynamic programming into speech-recognition systems became standard by the mid 1980s, though it was used in many research systems in the 1970s. The notion of a time-flexible distance between sequences was a critical one, and essentially all systems incorporate it in some fashion now. The specific mechanism of acoustic template matching, however, is no longer commonly used, except perhaps in some voice dialing systems. It has been shown to be useful to incorporate at least some lower-level structure, either from linguistic knowledge or from some self-organized process, so that some kind of subword units are incorporated in nearly all systems.

As noted previously, a more flexible form of distance design has been developed in the form of statistical learning procedures. Distributions are estimated in a training process and then used in an optimal decision procedure during recognition. However, both training and recognition still depend on many assumptions, and the best choice of these assumptions still largely depends on designer intuition. It is often useful to consider speech recognition from the standpoint of the earlier systems described in this chapter in order to develop the insight that is necessary for these decisions.


  1. 24.1 As noted here and in [4], connected word recognition can be done by using only two columns from the distance–distortion matrix (in addition to the frame and template pointer lists): the current column and the previous one. Suppose that the local slope constraint prohibits entirely vertical paths; in other words, input frames cannot be repeated in the time warp. Given this restriction, how could you modify the algorithm to use only one distance–distortion column?
  2. 24.2 Imagine an F1F2 graph, and examples of formant values in frames for the phones /i/ and /u/ represented by points in the graph. In this framework, explain how discrimination from frames of nonspeech could be more difficult than discrimination between the two phone types.


  1. Bellman, R., and Dreyfus, S., Applied Dynamic Programming, Princeton Univ. Press, Princeton, N.J., 1962.
  2. Cole, R., Stern, R., Phillips, M., Brill, S., Specker, P., and Pilant, A., “Feature-based speaker independent recognition of english letters,” in Proc. IEEE Int. Conf on Acoust. Speech Signal Process., Minneapolis, Minn., pp. 731–733, 1993.
  3. Janseen, R. D. T., Fanty, M., and Cole, R. A., “Speaker independent phonetic classification in continuous English letters,” in Proc. Int. Joint Conf. Neural Net., Seattle, pp. 11-801–808, 1991.
  4. Ney, H., “The use of a one-stage dynamic programming algorithm for connected word recognition,” IEEE Trans. Acoust. Speech Signal Process. 32: 263–271, 1984.
  5. Rabiner, L., and Juang, B.-H., Fundamentals of Speech Recognition, Prentice–Hall, Englewood Cliffs, N.J., 1993.
  6. Sakoe, H., and Chiba, S., “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. Acoust. Speech Signal Process. 26: 43–49, 1978.
  7. Sedgewick, R., Algorithms, Addison–Wesley, Reading, Mass., 1983.
  8. Vintsyuk, T., “Speech discrimination by dynamic programming,” Kibernetika 4: 81–88, 1968.
  9. Zue, V., “The use of speech knowledge in automatic speech recognition,” Proc. IEEE 73: 1602–1615, 1985.

1These categories are often further subdivided into classes associated with a particular phonetic context; for instance, a triphone is a phone with particular phonetic categories to its left and right.

2However, even if the local match always gave the correct phone, the variation in pronunciation could still generate possible confusions between words, so that higher-level information is still often necessary to decode the linguistic meaning.

3In real applications the word oh must also be recognized, but we ignore this detail here.

4Although it is generally true that the best state-of-the-art recognizer for the same vocabulary, environment, and so on is more accurate if trained for the isolated word case, it should also be noted that systems that have been designed and trained for continuous speech input can also have difficulties with isolated words.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.