Why do we study automatic speech recognition (ASR)? For one thing, there is a lot of money at stake: speech recognition is potentially a multi-billion-dollar industry in the near future. As of 2011, earnings (and savings) from simple telephone applications are reputed to be billions of dollars per year.

There are many aspects of speech recognition that are already well understood. However, it is also clear that there is much that we still don't know. We don't have human-quality speech recognition; performance degrades rapidly when small changes are made to the speech signal, such as those that can be caused from switching microphones.

Speech recognition is potentially very useful. Sample applications include the following.

Telephone applications: For many current voice-mail systems, one has to follow a series of touch-tone button presses to navigate through a hierarchical menu. Speech recognition has the potential to cut through the menu hierarchy, although simple “press or say one” speech applications do not do this. Many “smart phones” now also incorporate speech recognition, for instance to simplify dialing a number in the phone's contact list.

Hands-free operation: There are many situations in which hands are not available to issue commands to a device. Using a car phone and controlling the microscope position in an operating room are two examples for which some limited vocabulary systems already exist.

Applications for the physically handicapped: Speech recognition is a natural alternative interface to computers for people with limited mobility in their arms and hands, or for those with sight limitations.

For some aspects of computer applications, speech may be a more natural interface than a keyboard or mouse.

Dictation: General dictation is an advanced application, requiring a much larger vocabulary than, for instance, replacing a menu system. Dictation systems currently accept continuous, large vocabulary input, and work well for may people when trained specifically for that person.

Translation: Another advanced application is translation from one language to another. The Verbmobil project in Germany was both a collaborative and competitive effort to provide language-to-language translation. The goal was to facilitate a conversation between native speakers of German and Japanese, using English as an intermediate language; the system was to act as an assistant to the German participant, translating words and phrases as needed from German into English (the speaker is assumed to be moderately competent in English). A number of U.S. projects such as Transtac are also have attempted to provide two-way spoken translation.


There are many reasons why speech recognition is often quite difficult.

First, natural speech is continuous; it often doesn't have pauses between the words. This makes it difficult to determine where the word boundaries are, among other things. Also, natural speech contains disfluencies. Speakers change their mind in midsentence about what they want to say, will often accidentally switch phones (as in the phrase “teep a kape,” which means “keep a tape”), and utter filled pauses (e.g., “uh” and “urn”) while they are thinking of their next message.

Second, natural speech can also change with differences in global or local rates of speech, pronunciations of words within and across speakers, and phonemes in different contexts. As a result, we can't just say that X is the spectral representation that corresponds to “uh.” The spectrum will change, often quite dramatically, if any of these conditions are changed.

Third, large vocabularies are often confusable. A 20,000-word vocabulary is more likely to have more words that sound like each other than a 10-word vocabulary. There is also the issue of out-of-vocabulary words; for some tasks, no matter what words are in a vocabulary, recognition will always encounter words that have not been seen before. How to model these unknown words is an important unsolved problem.

Fourth, as noted previously, recorded speech is variable over room acoustics, channel characteristics, microphone characteristics, and background noise. In telephone speech, the channel used by the telephone company on any particular call (especially for analog segments) will have spectral and temporal effects on the transmitted speech signal. Background noise and acoustics in the environment that a telephone speaker is in will also have tangible effects on the signal. Different handsets, or in general different microphones, have different frequency responses; tilting a microphone at different angles will also change the frequency response. Nonlinear effects are particularly significant in carbon-granule microphones, but in general they can complicate the effects of using a particular handset. Some effects will be phone dependent; for instance, nasal sounds may be louder if the microphone is closer to the nose.

All of these factors can change the characteristics of the speech signal – a difference that humans can often compensate for, but that current recognition systems often cannot.

The algorithms for training recognition systems must be chosen carefully, for large training times are not practical for research purposes. Algorithms that take a year to run on available hardware may be of great theoretical interest, but since most programs have bugs, such a choice does not really permit the development of an experimental approach.

To replace other input modes with speech recognition, a high level of performance must be obtained. This does not necessarily mean near-perfect accuracy (although certainly too many errors can be very frustrating); perhaps it is just as important that recognition systems know when they are working well and when they are not, requiring some kind of confidence factor in designing a response to an input. This is difficult to do well when the recognition technology is known to be imperfect.

Another phenomenon that makes natural speech difficult to recognize is the effect of coarticulation. The physical realization of a phone can vary significantly depending on its phonetic context. For instance, consider the phone /t/ in the following words:

  • take
  • stake
  • tray
  • straight
  • butter
  • Kate

In the case of take, the /t/ is aspirated (i.e., there is a period of unvoiced sound after the release), whereas in stake it is not. The influence of r in tray and straight make /t/ come out more like a combination oft and ch. In butter, the /t/ is realized as a quick touch of the tongue against the alveolar ridge, also known as a flap. Finally, in Kate the /t/ is sometimes not released (especially in fast speech), so there is no large burst of noise at the end of the word (cf. take).

Besides differences in pronunciations of words within and between speakers, phonological variation often happens at the phrase level. For instance, the phrase “What are you doing?” often comes out sounding like “Whatcha dune?” In the same way, “Juwana eat?” is often the realization of “Do you want to eat?” In continuous speech, different strings of words can often sound like each other. Consider the following extreme examples:

  • It's not easy to wreck a nice beach.
    It's not easy to recognize speech.
    It's not easy to wreck an ice beach.
  • Moes beaches am big you us.
    Most speech is ambiguous.
  • sly drool
    slide rule
  • say s
    say yes

In addition, even if the word recognition is accurate, the semantic content may still not be clear. Consider these newspaper headlines1:

  • Carter Plans Swell Deficit
  • Farmer Bill Dies In House
  • Nixon To Stand Pat On Watergate Tapes
  • Stud Tires Out

We find most of these headlines funny because there is an alternate semantic interpretation one would not expect to see in the news (and sometimes it's the first interpretation we get).


Given this general description of the difficulty of speech recognition, some dimensions of this difficulty can be defined. Claims of 98% accuracy in ASR are meaningless without the specification of these task characteristics.

5.3.1 Task Parameters

One qualifier of an ASR task is whether it is speaker dependent (SD) or speaker independent (SI). An SD system is one that has been trained on one particular speaker and tested on the same speaker. An SI system is trained on many speakers and tested on a disjoint set of speakers. The Bell Labs digit recognizer discussed in [5], for instance, was trained and used in a SD manner, since it had to be adjusted for every speaker. The large ASR tasks that have been tackled under the U.S. ARPA program in the 1987–1995 period (Resource Management and Wall Street Journal dictation tasks) have both SD and SI components. Large vocabulary systems for use on personal computers have tended to be SD for greater accuracy. Although many systems have been ostensibly SI (at least they have been trained on many speakers and tested on others), many of them will perform quite poorly on speakers who are not native speakers of the target language.

Another descriptor is whether the task is to recognize isolated speech, recognize continuous speech, or spot keywords. The first type of task is to recognize words in isolation (demarcated by silence) and is in general less difficult than recognizing continuous speech, in which the word boundaries are not so apparent. A third type of task, which falls between the two earlier types, is keyword spotting. In this case the recognizer has a list of words that it tries to spot in the continuous speech input. The system must have a confidence factor about the match to prevent the false matching of words not in the keyword list. In general there is a trade-off between reducing the number of keyword occurrences that are missed and reducing the number of times that a non-keyword falsely triggers a keyword detection, and the cost of each of these kinds of errors must be considered in system optimization.

The lexicon (vocabulary) size also introduces another parameter. In general, a 20,000-word task is going to be much harder than a 10-word task. This is partly because there is a greater variability in the acoustics associated with each type of speech sound, but also because the larger task has many more words that are confusable with one another. However, even some small tasks have extremely confusable words. For instance, recognizing the E set of the alphabet (i.e., the letters b, c, d, e, g, p, t, v, z) may be harder than some tasks with a much larger vocabulary.

There is another reason why vocabulary size is not typically a reliable measure of task difficulty, which may also be strongly affected by constraints placed on the task. Systems that operate on the blocks world domain2 will have a more constrained grammar than systems that attempt to understand radio news broadcasts. As noted previously, the perplexity of a grammar is a measure of how constrained it is. Perplexity is essentially the geometric mean of the branching factor (i.e., how many words can follow another word) of the grammar. More formally, it is 2H, where H is the entropy3 associated with each word in the recognition grammar (i.e., the amount of uncertainty about the next word given the constraints and predictions of the grammar).

Speaking style has a strong effect on the difficulty of speech recognition. For instance, conversational speech is extremely difficult to transcribe. Speech that is carefully read from pre-existing text is comparatively easy. Fluent, goal-directed speech for a human–machine dialog is typically intermediate in difficulty; a motivated user will tend to speak more clearly, but the use of fluent speech will still be more difficult to recognize than read speech. Some of the characteristics of a more natural speaking style include a wider variability in speaking rate, an increase in disfluencies such as filled pauses or false starts, and a greater variability in vocal effort.

The recording conditions also play a part in determining the difficulty of an ASR task. The recording may range from wideband high-quality microphones to cellular phones in a moving car. The telephone channel typically has a bandwidth of less than 4 kHz, which means that it is more difficult to distinguish high-frequency consonants, such as /f/ and /s/. For example, Fig. 5.1 shows spectrograms of recordings of the words “foo” and “sue” at a 4-kHz (telephone) bandwidth and a 8-kHz bandwidth (high-quality microphone). Here /f/ and /s/ appear as the noise before the relatively straight lines, which represent the vowel. Note that it is more difficult to distinguish /f/ and /s/ in the 4-kHz spectral pictures than in the 8-kHz pictures. This is because most of the energy in /f/ and /s/ is above 4 kHz.

Telephone speech also introduces other challenges. The range of speakers that have access to telephone speech have a greater variability than is typically observed in laboratory databases (although this is also true for realistic data in many other cases as well). There is also a larger variability in background noise, and one must account for channel distortion from echo, cross talk, different spectral characteristics of the handset, and the communications channel in general. These sources of variability are particularly a problem for cellular and cordless telephones.

5.3.2. Sample Domain: Letters of the Alphabet

As an example of a speech-recognition task, we present a classification of letters of the alphabet. This task has a vocabulary of 26 words, but many of them are confusable. For example, there are four major sets of letters that sound alike:


FIGURE 5.1 Spectrograms of foo (top) and sue (bottom) sampled with 4-kHz and 8-kHz bandwidth restrictions. Courtesy of Eric Fosler-Lussier.

  • Eset: B C D E P T G V Z
  • A set: J K
  • EH set: M N F S
  • AH set: I Y R

Recognition results from the Oregon Graduate Institute report an accuracy of 89% on letters of the alphabet and an accuracy of 87% when spelled names are dealt with from a telephone speech system trained on 800 speakers (12,500 letters) and tested on 400 speakers (4200 letters) [3]. A perceptual experiment was run in which 10 listeners identified 3200 letters from 100 alphabets and 100 spelled names. Human accuracy on the telephone speech database was approximately 90–95%, with the average at approximately 93%. When the experiment was run with high-quality microphone speech, human accuracy jumps to approximately 99%. As noted previously, even small tasks can be relatively difficult, particularly if they involve typical users from the general public, and also particularly if the system is operating over the public-switched telephone network. Recognition can be even harder for cellular handsets and speakerphones.


Later in this book we will describe the characteristics and technological choices involved in systems for ASR. However, here we provide a preview, in which we describe very briefly the major components of these systems.

An ASR system may be described as consisting of five distinct subsystems (see Fig. 5.2): input acquisition, front end, local match, global decoder, and language model. This division is, of course, somewhat arbitrary. In particular, the first two subsystems are frequently described as one system that produces features for the classification stages.4 Here the first two functions are shown explicitly to emphasize their great significance (despite their relatively humble function). For instance, the microphone might seem to be a minor detail, one that is necessary but unworthy of discussion. However, as suggested earlier, some of the best ASR systems have been brought to their knees, so to speak, by a change in microphone. It is important to understand, then, the dependence of an ASR system on the choice and position of the microphone (which can affect the overall spectral slope of the transduced speech, as well as the overall noise level and influence of room acoustics, with the latter being more pronounced for larger microphone–talker distances). Simple preprocessing may be used to partially offset such problems (e.g., adaptively filter to flatten the spectral slope, using a time constant much longer than a speech frame).


FIGURE 5.2 Block diagram of a speech-recognition system.

Feature extraction consists of computing representations of the speech signal that are robust to acoustic variation but sensitive to linguistic content. More simply, we wish to determine from speech some values that do not vary much when the same words are spoken many times (or by different talkers, for the speaker-independent case), but that change significantly when different things are said. One could argue that this is the entire problem, since finding separable speech representations would clearly mean that recognition could be accomplished. However, the correct speech representation can differ depending on the classification technique. Nonetheless, very simple data-examination techniques (e.g., scatter plots) can often be useful for screening features that are particularly bad or good for the discrimination of similar speech sounds.

Typically, speech analysis is done over a fixed length frame, or analysis window. For instance, suppose speech is sampled at 16 kHz after being low-pass filtered with a corner frequency lower than 8 kHz (e.g., 6.4 kHz) to prevent spectral aliasing. A window of length 32 ms (512 points) might be used as the input to a spectral analysis module, with one analysis performed every 10 ms (160 points). Since the analysis windows are not chosen to be synchronous with any acoustic landmark, the resultant features will be smeared over transition regions. However, methods that rely on presegmentation to establish analysis windows are difficult to do well (although in some instances they have worked well on specific tasks).

The local match module may either produce a label for a speech segment (e.g., word), or some measure of the similarity between a speech fragment and a reference speech fragment. This reference can be an explicit prototype of the same features that are extracted from speech during the recognition process (e.g., spectra). Alternatively, the input can be fit with statistical models, yielding probabilistic measures of the uncertainty of the fit.

Whatever the measure of similarity, one can imagine a matrix of distances between input features (the horizontal axis representing time), and the reference models or prototypes (the vertical axis representing a sequence of speech sounds within a reference utterance). Some form of temporal integration must be incorporated in order to find the utterance that in some sense is the minimum distance choice for what was said. This is the job of the global decoder.

It is generally insufficient to find a simple distance (e.g., Euclidean) between the reference model or prototype and the new speech features. One of the most obvious variations among multiple occurrences of the same utterance is the speed of the speech. A first-order solution for this is a linear normalization by utterance length. However, this does not compensate for the varying amount of time expansion and compression for different sounds. For instance, stop consonants such as “k” or “g” typically do not change their length much, whereas the length of sonorant sounds such as vowels tends to vary significantly with the speed of the speech.

These considerations led to the major algorithmic innovation (as noted in Chapter 4) known as dynamic time warp (DTW) [7]. For an isolated word example, for instance, given a local distance between each input frame and each reference frame, one would use the dynamic programming algorithm of Bellman [2] to determine the minimum cost match (defined as the match with the minimum sum of local distances plus any cost for permitted transitions). Pointers are retained at each step, so that the optimal path through the matrix can be backtraced once the best match is found. This path represents the best warping of the models to match the data.

The preceding section primarily addressed the case of deterministic distances between features such as spectra for reference sounds and those that are being recognized. However, a similar approach can be used for a statistical reference model. If one can estimate the probability of an observed spectrum for each hypothetical speech sound, as well as the probability of each permissible transition, the same procedure can be followed using a statistical distance measure (e.g., negative log probability). These distances are used in practical recognition systems based on hidden Markov models (HMMs). The use of these models was another fundamental advance in speech-recognition systems [1], [4], as noted in Chapter 4.

For isolated word recognition, a HMM for each vocabulary word can be used in place of the deterministic representation of a reference template for each word. For continuous speech recognition, each word may still be represented by its own HMM, but more commonly phonemic HMMs are concatenated to represent words, which in turn can be concatenated to represent complete utterances. Model variations can also be introduced to represent common effects of coarticulation between neighboring phonemes or words. Word transition costs can also be introduced to permit the use of a grammar. When dynamic programming is used to get the best match between the data and a statistical model as described earlier, the resulting best-path calculation is called a Viterbi decoding [8], [6]; his is the dominant approach used in continuous speech recognition.

For most speech-recognition tasks of interest, the acoustic information by itself is insufficient to uniquely determine what was said. In fact, in the human example, what was spoken is determined by a complex combination of mechanisms, including the incorporation of knowledge about the syntax and semantics of a language, as well as the pragmatic expectations from a situation. Likewise, our speech-recognition systems in general must make use of some information about the language that is available prior to the reception of the new acoustic information. In most current systems, however, only very simple linguistic information is incorporated, such as the frequency of pairs or triplets of words. In some systems, however, particularly those that operate in some limited application domain, deeper knowledge about probable paths in the task dialog can be used, sometimes incorporating structured models for natural language.


Given the relatively long history of research into ASR, the variety of techniques alluded to in earlier sections, and the widely reported successes with fairly difficult tasks, one might wonder why ASR is considered a research topic at all. In fact, despite press reports to the contrary,5 speech recognition by machine is still a difficult and largely unsolved problem, and there are a number of areas of active research that are being explored in the attempt to conquer the remaining serious problems. In particular, although ASR is good enough to be used for many practical tasks, recognizers as of 2011 are still often brittle, providing unreliable performance under conditions that are handled quite well by human listeners. A short parable (courtesy of John Ohala) may illustrate the current state of affairs.

Stanford6 artificial intelligence researchers perfected a talking and listening handyman robot, which was then sent out to solicit research funds door to door. The robot rolled up to its first house, and rang the bell:


I am Stanford's handyman robot. Tell me a task, and I will do it for $5 per hour. This money will be applied to further research in artificial intelligence.


$5 an hour? Sounds great! Can you paint?


My painting is of the highest quality.


OK. See that paint brush and bucket of paint? Take them out back and paint the porch.


Your request will be fulfilled, courtesy of Stanford.

(The robot trundles off to do his job, and returns in an hour).


The task is complete. Please deposit $5 to aid in further research.


(Handing over the cash) This was a great deal! Come back again!


(While leaving) Oh, by the way, it wasn't a Porsche. It was a BMW.

Given the human example, perhaps we can get some good ideas about how to build artificial systems that will not perform so poorly when handling situations that people find so straightforward. Although naive mimicry of the human systems is likely to be an insufficient tactic, we believe that there is much to be learned from human speech perception.

Before we proceed to further detail on either human or machine systems for audio signal processing, we must first provide some technical background. We will return at a later point to the aspects of speech-recognition technology that have been briefly alluded to in this chapter.


  • 5.1 If the acoustic environment is noisy, how could you imagine each of the blocks of Fig. 5.2 being modified to help with speech recognition?
  • 5.2 You have a recognition system that can recognize strings of up to 16 digits, in which approximately 15% of the digits are incorrect. Describe some scenarios in which the system could be useful.
  • 5.3 Describe some situations in which a five-word recognizer can accomplish a more difficult task than a 1000-word recognizer.
  • 5.4 Suppose that any one of 16 distinct symbols could occur at any point in a sequence, with equal probability:
    • (a) What is the entropy associated with an occurrence of a symbol?
    • (b) What is the perplexity?

      Now suppose that four of the symbols could occur with the same probability as before, but four were twice as probable and eight were half as probable:

    • (c) What is the entropy?
    • (d) What is the perplexity?


  1. Baker, J. K., “The DRAGON system – an overview,” IEEE Trans. Acoust. Speech Signal Process. 23: 24–29, 1975.
  2. Bellman, R., and Dreyfus, S., Applied Dynamic Programming, Princeton Univ. Press, Princeton, N.J., 1962.
  3. Cole, R. A., Fanty, M., Gopalakrishnan, M., and Janssen, R. D. T., “Speaker-independent name retrieval from spellings using a database of 50,000 names,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. Toronto, Canada, pp. 325–328, 1991.
  4. Davis, K., Biddulph, R., and Balashek, S., “Automatic recognition of spoken digits,” J. Acoust. Soc. Am. 24: 637–642, 1952.
  5. Deller, J., Proakis, J., and Hansen, J., Discrete-Time Processing of Speech Signals, Macmillan Co., New York, 1993.
  6. Jelinek, F., “Continuous recognition by statistical methods,” Proc. IEEE 64: 532–555, 1976.
  7. Sakoe, H., and Chiba, S., “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. Acoust. Speech Signal Process. 26: 43–49, 1978.
  8. Viterbi, A., “Error bounds for convolutional codes and an asymptotically optimal decoding algorithm,” IEEE Trans. Inf. Theory 13: 260–269, 1967.

1Courtesy of Ron Cole of the Oregon Graduate Institute.

2The blocks world domain has been a favorite domain in artificial intelligence, in which objects are geometric solids such as blocks and are to be manipulated by a robot arm.

3The information of a random variable is just the negative log of its probability (conventionally in base 2 so that the result is in bits), and the entropy is the expectation of this quantity over the probability distribution.

4The components are also sometimes shown in greater detail; for instance the local match makes use of an acoustic model of some kind, and the decoder block also makes use of a pronunciation dictionary; the language model, the pronunciations, and the acoustic model could all be seen as prior information that is used by the decoder.

5It has often been suggested that the solution to machine speech-recognition is five years away; in fact, this has been suggested so repeatedly that it must be true!

6Remember, this is a parable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.