How do people recognize and understand speech? As with other aspects of perception that we have touched on, this is a focus for many books and articles. Our task is further complicated by the fact that, despite the profusion of articles on the subject, very little is understood in this area; at least there is very little that experts agree on.

Here we can only hope to introduce a few key concepts and in particular to lay the groundwork for the reader to think about aspects of human recognition that are different from the common approaches to artificial speech recognizers. For this purpose, we focus on two particular studies: the perception of consonant–vowel–consonant (CVC) syllables in decades-long studies, directed by Harvey Fletcher of Bell Labs (and later reexamined by Jont Allen [1]); and the direct comparison of human and machine “listeners” on tasks of current interest for speech-recognition research, as described by Richard Lippmann of Lincoln Labs [10].


In the 1990s, Jont Allen from AT&T revived interest in a body of work done at Bell Labs in the 1920s by a group headed by Harvey Fletcher; [1] is an insightful summary of Allen's perspective on this work. Here we describe only a few key points from that paper.

18.2.1 The Big Idea

A principal proposal of this paper is that humans do not appear to use spectral templates (e.g., one set of local spectral features every 10 ms), but rather they do partial recognition of phonetic units across time, independently in different frequency ranges. In other words, Allen's model suggests a subband analysis for speech recognition, in which partial decisions are developed independently and then combined at the level of phonetic categorization.

This suggestion is notable for a number of major reasons.

1. It is based on decades of measurements with human listeners; in other words, even if it does not turn out to be entirely correct, it is not a casual suggestion, and it is likely to at least be related to the character of human hearing.

2. It is also based on a theory of human hearing that was developed by Fletcher to model the experimental results. In other words, there is a theoretical structure that can be used. In particular, Fletcher defined a measure called the articulation index (called AI long before the term was ever used for artificial intelligence). In this context he intended the word articulation to refer to the probability of identifying nonsense speech sounds. This was in contrast to the notion of intelligibility, which refers to the identification of meaningful speech sounds such as words or sentences.

3. Finally, it is at odds with virtually every speech-recognition system that engineers have built; nearly all ASR systems have primarily depended on framewise features that are based on short-term spectral estimates.

18.2.2 The Experiments

Starting around 1918, Bell Labs researchers designed databases from CVC, consonant–vowel (CV), and vowel–consonant (VC) nonsense syllables. Their estimate at the time was that these types comprised approximately 74% of the syllables used over the telephone, and as such provided a good idealized testbed for the recognition of speech without the more complex factors of multisyllabic acoustic context or syntactic or semantic factors. Over the following years, listening tests were conducted with differing signal-to-noise ratios and frequency ranges; the latter were established by using high-pass or low-pass filters. A fundamental motivation was to determine the bandwidth that was required for the telephone system; in fact it was these experiments that led to the frequency range that is in use today.

There were many results from these experiments, but two that Allen focused on were as follows.

1. The probability of getting a CVC syllable correct (determined by counting the number of times that listeners correctly identified the syllable for a given condition) was roughly the product of the probabilities of having the initial C, the V, or the final C phone correct in the syllable identification. This meant that, as far as this measure and experiment were concerned, the phone identifications could be treated as being independent.

2. For speech low-pass filtered and high-pass filtered at the same point, the phone error probability for the total spectrum was equal to the product of the error probabilities for each of the two bands. In other words, the probability of error for the band from 100 Hz to 3000 Hz would be the product of the error probabilities for bands from 100 to 1000 Hz and from 1000 to 3000 Hz. More formally, let s(a, b) be the articulation (probability of correct phone classification) for speech with a lower spectral limit of fα and a higher spectral limit of fb. Then Fletcher's experiments seemed to show that




Fletcher then defined an AI as


where smax is the maximum articulation (that is, the best that people could achieve with a high signal-to-noise ratio and wide bandwidth), measured by Fletcher to be 0.985.

Given the experimental result of Eq. 18.1, dividing Eq. 18.2 by 1og10 (1 – smax), we see that the AI has the property that


Thus, a relatively simple nonlinear transformation of the probability of phone correctness converted it to a measure that would be roughly additive over frequency.

18.2.3 Discussion

The articulation index measure also implied an underlying density, for which the integral over some contiguous range corresponded to the AI for that range. Allen notes that Fletcher and Stewart generalized from these results to a multi-independent channel model of phone perception. As noted from Eq. 18.1, the error for a wideband signal is equal to the product of errors for two individual bands. If correct, this would be an astonishing kind of independence – essentially it says that if any single band leads to perfect phone identification, errors in any of the other bands don't matter! Such a system is in some sense optimal, though the form of optimality is one that we currently don't know how to implement in engineering systems. However, Allen's interpretation of the Fletcher data is that this was indeed what was observed.

There have been some objections to the AI theory as a literal truth; for instance, Lippmann showed human performance for speech that had been filtered to essentially remove components between 800 and 3000 Hz to in fact be significantly better than that which would be predicted from AI analyses [12]. The choice of phones as the fundamental unit of speech recognition is also controversial, as others have suggested transition-based units, half-syllables, or even complete syllables as being more primary (particularly during natural speech, for instance during human-to-human conversations) [7]. Also, while the Fletcher experiments suggest an ideal (for high SNR and broad bandwidth signal) phoneme error rate of 1.5%, they were designed to test substitution errors, and human phoneme recognition of natural speech also includes insertions and deletions; furthermore the phonetic classification was determined for a syllable in a known position in the carrier sentence. A modern measure of human phone recognition errors was observed in [16], where Italian -speaking subjects were asked to phonetically transcribe Japanese and Spanish sentences from conversational telephone speech, which is known to be more difficult to transcribe than carefully read sentences (and for which presumably the cross-language property would eliminate the effects of higher level information). In this case, the best subjects still had phone error rates in the mid-teens.

However, aside from interpretations that remain controversial, the Fletcher studies are still extremely instructive. For instance, it seems very likely that there is a significant amount of analysis that is performed in the human auditory system on limited bands of the speech over time, and that this information is later integrated into some kind of incomplete decision about sound unit identity. A number of speech researchers incorporated ideas such as this into experimental ASR systems [3, 2, 8]. More generally, it is likely that human hearing incorporates many maps for decisions about what was said; see Fig. 17.1 in the previous chapter, for instance, for the Sachs et al. perspective, in which multiple maps that are each tonotopically organized are used in order to make phonetic distinctions.


FIGURE 18.1 Six speech-recognition corpora. From [10].


Although the automatic speech recognition (ASR) research of the past few decades has resulted in great advances, much more remains to be done to achieve the oft-stated goal of devices that equal or exceed human performance. Lippmann [10] has compared recognition accuracy for machines and people on a range of tasks. Figure 18.1 shows six recognition corpora with vocabularies ranging from 10 to 20,000 words, and this includes isolated words, read sentences, and spontaneous speech. All cases are speaker independent (trained on one set of speakers but used on a different set). Table 18.1 lists the characteristics of six speech corpora.

The column marked Recognition Perplexity is a measure of the average number of words that can occur at any point in an utterance, assuming a particular grammar, as defined in Chapter 5. Given the ability of many recognizers to use word-sequence constraints, perplexity is a good measure of the linguistic uncertainty in the grammar, and as such tends to correlate with the recognition difficulty for a task. It does not account for the acoustic difficulty of a task; for instance, an unconstrained digit sequence has roughly the same perplexity as an unconstrained sequence of letter names that rhyme with “e,” but the latter has words that are much more acoustically similar to one another, so error rates tend to be higher. The perplexity for the last row is left blank, as the row refers to a wordspotting task, which does not use a constraining grammar in the same sense as the other tasks.

The five boxes in Fig. 18.2 show comparisons between humans and the best ASR devices for all table entries but the Wall Street Journal case.

These results are for the cases of clean speech; in other words, there are essentially no environmental effects, such as a poor signal-to-noise ratio (SNR). As the top 3 boxes show, both ASR and human error rates increase with perplexity. The bottom two boxes are for very different conditions so that perplexity is no longer a reasonable measure of difficulty. These boxes indicate that for more normal (less formal) human discourse, huge increases in ASR error rates occur.

Another comparison indicator is given in Table 18.2; namely, the effect of adding noise to a test set from the Wall Street Journal corpus.

Discussing these data, Lippmann notes [111:

... the error rate of a conventional high performance HMM (Hidden Markov Model) recognizer increases dramatically from 7.2% in quiet to 77.4% at a SNR of 10 dB at noise levels that do not affect human performance. This enormous increase in error rate occurs for all high-performance recognizers tested in this noise and trained using quiet speech. A noise adaptation algorithm reduces this dramatic drop in performance and provides an error rate of 8.4% at a SNR of 22 dB and 12.8% at a SNR of 10 dB.

We note that although the noise adaptation algorithm helps a great deal, the error rate still almost doubles between the quiet and 10-dB SNR case, while increasing only slightly for human listeners over this range. Further, even with noise adaptation there is still an order of magnitude greater error with ASR than for the human experiment.1.

TABLE 18.1 Characteristics of Six Talker-Independent Recognition Corporaa


a From [10].


FIGURE 18.2 Five comparisons between human and ASR devices. From [10].

Lippmann summarizes his results as follows.

Results comparing human and machine speech recognition demonstrate that human word error rates are roughly an order of magnitude lower than those of recognizers in quiet environments. The superiority of human performance increases in noise and for more difficult speech material such as spontaneous speech. Human listeners do not rely as heavily on grammars and speech context.... Humans do not require retraining for every new situation.... These results and other results on human perception of distorted speech suggest that humans are using a process for speech recognition that is fundamentally different from the simple types of template matching that are performed in modern hidden Markov speech recognizers.

In [10] Lippmann further suggests that examining more narrow spectral (and temporal) regions, as well as learning how to ignore or focus on different kinds of phonetic evidence, will be key problems for future ASR research.2

TABLE 18.2 Word Error Rate for a 5000-Word Wall Street Journal Task, Using Additive Automotive Noise


Recognition of large vocabulary conversational speech has improved enormously since the Lippmann studies, making use of very detailed statistical models trained on thousands of hours of telephone conversations, as noted in [14] and [4] where word error rates in the mid-teens were reported on a task for which error rates of 70 to 80% were often observed in the early 1990s. However, inter-annotator error rates for humans listening to speech from this corpus were still significantly lower (4.1 to 4.5% for careful transcriptions, as noted in [6]).

Finally, we note that in a 2008 study [15], a number of feature extraction methods (MFCC, PLP, Mel Filterbanks, and Rate Maps) were applied to the task of classifying articulatory features in VCV syllables with an SVM, and in general human performance was still found to be significantly better for every case (though human SR and ASR were close for the characteristic of voiced vs. unvoiced).


As we noted in Chapters 1417, the peripheral auditory system has been explored for many years. Although it still would be presumptuous of us to assume that the physiology up to the auditory nerve is well understood, there is a moderate amount of agreement among scientists about the basics in this area. Farther up the auditory chain, our knowledge is certainly much more limited, though there have been efforts to model the functional properties of mammalian primary auditory cortex; some of these models have even inspired a number of experimental methods for ASR. Overall, though as of this writing it would be fair to say that the internals of human speech recognition are little understood.

As noted in this chapter, people generally recognize speech very well, even under conditions that appear to pose great difficulty for our ASR systems. And though [16] appeared to show a somewhat closer error rates for human and machine recognition (particularly at the phone level) than had been observed previously, it is clear from many studies that human speech recognition under even moderately noisy or reverberant conditions can degrade far more gracefully than current ASR does.

What do we know about human speech recognition that differs from our best artificial systems?

Signal processing: Although we don't know exactly what signal processing occurs in the auditory system, we do know that processing occurs with a range of time constants and bandwidths. Given the robustness of human listening to many signal degradations, each of which would severely degrade an individual representation, it is likely that many maps of the input signal are available to the brain; see for instance studies in neuroscience, e.g., [5] and [9] that have revealed that neurons in the mammalian auditory cortex are highly tuned to specific spectro-temporal modulations. ASR's current use of simple functions of short-term spectra measured every 10 ms may be a significant limitation. However, even within this constraint, a number of the characteristics of auditory perception have been incorporated into speech-processing systems, and these will be discussed in the next few chapters.

Subword recognition: Humans seem to be able to adapt their use of the multiple signal representations according to the requirements of the moment. In [10] Lippmann calls for the addition of “active analysis in the front ends of speech recognizers to determine when a feature is present and when it is a component of a desired speech signal. This supplementary information can be used by classifiers that can compensate for missing features.” As noted earlier in this chapter, Allen has focused on the use of subband information over time, and he suggests some representation incorporating a correlation between subbands as a measure; combining these threads of information is an open problem, though the current leading contenders for a solution are all based on ideas from statistical pattern recognition.

Temporal integration: Humans are able to understand utterances with a wide range of speaking rates, implying some kind of time normalization. In contrast, durations are often key components in phonetic discriminations. For example, as noted in Chapter 17, the VOT (the time between the burst and the following vowel) is the key discriminating factor between “pa” and “ba.” In ASR, the most common form of temporal normalization is a crude compromise that does not sufficiently reduce variability (ASR systems often do much worse on fast speech) but also eliminates critical information about internal timing.

Integration of higher level information: In many cases in which the acoustic evidence is equivocal, the utterance identification can still be made based on the expectations from syntax, semantics, and pragmatics (where the latter refers to facts from the particular situation). Additionally, for many tasks it is not really necessary to recognize all the words, but only to get the relevant point. This is made obvious by examining written transcriptions of the spoken word. Particularly if there is no second corrective pass, there are often many differences between what is said and what was written. Essentially, people are trained to recognize the gist of what was said, and usually not the precise word sequence. In our ASR systems, we tend to focus equally on all words, both informative and noninformative; furthermore, our current capabilities for the integration of higher-level information are quite primitive, except in specialized systems that depend on an extremely restricted application domain.

This concludes the first half of our text. In the remaining chapters, we will focus on the engineering approaches that are currently the basis of audio processing systems, with a particular focus on speech recognition, synthesis, and vocoding.


  1. 18.1 Imagine a tonotopically organized (subband-based) phoneme recognition system such as the one discussed by Allen. What might be the potential advantages or disadvantages of such a system?
  2. 18.2 Lippmann points out many ways in which 1996 speech-recognition technology is inferior to the capabilities of human speech recognition. Suggest some situations in which human speech recognition could potentially be worse than an artificial implementation.


  1. Allen, J. B., “How do humans process and recognize speech?,” IEEE Trans. Speech Audio Proc. 2: 567-577, 1994.
  2. Bourlard, H., and Dupont, S., “ASR based on independent processing and recombination of partial frequency bands,” in Proc. Int. Conf. Spoken Lang. Process. Philadelphia, Pa., 1996.
  3. Bourlard, H., Hermansky, H., and Morgan, N., “Towards increasing speech recognition error rates,” Speech Commun. 18: 205-231, 1996.
  4. Chen, S., Kingsbury, B., Mangu, L., Povey, D., Saon, G., Soltau, H., and Zweig, G., “Advances in Speech Transcription at IBM Under the DARPA EARS Program,” IEEE Trans. Audio, Speech, and Language Processing, Vol. 14, No. 5, Sept. 2006, pp. 1596-1608.
  5. Depireux, D.A., Simon, J.Z., Klein, D.J., and Shamma, S.A., “Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex,” J. Neurophysiology, 85:1220-134, 2001.
  6. Glenn, M., Strassel, S., Lee, H., Maeda, K., Zakhary, R., and Li, X., “Transcription Methods for Consistency, Volume and Efficiency,” LREC 2010, Valletta, Malta, May 2010
  7. Greenberg, S., “Understanding speech understanding: towards a unified theory of speech perception,” in W. Ainsworth and S. Greenberg, eds., Workshop Auditory Basis Speech Percept., Keele, U.K., pp. 1-8, 1996.
  8. Hermansky, H., Pavel, M., and Tibrewala, S., “Towards ASR on partially corrupted speech,” in Proc. Int. Conf. Spoken Lang. Process. Philadelphia, Pa., 1996.
  9. Klein, D.J., Depireux, D.A., Simon, J.Z., and Shamma, S.A., “Robust spectro-temporal reverse correlation for the auditory system: Optimizing stimulus design,” J. Comp. Neuroscience, 9:85-111, 2000.
  10. Lippmann, R., “Speech recognition by humans and machines,” in W. Ainsworth and S. Greenberg, eds., Workshop Auditory Basis Speech Percept., Keele, U.K., pp. 309-316, 1996.
  11. Lippmann, R., “Speech recognition by machines and humans,” Lincoln Laboratories, Lexington, MA, 1996.
  12. Lippmann, R., “Accurate consonant perception without mid-frequency speech energy,” IEEE Trans. Speech Audio Process. 4: 66-69, 1996.
  13. Lippmann, R., “Speech recognition by machines and humans,” Speech Communication, 22: 1-15, 1997.
  14. Prasad, R., Matsoukas, S., Kao, C.L., Ma, J., Xu, D.-X., Colthurst, T., Kimball, 0., Schwartz, R., Gauvain, J.-L., Lamel, L., Schwenk, H., Adda, G., Lefevre, F., “The 2004 BBN/LIMSI 20xRT English Conversational Telephone Speech Recognition System,” Proc. Interspeech 2005, pp. 1645 1648, Lisboa, Portugal
  15. Scharenborg, 0., Cooke, M., “Comparing human and machine recognition performance on a VCV corpus,” Proc. Workshop on Speech Analysis and Processing for Knowledge Discovery, Aalborg, 2008.
  16. Shen, W., Olive, J., and Jones, D., “Two protocols comparing human and machine phonetic recognition performance in conversational speech,” Proc. Interspeech 2008, pp. 1630-1633, Brisbane, Australia.

1 While these results were generated in the 1990s, as of 2011 there is still a broad gap in word error rates between the roughly 1% for the human case and what is typically seen for ASR of noisy speech.

2 Lippmann later refined this material and described it in a journal paper [13].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.