CONCEPTUALLY, the development of speech recognition is closely tied with other developments in speech science and engineering, and as such can be viewed as having roots in studies going back to the Greeks (as with synthesis). However, the history of speech recognition1 per se in the 20th Century began with the invention of a small toy, Radio Rex.


The first machine to recognize speech to any significant degree may have been a commercial toy named Radio Rex, which was manufactured in the 1920s. Here is a description from a 1962 review paper [18]:

It consisted of a celluloid dog with an iron base held within its house by an electromagnet against the force of a spring. Current energizing the magnet flowed through a metal bar which was arranged to form a bridge with 2 supporting members. This bridge was sensitive to 500 cps acoustic energy which vibrated it, interrupting the current and releasing the dog. The energy around 500 cps contained in the vowel of the word Rex was sufficient to trigger the device when the dog's name was called.

It is likely that the toy responded to many words other than “Rex,” or even to many nonspeech sounds that had sufficient 500-Hz energy. However, this inability to reject out-ofvocabulary sounds is a weakness shared by most recognizers that followed it. Furthermore, the toy was in some sense useful, since it fulfilled a practical purpose (amusing a child or playful adult), which was not often accomplished by many of the laboratory systems that followed. Although quite simple, it embodied a fundamental principle of speech recognizers for many years: store some representation of a distinguishing characteristic of the desired sound and implement a mechanism to match this characteristic to incoming speech.


FIGURE 4.1 Schematic for 1952 Bell Labs digit recognizer [19].

Radio Rex was later referred to in a famous letter to the Acoustical Society by John Pierce of Bell Labs [57], in which he strongly criticized the speech recognition research of that time (1969):

What about the possibility of directing a machine by spoken instructions? In any practical way, this art seems to have gone downhill ever since the limited commercial success of Radio Rex.

Although much of the work in vocoding and related speech analysis in the 1930s and 1940s was relevant to speech recognition, the next complete system of any significance was developed at Bell Labs in the early 1950s.


A system built at Bell Labs and described in [19] may have been the first true word recognizer, as it could be trained to recognize digits from a single speaker. It measured a simple function of the spectral energy over time in two wide bands, roughly approximating the first two resonances of the vocal tract (i.e., formants). Although the system's analysis was crude, its estimate of a word-long spectral quantity may well have been more robust to speech variability than some of the later common approaches to estimating the time-varying speech spectrum. It tracked a rough estimate of formant positions instead of the spectrum itself. This is potentially resistant to irrelevant modifications of the overall speech spectrum. For instance, a simple turn of the talker's head away from a direct path to the listener often produces marked changes in the spectrum of the received speech (in particular, a relative reduction in the amplitude of the higher spectral components). The Bell Labs system's spectral estimation technique was, however, quite crude, histogramming low- and high-frequency spectral moments over an entire utterance, and thus timing information was lost. Although the idea was good, there was insufficient technology to develop it very far by modern standards; it used analog electrical components and must have been difficult to modify. Still, the inventors claimed that it worked very well, achieving a 2% error for a single speaker uttering digits that were isolated by pauses [19].

The system (see Fig. 4.1) worked generally as follows: incoming speech was filtered into low- and high-frequency components and each component strongly saturated so that its amplitude was roughly independent of signal strength. The cutoff frequency in each case was roughly 900 Hz, which is a reasonable boundary between first and second formants for adult males.2 Zero crossings were counted for each of the two bands, and the system used this value to estimate a central frequency for each band. The low-frequency number was quantized to one of six 100-Hz subbands (between 200 and 800 Hz), and the high-frequency number was quantized to be one of five 500-Hz subbands, beginning at 500 Hz. Together, these two quantized values correspond to one of 30 possible frequency pairs (in practice, only 28 were used, as the other two were rarely applicable). During a training period, capacitors were used to store charges associated with the time that the signal was mapped to a particular pair of frequencies. This distribution was learned for each digit. The resulting distributions were then used to choose conductances for RC circuits that would be used during recognition. When a new digit was uttered, a new distribution was determined in a similar way and compared to the stored distributions by switching between RC circuits corresponding to all possible digits (where the conductance corresponded to the template, the capacitances and charge time were all equal, and where the charging voltage for each frequency pair was determined by the new utterance). This procedure essentially implemented correlations between each stored distribution and the new distribution. The digits had distinguishable frequency-pair distributions and so could usually be discriminated from one another (See [19], Fig. 2, p. 639).

Note that even in 1952, researchers were reporting a speech recognizer that was 98% accurate! An examination of modern press releases suggests that this figure may be a constant for speech-recognition systems (those that are reported, anyway).


In 1958, Dudley made a classifier that continuously evaluated spectra, rather than approximations to formants. This new paradigm was commonly used afterward; in fact, broadly speaking, the current dominant paradigm for speech recognition uses some function of a local spectral estimate varying over time as the representation of the incoming speech.

In 1959, Denes, from the College of London, added grammar probabilities in addition to acoustic information. In other words, he pointed out that the probability of a particular linguistic unit being uttered can also be dependent on the previous linguistic unit, so that the probability of a word need not be solely dependent on the acoustic input.

In 1962, David and Selfridge put together Table 4.1, which compared a number of speech-recognition experiments in the preceding decade [18] including the two recognizers mentioned above. In general, researchers performed spectral tracking, detected a few words and sounds, and performed tests on a small number of people.

4.4 THE 1960s

Throughout much of the 1960s, automatic speech-recognition research continued along similar lines. Martin deployed neural networks for phoneme recognition in 1964. Digit recognizers became better in the 1960s, achieving good accuracy for multiple speakers. Widrow trained neural networks to recognize digits in 1963 [81]. Phonetic features were also explored by a number of researchers. However, as noted earlier, in 1969 John Pierce wrote a caustic letter entitled “Whither Speech Recognition?” In it he argued that scientists were wasting time with simple signal-processing experiments because people did not do speech recognition, but rather speech understanding. He also pointed out the lack of scientific rigor in the experimentation at that time and he suggested that arbitrary manipulation of recognizer parameters to find the best performance was like the work of a “mad scientist,” rather than that of a serious researcher. At the time, Pierce headed the Communications Sciences Division at Bell Labs, and his remarks were quite influential.

Although there may have been much that was correct about Pierce's criticism, there were a number of major breakthroughs in the 1960s that became important for speech-recognition research in the 1970s. First, as noted previously, prior to this period the primary approach to estimating the short-term spectrum was a filter bank. In the 1960s, three spectralestimation techniques were developed that were later of great significance for recognition, although their early applications to speech were for vocoding: the Fast Fourier transform (FFT), cepstral (or homomorphic) analysis, and linear predictive coding (LPC). Additionally, new methods for the pattern matching of sequences were developed: a deterministic approach called dynamic time warp (DTW), and a statistical one called the hidden Markov model (HMM).

Table 4.1 Pre-1962 Speech-Recognition Systemsa


4.4.1. Short-Term Spectral Analysis

As discussed in Chapter 7, Cooley and Tukey introduced the FFT [17]. This is a computationally efficient form of the discrete Fourier transform (DFT), which in turn can be interpreted as a filter bank. However, its efficiency was important for speech-recognition research, as it was for many other disciplines.

An alternative to filter banks and their equivalent FFT implementation was cepstral processing, which was originally developed by Bogert for seismic analysis [10] and applied later to speech and audio signals by Oppenheim, Schafer, and Stockham [53]. Cepstral processing will be discussed later (primarily in Chapter 20), but its significance for speech recognition is primarily as an approach to estimating a smooth spectral envelope. It ultimately became widely used for recognition, particularly in combination with other analysis techniques (see Chapter 22).

LPC is a mathematical approach to speech modeling that has a strong relation to the acoustic tube model for the vocal tract. Fundamentally, it refers to the use of an autoregressive (pole only) model to represent the generation of speech; each time point in sampled speech is predicted by a weighted linear sum of a fixed number of previous samples. In Chapter 21 we will provide a more rigorous definition, but for now the significance of LPC is that it provides an efficient way of finding a short-term spectral envelope estimate that has many desirable properties for the representation of speech, in particular the emphasis on the peak spectral values that characterize voiced sounds. Some of the early writings on this topic include [32], [2], and [44]. An excellent tutorial on the topic was written by Makhoul [42].

4.4.2. Pattern Matching

Dynamic programming is a sequential optimization scheme that has been applied to many problems [9]. In the case of speech analysis for recognition, it was proposed as a method of time normalization – different utterances of the same word or sentence will have differing durations for the sounds, and this will lead to a potential mismatch with the stored representations that are developed from training materials. DTW applies dynamic programming to this problem. It was proposed by Sakoe around 1970 (but published in an English-language journal in 1978 [66]). Vintsyuk was among the first to develop the theory, and he also applied it to continuous speech [75]. DTW for connected word recognition was described by Bridle [13] and Ney [51]. Excellent review articles on the subject were written by White [80] and by Rabiner and Levinson [62].

DTW is a deterministic approach to the matching of the time sequence of short-term spectral estimates to stored patterns that are representative of the words that are being modeled [50]. Alternatively, one could imagine a statistical approach, in which the incoming time sequence is used to assess the likelihood of probabilistic models rather than speech examples or prototypes. The mathematic foundations for such an approach were developed in the 1960s, and they were built on the statistical characterization of the noisy communications channel as described in 1948 by Shannon [69]. Most notably, the work of Baum and colleagues at the Institute for Defense Analysis established many of the basic concepts, such as the forward–backward algorithm to compute the model parameters iteratively [8] (see Chapter 26). Briefly, hidden Markov modeling is a statistical approach that models an observed sequence as being generated by an unknown sequence of variables.

Towards the end of the 1960s, a number of researchers became interested in developing these ideas further for the case of a naturally occurring sequence, and in particular for speech recognition. Many of these ultimately joined a research group at IBM, which pioneered many aspects of HMM-based speech recognition in the 1970s. An early IBM report that influenced this work was [74], and a range of other publications followed through the early to mid-1970s, for example, [7], [3], [35], and [34]. The group developed an early HMM-based automatic speech-recognition system that was used for a continuous speech-recognition task referred to as New Raleigh Language. Baker independently developed an HMM-based system called Dragon while still a graduate student at Carnegie Mellon University (CMU) [4]. Many other researchers were working with this class of approaches by the mid-1980s (e.g., [67]).

4.5 1971–1976 ARPA PROJECT

As noted earlier, one of Pierce's criticisms of earlier efforts was that there was insufficient attention given to the study of speech understanding, as opposed to recognition. In the 1970s the Advanced Research Projects Agency (ARPA)3 funded a large speech-understanding project. The main work was done at three sites: System Development Corporation, CMU, and Bolt, Beranek & Newman (BBN). Other work was done at Lincoln, SRI International, and University of California at Berkeley. The goal was to perform 1000-word automatic speech recognition by using a few speakers, connected speech, and constrained grammar with less than a 10% semantic error. The funding was reported to be $15 million. According to Klatt, who wrote an interesting critique of this program [36], only a system called Harpy, built by a CMU graduate student (Bruce Lowerre), fulfilled the goals. He used LPC segments, incorporated high-level knowledge, and modified techniques from Baker's Dragon system, as well as from another CMU system, Hearsay.

4.6 ACHIEVED BY 1976

By 1976, researchers were using spectral feature vectors, LPC, and phonetic features in their recognizers. They were incorporating syntax and semantic information. Approaches incorporating neural networks, DTW, and HMMs were developed. A number of systems were built. Efforts on reducing search cost were explored. Techniques from artificial intelligence were often used, particularly for the ARPA program. HMM theory had been applied to automatic speech recognition, and HMM-based systems had been built. In short, many of the fundamentals were in place for the systems that followed.


In the 1980s, most efforts were concentrated in scaling existing techniques (e.g., LPC and HMMs) to more difficult problems. New front-end processing techniques were also developed in this time period. For the most part, however, the structure of speech-recognition systems did not change; they were trained on a larger quantity of data and extended to more difficult tasks. This extension did require extensive engineering developments, which were made possible by a concerted effort in the community. In particular, there was a major effort to develop standard research corpora.

4.7.1. Large Corpora Collection

Prior to 1986 or so, the speech-recognition community did not have any widely accepted common databases for training recognition systems. This made comparisons between labs difficult, since few researchers trained or tested on the same acoustic data. Many speech researchers were concerned with this problem. Industrial scientists (e.g., those with Texas Instruments and Dragon Systems) worked with NIST (National Institute of Standards and Technology)4 and compiled large standard corpora.

In 1986, collection began on the TIMIT5 corpus [52], which was to become the first widely used standard corpus. A 61-phone alphabet was chosen to represent phonetic distinctions. The sentences in TIMIT were chosen to be phonetically balanced, meaning that a good representation of each phone was available within the training set. There were 630 speakers that each said 10 sentences, including two that were the same for each speaker. The data were recorded at Texas Instruments and phonetically segmented at MIT, first by use of an automatic segmenter [40], followed by manual inspection and repair of the alignments by graduate students. This resulted in a database in which the time boundaries of the phone in the speech signal are marked for every phone uttered by a speaker. Even though errors still undoubtedly exist in the TIMIT database, it remains one of the largest and most widely used hand-labeled phonetic corpora.

With the advent of the second major ARPA speech program in the mid-1980s, a new task called Resource Management (RM) was defined, with a new database [60] of speech. RM had much in common with the task from the first ARPA program in the 1970s. The major differences were that the grammar had a greater perplexity,6 and the recordings were made of read speech. Sentences were constructed from a 1000-word language model, so that no out-of-vocabulary words were encountered during testing. The corpus contained 21,000 utterances from 160 speakers. One important characteristic of the RM task was that it included speaker-independent recognition; that is, some systems were trained on many speakers, and they were tested on speakers not in the training set.

Later on in the program, the focus shifted to the Wall Street Journal Task – recognizing read speech from the Wall Street Journal.7 The first test was constrained to be a 5000-word vocabulary test with no out-of-vocabulary words; later, a 20,000-word task with out-of-vocabulary words was developed. More recent tests used an essentially unlimited vocabulary, and researchers often used 60,000-word decoders for system evaluations.

Another task that was developed in parallel with the read speech program was Air Travel Information System (ATIS), which was based on spontaneous query in the airline-reservation domain. ATIS is a speech-understanding task (as opposed to a speech-recognition task). Systems not only had to produce word strings, but they also had to attempt to derive some semantic meaning from these word strings and perform an appropriate function. For instance, if the user said “show me the flights from Boston to San Francisco,” the system should respond by showing a list of flights. Interaction continued with the system in order to reach some goal; in this case, ordering airline tickets. This domain was more practical than the Wall Street Journal task, but the vocabulary size was smaller. Systems today are now quite good at this task.

DARPA funded the collections of these corpora, and the collection processes were managed by NIST. NIST subcontracted much of the collection work to sites such as SRI and Texas Instruments. These and other corpora are now distributed through the Linguistic Data Consortium (LDC), which is based at the University of Pennsylvania in Philadelphia.

4.7.2. Front Ends

A number of new front ends, that is, subsystems that extract features from the speech signal, were developed in the 1980s. Of particular note are mel cepstrum [20], perceptual linear prediction [29], delta cepstral coefficients [22], and other work in auditory-inspired signal-processing techniques, for example, [68] and [27]. (See Chapter 22 for a discussion of many of these approaches.)

4.7.3. Hidden Markov Models

As noted previously, the fundamentals of HMM methodology were developed in the late 1960s, with applications to speech recognition in the 1970s. In the 1980s, interest in these approaches spread to the larger community. Research and development in this area led to system enhancements from researchers in many laboratories, for example, BBN [67] and Philips [12]. By the mid-late 1980s, HMMs became the dominant recognition paradigm, with, for example, systems at SRI [48], MIT–Lincoln [55], and CMU. The CMU system was quite representative of the others developed at this time, and [38] provides an extended description.

Much of this activity focused on tasks defined in a new ARPA program. As in the 1970s, IBM researchers primarily worked with their own internal tasks, although ultimately they too participated in DARPA evaluations. See [61] for descriptions of the wide range of work done at Bell on HMMs for telephone speech, as well as on many other aspects of automatic speech recognition.

4.7.4. The Second (D)ARPA Speech-Recognition Program

In 1984, ARPA began funding a second program. The first major speech-recognition task in this program was the Resource Management task mentioned earlier. This task involved reading sentences derived from a 1000-word vocabulary. The sentences were questions and commands designed to manipulate a naval information database, although the systems did not actually have to interface with any database; ratings were based on word recognition. Sample sentences from the corpus [60] include the following:

  • Is Dixon's length greater than that of Ranger?
  • What is the date and hour of arrival in port for Gitaro?
  • Find Independence's alerts.
  • Never mind.

Evaluations of participating systems were held one to two times per year. Sites would receive a CD-ROM with test data, and send NIST the sentences produced by their recognizer, where the results would be officially evaluated.

The competition tended to make systems converge on good, similar systems, with each lab attempting to incorporate improvements that had been noted by the others. Although this led to a rapid set of improvements, this also led to a convergence of approaches for many systems.

The ARPA project fueled many engineering advances. As of 1998, many research systems can recognize read speech from new speakers (without speaker-specific training) with a 60,000-word vocabulary in real time, with less than a 10% word error.8 The competition also inspired other sites that were not funded by the project, including laboratories in Europe. For example, Cambridge University in England participated in the evaluations, and developed HTK or HMM ToolKit, which has been widely distributed [82]. It is now possible to use HTK to get large vocabulary-recognition results close to those achieved by the major sites.

It could be argued that the fundamentals of speech-recognition science have not greatly changed in many years; at least it is not clear that any major mechanisms (of the significance of dynamic programming, HMMs, or LPC) were developed during the last decade or two. However, there have been many developments that may ultimately prove to have been important, particularly in the 1990s – examples include front-end developments (mel or bark-scaled cepstral estimates, delta features, channel normalization schemes, and vocal tract normalization) and probabilistic estimation (e.g., maximum likelihood linear regression (MLLR) - see Chapter 28) to adapt to new speakers or acoustics, schemes to improve discrimination with neural networks, or training paradigms to maximize the mutual information between the data and the models). Still, it is fair to say that the field has matured to the point that the efforts of many workers in the field are more oriented toward improving the engineering effectiveness of existing ideas rather than generating radically different ones. It is a matter of current controversy as to whether such an engineering orientation is sufficient to make major progress in the future, or whether radically different approaches will actually be required [11].

4.7.5. The Return of Neural Nets

The field of neural networks suffered a large blow when Minsky and Papert wrote their 1969 book Perceptrons, proving that the perceptron, which was one of the popular net architectures of the time,9 could not even represent the simple exclusive or (XOR) function.10 With the advent of backpropagation, a training technique for multilayer perceptrons (MLPs), in the early 1980s, the neural network field experienced a resurgence.

One application of neural networks to speech classification in the early 1980s was the use of a committee machine to judge whether a section of speech was voiced or unvoiced [26]. In 1983 Makino reported using a simple time-delayed neural network (a close cousin to a MLP in which the input layer includes a delayed version of itself in order to provide a simple context-delay mechanism) to perform consonant recognition [43]. This technique was later expanded by other researchers to add these delayed versions at multiple layers in the net [78]. Other researchers in the mid-1980s used Hopfield nets to classify both vowels and consonants [41].

By the late 1980s, many labs were experimenting with neural networks, both in isolated and continuous contexts. Only a few labs attacked large problems in automatic speech recognition with neural networks during this period; discrete probability estimators and mixtures of Gaussians were used in HMM recognizers for the majority of systems. Some sites have been using hybrid HMM–artificial neural network techniques, in which the neural network is used as a phonetic probability estimator, and the HMM is used to search through the possible space of word strings comprising the phones from the artificial neural network [46], [64]. In recent years, neural networks have also been used for feature transformation as part of a discriminatively trained front end for use in a Gaussian-mixturebased recognition system [30].

4.7.6. Knowledge-Based Approaches

As noted previously, much of the work in the first ARPA speech project was strongly influenced by an artificial intelligence perspective. In the late 1970s and early 1980s, approaches based on the codification of human knowledge, typically in the form of rules, became widely used in a number of disciplines. Some speech researchers developed recognition systems that used acoustic–phonetic knowledge to develop classification rules for speech sounds; for instance, in [79], the consonants “k” and “g” following a vowel were discriminanted on the basis of the proximity of the second and third resonances at the end of the vowel. This style of recognition was explained very well in [84]. One of the potential advantages of such an approach was that the speech characteristics used for discrimination were not limited to the acoustics of a single frame. Some of these points were explained in [16]. This reference, which is reprinted in [77], is also interesting because it includes a commentary from two BBN researchers (Makhoul and Schwartz), who took issue with the idea of focusing on the weak knowledge that we have about the utility of features chosen by experts. In this commentary, they suggested that systems should instead be focused on representing the ignorance that we have. In this case, they were really pointing to IIMMbased approaches.11 This dialog, and the personal interactions surrounding it at various meetings around this time, were extremely influential. By 1988 nearly every research site had turned to statistical methods. In the long term, however, the dichotomy might be viewed as elusive, since all of the researchers employing statistical methods continued to search for ways to include different knowledge sources, and the systems that attempted to use knowledge-based approaches also used statistical models.


Since the early 1990s, there have been many events and advances in the field of speech recognition, though, arguably, few have had the fundamental impact of such things as the use of common databases and evaluations, and the core statistical modeling approach. However, the cumulative effect of these more recent efforts has been considerable. Here is a sampling of what we view to be the more significant components of work in ASR since the early 1990s.

1. The DARPA program continued, and moved on to tasks such as Broadcast News. This is a significantly more realistic task than the Wall Street Journal transcription, since it includes a range of speaking styles (from read to spontaneous) and acoustic conditions (e.g., quiet studio to noisy street). It also is a real task, in the sense that the automatic transcription of broadcast data is closely related to several potential commercial applications.

2. The U.S. Defense Department also funded an effort to transcribe conversational speech. Two databases collected for this work were Switchboard and Call Home; in the first case, talkers were asked to converse on the telephone on a selected topic (e.g., credit cards). In the second, callers were asked to telephone family members and discuss anything they wanted. These were, and are, extremely difficult tasks, though by 2005 the best systems achieved word error rates on Switchboard (and Fisher, a related task) in the low 'teens (Call Home test sets remained very challenging, possibly due to the extremely relaxed and informal conversational style used in talking to family members).

3. Beginning in the mid to late 1990s, there was an increased effort by a number of American and European laboratories to study conversational speech in meetings [45] [1]. This extended both the technical problem set and the potential utility of successful systems by including scenarios with distant microphones (thus generating noisy and reverberant speech signals) and extremely natural conversational phenomena. Data sets were collected and made available from the U.S. [33] and Europe [14]. For a number of years NIST conducted evaluations of ASR and speaker diarization for such test sets.

4. Beginning in 1993, there has been an annual 6-week summer workshop that is focused on recognizing conversational speech. It was held for 2 years at Rutgers, and then each summer at Johns Hopkins, although the latter workshop has since broadened its scope, looking at many problems in speech and natural language processing.

5. Many of the first speech recognizers were segment based; that is, the recognizer hypothesized the boundaries of phone segments in the speech signal and then tried to do recognition based on this segmented speech. By the 1970s, most researchers turned to a more frame-based system, in which the base acoustic analysis regions were small, constant-duration sections, or frames, of speech. However, some researchers continue to work with segment-based systems, e.g., the MIT SUMMIT system [85], [56]. These systems developed ways of using statistical models [28], much as the frame-oriented systems had. Additionally, a number of researchers developed ways of extending HMMbased approaches to include segment statistics; see, for example, [54]. More recently, there has been increased experimentation with hybrids of HMM-based and memory-based (episodic) approaches, such as in [76].

6. Discriminative training methods are now widely used, particularly in large research systems that are trained with many hours (often more than 1000) of speech. The availability of these very large data sets also permitted much more detailed models, which were a major factor in the reduction in word error rates since the early 1990s. Discriminative approaches to model training such as those developed in [58] will be discussed in Chapters 27 and 28; similar methods have been developed for feature transformation, as described in [59] [30] [47], and different discriminative methods have also been successfully combined, for instance as reported in [83].

7. Through the 1980s, essentially every recognition system was extremely susceptible to a linear filtering operation (as one might experience from a telephone channel with a different frequency response than the one that was used to collect training data). In the 1990s there was significant work to improve recognition robustness to different channels, as well as to variability in the microphone, and to acoustic noise [31], [72], [25], [37]. This was continued in the following decade with work on the Aurora task, based on speech recognition with additive noise, and developed by a working group for the European Telecommunications Standards Institute (ETSI) as part of the selection process for a standard front end for “Distributed Speech Recognition” (DSR), which is discussed further in Chapter 22.

8. Language models have been extended by using many billions of words from available text, including incorporating language from the Web. Additionally, while n-gram remain predominant, additional methods that incorporate more language structure have increasingly been incorporated as a supplement.

9. The use of multiple systems or subsystems became widespread for large research systems, with combination at the feature, model, or system output levels, as described in [21], [71], [70], and [73].

10. There has been an increased emphasis on issues of pronunciation [63], dialog modeling [15], model adaptation schemes [39], and long-distance dependencies within word sequences [65], to mention just a few major topics.

11. There has been a rapid expansion of research in other classification tasks related to automatic speech recognition. For instance, methods and systems were developed for speaker identification and verification ([23], [24] and Chapter 41), as well as for language identification [49]. Speaker diarization (determining who spoke when) has also become a significant topic.

Some additional topics are discussed in a 2009 review paper, published in two parts [5][6].


Researchers often return to the same themes decade after decade – frame-based measures versus segment-based ones, statistical estimation of acoustic and language probabilities, incorporation of speech knowledge, and so on. With each return, the technology is more sophisticated. For instance, consumers can now purchase a dictation system that can recognize tens of thousands of words in continuous speech with a moderate error rate (after adaptation to the speaker), and the computers that can accomplish this are widely available.

However, the problems in speech recognition remain deep. Even five-word recognizers operate with significant errors under common natural conditions (e.g., moderate background noise and room reverberation, accent, and out-of-vocabulary words). In contrast, human performance is often far more stable under the same conditions, as discussed further in Chapter 18. We expect the general problem of the recognition and interpretation of spoken language to remain a challenging problem for some time to come.


  • 4.1 How was the 1952 Bell Labs automatic-speech recognition system limited in comparison with a modern system? Is there any way in which it could potentially be better, while keeping the same basic structure?
  • 4.2 A new speech-recognition company is advertising their wonderful product. What percentage accuracy would you expect them to ascribe to their system? Describe some ways in which performance could be benchmarked in more realistic ways.
  • 4.3 Find a newspaper, magazine, or Web announcement about some speech-recognition system, either commercial or academic. Can you conclude anything about the structure and capabilities of these systems? If there is any content in the release information, try to associate your best guesses about the systems with any of the historical developments described in this chapter.
  • 4.4 In what way could Radio Rex be a better system than a recognizer trained to understand read versions of the Wall Street Journal?


  2. Atal, B., and Hanauer, S., “Speech analysis and synthesis by prediction of the speech wave,” J. Acoust. Soc. Am. 50: 637–655, 1971.
  3. Bahl, L., and Jelinek, F., “Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition,” IEEE Trans. Inform. Theory IT-21: 404–411, 1975.
  4. Baker, J., “The DRAGON system – an overview,” IEEE Trans. Acoust. Speech, Signal Process. 23: 24–29, 1975.
  5. Baker, J., Deng, L., Glass, J., Khudanpur, S., Lee, C.H., Morgan, N., and O'Shaughnessy, D. “Developments and directions in speech recognition and understanding, Part 1,” IEEE Signal Process. Mag. 26(3): 75–80, May 2009.
  6. Baker, J., Deng, L., Khudanpur, S., Lee, C.H., Glass, J., Morgan, N., and O'Shaughnessy, D. “Updated MINDS report on speech recognition and understanding, Part 2,” IEEE Signal Process. Mag. 26(4): 78–85, July 2009.
  7. Bakis, R., “Continuous-speech word spotting via centisecond acoustic states,” IBM Res. Rep. RC 4788, Yorktown Heights, New York, 1974; abstract in J. Acoust. Soc. Am. 59 (Supp. 1): S 97, 1976.
  8. Baum, L. E., and Petrie, T., “Statistical inference for probabilistic functions of finite state Markov chains,” Ann. Mathemat. Stat. 37: 1554–1563, 1966.
  9. Bellman, R., “On the theory of dynamic programming,” Proc. Nat. Acad. Sci. 38: 716–719, 1952.
  10. Bogert, B., Healy, M., and Tukey, J., “The quefrency analysis of time series for echos,” in M. Rosenblatt, ed., Proc. Symp. on Time Series Analysis, Chap. 15, Wiley, New York, pp. 209–243, 1963.
  11. Bourlard, H., Hermansky, H., and Morgan, N., “Towards increasing speech recognition error rates,” Speech Commun. 18: 205–231, 1996.
  12. Bourlard, H., Kamp, Y., Ney, H., and Wellekens, C. J., “Speaker-dependent connected speech recognition via dynamic programming and statistical methods,” in M. R. Schroeder, ed., Speech and Speaker Recognition, Karger, Basel, 1985.
  13. Bridle, J., Chamberlain, R., and Brown, M., “An algorithm for connected word recognition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Paris, pp. 899–902, 1982.
  14. Carletta, J. “Announcing the AMI Meeting Corpus.,” The ELRA Newsletter 11(1): 3–5, January-March 2006.
  15. Cohen, P., “Dialogue modeling,” in R. Cole, J. Mariani, H. Uszkoreit, G. B. Varile, A. Zaenen, A. Zampoli, and V. Zue, eds. Survey of the State of the Art in Human Language Technology, Cambridge Univ. Press, London/New York, 1997.
  16. Cole, R., Stern, R., and Lasry, M., “Performing fine phonetic distinctions: templates versus features,” in J. S. Perkell and D. M. Klatt, eds., Variability and Invariance in Speech Processes, Erlbaum, Hillsdale, N.J., 1986.
  17. Cooley, J. W., and Tukey, J. W., “An algorithm for the machine computation of complex Fourier series,” Math. Comput. 19: 297–301, 1965.
  18. David, E., and Selfridge, O., “Eyes and ears for computers,” Proc. IRE 50: 1093–1101, 1962.
  19. Davis, K., Biddulph, R., and Balashek, S., “Automatic recognition of spoken digits,” J. Acoust. Soc. Am. 24: 637–642, 1952.
  20. Davis, S., and Mermelstein, P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust. Speech Signal Process. 28: 357–366, 1980.
  21. Fiscus, J. G., “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),” in Proc. Auto. Speech Recog. Underst., Santa Barbara, pp. 347–354, 1997.
  22. Furui, S., “Speaker independent isolated word recognizer using dynamic features of speech spectrum,” IEEE Trans. Acoust. Speech Signal Process. 34: 52–59, 1986.
  23. Furui, S., “An overview of speaker recognition technology,” in C. H. Lee, F. K. Soong, and K. K. Paliwal, eds., Automatic Speech and Speaker Recognition, Kluwer, Boston, Mass., 1996.
  24. Furui, S., “40 Years of Progress in Automatic Speaker Recognition,” in Adv. Biometrics, pp. 1050–1059, 2009.
  25. Gales, M., and Young, S., “Robust speech recognition in additive and convolutional noise using parallel model combination,” Comput. Speech Lang. 9: 289–307, 1995.
  26. Gevins, A., and Morgan, N., “Ignorance-based systems,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., San Diego, pp. 39A.5.1–39A.5.4., 1984.
  27. Ghitza, O., “Temporal non-place information in the auditory-nerve firing patterns as a front end for speech recognition in a noisy environment,” J. Phonet. 16: 109–124, 1988.
  28. Glass, J.R., “A probabilistic framework for segment- based speech recognition,” Comput., Speech Lang. 17(2–3): 137–152, 2003.
  29. Hermansky, H., “Perceptual linear predictive (PLP) analysis of speech,” J. Acoust. Soc. Am. 87: 1738–52, 1990.
  30. Hermansky, H., Ellis, D., and Sharma, S., “Tandem connectionist feature stream extraction for conventional HMM systems,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Istanbul, pp. III-1635–1638, 2000.
  31. Hermansky, H., and Morgan, N., “RASTA processing of speech,” IEEE Trans. Speech Audio Process. 2: 578–589, 1994.
  32. Itakura, F., and Saito, S., “Analysis-synthesis telephone based on the maximum-likelihood method,” in Y. Konasi, ed., Proc. 6th Int. Cong. Acoust., Tokyo, Japan, 1968.
  33. Janin, A., Baron, D., Edwards, Ellis, D., J., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., and Wooters, C., “The ICSI Meeting Corpus,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Hong Kong, pp. I 364–367, 2003.
  34. Jelinek, F., “Continuous recognition by statistical methods,” Proc. IEEE 64: 532–555, 1976.
  35. Jelinek, F., Bahl, L., and Mercer, R., “The design of a linguistic statistical decoder for the recognition of continuous speech,” IEEE Trans. Inform. Theory IT-21: 250–256, 1975.
  36. Klatt, D., “Review of the ARPA speech understanding project,” J. Acoust. Soc. Am. 62:1345–1366, 1977.
  37. Lee, C.-H., “On stochastic feature and model compensation approaches to robust speech recognition,” Speech Commun. 25: 29–48, 1998.
  38. Lee, K.-F., Automatic Speech Recognition - the Development of the Sphinx System, Kluwer, Norwell, Mass., 1989.
  39. Leggetter, C., and Woodland, P., “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Lang. 9: 171–185, 1995.
  40. Leung, H., and Zue, V., “A procedure for automatic alignment of phonetic transcriptions with continuous speech,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., San Diego, pp. 2.7.1–2.7.4, 1984.
  41. Lippmann, R., and Gold, B., “Neural classifiers useful for speech recognition,” in Proc. IEEE First Int. Conf. Neural Net., San Diego, pp. 417–422, 1987.
  42. Makhoul, J., “Linear prediction: a tutorial review,” Proc. IEEE 63: 561–580, 1975.
  43. Makino, S., Kawabata, T., and Kido, K., “Recognition of consonants based on the perceptron model,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Boston, Mass., pp. 738–741, 1983.
  44. Markel, J., and Gray, A., Linear Prediction of Speech, Springer-Verlag, New York/Berlin, 1976.
  45. Morgan, N., Baron, D., Bhagat, S. Carvey, H., Dhillon, R., Edwards, J., Gelbart, D., Janin, A., Krupski, A., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., and Wooters, C., “Meetings about meetings: research at ICSI on speech in multiparty conversations,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Hong Kong, pp. IV 740–743, 2003.
  46. Morgan, N., and Bourlard, H., “Continuous speech recognition: an introduction to the hybrid HMM/connectionist approach,” IEEE Signal Process. Mag. 12: 25–42, 1995.
  47. Morgan, N., Zhu, Q., Stolcke, A., Sonmez, K., Sivadas, S., Shinozaki, T., Ostendorf, M., Jain, P., Hermansky, H., Ellis, D., Doddington, G., Chen, B., Cetin, O., Bourlard, H., and Athineos, M., “Pushing the envelope-aside,” IEEE Signal Process. Mag. 22(5): 81–88, 2005.
  48. Murveit, H., Cohen, M., Price, P., Baldwin, G., Weintraub, M., and Bernstein, J., “SRI's DECIPHER system,” in Proc. Speech Natural Lang. Workshop, Philadelphia, pp. 238–242, 1989.
  49. Muthusamy, Y. K., Barnard, E., and Cole, R. A., “Reviewing automatic language identification,” IEEE Signal Process. Mag. 11: 33–41, 1994.
  50. Myers, C., Rabiner, L., and Rosenberg, L., “Performance tradeoffs in dynamic time warping algorithms for isolated word recognition,” IEEE Trans. Acoust. Speech Signal Process. 28: 623–635, 1980.
  51. Ney, H., “The use of a one stage dynamic programming algorithm for connected word recognition,” IEEE Trans. Acoust. Speech Signal Process. 32: 263–271, 1984.
  52. National Institute of Standards and Technology, TIMIT Acoustic-Phonetic Continuous Speech Corpus, Speech Disc 1–1.1, NIST Order No. PB91–505065, 1990.
  53. Oppenheim, A. V., Schafer, R. W., and Stockham, T. G. Jr., “Nonlinear filtering of multiplied and convolved signals,” Proc. IEEE 56: 1264–1291, 1968.
  54. Ostendorf, M., Bechwati, I., and Kimball, O., “Context modeling with the stochastic segment model,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., San Francisco, pp. 389–392, 1992.
  55. Paul, D., “The Lincoln continuous speech recognition system: recent developments and results,” in Proc. Speech Natural Lang. Workshop, Philadelphia, pp. 160–165, 1989.
  56. Phillips, M., Glass, J., and Zue, V., “Automatic learning of lexical representations for sub-word unit based speech recognition systems,” Proc. Eurospeech, Genova, pp. 577–580, 1991.
  57. Pierce, J., “Whither speech recognition,” J. Acoust. Soc. Am. 46: 1049–1051, 1969.
  58. Povey, D., Discriminative Training for Large Vocabulary Speech Recognition, Ph. D. Thesis, Cambridge University, 2004.
  59. Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., and Zweig, G., ŇFMPE: Discriminatively trained features for speech recognition, Ó in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. Philadelphia, pp. 961è 964, 2005.
  60. Price, P., Fisher, W., Bernstein, J., and Pallett, D., “The DARPA 1000-word resource management database for continuous speech recognition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., New York, S.13.21, pp. 651–654, 1988.
  61. Rabiner, L., and Juang, B.-H., Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, N.J., 1993.
  62. Rabiner, L., and Levinson, S., “Isolated and connected word recognition: theory and selected applications,” IEEE Trans. Commun. 29: 621–659, 1981.
  63. Riley, M., and Ljolje, A., “Automatic generation of detailed pronunciation lexicons,” in C. H. Lee, F. K. Soong, and K. K. Paliwal, eds., Automatic Speech and Speaker Recognition, Kluwer, Boston, Mass., 1996.
  64. Robinson, T., Hochberg, M., and Renals, S., “The use of recurrent neural networks in continuous speech recognition,” in C. H. Lee, F. K. Soong, and K. K. Paliwal, eds., Automatic Speech and Speaker Recognition, Kluwer, Boston, Mass., 1996.
  65. Rosenfeld, R., “A maximum entropy approach to adaptive statistical language modeling,” Comput. Speech Lang. 10: 187–228, 1996.
  66. Sakoe, H., and Chiba, S., “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. Acoust. Speech Signal Process. 26: 43–49, 1978.
  67. Schwartz, R., Chow, Y., Kimball, O., Roucos S., Krasner, M., and Makhoul, J., “Context-dependent modeling for acoustic-phonetic recognition of continuous speech,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Tampa, pp. 1205–1208, 1985.
  68. Seneff, S., “A joint synchrony/mean-rate model of auditory speech processing,” J. Phonet. 16: 55–76, 1988.
  69. Shannon, C., “A mathematical theory of communication,” Bell Sys. Tech. J. 27: 379–423, 623–656, 1948.
  70. Sinha, R., Gales, M., Kim, D. Y., Liu, X. A., Sim, K. C., Woodland, P. C., “The CU-HTK Mandarin Broadcast News Transcription System,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Toulouse, pp. I–1077–1080, 2006.
  71. Siohan, O., Ramabhadran, B., and Kingsbury, B., “Constructing Ensembles of ASR Systems Using Randomized Decision Trees,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Philadelphia, pp. I–197–200, 2005.
  72. Stern, R., Acero, A., Liu, F.-H., and Ohshima, Y., “Signal processing for robust speech recognition,” in C. H. Lee, F. K. Soong, and K. K. Paliwal, eds., Automatic Speech and Speaker Recognition, Kluwer, Boston, Mass., 1996.
  73. Stolcke, A., Chen, B., Franco, H., Gadde, V.R.R., Graciarena, M., Hwang, M.-Y., Kirchhoff, K., Morgan, N., Lin, X., Ng, T., Ostendorf, M., Sãcnmez, K., Venkataraman, A., Vergyri, D., Wang, W., Zheng, J., and Zhu, Q., “Recent Innovations in Speech-to-Text Transcription at SRI-ICSI-UW” IEEE Trans. Audio Speech Lang. Process., 14(5): 1729–1744, 2006.
  74. Tappert, C., Dixon, N., Rabinowitz, A., and Chapman, W., “Automatic recognition of continuous speech utilizing dynamic segmentation, dual classification, sequential decoding, and error recovery,” IBM Tech. Rep. RADC-TR-71–146, Yorktown Heights, NY, 1971.
  75. Vintsyuk, T., “Element-wise recognition of continuous speech composed of words from a specified dictionary,” Kibernetika 7: 133–143, 1971.
  76. Wachter, M., Demuynck, K., Van Compernolle, D., and Wambacq, P., “Data-driven example based continuous speech recognition,” in Proc. Eurospeech, Geneva, pp. 1133–1136, 2003.
  77. Waibel, A., and Lee, K., eds., Readings in Speech Recognition, Morgan Kaufmann, San Mateo, Calif., 1990.
  78. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K., “Phoneme recognition: neural networks vs. hidden Markov models,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., New York, pp. 107–110, 1988.
  79. Weinstein, C., McCandless, S., Mondshein, L., and Zue, V., “A system for acoustic-phonetic analysis of continuous speech,” IEEE Trans. Acoust. Speech Signal Process. 23: 54–67, 1975.
  80. White, G., “Speech classification using linear time stretching or dynamic programming,” IEEE Trans. Acoust. Speech Signal Process. 24(2): 183–188, 1976.
  81. Widrow, B., Personal Communication, Phoenix, Az., 1999.
  82. Woodland, P., Odell, J., Valtchev, V., and Young, S., “Large vocabulary continuous speech recognition using HTK,” in Proc. IEEE Mt. Conf. Acoust. Speech Signal Process., Adelaide, pp. II–125–128, 1994.
  83. Zheng., J., Cetin, O., Huang, M.-Y., Lei, X., Stolcke, A., and Morgan, N., “Combining Discriminative Feature, Transform, and Model Training for Large Vocabulary Speech Recognition” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Honolulu, pp. 633–636, 2007.
  84. Zue, V., “The use of speech knowledge in automatic speech recognition,” Proc. IEEE 73: 1602–1615, 1985.
  85. Zue, V., Glass, J., Phillips, M., and Seneff, S., “The MIT SUMMIT speech recognition system: a progress report,” in Proc. Speech Natural Lang. Workshop, Philadelphia, pp. 179–189, 1989.

1As with any such brief historical review, we have been limited to discussing a small fraction of the many contributions and contributors to this extremely active field.

2Children and adult women often have first formants above this frequency, and speakers of either gender can have second formants that are below 900 Hz for some sounds. Still, 900 Hz is a reasonable dividing point between major energy components in speech.

3This U.S. government agency was originally known as ARPA but later became known as DARPA (the D standing for Defense), but after a few years it reverted back to ARPA; as of this writing it is DARPA again.

4Formerly called the National Bureau of Standards (NBS).

5So called because the data were collected at Texas Instruments (TI) and annotated at MIT.

6Roughly speaking, perplexity is a measure of the uncertainty about the next word given a word history; a more precise definition will be given in Chapter 5.

7The task was later called CSRNAB (Continuous Speech Recognition of North American Business News), which included data from other news sources.

8The reader should keep in mind that this impressive performance is for read speech (that is, read from a page) in a limited domain with extensive language materials for training and relatively well-behaved acoustic input. The 1998 performance on tasks that are less constrained can be much worse.

9Although other network architectures were (and still are) available, including the perceptron's cousin, the MLP, the perceptron had properties that made it relatively easy to train.

10The XOR is a two-input logic function that returns true for inputs that are different (only one or the other is true) and false if the inputs are the same (either both true or both false).

11An earlier paper that made similar philosophical points, but that was not specifically concerned with speech recognition or HMMs, was [26].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.