Many years ago, von Kempelen demonstrated that the speech-production system of the human being could be modeled. He showed this by building a mechanical contrivance that “talked.” The paper by Dudley and Tarnoczy [2] relates the history of von Kempelen's speaking machine. This device was built about 1780, at a time when the notion of building automata was quite popular. Von Kempelen also wrote a book [7] that dealt with the origin of speech, the human speech-production system, and his speaking machine. Thus, for over a century, an existence proof was established that one could indeed build a machine that spoke. (Von Kempelen's work brings to mind that of another great innovator, Babbage, who also labored for many years with mechanical contrivances to try to build a computing machine.)

Figure 2.1 shows the speaking machine built by Wheatstone that was based on von Kempelen's work. The resonator of leather was manipulated by the operator to try to copy the acoustic configuration of the vocal tract during the sonorant sounds (vowels, semivowels, glides, and nasals); the bellows provided the air stream; the vibrating reed produced the periodic pressure wave; and the various small whistles and levers shown controlled most of the consonants. (Much later, Riesz [6] built a mechanical speaking machine that was more precisely modeled after the human speech-producing mechanism. This is depicted in Fig. 2.2, shown here for comparison to the von Kempelen–Wheatstone model of Fig. 2.1).


Modern methods of speech processing really began in the U.S. with the development of two devices. Homer Dudley pioneered the development of the channel vocoder (voice coder) and the Voder (voice-operated demonstrator) [1]. We know from numerous newspaper articles that the appearance of the Voder at the 1939 World's Fair in San Francisco and New York City was an item of intense curiosity. Figure 2.3 is a collage of some clippings from that period and reflects some of the wonder of people at the robot that spoke.

It is important to realize that the Voder did not speak without a great deal of help from a human being. The operator controls the Voder through a console, which can be compared to a piano keyboard. In the background is the electronic device that does the speaking. Operator training proved to be a major problem. Many candidates for this job were unable to learn it, and the successful ones required training for periods of 6 months to 1 year. Figure 2.4 shows an original sketch by S. W. Watkins of the Voder console.


FIGURE 2.1 Wheatstone's speaking machine. From [2].

The keys were used to produce the various sounds; the wrist bar was a switch that determined whether the excitation function would be voiced or unvoiced, and the pitch pedal supplied intonation information. Figure 2.5 is a close-up of the controls in the console and shows how these relate to the articulators of a human vocal tract.

The keys marked 1 through 10 control the connection of the corresponding bandpass filters into the system. If two or three of the keys were depressed and the wrist bar was set to the buzz (voicing) condition, vowels and nasals were produced. If the wrist bar were set to hiss (voiceless), sounds such as the voiceless fricatives (e.g., f) were generated. Special keys were used to produce the plosive sounds (such as p or d) and the affricate sounds (ch as in cheese; j as in jaw).


FIGURE 2.2 Riesz's speaking machine. From [3].


FIGURE 2.3 News clippings on the Voder.


The Voder was marvelous, not only because it “talked” but also because a person could be trained to “play” it. Speech synthesis today is done by real-time computer programs or specialized hardware, and the emphasis is either on voice answer-back systems, in which the synthesizer derives information from a stored vocabulary, or on text-to-speech systems, in which text that is either typed or electronically scanned is used to control the synthesizer parameters. It is a pity that further work on real-time control by a human operator has not been seriously pursued.


FIGURE 2.4 Sketch of the Voder.


FIGURE 2.5 Voder controls. From [2].


FIGURE 2.6 Lesson 1 of the Voder instructions.

Figures 2.6, 2.7, and 2.8 describe Lessons 1, 9, and 37 of the Voder Instruction Manual.

Relatively few of the candidate operators were successful, but one young woman (Mrs. Helen Harper) was very proficient. She performed at the 1939 New York World'sFair. Many years later (in the 1960s) a highlight of Dudley's retirement party was the Voder's speaking to Mr. Dudley, with the help of Mrs. Harper.


Many speech-synthesis devices were built in the decades following the invention of the Voder, but the underlying principle, as captured in Fig. 2.5, has remained quite fixed. For many cases, there is a separation of source and filter followed by the parameterization of each. As we shall see in the following sections, the same underlying principles control the design of most music synthesizers. In later chapters, the field of speech synthesis from the past to the present is explored in some detail, including advanced systems that transform printed text into reasonable-sounding speech.


Figure 2.9 shows a 17th Century drawing of a water-powered barrel organ. Spring-powered barrel organs may have existed as long ago as the 12th Century. Barrel organs work on the same concepts as present-day music boxes; that is, once the positions of the pins are chosen, the same music will be played for each complete rotation. Keys can be depressed or strings can be plucked, depending on the overall design of the automatic instrument.


FIGURE 2.7 Lesson 9.

The barrel organ is a form of read-only memory, and not a very compact form at that. Furthermore, barrel organs could not record music played by a performer. In the late 18th Century, both of these problems were overcome by melography, which allowed music to be both recorded and played back, using the medium of punched paper tape or cards. The idea originated for the automation of weaving and was developed fully by Joseph Marie Jacquard, who designed a device that could advance and register cards. (Punched cards were used by Babbage in the design of his computing machine and, in our time, were used by many computer manufacturers such as IBM.) Card-driven street organs made use of this technology. Card stacks were easy to duplicate; also, different stacks contained different music, so that music machines became very marketable. By the beginning of the 20th century, the concept had been applied to the player piano. A roll of paper tape could be made and the holes punched automatically while a master pianist (such as Rachmaninoff or Gershwin) played. This paper roll could then actuate the playback mechanism to produce the recorded version. Since the piano keys were air driven, extra perforations in the paper roll allowed variable amounts of air into the system, thus changing volume and attack in a way comparable to that of the human performer. Until the development of the high-fidelity microphone, player pianos offered greater reproduction fidelity than the gramophone—but of course they could only record the piano, whereas the gramophone recorded all sounds.


FIGURE 2.8 Lesson 37.

A modern example of a player piano is the solenoid-controlled Bosendorfer at the MIT Media Laboratory. Using this system, Fu [4] synthesized a Bosendorfer version from an old piano roll by Rachmaninoff.


FIGURE 2.9 17th Century drawing of a water-powered barrel organ.

At the beginning of the 20th century, a mighty device called the telharmonium was constructed by Thaddeus Cahill. Remember that this was built before the development of electronics; nevertheless, Cahill had the ingenuity to realize that any sound could be synthesized by the summation of suitably weighted sinusoids. He implemented each sinusoid by actuating a generator. To create interesting music, many such generators (plus much additional equipment) were needed, so the result was a monster, weighing many tons. Cahill's concept of additive synthesis is still an important feature of much of the work in electronic music synthesis. This is in contrast to many later music synthesizers that employ subtractive synthesis, in which adaptive filtering of a wideband excitation function generates the sound. (The additive synthesis concept was used by McCaulay and Quatieri [5] to design and build a speech-analysis-synthesis system; we discuss this device in later chapters.)

The player piano is only partially a music machine, since it requires a real piano to be part of the system. The telharmonium, by contrast, is a complete synthesizer, since music is made from an abstract model, that is, sine generators. Another, although totally different, complete synthesizer is the theremin, named after its inventor, the Russian Lev Termin. In this system, an antenna is a component of an electronic oscillator circuit; moving one's arm near the antenna changes the oscillator frequency by changing the capacitance of the circuit, and this variable frequency is mixed with a fixed-frequency oscillator to produce an audio tone whose frequency can be varied by arm motion. Thus the theremin generates a nearly sinusoidal sound but with a variable frequency that can produce pitch perceptions that don't exist in any standard musical scale. In the hands of a trained performer, the theremin produces rather unearthly sounds that are nevertheless identifiable as some sort of (strange) music. A trained performer could play recognizable music (e.g., Schubert's Ave Maria). Figure 2.10 shows Clara Rockmore at a theremin. Her right hand controls the frequency of the straight antenna while her left hand controls the amplitude by changing the capacitance of a different circuit.

The theremin continues to fascinate. In 1994 a film called “Theremin: An Electronic Odyssey” was released, leading to the sale of more than one thousand instruments the following year. In 2004, Moog Music, the doyen of the electronic music industry, released the Etherwave Theremin Pro. This is but the latest in a long line of theremins they have marketed, and is a consistent favorite for live performances.


  • 2.1 The Voder was which of the following:
    • (a) a physical model of the human vocal apparatus,
    • (b) an early example of subtractive synthesis,
    • (c) an early example of additive synthesis, or
    • (d) a member of the electorate with a head cold.
  • 2.2 Shown in Fig. 2.11 is Dudley's speech-sound classification for use with Voder training. Find the Voder sequence for any of the practice sentences of Fig. 2.8 (Lesson 37). Break the sentence into a phoneme sequence, using the notation of Fig. 2.11. Note that the BK1, BK2, and BK3 keys in Fig. 2.11 are the kg, p-b, and t-d keys of Fig. 2.5. A sample is shown below for the sentence “The Voder can speak well.”


    FIGURE 2.10 Clara Rockmore at the theremin.


    FIGURE 2.11 Classification of speech sounds for Voder use.


    FIGURE 2.12 Spectrogram of “greetings everybody” by an announcer.


    FIGURE 2.13 Spectrogram of “greetings everybody” by the Voder.


  • 2.3 Compare von Kempelen's speaking machine with Dudley's Voder.
    • (a) What are the chief differences?
    • (b) What are the chief similarities?
    • (c) How would you build a von Kempelen machine today?
  • 2.4 Figures 2.12 and 2.13 show spectrograms of the saying “greetings everybody” by the announcer and the Voder.
    • (a) What do you perceive to be the main difference between the natural and the synthetic utterances?
    • (b) Estimate the instants when the operator changes the Voder configuration.
  • 2.5 Synthesizers can be classified as articulatory based or auditory based. The former type works by generating sounds that are based on a model of how the sound is produced. The latter type relies on the properties of the ear to perceive sounds that are synthesized by different methods than the natural sounds that they imitate.

    Categorize each of the following as an articulatory-based or auditory-based synthesizer:

    • (a) telharmonium,
    • (b) Wheatstone–von Kempelen speaking machine,
    • (c) Voder,
    • (d) theremin; and
    • (e) player piano.


  1. Dudley, H., Riesz, R., and Watkins, S., “A synthetic speaker,” J. Franklin Inst. 227: 739, 1939.
  2. Dudley, H., and Tarnoczy, T. H., “The speaking machine of Wolfgang von Kempelen,” J. Acoust. Soc. Am. 22: 151–166, 1950.
  3. Flanagan, J. L., Speech Analysis Synthesis and Perception, 2nd ed., Springer-Verlag, New York/Berlin, 1972.
  4. Fu, A. C., “Resynthesis of acoustic piano recordings,” M.S. Thesis, Massachusetts Institute of Technology, 1996.
  5. McAulay, R. J., and Quatieri, T. F., “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. Acoust. Speech Signal Process. 34(4): 744–754, Aug. 1986.
  6. Riesz, R. R., personal communication to J. L. Flanagan, 1937. (Details of this work are described by Flanagan in [3, pages 207–208].)
  7. Von Kempelen, W., Le Mechanisme de la pavola, suivi de la Description d'une machine parlante. Vienna: J.V. Degen, 1791.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.