CHAPTER 3

image

SPEECH ANALYSIS AND SYNTHESIS OVERVIEW

“If I could determine what there is in the very rapidly changing complex speech wave that corresponds to the simple motion of the lips and tongue, if I could then analyze speech for these quantities, I would have a set of speech defining signals that could be handled as low frequency telegraph currents with resulting advantages of secrecy, and more telephone channels in the same frequency space as well as a basic understanding of the carrier nature of speech by which the lip reader interprets speech from simple motions.”

–Homer Dudley, 1935

3.1 BACKGROUND

If we think, for the moment, of speech as being a mode of transmitting word messages, and telegraphy as simply another mode of performing the same action, this immediately allows us to conclude that the intrinsic information rate of speech is exactly the same as that of a telegraph signal generating words at the same average rate. Speech, however, conveys emphasis, emotion, personality, etc., and we still don't know how much bandwidth is needed to transmit these kinds of information.

In the following sections, we begin with some further historical background on speech communication.

3.1.1. Transmission of Acoustic Signals

Perhaps the earliest network for speech communication at long distances was a system that we'll call the “stentorian network,” which was used by the ancient Greeks. It consisted of towers and men with very loud voices. The following excerpts were found in Homer Dudley's archives:

Homer has written that the warrior Stentor, who was at the siege of Troy, had such a loud voice that it made more noise than fifty men all shouting at once. Alexander the Great (356–325 B.C.) seems to have had a method whereby a stentor's voice could be heard by the whole army. Did it consist of acoustical signals which were repeated from one soldier crier to another, organized as a transmitting group?

We quote the following from Caesar's commentaries: when extraordinary events happened, the Gauls relayed the information by shouting from one place to another: for example, the massacre of the Romans which took place at Orleans at sunrise was known at nine o'clock the same evening at Auvergne, forty miles away.

Diodorus of Sicily, a Greek historian living in the age of Augustus, said that at the order of the King of Persia, sentinels, who shouted the news which they wished to transmit to distant places were stationed at intervals throughout the land. The transmission time was 48 hours from Athens to Susa, over 1500 miles apart.

We also note that, in addition to speech, flare signals were used as a communications medium. The Appendix to this chapter illustrates this with an excerpt from the Greek play Agamemnon (by Aeschylus) that describes the transmission of information about the fall of Troy (also see Fig. 3.13 in the Appendix).

3.1.2. Acoustical Telegraphy before Morse Code

The Dudley archives provide some fascinating examples of pre-Morse code communications:

Later a group of inventors, among whom we find Kircher(1601–1680), Scheventer (1636) and the two Bernoulli brothers, sought to transmit news long distances by means of musical instruments each note representing a letter. One of the Bernoullis devised an instrument, composed of five bells, which permitted the principal letters of the alphabet to be transmitted.

It is told that the King of England was able to hear news transmitted 1.5 English miles to him by means of a trumpet. He had this trumpet taken to Deal Castle, whose commander said that this instrument permitted a person to make himself understood over a distance of three nautical miles. It was invented by the “genial mechanic” of Hammersmith, Sir Samuel Morland (1626–1696). It's [sic] mouthpiece was designed so that no sound could escape from either end. Morland published a treatise on this instrument entitled “Tube Stentorophonica” and in 1666 he wrote a report on “a new cryptographic process.”

In 1762 Benjamin Franklin experimented with transmitting sound under water. In 1785 Gauthoy and Biot transmitted words through pipes for a distance of 395 meters. But at a distance of 951 meters speech was no longer intelligible.

We can also regard the ringing of bells as acoustical telegraphy or telephony, if we consider that in certain Swiss villages the inhabitants recognize from their tone whether the person who has just died is a man or a woman, a member of a religious order, etc. Moreover, every Sunday the inhabitants of these villages follow the principal passages of the divine service with the aid of the pealing of the different bells. We have seen old people, prevented from attending the service because of their infirmities, with prayer book in hand, follow at a distance the priest's various movements.

Our story would be incomplete if we did not mention the African tom-tom, which some people consider a sort of acoustical telegraphy. The African explorer, Dr. A. R. Lindt, has written a short report on the tom-tom. We quote the following from his work: “There is no key to the acoustical telegraphy of the Africans. Since they have no written language, they are unable to divide their words into letters. The tom-tom therefore does not translate letter by letter or even word by word, but translates a series of well-defined thoughts into signals. There are different signals for all acts interesting to the tribe: mobilization, death of the chief and summons to a judicial convocation. However, the tom-tom also serves to transmit an order to a definite person. Thus, when a young man enters the warrior class, he receives a kind of call signal which introduces him and enables him to be recognized at a distance.

As yet, explorers have not been able to discover how intelligible the same signals are to different tribes. It is certain, however, that friendly tribes use the same signals. A settlement receiving a signal transmits it to the next village, so that in a few minutes a communication can be sent several hundred kilometers.

Acoustical telegraphy is still used today by certain enterprises such as railroads, boats, automobiles, fire fighting services and alarm services.

This completes our quotations from the Dudley archives. We see that the concept of long-distance communication has a long history and that there is some evidence that speech communication at a distance was practiced by the ancients.

3.1.3. The Telephone

Proceeding more or less chronologically, we come to that most important development, the invention of the telephone by Alexander Graham Bell. There is no need to chronicle the well-known events leading to this invention and the enormous consequent effect on human communication; we restrict ourselves to several comments. It is interesting that Bell's primary profession was that of a speech scientist who had a keen understanding of how the human vocal apparatus worked, and, in fact, Flanagan [5] describes Bell's “harp telephone,” which showed that Bell understood the rudiments of the speech spectral envelope. Nevertheless, telephone technology has been mostly concerned with transmission methods. Recently, however, with the growing use of cellular phones in which transmission rate is limited by nature, efficient methods of speech coding have become an increasingly important component of speech research at many laboratories.

3.1.4. The Channel Vocoder and Bandwidth Compression

In a National Geographic magazine article [2], Colton gives an engrossing account of the history of telephone transmission. Figure 3.1, taken from that article, shows the telephone wires on lower Broadway in New York City in the year 1887. It is clear that progress in telephony could easily have been brought to a halt if not for improvements, such as underground cables, multiplexing techniques, and fiber-optical transmission. Dudley pondered this traffic problem in a different way, that is, through coding to reduce the intrinsic bandwidth of the source, rather than increasing the capacity of the transmission medium.

Just as the Voder was the first electronic synthesizer, so the channel vocoder [3] was the first analysis-synthesis system. The vocoder analyzer derived slowly varying parameters for both the excitation function and the spectral envelope. To quote Dudley, this device could lead to the advantages of “secrecy, and more telephone channels in the same frequency space.” Both of these predictions were correct, but the precise ways in which they came to pass (or are coming to pass) probably differ somewhat from how Dudley imagined them. In 1929, there was no digital communications. When digitization became feasible, it was realized that the least-vulnerable method of secrecy was by means of digitization. However, digitization also meant the need for wider-transmission bandwidths. For example, a 3-kHz path from a local telephone cannot transmit a pulse-coded modulation (PCM) speech signal coded to 64 kbits/s (the present telephone standard). The channel vocoder was thus quickly recognized as a means of reducing the speech bit rate to some number that could be handled through the average telephone channel, and this led eventually to a standard rate of 2.4 kbits/s.

image

FIGURE 3.1 Lower Broadway in 1887.

With respect to the second prediction, given that the science of bandwidth compression is now approximately 50 years old, one might assume that “more telephone channels in the same frequency space” would by now be a completely realized concept within the public telephone system. Such, however, is not the case. Although it is our opinion that Dudley's second prediction will eventually come true, it is fair to ask why it is taking so long. With the recent boom in wireless telephony, the bandwidth is now an issue of even greater importance.

We conclude this section with a reference to an informative and entertaining paper by Bennett [1]. This paper is a historical survey of the X-System of secret telephony that was used during World War II. Now totally declassified, the X-System turns out to be a quite sophisticated version of Dudley's channel vocoder! It included such features as PCM transmission, logarithmic encoding of the channel signal, and, of course, enciphered speech. Bennett has many interesting anecdotes concerning the use of the X-System during the war.

3.2 VOICE-CODING CONCEPTS

To understand why a device such as a vocoder reduces the information content of speech, we need to know enough about human speech production to be able to model it approximately. Then we must convince ourselves that the parameters of the model vary sufficiently slowly to permit efficient transmission Finally, we must be able to separate the parameters so that each one is coded optimally. The implementation of these concepts is captured by the phrase “analysis-synthesis system.” The analysis establishes the parameters of the model; these parameters are transmitted to the receiving end of the system and used to control a synthesizer with the goal of reproducing the original utterance as faithfully as possible.

A convenient way to understand vocoders is to begin with the synthesizer. A concise statement that helps define a model of speech is given by Fant [4]: “The speech wave is the response of the vocal tract to one or more excitation signals.” This concept leads directly to engineering methods to separate the source (the excitation signal) from the filter (the time-varying vocal tract). The procedures (and there are many) for implementing this separation can be called deconvolution, thus implying that the speech wave is a linear convolution of source and filter.1 In spectral terms, this means that the speech spectrum can be treated as the product of an excitation spectrum and a vocal tract spectrum. Figure 3.2 is a simplified illustration of the spectral cross section for sustained vowels. Numerous experiments have shown that such waveforms are quite periodic; this is represented in the figures by the lines. In (a) the lines are farther apart, representing a higher pitched sound; in (b) and (c) the fundamental frequency is lower.

image

FIGURE 3.2 Fine structure and spectral envelope of sustained vowels.

The spectral envelope determines the relative magnitudes of the different harmonics, and it, in turn, is determined from the specific shape of the vocal tract during the phonation of that vowel. Deconvolution is the process of physically separating the spectral envelope from the spectral fine structure, and in later chapters we describe methods of implementing such a process. Once this separation is accomplished, we can hypothesize, with some confidence, that both the spectral envelope and spectral fine structure can be efficiently parameterized, with consequent bandwidth savings during transmission.

The parameters, if appropriately obtained, must vary relatively slowly because ultimately they depend on the articulator motions of the speech-producing mechanisms. Since these are human motions they obey the mechanical constraints imposed by the flesh-and-blood properties of the pertinent human organs, which move relatively slowly compared to typical speech bandwidths of 5 kHz.

The human vocal tract has been represented as a time-variable filter excited by one or more sources. The mechanism for this production varies according to the type of speech sound. Air pressure is supplied by the lungs. For vowel production, the cyclic opening and closing of the glottis creates a sequence of pressure pulses that excite resonant modes of the vocal tract and nasal tract: the energy created is radiated from the mouth and nose to the listener.

For voiceless fricatives (e.g., s, sh, f, and th), the vocal cords are kept open and the air stream is forced through a narrow orifice in the vocal tract to produce a turbulent, noiselike excitation. For example, the constriction for “th” is between tongue and teeth; for “f” it is between lips and teeth.

For voiceless plosives (e.g., p, t, and k), there is a cross section of complete closure in the vocal tract, causing a pressure buildup. The sudden release creates a transient burst followed by a lengthier period of aspiration.

A more extensive categorization of speech sounds is given in Chapter 23, including some additional material about the articulator positions (tongue, lips, jaw, etc.) corresponding to these categories.

Several basic methods of source–filter separation and subsequent parameterization of each have been developed over the past half-century or so. We limit our discussion to four such methods: (a) the channel vocoder, (b) linear prediction, (c) cepstral analysis, and (d) formant vocoding. Details of these methods will be examined in later chapters; for now we discuss the general problem of source–filter separation and the coding of the parameters.

One way to obtain an approximation of the spectral envelope is by means of a carefully chosen bank of bandpass filters. Looking at Fig. 3.2, we see that the complete spectrum envelope is not available; only the samples of this envelope at frequencies determined by the vertical lines are available. We assume that the fundamental frequency is not known so that we have no a priori knowledge of the sample positions. However, by passing the signal through a filter bank, where each filter straddles several harmonics, one can obtain a reasonable approximation to the spectral envelope. If the filter bandwidths are wide enough to encompass several harmonics, the resulting intensity measurements from all filters will not change appreciably as the fundamental frequency varies, as long as the envelope remains constant. This is the method employed for spectral analysis in Dudley's channel vocoder. The array of (slowly varying) intensities from the filter bank can now be coded and transmitted.

Linear prediction is a totally different way to approximate the spectral envelope. We hypothesize that a reasonable estimate of the nth sample of a sequence of speech samples is given by

image

In Eq. 3.1, the ak's must be computed so that the error signal

image

is as small as possible. As we will show in the Chapter 21, Eq. 3.1 and the minimizing computational structure used lead to an all-pole digital synthesizer network with a spectrum that is a good approximation to the spectral envelope of speech.

Source–filter separation can also be implemented by cepstral analysis, as illustrated in Figure 3.3. Figure 3.3(a) shows a section of a speech signal, Fig. 3.3(b) shows the spectrum of that section, and Fig. 3.3(c) shows the logarithm of the spectrum. The logarithm transforms the multiplicative relation between the envelope and fine structure into an additive relation. By performing a Fourier transform on Fig. 3.3(c), one separates the slowly varying log spectral envelope from the more rapidly varying (in the frequency domain) spectral fine structure, as shown in Fig. 3.3(d). Source and filter may now be separately coded and transmitted.

image

FIGURE 3.3 Illustration of source–filter separation by cepstral analysis.

image

FIGURE 3.4 Wideband spectrogram.

Finally, formant analysis can be used for source–filter separation. In Chapters 10 and 11 (Wave Basics and Speech Production), the theory of vocal-tract resonance modes is developed. However, we can to some extent anticipate the result by studying the speech spectrograms of Figs. 3.4 and 3.5. These figures are three-dimensional representations of time (abscissa), frequency (ordinate), and intensity (darkness). Much can be said about the interpretation of spectrograms; here we restrict our discussion to the highly visible resonances or formants and to the difference between Fig. 3.4 (wideband spectrogram) and Fig. 3.5 (narrow-band spectrogram).

We see from Fig. 3.4 that during the vowel sounds, most of the energy is concentrated in three or four formants Thus, for vowel sounds, an analysis could entail tracking of the frequency regions of these formants as they change with time. Many devices have been invented to perform this operation and also to parameterize the speech for other sounds, such as fricatives (s, th, sh, f) or plosives (p, k, t); again we defer detailed descriptions for later.

Formant tracks are also visible in Fig. 3.5, but there is a significant difference between the two figures. Whereas Fig. 3.4 displays the periodicity of the signal during vowels as vertical striations, Fig. 3.5 displays the periodicity horizontally. An explanation of this difference is left as an exercise.

image

FIGURE 3.5 Narrow-band spectrogram.

3.3 HOMER DUDLEY (1898–1981)

Homer Dudley's inventions of the channel vocoder and Voder triggered a scientific and engineering effort that is still in progress. On my first visit to the Bell Laboratories in 19612 I was hosted by Dr. Ed David, who then managed the speech-processing group. As we passed an office, David whispered to me, “that's Homer Dudley.” I was not introduced to Mr. Dudley and on subsequent visits did not see him. At that time he was near retirement age and, I suppose, not in the mainstream of Bell Laboratories' work. Quite a few years later (the late 1960s), Lincoln Laboratory was privileged to have the then-retired inventor as a consultant. We mention several items of interest from his brief stay there.

Dudley had a strong feeling that we should study speech waveforms as much, and perhaps more, than speech spectrograms. He felt that with practice, one could learn to read these waveforms. Dudley's speculation remains unproven. However, in an effort to augment his claim, Dudley, with the help of Everett Aho, produced photographs that are very informative and aesthetically pleasing. They are reproduced here as Figs. 3.63.11. Observing these waveforms, one develops a good feeling for the relative duration and amplitude of the vowels versus the consonants. In addition, we see the precise timing of the burst and voice-onset time of the voiced plosive sounds, b, d, and g. An inspection of the vowel sound I as in “thin” or “fish” illustrates the high frequency of the second resonance and the low frequency of the first resonance. We also note that the energy of the sh sound in “fish” is much stronger than the “f” sound in fish. Many other relationships among the acoustic properties of the phonemes can be found by careful observation of good-quality speech waveforms.

In 1967, a vocoder conference was organized under the auspices of the U.S. Air Force Cambridge Research Laboratory (AFCRL). Dudley was honored at this conference. Figure 3.12 shows Dudley displaying the plaque to the audience.

In 1969, when Dudley discontinued his consultancy at Lincoln Laboratory, he entrusted one of the authors (Gold) with two boxes filled with various technical information, plus a large number of newspaper clippings on his inventions. These have been used freely in this chapter, and they have been donated to the archives of the Massachusetts Institute of Technology.

In 1981, we received the news of his death at age 83. Tributes to him were written by Manfred Schroeder and James L. Flanagan, who worked with Dudley at Bell Laboratories and appreciated his monumental contributions.

image

FIGURE 3.6 Dudley's waveform display.

image

FIGURE 3.7 Continuation of Dudley's waveform display.

image

FIGURE 3.8 Conclusion of Dudley's waveform display.

image

FIGURE 3.9 Dudley's second waveform display.

image

FIGURE 3.10 Continuation of Dudley's second waveform display.

image

FIGURE 3.11 Conclusion of Dudley's second waveform display.

image

FIGURE 3.12 Dudley receiving an award.

3.4 EXERCISES

  • 3.1 Explain why wideband spectrograms show periodicity in time whereas narrow-band spectrograms show periodicity in frequency.
  • 3.2 Invent a display that shows periodicity in both time and frequency.
  • 3.3 Can you think of a reason why spectrograms are preferable visual displays to direct oscillographic waveforms?
  • 3.4 Which sounds are more likely to be better understood from waveforms? From spectrograms?
  • 3.5 Construct a table for the phonemes of the phrase “we pledge you some heavy treasure.” The leftmost column should list the phonemes alternating with the transition regions; the next column should list your best estimate of the beginning; and the third column should list the end of the speech section. Base your estimates on Figs. 3.4 and 3.5.
  • 3.6 Construct a syllable table in the same manner as in the previous exercise.
  • 3.7 During World War II, Roosevelt and Churchill conversed by telephone between London and Washington, using a channel vocoder. Explain why the vocoder was an important component of the communications link.
  • 3.8 The phrase “carrier nature of speech” was proposed by Dudley as a way of explaining how a vocoder could represent speech with fewer bits (or less bandwidth). Explain how channel vocoders, linear predictive vocoders, and cepstral vocoders implement this concept and, as a result, represent the speech signal more efficiently than a standard telephone or PCM system.

3.5 APPENDIX: HEARING OF THE FALL OF TROY

LEADER OF CHORUS:

I come to do you reverence, Clytemnestra.

For it is right to give the king's wife honor,

A woman on a throne a man left empty

But if you know of good or only hope

to hear of good and so do sacrifice,

I pray you speak. Yet if you will, keep silence.

CLYTEMNESTRA:

With glad good tidings, so the proverb runs,

may dawn arise from the kind mother night.

For you shall learn a joy beyond all hope:

the Trojan town has fallen to the Greeks.

LEADER:

You say? I cannot hear–I cannot trust–

CLYTEMNESTRA:

I say the Greeks hold Troy. Do I speak clear?

LEADER:

Joy that is close to tears steals over me.

CLYTEMNESTRA:

Quite right. Such tears give proof of loyalty.

LEADER:

What warrant for these words? Some surety have you?

CLYTEMNESTRA:

I have. How not–unless the gods play tricks.

LEADER:

A fair persuasive dream has won your credence?

CLYTEMNESTRA:

I am not one to trust a mind asleep.

LEADER:

A wingless rumor then has fed your fancy?

CLYTEMNESTRA:

Am I some little child that you would mock at?

LEADER:

But when, when, tell us, was the city sacked?

CLYTEMNESTRA:

This night, I say, that now gives birth to dawn.

LEADER:

And what the messenger that came so swift?

CLYTEMNESTRA:

A god! The fire-god flashing from Mount Ida.

Beacon sped beacon on, couriers of flame.

First, Ida signaled to the island peak

of Lemnos, Hermes' rock, and swift from there

Athos, God's mountain, fired the great torch.

It leaped, it skimmed the sea, a might of moving light.

joy-bringing, golden shining, like a sun,

and sent the fiery message to Macistus.

Whose towers, then, in haste, not heedlessly

or like some drowsy watchman caught by sleep,

sped on the herald's task and flashed the beacon

afar, beyond the waters of Euripus

to sentinels high on Messapius' hillside,

who fired in turn and sent the tidings onward,

touching with flame a heap of withered heather.

So, never dimmed but gathering strength, the splendor

over the levels of Asopus sprang,

lighting Cithaeron like the shining moon,

rousing a relay there of travelling flame.

Brighter beyond their orders given, the guards

kindled a blaze and flung afar the light.

It shot across the mere of Gorgopis.

It shone on Aegiplanctus' mountain height,

swift speeding on the ordinance of fire,

where watchers, heaping high the tinder wood,

sent darting onward a great beard of flame

that passed the steeps of the Saronic Gulf

and blazing leaped aloft to Arachnaeus,

the point of lookout neighbor to our town.

Whence it was flashed here to the palace roof,

a fire fathered by the flame on Ida.

Thus did the they hand the torch on, one to other,

in swift succession finishing the course.

And he who ran both first and last is victor.

Such is my warrant and my proof to you:

my lord himself has sent me word from Troy.

image

FIGURE 3.13 Map, showing the communications path described in Agamemnon.

BIBLIOGRAPHY

  1. Bennett, W. R., “Secret telephony as a historical example of spread-spectrum communication,” IEEE Trans. Commun. COM-31: 98–104, 1983.
  2. Colton, F. B., “The miracle of talking by telephone,” National Geographic 70(4): 395–433, 1937.
  3. Dudley, H., “The vocoder,” Bell Labs Record 17: 122–126, 1939.
  4. Fant, G., Acoustic Theory of Speech Production, Morton, S-Gravenhage, 1960.
  5. Flanagan, J. L., Speech Analysis Synthesis and Perception, 2nd ed., Springer-Verlag, New York/Berlin, 1972.

1Acoustic speech or music production often involves varying degrees of nonlinear behavior, usually at the interface between excitation and filter. New research is now being directed at this subject. In many cases we expect that the resulting effects will be minor, but there could be surprises.

2Of course this is Gold speaking here. Morgan was 12 at the time, and Ellis had yet to be born.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.210.151.5