For many years, Gunnar Fant directed the Speech Transmission Laboratory in Stockholm. He performed X-ray measurements to determine the shape of the human vocal tract during phonation. In 1970 (based to a great extent on his doctoral thesis) his book Acoustic Theory of Speech Production [1] was published. It contained detailed information on vocal tract shapes.

For each phoneme in any spoken language there corresponds one or several sequences of vocal tract shapes. With the development of digital signal-processing concepts, these shapes can be efficiently modeled. In Chapter 10 we showed how simple acoustic tubes could be digitally modeled. In this chapter, these ideas are extended to more complicated acoustic tube structures that relate to spoken sounds.


Fant first traced area functions from the X-ray data. An example is shown in Fig. 11.1 for the vowel /i/ as in /bid/.

On the left is the tracing and on the right we see the area of the tube as a function of the distance from the glottis. This area function is quantized as a concatenation of cylindrical tubes. This string of tubes can now be approximated by analog T networks [1] or digital waveguides [4]. Straightforward mathematical derivations for a practical system (four or more tubes) become difficult. Computer simulation using digital waveguides is a more effective method than with analog T networks. So we begin with digital waveguides and then add speech-specific attributes such as source function properties.

Our aim is to establish relationships between various acoustic tube structures and the resonant modes resulting from these structures. We will see that even a small number of tubes exhibit resonances that resemble formant measurements of the different phonemes.

Figure 11.2 shows a single section of a digital waveguide. This figure is a graphical representation of the equations describing the pressure and volume velocity at the two ends of a lossless, uniform acoustic tube, governed by Eqs. 10.17–10.20. From this figure we now derive the relationships between uk, pk (the inputs) and uk+1, Pk+1 (the outputs).


FIGURE 11.1 X-ray tracing and area function for phoneme /i/. From [1].

For the kth section we can write the equations


where Ak is the cross-sectional area and Vk = ρc/Ak, ρ is the density of the gas in the tube, and c is the velocity of sound in the tube. We have omitted the arguments, remembering that u+ always has (t – x/c) as an argument and u has (t + x/c) as an argument. Notice the similarity between Eqs. 11.1 and 11.2 and Eqs. 10.17 and 10.18.

In what follows, we use the same notation for the space-time functions and their z transforms. No confusion should result, since the z-transform versions always explicitly include z.



An inspection of Fig. 11.2 yields the equations



FIGURE 11.2 Single section of a digital waveguide. M is the delay in units of the sampling period.

and this leads to the following set of two equations in the two unknowns, u+k and uk


The solutions are


Substituting Eqs. 11.9 and 11.10 into Eqs. 11.1 and 11.2, we arrive at the basic chain relationship between the kth and (k + 1)th stage:


Since our interest is to determine the resonances in the system, and since, for a lossless tube, the poles always appear on the unit circle, we can replace z by e (where θ = ωT) so that Eqs. 11.11 and 11.12 become


It is useful to express Eqs. 11.13 and 11.14 in matrix form; adding additional sections results in successive matrix multiples.


Thus, for example, if we want the relationship between the kth section and the (k+ 2)th section pictured in Fig. 11.3, we can write down the matrix result



FIGURE 11.3 Two-section digital waveguide. M and L are the delays of successive sections, in units of the sampling period.



For many speech sounds, particularly vowels, the mouth is open so that the pressure gradient at the mouth opening is zero; setting pk+2, to zero in Eq. 11.17, we get the simple relationship


Setting B22 to zero allows us to solve for the poles on the unit circle. B22 can be simplified by noting from Eq. 11.1 that Vk = ρc/Ak, using the trigonometric identities


Then, letting the factor r = (Ak+1 – Ak)/(Ak+1 + Ak), we arrive at the relationship


Given the parameters M, L, and r, we can find those values of θ that correspond to the resonances (formants) of the two-tube structure. Figure 11.4 shows a plot of the positions of formants 1 and 2.

Each curve in this f1, f2 plane corresponds to specific values of M and L, and the curves trace out a trajectory that is a function of the ratio A2/A1. Also shown in the figure are the f1, f2 points for various vowels obtained from the work of Peterson and Barney [2]. Any curve passing close to a vowel implies that there exists a two-stage digital waveguide that has approximately the same f1, f2 value as that vowel. Notice that not all vowels are close to a trajectory; such vowels require a model in which the number of stages exceeds two. Also, this analysis has ignored matches to higher formants; again, a model with more stages is required. However, as is made clear in the studies by Fant [1] and Portnoff [3], an acoustic configuration can always be found to match the measured steady-state spectrum of any speech sound.


FIGURE 11.4 Formants 1 and 2 obtained from the two-tube model.


Thus far we have shown how an acoustic tube or combinations of such tubes respond to acoustic stimuli. In the human vocal system, three types of excitation exist. The speech signal is the response of the vocal tract to some combination of the three exciting signals.

During the production of vowels and vowellike sounds, the excitation is a nearly periodic sequence of pulselike pressure changes occurring at the glottal opening. Pressure changes originating in the lungs force open the vocal cords, which are then quickly closed by elastic forces, which are again forced open, and the process repeats. Neurologically controlled muscles determine the vocal cord tension and hence the degree of elasticity; thus, the frequency of this excitation signal is controlled by the speaker.

Vowels generally are excited as described above, but not always. Vowels can be whispered. In such cases the vocal cords remain open but the air stream must pass through the small glottal opening; this produces turbulence, a noiselike component in the air stream. The resonances of the vocal tract will further shape the pressure wave to produce the whispered vowel.

Turbulence can also be produced by constrictions in other parts of the vocal tract; for example, for voiceless fricatives to be generated, noise can be generated at the tongue-tip-teeth constriction (/s/ or /th/), or, further back in the vocal tract, at the tongue–upper-palate constriction (/sh/), or at the teeth–lower-lip constriction (/f/). These excitation signals are acted on by the vocal tract complex to produce the various spectra typifying the different fricative sounds.

Such excitations can take place in concert with glottis-controlled excitations during voiced fricatives. The vocal tract configurations during these sounds are the same as the corresponding voiceless fricatives, but the vocal cords can be simultaneously vibrated, yielding sounds that contain both quasiperiodic and noise components.

Transients in the vocal tract are another source of excitation. If pressure is built up anywhere in the tract by occlusion, sudden removal of the occlusion causes a sudden pressure change that propagates throughout the vocal cavity. This occurs, for example, for (/p/), (/k/), and (/t/).


  • 11.1 In Section 11.2 it is stated that for a lossless tube, the poles always appear on the unit circle. Can you justify this claim?
  • 11.2 Derive Eqs. 11.18–11.21 given Eqs. 11.16 and 11.17.
  • 11.3 What are the boundary conditions on pk and uk for an open tube and for a closed tube?
  • 11.4 Sketch an acoustic tube model of a voiceless fricative sound such as /sh/. This sketch should be of a qualitative nature accompanied by some intuitive justification of your solution.
  • 11.5 Repeat the previous problem for the voiceless plosives.
  • 11.6 Repeat again for the voiced fricatives and plosives.
  • 11.7 Finally, repeat for the nasal sounds.
  • 11.8 If the space–time solution to the wave equation can be expressed as the separable product of a time function and a space function, derive explicit solutions for these two functions. Base your solution on the case of a single space dimension.


  1. Fant, G., Acoustic Theory of Speech Production, Mouton, The Hague, 1970.
  2. Peterson, G. E., and Barney, H. L., “Control methods used in the study of vowels,” J. Acoust. Soc. Am. 24: 175–184, 1952.
  3. Portnoff, M. R., “A quasi-one-dimensional digital simulation for the time-varying vocal tract,” M.S. Thesis, Massachusetts Institute of Technology, Cambridge, Mass., 1973.
  4. Smith, J. O. III, “Physical modelling using digital waveguides,” Comp. Mus. J. 16: 74–87, 1992.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.