In some of the previous chapters, we have stressed the model of speech and music production as consisting of one or more excitations that drive a time-variable filter. In this chapter we focus on the excitation model, and in particular on the extraction of pitch frequency. The time-variable filter that results in the spectral envelope can be estimated in different ways, including filter banks, cepstra, and linear prediction (see Chapters 19, 20, and 21), as well as combinations of these approaches (see Chapter 22).

Modeling of the excitation function of speech requires paying particular attention to the following components: (a) the periodic or nearly periodic opening and closing of the glottis during voicing; (b) the shape of the glottal pressure pulse; (c) the position in the vocal system of the constriction that creates turbulent flow during unvoiced sound; (d) the nature of the excitation function during stop consonant articulation; (e) how voicing and turbulence combine during articulation of the voiced fricative sounds; and (f) possible nonlinear interactions between excitation and acoustic tube response.

In many ways, accurate modeling of the excitation parameters is more complex than modeling of the time-varying linear filter that we use to represent the vocal tract. Channel vocoder researchers in the 1950s must have been somewhat aware of this when they stated that vocoders of that era lacked good pitch detectors. During this period and later (into the 1960s and 1970s), many novel methods of tracking the voice fundamental frequency were invented; these included algorithms for distinguishing buzz (quasi-periodic voicing) from hiss (turbulent air flow; therefore, noiselike excitation). Although a comprehensive model of human speech excitation remains elusive, here we outline the substantial progress that has been made.

As discussed in Chapter 16, the word pitch (in the context of speech processing), as defined operationally by psychoacousticians, is the frequency of a pure tone that is matched by the listener to a more complex (usually periodic) signal. This is a subjective definition. When engineers speak of a “pitch detector,” they usually refer to a device that measures the fundamental frequency of an incoming signal; this is an objective definition. In this chapter, pitch perception refers to the subjective result and pitch detection refers to an objective result. Pitch detection and fundamental frequency estimation are often used interchangeably.

In Chapter 16, several models of pitch perception were treated. Many ideas in pitch detection (but not all of them) are reminiscent of these models, which yielded insights that help us invent better pitch detectors. However, we gain additional insight by considering how the excitation function in speech or music is produced.

Homer Dudley's design of the original channel vocoder included a pitch detector. At that time, many psychoacousticians believed Helmholtz's assertion that the fundamental frequency component must exist at some level in order to perceive the pitch. It is interesting to speculate whether Dudley was influenced by this belief. The Dudley pitch detector was based on the articulatory premise that the voiced speech signal always included the fundamental frequency component; his design consisted of a slope filter designed to enhance this component to make it easier for the hardware to correctly measure its frequency. The slope filter is shown in Fig. 31.1.

It can be seen from the sketches of Fig. 31.2 that passage through the slope filter of a signal with almost equal first and second harmonics greatly reduces the second harmonic relative to the fundamental.

A simple way to extract the fundamental period is shown in Fig. 31.3. First, the positive peaks of the signal are found; this is followed by the detection algorithm shown in the figure. Dudley made use of the knowledge that unadulterated speech usually (perhaps always) contains a significant fundamental component. Thus, for example, Charles Vaderson successfully demonstrated a channel vocoder with Dudley's pitch detector in 1950. However, many practical communication systems (e.g., telephones) are band limited, and the fundamental component of the speech may be completely missing.^{1} Furthermore, environmental noise may completely mask the fundamental. This leads us to describe more complex signal-processing algorithms for conditioning the speech prior to detection.

In many cases, some form of an autocorrelation function (typically on a low-pass-filtered form of the speech signal) is used as the core methodology. For a periodic signal (with period within the passband of the filter), the peaks of the function correspond to peaks at multiples of the signal period, and to some extent the influence of noise is reduced.

Certain speech sounds, such as the voiceless fricatives /s/, /sh/, /th/, and /f/, can be modeled as the output of an acoustic tube complex when a portion of the tube has very narrow cross section (a constriction), causing the airflow to become turbulent. For our present discussion, it is sufficient to equate turbulence to the presence of a random noise source. For example, in the production of /s/, the narrow, turbulent cross section is located between the tongue tip and upper teeth. Thus, the source is close to the mouth opening and the excitation is shaped by various reflections in the vocal tract. A reasonable model of the excitation for these sounds is that of white noise, which is then shaped by the vocal tract for that sound.

The voiced fricatives /z/, /th/ (as in the), /zh/ (as in azure), and /v/ are controlled by the same vocal tract shape as their voiceless counterparts, but, in addition, the vocal cords are simultaneously vibrating. Thus, there are two sources of excitation in this case; furthermore, since the periodic excitation is formed at the glottis and the noise is formed near the lips, the two sources excite the vocal tract quite differently.

The voiceless plosives, /p/, /t/, and /k/, involve a transient burst followed by noiselike aspiration. As discussed in Chapter 17, the formant transitions at the start of voicing are auditory cues for distinction among these three sounds, so this has to be part of an articulator model. In Chapter 32, we relate how these modeling issues have been dealt with in vocoder design.

A number of features have been used to train and use a classifier for the voicing decision, such as spectral slope (often estimated by normalizing the first autocorrelation coefficient by the energy), high frequency energy vs. low frequency energy, and features related to confidence in a pitch decision. In experiments performed by one of the authors, speaker-specific neural networks incorporating feature selection were found to be very useful for this classification task [12].

Figure 31.4 illustrates some of the problems encountered in pitch detection. Figure 31.4(a) shows two speech waveforms; the bottom signal has a period approximately one-fourth of the top signal. This illustrates the large dynamic range of the voice fundamental frequency. The pitch of some male voices can be as low as 60 Hz, whereas the pitch of children's voices can be as high as 800 Hz. Figure 31.4(b) shows how the period can fluctuate drastically and almost instantaneously. The leftmost period is quite short, but the next five periods are more than twice as long before snapping back to shorter periods. This kind of behavior makes pitch tracking difficult. Figure 31.4(c) shows a rapid change in the spectrum caused, for example, by sudden closure as in a vowel-to-nasal transition. Although the fundamental frequency has not changed drastically, pitch detection based on waveform analysis can suffer. Figure 31.4(d) shows a transition region from aperiodic (hiss) excitation to quasi-periodic (buzz) excitation. For the precise transition instant to be caught, a fast-acting time-domain detector would be best. Finally, Figs. 31.4(e) and 31.4(f) show the effect of speech degradation that is caused by telephone transmission and added acoustic noise, causing extra problems in pitch extraction.

Several methods of conditioning the speech signal to improve pitch detection have proved useful. Among these, we include low-pass filtering, spectral flattening and correlation, inverse filtering, comb filtering, cepstral processing, and high-resolution spectral analysis.

**Low-pass filtering**: We know from Chapter 16 that human pitch perception pays more attention to the lower frequencies. Interestingly, estimating the pitch period by eye is typically easier with low-passed waveforms such as those shown in Fig. 31.6 than with full-band waveforms such as those shown in Fig. 31.5. It thus seems plausible that a pitch-detection device would have less trouble finding the correct period by analyzing the signal of Fig. 31.6 than that of Fig. 31.5. This has proved true in practice.

**Spectral flattening and correlation**: A more sophisticated concept was proposed by Sondhi [18]. It is based on the observation that a Fourier series representation of harmonics of equal amplitude and zero phase results in a signal that is very much like a pulse train. Sondhi proposed that the original signal first be spectrally flattened. An approximation to this operation is shown in Fig. 31.7, where the outputs of a bank of bandpass filters (BPFs) are divided by their own energy and the components added.

The sum is now sent through an autocorrelator, which creates a zero-phase time function, thus approximating the equal harmonic–zero-phase criterion proposed by Sondhi. Figure 31.8 shows the effect of autocorrelation.

**Inverse filtering**: This concept begins with the hypothesis that the speech signal is the convolution of an excitation and a vocal tract filter. If one were able, in some manner, to specify the time-varying vocal tract at all times, then the speech signal could be passed through a filter with a spectrum *inverse* to that of the vocal tract filter; the output, ideally, should be the glottal waveform, again simplifying pitch tracking. Chapters 19–21 describe methods for estimating the spectral envelope; inverse filtering, at least for the relatively simple vowel sounds, consists of building a linear system having zeros where the original spectral envelope has poles. Markel [10] has implemented inverse filtering as part of his SIFT algorithm for fundamental frequency estimation.^{2}

**Comb filtering**: The speech signal is sent through a multitude of delays, corresponding to all possible (discrete) periods of the input. The system is shown in Fig. 31.9.

For 10-kHz sampling and a fundamental frequency range of 50–500 Hz, the number of possible periods (in samples) ranges from 20 to 200. Thus the comb filter of Fig. 31.9 must contain at least 181 taps; at each tap the signal and its delayed version are subtracted. If the signal is periodic, one of the tapped outputs should be zero, so an estimate can be made by examining the tap outputs.

Ross et al. [14] have implemented a comb-filter pitch detector.

**Cepstral pitch detection**: As elucidated in Chapter 20, cepstral analysis performs *deconvolution* of the source and filter. In Chapter 20 we stressed the application to the spectrum envelope but, as shown by Noll, [13] the high-time portion of the cepstrum contains a very clear hint about the fundamental frequency. Figure 31.10 shows sequences of log spectrum cross sections and the resulting cepstra for a male (two left columns) and a female (two right columns). Note the large peak corresponding to the pitch period in the second and fourth columns.

Finally, we should mention that simply measuring the spectrum with a high resolution illuminates the positions of the harmonics. In the next section, we show how this straightforward operation can lead to a powerful pitch-detection algorithm.

It has often been found that multiple sources of information (or multiple estimators of a variable) provide a more reliable estimator. For instance, it can be easily shown that the estimate of a variable's mean formed by averaging *N* independent measurements has a variance that is 1/*N* times the variance of a single measurement. More generally, the use of multiple estimators permits a secondary decision process to consider agreement among estimators. In practice, it is often difficult to determine whether measurements are independent, but even if there is *some* degree of dependence, improvement is often made.

An example of the use of parallelism for pitch detection [5] was the development of such a program that consisted of four major steps:

- A low-pass filter to smooth the speech wave.
- A processor that generated six functions of the peaks of the filtered speech.
- Six identical elementary pitch-period estimators (PPE), each working on one of the functions.
- A global, statistically oriented computation based on the results of step 3.

Figure 31.11 depicts the six measurements. Each is input to a PPE; the task of the PPE is to eliminate spurious peaks and save those that are separated by the correct period. Each PPE performs the function described by Fig. 31.3.

The box labeled final pitch-period computation in Fig. 31.11 compiles a histogram of all measured intervals between peaks as outlined in Fig. 31.12. To avoid delays, the only candidates for most probable period are chosen from among the most recent, and one of these six is selected, based on the histogram of all periods. This set of measurements can be repeated as often as desired; typically, a new selection is made every 5–15 ms.

The histogram obtained from this algorithm can also be used to produce a buzz-hiss decision [6].

As computer processing has increased in speed, there has been an evolution toward new algorithms that require such speeds to operate in real time. One set of algorithms extends the histogram idea discussed here by including more measurements. For example, instead of preprocessing with just a single low-pass filter, the speech is passed through a bank of 19 bandpass filters covering the range 200–2000 Hz. The output of each filter is passed through an elementary pitch detector similar to the PPE described earlier. There now exists a total of 38 outputs (since both positive and negative peaks are represented). Also, the histogram of times between pitch peaks is generalized somewhat to include times between peaks that are separated by several other peaks; this is equivalent to the original proposal by Licklider [8] to compute the correlation function of the spike train from simulated neurons.

Licklider's concept, plus later ideas that were similar [1], [11], [9], [3], based its measurements on the speech wave directly or on a low-pass filtered version of the speech. The same notion of computing a histogram, but one based on high-resolution spectral analysis, was carried out by Schroeder [15] and Seneff [17]; the latter will be briefly summarized.

Figure 31.13 shows a spectral magnitude cross section containing seven peaks. (This spectrum is based on a 20-ms windowed section of the speech.) The peaks are ordered, as shown. Then the frequencies of peaks 1 and 2 are marked. Then peak frequencies 1, 2, and 3 are marked; then 1, 2, 3, and 4 are marked, and so on, until all seven peak frequencies have been marked in this manner. A histogram is then computed (bottom right) of the intervals shown in Fig. 31.13, and the winner is picked to be the interval that occurs with the greatest probability.

The above two algorithms have in common the concept that performing a collection of procedures on the conditioned speech can lead to improvement. A somewhat different statistical approach was developed by Goldstein [7] and implemented by Duifhuis [4]. In their method, a single, powerful algorithm is employed but the parameters are adjusted to be successively tuned to the specific fundamental frequency. In other words, the hypothesis is advanced that the result is, for example, f_{1}. This hypothesis is then tested by comparing the spectrum of the signal with the spectrum of the hypothetical signal, and a score is obtained. The procedure is now repeated for f_{2}, *f*_{3}, and so on, and the best score determines the winner. The crucial point is that *all permissible* hypotheses go through the test. An implementation of this algorithm is shown in Fig. 31.14. This procedure exemplifies the maximum likelihood approach of testing all reasonable hypotheses and choosing the one having the greatest probability.

Physiological constraints limit the rate at which the voice's fundamental frequency can vary, and this can be exploited to correct errors made by local pitch estimators. Median smoothing is a popular technique invented by Tukey [20] which looks at a sequence of final decisions and treats this sequence as a collection of points on a histogram. Thus, for example, the sequence 5, 6, 12, 7, 8 is plotted as a discrete probability density in the lower part of Fig. 31.15 and as a cumulative probability distribution in the upper part.

The median is the *x* position for which *P* is one-half. In Fig. 31.15, the median is 7; thus, the center of the sequence is replaced by 7, so the new sequence becomes 5, 6, 7, 7, 8. In this example, the outlier 12 was replaced. In this case, as in many others, median smoothing is preferable to a linear filter, for which the effect of an outlier would spread to other samples. In the case in which a one represents a buzz (voiced) excitation and a zero represents a hiss (unvoiced), a sequence of 1, 1, 0, 1, 1, the zero gets changed to a one and the modified sequence is 1, 1, 1, 1, 1. Thus, median smoothing can also be used to fix buzz–hiss errors.

In practice, median smoothing is applied to sequences in much the same way as a symmetric FIR filter. That is, for each window of *N* points, where *N* is an odd integer, the value of point (*N* + 1)/2 in the window (for a new, smoothed sequence) is set equal to the median of the points in the window. The window then is stepped along by one sample point, and the function is recomputed. This repeated sequence of operations is then referred to as an N-point smoothing.

By cascading a three-point median smoother with a five-point median smoother (Fig. 31.16), one can transform the presmoothing pitch contour (top figure) into the result shown on the bottom. In this modification of the basic median smoother, two additional constraints are imposed: (a) if the low-pass signal energy is below a threshold, the result is set to hiss (zero in the figure), and (b) if the variance of three successive results is too large, the median smoother output is also set to hiss.

A second approach to smoothing pitch estimates is via dynamic programming (DP). DP was introduced in Section 24.2.2 as the principle that allowed dynamic time warping to efficiently find the best path through a matrix of local-match scores subject to constraints on local transitions. By the same token, it can be used to find the sequence of reported pitch values that optimizes a combination of consistency with an underlying period strength feature (such as autocorrelation) and a transition penalty favoring continuity or smoothness. DP-based smoothing was incorporated in systems such as the one reported in [16]. A later application to normalized cross-correlation coefficients in the “robust algorithm for pitch tracking” (RAPT) algorithm was reported in [19]. RAPT is the basis for the get JO pitch extraction software that is widely used as a reference in speech processing; DP is extremely effective in evaluating each alternative in regions of locally-ambiguous pitch, and finally choosing the pitch that gives the best overall continuity through time.

As mentioned above, the local autocorrelation of a signal × over a window of length *W* starting at time *t*,

is a natural way to detect periodic repetitions in a waveform since it will show a maximum at any period τ where the signal (approximately) repeats. Autocorrelation, however, presents a number of practical difficulties, as illustrated in Figure 31.17: for the short fragment of voiced female speech shown in part (a), the autocorrelation in part (b) shows many peaks. Although visually it is fairly easy to tell that the peak at 4 ms (corresponding to a pitch of 250 Hz) is the “right” one, it proves difficult to design a thresholding scheme that will, at the same time, ignore the large peak around zero lag, and be robust both to suboctave errors (the second-order peak at lag 8 ms) and to superoctave errors (i.e., the peak due to the strong second harmonic at 2 ms).

These problems were addressed by de CheveignC and Kawahara in their pitch detector Yin [2]. Instead of autocorrelation, they considered the squared-difference between the signal and its delayed version,

Note that this can be expressed in terms of the autocorrelation:

where *r** _{t}* (0) is simply the energy of

Introducing the energy terms *r** _{t}* (0) and r

This function, shown in Figure 31.17 (d), has a value that, by definition, starts at 1, and tends to be largest for ti close to zero, eliminating the first minimum in d_{t}(τ). Now, finding the period is a simple matter of finding the earliest minimum that falls below a simple, insensitive threshold – for example, 0.1.

Yin includes several other refinements including interpolation of the best lag, and a stage of searching over a small time range in the region of analysis window to find the best local estimate. Beyond this, however, it does not apply any time smoothing, yet it was shown to significantly outperform a range of standard approaches. It has become a popular reference method, especially on relatively clean speech signals. Pitch detection in lower signal-to-noise ratios remains an active research area; we will return to the particular case of multi-pitch extraction for music signals in Chapter 37.

**31.1**Using Klatt's synthesizer as a model (Chapter 30), find the sequence of excitation functions for:**(a)**Voiced fricatives,**(b)**Voiced plosives,**(c)**Affricates (ch, dj),**(d)**Voiceless plosives.

**31.2**Figure 31.7 shows a way of spectrally flattening a speech signal. Another method is to pass the speech through a bank of bandpass filters and hard limit each output, which is then bandpass filtered by an identical filter. Can you compare the two methods? How are they the same? How do they differ?**31.3**How might the measurements described in connection with Figs. 31.11 and 31.12 be used to create a buzz–hiss decision?**31.4**Given a sequence of detected periods of 90, 90, 94, 73, 85, 40, 78, 95, 97, 50, 100, 105, 110:**(a)**Find the new sequence after three-point median smoothing.**(b)**Find the resulting sequence after processing the result of (a) with a five-point median smoother.

**31.5**Build a circuit or write a program to generate a pulse train. Include the ability to vary the repetition frequency. For reference frequencies of 50, 100, 200, 400, and 800 Hz, measure the just noticeable deviation from that reference. Design a convenient way to plot results. Discuss.**31.6**The pitch detection circuit used by Dudley consisted of a slope filter followed by a zero-crossing meter. The filter had a log magnitude versus frequency shape that approximated a straight line (see Fig. 31.1).**(a)**Design a digital filter to approximate the magnitude response of the slope filter.**(b)**Consider a signal defined by the equation

as input to the slope filter.

Write the equation for the output

*q(t)*. Measure the zero crossings of both the input and output and discuss how they compare as pitch detectors.**31.7**Describe the perceptual effects of the following types of errors in modeling the excitation function:**(a)**Mistakenly changing buzz to hiss.**(b)**Mistakenly changing hiss to buzz.**(c)**Doubling the detected pitch.**(d)**Halving the detected pitch.

**31.8**Write a brief essay (less than 1000 words), giving your views of the following pitch detection algorithms:**(a)**Comb filtering.**(b)**Gold–Rabiner parallel-processing time-domain algorithm.**(c)**Cepstral analysis.**(d)**Spectral flattening and autocorrelation as described by Sondhi.**(e)**Implementation of Goldstein's model as described by Duifhuis.**(f)**Seneff's Harmonic pitch detector.

- Aarset, T. C., and Gold,
**B.**, “Models of pitch perception,” Tech. Rep. 964, MIT Lincoln Laboratory, Lexington, Mass., 1992. - de Cheveigne, A., and Kawahara, H., “YIN, a fundamental frequency estimator for speech and music,”
*J. Acoust. Soc. Am*., Vol. 111, no. 4, pp. 1917–1930, 2002. - Delgutte, B., and Cariani, P. A., “Coding of the pitch harmonic and inharmonic complex tones in the interspike intervals of auditory-nerve fibers,” in M. E. H. Schouten, ed.,
*The Auditory Processing of Speech*, Mouton–De Gruyter, Berlin, pp. 37–45, 1992. - Duifhuis, H., Willems, L. F., and Sluyter,
**R. J.**, “An implementation of Goldstein's theory of pitch perception,” J.*Acoust. Soc. Am*.**71**: 1568–1580, 1982. - Gold, B., and Rabiner, L. R., “Parallel processing techniques for estimating pitch periods of speech in the time domain,”
*J. Acoust. Soc. Am*.**46**: 442–448, 1969. - Gold, B., “A note on buzz-hiss detection,”
*J. Acoust. Soc. Am*.**36**: 1659, 1964. - Goldstein,
**J. L.**, “An optimum processor for the central formation of pitch of complex tones,”*J. Acoust. Soc. Am*.**54**: 1496–1516, 1973. - Licklider, J. C. R., “A duplex theory of pitch perception,”
*Experientia 7*: 128–138, 1951. - Lyon, R. F., “Computational models of neural auditory processing,” in
*Proc. ICASSP '84*, San Diego, 1984. - Markel, J. D., “The SIFT algorithm for fundamental frequency estimation,”
*IEEE Trans. Audio Electroacoust*.**AU-20**: 367, 1972. - Meddis, R., and Hewitt, M. J., “Virtual pitch and phase sensitivity of a computer model of the auditory periphery I: pitch identification,”
*J. Acoust. Soc. Am*.**89**: 6, 1991. - Gevins, A., and Morgan, N., “Ignorance-based' systems,”
*Proc. 1984 IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing*, Vol. 9, Piscataway, NJ: IEEE Press, 1984, pp. 39A.5.14. - Noll, A. M., “Cepstrum pitch determination,” J.
*Acoust. Soc*. Am.**41**: 293, 1967. - Ross, M.
**J.**, Schaffer, H. L., Cohen, A., Freudberg,**R.**, and Manley, H., “Average magnitude difference function pitch extractor,”*IEEE Trans. Acoust. Speech Signal Process*.**ASSP-22**: 353–362, 1974. - Schroeder, M. R., “Period histogram and product spectrum: new methods for fundamental frequency measurement,”
*J. Acoust. Soc. Am*.**43**: 829–834, 1968. - Secrest, B., and Doddington, G., “An integrated pitch tracking algorithm for speech systems,” in
*Proc. ICASSP '82*, Boston, pp. 1352–1355, 1983. - Seneff, S., “Real-time harmonic pitch detector,”
*IEEE Trans. Acoust. Speech Signal Process*.**ASSP-26**: 358–364, 1978. - Sondhi, M. M., “New methods of pitch extraction,”
*IEEE Trans. Audio Electroacoust*.**AU-16**: 262–266, 1968. - Talkin, D., “A robust algorithm for pitch tracking,” W. B. Kleijn and K. K. Paliwal, eds., in
*Speech Coding and Synthesis*, Elsevier, Amsterdam/New York, pp. 495–518, 1995. - Tukey,
**J.**W., “Nonlinear (nonsuperposable) methods for smoothing data,” in*Proc. Eascon '74*, Washington, D.C., pp. 673, 1974.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.