Human pitch perception is performed by the complete auditory system. Aside from the periphery, our knowledge of this system is still so fragmentary that the task of modeling human pitch perception depends primarily on the interpretation of many psychoacoustics experiments, abetted somewhat by continuing physiological explorations. In this chapter, we first review some proposals (and the accompanying controversies) about the nature of this remarkable facility of ours. We then elaborate on some of these ideas by comparing performances of models with experimental results. The reader should be aware that there is a long and rich history associated with this problem. For greater detail than we can provide here, consult the excellent review by de Boer [2].

As with Chapter 15, the understanding of some of these concepts can be improved by listening to the relevant demonstrations from [10].


As noted in Chapter 14, von Helmholtz [8] conceived of the auditory system as a bank of many overlapping bandpass filters. The relationship between this model and the known physiology of the periphery can be seen from Figs. 16.1 and 16.2.

As noted in Chapter 14, the basilar membrane and associated hair cells respond more to high frequencies at the entrance to the cochlea. As the vibrations penetrate more deeply into the cochlea, the basilar membrane (BM) response becomes more sluggish, corresponding to filters with lower center frequencies. A pure tone would cause greatest vibration at a specific place on the BM, and this would ultimately lead to perception of that tone. An important fact to note is that the ultimate perception of the tone is dependent on the activities of specific cortical neurons. We will return to this point later.

An engineering model of the auditory system is shown in Fig. 16.2. Note that the components of the inner ear (BM, hair cells) are represented by a filter bank.

Although von Helmholtz's place theory supplies a credible explanation for the pitch of pure tones, it runs into some difficulty accommodating the perception of complex tones. There are many instances when pitch can be readily identified when the fundamental frequency is completely absent. Many years before the electronic age, Seebeck [18] demonstrated this result by using a siren, as shown in Fig. 16.3.


FIGURE 16.1 Schematic of the outer, middle, and inner ears.

The rotating disk with a single opening, subjected to a wideband acoustic field, produced pulses of sound at the frequency 1/T. When the disk had two openings at opposite ends, the repetition rate was doubled. When one of the openings was slightly displaced, the frequency was again 1/T, but the first harmonic could be made extremely small by reducing the displacement. Thus, Seebeck created the original version of the missing fundamental experiment. In such experiments, listeners are presented with complex tones with essentially no energy at the fundamental frequency in order to determine the pitch perception for such a stimulus.

Figure 16.4 illustrates a modern version of Seebeck's experiment, presented in demonstration 20 of [10]. The listener hears a complex tone at a fundamental frequency of 200 Hz. Successive harmonics, beginning with the first, are removed. Many listeners perceive the same pitch as would be heard for a sequence that included the fundamental.

How can the place explanation for pitch perception be maintained, given experiments such as Seebeck's? von Helmholtz defended his hypothesis by arguing that an (unspecified) nonlinear operation at the basilar membrane caused the BM to vibrate at the place corresponding to the fundamental frequency, even when there was no energy at that frequency in the original signal. Many years later this argument was proved false by Licklider [12].


FIGURE 16.2 Engineering model of the outer, middle, and inner ears. The dashed curve encloses a model of the inner ear as a linear filter bank; this model will be specified further in Chapter 19.


FIGURE 16.3 Pulse train and resultant spectrum in Seebeck's experiment. The first column represents the physical arrangement for the production of each sound; the other columns show the corresponding time waveform and power spectrum. Note that for the sound labeled c, the spectral component corresponding to the disk rotation speed (1/T) has a magnitude very close to zero.

He alternately played a pure tone (at the fundamental) and a harmonic series of the same fundamental frequency but with the actual first harmonic physically absent. The listener then perceived two sounds of equal pitch but different timbre. Then noise with a band centered at the fundamental was added to this sequence. It was found that the pure tone was completely masked, whereas the harmonic series was still heard. If perception of the harmonic series were dependent on the combination tone appearing at the fundamental frequency place in the BM, this combination tone should also have been masked. Licklider's experiment is reproduced in demonstration 22 of [10].


FIGURE 16.4 Virtual pitch demonstration.

Licklider's result reinforced the so-called periodicity model of Schouten [17], who proposed that the auditory system perceived pitch by somehow measuring the periods of the signals as they traveled from BM (by means of hair cells) onto auditory fibers. Schouten's model is depicted in Fig. 16.5.

From this figure we observe that for a 200-Hz pulse train, the lower-frequency filters (below 1000 Hz) resolve the lower harmonics of the pulse train. For the high-frequency channels, however, a carrier signal that is within band for each filter appears to be modulated by a periodic signal so that the period of the input signal is easily recognized. Schouten thus hypothesized that pitch perception would be more salient at these higher harmonics. He asked his student Ritsma to set up an experiment to try to verify this hypothesis [16]. The experimental setup is shown in Fig. 16.6. Both the low-pass and high-pass filters in the figure are subjected to pulse trains having repetition rates that depend on the oscillators. Ritsma showed that the listener always chose the output of the low-pass filter as the preferred pitch. (The precise design of an experiment to arrive at the result is deferred to the exercises).


FIGURE 16.5 Schouten's depiction of BM responses to a periodic 200-HZ pulse train. From [15].


FIGURE 16.6 Ritsma's setup to determine dominant frequencies for pitch perception. From [16].

Thus, it does not appear to be the case that pitch perception is primarily a high-frequency phenomenon. More specifically, the results of testing the pitch perception of many listeners showed that the dominant frequencies for perception were in the vicinity of the third, fourth, and fifth harmonic. This result destroyed Schouten's hypothesis but still left open the question of a suitable model.

Other important insights were obtained from the experiments of Houtsma and Goldstein [9]. In one experiment, musically trained listeners were asked to recognize the interval between two successively played signals.1 Each signal contained two successive harmonics of a given fundamental. When the two signals were presented to both ears, the trained subjects had no trouble identifying the intervals. They then repeated this experiment with one major modification: for each signal, only one harmonic was presented to one ear and the other harmonic to the opposite ear. Again, pitch intervals were correctly identified. This result implied that the perception of pitch is centrally located. That is, perception appears to have taken place after the auditory signals from the two ears had been combined. From Fig. 14.1, this means that processing to determine pitch must take place at the superior olivary complex or higher. Also, significantly, von Helmholtz's place theory again plays some role in these more recent perspectives on pitch. In particular, the model proposed by Goldstein [7] assumes that the BM resolves the low-frequency harmonics and that the auditory system recognizes the pattern of excitations on the BM.


Miller and Sachs [14] collected poststimulus time histograms of spiking intervals over a wide range of fiber characteristic frequencies (CFs) from cats. Results for the synthetic speech stimulus “da” are shown in Fig. 16.7.

Their results show that some fibers yield patterns that support the Goldstein place concept, whereas others support the periodicity concept. Eight fiber discharge patterns are shown in Fig. 16.7, with CFs ranging from 250 Hz to 3620 Hz. Also shown on the figure are formants 1, 2, and 3. For the fibers with CF = 250 Hz and CF = 400 Hz, responses are synchronized to individual harmonics of the fundamental frequency of approximately 120 Hz. For the fibers with CF = 330 Hz and CF = 970 Hz, responses are synchronized to the fundamental frequency. It is clear that the auditory system may use more than a single mechanism in arriving at a pitch estimate. A more complete explanation for these results is left as an exercise for the reader (see Exercise 6).

In recent decades, many deaf patients have been given cochlear implants; this operation sometimes provides new auditory capabilities. In this procedure, the dysfunctional hair cells are bypassed, and implanted electrodes excite the auditory nerve bundle directly. Since the electrodes are distributed throughout a specific region of the cochlea, it is possible to observe patient responses when individual electrodes, placed at specific places on the BM, are stimulated. Results show [5] that responses to the same periodic stimuli vary as a function of the place of excitation. Figure 16.8 gives the results for a six-electrode implant as a function of a pulselike stimulus repetition rate. For repetition rates between 100 and 200 Hz, the perception of pitch increases almost linearly with the rate. Beyond 200 Hz, pitch is quite constant, which is quite different than for normal ears. We also notice that when the electrode is closer to the base (i.e., electrodes 5 and 6), pitch is higher than for the electrodes closer to the apex. This observation is reminiscent of the early place hypothesis of von Helmholtz.


FIGURE 16.7 Histograms of spiking intervals. From [14].


FIGURE 16.8 Scaled pitch response of a single implanted patient. Each solid curve corresponds to the response for a different electrode. From [5].


As we noted earlier, the two major categories of pitch-perception models are those based on BM place and those based on the periodicity of the outputs from the BM. Although modern perspectives often include aspects of each theory, we have found it instructive to compare the response of these models to different stimuli.

A periodicity model is shown in Fig. 16.9. A correspondence is assumed between the basilar membrane and the filter bank. Further, the hair cell–auditory nerve complex is modeled as the elementary pitch detectors (EPDs). As in the research by Ritsma, the filters cover the low-frequency portion of the speech spectrum (100-2000 Hz). The ability of these filters to resolve harmonics is a function of the pitch and spectrum of the incoming signal.


FIGURE 16.9 Block diagram of the periodicity model.


FIGURE 16.10 Block diagram of the place model.

Neural spiking tends to follow the peaks of the signal. Given an auditory nerve spike, that same nerve cannot respond to further stimulation during the refractory period. Following this period, the voltage difference between the interior and exterior of the neuron gradually returns to normal, thus increasing the probability of subsequent firings. The global algorithm shown in the figure generates a histogram of the intervals between successive spikes and spikes two, three, or four intervals apart. Pitch period is determined by choosing the interval corresponding to the maximum value of the histogram.

A place model is shown in Fig. 16.10. The underlying hypothesis of this model is the ability of the auditory system to resolve harmonic peaks of the stimulus. This resolution probably takes place at higher auditory levels above the periphery. Stage 1 of the figure is a version of the Seneff algorithm [19] that performs a statistical separation of the frequency spacing between spectral peaks. Stage 2 is related to the harmonic sieve algorithm of the Goldstein model, as implemented by Duifhuis [4]. The spectral peaks are correlated with sets of harmonically spaced narrow windows. The nominated sets are based on the winning pitch of stage 1. The final winner corresponds to the set of maximum correlation.

We can refer to two examples from [10] in order to test the fidelity of each of these models to human pitch perception. In the first of these, we present listeners with signals that are not harmonic that result from the shifting of each of several harmonics upward by the same amount; this is presented in demonstration 21 of [10]. Figure 16.11 shows the harmonic structure of two signals. The top signal clearly leads to a pitch of 200 Hz, either for humans or for either model. However, the bottom signal is a shifted version of the top signal and leads to a perceived pitch of 210 Hz. Many experiments, using different parameters, verify that shifted versions of harmonic signals result in reliable pitch perception. At issue is how to explain the amount of pitch shift as a function of the shift of the stimulus frequencies. Figure 16.12 shows the results for both place and periodicity models for a variety of conditions. Both models respond more or less correctly to the stimuli.


FIGURE 16.11 Shift of virtual pitch.


FIGURE 16.12 Responses of both models to virtual pitch shifts.

In another case, we can present an entirely different type of stimuli. In demonstration 26 from [10], a five-octave diatonic scale is played with pulse pairs. This is followed by a four-octave diatonic scale built from samples of a Poisson process. Finally, a four-octave scale is played with bursts of comb-filtered white noise.2 Here we tested the two pitch perception models for the last of these stimuli. Figure 16.13 shows how the noise is comb filtered. By sequentially changing the delay, one can control the pitch (which is inverse to the delays) to produce the four octaves shown in Fig. 16.14.

Figure 16.15 shows how the models respond to the comb-filtered noise. Instead of the use of the diatonic scale of Fig. 16.14, a slightly different sequence of delays was presented to the models, consisting of ten delays ranging from 12.0 to 2.1 ms. Each of the signals was processed by both models for 160 ms, followed by a pause of 55 ms. Figure 16.15 shows the resulting pitch-period estimates; the dips correspond to the off times.


FIGURE 16.13 Comb-filtered noise with different delays.


FIGURE 16.14 Diatonic scale for a comb-filtered noise demonstration.


FIGURE 16.15 Tracks of the detected periods for both place and periodicity models of pitch, given comb-filtered noise for different delays.

The place model follows the psychoacoustic results for the initial six of the 10 cases. The periodicity model more or less follows the same pattern but very erratically. Interestingly, the higher pitches are better represented by this model.

More details on listener response to various noise signals can be found in [1].


FIGURE 16.16 Response of periodicity model to Flanagan–Guttman pulse train.

The results from comb-filtered noise seemed to suggest that place-based models were a better match to human perception. However, in other cases the periodicity model performs quite well. An example in point is an experiment by Flanagan and Guttman [6] that demonstrated two modes of pitch perception. The listener was presented with a sequence of periodic pulses of pulse rate R. Each period consisted of three positive pulses followed by one negative pulse; thus the fundamental frequency was R/4. When R was less than approximately 150 Hz, listeners perceived pitch to be R; for R greater than 200 Hz, listeners perceived pitch to be R/4. Figure 16.16 shows the result obtained with the periodicity model. In this case, this model gives a good match to the behavior of listeners. The place model that we used gave erratic results and is not reproduced here.


The reader may feel at this point that there are no conclusions that can be made at all. Certainly our understanding of pitch perception is incomplete, and the many experiments on the topic often seem to point in different directions. However, the two models discussed do follow, to some extent, the psychoacoustics data. One can argue, as do Meddis et al. [13] and Delgutte and Cariani [3], that the interval histograms of fiber firings supply all the required pitch information. Yet, it is still necessary to explain how this periodicity information is translated by the auditory system into the necessary place information in the cortex. One can subscribe to the modified place model of Goldstein, but we still need to understand how the various resolved harmonics get translated into a single answer. In either case, it is likely that the brain makes use of all available information, particularly since some aspects of the signal may be obscured under different acoustics conditions. In any event, these models have provided much food for thought for engineers who have designed automatic pitch-detection systems for use in speech communications or for diagnostic purposes in detecting vocal cord or vocal tract illnesses.


FIGURE 16.17 Spectra from “West End Blues” and proposed important spectral sections for pitch perception.

We close the chapter by giving a musical example to illustrate the portions of the spectrum that are the proposed critical inputs for the auditory system according to some of the model builders referenced in this chapter. Figure 16.17 shows the spectra of the first eight notes of the great trumpet cadenza by Louis Armstrong in “West End Blues.”


  1. 16.1 Experiments have demonstrated that nerve fibers have a refractory period; that is, for several milliseconds following the production of a spike, the neuron is incapable of producing a new spike. This implies that the spike frequency of a single neuron cannot exceed several hundred hertz. Given this, how can one justify the periodicity model of perception?
  2. 16.2 Combination tones produced by the ear's nonlinearity have been verified from psychoacoustic data. Why doesn't this fact nullify the validity of Licklider's demonstration that the missing fundamental was not re-created in the cochlea by nonlinearities?
  3. 16.3 Design an experiment using Ritsma's setup (Fig. 16.6) to show that the auditory system perceives pitch based on the low-frequency portion of the spectrum.
  4. 16.4 Ritsma showed that the dominant frequencies for pitch perception were harmonics 3, 4, and 5. Do you think that this relationship holds for very low (≈50 Hz) or very high (≈1000 Hz) fundamental frequencies? Discuss.
  5. 16.5 (a) Miller and Sachs collected Poststimulus time histograms to produce Fig. 16.7. Explain how this is done.
    1. (b) Instead of poststimulus time histograms, interval histograms are often collected. Explain the difference. Would interval histograms for the same stimuli have been as useful?
    2. (c) Licklider [11] and Meddis et al. [13] both presented a pitch-perception model based on a sum of autocorrelation functions across frequency channels. Derive a relationship between this function and the interval histogram.
  6. 16.6 Explain the results of Fig. 16.7. Why are some fibers synchronized to a harmonic and others to the fundamental frequency?
  7. 16.7 Since most of us don't have perfect pitch, we can't simply assign a number or a note to the pitch of a sound. Think of one or more ways of determining pitch perception experimentally.
  8. 16.8 Both place and periodicity models performed quite well for the stimulus of Fig. 16.11. Explain how the models managed to emulate human performance for this task.
  9. 16.9 What is the property of comb filtered noise that makes it plausible for the place model to emulate human perception for at least the lower pitches?
  10. 16.10 What is the pitch of a harmonic series with missing fundamental frequency? Is it (a) the fundamental frequency, (b) twice the fundamental frequency, (c) halfway between (a) and (b), or (d) none of the above? (In that case, what is your guess?)


  1. Bilsen, F. A., “Pitch of noise signals: evidence for a central spectrum,” J. Acoust. Soc. Am. 61: 150-161, 1977.
  2. de Boer, E., “On the ‘residue’ and auditory pitch perception,” in W. Keidel, ed., Handbook of Sensory Physiology 5, Springer-Verlag, New York/Berlin, pp. 479-583, 1976.
  3. Delgutte, B., and Cariani, P., “Coding of the pitch of harmonic and inharmonic sounds in the discharge patterns of auditory nerve fibers,” in M. E. H. Shouten, ed., The Processing of Speech, Mouton–DeGruyer, s-Gravenhage, The Netherlands, 1992.
  4. Duifhuis, H., Willems, L. F., and Sluyter, R. J., “Measurement of pitch in speech: an implementation of Goldstein's theory of pitch perception,” J. Acoust. Soc. Am. 71: 1568, 1982.
  5. Eddington, D. K., “Speech discrimination in deaf subjects with cochlear implants,” J. Acoust. Soc. Am. 68: 885-891, 1980.
  6. Flanagan, J. L., and Guttman, N., “On the pitch of periodic pulses,” J. Acoust. Soc. Am. 32: 1308-1319, 1960.
  7. Goldstein, J. S., “An optimum processor theory for the central formation of the pitch of complex tones,” J. Acoust. Soc. Am. 54: 1496-1516, 1973.
  8. von Helmholtz, H., On the Sensation of Tone as a Physiological Basis for the Study of Music, 4th ed., A. J. Ellis, trans., Dover, New York, 1954; orig. German, 1862.
  9. Houtsma, A. J. M., and Goldstein, J. L., “The central origin of the pitch of complex tones: evidence from musical interval recognition,” J. Acoust. Soc. Am. 51: 520-529, 1972.
  10. Houtsma, A. J. M., Rossing, T. D., and Wagenaars, W. M., “Auditory demonstrations,” Philips compact disk, Inst. Perceptual Research and Acoustical Soc. Am., Eindhoven, 1987.
  11. Licklider, J. C. R., “A duplex theory of pitch perception,” Experientia 7: 128-133, 1951.
  12. Licklider, J. C. R., “Periodicity pitch” and “place pitch,” J. Acoust. Soc. Am. 26: 945 (A) (1954).
  13. Meddis, R., and Hewitt, M. J., “Virtual pitch and phase sensitivity of a computer model of the auditory periphery,” J. Acoust. Soc. Am. 91: 233-245, 1991.
  14. Miller, M. I., and Sachs, M. B., “Representation of voice pitch in discharge patterns of auditory-nerve fibers,” Hearing Res. 14: 257-279, 1984.
  15. Moore, B. C. J., An Introduction to the Psychology of Hearing, 5th ed. Academic Press, New York/London, 2003.
  16. Ritsma, R. J., “Frequencies dominant in the perception of the pitch of complex sounds,” J. Acoust. Soc. Am. 42: 191-198, 1967.
  17. Schouten, J. F., “The residue, a new component in subjective sound analysis,” Proc. Kon. Acad. Wetensch (Neth.) 43: 356-365, 1940.
  18. Seebeck, A., “Ueber die Sirene,” Ann. Phys. Chem. 60: 449-481, 1843.
  19. Seneff, S., “Real-time harmonic pitch detector,” IEEE Trans. Acoust. Speech Signal Process. ASSP-26: 358-365, 1978.

1 An interval is the ratio between the fundamental frequencies of two signals.

2 Comb filtering refers to multiplying the input spectrum by some simple periodic function, for instance by placing equidistant zeros around the unit circle in the z plane. Since the result is a filtering out of periodic chunks of the spectrum, the filtering action appears like a comb.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.