Psychoacoustics is the science in which we quantify the human perception of sounds. The ultimate aim is to derive a quantitative model that matches the results of all auditory experiments that we can contrive. This is quite a tall order, since, to a great extent, the human auditory system remains a “black box,” despite many years of physiological research. In this chapter we survey some of the results that have the most obvious relevance to speech and audio applications. Where possible, we correlate psychoacoustical phenomena with physiological measurements.

We can establish some objective variables that will be adjusted in order to assess human perception of sounds. For frequency and intensity, standardized instruments can produce outputs that are linearly proportional to the stimulus. For example, a device that counts the number of zero crossings of a sinusoid over a prescribed time interval can be calibrated to read what we can define as the frequency of the signal. A measure of the spectrum of a sound can also be defined, for instance by a particular form of spectrogram. Duration is another objective property of a sound.

Each of these sound characteristics has a corresponding perceptual variable. The perception of frequency is called pitch, the perception of intensity is called loudness, and the perception of spectrum is called timbre. These human response variables are not linearly proportional to the value of the corresponding stimulus variables. Thus, if a person hears a pure tone at some given frequency f, followed by another tone at 2f, the perception will not be that the frequency of the second tone is double that of the first tone. Similarly, if the intensity of the tone is doubled, the human subject will not describe it as twice as loud. The same argument holds for perception of duration. Furthermore, the response variables are often dependent on more than one of the stimulus variables. For instance, the subjective impression of pitch, although primarily dependent on frequency, can vary with other parameters, such as intensity or spectrum. The same holds for loudness and timbre.

Some issues of interest in this area include the following.

  1. How sensitive is human hearing? How does the ear1 respond to different intensities?
  2. How does the ear respond to different frequencies?
  3. How well does the ear focus on a given sound of interest in the presence of interfering sounds?

Such questions can be quantitatively addressed by conducting psychoacoustic tests. In the following sections we review some of the classic experiments that have been performed to try to answer a few of these types of questions. In particular, we discuss experiments to demonstrate the dependence of perceived loudness on objective parameters of intensity, duration, and spectrum. Pitch will be discussed in Chapter 16.

The reader's understanding can be further enhanced by listening to a set of demonstrations that were released on a commercial compact disk, Auditory Demonstrations [7]. A review paper by Hartmann [6] contains evaluations based on class listening tests for these demonstrations.


Sensations (hearing, seeing, smelling, etc.) increase logarithmically as the intensity of the stimulus increases. Many experiments have certainly verified (at least approximately) this law and have led to the use of the decibel scale. To define a subjective measure of loudness, we introduce the sone. Based on the work of Stevens [11] and others, an empirical relation between the sound pressure p and the loudness S in sones gives the result


where a sone value of one is set to be the loudness of a 1000-Hz tone at an intensity of a 40-dB SPL. Recalling from Chapter 13 that the intensity is proportional to the square of the pressure, we obtain


Roughly speaking, we can say that the loudness is proportional to the cube root of the intensity.2 From Eq. 13.9, we can derive that a 10-dB increase in sound intensity corresponds to an increase in intensity by a factor of 10. According to Eq. 15.2, then, this 10-dB increase roughly corresponds to a doubling in sone value.

The constant of proportionality implied by Eq. 15.1 is frequency dependent. In general, the subjective listener response of loudness is a function of the intensity, frequency, and quality of the sound. Figure 15.1 shows a standardized set of curves for pure tones. Each curve denotes the measurement of equal loudness as a function of tone frequency. Typically, the listener adjusts the intensity at a given frequency until it is judged to be of equal intensity to a standard 1000-Hz tone. We see from these curves that the ear is most sensitive to sounds of approximately 4 kHz. Each contour of loudness level corresponds to units called phons. The phon level is set to be equal to the SPL in decibels at 1000 Hz. Thus, for example, we see that for a SPL of 40 dB, the loudness of a tone at 100 Hz is 20 phons less than the loudness at 1000 Hz.

This frequency-dependent sensitivity leads to the differing standards for soundpressure-level meters, as were briefly described in Chapter 13, Section 13.2.5 in particular; the relative insensitivity to low-frequency sounds led to the development of A weightings, which deemphasize these sounds. Note also that this variation in sensitivity is reduced for louder sounds, which led to the development of B-weighted and C-weighted SPL measurements.


FIGURE 15.1 Equal loudness curves for pure tones. From [4].

These relationships are illustrated in [7] by demonstrations 4 and 6. In demonstration 4, tones are played in intensity steps of 1, 3, and 6 dB. In demonstration 6, tones are played in intensity steps of 5 dB for frequencies ranging from 125 to 8000 Hz.

Finally, there are many experiments that have been performed to demonstrate the dependence of loudness on the duration of a sound. Such experiments have shown that if the duration of a sound is smaller than approximately 200 ms, it will be less loud than a sound of the same intensity with a duration greater than 200 ms. In demonstration 8 of [7], for instance, pulses are presented with decreasing sound-pressure levels (0, –16, –20, –24, –28, –32, –36, and –40 dB). The listener responds by noting at which step the sound became inaudible. The test is repeated for durations of 1000, 300, 100, 30, 10, 3, and 1 ms. Figure 15.2 shows the result averaged over 103 listeners in a reasonably reverberant classroom, as reported in [6].

The scale on the left of Fig. 15.2 shows the pulse level, relative to the first pulse, at which audibility was lost. The dashed line is a reference with a slope of –10 dB per decade of duration. We see that the result asymptotes at approximately 100 ms, at which point further duration increases change the perceived threshold very little. This time would probably be longer if the experiment had been conducted with headphones, as the reverberation in the classroom undoubtedly extended the length of each pulse as received by the listeners.


FIGURE 15.2 Effect of duration of a pulse in noise on loudness. For each tone duration the ordinate on the solid curve gives the smallest audible tone level (on the left) and the corresponding step number (on the right). Error bars of 2 standard deviations are shown. The broken line is a reference to show a slope of –10 dB per decade of duration, which corresponds to a simple integration of signal power. From [6], Fig. 5.


In Section 15.2, we described the frequency dependence of a listener's sensitivity to pure tones. However, most signals of interest are more complex. This suggests other types of experiments to learn about the perception of multiple tones. Some of the most important of these are based on the notion that some kind of frequency analysis is central to human hearing.

As we discussed in the previous chapter, auditory neurons are tuned to specific characteristic frequencies. Thus, we surmise that the auditory system behaves like some sort of a filter bank. Experiments have been performed for many years to determine the characteristics of these auditory filters. Fletcher [2] performed some of the early experiments of this kind. In one such series of experiments, he based his measurements on the ear's response to a pure tone in band-limited white noise. Initially, the tone level was set to be heard by normal ears. The tone level was then decreased in discrete 5-dB steps until the listener did not hear it, and the number of steps was registered. The noise bandwidth (still with the same flat spectrum) was decreased and the experiment was repeated. Until the noise bandwidth was decreased to some critical value, the listener's ability to hear the tone remained the same, despite the decrease in noise power. However, for noise bandwidths lower than this critical value, the listener's response was monotonically enhanced.

The experiment could then be repeated with a tone of a different frequency, and in this way these critical values could be plotted over the total audible band.

The implication is this: in listening to a tone of a given frequency, the listener applies a psychological filter of width approximately equal to this critical value. The filter ignores noise outside this bandwidth. Thus, the decision as to the absence or presence of the tone is based on the signal-to-noise ratio within this band. Figure 15.3 is a caricature of the apparent and measured signal-to-noise (S/N) ratios for this type of experiment. Note that the break point in the solid line in this figure can be interpreted as showing the noise bandwidth corresponding to a critical band for the particular tone frequency used.


FIGURE 15.3 Critical ratio experiment. The solid line shows the apparent signal-to-noise ratio of the psychological filter; the dashed line shows the signal-to-noise ratio of the stimulus.

Fletcher called these critical filter widths the critical bands. Later researchers (see Chapter 19) have developed methods of estimating the shapes of auditory filters.

From an engineering standpoint, probably the most important result that emerges from critical band research is that auditory filters with higher center frequencies have greater bandwidths. The Bark scale and the scale proposed by Greenwood, as shown in Fig. 15.4, are reasonable approximations to critical bands obtained from psychoacoustic measurements.

Similar results can be obtained by having listeners compare the loudness for a pair of noise bursts. In one experiment demonstrated in [7], the reference noise burst had a bandwidth that was a fixed 15% of the center frequency, whereas the test noise burst maintained a constant power by lowering the amplitude and widening the bandwidth as shown in Fig. 15.5. As long as the test-signal bandwidth was smaller than the critical band, the loudness of the two stimuli remained equal, but when the test stimulus exceeded the critical band, it was typically judged to be louder than the reference.


FIGURE 15.4 Plots of estimated bandwidth as a function of center frequency for the Bark scale, and Greenwood's cochlear frequency-position function, with two constant-Q scales also shown for comparison. From [8].


FIGURE 15.5 Critical bands by loudness comparison. Solid curves represent the spectrum of the test noise burst; dashed curves represent the spectrum of the reference noise burst. The loudnesses of reference and test bursts are compared by listeners. From [7].

These two critical band experiments are presented in demonstrations 2 and 3 in [7].


In Section 15.3, we described an experiment in which the ability to hear a tone was masked by the presence of noise within the same spectral region. This led to models for the spectral resolution of human hearing. Similarly, many experiments are based on the effect of one tone on another, which it is hoped will lead to a better understanding of the perception of complex sounds. When two tones are presented simultaneously, the weaker tone may, in some cases, not be heard. A number of results have commonly been observed for this type of experiment. Closer tones have a greater effect, and louder tones affect tones that are further away in frequency. It has also been observed that a tone more easily masks a tone of higher frequency than one of lower frequency.

In masking experiments, there is typically a target signal, which is the one to be detected, and a masker signal, which is the one being manipulated to affect the listener's perception of the target. In one such experiment, a 2000-Hz target signal and a 1200-Hz masker are applied simultaneously. The masker consists of eight bursts and the signal consists of four bursts over the same interval, as shown in Fig. 15.6. The sequence is repeated 10 times, with the target tone set at a decreased intensity level for each presentation (down by 15 dB for the second sequence, and reduced by 5 dB more for each new sequence). After all presentations, the masker and target signals are reversed, as shown in the figure.

Some of the results quoted in [6] for experiments of this kind are ambiguous. They are affected by other factors, such as the overall intensities and the amount of room reverberation present. However, the asymmetry of the masking affect, namely the greater spread upward in frequency than downward, is illustrated in the histogram shown in Fig. 15.7. The figure plots a histogram of the number of listeners who heard more n pulse streams with the 2000-Hz masker than with the 1200-Hz masker. The open bars are for the 60-dB masker level and the dark bars are for the 75-dB masker level. It is clear, at least for this test, that for a sufficiently intense masker, the asymmetry is decidedly in favor of the lower-frequency masker.


FIGURE 15.6 Signals and maskers for a simultaneous masking experiment. The length of each line represents the duration of a tone. From [6].

Demonstration 9 of [7] presents this experiment for the listener.

Masking can also occur when the signal and masker are nonsimultaneous. In experiments of this sort, a short signal (often called a probe) is presented at various times as a target signal, and the effect of the masker is measured. When the masker precedes the probe, the effect on perception is referred to as forward masking. When the masker follows the probe, the effect is referred to as backward masking. Masking effects decrease as the time between masker and probe increases, but can persist for 100 ms or more [9]. Temporal (nonsimultaneous) masking experiments are presented in demonstration 10 of [7].

High quality audio coding schemes such as MPEG-Audio (MP3) take advantage of simultaneous and sequential masking to hide the distortion that results from representing each sample with very few bits. We will describe these techniques in Chapter 35, including a more detailed discussion of masking.


FIGURE 15.7 Asymmetry of the masking of one tone by another. For each level (upper horizontal axis), the vertical bars show the number of listeners for whom n more streams are masked by the lower-frequency masker than are masked by the higher-frequency masker, where n is marked on the lower horizontal axis. From [6].


In this chapter, we have presented a brief introduction to experiments that have demonstrated the dependence of loudness on a number of stimulus characteristics. In choosing these experiments, we have focused on those characteristics that seemed fundamental to our overall goals, namely the signal processing of speech and audio. Some of the key results of this chapter are as follows.

  1. Loudness is roughly proportional to the cube root of sound intensity, with a doubling of loudness being observed for a 10-dB increase in the SPL.
  2. Loudness is frequency dependent, and this frequency dependence is itself amplitude dependent. The sensitivity is greatest at approximately 4 kHz, with large deviations for low frequencies at low signal levels (up to 80 dB less sensitivity at 20 Hz). Smaller deviations are observed for low frequencies at high signal levels.
  3. Longer sounds of a given intensity sound louder, up to a duration of 200 ms. This implies some kind of temporal integration, as do the results of forward-masking experiments.
  4. Experiments indicate the existence of critical band filters. The bands are wider at high frequencies than at low frequencies. Further details on these filter characteristics will be described in Chapter 19 in the context of filter-bank models that can be used in speech and audio applications.
  5. Pure tones that are close in frequency mask one another, with the lower-frequency tone masking the higher-frequency tone more than vice versa. Higher-amplitude tones also mask neighboring tones more than lower-amplitude tones. This set of results will affect the filter-bank models described in Chapter 19, and is central to the perceptual audio coding techniques of Chapter 35.

As with many of the topics of this book, we can only provide an introduction to a major area of study such as psychoacoustics (for further reading, see [3], [5], [9]). In both the previous chapter and this one, for instance, we have avoided discussion of binaural (two ear) hearing, even though this aspect of hearing is fundamental to the ability to locate sound sources spatially; the use of multiple receivers by the auditory system also improves intelligibility under noisy or reverberant conditions. Similarly, there have been many studies that explore the ability of humans to identify and separate auditory streams; a comparable visual phenomenon would be the ability to recognize a car on the other side of a white picket fence. Many experiments of this latter type are described in [1]. Although we do not go into further detail on these topics, it is essential to the understanding of speech and audio signal-processing algorithms that we study the perception of pitch. This is the subject of Chapter 16.


  1. 15.1 Critical band experiments result in psychoacoustical tuning curves that resemble bandpass filter frequency-response curves. Theory says that these curves are related to the tuning curves of Fig. 14.10, but these latter curves seem to be the inverse of critical band filters. How do you explain this apparent discrepancy?
  2. 15.2 Having explained the previous problem, we are still faced with the fact that physiological tuning curves and psychoacoustic tuning curves are different. Can you provide a plausible explanation for this difference?
  3. 15.3 As the intensity of a pure tone is increased, it sounds louder. What do you think happens neurophysiologically to make this happen?
  4. 15.4 What reason can you give to explain why a low-frequency tone is better able to mask a tone of higher frequency than vice versa?
  5. 15.5 One of the difficulties with two-tone masking is the beating effect. Thus, although the weaker tone is effectively masked, the listener hears the beats. Devise an experiment that overcomes this difficulty.
  6. 15.6 Three tones (100 Hz, 2000 Hz, and 7000 Hz) are presented monaurally over wideband headphones (40 Hz-16 kHz) to a young adult subject with normal hearing. In each case, the sound-pressure level at the subject's ear is 40 dB. What would be the expected loudness for the tones, going from the most loud to the least loud?


  1. Bregman, A. S., Auditory Scene Analysis, MIT Press, Cambridge, Mass., 1990.
  2. Fletcher, H., “Auditory patterns,” Rev. Mod. Phys. 12: 47-65, 1940.
  3. Fletcher, H., Speech and Hearing in Communication, Van Nostrand, Princeton, N.J., 1953.
  4. Fletcher, H., and Munson, W. J., “Loudness, its definition, measurement and calculation,” J. Acoust. Soc. Am. 5: 82-108, 1933.
  5. Green, D. M., An Introduction to Hearing, Wiley, New York, 1976.
  6. Hartmann, W. M., “Auditory demonstrations on compact disk for large N,” J. Acoust. Soc. Am. 93: 1-16, 1993.
  7. Houtsma, A. J. M., Rossing, T. D., and Wagenaars, W. M., “Auditory demonstrations,” Philips compact disk, Inst. Perceptual Research and Acoustical Soc. Am., Eindhoven, 1987.
  8. Kingsbury, B. E. D., Perceptually Inspired Signal Processing Strategies for Robust Speech Recognition in Reverberant Environments, PhD Thesis, U.C. Berkeley, 1998.
  9. Moore, B. C. J., An Introduction to the Psychology of Hearing, 5th ed., Academic Press, New York/London, 2003.
  10. Rossing, T. D., The Science of Sound, Addison–Wesley, Reading, Mass., 1990.
  11. Stevens, S. S., “The direct estimation of sensory magnitudes: loudness,” Am. J. Psych. 69: 1-25, 1956.

1 In common parlance the ear actually refers to the entire apparatus for hearing, up to and including the brain.

2 Stevens’ empirical formula is not very different than a logarithmic relation over the intensity values of interest.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.