In previous chapters, we described some of the fundamentals of the acoustics of tubes and strings, using abstractions that we showed to be relevant to the production of audio signals in the human vocal tract and in some musical instruments. Once these signals leave their sources, however, they generally encounter boundaries that are at least partially reflective. Thus, a listener or a microphone receives a multipath version of the original source signal. Therefore, room acoustics are also a fundamental concern for many audio signal-processing applications. In this chapter, we discuss the effect of room boundaries on a sound wave, the resulting phenomenon of reverberation (i.e., the smearing of a source sound over time as a result of these boundary effects), and the effect of reverberation on speech intelligibility. As with many topics discussed in this book, this chapter can only serve as a brief introduction, with a practical focus on factors that affect the goals of audio signal processing.


At atmospheric pressure and standard conditions of humidity, the speed of sound is


where T is the temperature of the air in degrees Celsius. At 20°C, the speed of sound is 343.4 m/s, or about 1127 ft/s, corresponding to a transmission time of approximately 1 ms for each foot. If a reflective boundary is 10 ft (~305 cm) from an acoustic source, it will take roughly 20 ms for the sound wave to return. This is too short for the second sound to be heard as a distinct echo. In contrast, if the boundary is 50 ft or more from the source, it will take at least 100 ms for the sound wave to return, which is long enough for the listener to hear a discrete echo. In many rooms, successive reflections are too close together to be audible as separable events.

In real acoustic environments, the speed of sound varies over time and space, as a result of factors such as temperature (as noted in the equation) and humidity. The nature of the received reflective pattern will be dependent on these sources of variability, as well as on movements of the sound source or receiver around the room. However, for a midfrequency sound wave, and for a given source and receiver position, the room acoustic effect can be seen as a linear time-invariant sum of attenuated, filtered, and delayed versions of the original signal. In other words, for this limited approximation, the room effect can be viewed as a linear convolution with an echo pattern. We will return to this perspective, but first we review acoustic wave propagation in the context of room acoustics.

13.2.1. One-Dimensional Wave Equation

In Chapter 10 we gave a one-dimensional equation relating space, time, and pressure for a progressive acoustic plane wave (repeated here in slightly different form):


with a general solution of the form


where F(ctx) represents a wave moving toward increasing values of x, and G(ct + x) represents a wave moving toward decreasing values of x.

For sinusoidal functions, λ = c/f (i.e., wavelength equals the speed of sound divided by frequency). To give some physical intuition, if we assume that c image 1000 ft/s, then1 a 20-Hz signal has a wavelength image50 ft long, a 1-KHz signal has a wavelength image1 ft long, and a 20-KHz signal has a wavelength image0.5 in. (1.27 cm) long.

In our previous examples (speech and music), we examined sound propagation in objects that were small in comparison to the wavelengths of low-frequency sounds (e.g., violins or human vocal tracts). In the case of room acoustics, we consider sound propagation in enclosures that are much larger. For wavelengths that are much smaller than room dimensions, ray paths from the source to the receiver (including reflective paths) can be traced to yield a reasonable approximation to the acoustic transmission characteristics.

It is often useful to think of sound in terms of wavelength, particularly for intuition concerning the extent to which a structure is a barrier to sound transmission. Noting the physical size (wavelength) of acoustic waves can also help provide intuition for another aspect of audio perception, known as head shadowing. The human head is a barrier of significant size to high-frequency sound waves. Thus, sound from the opposite side of the head is low-pass filtered.2

13.2.2. Spherical Wave Equation

Ignoring any directionality of a source, ignoring the effect of boundaries, and assuming a point source, the propagation of waves in three-dimensional space will be spherical. The wave equation in spherical coordinates is


A solution is the complex exponential,


where p is the pressure, r is the distance between the source and the receiver, P0 is the amplitude of the sinusoidal sum, t is the time, k = 2π/λ, and ω = kc. Note that the pressure is inversely proportional to r. This proportionality relationship is a reasonable approximation in free space and under many outdoor conditions.

13.2.3. Intensity

Intensity is defined as the amount of sound energy flowing across a unit area surface in a second. This is equivalent to the rms pressure times the rms velocity, or3


This can also be expressed without the velocity term, which can be expressed as the ratio of pressure to impedance (as noted in Chapter 10), yielding


where ρ0 is the medium density, c is the speed of sound as before, and ρ0c is the characteristic impedance. For a sinusoid,


So Ip2, and for a spherical wave p ∞ 1/r implies I ∞ 1/r2.

In other words, as with many other interesting physical phenomena, sound intensity follows an inverse square law with distance – but don't forget the assumptions of a point source and lack of reflective boundaries! In addition, one must assume that the sound medium (air in the room acoustics case) does not dissipate any energy, which is particularly incorrect for high-frequency audio signals.

13.2.4. Decibel Sound Levels

Conventionally, the difference between sound energy levels is measured in decibels:


Since the intensity is proportional to the square of the pressure,


The denominator values are often chosen to be reference values that correspond to the threshold of hearing at 1 kHz, namely


For these choices, the decibel levels become the sound pressure level (SPL) and the intensity level, respectively. The two measurements are roughly equivalent for plane or spherical waves measured in the air.

13.2.5. Typical Power Sources

For speech in many natural situations, it is preferable to assume that the point source propagates hemispherically rather than spherically. Given this assumption, the following are typical power values for various speech sources. The number in parentheses indicates the SPL of the sound wave flowing across a 1-m2 area 40 cm away from the source.

  • Whispered speech: 1nW (30-dB SPL)
  • Average for speech: 10 μW (70-dB SPL)
  • Loud speech: 200 μW (83-dB SPL)
  • Shouted speech: 1 mW (90-dB SPL)

Ignoring boundaries, we find that the SPL would be 6 dB lower for each doubling of distance. Note that shouted and whispered speech differ by 60 dB, which is a power ratio of a million! This is certainly not comparable to the human sense of relative loudness; however, the cube root of intensity is often used as such a measure. Given this approximation, a 10-dB increase in intensity corresponds roughly to a doubling of perceived loudness. However, as we discuss in Chapter 15, loudness is frequency dependent. Perceived loudness is generally weaker for lower frequencies, though the apparent frequency response is flatter for louder sounds. The loudness control found on some stereos, as opposed to the volume, does some filtering to adjust for human sensitivity characteristics; roughly speaking, these controls usually bring up the extremes of low and high frequencies at low volumes so as to compensate for the reduced perceived bandwidth at the lower volume level.

Sound-level meters often make use of standard frequency-weighting characteristics in order to more closely approximate loudness, as opposed to acoustic energy. This certainly makes sense for measurements that will be relevant for human perception, such as the noise level in a building. One of the most common standards is the A-weighted measurement, typically used for a low to moderate loudness level, in which a large energy-loudness correction must be made for the low frequencies (reducing sensitivity to low frequencies) [6]. This is probably appropriate for applications such as describing the noise levels in a typical acoustic background, since a low-frequency rumble could have a very large amplitude before it had the same perceptual effect as a midfrequency sound. Sound-pressure-level meters commonly have an A setting for this purpose; B and C settings correspond to loudness corrections for successively higher sound pressure levels, and they apply smaller low-frequency compensations. SPL measurements for the A-weighted case are often referred to as dBA.

For machine speech-analysis purposes, these weighting curves can often be misleading, since a large low-frequency noise can sometimes have strong effects on our algorithms but will have a reduced effect on dBA measurements.


As suggested earlier, the mathematics used for acoustically modeling a room as a box is similar to that used for abstractions of strings and tubes. Solving the wave equation will result in characteristic resonances that will be instantiated as standing waves. For an ideal box, the lowest-frequency resonance will correspond to a wavelength that is twice the size of the room's longest dimension. Any room will actually have a large number of such resonances, and at higher frequencies these can be approximated by a continuum; as explained in [4], the number of modes (eigenfrequencies) below frequency f is proportional to f3.

For sound-production devices, such resonant phenomena are often the most critical aspect. However, resonances are often not the most important aspect of room acoustics for the study of the how speech and music will be perceived by listeners and machines. In particular, the biggest effect on the intelligibility of speech comes from the effects of reverberation, which is more fundamentally a time-domain phenomenon.

Thus far in this chapter we have noted that a point source emanates a spherical wave-front whose intensity is inversely proportional to the square of distance. In an enclosure, however, boundaries will reflect some of the energy, causing the receiver to get a series of delayed and attenuated versions of the original signal. The pattern of these returns establishes the audio character of the room, both for intelligibility and for sound quality (e.g., for the musical quality of a concert hall). These characteristics are determined by the geometry of the room, the positions of the source and receiver of sound, and the absorption characteristics of the boundaries. We separately describe the long-term and short-term effects of the room echo response, as they are qualitatively quite different. We ignore more complicated effects (such as diffusion from complicated surfaces).

13.3.1. Acoustic Reverberation

When a wave front impinges on a boundary, part of the energy is reflected, and part is absorbed; absorption includes both energy that is transmitted and energy that is dissipated into heat. A key part of architectural acoustics, then, is the use of measurements of the fraction of energy that is absorbed, called the absorption coefficient. These coefficients are typically frequency dependent; most ordinary materials absorb better at high frequencies (or, conversely, reflect better at low frequencies). For example, Table 13.1 shows that a typical coefficient for acoustic paneling is 0.16 at 125 Hz, but it is 0.80 at 2 kHz. Some materials that are less absorptive overall show less frequency sensitivity. For instance, glass typically has a coefficient of 0.04 at 125 Hz and 0.05 at 2 kHz. For nearly any real room, the effect of this frequency dependence is to shorten the time for reflections to die down for a higher-frequency energy; put another way, contributions to the current sound from previously emitted sounds are low-pass-filtered (as well as delayed and attenuated) versions of the original.

For large distances, high frequencies, or both, sound absorption of the air can cause an additional low-pass-filtering effect. In addition to the inverse square attenuation, the energy absorption of the air can be approximated by a factor of e–mr where m is 0.0013/m at 1 kHz and 0.021 at 4 kHz (for a relative humidity of 50%).

TABLE 13.1 Effective Absorption Coefficients of Common Building Materialsa


Whereas the early pattern of reflections can be an important cue as to the size and shape of the room (as well as the distance to the sound source), the reflections usually become quite dense within 100 or 200 ms of the first (direct) wave front. This dense pattern, which tends to have a roughly exponential decay, is typically characterized by the time that it takes a steady-state noise signal to decay by 60 dB, a value that is referred to as RT60.

There are a number of ways to derive a formula for RT60, as elaborated in such sources as [3] or [4]. Some use a stochastic approach, assuming independence of reflection, whereas others begin with a first-order differential equation to approximate the balance between absorption and generation. In either case the simple approximations lead to an exponential decay, which on the average conforms to the long-term response in regular rooms. In the 1920s a Harvard professor named Sabine showed empirically that a particularly simple approximation was a reasonable predictor of reverberation time (ignoring air absorption), namely


in MKS units, and


for English units (feet), where S, V, and image are the surface area, volume, and average absorption of the room, respectively.

Including the effect of air absorption,


in feet, where m is the acoustic air-absorption term given above. This air term typically dominates at very high frequencies, and it is largely irrelevant for low frequencies.

For concert hall acoustics, a rule of thumb is that 75% of the absorption is contributed by the audience and orchestra, so that measurements taken in an empty hall must be interpreted with care.4

Reverberation has a number of effects on acoustic signals such as music and speech, most obviously to smear them out in time. When the reverberant energy is large (for instance when the distance between source and receiver is large, and when RT60 is long), syllable onsets and identities can be masked by decaying energy from previous syllables. This can hurt intelligibility, particularly when combined with noise. For music, a degree of smearing that is considered appropriate for some forms of music (such as Bach organ pieces) can cause an undesirable loss of definition for others (such as string quartets).

As noted at the start of this chapter, the net effect of reverberation on a sound signal for a particular source position and receiver (microphone or ear) position may be roughly expressed as a linear convolution of an echo response with the source signal. A simple approximation to this response is an exponentially decaying impulse response, such as that implied by the RT60. However, although the reverberation time is important, the corresponding exponential impulse response gives a poor match to many of the important features that characterize a room's acoustics. There are, however, a number of ways of more accurately estimating the impulse response in a real room (for one specific source and receiver position). One of the most popular is to emit white noise or pseudorandom sequences from a calibrated source and correlate them with the received signal at the microphone. This can be easily shown to yield an estimate of the desired impulse response. For linear time-invariant systems in general, Rxy = Rxx * h(t), where Rxy is the correlation between processes x and y, Rxx, the autocorrelation for x, and h(t) is the impulse response that is convolved with x to yield y. Then if Rxx = δ(t) (uncorrelated input sequence), the measured output correlation Rxy is equivalent to the desired impulse response.

In an alternative approach, a chirp (a signal with an instantaneous frequency that rises with time) can be used as the test source. The output signal is then phase adjusted to compensate for the timing of the different sinusoidal components in the source, resulting in an estimate of the transfer function (Fourier transform of the impulse response) of the linear model for the source–receiver relationship. Figure 13.1 show an impulse response that was measured through the use of a chirp-based approach. The response was measured in an experimental chamber at Bell Labs that has walls with adjustable absorption characteristics. The adjustable characteristics were set to yield an RT60 of 0.5 s. Figure 13.2 shows two versions of the waveform for a sequence of spoken numbers: one that was taken directly from the amplified and digitized microphone signal, and one that was artificially reverberated by using the measured impulse response from the Bell Labs recording setup. Note that although time smearing is evident, most of the basic energy features are still intact for this case; in fact listeners do not seem to have any trouble understanding what is said in this example. However, it is difficult to make speech-recognition models that will recognize the reverberated forms when they have been trained on the original versions.


FIGURE 13.1 Measured impulse response from the experimental varechoic chamber at Bell Labs. This was part of a collection by Jim West and Gary Elko, along with Carlos Avendano, who kindly passed it on to us. The room reverberation time was set to be roughly 0.5 s, the distance from the source to the microphone was 2 m, and the shutters in the varechoic chamber were 43.7% open. The time of flight from the sound source to the microphone was removed from the impulse response display, so the plot doesn't illustrate that sound travels at finite speed.


FIGURE 13.2 Two waveforms for the continuous phrase “two oh six,” uttered by a female speaker over the telephone. The first waveform is the original signal, and the second was derived from the first by convolution with the impulse response shown earlier. The phrase is part of the Numbers database from the Oregon Graduate Institute.

Reverberation typically increases the loudness at a given location, both because energy generated over a range of time in the past is received in the present, and also because sound that would have radiated away in an open space is instead received by a listener in a closed space. Nonzero reflectivity of boundaries is an essential part of sound-reinforcement systems – it can be difficult to provide intelligible speech to listeners in rooms that are overly absorbent. Intelligibility is particularly aided by contributions to loudness from the early components in the impulse response.

13.3.2. Early Reflections

Prior to the arrival of a significant energy density, reflections are relatively sparse, and an exponential energy decay is typically not a useful description. Apparently these early reflections (i.e., in the first 80–100 ms) provide critical cues for the listener's sense of intimacy or apparent room size and character [5]. A critical parameter characterizing these early reflections for concert halls is the initial time-delay gap, which is defined for concert halls as the delay between the time of receipt of the direct sound and the time of the first echo, as measured at a point midway between the orchestra and the back of the hall (or overhang of a balcony when there is one in the back). As noted in [1], the best concert halls rated by conductors are those that have an initial time-delay gap of 15–30 ms. In the case of speech, additional energy that is received from these early reflections is integrated by the listener into the apparently direct sound, which typically enhances intelligibility and quality.


FIGURE 13.3 Zoom of the first 200 ms of the measured impulse response, in which the initial point (the direct sound) is omitted in order to scale up the rest of the sequence.

Figure 13.3 shows an expanded version of the first 200 ms of the impulse response shown earlier. Note that the early response tends to look roughly like a series of discrete echos, whereas by 100 ms the sequence has a significantly more continuous character.


As should be clear from the preceding sections, the acoustics produced in a room undergo significant processing as a result of a multipath transmission from source to receiver. When the receiver is a human listener, moderate reverberation can improve intelligibility by increasing the signal-to-noise ratio: more signal gets to the listener. For a constant signal level received by the listener, however, reverberation tends to hurt intelligibility, particularly for large amounts of reverberant energy and for long reverberation times. This is particularly true when the signal-to-noise ratio is poor.

Even a small amount of reflected energy can effect the spectral or apparent spatial character of the sound for a human listener; but such small effects do not generally hurt intelligibility. Thus, the effects of the early reflections, as noted earlier, are largely to modify the listener's impression of the rooms size or intimacy. However, this is often not the case for machine signal processing of audio signals such as speech. Particularly for the speech-recognition application, even relatively short-term echo patterns modify the signal representations so as to be inconsistent with stored representations collected in different (i.e., nonreverberant) environments, such as speech collected from a close-talking microphone. Significant additive components from previous sounds appear to hurt machine recognition systems seriously for much smaller reverberation effects than are required to hurt intelligibility for humans.

The most common approach to handling these problems is to use a directional (noise-canceling) microphone quite close to the talker's mouth. In addition to improving the signal-to-noise ratio, such an arrangement keeps the direct-to-reverberant energy ratio high, so that the room acoustic effects are minimized. Unfortunately, there are many situations in which the microphone cannot be placed in this way. There has also been a significant amount of research on microphone arrays (see, e.g., [2]). In these studies, processing techniques such as beamforming (see Chapter 39) or matched filtering have been applied to signals from as many as 200 microphones. These approaches improve spatial selectivity and thus reduce the effects of room acoustics on speech intelligibility and recognition performance.


  • 13.1 On one hand, explain how a reflective ceiling in a lecture hall could potentially improve the intelligibility of speech. On the other hand, explain how reflective surfaces could hurt speech intelligibility.
  • 13.2 For an omnidirectional point sound source, the received sound level at a microphone is an 80-dB SPL 10 ft away in an anechoic chamber (a room with essentially no reflective boundaries). What is the SPL at a 20-ft distance? If the same source and microphone are put in a highly reverberant room for which the energy from reflections in the first 40 ms is much smaller than the energy from later reflections, what is the corresponding change in decibel SPL when the microphone distance is doubled?
  • 13.3 A rectangular room has a 12-ft high ceiling and is 20 ft wide and 30 ft long. The ceiling is made of acoustic paneling and the floor is wooden. The other walls are 25% glass and 75% wood paneling.
    • (a) What is the approximate RT60 at 125, 500, and 2000 Hz, ignoring air absorption?
    • (b) Let one corner of the floor be the position of the origin, and let (x, y, z) be the position in feet along the 20-ft wall, 30-ft wall, and the height (respectively). For a sound source at (10, 10, 3) and a receiver at (10, 18, 3), find the propagation time for the direct sound. Also find the propagation time for the first reflection that is received after the direct sound.


  1. Beranek, L., Concert and Opera Halls – How They Sound, Amer. Inst. Physics, New York, 1996.
  2. Flanagan, J., Surendran, A., and Jan, E., “Selective sound capture for speech and audio processing,” Speech Commun. 13: 107–122, 1993.
  3. Kinsler, L., and Frey, A., Fundamentals of Acoustics, Wiley, New York, 1962.
  4. Kutruff, H., Room Acoustics, Applied Science, London, 1973.
  5. Morgan, N., “Room acoustics simulation with discrete-time hardware,” Ph.D. Thesis, University of California at Berkeley, 1980.
  6. Tremaine, H., Audio Cyclopedia, Sams, Indianapolis, Ind., 1978.

1Recall that the actual speed is closer to 1130 ft/s at room temperature; we assume the lower speed above for simplicity's sake only.

2This filtering effect is used to good advantage by the auditory system, helping to determine the direction of the arrival of sound waves. Spatial location is also assisted by the relative timing of the arrival of sound waves at each ear.

3For derivations of this and other relations that are simply stated here as facts, check acoustics texts such as [3].

4For this rule of thumb and many other fascinating observations about 76 famous concert halls, see [1].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.