Chapter 7

Applications: Acoustics and Psychoacoustics Combined

Chapter Contents

So far in this book, we have considered acoustics and psychoacoustics as separate topics. However, real applications often require the combination of the two, because although the psychoacoustics tells us how we might perceive the sound, we need the acoustic description of sound to help create physical or electronic solutions to the problem. The purpose of this chapter is to give the reader a flavor of the many applications that make use of acoustics and psychoacoustics in combination. Of necessity, these vignettes are brief and do not cover all the possible applications. However, we have tried to cover areas we feel are important and of interest. The level of detail also varies, but in all cases, we have tried to provide enough detail for the reader to be able to read and understand the more advanced texts and references that we provide, and any that the reader may discover themselves, for further reading. The rest of this chapter will cover listening room design, audiometry, psychoacoustic testing, filtering and equalization, public-address systems, noise-reducing headphones, acoustical social control devices and last, but by no means least, audio coding systems.

7.1 Spatial and 3D Audio Reproduction

Spatial and so-called 3D audio reproduction is an essential aspect of the enjoyment of sound, whether it is on a personal player via headphones, at home, or in the cinema. All of these systems rely on the application of acoustics and psychoacoustics to present an illusion of specific sound directions that are called “phantom images” to mimic real sound sources such as singers or musical instruments. They also attempt to give the listener an illusion of being in a space, which is often described as its “ambience.”

A variety of systems are currently in use or being introduced as technology improves. They can be broadly categorized into two separate approaches.

  1. 1. Systems that try to provide the necessary directional cues directly via the reproduction method. This is the approach adopted by conventional stereo, surround systems and the new 3D reproduction methods.
  2. 2. Systems that try to reproduce the acoustic sound field that the listener would experience if they were there. This method covers ambisonics, wave-field synthesis and binaural stereo.

Historically, the first category formed the first practical systems with the second category following closely behind, and this section will give a brief description of how they work. To actually fully go into all the details of these systems, including the many approaches to producing and recording the necessary content, would take several additional books! The references and recommended reading at the end of the chapter can help you get additional detail.

7.1.1 Stereo

The simplest such system is stereophonic reproduction, which uses two spaced loudspeakers to provide the illusion of sound sources in front of the listener. Superficially this would only provide intensity cues to give the illusion of direction to sound coming from two speakers due to the head shading above 1kHz, as discussed in Section 2.6.2. However, this would be incorrect, as demonstrated by Blumlein in 1931.

Blumlein showed that the coherent addition of the acoustic signals arriving at the ears from both loudspeakers, as shown in Figure 7.1, produced a phase shift that mimicked the effect of the interaural time delay cue. In essence the signals from both speakers arrive at both ears, with equal amplitudes, because at low frequencies, they diffract easily around the head. Due to the relative delay caused by the speaker angles, the sine waves from the contralateral (farther away) speaker are delayed with respect to the ipsilateral (closer) one. As we know from Chapter 1, in Section 1.3.1, adding two sine waves of different amplitudes and phase forms a single one with a different amplitude and phase. The resultant phase of the these combined sine waves at the two ears will be a function of the relative speaker amplitudes. Blumlein showed that this mimics the effect of the ITD cue. How this works is shown in Figures 7.2(a) to 7.2(c), which show the effect of adding two sine waves with different amplitudes that emanate from left and right speakers at each ear. Three different cases are shown:

  1. 1.Only one speaker driven: This is shown in Figure 7.2(a). In this case, only the left speaker is providing the signal to both ears. Thus the phase shift between the two ears is identical to the phase shift caused by a sound source at the left speaker’s position.
  2. 2.Both speakers driven equally: This is shown in Figure 7.2(b). In this case, both speakers are providing identical signals to both ears. The composite wave due to the sum of the two signals from the speakers at each ear are identical. Thus the phase shift between the two ears is zero. This is identical to the phase shift due to a sound source that is central between the two speaker’s positions.
  3. 3.Both speakers driven with different amplitudes: This is shown in Figure 7.2(c). In this case, both speakers are providing different signals to both ears. The composite wave due to the sum of the two signals from the speakers at each ear is no longer identical. This makes the phase shift between the two ears somewhere between the first two cases. This phase shift will be dependent on the relative amplitudes of the two loudspeakers, because this determines the composite wave’s amplitude and phase. The resulting phase shift is identical to the phase shift due to a sound source that is between the center position and the physical loudspeaker positions.

Note that there is some amplitude variation in the composite ear signals that depends on the relative amplitudes and phases of the loudspeaker signals. This has to be compensated for in the controls of the relative amplitudes of the speaker, the pan pots.

If one assumes ITD calculation based on the geometry shown in Figure 7.1, one obtains the following equation.

Sin(θphantom)=(LRL+R)Sin(θloudspeakers) (7.1)
  • where θphantom is the angle with respect to the normal of the virtual image,
  • L and R are the amplitudes of the left and right speaker signals, respectively,
  • and θloudspeakers is the angle with respect to the normal of the loudspeakers.

This is the well-known law of sines.

Figure 7.3 shows the predicted angle as a function of the relative levels of the left and right loudspeakers for a θloudspeakers = 30°. This corresponds to a total angle between the loudspeakers, which is the recommended angle for ideal stereophonic listening.

The net result is that amplitude stereo can give a convincing directional image of a set of sound sources because it provides both intensity and delay cues. Critically, this requires that both the loudspeakers are heard by both ears at low frequencies. This is something that does not occur when listening to conventional stereo recordings on headphones and will be discussed later.

One way of recording this acoustically is via a pair of crossed figure-of-eight microphones, as shown in Figure 7.4. As a sound source moves across the sound stage, the amplitudes of the signals in each microphone have a cosine/sine relationship, as shown in Figure 7.5. Thus at extreme left or right the signal appears only in one speaker, whereas when the source is in the center, the signal appears equally in both speakers. Incidentally, this cosine/sine law is the ideal amplitude law used in pan pots in mixing desks for the purpose of positioning monophonic sources in a stereo image. Of course, different microphone layouts may be used to introduce phase shifts. Due to the effect of coherent addition of the two speaker signals at the ears at low frequencies, there is some trading between delay and amplitude. This can be used to achieve a variety of creative effects to the stereo image.

7.1.2 5.1 Surround Systems

A related form of spatial reproduction is 5.1 surround, which is used in film and video presentations. Here the objective is to provide not only clear dialogue but also stereo music and sound effects, as well as a sense of ambience. The typical speaker layout is shown in Figure 7.6.

Here in addition to the conventional stereo speakers, there are some additional ones to provide the additional requirements. These are as follows:

  • Center dialogue speaker: The dialogue is replayed via a single central speaker, because this has been found to give better speech intelligibility than a stereo presentation. Interestingly, the fact that the speech is not in stereo is not noticeable because the visual cue dominates so that we hear the sound coming from the person speaking on the screen even if their sound is coming from a different direction. For music, this speaker may be used to help anchor the front image or support the two stereo speakers.
  • Surround speakers: The ambient sounds and sound effects are diffused via rear-mounted speakers. Originally they were not supposed to provide directional effects and were deliberately designed and fed signals that minimized their correlation with each other and the front speakers. The effect of this was to fool the hearing system into perceiving the sound as all around with no specific direction. In modern surround systems, they are also used to provide directional effects in addition to the ambience.
  • Low-frequency effects: This is the “0.1” channel in the 5.1 system. It is required because many of the sound effects used in film and video, such as explosions and punches, have substantial low-frequency and subsonic content. Thus, a specialized speaker is needed to reproduce these sounds properly. Note: this speaker was never originally intended to reproduce music signals. However, their presence in many surround music systems is now used to provide general low-frequency content as a “subwoofer.” This does have the advantage of providing a means of dealing with the effects of low-frequency modes on the sound, as these speakers can be placed in an “optimum” position. It also allows the use of multiple bass speakers and appropriate processing to combat the effects of modes. In addition, it can permit the use of more compact speakers for the other channels, as they do not have to handle the lowest frequencies.

All these signals can be encoded into and derived from a two-channel format using a technique called matrixing. However, this results in a considerable loss of separation between channels, but it was the only way to achieve it when only two channel distribution formats such as vinyl discs or stereo film tracks were available. Better performance is obtained from a multichannel formant, and with the advent of digital distribution formats that can provide discrete channels, this has become the norm, although some systems also use the matrixing technique to derive additional speaker feeds for more complex reproduction layouts.

7.1.3 Three-Dimensional Surround Systems

In recent years, additional speakers have been added to 5.1 surround systems. These systems are often designated 7.1, 9.1, 11.2 and so on to designate the extra speakers. The first digit specifies the number of full-range speakers actually used. Thus a 7.1 system would have two additional speakers compared to a 5.1 system, a 9.1 system four additional full-range speakers and so on. The figure after the decimal point denotes the number of low-frequency speakers used. So 7.1 would describe a system with 7 full-range speakers and 1 low-frequency speaker, whereas an 11.2 describes a system with 11 full-range speakers and 2 low-frequency speakers. Note that in many cases, the extra low-frequency channel is a duplicate of the existing low-frequency channel and not a different low-frequency feed. Having two low-frequency speakers allows one to place them in a way that deals with the inevitable modal issues that happen in real rooms at low frequencies. The necessary signals for these extra speakers may be derived from an existing 5.1 system via the technique of matrixing, or they can be provided by separate dedicated channels. It is also possible to conceive of systems that use a combination of matrixing and additional channels. Although some standard arrangements have been proposed, there is still an element of flux in these systems.

The purpose of these extra channels is to improve the surround-sound experience for the listener so that they have more consistent localization all the way around themselves. In addition, some arrangements can be used to give an illusion of source height, which can give the listener a more realistic sense of the environment the reproduced sound is in. For example, a 7.1 system places two additional speakers to the left and right of the listener (called the left and right surrounds) and moves the rear speakers farther around the back (called the left and right back surrounds) as shown in Figure 7.7. By adding two more speakers, 9.1 systems allow one to have some height loudspeakers (left and right front height), and 11.1 systems would allow two additional-height speakers to the rear. Clearly the situation can get very confusing! Material is now available that has been recorded for 9.1 systems with height, and the results can give a great sense of the space the recording was made in.

7.1.4 Ambisonics

Another form of spatial sound presentation is called ambisonics. This is a three-dimensional sound reproduction system that tries to simulate the sound field at a given point in the room. The original and simplest system does this by recognizing that at a given point, there are the component of pressure at the point, velocity in the left-to-right direction, velocity in the front-to-back direction and velocity in the down-to-up direction. These components can be measured via the use of an omnidirectional pressure microphone and three figure-of-eight pressure-gradient or velocity microphones, which are oriented in the three orthogonal directions, as shown in Figure 7.8.

This gives four channels of information that represent the first-order sound field at the recording point. This sound field can be reproduced in a room via an array of speakers. The minimum recommended is eight, and a typical arrangement is shown in Figure 7.9.

Unlike stereo, however, no particular speaker is assigned to a particular direction, such as left, right and so forth. Instead, all the speakers work in concert to provide the necessary sound field. That is, all the speakers work together to provide a pressure component and, by feeding appropriate levels and phases of the other signals to the loudspeakers, all cooperate to synthesize the necessary velocity components. As an example, the front-to-back component would be synthesized by feeding equal levels of the front-to-back signal to speakers 1 and 2 at the same level but in antiphase to speakers 3 and 4. Speakers 5 to 8 would receive no signal because they fall on the null of the figure-of-eight polar pattern for front-to-back sounds. Note that these weightings are only true for the speaker arrangement shown in Figure 7.9, and if other dimension ratios are used, then the weightings required change, because they are determined by the shape of the polar patterns of the different spherical harmonic components. The system can be simplified by simply removing channels; thus a two dimensional sound field requires only three channels, and the system becomes stereo if only two channels are used.

In principle, it is possible to extend the system to sample acceleration components and higher-order changes, in addition to the velocity components, to provide greater accuracy. The effective microphone patterns required for the different orders are shown in Figure 7.10. This results in the ability to produce more complex sound fields to give better localization accuracy. These higher-order ambisonic systems are now used for a variety of purposes and have been proposed as an intermediate format for production, because it is comparatively easy to mix down from an ambisonic format into the perplexing variety of different surround-sound formats.

7.1.5 Wavefield Synthesis

Another approach to reproducing spatial sound is to try to reproduce the radiated sound field using the Huygens-Fresnel Principle, which states that any possible wavefront shape can be made up from elemental spherical waves as shown in Figure 7.11.

As a conventional loudspeaker radiates spherical waves at wavelengths that are large compared to its size, this naturally suggests the possibility of synthesizing an arbitrary sound field using a large number of speakers in an array, as shown in Figure 7.12. If one can accurately produce the desired sound field using an array of speakers on the boundary of the reproduction area, the directional characteristics of the recorded sources and environment will be preserved. In theory, this could produce a sound field that is indistinguishable from the original sound field. However, in practice, there are some limitations, which are as follows.

  1. 1. Speaker size and spacing: To achieve perfect wavefront reconstruction, one needs to have speakers whose spacing, and therefore size, is less than half the wavelength of the highest frequency to be reproduced. For 20kHz, this would require a speaker size of less than 8.6 mm! If the spacing is larger than this, which in general is necessary for practical speaker size, then spatial aliasing occurs just as it does in array loudspeakers used for public-address systems, as described in Appendix A6.1.2. This causes distortions in the reproduced wavefront that degrades the spatial accuracy of the reconstruction. Much of the work on wavefield synthesis is directed to methods of reducing these effects for practical speaker sizes and technologies.
  2. 2. Reproduction environment: To be successful, the only sound in the space must be that which is radiated by the loudspeakers. Any reflecting boundary will result in spurious wavefronts that will degrade the accuracy of the wavefront reconstruction. Ideally, the array should be in an anechoic room to avoid this.
  3. 3. Finite array size: The Huygens-Fresnel Principle assumes that elemental spherical waves are present on an infinitely long boundary. This means that the sharp boundaries, or changes in direction that would be need to form a practical array, can cause problems. Figure 7.13 shows a wavefield synthesis installation in the anechoic room at the university of Salford. Note that it is an octagonal arrangement to reduce these end effects.
  4. 4. Large numbers of loudspeaker signals: As well as a large number of loudspeakers, there is also the same number of loudspeaker signals. Each of these is unique, and recording or generating them requires a lot of effort. These limitations make it difficult to use wavefield synthesis as a recording format. However, some virtual-reality and acoustic programs provide outputs that are representations of the necessary speaker feeds and so can be used in conjunction with wavefield synthesis as an auralization method.
  5. 5. Plane reproduction only: Because of the large number of speakers required, most wavefield synthesis setups are only in the horizontal plane. Theoretically, there is no reason for this limitation. But practically, the number of speakers and associated signals would be enormous for a three-dimensional system.

Wavefield synthesis can provide realistic sound fields, but the large amount of speakers and processing required limit its broader application with current technology. In niche applications, though, they can provide a highly realistic soundfield.

7.1.6 Binaural Stereo

For recorded sound that is to be presented on headphones, binaural stereo reproduction is more appropriate. This works by recording the sound using microphones that are located on either a real head or a “dummy” head. This recording is then presented to the listener’s ears via headphones, thus giving the listener a direct representation of the sound they would have heard in the recording venue. Because the sound is recorded at the ears, dummy or real, the full effect of head shading, interaural delay and pinna reflections is accounted for in the recording. This can result in amazingly realistic recordings, especially if your own ears are used. However, if other people’s ears are used, the effect is less realistic, because a given person learns the responses of their own head and ears. In fact, it seems that reasonably simple approximations to a head, which just handle the effect of head shading and interaural delay, provide the most benefit, probably because they feed less-confusing direction cues to the listener.

The other problem with binaural presentation is that the stereo image does not change as the listener moves their head. This is not a natural effect and is one of the reasons that binaural signals, especially those from normal-intensity stereo sources, appear within the head. In order to avoid this effect, it would be necessary to detect the head movement and adjust the signals to the two ears to take account of the change in head direction. When this is done, the sounds are properly externalized, and the effect can be quite dramatic!

This type of processing is required in virtual-reality systems, in which sound sources need to remain associated with their visual cue. One way of achieving this is to measure the effective filter response to the two ears for different directions, as shown in Figure 7.14. These are known as head related transfer functions (HRTFs). These filter functions can then be used directly to “pan” a monaural source to a particular direction in the binaural stereo perspective. Because it is difficult to interpolate between HRTFs for different angles, these functions have to be stored as a large table of data in both azimuth and elevation. If a recorded binaural signal is to be processed, the difference between the HRTF for zero angle and the actual angle of the head must be used to correct for head movement.

Normal stereo signals do not reproduce correctly on headphones, and binaural recordings do not reproduce correctly on normal stereo equipment. This is because in the former, the effect of the head is removed from the listening environment, whereas in the latter, the effect of the head is applied twice, once in the recording and once in the listening. It is possible to convert the signals from one format to the other by making use of the head-related transfer functions.

To convert normal stereo signals into signals suitable for binaural presentation, it is necessary to filter the left and right channels with the HRTFs associated with the positions of left and right speakers, as shown in Figure 7.15. Note that four filters are involved, because one must model not only the transfer function to each ear but also the cross-coupling of the loudspeakers between the two ears.

To convert binaural signals into a form suitable for normal stereo presentation, it is necessary to cancel out the HRTFs associated with the positions of left and right speakers, as shown in Figure 7.16. This involves using two filters that are the inverse of the HRTF to each ear and two other filters that model the cross-coupling between the two ears and subtract the effect. Clearly this is only going to be possible for a particular speaker layout and a forward-facing listener with no head movement. However, this technique is used in some proprietary spatial-enhancement systems. It has also been used in television and computer game sound reproduction to provide “virtual surround” by using the HRTFs to synthesize the surround speakers from the forward stereo ones. In this situation, not only are the speaker locations known, as they are built into the set, but the head is effectively clamped in position due to the need to watch the picture. In this sort of situation, the technique can be very effective.

7.1.7 Summary

This section has described some of the methods used to produce spatial sound. Some are more appropriate for domestic use than others, but the challenge of producing a realistic spatial impression in reproduced sound is ongoing.

7.2 Critical Listening Room Design

Although designing rooms for music performance is important, we often listen to recorded sound in small spaces. We listen to music and watch television and movies in both stereo and surround in rooms that are much smaller than the recording environments. If one wishes to evaluate the sound in these environments, it is necessary to make them suitable for this purpose. In Chapter 6, we have seen how to analyze existing rooms and predict their performance. We have also examined methods for improving their acoustic characteristics. However, is there anything else we can do to make rooms better for the purpose of critically listening to music? There are a variety of approaches to achieving this, and this section examines optimal speaker placement, IEC rooms, room energy evolution, LEDE rooms, nonenvironment rooms and diffuse-reflection rooms.

7.2.1 Loudspeaker Arrangements for Critical Listening

Before we examine specific room designs, let us first examine the optimum speaker layouts for both stereo and 5.1 surround systems. The reason for doing this is that most modern room designs for critical listening need to know where the speakers will be in order to be designed. It is also pretty pointless to have a wonderful room if the speakers are not in an optimum arrangement.

Figure 7.17 shows the optimum layout for stereo speakers. They should form an equilateral triangle with the center of the listening position. If one has a greater angle than this, the center phantom image becomes unstable—the so-called hole-in-the-middle effect. Clearly, having an angle of less than 60° results in a narrower stereo image.

5.1 surround systems are used in film and video presentations. Here, the objective is to provide both clear dialog and stereo music and sound effects, as well as a sense of ambience. The typical speaker layout was shown earlier in Figure 7.6. Here, in addition to the conventional stereo speakers, there are some additional ones to provide the additional requirements, as explained earlier.

More recently, systems using six or more channels have also been proposed and implemented; for more information, see Rumsey (2001).

As we shall see later, the physical arrangement of loudspeakers can significantly affect the listening room design.

7.2.2 IEC Listening Rooms

The first type of critical listening room is the IEC (International Electrotechnical Commission) listening room (IEC, 2003). This is essentially a conventional room that meets certain minimum requirements: a reverberation time that is flat and between 0.3 and 0.6 seconds above 200 Hz, a low noise level, an even mode distribution and a recommended floor area. In essence, this is a standardized living room that provides a consistent reference environment for a variety of listening tasks. It is the type of room that is often used for psychoacoustic testing, as it provides results that correlate well with that which is experienced in conventional domestic environments. This type of room can be readily designed using the techniques discussed in Chapter 6.

However, for critically listening to music mixes and the like, something more is required, and these types of rooms will now be discussed. All of them don’t control only reverberation but also the time evolution and level of early reflections. They also all take advantage of the fact that the speakers are in specific locations to do this and very often have an asymmetric acoustic that is different for the listener and the loudspeakers. Although there are many different implementations, they fall into three basic types: reflection-controlled rooms, nonenvironment rooms and diffuse-reflection rooms. As they all control the early reflections within a room, we shall look at them first.

7.2.3 Energy–Time Considerations

A major advance in acoustical design for listening to music has arisen from the realization that as well as reverberation time, the time evolution of the first part of the sound energy buildup in the room matters, that is, the detailed structure and level of the early reflections, as discussed in Chapter 6. As it is mostly the energy in the sound that is important as regards perception, the detailed evolution of the sound energy as a function of time in a room matters. Also, there are now acoustic measurement systems that can measure the energy–time curve of a room directly, thus allowing a designer to see what is happening within the room at different frequencies rather than relying on a pair of “golden ears.” An idealized energy–time curve for a typical room is shown in Figure 7.18. It has three major features:

  • A gap between the direct sound and first reflections. This happens naturally in most spaces and gives a cue as to the size of the space. The gap should not be too long—less than 30 ms—or the early reflections will be perceived as echoes. Some delay, however, is desirable, as it gives some space for the direct sound and so improves the clarity of the sound, but a shorter gap does add “intimacy” to the space.
  • The presence of high-level diffuse early reflections, which come to the listener predominantly from the side, that is, lateral early reflections. This adds spaciousness and is easier to achieve over the whole audience in a shoebox hall rather than a fan-shaped one. The first early reflections should ideally arrive at the listener within 20 ms of the direct sound. The frequency response of these early reflections should ideally be flat, and this, in conjunction with the need for a high level of lateral reflections, implies that the side walls of a hall should be diffuse reflecting surfaces with minimal absorption.
  • A smoothly decaying diffuse reverberant field that has no obvious defects and no modal behavior and whose time of decay is appropriate to the style of music being performed. This is hard to achieve in practice, so a compromise is necessary in most cases. For performing acoustic music, a gentle bass rise in the reverberant field is desirable to add “warmth” to the sound, but in studios this is less desirable.

7.2.4 Reflection-Controlled Rooms

For the home listener or a sound engineer in the control room of a studio, the ideal would be an acoustic that allows them to “listen through” the system to the original acoustical environment that the sound was recorded in. Unfortunately, the room in which the recorded sound is being listened to is usually much smaller than the original space, and this has the effect shown in Figure 7.19. Here, the first reflection the listener hears is due to the wall in the listening room and not the acoustic space of the sound that has been recorded. Because of the precedence effect, this reflection dominates, and the replayed sound is perceived as coming from a space the size of the listening room, which is clearly undesirable. What is required is a means of making the sound from the loudspeakers appear as if it is coming from a larger space by suppressing the early reflections from the nearby walls, as shown in Figure 7.20. Examples of this approach are “live end dead end” (LEDE) (Davies and Davies, 1980), “reflection-free zone” (RFZ) (D’Antonio and Konnert, 1984) and controlled-reflection rooms (Walker, 1993, 1998).

One way of achieving this is to use absorption, as shown in Figure 7.21. The effect can also be achieved by using angled or shaped walls, as shown in Figures 7.22 and Figure 7.23. This is known as the “controlled-reflection technique” because it relies on the suppression of early reflections in a particular area of the room to achieve a larger initial time-delay gap. This effect can only be achieved over a limited volume of the room unless the room is made anechoic, which is undesirable. The idea is simple: by absorbing or reflecting away the first reflections from all walls except the farthest one away from the speakers, the initial time delay gap is maximized. If this gap is larger than the initial time-delay gap in the original recording space, the listener will hear the original space and not the listening room.

However, this must be achieved while satisfying the need for even diffuse reverberation, and so the rear wall in such situations must have some explicit form of diffusion structure on it to assure this. The initial time-delay gap in the listening should be as large as possible, but it is clearly limited by the time it takes sound to get to the rear wall and back to the listener. Ideally this gap should be 20 ms, but it should not be much greater, or it will be perceived as an echo. In most practical rooms, this requirement is automatically satisfied, and initial time-delay gaps in the range of 8 ms to 20 ms are achieved.

Note that if the reflections are redirected rather than being absorbed, there will be “hot areas” in the room where the level of early reflections is higher than normal. In general, it is often architecturally easier to use absorption rather than redirection, although this can sometimes result in a room with a shorter reverberation time.

7.2.5 The Absorption Level Required for Reflection-Free Zones

In order to achieve a reflection-free zone, it is necessary to suppress early reflections, but by how much? Figure 7.24 shows a graph of the average level that an early reflection has to be at in order to disturb the direction of a stereo image. From this, we can see that the level of the reflections must be less than about 15 dB to be subjectively inaudible. Allowing for some reduction due to the inverse square law, this implies that there must be about 10 dB, or α = 0.9, of absorption on the surfaces contributing to the first reflections. In a domestic setting, it is possible to get close to the desired target using carpets and curtains, and bookcases can form effective diffusers, although persuading the other occupants of the house that carpets or curtains on the ceiling are chic can be difficult. In a studio, more extreme treatments can be used. However, it is important to realize that the overall acoustic must still be good and comfortable, that it is not anechoic and that, due to the wavelength range of audible sound, this technique is only applicable at mid to high frequencies, where small patches of treatment are significant with respect to the wavelength.

7.2.6 The Absorption Position for Reflection-Free Zones

Figure 7.25 shows one method of working out where absorption should be placed in a room to control early reflections. By imagining the relevant walls to be mirrors, it is possible to create “image rooms” that show the direction of the early reflections. By defining a reflection-free space around the listening position and by drawing “rays” from the image speaker sources, one can see which portions of the wall need to be made absorbent, as shown in Figure 7.26. This is very straightforward for rectangular rooms but a little more complicated for rooms with angled walls. Nevertheless, this technique can still be used. It is applicable for both stereo and surround systems, the only real difference being the number of sources.

In Figure 7.25, the rear wall is not treated, because normally some form of diffusing material would be placed there. However, absorbing material could be so placed, in the places determined by another image room created by the rear wall, if these reflections were to be suppressed. One advantage of this technique is that it also shows places where absorption is unnecessary. This is useful because it shows you where to place doors and windows that are difficult to make absorptive. To minimize the amount of absorption needed, one should make the listening area as small as possible, because larger reflection-free volumes require larger absorption patches. The method is equally applicable in the vertical as well as the horizontal direction.

7.2.7 Nonenvironment Rooms

Another approach to controlling early reflections, which is used in many successful control rooms, is the “nonenvironment” room. These rooms control both the early reflections and the reverberation. However, although they are quite dead acoustically, they are not anechoic. Because for users in the room there are some reflections from the hard surfaces, there are some early reflections that make the room nonanechoic. However, sound that is emitted from the speakers is absorbed and is never able to contribute to the reverberant field. How this is achieved is shown in Figure 7.27.

These rooms have speakers, which are flush-mounted in a reflecting wall, and a reflecting floor. The rear wall is highly absorbent, as are the side walls and ceiling. The combined effect of these treatments is that sound from the loudspeakers is absorbed instead of being reflected so that only the direct sound is heard by the listener, except for a floor reflection. However, the presence of two reflecting surfaces does support some early reflections for sources away from the speakers. This means that the acoustic environment for people in the room, although dead, is not oppressively anechoic. Proponents of this style of room say that the lack of anything but the direct sound makes it much easier to hear low-level detail in the reproduced audio and provides excellent stereo imaging. This is almost certainly due to the removal of any conflicting cues in the sound, as the floor reflection has very little effect on the stereo image.

These rooms require wide-band absorbers, as shown in Figure 7.27. These absorbers can take up a considerable amount of space. As one can see in Figure 7.27, the absorbers can occupy more than 50% of the volume. However, it is possible to use wide-band membrane absorbers, as discussed in Chapter 6, with a structure similar to that shown in Figure 6.48 with a limp membrane in place of the perforated sheet. Using this type of absorber, it is possible to achieve sufficient wide-band absorption with a depth of 30 cm, which allows this technique to be applied in much smaller rooms whose area is approximately 15 m2. Figure 7.28 shows a typical nonenvironment room implementation: “the Lab” at the Liverpool Music House.

Because nonenvironment rooms have no reverberant field, there is no reverberant room support for the loudspeaker level, as discussed in Section 6.1.7. Only the direct sound is available to provide sound level. In a normal domestic environment, as discussed in Chapter 6, the reverberant field is providing most of the sound power and is often about 10 dB greater than the direct sound. Thus in a nonenvironment room, one must use either 10 times the power amplifier level or specialist loudspeaker systems with a greater efficiency to reproduce the necessary sound levels (Newell, 2008).

7.2.8 The Diffuse-Reflection Room

A novel approach to controlling early reflections is not to try to suppress or redirect them but instead diffuse them. This results in a reduced reflection level but does not absorb them.

In general, most surfaces absorb some of the sound energy, and so the reflection amplitude is weakened by the reflecting surface. Therefore, the level of direct reflections will be less than that which would be predicted by the inverse square law due to surface absorption. The amount of energy or power removed by a given area of absorbing material will depend on the energy or power per unit area striking it. As the sound intensity is a measure of the power per unit area, this means that the intensity of the sound reflected is reduced in proportion to the absorption coefficient. Therefore, the intensity of the early reflection is given by:

Idirect sound=QWSource(1α)4πr2 (7.1)

From Equation 7.1, which is Equation 1.18 with the addition of the effect of surface absorption, it is clear that the intensity reduction of a specular early reflection is inversely proportional to the distance squared.

Diffuse surfaces, on the other hand, scatter sound in other directions than the specular. In the case of an ideal diffuser, the scattered energy polar pattern would be in the form of a hemisphere. A simple approach to calculating the effect of this can be to model the scattered energy as a source whose initial intensity is given by the incident energy. Thus, for an ideal scatterer, the intensity of the reflection is given by the product of the equation describing the intensity from the source and the one describing the sound intensity radiated by the diffuser. For the geometry shown in Figure 7.29, this is given by:

Idiffuse reflection=(WSource4πrs2)×(24πrd2) (7.2)

The factor 2 in the second term represents the fact that the diffuser only radiates into half a hemisphere and therefore has a Q of 2. From Equation 7.2, one can see that the intensity of a diffuse reflection is inversely proportional to the distance to the power of 4. This means that the intensity of an individual diffuse reflection will be much smaller than that of a specular reflection from the same position.

So diffusion can result in a reduction of the amplitude of the early reflection from a given point. However, there will also be more reflections due to the diffusion arriving at the listening position from other points on the wall, as shown in Figure 7.30. Surely this negates any advantage of the technique? A closer inspection of Figure 7.30 reveals that although there are many reflection paths to the listening point, they are all of different lengths and hence time delays. The extra paths are also all of a greater length than the specular path, shown dashed in Figure 7.30. Furthermore, the phase-reflection diffusion structure will add an additional temporal spread to the reflections. As a consequence, the initial time-delay gap will be filled with a dense set of low-level early reflections instead of a sparse set of higher-level ones, as shown in Figure 7.31. Of particular note is that even with no added absorption, the diffuse-reflection levels are low enough in amplitude to have no effect on the stereo image, as shown earlier in Figure 7.24.

The effect of this is a large reduction of the comb-filtering effects that high-level early reflections cause. This is due to both the reduction in amplitude due to the diffusion and the smoothing of the comb filtering caused by the multiplicity of time delays present in the sound arriving from the diffuser. As these comb-filtering effects are thought to be responsible for perturbations of the stereo image (Rodgers, 1981), one should expect improved performance even if the level of the early reflections is slightly higher than the ideal.

The fact that the reflections are diffuse also results in an absence of focusing effects away from the optimum listening position, and this should result in a more gradual degradation of the listening environment away from the optimum listening position. Figure 7.32 shows the intensity of the largest diffuse side-wall reflection relative to the largest specular side-wall reflection as a function of room position for the speaker position shown. From this figure, we can see that over a large part of the room, the reflections are less than 15 dB below the direct sound.

Figure 7.33 shows one of the few examples of such a room. The experience of this room is that one is unaware of sound reflection from the walls: it sounds almost anechoic, yet it has reverberation. Stereo and multichannel material played in this room has images that are stable over a wide listening area, as predicted by theory. The room is also good for recording in, as the high level of diffuse reflections and the acoustic mixing it engenders, as shown in Figure 7.30, help integrate the sound emitted by acoustic instruments.

7.2.9 Summary

In this section, we have examined various techniques for achieving a good acoustic environment for hearing both stereo and multichannel music. However, the design of a practical critical listening room requires many detailed considerations regarding room treatment, sound isolation, air conditioning and so on that are covered in more detail in Newell (2008).

7.3 Pure-Tone and Speech Audiometry

In this section, a number of acoustic and psychoacoustic principles are applied to the clinical measurement of hearing ability. Hearing ability is described in Chapter 2 and summarized in Figure 2.10 in terms of the frequency and amplitude range typically found. But how can these be measured in practice, particularly in the clinic, where such information can provide medical professionals with critical data for the treatment of hearing problems?

The ability to detect sound and the ability to discriminate between sounds are the two aspects of hearing that can be detrimentally affected by age, disease, trauma or noise-induced hearing loss. The clinical tests that are available for the diagnosis of these are

  • sound detection: pure-tone audiometry
  • sound discrimination: speech audiometry

Pure-tone audiometry is used to test a subject’s hearing threshold at specific frequencies approximately covering the speech hearing range (see Figure 2.10). These frequencies are spaced in octaves as follows: 125 Hz, 250 Hz, 500 Hz, 1 kHz, 2 kHz, 4 kHz and 8 kHz. The range of sound levels that are tested usually start 10 dB below the average threshold of hearing, and they can rise to 120 dB above it; recall that the average threshold of hearing varies with frequency (see Figure 2.10).

A clinical audiometer is set up to make diagnosis straightforward and quick and easy to explain to patients. Because the threshold of hearing is a nonuniform curve and therefore not an easy reference to use on an everyday basis in practice, a straight line equating to the average threshold of hearing is used instead to display the results of a hearing test on an audiogram. A dBHL (hearing level) scale is defined for hearing testing, which is the number of dBs above the average threshold of hearing.

Figure 7.34 shows a blank audiogram that plots frequency on the x-axis (the octave values between 125 Hz and 8 kHz inclusive as shown earlier) against dBHL between −10 dBHL and +120 dBHL on the y-axis. Note that the dBHL scale increases downward to indicate greater hearing loss (a higher amplitude or greater dBHL value needed for the sound to be detected). The 0 dBHL (threshold of hearing) line is thicker than the other lines to give a visual focus on the average threshold of hearing as a reference against which measurements can be compared.

A pure-tone audiometer has three main controls: (1) frequency, (2) output sound level and (3) a spring-loaded output key switch to present the sound to the subject. When the frequency is set, the level is automatically altered to take account of the average threshold of hearing, which enables the output sound level control to be calibrated in dBHL directly. The output sound level control usually works in 5 dB steps and is calibrated in dBHL. It is vitally important that the operator is aware that an audiometer can produce very high sound levels that could do permanent damage to a normal hearing system (see Section 2.5). When testing a subject’s hearing, a modest level around +30 dBHL should be used to start with, which can be increased if the subject cannot hear it.

The spring-loaded output key is used to present the sound, thereby giving the operator control of when the sound is being presented and removing any pattern of presentation that might allow the subject to predict when to expect the next sound. Such unpredictability adds to the overall power of the test, but in the context of hearing measurement, it is particularly important when hearing is being tested in the context of, at one extreme, a legal claim for damages being made for hearing loss due to noise-induced hearing loss or, at the other, a health screening for normal hearing as part of a job interview.

When a sound is heard, the subject is asked to press a button, which illuminates a lamp or light-emitting diode (LED) on the front panel of the audiometer. The subject should be visible to the tester, but the subject should not be able to see the controls. When carrying out an audiometric test, local sound levels should be below the levels defined in BS EN ISO 8253–1, which are shown for the test frequencies in Figure 7.35. Generally, the local level should be below 35 dBA.

During audiometry, test signals are presented in one of two ways:

  • air conduction
  • bone conduction

For air-conduction audiometry, sound is presented acoustically to the outer ear and thereby tests the complete hearing system. Three types of air-conduction transducers are available:

  1. 1. circumaural headphones
  2. 2. supra-aural headphones
  3. 3. ear canal–insert earphones

Circumaural headphones surround and cover the pinna (see Figure 2.1) completely, thereby providing a degree of sound isolation. Supra-aural headphones rest on the pinna and are the more traditional type in use, but they are not particularly comfortable since they press quite heavily on the pinna in order to keep the distance between the transducer itself and the tympanic membrane constant. Both circum- and supra-aural headphones can be uncomfortable and somewhat awkward, and they can in certain circumstances deform the ear canal. As an alternative, ear canal–insert earphones that have a disposable foam tip can be used that will not distort the ear canal. They have the added advantage that less sound leaks to the other ear, which reduces the need to consider presenting a masking signal to it. There are, however, situations, such as infected or obstructed ear canals, in which the use of ear canal insert earphones is not appropriate.

For bone-conduction audiometry, sound is presented mechanically using a bone vibrator that is placed just behind but not touching the pinna on the bone protrusion known as the mastoid prominence. It is held in place with an elastic headband. The sound presented when using bone conduction bypasses the outer and middle ears since it vibrates the temporal bone in which the cochlea lies directly. Thus it can be used to assess inner function and the presence or otherwise of what is known as sensorineural hearing loss with no hindrance from any outer- or middle-ear disorder. Bone conduction is carried out in the same way as air conduction audiometry except that only frequencies from 500 Hz to 4 kHz are used due to the limitations of the bone-conduction transducers themselves. When a bone-conduction measurement is being made for a specific ear, it is essential that the other ear is masked using noise. Specific audiometric guidelines exist for the use of masking.

The usual audiometric procedure for air or bone-conduction measurements (recalling the one difference for bone conduction that the frequencies used are from 500 Hz to 4 kHz only) is to test frequencies (in the following order: 1 kHz, 2 kHz, 4 kHz, 8 kHz, 500 Hz, 250 Hz, 125 Hz, 1 kHz). (Note that the starting frequency is 1 kHz, which is a mid frequency in the hearing range and is therefore likely to be heard by all subjects to give them confidence at the start and end of a test.) If the retest measurement at 1 kHz has changed by more than 5 dB, other frequencies should be retested and the most sensitive value (lowest dBHL value) recorded. When testing one ear, consideration will be given as to whether masking should be presented to the other ear to ensure that only the test ear is involved in the trial. This is especially important when testing the poorer ear.

Tests are started at a level that can be readily heard (usually around +30 dBHL), which is presented for 1 to 3 s using the output key switch and then involves watching for the subject to light the lamp or LED. If this does not happen, the level is increased in 5-dB steps (5 dB being a minimum practical value to enable tests to be carried out in a reasonable time)—presenting the sound and awaiting a response each time. Once a starting level has been established, the sound level is changed using the “10 down, 5 up” method as follows:

  1. 1. Reduce the level in 10-dB steps until the sound is not heard.
  2. 2. Increase the level in 5-dB steps until the sound is heard.
  3. 3. Repeat 1 and 2 until the subject responds at the same level at least 50% of the time, defined as two out of two, two out of three or two out of four responses.
  4. 4. Record the threshold as the lowest level heard.

There are a number of degrees of hearing loss, which are defined in Table 7.1. These descriptions are used to provide a general conclusion about a subject’s hearing, and they should be interpreted as such. They consist of a single value that is the average dBHL value across frequencies 250 Hz to 4 kHz. The values are used to provide a general guideline as to the state of hearing, and it must be remembered that there could be one or more frequencies for which the hearing loss is worse than the average.

Consider, for example, the audiogram for damaged hearing given in Figure 2.19. Here, the average dBHL value for frequencies 250 Hz to 4 kHz would be {((10 + 5 + 5 + 15 + 60)/5) = 19 dBHL}, which indicates “no hearing handicap” (see Table 7.1), which is clearly not the case.

The upper part of Figure 7.36 shows audiograms for a young adult with normal healthy hearing within both ears based on air- and bone-conduction tests in the left and right ears (a key to the symbols used on audiograms is given in the figure). Notice that the bone- and air-conduction results lie in the same region (in this case < 20 dBHL) and a summary statement of “no hearing handicap” (see Table 7.1) would be entirely appropriate in this case. Pure-tone audiometry is the technique that enables the normal deterioration of hearing with age, or presbycusis (see Section 2.3 and Figure 2.11) to be monitored.

The lower part of Figure 7.36 shows example audiograms for two hearing-loss conditions. The audiogram in the lower-left position shows a conductive hearing loss in the left ear because the bone-conduction plot is normal, but the air-conduction plot shows a significant hearing loss that would be termed a “moderate hearing loss” (see Table 7.1). This indicates a problem between the outside world and the inner ear, and a hearing aid, tailored to the audiogram, could be used to correct for the air-conduction loss.

The audiogram in the lower-right position shows the effect on hearing of congenital rubella syndrome, which can occur in a developing fetus of a pregnant woman who contracts rubella (German measles) from about 4 weeks before conception to 20 weeks into pregnancy. One possible effect on the infant is profound hearing loss (> 95 dBHL—see Table 7.1), which is sensorineural (note that both the air- and bone-conduction results lie in the same region, indicating an inner-ear hearing loss). Sadly, there is no known cure; in this example, a hearing aid would not offer much help because there is no usable residual hearing above around 500 Hz.

Table 7.1 Categories of Hearing Loss with Their Measured Detection Ranges (British Society of Audiology, 2011) in dBHL and a Qualitative Approximate Description of What This Means in Terms of Speech Perception

Hearing loss

Band in dBHL

Effect on hearing speech

Normal

below 20

Speech heard easily

Mild

20 to 40

Quiet speech difficult to hear with competing acoustic noise

Moderate

41 to 70

Moderately loud speech difficult to hear with competing acoustic noise

Severe

71 to 95

Loud speech difficult to hear with competing acoustic noise

Profound

over 95

Almost no speech understood

Pure-tone audiometry tests a subject’s ability to detect different frequencies, and the dBHL values indicate the extent to which the subject’s hearing is reduced at different frequencies. It thus indicates those frequency regions in which a subject is perhaps less sensitive than normally hearing listeners. This could, for example, be interpreted in practice in terms of timbral differences between specific musical sounds that might not be heard or vowel or other speech sounds that might be difficult to perceive. However, pure-tone audiometry does not provide a complete test of a subject’s hearing ability to discriminate between different sounds. Discrimination of sounds does start with the ability to detect the sounds, but it also requires appropriate sound processing to be available. For example, if the critical bands (see Section 2.2) are widened, they are less able to separate the components of complex sounds—the most important to us being speech. In order to test hearing discrimination, speech audiometry is employed that makes use of spoken material.

Speech audiometry is carried out for each ear separately and tests speech-discrimination performance against the pure-tone audiograms for each ear and normative data. When testing one ear, consideration will be given as to whether masking should be presented to the other ear to ensure that only the test ear is involved in the trial. This is particularly important when testing the poorer ear.

Speech audiometry involves the use of an audiometer and speech material that is usually recorded on audio compact disc (CD). Individual single-syllable words such as bus, fun, shop are played to the subject, who is asked to repeat them, providing part words if that is all they have heard. Each spoken response is scored phonetically in terms of the number of correct sounds in the response (for example, if bun or boss was the response for bus, the subject would score two out of three). Words are presented in sets of 10, and if a total phonetic score of 10% or better is achieved for a list, the level is reduced by 10 dB and a new set of 10 words is played, repeating the process until the score falls below 10%. The speech reception threshold (SRT) is the lowest level at which a 10% phonetic score can be achieved. The speech discrimination score (SDS) is the percentage of single-syllable words that can be identified at a comfortable loudness level.

The results from speech audiometry indicate something about the ability to discriminate between sounds, whereas pure-tone audiometry indicates ability to detect the presence of particular frequency components. Clearly detection ability is basic to being able to make use of frequency components in a particular sound, but how a listener might make use of those components depends on their discrimination ability. Discrimination will change if, for example, a listener’s critical bands are widened, which can result in an inability to separate individual components. This could have a direct effect on pitch, timbre and loudness perception. In addition, the ability to hear separately different instruments or voices in an ensemble might be impaired—something that could be very debilitating for a conductor, accompanist or recording engineer.

7.4 Psychoacoustic Testing

Knowledge of psychoacoustics is based on listening tests in order to find out how humans perceive sounds in terms, for example, of pitch, loudness and timbre. Direct measurements are not possible in this context, since direct connections cannot be made for ethical as well as practical reasons, and, in many cases, there is a cognitive dimension (higher-level processing) that is unique to each and every listener. Our knowledge of psychoacoustics therefore is based on listening tests, and this section presents an overview of procedures that are typically used in practice. Apart from offering this as a background to the origins of the psychoacoustic information presented in this book, it also enables readers to think through aspects of the creation of their own listening tests to progress psychoacoustic knowledge in the future.

When carrying out a psychoacoustic test, it is important to note that the responses will be from the opinions of listeners; that is, they will be subjective, whereas an objective test involves a direct physical measurement such as dB SPL, Hz or spectral components. There is no right answer to a subjective test since it is the opinion of a particular listener, and each listener will have an opinion that is unique; the process of psychoacoustic testing is to collect these listener opinions in a nonjudgmental manner. Subjective testing is unlike objective testing, in which direct measurements can be made of physical quantities such as sound pressure level, sound intensity level or fundamental frequency; in a subjective test, a listener is asked to offer an opinion in answer questions such as “Which sound is louder?” “Does the pitch rise or fall?” “Are these two sounds the same or different?” “Which chord is more in tune?” or “Which version do you prefer?”

Psychoacoustic testing involves careful experimental design to ensure that the results obtained can be truly attributed to whatever aspect of the signal is being used as the controlled variable. This process is called controlled experimentation. A starting point for experimental design may well be a hunch or something we believe to be the case from our own listening experience or from anecdotal evidence. A controlled experiment allows such listening experiences to be carefully explored in terms of which aspects of a sound affect them and how. Psychologists call a behavioral response, such as a listening experience, the dependent variable, and those aspects that might affect it are called the independent variables. Properly controlled psychoacoustic testing involves controlling all the independent variables so that any effects observed can be attributed to changes in the specific independent variable being tested.

7.4.1 Psychoacoustic Experimental Design Issues

One experimental example might be to explore what aspects of sound affect the perception of pitch. The main dependent variable would be f0, but other aspects of sound can affect the perceived pitch such as loudness, timbre and duration (see Section 3.2). Experimentally, it would also be very appropriate to consider other issues that might affect the results—some of which may not initially seem obvious—such as the fact that hearing abilities of the subjects can vary with age (see Section 2.3) and general health or that perhaps subjects’ hearing should be tested (see Section 7.2).

The way in which sounds are presented to subjects can also make a difference, since the use of loudspeakers would mean that the acoustics of the room will alter the signals arriving at each ear (see Chapter 6), whereas the use of headphones would not. There may be background acoustic noise in the listening room that could affect the results, and this may even be localized, perhaps to a ventilation outlet. Subjects can become tired (listener fatigue) or distracted or may perform better at different times of the day. The order in which stimuli are presented can have an effect—perhaps of alerting the listener to specific features of the signal, which prepares them better for a following stimulus. These are all potential independent variables and would need proper controlling.

Part of the process of planning a controlled experiment is thinking through such aspects (the ones given here are just examples and are not presented as a definitive list) before carrying out a full test. It is common to try a pilot test with a small number of listeners to check the test procedure and the presence of any additional independent variables. Some independent variables can be controlled by ensuring they remain constant (for example, the ventilation might be turned off, and measures could be taken to reduce background noise). Others can be controlled through the test procedure (for example, any learning effect could be explored by playing the stimuli in a different order to different subjects or asking each listener to take the test twice, with the stimulus order being reversed the second time).

7.4.2 Psychoacoustic Rating Scales

For many psychoacoustic experiments, the request to be given to the listeners is straightforward. In the pitch example, one might ask listeners to indicate which of two stimuli has the highest pitch or whether the pitch of a single stimulus was changing. In experiments in which the objective is to establish the nature of change in a sound, such as whether one synthetic sound is more natural than another, it is not so easy. A simple “yes” or “no” would not be very informative since it would not indicate the nature of the difference. A number of rating scales have been produced that are commonly used in such cases in which the listener is invited to choose the point on the scale that best describes what they have heard. Some examples are given in what follows.

When speech signals are rated subjectively by listeners, perhaps for the evaluation of the signal provided by a mobile phone or the output from a speech-synthesis system, it is usually the quality of the signal that is of interest. The number of listeners is important, since each will have a personal opinion, and it is generally suggested that at least 16 are used to ensure that statistical analysis of the results is sufficiently powerful. However, the greater the number of listeners, the more reliable the results are. It is also most appropriate to use listeners who are potential users of whatever system might result from the work and listeners who are definitely not experts in the area. A number of rating scales exist for the evaluation of the quality of a speech signal, and the following are examples:

  • absolute category-rating (ACR) test
  • degradation category-rating (DCR) test
  • comparison category-rating (CCR) test

The absolute category-rating (ACR) test requires listeners to respond with a rating from the five-point ACR rating scale shown in Table 7.2. The results from all the listeners are averaged to provide a mean opinion score (MOS) for the signals being tested. Depending on the purpose of the test, it might be of more interest to present the results of the listening test in terms of the percentage of listeners that rated the presented sounds in one of the categories such as good or excellent or poor or bad.

A comparison is requested of listeners in a degradation category-rating (DCR) test, and this usually involves a comparison between a signal before and after some form of processing has been carried out. The assumption here is that the processing is going to degrade the original signal in some way, for example after some sort of coding scheme such as MP3 (see Section 7.8) has been applied, where one would never expect the signal to be improved. Listeners use the DCR rating scale (see Table 7.2) to evaluate the extent to which the processing has degraded the signal when comparing the processed version with the unprocessed original. The results are analyzed in the same way as for the ACR test, and these are sometimes referred to as the degradation mean opinion score (DMOS).

Table 7.2 Rating Scale and Descriptions for the Absolute Category-Rating (ACR) Test, Which Produces a Mean Opinion Score (MOS), and Degradation Category-Rating (DCR) Test, Which Produces a Degraded Mean Opinion Score (DMOS)

Rating

ACR description—MOS

DCR description—DMOS

5

Excellent

Degradation not perceived

4

Good

Degradation perceived but not annoying

3

Fair

Degradation slightly annoying

2

Poor

Degradation annoying

1

Bad

Degradation very annoying

In situations in which the processed signal could be evaluated as being either better or worse, a comparison category-rating (CCR) test can be used. Its rating scale is shown in Table 7.3, and it can be seen that it is a symmetric two-sided scale. Listeners are asked to rate the two signals presented in terms of the quality of the second signal relative to the first. The CCR test might be used if one is interested in the effect of a signal-processing methodology being applied to an audio signal, such as noise reduction, in terms of whether it has improved the original signal.

7.4.3 Speech Intelligibility: Articulation Loss

Psychoacoustic experiments may be used to define thresholds of perception and rating scales for small degradations, such as the quality of the sound. However, at the other end of the quality scale is the case in which the degradation, due to noise distortion, reverberation and so on is so severe that it affects the intelligibility of speech. This is also measured and defined by the results of psychoacoustic experiments, but in these circumstances, instead of annoyance, the dependent variable is the proportion of the words that are actually heard correctly.

Two parameters are found to be important by those who work on speech: the “intelligibility” and the “quality” or “naturalness” of the speech. Both reflect human perception of the speech itself, and while they are most directly measured subjectively with panels of human listeners, research is being carried out to make equivalent objective measurements of these because of the problems of setting up listening experiments and the inherent inter- and intralistener variability. The relationship between intelligibility and naturalness is not fully understood. Speech that is unintelligible would usually be judged as being unnatural. However, muffled, fast or mumbling speech is natural but less intelligible, and speech that is highly intelligible may or may not be unnatural.

Subjective measures of intelligibility are often based on the use of lists of words that rhyme, differing only in their initial consonant. In a diagnostic rhyme test (DRT), listeners fill in the leading consonants on listening to the speech, and often the possible consonants will be indicated. In a modified rhyme test (MRT), each test consists of a pair-wise comparison of acoustically close initial consonants such as feel—veal, bowl—dole, fought—thought, pot—tot. The DRT identifies quite clearly in what area a speech system is failing, giving the designers guidance on where they might make modifications. DRT tests are widely accepted for testing intelligibility, mainly because they are rigorous, accurate and repeatable. Another type of testing is “logatom” testing.

In logatom testing, nonsense words such as “shesh” and “bik” are placed into a carrier phrase such as “Can con buy < logatom, e.g., “shesh” here > also” to ensure that they are all pronounced with the same inflection. The listener then has to identify the nonsense word and write it down. Using nonsense words has the advantage of removing the higher language processing that we use to resolve words with degraded quality and so provides a less biased measure. The errors listeners make show how the system being tested damages the speech, such as particular letter confusions, and provides a measure of intelligibility. Typically lists of 50 or 100 logatoms are used as a compromise between accuracy and fatigue, as discussed earlier. Although in theory any consonant-vowel-consonant may be used, it has been the authors’ experience that rude or swear words must be excluded, because the talker usually cannot pronounce them with the same inflection as normal words.

All of these tests result in a measure of the number of correctly identified words. This, as a percentage of the total, can be used as a measure of intelligibility or articulation loss, respectively. As consonants are more important in Western languages than vowels, this measure is often focused on just the consonants to form a measure called %ALcons (articulation loss; consonants) which is the percentage of consonants that are heard incorrectly. If this is greater than 15%, the intelligibility is considered to be poor. Although articulation loss is specific to speech—an important part of our auditory world—much music also relies on good articulation for its effect.

Subjective testing is a complex subject that could fill a complete book just on its own! For more details, see Bech and Zacharov (2006).

Table 7.3 Rating Scale and Descriptions for the Comparison Category-Rating (CCR) Test

Rating

Description

 3

Much better

 2

Better

 1

Slightly better

 0

About the same

−1

Slightly worse

−2

Worse

−3

Much worse

7.5 Filtering and Equalization

One of the simplest forms of electronic signal processing is to filter the signal in order to remove unwanted components. For example, often low-frequency noises, such as ventilation and traffic rumble, need to be removed from the signal picked up by the microphone. A high-pass filter would accomplish this, and mixing desks often provide some form of high-pass filtering for this reason. High frequencies also often need to be removed to either ameliorate the effects of noise and distortion or remove the high-frequency components that would cause alias distortion in digital systems. This is achieved via the use of a low-pass filter. A third type of filter is the notch filter, which is often used to remove tonal interference from signals. Figure 7.37 shows the effect of these different types of filter on the spectrum of a typical music signal.

In these cases, the ideal would be to filter the signal in a way that minimized any unwanted subjective effect on the desired signal. Ideally, in these cases, the timbre of the sound being processed should not change after filtering, but in practice, there will be some effects. What are these effects and how can they be minimized in the light of acoustic and psychoacoustic knowledge?

The first way of minimizing the effect is to recognize that many musical instruments do not cover the whole of the audible frequency range. Few instruments have a fundamental frequency that extends to the lowest frequency in the audible range, and many of them do not produce harmonics or components that extend to the upper frequencies of the audible range. Therefore, in theory, one can filter these instruments such that only the frequencies present are passed with no audible effect. In practice, this is not easily achieved for two reasons:

  • Filter shape: Real filters do not suddenly stop passing signal components at a given frequency. Instead, there is a transition from passing the signal components to attenuating them, as shown in Figure 7.38. The cutoff frequency of a filter is usually expressed as the point at which it is attenuating the signal by 3 dB relative to the pass-band; see Figure 7.38. Thus if a filter’s cutoff is set to a given frequency, there will be a region within the pass-band that affects the amplitude of the frequency components of the signal. This region can extend as far as an octave away from the cutoff point. Therefore, in practice, the filter’s cutoff frequency must be set beyond the pass-band that one would expect from a simple consideration of the frequency range of the instruments in order to avoid any tonal change due to change in frequency content caused by the filter’s transition region. As the order of the filter increases, both the slope of the attenuation as a function of frequency and the sharpness of the cutoff increase; this reduces the transition region effects but, unfortunately, increases the time domain effects.
  • Time domain effects: Filters also have a response in the time domain. Any form of filtering that reduces the bandwidth of the signal will also spread it over a longer period of time. In most practical filter circuits, these time domain effects are most pronounced near the cutoff frequency and become worse as the cutoff becomes sharper. Again, as in the case of filter shape, these effects can extend well into the pass-band of the filter. Note that even the notch filter has a time response, which gets longer as the notch bandwidth decreases. Interestingly, particular methods of digital filtering are particularly bad in this respect because they result in time domain artifacts that precede the main signal in their output. These artifacts are easily unmasked and so become subjectively disturbing. Again, the effect is to require that the filter cutoff be set beyond the value that one would expect from a simple consideration of the frequency range of the instruments.

Because of these effects, the design of filters that achieve the required filtering effect without subjectively altering the timbre of the signal is difficult.

The second way of minimizing the subjective effects is to recognize that the ear uses the spectral shape as a cue to timbre. Therefore, the effect of removing some frequency components by filtering may be partially compensated by enhancing the amplitudes of the frequency components nearby, as discussed in Chapter 5. Note that this is a limited effect and cannot be carried too far. Figure 7.39 shows how a filter shape might be modified to provide some compensation. Here a small amount of boost, between 1 dB and 2 dB, has been added to the region just before cutoff in order to enhance the amplitude of the frequencies near those that have been removed.

7.5.1 Equalization and Tone Controls

A related and important area of signal processing to filtering is equalization. Unlike filtering, equalization is not interested in removing frequency components but in selectively boosting, cutting or reducing them to achieve a desired effect. The process of equalization can be modeled as a process of adding or subtracting a filtered version of the signal from the signal, as shown in Figure 7.40. Adding the filtered version gives a boost to the frequencies selected by the filter, whereas subtracting the filtered output reduces the frequency component amplitudes in the filter’s frequency range. The filter can be a simple high- or low-pass filter, which results in a treble or bass tone control, or it can be a band-pass filter to give a bell-shaped response curve. The cutoff frequencies of the filters may be either fixed or variable depending on the implementation. In addition, the bandwidths of the band-pass filters and, less commonly, the slopes of the high- and low-pass filters can be varied.

An equalizer in which all the filter’s parameters can be varied is called a parametric equalizer. However, in practice, many implementations, especially those in mixing desks, only use a subset of the possible controls for both economy and simplicity of use. Typically in these cases, only the cutoff frequencies of the band-pass, and in some cases the low- and high-pass, filters are variable. There is an alternative version of the equalizer structure that uses a bank of closely spaced fixed-frequency band-pass filters to cover the audio frequency range. This approach results in a device known as the “graphic equalizer,” with typical bandwidths of the individual filters ranging from one third of an octave to 1 octave. For parametric equalizers, the bandwidths can become quite small.

Because a filter is required in an equalizer, the latter also has the same time domain effects that filters have, as discussed earlier. This is particularly noticeable when narrow-bandwidth equalization is used, as the associated filter can “ring,” as shown in Figures 1.62 and 1.63, for a considerable length of time in both boost and cut modes.

Equalizers are used in three main contexts (discussed next), which each have different acoustic and psychoacoustic rationales.

7.5.2 Correcting Frequency Response Faults Due to the Recording Process

This was one of the original functions of an equalizer in the early days of recording, which to some extent is no longer required because of the improvement in both electroacoustic and electronic technology. However, in many cases, there are effects that need correction due to the acoustic environment and the placement of microphones. There are three common acoustic contexts that often require equalization:

  • Close miking with a directional microphone: The acoustic bass response of a directional microphone increases as it is moved close to an acoustic source due to the proximity effect. This has the effect of making the recorded sound bass heavy; some vocalists often deliberately use this effect to improve their vocal sound. This effect can be compensated for by applying some bass-cut to the microphone signal, and this often has the additional benefit of further reducing low-frequency environmental noises. Note that some microphones have this equalization built in but that in general, a variable equalizer is required to fully compensate for the effect.
  • Compensating for the directional characteristics of a microphone: Most practical microphones do not have an even response at all angles as a function of frequency. In general, they become more directional as the frequency increases. As most microphones are designed to give a specified on-axis frequency response in order to capture the direct sound accurately, this results in a response to the reverberant sound that falls with frequency. For recording contexts in which the direct sound dominates, for example close miking, this effect is not important. However, in recordings in which the reverberant field dominates, for example classical music recording, the effect is significant. Applying some high-frequency boost to the microphone signal can compensate for this.
  • Compensating for the frequency characteristics of the reverberant field: In many performance spaces, the reverberant field does not have a flat frequency response, as discussed in Section 6.1.7, and therefore subjectively colors the perceived sound if distant miking is used. Typically the bass response of the reverberant field rises more than is ideal, resulting in a bass-heavy recording. Again, the use of some bass-cut can help to reduce this effect. However, if the reverberation is longer at other frequencies, for example in the midrange, then the reduction should be applied in a way that complements the increase in sound level this causes. As in these cases the bandwidth of the level rise may vary, this must also be compensated for—usually by adjusting the bandwidth, or Q, of the equalizer.

All these uses of equalization compensate for limitations imposed by the acoustics of the recording context. To make intelligent use of it in these contexts requires some idea of the likely effects of the acoustics of the space at a particular microphone location, especially in terms of the direct-to-reverberant sound balance.

7.5.3 Timbre Modification of Sound Sources

A major role for equalizers is the modification of the timbre of both acoustically and electronically generated sounds for artistic purposes. In this context, the ability to boost or cut selected frequency ranges is used to modify the sound’s spectrum to achieve a desired effect on its timbre. For example, boosting selected high-frequency components can add “sparkle” to an instrument’s sound, whereas adding a boost at low frequencies can add “weight” or “punch.” Equalizers achieve these effects through spectral modification only: they do not modify the envelope or dynamics of a music signal. Any alteration of the timbre is purely due to the modification by the equalizer of the long-term spectrum of the music signal. There is also a limit to how far these modifications can be carried before the result sounds odd, although in some cases this may be the desired effect.

When using equalizers to modify the timbre of a musical sound, it is important to be careful to avoid “psychoacoustic fatigue”—this arises because the ear and brain adapt to sounds. This has the effect of dulling the effect of a given timbre modification over a period of time. Therefore one puts in yet more boost, which one adapts to, and so on. The only remedy for this condition is to take a break from listening to that particular sound for a while and then listen to it again later. Note that this effect can happen at normal listening levels and so is different from the temporary threshold shifts that happen at excessive sound levels.

7.5.4 Altering the Balance of Sounds in Mixes

The other major role is to alter the balance of sounds in mixes—in particular the placing of sound “up front” or “back” in the mix. This is because the ability of the equalizer to modify particular frequency ranges can be used to make a particular sound become more or less masked by the sounds around it. This is similar to the way the singer’s formant is used to allow a singer to be heard above the orchestra, as mentioned in Chapter 4. For example, suppose one has a vocal line that is being buried by all the other instrumentation going on. The spectrum of such a situation is shown in Figure 7.41, and from this it is clear that the frequency components of the instruments are masking those of the vocals. By selectively reducing the frequency components of the instruments at around 1.5 kHz while simultaneously boosting the components in the vocal line over the same frequency range, the frequency components of the vocal line can become unmasked, as shown in Figure 7.42. This has the subjective effect of bringing the vocal line out from the other instruments.

Similarly, performing the process in reverse would further reduce the audibility of the vocal line in the mix. To achieve this effect successfully requires the presence of frequency components of the desired sound within the frequency range of the equalizer’s boost and cut region. Thus different instruments require different boost and cut frequencies for this effect. Again, it is important to apply the equalization gently in order to avoid substantial changes in the timbre of the sound sources.

Equalizers therefore have a broad application in the processing of sound. However, despite their utility, they must be used with caution—first to avoid extremes of sound character, unless that is desired, and second to avoid unwanted interactions between different equalizer frequency ranges. As a simple example, consider the effect of adding treble, bass and midrange boost to a given signal. Because of the inevitable interaction between the equalizer frequency responses, the net effect is to have the same spectrum as the initial one after equalization. All that has happened is that the gain is higher. Note that this can happen if a particular frequency range is boosted and then, because the result is a little excessive, other frequency ranges are adjusted to compensate.

7.6 Public-Address Systems

Sound reinforcement of speech is often taken for granted. However, as anyone who has tried to understand an announcement in a reverberant and noisy railway station knows, obtaining clear and intelligible speech reinforcement in a real acoustic environment is often difficult. The purpose of this section is to review the nature of the speech reinforcement problem from its fundamentals in order to clarify the true nature of the problem. We will examine the problem from the perspective of the sound source, the listener, and the acoustics. At the end, you should have a clear appreciation of the difficulties inherent in reinforcing one of our most important and sensitive methods of communication.

Several aspects of an acoustic space affect the intelligibility of speech within it.

7.6.1 Reverberation

As discussed in Chapter 6 (see Section 6.1.12), bigger spaces tend to have longer reverberation times, and well-furnished spaces tend to have shorter reverberation times. Reverberation time can vary from about 0.2 of a second for a small, well-furnished living room to about 10 seconds for a large glass-and-stone cathedral.

There are two main aspects of the sound to consider:

  • The direct sound: This is the sound that carries information and articulation. For speech, it is important that the listener receive a large amount of direct sound if they are to comprehend the words easily. Unfortunately, as discussed in Chapter 1, the direct sound gets weaker as it spreads out from the source. Every time you double your distance from a sound source, the level of the direct sound goes down by a factor of four, that is, an inverse square law. Thus the further away you are from a sound source, the weaker the direct sound component.
  • The reverberant sound: The second main aspect of the sound is the reverberant part. This behaves differently from the direct sound, as discussed in Chapter 6. The reverberant sound is the same in all parts of the space.

The effect of these two aspects is shown in Figure 7.43. As one moves away from a source of sound in a space, the level of direct sound reduces, but the reverberant sound stays constant. This means that the ratio of direct sound to reverberant sound becomes less, so the reverberant sound becomes more dominant. The critical distance, where the reverberant sound dominates, is dependent on both the absorption of the space and the directivity of the source. As the absorption and directivity increase, so does the critical distance, but only proportionally to the square root of these factors. As discussed in Chapter 6, the critical distance is:

Dcrit = 0.141RQ (7.3)
  • where R=Sα(1α) the room constant (in m2)
  •   and Q = the directivity of the source (compared to a sphere)

7.6.2 The Effect of Reverberation on Intelligibility

The effect of reverberation and early reflections is to mask the stops and bursts associated with consonants. They can also blur the rapid formant transitions that are also important cues to different consonant types. Clearly the degradation will depend on both the reverberation time and the relative level of the reverberation to the direct sound. One would expect longer reverberation times to be more damaging than short ones.

Because of the importance of consonants to intelligibility, it is therefore important to maintain a high level of direct to reverberant sound; ideally one should operate a system at less than the critical distance. There is an empirical equation that links the number of consonants lost to the characteristics of the room (Peutz, 1971). As consonants occupy frequencies above 1 kHz and have very little energy above 4 kHz, this equation is based on the average reverberation time in the 1 kHz and 2 kHz octave bands.

Up to D, 3.5Dc (at which point the Direct/Reverberation ratio = −11 dB)

%ALcons=200D2T602NVQ+a (7.4a)
  • where D = the distance from the nearest loudspeaker (in m)
  •    T60 = the reverberation time of the room (in s)
  •    N = the number of equal power sources in the room
  •    V = the volume of the room (in m3)
  •    Q = the directivity of the nearest loudspeaker
  •    a = a listener factor; because we always make some errors, it can range from to 1. 5 % 12.5% where 1.5% is an excellent listener

For D > 3.5Dc (where the Direct/Reverberation ratio is always worse than −11 dB)

%ALcons=9T60+a (7.4b)

Note that when D is greater than 3.5Dc, the intelligibility is constant.

  • The %ALcons is related to intelligibility as follows. If:
  • %ALcons is less than 10%, the intelligibility will be very good;
  • %ALcons is between 10% and 15%, the intelligibility will be acceptable;
  • %ALcons is greater than 15%, the intelligibility will be poor.

In order to achieve this, one might think that placing more loudspeakers in the space would be better, because this would place the loudspeakers closer to the listeners.

Notice that the %ALcons increases as the number of sources increases. This is counterintuitive, because you would think that more loudspeakers would mean they are closer to the listener and therefore should be clearer.

7.6.3 The Effect of More Than One Loudspeaker on Intelligibility

Unfortunately, increasing the number of speakers decreases the intelligibility, because only the loudspeaker that is closest to you provides the direct sound. All the other loudspeakers contribute to the reverberant field and not to the direct sound! The net effect of this is to reduce the critical distance and make the problem worse. If one assumes that all the loudspeakers radiate the same power, the critical distance becomes:

Dcrit=0.141RQN (7.5)

where N = the number of equal power sources in the room

If the sources do not have equal power, then N is the ratio of the total source power to the power of the source producing the direct sound. That is:

N=WreverbWdirect=all sourcesWeach sourceWdirect

where N = the ratio of all power sources in the room to the direct power

So, in this case, more is not better! Ideally one should have the minimum number of speakers, preferably one, needed to cover the space. When this is not possible, it is possible to regain the critical distance by increasing the Q of each loudspeaker in proportion to their number. This has its own problems, which will be discussed later.

The need to minimize the number of sources in the space has led to a design called the central cluster in which all the speakers required to cover an area are concentrated at one coherent point in the space. In general, such an arrangement will provide the best direct-to-reverberant ratio for a space. Unfortunately, it is not always possible, especially for large spaces.

7.6.4 The Effect of Noise on Intelligibility

The effect of noise, like reverberation, is to mask the stops and bursts associated with consonants. This is because the consonants are typically 20 dB quieter than the vowels. They can also blur the rapid formant transitions, which are also important cues to different consonant types. Because of the importance of consonants to intelligibility, it is therefore important to maintain a high signal-to-noise ratio.

Figure 7.44 shows how the intelligibility of speech varies according to the signal-to-noise ratio. From this figure, we can see that a speech to background noise ratio of greater than +7.5 dB is required for adequate intelligibility. Ideally, a signal-to-noise ratio of greater than 10 dB is required for very good intelligibility. This assumes that there is minimal degradation due to reverberation.

Different types of noise have different effects on speech. For example, background noise that is hiss-like can be spectrally very similar to the initial consonants in sip or ship; periodic sounds such as the low-frequency drone of machines or vehicle tire noise can mask sounds with predominantly low-frequency energy such as the vowels in food or fun; sounds such as motor noise that exhibit a continuous whine can mask a formant frequency region and reduce vowel intelligibility; short bursts of noise can either mask or insert plosive sounds such as the initial consonants in pin, tin or kin; and broad-band noise can contribute to the masking of all sounds, particularly those that depend on higher-frequency acoustic cues (see Howard, 1991) such as the initial consonants in fun, shun, sun and thump.

High levels of noise can mask important formant information. This is especially true of high levels of low-frequency noise that, as shown earlier in Chapter 5, can mask the important lower formants. As high levels of low-frequency and broad-band noise are often associated with transport noise, this can be a serious problem in many situations. More subtly, it is possible for speech that is produced at high levels to mask itself. That is, if the speech is too loud, then, notwithstanding the improvement in signal-to-noise ratio, the intelligibility is reduced, because the low-frequency components of the speech mask the high-frequency components due to the upward spread of masking.

There may be situations in which acoustic treatment may be essential before sound reinforcement is attempted. Interfering noises that have similar rates of variation as speech are particularly difficult to deal with, as they fool our higher-order processing centers into attending to them as if they are speech. Because of this, their effect is often more severe than a simple measurement of level would indicate.

There may also be high levels of noise that cannot be controlled. In these circumstances, it can sometimes be possible to increase intelligibility by boosting the speech spectrum in the frequency regions where the interfering sound is weakest, as discussed in Section 7.4, thus causing the desired speech to become unmasked in those regions and so enhancing the speech intelligibility.

7.6.5 Requirements for Good Speech Intelligibility

In general, for good intelligibility, we require the following:

  • The direct sound should be greater than or equal to the reverberant sound. This implies that the listener should be no farther away than the critical distance.
  • The speech-to-interference ratio should, ideally, be greater than 10 dB and no worse than 7.5 dB.
  • The previous two requirements have the implication that the level of the direct sound should be above a certain level, that is, at least 10 dB above the background noise and equal to the reverberant sound level. For both efficiency and the comfort of the audience, this implies that the direct sound should be constant at this level throughout the coverage area.

Usually the only way of achieving this is to make use of the directivity of the loudspeakers used. This is because any other technique, such as reducing T60, tends to require major architectural changes and therefore considerable cost. However, sometimes this may be the only way of achieving a usable system. Sometimes communication can be assisted by using speakers with good elocution, especially female ones, because their voices’ higher pitch tends to be less masked by the reverberation and noise typically present. Another possibility is to “chant” the message, which gives an exaggerated pitch contour that assists intelligibility. As a last resort, one can use the international radiotelephony-spelling alphabet (Oscar, Bravo, Charlie …) to facilitate communication. Paradoxically, electronic volume compression does not improve intelligibility; in many cases it makes it worse, because it can distort the syllabic amplitude variations that help us understand words.

These simple rules, outlined previously, must be considered in the light of the actual context of the system. Their apparent simplicity belies the care, analysis and design that must be used in order for practical systems to achieve their objectives.

7.6.6 Achieving Speaker Directivity

If the major way of achieving a good-quality public-address system is to use directional loudspeakers, it is worth considering how this might be achieved. Ideally, speakers for public-address systems should have a directivity that is constant with frequency; that is, the angles into which they radiate their sound energy remain the same over the whole audio spectrum.

There are two main ways of achieving this:

  • The array loudspeaker: One way is to use a large number of speakers together as an “array loudspeaker.” With appropriate signal processing of the audio into these systems, a good constant-directivity performance can be achieved. With the advent of technologies that make this processing much easier, such speakers are becoming more popular because of the flexibility they allow.
  • The constant directivity horn: This is the other main technique. It has a constant directivity above a specific frequency or frequencies and is simple and very efficient. It can convert about 25% of its electrical energy input into acoustic energy and is able to sustain outputs of around 10 W (130 dB) of acoustic power for long periods of time. Its main limitation is its low-frequency performance, which typically limits its frequency range to frequencies that are greater than 500 Hz.

However, irrespective of the technology used, there is a fundamental lower limit on the frequencies at which the speakers are directive that is determined by their size.

Recall that in Chapter 1, sound diffraction and scattering were discussed in Sections 1.5.9 and 1.5.10, and we saw that the size of an object depended on its size in wavelengths. That is, sound is diffracted around objects that are small with respect to wavelength and is reflected from objects that are large with respect to wavelength. The same thing applies to loudspeakers.

Although a standard loudspeaker looks like it should radiate sound in the direction that its drivers are pointing, in practice it doesn’t. This is because at many frequencies, it is small with respect to the wavelength; for example, a 200 mm (8”) loudspeaker will only start becoming directive at about 1 kHz! Note that the size of the box has very little influence; it is the size of the part that radiates the sound that matters. So we can consider any small loudspeaker to be similar to a torch or flashlight bulb without a reflector, irrespective of what it looks like!

The equation that relates the minimum size of the radiating size of a speaker for a given directivity is:

Fθmin=219Lsinθ (7.6)
  • where Fθmin = the minimum frequency that ± θ can equal −6 dB
  •       L = the length of the radiating source (in m)
  •       θ = the −6 dB angle for the speaker. This is on either side of the center line, so the coverage would be 2θ. For example, θ = 45° means 90° of coverage

As an example, if one wanted a constant-directivity horn that had a coverage of 90° × 40° from 500 Hz, the mouth would have to be at least 0.62 m for the 90° direction and 1.28 m for the 40° direction. This is not small! You can have a smaller horn, but you must recognize that it will not have this coverage angle at 500 Hz. The sine of the −6 dB θ point is inversely proportional to the frequency below Fθmin; thus at an octave below Fθmin, θ will be approximately double that desired, and so on, until θ becomes greater than 90°, which implies the speaker is omnidirectional and has no directivity. In practice, many commercially available directional speakers have to make some form of compromise in the frequency range of the desired directivity.

However, beware of specmanship. One of the authors has seen a loudspeaker advertised as having a constant directivity of 90° × 60° using a horn whose mouth dimensions were 130 × 130 mm. This gives an Fθmin of 2.4 kHz and 3.4 kHz, respectively. As most human speech energy is between 100 Hz and 5 kHz, this speaker’s directivity is going to have very little influence on the intelligibility of the speech!

7.6.7 A Design Example: How to Get It Right

Now that we know, let’s look at how we would go about designing a PA system that has good coverage. In order to do this we need to work out what level the sound will be some distance away from the loudspeaker. Fortunately, this can be done very easily, as manufacturers provide a parameter called the speaker’s sensitivity. This measures the sound pressure level (SPL) emitted by a loudspeaker at one meter for 1 W of electrical power input. Using this measure, the SPL at a given distance from the loudspeaker is given by:

SPL = 10log(Pin)+ Sensitivity − 20log(r) (7.7)
  •  where Pin = is the power sent to the speaker (in W)
  • Sensitivity = the speaker’s sensitivity in (dB W−1@ 1 m),
  •      R = the distance from the speaker (in m)

Consider a room that is 30 m long by 12 m wide by 9 m high and with a T60 = of 1.5 s at 1 kHz and 2 kHz. The audience’s ears start at 2.2 m away from the front wall and extend all the way to the back.

Making use of Equation 7.7, let us look at an example public-address problem for several speaker arrangements.

Example 7.1

A Small Speaker Mounted at Ear Height

Figure 7.45 shows a simple full-range speaker mounted at ear height and oriented to cover the entire audience area. The loudspeaker sensitivity is 94 dB per watt at 1 m. As the entire audience is “on axis,” there will be no variation in level due to speaker directivity; therefore, at the positions shown, the levels will be as shown in Table 7.4.

There is a huge SPL variation from front to back: −22.7 dB! People in the front row are being deafened (one of the authors has seen people close to the speakers wince in a York church installation), while people in the rear row are straining to hear anything.

Unfortunately, this is not an unusual situation. Furthermore, the critical distance for this speaker is 2.63 m! So the %ALcons for this space will be 15.5%. Note that adding more speakers does not make the problem any better. Instead it makes it worse, because the extra loudspeakers further reduce the critical distance. In fact, in a typical multispeaker installation, it is possible to show that an unaided human voice can be more intelligible than the PA system, provided the speaker can project their voice sufficiently!

Table 7.4 Calculated SPL for a Small Full-Range Loudspeaker

Distance from front

2.2 m

12 m

30 m

Distance from speaker

2.2 m

12 m

30 m

Output for 1 watt

94 dB

94 dB

94 dB

Path attenuation

−6.8 dB

−21.6 dB

−29.5 dB

SPL at the listener

87.2 dB

72.4 dB

64.5 dB

The problem is that the path length variation is too large (2.2–30 m), and this results in a very high SPL variation. The arrangement would also have a very poor direct-to-reverberant ratio over most of the audience and would be prone to feedback. Let us see if using a more directive speaker mounted higher up can do any better.

Example 7.2

A single constant-directivity horn speaker 7.5 m above ear height

Mounting the speaker higher up, as shown in Figure 7.46, reduces the path variations between the front and back row. By making the speaker directive, we can ensure that more of the sound goes to the audience. However, we have to handle the low frequencies with a separate loudspeaker because of the frequency limitations of the constant-directivity horn.

Table 7.5 shows the results of doing this. Because of the lower variation in path length, the path loss variation is lower: only 12 dB. However, the directivity of the loudspeaker both helps and hinders the total SPL at the listener. For the front row, it overcompensates for the shorter path length and provides a slightly lower SPL there; unfortunately for the back row, it adds to the path loss, resulting in a lower SPL. The peak SPL variation is therefore 15.1 dB, which is better than the nondirective loudspeaker’s case but still unacceptable.

Although better, the directivity of one speaker is not precise enough to obtain an even distribution. What we need is more control over the directivity. Ideally, we want an on-axis performance farther down the audience. However, if we use a more directive horn, the front will suffer. So let’s see if by combining two horns to make a “central cluster,” we can do better.

Table 7.5 Calculated SPL for a Single 40° × 60° Constant-Directivity Horn Speaker

Distance from front

2.2 m

7.8 m

30 m

Distance from speaker

7.8 m

10.8 m

30.9 m

Output for 1 watt

113 dB

113 dB

113 dB

Path attenuation

−17.8 dB

−20.7 dB

−29.8 dB

Directivity effect

−6 dB

0 dB

−6 dB

SPL at the listener

89.2 dB

92.3 dB

77.2 dB

Example 7.3

Two constant-directivity horn speakers 7.5 m above ear height

We need to cover 60 degrees of angle using two constant-directivity horns. This can be achieved by using a 60° × 40° combined with 40° × 20° horn, as shown in Figure 7.47; the horns are aligned so that the −6 dB points of the two horns are at the same angle. Because the signals are adding, coherently these −6 dB points can become equivalent to an on-axis level. Therefore the combine horn has three on-axis points on the audience: one at 5.4 m due to the 60° × 40° horn; one at 16.8 m due to the 40° × 20° horn; and finally one at 11.1 m due to the two horns combined. Note that we still have only one source of sound in the room despite using two speakers. This is because they both radiate sound energy from a single point.

Table 7.6 shows the results of doing this. Note that the 40° × 20° horn has a higher sensitivity, because it is concentrating the sound into a smaller solid angle. Note that the level is now much more uniform—to about halfway down the audience, the maximum variation is 4.5 dB, or ± 2.25 dB. However, the rear row at 30 m is still 14.5 dB lower than the level at 5.4 m, and this is still too much variation. One might be able to do better by using three horn loudspeakers, but this is getting more complicated and expensive. The main problem seems to be that the levels in the audience covered by the 40° × 20° horn are lower than those in the area covered by the other horn. So as a more economical solution, let’s try raising the power fed to the 40° × 20° horn to 2.5 W, which is a +4 dB increase in level.

7.6.8 More Than One Loudspeaker and Delays

Sometimes it is necessary to use more than one loudspeaker, for example, to “fill in” a shaded under-balcony area or cover a very wide area. In these situations, the speakers should be as directive as possible so as to cover only the area required. This will minimize the amount of extra energy that is fed into the reverberant field. Electronic delay of the signal will often be required in order to match the acoustic delay. When this is used, it should be set at about 15 ms greater than the acoustic delay. This has two advantages:

Example 7.4

Two constant-directivity horn speakers 7.5 m above ear height with more power to the horn covering the back

Table 7.7 shows the results of doing this. Now the sound is more even within ± 0.15 dB from 5 m to 19 m in the audience. It drops at the front and the back, and the maximum variation is 10.8 dB. In practice, the front row is likely to receive additional sound from the stage, thus boosting its level. In general, it is not a good idea to try to make the rear row the same level, because this means the rear wall must be on-axis to a loudspeaker, which means half that speaker’s energy is splashing off the back wall and either contributing to the reverberant field or causing interference effects at the back of the venue. The calculated %ALcons for this design is 11%, which is on the boundary between acceptable and very good intelligibility.

We should also consider the lateral coverage. In the final arrangement, the horn covering the front of the audience covers a wider angular width. This matches the trapezoidal audience shape seen by a high central cluster.

  • First, it allows the near speaker to be louder than the farther one while giving the illusion that the sound is still coming from the further source via the Haas effect, as discussed in Chapter 2.
  • Second, any combing effects due to the combination of the near and far signal are of a close enough frequency spacing (67 Hz) to be averaged by the ear’s critical bands and so not affect the intelligibility.

7.6.9 Objective Methods for Measuring Speech Quality

%ALcons is a subjective/empirical formula that can be used as a means of estimating the speech intelligibility from architectural data. However, for specifying systems it is better to have an objective measure that can be assessed by some form of acoustic instrumentation. Objective methods for measuring speech quality make use of either an auditory model or a measure that is based on a measure that is sensitive to speech spectral variations.

The articulation index (AI), now called the speech intelligibility index (SII) (Pavlovic, 1987), objectively measures “articulation” in individual critical bands, which is defined as that fraction of the original speech energy perceivable (i.e., between the threshold of hearing and the threshold of pain, and above the background noise). The AI can be measured by averaging the signal-to-noise ratio across all the bands. The validity of AI depends on the noise being non–signal dependent, which may not be the case with some processing. Also used are various measurements based on comparing the smoothed spectrum of the processed version with that of the original.

Another objective measure is the speech transmission index (STI) (IEC, 2003; Houtgast and Steeneken, 1985), which uses a modulation transfer function approach to measure the effect of a given situation on speech intelligibility. It works for most forms of speech degradation and shows good correlation with subjective tests (Steeneken and Houtgast, 1994). It also has the advantage of being relatively easy to calculate, and so can be used in simulations to predict likely improvements in performance.

The criteria for STI and intelligibility are:

  • 0.0  < STI < 0.4 intelligibility is poor
  • 0.4  < STI < 0.6 intelligibility is fair
  • 0.6  < STI < 0.8 intelligibility is good
  • 0.8  < STI < 1.0 intelligibility is excellent.

It can also be measured, and the simpler STIpa and RaSTI methods that it replaces are available as simple handheld instruments. For more details about public-address system, design see Ahnert and Steffen (1999).

7.7 Noise-Reducing Headphones

One important feature relevant to obtaining good-quality audio listening is the relative levels of the wanted sound and unwanted sound such as local background acoustic noise. In environments such as aircraft cabins where the ambient acoustic noise level is high, it is not easy to obtain good-quality audio. This is particularly the case with the headphones provided to economy-class travelers, which are generally of a low quality that distorts the sound when the volume is turned up sufficiently high to hear the music or film soundtrack. There is also the danger that the overall sound level being presented (wanted signal plus the unwanted background noise) could even cause noise-induced hearing loss (see Section 2.5) depending on the volume of the wanted signal that is set.

One way of reducing the overall sound level being presented in such situations is to reduce the level of the unwanted acoustic background noise being experienced. This has the added advantage of perhaps improving the sound quality of the wanted sound because it can then be presented at a lower level, thereby possibly avoiding any distortion issues due to a high signal presentation level.

Two common methods used for acoustic background noise reduction:

  1. 1. active noise cancellation
  2. 2. passive noise cancellation

Active noise cancellation is based on the fact that if a waveform is added to an equal and opposite (antiphase) version of itself, cancellation results (see Figure 1.13). Active noise cancelation is designed into headphones that have a microphone on the outside of each earpiece to pick up the local acoustic background noise on each side of the head. This is essentially the acoustic noise that is reaching each ear. This microphone signal is phase reversed and added at the appropriate level to cancel the background noise. The wanted signal is also added in, and the result is that the background noise is significantly reduced in level, and the overall volume of the wanted signal can be reduced to a more comfortable listening level.

Passive noise cancellation can be achieved with in-ear earphones that seal well in the ear canal. The principle here is to block the ear canal to reduce the level of acoustic background noise that enters the ear canal using the same technique used for in-ear ear defenders. Usually there are a number (two or three) of soft rubber flanges that form a seal with the border of the ear canal to attenuate the level of acoustic background noise that can enter the ear. The wanted sound is played via the earphones, which are mounted in the body of the earphones, and the level of the wanted sound can be reduced as with active noise cancelation. Once again, this allows a more comfortable overall acoustic level to be achieved and lessens the likelihood of sound distortion.

Both active and passive noise cancelation systems can work very effectively, and many people (including both authors!) tend to prefer the passive type, because only the wanted sound is being presented to the ear and the sound itself is not being modified in any way.

In the active case, the wanted signal is having noise added to it—not something one really wants to be doing if one can avoid it. If the phase shift is not absolutely correct, complete noise cancelation will not occur, with the result that the wanted signal is then being further contaminated. Furthermore, because of the phase shift requirement, higher frequencies are not attenuated as effectively—something the passive types do very well.

However, both kinds can sound very good in practice, and while the sound is being modified in the active version, this needn’t modify or distort the wanted sound. The passive types are somewhat intrusive, as their presence in the ear canal is felt physically, since the seal has to be complete to enable them to function well. One advantage of a good seal with the ear canal is that transmission of the low-frequency components is significantly improved.

7.8 “Mosquito” Units and “Teen Buzz” Ringtones

Sound is being used to deter young people from congregating in particular areas via devices termed “mosquito units” or “teen deterrents.” These devices play sounds at relatively high levels that can only be heard by young people, thereby making it acoustically unpleasant for them to remain in a particular area. They can be placed outside shops, restaurants and other public places to discourage young people from hanging around there. The devices exploit the natural change in hearing response that occurs with age known as presbycusis (see Section 2.3), which causes a significant reduction in the ear’s ability to hear high frequencies (see Figure 2.11). The mosquito was invented by Howard Stapleton in 2005 and first marketed in 2006.

The nominal range of human hearing is usually quoted as being from 20 Hz to 20 kHz, but during a person’s twenties, the upper-frequency region reduces greatly in the range above about 16 kHz. The mosquito unit exploits this by emitting a sound between 16 and 19 kHz, which can only be heard by those less than 20 to 25 years of age. The usual range over which the mosquito operates as a deterrent is around 15 to 25 m. The UK National Physical Laboratory (NPL) conducted a test of the Mosquito (http://www.europarl.europa.eu/sides/getAllAnswers.do?reference=E-2007-4089&language=LV) in which they reported that the device presented:

  • a mean f0 of 16.8 kHz;
  • a maximum f0 of 18.6 kHz;
  • an A-weighted sound pressure level of 76 dBA at 3 m;
  • no hearing hazard.

These units have proved themselves to be successful deterrents, saving considerable police time and effort in moving on gatherings of young people, and they are now used quite widely in public spaces. The unit will work in the presence of other sounds such as music, which typically contains no high-level energy at these high frequencies. Young people report these high-level high-frequency sounds to be very annoying, unpleasant and irritating.

However, young people now benefit from this basic idea because it has now been used to provide mobile ringtones, sometimes known as “teen buzz,” that cannot be heard by older people. The original teen buzz was created by recording the output from a mosquito device, but nowadays there are plenty of synthesized downloadable teen buzz ringtones available online. In general, these ringtones are not audible to adults over 25 years of age, but this does depend on the overall amplitude of the sound and the rate of presbycusis change for given individuals.

An example mosquito tone is provided on the accompanying CD to demonstrate what it sounds like; the track has four pure tones rising in octaves to the mosquito average frequency of 16.8 kHz as follows: 2.1 kHz, 4.2 kHz, 8.4 kHz and 16.8 kHz. Bear in mind when listening to this track that most youngsters will find the 16.8 kHz tone unpleasant; be cautious with the listening level.

7.9 Audio Coding Systems

Many of the advances in the distribution of audio material via film, DVD, television and the Internet and even on DVD-audio or super audio CD (SA-CD) have been made possible because of developments in audio coding systems. Audio coding systems are methods that reduce the overall data rate of the audio signal so that it may be transmitted via a limited-data-rate channel, such as the Internet, or stored in a data limited storage medium, like a DVD. In all cases, the ability to reduce the data rate is essential for the system. There are two types of audio coding systems:

  • Lossless audio coding systems: In these systems, the data rate, or data quantity, is reduced but in such a way that no information is lost. That is, after coding and subsequent decoding, the signal that comes out is identical to the signal that went in. This is like audio computer data file “zip” compression. Examples of such systems are: Shorten, an early lossless format; MPEG-4 ALS, FLAC, Apple Lossless, which are newer ones; and Direct Stream Transfer (DST) and Meridian Lossless Packing, which are proprietary methods that are used in the SA-CD and DVD audio formats, respectively. Lossless methods typically achieve a compression ratio of around 2 to 1. That is, the data size after compression is typically half the size of the input. The actual amount of compression achieved is dependent on the nature of the input audio signal itself (i.e., what kind of music it is or if it is speech) and can vary from about 1.5 to 1 to 3 to 1.
  • Lossy audio coding systems: These systems make use of similar techniques to that of lossless coding but, in addition, reduce the data further by quantizing the signal to minimize the number of bits per sample. In order to achieve this with the minimum of perceived distortion, some form of psychoacoustic model is applied to control the amount of quantization that a particular part of the signal suffers. Because this type of coding removes, or loses, information from the audio signal, it is known as a “lossy coding system.” Unlike the lossless systems, these coders do not preserve the input signal; the output signal is not identical to the original input. The output signal is thus distorted but hopefully in a way that does not disturb and is inaudible to the listener. Examples of such systems are MPEG-1, MPEG-2, mp3, and MPEG-4, which are used on the Internet, in broadcasting and for DVDs. In the film world DTS, Sony-SDDS and Dolby AC3 are used to fit multiple channels onto standard film stock. The advantage of lossy coding is that it can compress the signal much more than a lossless system. For example, to achieve the 128 k bit data rate stereo audio that is used for mp3 coding of material on many music download sites, the audio signal has to be compressed by a factor of 11.025! This is considerably more than can be achieved by lossless compression.

So how do these systems work? What aspects of the audio signal allow one to losslessly compress the signal, and how can one effectively further reduce the data rate by doing psychoacoustic quantization?

7.9.1 The Archetypical Audio Coder

Figure 7.48 shows the three archetypical stages in an audio coder. All modern audio coders perform these operations. The decoder essentially operates in reverse. The three stages are:

  1. 1.A signal redundancy removal stage, which removes any inter-sample correlations in the signal. In order to do this, the coder may have to send additional side information to the receiving end. This stage does not remove any information from the signal; it merely makes it more efficient and can therefore be considered to be lossless.
  2. 2.A psychoacoustic quantization stage, which allocates bits to the various components in the audio signal in a manner that has the minimum subjective distortion. Again, in order to do this, it has to send additional side information to the decoder, such as the number of bits allocated to each signal component. This is the only stage that removes information from the signal and is therefore the only lossy stage in the whole process. It is this stage that makes the difference between a lossy and a lossless audio coding system. Note that although the decoder stage works in reverse to provide real audio levels at the levels quantized by the coder, it cannot restore the information that has been thrown away by the encoding quantizer.
  3. 3.An entropy coding stage, which tries to use the most efficient bit arrangement to transmit both the signal information and the side information to the decoding end. This stage also does not remove any information from the signal and can also be considered to be lossless.

The purpose of these three stages is to maximize the amount of audio information transmitted to the receiving end. So in order to understand how these stages work, we need to understand what we mean by information and how it is related to the audio signal. Then we can unpack the function of the three stages in more detail.

7.9.2 What Exactly Is Information?

What characteristics of a signal or data stream indicate information? For example, you are currently reading text in this book; what is it about the text that is carrying information? The answer is that as you read it, you are seeing new combinations of words that are telling you something you didn’t know before. Another way of looking at it is that there is an element of novelty or surprise in the text. On the other hand, if, during a web chat, you got a message that said “hhhhhhhhhhhhhhhhhhhhhhhh …” it would be carrying no information other than the possibility that the other person has fallen asleep on the keyboard! So the more surprising a thing is, the more information it carries. That is, the less probable something is, the more information it carries. So:

InformationSurprise1Probability of Occurring (7.8)

How does this relate to an audio signal? Well, consider a sine wave; it sounds pretty boring to listen to because it is very predictable. On the other hand, a piece of music that is jumping around or an instrument whose texture is continuously changing is much more interesting to listen to because it’s more unpredictable or surprising. If we looked at the spectrum of a sine wave, we would find that all the energy is concentrated at one frequency, whereas for the more interesting music signal, it’s spread over lots of frequencies. In fact, the audio signal that carries the most information is either random noise or a single spike that happens at a time you don’t or can’t predict. Interestingly, in both these cases, the spectrum of the signal contains an equal amount of energy at all frequencies. Again, just like text, the more surprising a signal is, the more information it carries.

So to maximize the information carried by our coded audio signal, we first need to maximize the surprise value of the audio signal. This is done by the signal redundancy removal stage.

7.9.3 The Signal Redundancy Removal Stage

In Chapter 4, we saw that all musical instruments, including the human voice, could be modeled as a sound source followed by sound modifiers, which apply a filtering function. In the limit, the source may be regular spaced impulses for pitched instruments, a single pulse for percussion instruments, random noise for fricatives or a combination of them all. A filter that combines the effect of the source with the acoustic effect of the instrument and output to shape the final sound then follows this. The effect of this filtering is to add correlation to the audio signal. Correlation implies that information from previous samples is carried over to the current sample, as shown in Figure 7.49. The basic principle behind redundancy removal is that at a given time instant, the audio signal will consist of two elements (as shown in Figure 7.49).

  • Information about the previous signals that have passed through the filter. In principle, the contribution of this to the overall signal can be calculated from knowledge of the signal that has already passed through the filter. That is, this contribution is predictable; this is shown by the different hatchings in the sample at time “t” in Figure 7.49 that correspond to the similarly hatched earlier samples.
  • Information that is purely due to the excitation of the filter by the source. The contribution of this to the overall signal cannot be calculated from knowledge of the signal that has already passed through the filter, and hence this contribution is not predictable and is shown in black in Figure 7.49.

To maximize the information content of the signal, we should aim to remove the predictable parts, because they can be recalculated at the receiving end, and only encode the unpredictable part, because this represents the new information.

This is what the correlation removal stage does. There are two different ways of doing this:

  • Time domain prediction: This is the method used by most lossless encoding schemes, which calculates an inverse filter to one that has filtered the unpredictable parts of the signal. It is possible, using the method of linear prediction, to calculate the necessary inverse filter from the input data. This filter is then used to remove the correlated components from the input signal prior to coding the signal. At the decoder, the original signal is recovered by feeding the decoded signal into the complementary filter, which then puts the correlations back into the signal. To do this, the coder must send additional side information that specifies the necessary filter coefficients so that the decoder restores the correlation correctly. In addition, because the correlations within the signal vary with time due to different notes and instruments, it is necessary to recalculate the required filter coefficients and resend them periodically. Typically this happens about 50 to 100 times a second. This type of system is known as a “forward adaptive predictor” because it explicitly sends the necessary reconstruction information forward to the decoder. There is an alternative known as a “backward adaptive predictor” that sends the necessary side information implicitly in the data stream, but it is seldom used.
  • Frequency domain processing: This is the method used by most of the lossy audio coding schemes, which splits the audio spectrum into many small bands. In principle, the smaller the better, but there is a limit to how far one can do this because smaller bands have a longer time response, as discussed in Chapter 1. In practice, due to the limitations imposed by temporal masking (see Chapter 5), the time extent of the signal is limited to 8 to 25 ms, which implies a minimum bandwidth of 125 to 40 Hz. This technique removes correlation from the signal, because correlated signals will occupy less spectral space, for example a sine wave occupies almost no spectrum. Thus the coder can focus more bits on higher amplitude components. It also allows the coder to flag bands that contain no energy at all, and therefore need no bits to be transmitted. The narrower the bands are, the more effective this concentration can be. The used and unused band information needs to be transmitted to the decoder and therefore represent additional side information to be encoded and transmitted to the decoder. The signal within the bands is also less correlated because, as the band gets narrower in frequency range, the spectrum within the band is more likely to be uniform and thus approaches the desired white-noise spectrum with zero correlation between samples. A further advantage of frequency domain processing is that it converts the audio signal into a form that makes it easy for psychoacoustic quantization to be efficiently applied.

The use of either a time or frequency domain method is possible, and their relative strengths and weaknesses are primarily determined by the application. For example, the frequency domain approach fits well with psychoacoustic quantization algorithms—hence its choice for lossy coding systems.

However, all signal redundancy removal schemes have to tread an uneasy balance between increasing their effectiveness, which requires more side information, and having sufficient resources to effectively encode the actual audio signal information.

7.9.4 The Entropy Coding Stage

Although the entropy stage is the final stage in the coder, it is appropriate to consider it now, because in conjunction with the redundancy removal stage, it forms the structure of a lossless encoder system. Entropy encoding works by maximizing the information carried by the bit patterns that represent the audio signal and the side information. In order to understand how it works, we need to understand a little bit about how we measure information and how entropy relates to information.

7.9.5 How Do We Measure Information?

In order to measure the information content of a signal, we need to know how likely it is to occur. However, we also want to be able to relate the information content to something real, such as the number of bits necessary to transmit that information. So how could we measure information content in such a way that it is related to the number of bits needed to transmit it?

Consider a 3-bit binary digit. It has eight possible bit patterns, or symbols, as shown in Table 7.8. Furthermore, let’s assume that each possible symbol has the same one in eight probability of happening occurring (Psymbol = 0.125). If we use the following equation:

One can calculate log2 (x) simply as follows:

log(x)=log10(x)log10(2)

Isymbol=log2(1Psymbol) (7.9)
  • where Isymbol = the self-information of the symbol (in bits)
  •     Psymbol = the probability of that symbol occurring
  •     log2 = the logarithm base 2, because we are dealing in bits

where: log10 (x) is the logarithm base, 10, the standard log key on most calculators

For any of the eight symbols shown in Table 7.8, the self-information is 3 bits because they all have the same probability. However, if the probability of a particular symbol was one, that is, it was like our repeating “h” discussed earlier, the self-information would be zero. On the other hand, if the probability of one of the symbols was lower than one eighth, for example, one hundredth or (0.01), then that symbol’s self-information would be log2 (1/0.01), which would equal 6.64 bits, which would mean that symbol was carrying more information than the other symbols.

Because the total probability of all the symbols must add up to one, if one symbol has a very low probability, the probability of the other symbols must be slightly greater to compensate. In the case of one symbol having a probability of one hundredth, the other symbols taken all together will have to have a probability of Psymbol = (99/(7 × 100)) = 0.141. This gives a self-information for all the other symbols of 2.83 bits per symbol. So in this case, one symbol is worth more than 3 bits of information, but the other symbols are worth less than 3 bits of information. So one of the symbols is using the bits more efficiently than the other seven symbols.

To unravel this, we must look at more than the self-information carried by each symbol. Instead, we must look at the total information carried by all the symbols used in the data stream.

Table 7.8 Symbols Associated with a 3-bit Binary Code

Binary code

Symbol

000

0

001

1

010

2

011

3

100

4

101

5

110

6

111

7

7.9.6 How Do We Measure the Total Information?

In the previous section, we had the case of one symbol in the whole set of possibilities being very small, and as a consequence, the other seven symbols had to have a higher probability, because the probability of all symbols being used has to be 1. As an extreme case, consider our person asleep on the keyboard who is sending the same symbol all the time. In this case, the probability of the symbol being sent is 1, and all other symbols have 0 probability of being sent. The total information being sent by this source is 0 bits. How can this be? Surely, if the probability is 0, then the self-information of these symbols is infinite? In theory, this might be true, but as these symbols are never sent, the total information is 0.

So in order to find out the total information of the source, we need to incorporate not only the self-information of each symbol but also their probability of being sent. This gives the proportion of information that symbol actually carries as part of the whole data set. To do this, we simply multiply the self-information of the symbol by the probability of that symbol actually occurring. This gives the amount of information that symbol carries in proportion to the other symbols in the data source.

Esymbol=Psymbol×log2(1Psymbol) (7.10)

where Esymbol = the entropy of the symbol (in bits)

This is also known as the “entropy” because the equation is analogous to one used to calculate entropy in physics. In general, we are interested in the total information content, or entropy of the data source, because this gives us the minimum number of bits required to encode it. This is given by:

Esource=All symbolsPsymbol×log2(1Psymbol) (7.11)

where Esource = the entropy of the symbol (in bits)

Equation 7.11 shows that the way to calculate the entropy of a data source is to add up all the individual symbol entropies of that source.

Table 7.9 shows the calculated entropy for four different data sources—the first three correspond to the examples that have been discussed already, and the fourth represents the output of a five-level audio signal. Notice that, except for the case of all symbols having equal probability, all the other sources have a total source information, or entropy, of less than 3 bits. This is always true: the most efficient information source is one that uses all its possible symbols with equal probabilities. If we encoded the sources shown in Table 7.9 using a 3-bit binary code, then we would be wasting bits. In principle, we could code the audio signal using only 1.7 bits, but how?

This is the idea behind entropy coding; what we need is a transmitted code that relates more closely to the entropy of each symbol.

7.9.7 Entropy Coding

In order to implement entropy coding, we have to more closely match the number of bits we use for each symbol to the amount of information it carries. Furthermore, each code word associated with the symbol must have an integer number of bits. The net result of this is that instead of using a fixed number of bits for each symbol, we need to use a number of bits that is related to the information carried by that symbol. For example, we know that in a real audio signal, the signal spends much more time at low amplitudes than high amplitudes, so low amplitudes carry less information and should be encoded with fewer bits than high signal levels. This way, fewer bits would be used, on average, to transmit the information. But how do we generate codes that have this desirable property? One simple way of doing it is by using a technique called Huffman coding, which is best illustrated by an example.

Consider the five-level audio signal shown in column 6 of Table 7.9. At the moment, we are using 3 bits of information to transmit an information source that only has 1.7 bits of information. How can we assign code words to the symbols such that the average data rate is closer to 1.7 bits? First, we can recognize that three of the symbols are not used and hence can be ignored.

A Huffman code is generated by starting with the least probable symbol. The list of symbols is first sorted into a list of decreasing probability, as shown in Table 7.10. Then, working up from the lowest-probability symbol, a code is constructed by combining the probabilities to form different levels that correspond to bits in the code. These bits are then used to select either a code word or the next level down, except for the longest/lowest level, corresponding to the two lowest-probability symbols in which the bit is used to select between a one and a zero.

One can calculate the average bits per symbol, after coding, by simply using the actual bits it uses and multiplying that by its probability of occurring. This gives that symbol’s average bit rate. Then, by adding up the average bit rate for all the symbols, you get the total average bit rate:

BitRateactual=All symbolsPsymbol×Bitsactual

In our example the rate calculation becomes:

BitRateactual=0.6×1+0.15×2+0.15×3+0.05×4+0.05×4

Which gives:

BitRateactual=1.75bitssymbol1

By allocating a bit for each level, a variable-length code is built up in which a leading zero represents the beginning of a new code word up to the maximum length of the code words. This property makes the code comma free, which means it needs no additional bits to separate the variable code words from each other, thus maximizing the efficiency. A code word of length 1 is assigned to the most probable symbol, and longer-length code words are optimally assigned to the lower probability symbols.

Huffman coding is optimal in that it gets the code rate within one bit of the source entropy. However, because bits only come in integer multiples, the efficiency for small symbol sets is quite low. In MPEG-1 layer 3 coders, some symbols are Huffman coded as pairs to allow a greater coding efficiency. However, there is a practical limit to the size of a Huffman code because the computation and the tables blow up exponentially in size and so become unfeasible. A more subtle limitation is that, for large symbol sets, it becomes very difficult to gather enough symbol statistics data with sufficient accuracy to generate a Huffman code.

In order to have longer symbol lengths, which allow greater coding efficiencies, other approaches are used. Golomb-Rice codes use a predetermined statistical distribution to remove the need for tables, thus allowing longer code words; they are used in some lossless encoding schemes. Another approach is arithmetic coding, which does not need predetermined distributions, but, by using the statistics of the symbols occurrence, can encode very long symbols and so approach the source entropy much more closely. For details about these and other entropy coding methods, see Salomon (2007).

7.9.8 The Psychoacoustic Quantization Stage

This is the stage that makes the difference between a lossless and a lossy coder. Again, the actual signal processing can be carried out in either the time domain or the frequency domain. In both cases, the process of adaptive quantization and noise shaping is used.

7.9.9 Quantization and Adaptive Quantization

Quantization is the process of taking an audio or video signal and converting it to a discrete set of levels. The input to this process may be from a continuous audio signal, like the one you get from a microphone, or may be from an already quantized signal, for example, the signal from a compact disc. An important parameter is the number of levels in the quantizer. This is due to the fact that the act of quantization is lossy because it throws away information. If the input signal is not exactly the same as the desired output, there will be an error between the input and the output, as shown in Figure 7.50.

Although the quantizer will pick the output level that causes minimum error, there will on average always be some error. This error adds noise and distortion to the signal and is often referred to as quantization noise. Ideally this noise should be random, and often dither is added to ensure this. The effect of the error is to reduce the signal-to-noise ratio of the audio signal. If a binary word of Nbits bits is used to encode the audio signal, then the maximum signal-to-noise ratio is given by:

(SN)Max=6×NbitsdB (7.12)
Figure 7.50

For a 16-bit word (the CD standard), this gives a maximum signal-to-noise ratio of 96 dB. If you compare this signal-to-noise ratio with the idealized masked thresholds in Figure 5.9 in Chapter 5, you will see that the quantization noise will be masked over most of the audio band for loud signals. Although quantizers very often have a number of levels, which are powers of 2, such as 16, 256 or 65,536, because this makes best use of a binary word, there is no reason that other numbers of levels cannot be used, especially if entropy coding is going to be employed. In this case, the maximum signal-to-noise ratio for an N-level quantizer will be given by:

(SN)Max=20log(N)dB (7.13)

In either case, the quantization error is uniform over the frequency range in the ideal case.

These maximum signal-to-noise calculations are assuming that both the signal and noise have a signal probability distribution that is uniform; that is, all signal levels within the range of the quantizer are equally possible. If one assumes a sine wave input at maximum level, then the maximum signal-to-noise ratio is improved by +1.76 dB because a sine wave spends more time at higher levels. However, in general we do not listen to either sine waves or uniform random noise. We listen to signals that spend more time at low signal levels, and this means that often the signal-to-noise ratio is worse than predicted by these equations.

Real signals often spend most of their time at low signal levels, but one has to design the quantizer to handle the maximum signal level, even if it isn’t used very often. This means that more bits are used than is strictly necessary most of the time. One way of reducing the number of bits needed to quantize an audio signal is to make the quantizer adapt to the level of the signal, because loud signals mask weaker signals, as discussed in Chapter 5. This type of quantizer is known as an “adaptive quantizer” and can save some bits. There are two main types of adaptive quantizer:

  1. 1.Backward adaptive quantizers (Jayant, 1973): These make use of the adapted output bits to drive the adaptation and so require no additional bits to be sent to the receiving end. However, they are sensitive to errors and are not guaranteed to be overload free. Although there are ways of mitigating these problems, they are not often used in lossy compression systems.
  2. 2.Forward adaptive quantizers: Look at a block of the input signal and then set a scale factor that makes maximum use of the quantizer. This type of adaptive quantizer is guaranteed to be free from overload and is more robust to errors. However, it needs some side information—the scale factor—to be sent to the receiving end, which clearly robs bits from the quantized signal samples. Therefore, there is an uncomfortable balance that must be struck between the amount of side information and the number of levels in the adaptive quantizer. In particular, if the block is made longer, less scale values have to be sent to the receiver. But if the block is too long, the quantization noise may become unmasked due to nonsimultaneous masking. So the block size is set typically within the range 8 to 25 ms to avoid this happening. There is also an issue about the scale precision. The minimum scale precision is 6 dB, which allows for a very simple implementation, but there is a possibility that only just over half the quantizer’s range is used, because if the input has a sample that is just over half the quantizer’s range, the scale would set to that case, as the next-lower scale value would result in an overload. Increasing precision of the scale factor would allow a greater proportion of the quantizer to be used on average but would require the scale value to have a longer word length to handle the increased precision. This increases the amount of side information that must be sent to the receiver. MPEG uses finer scale factors of 1.5 dB to 2 dB.

Adaptive quantizers attempt to make the input signal statistics match the ideal uniform distribution for maximum signal to noise. However, even with fine-scale steps, this is rarely achieved, and there is still a tendency for the small signal levels to be more probable. Furthermore, the higher levels may have a better signal-to-noise ratio than is strictly necessary for masking.

One solution to this is to use nonuniform quantization, sometimes called nonlinear quantization, in which the levels are not equally spaced. By having the less-probable higher levels farther apart, the increased quantization error that results is more likely to be masked by the signal, and, because it’s less probable, the average signal-to-noise level does not increase. In fact, for a given number of levels, the average signal to noise reduces because the more likely levels are closer together and thus generate less quantization noise. MPEG-1 layer 3 uses a nonuniform quantizer to quantize the filtered signal samples.

However, even if an adaptive quantizer is used, then, because of the uniform spectrum of the quantization noise, the overall noise becomes unmasked if the signal-to-noise ratio falls below about 60 dB. This corresponds to 10 bits of quantizer precision. In order to do better, we need to arrange for the noise to be shaped so that it is less perceptible.

7.9.10 Psychoacoustic Noise Shaping

As discussed in Chapter 2, our threshold of hearing is not constant with frequency, as shown in Figure 2.14. We are much more sensitive to sounds around 4 kHz than at the extremes of the frequency range. Therefore, a uniform quantization spectrum may be more inaudible at low and high frequencies and yet still be heard if it is above the threshold of hearing in the most sensitive part of our hearing range. Ideally, we need to have more noise where we are less sensitive to it and less noise where we are more sensitive. This is possible via a technique known as “noise shaping.”

Figure 7.51 shows the block diagram of a noise-shaping quantizer. In it, the quantization error is extracted, fed back via a noise-shaping filter and subtracted from the input. The effect of this can be analyzed as follows, assuming that the output of a quantizer can be considered to be the sum of the input signal and the quantization error.

Quantizeroutput=QuantizeroutputN(z)Error+Error (7.14)

Which gives:

Quantizeroutput=Quantizerinput+(1N(z))Error (7.15)

Equation 7.15 shows that the quantizer’s error is shaped by the filter function (1 − N (z)). It is possible to design this filter to reduce the noise within the most sensitive bit of our hearing range. This technique is used to improve the quality of sounds on a CD. An example is Sony Super Bit Mapping (Akune et al., 1992), which uses psychoacoustic noise shaping to give an effective signal-to-noise ratio of 120 dB (20 bits) in our most sensitive frequency range.

7.9.11 Psychoacoustic Quantization

Although noise shaping provides a means of psychoacoustically shaping the quantization noise, it is difficult to achieve very low bit rates using it. In particular, one needs to be able to avoid transmitting information in frequency regions that either contain no signal or are masked by other signal components. To do this easily, one must work in the frequency domain.

Figure 7.52 shows the block diagram of a lossy audio coder. Ideally, the time-to-frequency mapping splits the signal up into bands that are equal to, or smaller than, the width of our critical bands. Unfortunately for some audio coders, this is not true at the lower frequencies.

The psychoacoustic quantization block is now replaced by a bit allocation and quantization block, which is driven by a psychoacoustic model that allocates the number of quantization levels for each frequency band, including 0 for no bits allocated. Psychoacoustic models can be quite complicated and are continuously evolving. This is because most lossy audio coding systems define how the receiver interprets the bit stream to form the audio output but leave how those bits are allocated to the encoder. This is clever because it allows the encoder to improve as technology and knowledge get better without having to alter the decoder standard. This is an important consideration for any audio delivery format.

However, regardless of the psychoacoustic bit allocation algorithm, they all effectively convert the linear frequency scale of the discrete Fourier transform into a perceptually based frequency scale similar to the ERB scale shown in Figure 5.10. Most use the Bark scale, which is similar but is based on the earlier work of Zwicker. Both scales are quasi-logarithmic and convert simultaneous masking thresholds into approximately straight lines.

Using some form of simultaneous masking model masking curves, which are different for tonal and nontonal sounds, a signal-to-masking ratio (SMR) is calculated and the bits allocated such that components that need a high SMR are given more bits than those that have a lower SMR. For components that have a negative SMR, no bits are allocated because these components are masked and therefore need not be transmitted. The amount of bits that can be allocated depends on the desired bit rate. Also, the process interacts with itself, so this process usually is inside some form of optimization loop that minimizes the total perceptual error. For more information, the book by Marina Bosi (Bosi and Goldberg, 2003), one of the developers of current lossy coding systems, provides a wealth of information.

7.10 Cochlea Implants

As we get older, our hearing deteriorates from the high frequencies downward (see Section 2.3) as a function of aging. If our hearing system is exposed to loud sounds for a sufficiently long period of time, noise-induced hearing loss can occur (see Section 2.5). Hearing ability is tested in practice clinically using audiometry (see Section 7.2), which can measure the ability to detect sound (pure-tone audiometry) and to discriminate sounds (speech audiometry). Example audiograms are plotted in Figure 7.36, where hearing loss is recorded on the Y-axis.

Hearing loss is organized descriptively into different degrees as shown in Table 7.1 as follows: normal, mild, moderate, severe and profound. A hearing aid, which electronically amplifies the incoming sound before it enters the ear canal, can be very acceptable and useful in one or both ears that have moderate or severe hearing loss, and to some extent it can help with profound hearing loss. In practice, the decision as to whether a hearing aid will be fitted rests with trained medical staff, and it is based on the extent to which amplifying sound helps speech communication. Acoustic hearing aids are far less likely to help those with profound hearing loss, and where this is the case, a cochlear implant is a possibility. The decision as to whether to fit a cochlear implant rests with trained medical staff, who will consider the hearing capability of each ear; if only one has a profound loss and the other has a lesser degree of hearing loss, a hearing aid in the better ear might be sufficient.

A cochlear implant is a device that picks up incoming sound with a microphone, filters it into different frequency bands, and presents the appropriate frequency band to the appropriate place along the cochlea (see Section 2.1.3). The implant consists of an electrode array with its driving electronics and a receiver coil that is implanted behind the ear under the skin and an external behind-the-ear processing unit that includes a microphone, speech processor and a transmission coil that couples magnetically to the internal receiver coil. The electrode array is designed to be very flexible such that it can be pushed into the cochlea via the scala tympani (see Figure 2.3) close to the round window (see Figure 2.1).

The basic principle behind the operation of the cochlea implant is illustrated in Figure 7.53, which shows a block diagram of the essential elements of a cochlea implant. The speech processor captures the input from the microphone and processes it in some way to derive electrical signals to be sent to each electrode. Modern cochlea implants usually offer a choice of speech-processing algorithms that essentially provide an appropriate stimulus to electrodes based on their position within the cochlea relative to the analysis frequency at that place. Band-pass filtering with ERB or equivalent bandwidths enables the input spectrum to be partitioned appropriately, and the stimulus sent to each electrode is specific patterns of pulses with controllable amplitude. Cochlea implants have to be set up such that the minimum and maximum stimulation level for each electrode is calibrated, something that typically needs to be repeated over time to accommodate any physiological change or electrode movement.

As an alternative to a cochlear implant, when the VIIIth nerve from the inner ear is not functioning normally, an auditory brainstem implant, or ABI, might be fitted. An ABI is similar to a cochlea implant in terms of its processing, but its output is connected surgically to the brainstem. Its function is to make lip reading easier and is used as a treatment for neurofibromatosis type 1, which causes hearing loss and ringing in the ear, or tinnitus.

7.11 Sonification

Sound is a complex basis for communication in terms of speech and music, and advantage is taken of this in the field of “sonification,” which is making use of audio as a means of communicating changes in measured data. Instead of plotting a graph of a measured variation such as rainfall over the past 12 months, the data is used to control an aspect of synthesized audio, such as pitch, loudness or timbre, enabling long durations of experimentally measured data to be explored rapidly. The timescale of the sonified version of the data in terms of how long it lasts can be varied as desired; it would be perfectly possible, for example, to take rainfall data for a number of years and play it back in minutes or even seconds simply by setting different data flow rates. It depends on the nature of the changing features of interest. Any measurable data in any field has the potential to be sonified, such as telemetry data from space vehicles or Formula One cars, data from devices that monitor health functions, weather data, data from historical records, geological data and data relating to everyday living. This process is also known as “auditory display,” and it takes advantage of acoustics, psychoacoustics and the overall highly potent analysis capability of the human hearing system. A major resource for sonification is the downloadable Sonification Handbook (Hermann et al., 2011).

The underlying principle behind sonification is to take advantage of the perceptual processing power of the human hearing system to detect changes in or relationships between measured data sets. A key element of success in this process is deciding on the mapping between the measured data and the final audio signal. Ideally, the mapping should take advantage of changes in the audio signal that are going to be obvious to the listener; pitch, loudness and timbre are three potential candidates. But consideration must also be given to the signal itself, since (a) it might be the raw data itself, perhaps with the sampling rate being changed to place it within the human hearing range (data can be captured at any sampling rate, which could be well below or above what is audible, such as weather data that might be captured once every 24 hours or engine telemetry data that might be captured once every microsecond), or (b) the measured data could be used to change a feature of the sound to be used such as its pitch, vibrato, loudness or a filter setting.

Sound is used for signaling in everyday life, for example in fire alarms, warning signals, kitchen appliances, hospital monitors and ringtones, but in terms of information conveyed, this tends to be a poor use of available auditory processing, since many such sounds simply serve to indicate a single state of the system (e.g., on or off); in most cases, better use could be made of the sound, for example, a fire alarm could indicate the part of the building the fire is in, there could be a user-set way of arranging kitchen sounds up to indicate which kitchen appliance it is that requires attention and perhaps what stage the activity is at (e.g., up to temperature, time nearly up, water running low, add the next ingredient), hospital monitors could be bed specific and convey more information through changes in the sound about the current state of the totality of data being monitored. Sound can be an enabling route to interaction with systems and information for those with visual impairment, where advantage is taken of auditory processing capability, and again, such systems could provide a greater range and/or depth of information.

Sonification extends the use of sound to convey information about local systems and environment to exploring measured data sets in much more detail, most notably in terms of dynamic changes with time rather than, for example, as an indication of a single state of a system. There may be a number of time-varying attributes of interest in a particular data set, and/or there may be a number of different individual data measurements of interest. Monitoring a dynamically changing aspect of data requires the use of an aspect of sound in which dynamic changes are readily perceived; changes in pitch are commonly employed, as are modifications to timbre. Working with multiple aspects of dynamic change in data can be facilitated by, for example, using different musically related timbres perhaps based on instrument differences, placing the sounds in different spatial locations that can be achieved using either loudspeakers or headphones, using different pitch ranges for each stream of data that could, for convenience, be related to the size of the object of interest (a large object might be in a low pitch range and a small object might be in a high pitch range) or using different amplitude ranges for different data streams, bearing in mind the potential for masking (see Section 5.5.2) to disrupt a listener’s ability to hear changes. Spatial location offers three dimensions: (1) horizontal angle, (2) vertical angle and (3) distance. Other aspects of how the environment impacts are perceived might also be used such as the relative level of early sound and/or reverberation.

Sonification involves making use of data derived from sensors or other data, mapping these data onto a sound synthesis procedure and then diffusing the sound over one or more loudspeakers or perhaps headphones. This is illustrated in Figure 7.54. Here, data is shown in the form of a number of graphs in which the x-axis would be time. Notice that the mapping stage might make multiple connections with some input data streams depending on what they convey and how they might affect the synthesis stage. The synthesis stage might have multiple elements with the mapped inputs being used to control various aspects of the synthesis. Synthesizing sounds for the sonification of data therefore has much potentially in common with music analysis and synthesis techniques as well as methods used in speech processing and synthesis, and many sonification schemes make use of such techniques to create their outputs.

Clearly, attention must be paid when sonifying data to the hearing capability of listeners who are making judgments to ensure they can actually hear the sonified data in the various ranges being used. In addition, the mapping between the data itself and the resulting acoustic change needs to relate the minimum data variation of interest to the minimum audible change in the particular audible change being employed or the relevant JND (just-noticeable difference), which for loudness and pitch are discussed in Sections 2.4.2 and 3.2.2, respectively. If loudness is being employed, it would be necessary to bear in mind the relationship between loudness and sound duration as well as loudness and critical band to ensure that it is changes in actual loudness that are being monitored (see Sections 2.4.2 and 2.4.3, respectively). In any such work that involves human subjects as listeners, an experimenter has a duty of care to ensure their health and safety. In particular, the overall sound level needs to be controlled and assured to avoid any risk of noise-induced hearing loss (see Section 2.5), especially when there are aspects of the sound being controlled by a potentially unknown data source in terms of what its range of variation might be for a given experiment.

7.12 Summary

This chapter has looked at a variety of applications that combine both acoustic and psychoacoustic knowledge to achieve a specific audio objective. The need to combine both these aspects of this knowledge to achieve useful results and the diversity of applications are what make this subject so exciting! As we write, acoustics is being applied to many more areas that affect our everyday life directly, for example, the noise caused by wind farms, alarms and warning systems, acoustic screening between individual working spaces in open office areas and the design of urban public spaces that not only look nice but also sound good. Our hearing is one of our most precious senses, and it is good to see that, as we become more technologically sophisticated, we are coming to realize that, for our human existence, sound really matters!

Bibliography

Ahnert, W., and Steffen, F., 1999. Sound Reinforcement Engineering: Fundamentals and Practice. Spon Press, London.

Bech, S., and Zacharov, N., 2006. Perceptual Audio Evaluation—Theory, Method and Application. Wiley, Chichester, England.

Bosi, M., and Goldberg, R. E., 2003. Introduction to Digital Audio Coding and Standards, second edn. Springer, New York and London.

British Society of Audiology, 2011. Recommended Procedure—Pure-time Air-conduction and Bone-conduction Threshold Audiometry With and Without Masking, 1–32. http://www.thebsa.org.uk/wp-content/uploads/2014/04/BSA_RP_PTA_FINAL_24Sept11_MinorAmend06Feb12.pdf

D’Antonio, P., and Konnert, J. H., 1984. The RFZ/RPG approach to control room monitoring. Audio Engineering Society 76th Convention, October, New York, preprint #2157.

Davies, D., and Davies, C., 1980. The LEDE concept for the control of acoustic and psychoacoustic parameters in recording control rooms. J. Audio Eng. Soc. 28 (3), 585–595 (November).

Houtgast, T., and Steeneken, H.J.M., 1985. The MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria. J. Acoust. Soc. Am. 77, 1069–1077.

Howard, D. M., 1991. Speech: Measurements. In: Payne, P. A. (ed.), Concise Encyclopaedia of Biological and Biomedical Measurement Systems. Pergamon Press, Oxford, pp. 370–376.

IEC 60268–13:1998, BS 6840–13, 1998. Sound system equipment. Listening tests on loudspeakers.

IEC standard n. 60268–16, 2003. Sound system equipments—objective rating of speech intelligibility by speech transmission index, July.

ISO/TR 4870, 1991. Acoustics—the construction and calibration of speech intelligibility tests.

Jayant, N. S., 1973. Adaptive quantization with a one-word memory. Bell Syst. Tech. J. 52, 1119–1144 (September).

Newell, P., 2008. Recording Studio Design, second edn. Focal Press, Oxford.

Pavlovic, C. V., 1987. Derivation of primary parameters and procedures for use in speech intelligibility predictions. J. Acoust. Soc. Am. 82, 413–422.

Peutz, V.M.A., 1971. Articulation loss of consonants as a criterion for speech transmission in a room. J. Audio Eng. Soc. 19 (11), 915–919 (December).

Rodgers, C.A.P., 1981. Pinna transformations and sound reproduction. J. Audio Eng. Soc. 29 (4), 226–234 (April).

Rumsey, F., 2001. Spatial Audio (Music Technology Series). Focal, Oxford.

Salomon, D., 2007. Data Compression: The Complete Reference, fourth edn. Springer, New York and London.

Toole, F. E., 1990. Loudspeakers and rooms for stereophonic sound reproduction. In: The Proceedings of the Audio Engineering Society 8th International Conference, The Sound of Audio, Washington, DC, 3–6 May, pp. 71–91.

Walker, R., 1993. A new approach to the design of control room acoustics for stereophony. Audio Engineering Society 94th Convention, preprint #3543.

Walker, R., 1998. A controlled-reflection listening room for multichannel sound. Audio Engineering Society 104th Convention, preprint #4645.

Further Reading

Akune, M., Heddle, R., and Akagiri, K., 1992. Super bit mapping: Psychoacoustically optimized digital recording. Audio Engineering Society 93rd Convention, preprint #3371.

Angus, J.A.S., 1997. Controlling early reflections using diffusion. Audio Engineering Society 102nd Convention, 22–25 March, Munich, Germany, preprint #4405.

Angus, J.A.S., 2001. The effects of specular versus diffuse reflections on the frequency response at the listener. J. Audio Eng. Soc. 49 (3), 125–133 (March).

ANSI S3.5–1997, American National Standard Methods for Calculation of the Speech Intelligibility Index. American National Standards Institute, New York.

Hermann, T., Hunt, A. D., and Neuhoff, J. G., 2011. The Sonification Handbook. Logos Verlag, Berlin. (available to download at: http://sonification.de/handbook/)

Holman, T., 1999. 5.1 Surround Sound. Focal Press, Boston.

Newell, P., 1995. Studio Monitoring Design. Focal Press, Oxford.

Newell, P., 2000. Project Studios: A More Professional Approach. Focal Press, Oxford.

Schroeder, M. R., 1975. Diffuse sound reflection by maximum-length sequences. J. Acoust. Soc. Am. 57, 149–151 (January).

Schroeder, M. R., 1984. Progress in architectural acoustics and artificial reverberation: Concert hall acoustics and number theory. J. Audio Eng. Soc. 32 (4), 194–203 (April).

Steeneken, H.J.M., and Houtgast, T., 1994. Subjective and objective speech intelligibility measures. Proc. Inst. Acoust. 16 (4), 95–112.

Toole, F., 2008. Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms. Focal Press, Oxford.

Walker, R., 1996. Optimum dimension ratios for small rooms. Audio Engineering Society 100th Convention, preprint #4191.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.41.148