2.2 Spatial audio playback systems

There has been an ongoing debate about the aesthetic aim of recording and reproducing sound. In recording of classical music or other events implying a natural environment, the goal of recording and reproduction can be to re-create as realistically as possible the illusion of ‘being there’ live. (‘Being there’ refers to the recreation of the sound scene at the place and time of the performance. The term ‘there and then’ is often used to describe the same concept, in contrast to ‘here and now’, which describes the sound scene at the place and time during playback.) In many other cases, such as movie sound tracks and pop music, sound is an entirely artificial creation and so is the corresponding spatial illusion, which is designed by the recording engineer. In such a case, the goal of recording and reproduction can be to create the illusion of the event ‘being here’, i.e. the event being in the room where playback takes place.

In any case, the requirement of a spatial audio playback system is to reproduce sound perceived as realistically as possible, either as ‘being there’ or ‘being here’. Note that in ‘being there’ one would like to create the spatial impression of the concert hall ‘there’, whereas in ‘being here’ the acoustical properties of the playback room ‘here’ are to play a more important role. But these aesthetic issues are to be addressed by the performing artists and recording engineers, given the limits of a specific target spatial audio playback system. In the following, we describe three of the most commonly used consumer spatial audio playback systems: stereo loudspeaker playback, headphone playback, and multi-channel surround loudspeaker playback. A relation to Section 3 is established by linking spatial hearing phenomena to the described playback systems. A more thorough overview, covering the history and a wide range of playback systems not discussed here can be found elsewhere [250] and [227].

2.2.1 Stereo audio loudspeaker playback

The most commonly used consumer playback system for spatial audio is the stereo loudspeaker setup as shown in Figure 2.1(a). Two loudspeakers are placed in front on the left and right sides of the listener. Usually, these loudspeakers are placed on a circle at angles −30° and 30°. The width of the auditory spatial image that is perceived when listening to such a stereo playback system is limited approximately to the area between and behind the two loudspeakers.

Stereo loudspeaker playback depends on the perceptual phenomenon of summing localization, as will be described in Chapter 3, Section 3.3.4, an auditory event can be made to appear anywhere between a loudspeaker pair in front of a listener by controlling the interchannel time difference (ICTD) and/or inter-channel level difference ICLD. It was Blumlein [28] who recognized the power of this principle and filed his now-famous patent. Blumlein showed that when only introducing amplitude differences (ICLD) between a loudspeaker pair, it would be possible to create phase differences between the ears (interaural time difference, or ITD) similar to those occurring in natural listening. He proposed a number of methods for pickup of sound, leading to the now common technique of coincident-pair microphones.

images

Figure 2.1 (a) Standard stereo loudspeaker setup; (b) coincident-pair microphone pickup and playback of the resulting stereo signal.

However, Blumlein's work remained unimplemented in commercial products for dozens of years. It was not until the late 1950s when stereo vinyl discs became available. These applied methods for cutting two-channel stereo signals onto a single disc similar to a technique already proposed in Blumlein's patent.

Capturing natural spatial sound

Figure 2.1(b) illustrates sound pickup and playback with coincident-pair microphones. Two directional microphones at the same location are orientated such that one microphone is headed more to the left and the other more to the right. Since, ideally, both microphones are at the same location, there is no phase difference (ICTD) between their signals. But due to their directionality there is an intensity difference (ICLD). For example, sources located on the left side result in a stronger signal in the microphone heading towards the left side than in the microphone heading towards the right side. In other words, the ICLD between the two microphone signals is a function of the source angle ϕ. When these microphone signals are amplified and played back over a loudspeaker pair, an auditory event will appear at an angle ϕ′ which is related to the original source angle ϕ, as illustrated in Figure 2.1(b). If the recording system parameters are properly chosen one can achieve ϕϕ′. When there are multiple concurrently active sources to be recorded (e.g. musical instruments playing together) the recording and playback principle already mentioned is also applicable and usually results in multiple auditory events, one for each instrument. More on this subject will be outlined in Section 3.3.5.

Coincident-pair microphones are a commonly used technique for stereo sound pickup. But there are a number of other popular microphone techniques. As mentioned, coincident-pair microphones ideally result in a signal pair without phase differences (ICTD = 0). This has the advantage that the resulting signal pair is ‘mono compatible’, i.e. when the signal pair is summed to a single mono signal, no problems will occur due to a comb-filter effect (cancellation and amplification of signal components which are out-of-phase and in-phase, respectively).

Early spatial audio playback experiments based on ‘spaced microphone configurations’ were carried out at Bell Laboratories [245]. In spaced microphone configurations the different microphones are located at different locations. Therefore, such techniques will result not only in ICLD cues, but also in ICTD cues. When the goal is to retain mono compatibility special care has to be taken when mixing the microphone obtained signals to the final stereo mix. It goes beyond the scope of this overview to describe such other microphone techniques in more detail. More on this topic can be found in [250] and [227].

Artificial generation of spatial sound

Artificial auditory spatial images for stereo loudspeaker playback systems can be generated by mixing a number of separately available source signals (e.g. multitrack recording). In practice, mostly ICLD are used for mixing of sources in this way, denoted amplitude panning. The concept of amplitude panning is visualized in Figure 2.2. One sound source s(n) is reproduced using two loudspeakers with signal scale factors a1 and a2. The perceived direction of an auditory event appearing when amplitude panning is applied follows approximately the stereophonic law of sines derived by Bauer [11],

images

Figure 2.2 Definitions of scale factors and angles for the stereophonic law of sines (2.1).

images

Figure 2.3 The relation between auditory event angle ϕ and ICLD, i.e. 20 log10(a2/a1), for the stereophonic law of sines.

images

where 0° ≤ ϕ0 ≤ 90° is the angle between the forward axis and the two loudspeakers, ϕ is the corresponding angle of the auditory event, and a1 and a2 are scale factors determining ICLD. The relation between ICLD, i.e. 20 log10(a2/a1), and ϕ is shown in Figure 2.3 for a standard stereo listening setup with ϕ0 = 30°.

Bennett et al. [15] derived a panning law considering an improved head model compared to the stereophonic law of sines. The result was a ‘stereophonic law of tangents’ which is similar to another earlier proposed law by Bernfeld [18], but for different listening conditions. Amplitude panning and auditory event direction perception is discussed in more detail in [210]. Note that all the mentioned panning laws are only a crude approximation since the perceived auditory event direction ϕ also depends on signal properties such as frequency and signal bandwidth.

A second method to reproduce a sound source at a desired position is referred to as delay panning. The implementation of delay or ICTD panning in analog mixing consoles would have been much more difficult than implementing amplitude panning. This was surely one reason why ICTD panning was hardly used. But even today, when implementation of ICTD panning would be simple in the digital domain, ICTD panning is not commonly used. This may be due to the fact that ICLD are somewhat more robust than ICTD when a listener is not exactly in the sweet spot (optimal listening position). ICLD may be perceived as being more robust because amplitude panning with large-magnitude ICLD results in auditory events at the loudspeaker locations by means of only giving signal to one loudspeaker. In such a case, an auditory event is perceived at the loudspeaker location even in cases when the listener is not in the sweet spot. This is one reason why in movie theaters usually a center loudspeaker is used, i.e. to have auditory events associated with dialogue at the center of the screen for all movie viewers. For pure ICTD panning signal is always given with the same level to more than one loudspeaker and a situation with stable auditory events (i.e. only signal from one loudspeaker) does not occur.

In addition to panning, artificial reverberation may be added to the stereo signal for mimicking the spatial impression of a certain room or hall. Other signal modifications may be carried out for controlling other attributes of auditory events and the auditory spatial image.

2.2.2 Headphone audio playback

Headphone stereo audio playback

Stereo audio signals are mostly produced in an optimized way for loudspeaker playback, as described in the previous section. This is reflected by the fact that during the production process the signals are usually monitored with loudspeakers by the recording engineer. The mixing parameters ICLD and ICTD result in relatively similar phenomena with respect to localization and lateralization of auditory events when a signal is presented over loudspeakers or headphones, respectively. Thus, one single stereo signal can be used for either loudspeaker or headphone playback. A major difference is that headphone listening with such stereo signals is limited to in-head localization, i.e. the width of the auditory spatial image is limited to being inside of the head of the listener as will be described in Chapter 3.

Headphone binaural audio playback

For regular audio playback (headphones, loudspeakers) the resulting interaural time difference (ITD) and interaural level difference (ILD) cues (i.e., the spatial cues at the level of the eardrums) only crudely approximate the cues evoked by sources that are physically placed at the auditory event positions. Furthermore, cues related to other attributes of the auditory events and auditory spatial image are also not entirely realistic, but determined largely by the recording engineer as a function of microphone setup parameters, mixing techniques, and sound effects processing.

images

Figure 2.4 A binaural recording is a two-channel audio signal recorded with microphones close to the ear entrances of a listener or dummy head (left). When these signals are played back with headphones (right), a realistic three-dimensional auditory spatial image is reproduced mimicking the natural spatial image that occurred during recording.

Binaural audio playback aims at presenting a listener with the same signals at the ear entrances as the listener would receive if he were at the original event to be reproduced. Thus, all the signal cues related to perception of the sound are realistic, enabling a three-dimensional sound experience. Note that in this case the width of the auditory spatial image is not limited to inside the head (this is often called externalization).

Figure 2.4 illustrates a system for binaural recording and binaural headphone playback. During a performance, two microphones are placed at or near the ear entrances of a listener or a dummy head and the respective signals are recorded. If these signals are played back with binaural headphones, a listener will experience an auditory spatial image very similar to the image he would perceive if he would be present at the original performance. If the binaural recording and playback are carried out with a single person, the auditory spatial image experienced during playback is very realistic. However, if a different person (or dummy head) is chosen for the binaural recording the sound experience of the listener is often limited (front/back confusion, limited externalization). Front/back confusion can be avoided by modifying the signals as a function of a listener's head movements [30].

The dependence on the individual listener is one reason why binaural recordings are commercially hardly used. Another reason is that binaural recordings do not sound very good when played over a stereo loudspeaker setup. Several approaches have been proposed for playing back binaural recordings over two loudspeakers. Crosstalk cancellation techniques [5, 10, 130] pre-process the loudspeaker signals such that the signals at the ear entrances approximate the binaural recording signals. Disadvantages of this approach are that it only works effectively when a listener's head is located exactly in the sweet spot, and its performance at higher frequencies is limited. A technique where a binaural recording is post-processed with a filter with the goal of being comparable in quality to conventional stereo recordings for loudspeaker playback was proposed in [252–254]. This can be viewed as post-processing binaural recordings so as to mimic the properties of a good stereo microphone configuration. The idea is to store the signals obtained as conventional stereo signals. These signal would play back in good quality on standard stereo loudspeaker setups and when a device is intended for binaural playback it would incorporate a filter which would undo the post-processing that was applied prior to storage of the signal. The result would be a signal similar to the original binaural recording.

Binaural recordings can also be created by artificially mixing a number of audio signals. Each source signal is filtered with head related transfer functions (HRTFs) or binaural room impulse responses (BRIRs) corresponding to the desired location of its corresponding auditory event. The resulting signal pairs are added resulting in one signal pair. More detailed information on the synthesis of virtual sound sources and corresponding binaural cues is given in Chapter 6.

2.2.3 Multi-channel audio playback

Five-to-one (5.1) surround

Only in recent years have multi-channel loudspeaker playback systems become widely used in the consumer domain. Such systems are mostly installed as ‘home theater systems’ for playing back audio for movies. This is partially due to the popularity of the digital versatile disc (DVD), which usually stores five or six discrete audio channels designated for such home theater system audio playback. Figure 2.5 illustrates the standard loudspeaker setup for such a system [150], denoted 5.1 surround. For backwards compatibility to stereo in terms of loudspeaker positions, in the front, two loudspeakers are located at angles −30° and 30°. Additionally, there is a center loudspeaker at 0°, providing a more stable center of the auditory spatial image when listeners are not exactly in the sweet spot. The two rear loudspeakers, located at −110° and 110°, are intended to provide the important lateral signal components related to spatial impression. There is one additional channel (the .1 in 5.1) intended for low frequency effects (LFE). The LFE channel has only a bandwidth up to 120 Hz and is for effects for which the other loudspeakers can not provide enough low-frequency sound pressure, e.g. explosion sounds in movies.

images

Figure 2.5 Standard 5.1 surround loudspeaker setup with a low-frequency effects (LFE) channel.

The angle between the two rear loudspeakers is so large (140°) that amplitude panning between those loudspeakers is problematic. Similarly, it is problematic to apply amplitude panning between the front and rear loudspeakers. Thus, the standard 5.1 system is not optimized for providing a 360° general auditory spatial image, but for providing a solid frontal auditory spatial image with lateral sound from the sides for spatial impression. In terms of 360° rendering, it is never problematic to place an auditory event at the location of a loudspeaker. In this sense, also the 5.1 system provides some good possibilities for auditory events appearing either at rear left or rear right. With only five main loudspeakers, a system has to be a compromise, as reflected by the 5.1 system.

Capturing and generating sound for 5.1 systems

Different microphone configurations and mixing techniques have been proposed for generating sound for 5.1 systems, see e.g. [227]. Alternatively, techniques applied for recording or mixing two-channel stereo can be applied to a specific channel pair of the five main loudspeakers of a 5.1 setup. For example, for obtaining an auditory event from a specific direction, the loudspeaker pair enclosing the desired direction is selected and the corresponding signals are recorded or generated similarly as for the stereo case (resulting in auditory events between the two selected loudspeakers). Vector base amplitude panning (VBAP) [211, 212], when applied to two-dimensional loudspeaker setups such as 5.1, applies the principle in terms of amplitude panning (ICLD). But also other techniques have been proposed, feeding signals to more than two loudspeakers simultaneously, e.g. three loudspeakers [93].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.208.97