Chapter 8. Audio Displays

Although the visual display components of virtual and augmented reality systems play the dominant role in communicating information to the user, they are limited to providing stimuli corresponding to features and events occurring in the direction users are facing within the simulation. The addition of discreet audio cues or even a rich 3D acoustic soundscape can provide significant additional data about the 360-degree virtual environment surrounding the user. In this chapter we explore the various types of audio displays used in virtual and augmented reality systems, examining their functional differences and the types of application settings within which each is most beneficial.

Conventional Audio

It has been said that the ears point the eyes (Blattner et al., 1991; Broze, 2013). Just as with the real world, the use of audio in virtual and augmented reality applications can increase situational awareness (Fisher et al., 1987), improve navigation and way-finding (Ardito et al., 2007), enhance the perception of size, space, and depth (Larsson et al., 2001), and add to the overall sensation of “presence” within a virtual environment (Avanzini and Crosato, 2006).

Conversely, it is also important to understand that not all virtual and augmented reality applications benefit from the addition of sound. In fact, many targeted applications in science and engineering do not use sound at all because it is simply of no benefit. For example, applications such as those focused on the visual exploration and examination of complex geometric models and designs, or analysis of medical imaging or geophysical data, in and of themselves benefit very little from the addition of an audio component.

In the following sections we explore a variety of existing and emerging audio display solutions available for virtual and augmented reality systems, detailing their inherent strengths and weaknesses, and where possible, providing insights into what works best for different applications.

Monaural Sound

Monaural sound (also known as monophonic or simply mono) is a term used to describe a basic sound reproduction format within which all audio signals are combined into a single channel and fed to a single speaker. As illustrated in Figure 8.1, audio output is perceived as coming from one position. Monaural sound systems can also use multiple speakers (both clustered as well as separated, such as with aviation headphones), with each receiving identical output signals. An essential identifying characteristic is that monaural audio signals do not inherently contain intensity or arrival time information that can be used by both ears together to simulate directional and depth cues, although Doppler and distance processing (for example, attenuation, filtering, and reverb) can still provide distance cues in mono (interaural time and intensity differences; see Chapter 7, “The Mechanics of Hearing”).


Credit: Illustration by S. Aukstakalnis

Figure 8.1 Illustration depicting a basic monaural sound configuration within which audio signals are intended to be heard as if they were a single channel of sound perceived as coming from one position.

Although it is easy to dismiss monaural audio as a basic, limited sound format, there are a variety of application areas within virtual and augmented reality systems where single-channel output is an optimum solution, such as those instances in which a directional property is not required. The simplest of examples could include a warning buzzer or notification tone. In terms of computational load, monaural sound generation requires the least system resources.

Stereo Sound

Stereo (short for stereophonic, from the Greek words stereos, meaning “firm or solid” and phone meaning “sound” or “tone”) refers to an audio recording or reproduction format within which multiple audio signals are combined into two independent audio channels. As illustrated in Figure 8.2, compared to monophonic systems, stereophonic audio signals do contain intensity or arrival time information that can be used by both ears together to simulate, or reproduce, directional and depth cues. Stereo systems require two or more speakers or a pair of headphones to produce this effect.


Credit: Illustration by S. Aukstakalnis

Figure 8.2 This illustration depicts a proper basic stereo speaker configuration within which separate left and right channels are used to project sounds containing distinct intensity and arrival time relationships.

Stereophonic systems can be divided into two categories:

Image Artificial stereo (also known as “pan-potted” mono) refers to systems within which a single channel audio signal (mono) is reproduced across multiple speakers. Spatialization is simulated through signal processing techniques such as panning (varying the amplitude of the source signal sent to each speaker) to achieve the sensation of left and right source placement or movement, low-pass filters (attenuating a signal above a cutoff frequency allowing lower frequencies to pass through the filter) for front-back movement with distance, and mixed reverb for ambiance effects.

Image True stereo (also known as “natural stereo”) refers to audio systems with two independent audio channels. Sound signals reproduced in true stereo have a distinct intensity and arrival time relationship between the two channels resulting in a compelling reproduction of the original sound field.

Although conventional, fixed speaker stereo systems can produce a compelling spatial sound field, the area within which the effect can be experienced is limited, regardless of whether the audio being delivered is a live recording, the result of extensive mixing in a studio, or generated in real time to supplement a visual simulation application.

As illustrated in Figure 8.3, the best illusion of sound placement and depth is achieved if the listener is positioned on the mid-line between the loudspeakers and back far enough so as to form an equilateral triangle. From this listener position (known as the “sweet spot”), interaural time and intensity differences are minimized and a soundscape can be experienced within the area in front of, but between, the two fixed speakers and spread across a common horizontal plane. If the listeners move out of the sweet spot, the illusion of sound placement and depth falls off dramatically. You will obviously continue to hear the sound, but the rich spaciousness of the sound image will be lost.


Credit: Image by iconspro, pisotskii ©

Figure 8.3 Stereo speakers should be arranged to form an equilateral triangle with your listening position. The phantom sound image will appear in front of and between the two speakers.

Under ideal listening conditions, it is possible to use frequency contouring and crosstalk elimination to move some sound images above and beyond the horizontal plane, although the effect would be perceptible by some and not others because of the differences in the manner in which our individual pinna color the sound before reaching our eardrums (Griesinger, 1990).

In the case of headphones, the spatial effect is quite apparent, although the totality of the sound image is perceived to emanate from the space between your ears.


The use of normal stereo sound within virtual and augmented reality applications requires careful consideration and experimentation given the limited area within which a sound field can be created. As a general rule, you can deliver sounds from the front. Signal processing allows for some flexibility, but the time and effort put forth may warrant simply moving to an alternate solution. The systems in which stereo audio alone could be successfully used are those employing a fixed display where the user is facing in a single direction. Examples might include wide field of view (FOV), theater-like projection systems, hemispherical displays such as those discussed in Chapter 6, “Fully Immersive Displays,” within an automobile to complement an on-windscreen display or even in a standalone mode to enhance onboard navigation. In this situation, although the automobile is moving, the operator’s position in the driver’s seat, and thus, in relation to the speakers, remains relatively constant.

One such application, under development by French automobile manufacturer PSA Peugeot Citroën, optimizes navigation-system voice directions. The virtual placement of sound can create the impression that an instruction to “turn right” is coming from the windscreen area on the right side of the driver (PSA Peugeot Citroën, 2015).

Surround Sound

The phrase surround sound refers to an enhanced audio system within which a semi-immersive soundscape is created using a combination of digital signal processing techniques as well as four or more independent audio channels routed to an array of speakers strategically situated around a central listening position. Unlike stereo, which provides the sensation of a soundscape positioned within an arc extending from a central listener’s position forward to two speakers (or inside of one’s head in the case of headphones), surround sound expands the perceived soundscape to a 360° radius around the listener. If the speakers are correctly positioned, all elements in the soundscape will be perceived to originate along a common horizontal plane, such as is illustrated in Figure 8.4.


Credit: Image by iconspro ©, bogalo/Depositphotos

Figure 8.4 Surround sound provides the listener the sensation of sounds originating across a 360° radius in the horizontal plane.

Surround Sound Formats

There are numerous surround sound formats and speaker configurations available, each of which is described using a unique nomenclature. The most basic surround sound format is known as 5.1 (pronounced 5-point-1 or 5-dot-1). As shown in Figure 8.5, this format utilizes six separate channels: five normal full-range audio channels (Left Front, Center Front, Right Front, Left Surround, Right Surround) and one low-frequency effects channel (a subwoofer) for extended bass. With such systems there is considerable flexibility in the placement of the subwoofer due to the lack of directionality of these frequencies. (As we learned in Chapter 7, “The Mechanics of Hearing,” low-frequency sounds have long wave forms that bend around objects.) Additionally, most professionally mixed audio intended for surround sound systems, such as in the case of movies or soundtracks, typically take advantage of the fact that the audience/listener will likely be facing in one direction and thus, use the front center channel for delivery of character dialogue.


Credit: Illustration by S. Aukstakalnis

Figure 8.5 Illustration showing basic 5.1 and 7.1 surround sound speaker configurations (max angles).

Building on this basic setup, surround sound formats and speaker configurations grow in complexity. For instance, a 6.1-channel system adds an additional channel targeting a single speaker (or array) behind the listener. A 7.1 system adds two additional rear channels for a total of eight full channels. Recent advances in cinematic audio have also resulted in the development of a 7.1.2-channel system, which adds an additional two channels for speakers above the listener.


Surround sound is often employed within immersive visualization systems based on large-scale fixed displays such as computer-assisted virtual environments (CAVEs), hemispherical/domes displays, and other wide field of view (FOV) curved-screen systems (each of which is covered in Chapter 6, “Fully Immersive Displays”), as well as various full motion flight simulators. In general, those applications within which an individual or audience remain in a relatively stable position facing in one direction, although sound effects are intended to surround the participants, tend to benefit the most from surround sound solutions.

Object-Based Surround Sound

To this point, each of the fixed-speaker surround sound solutions mentioned in this section are constrained by the need to mix audio to address specific channels (or speakers) and produce the illusion of presence within a sound field. While effective, this method is limiting, particularly if you want sounds to appear well above or below the horizontal plane. An innovative new approach known as object-based surround sound now emerging from several manufacturers (Dolby Atmos, DTS:X, Auro-3D) enables audio engineers and designers to treat sound events as individual entities and precisely define their location and movement in 3D space (Dolby, 2015). Although specific methods and approaches vary, each manufacture’s system enables sound designers to select a specific position or vector for an individual sound element, and the mixing software determines the correct speakers to target to produce the desired effect. This new approach enables a much larger number of fixed speakers to be employed to produce stunning 3D aural experiences for cinemas and home theaters. But it does not end here.

In an object-based system, the different sound elements are encoded with metadata such as position and vector information during mixing. Each manufacturer has developed proprietary methods to use this metadata in combination with HRTF filters mentioned in Chapter 7 (and discussed in more detail in this chapter) to deliver the same immersive surround sound effects over standard stereo headphones, opening the door for a variety of direct uses in virtual and augmented reality applications.

One such demonstration application, detailed in Chapter 13, “Gaming and Entertainment,” involves combining object-based surround sound and 360-degree stereo video (also referred to as spherical video) of a live concert by the musician Paul McCartney, with the end product experienced using a stereoscopic head-mounted display and headphones. Using accelerometers within the display to monitor head orientation and appropriately adjust the visual and audio scene, the user is able to turn to the left or right, look up and down, as if physically standing on the stage.

Binaural audio, also referred to as 3Dsound, spatial(ized) audio, or space-related stereophony, refers to methods of recording or synthesizing sound to simulate the way our two ears naturally perceive our acoustic surroundings. Considerably different from any of the previously discussed methods of sound reproduction, binaural audio incorporates the localization cues previously discussed in Chapter 7, including the effects resulting from our natural ear spacing (interaural time and intensity differences), as well as the shape of our upper torso, head, and pinna geometry in coloring the frequency spectrum of sound waves before entering our ear canals. The net result is that sound sources can appear anywhere in a 360-degree sphere surrounding the user (Fisher, 1991; Ferrington, 1993; Schauer, 2001; Sontacchi et al., 2002).

Noted sound engineer William B. Snow (San Francisco, May 16, 1903–Oct. 5, 1968), formerly of Bell Labs and assistant director of the U.S. Navy’s Underwater Sound Laboratory, offered one of the more eloquent explanations on the difference between stereo and binaural sound: “It has been said aptly the binaural system transports the listener to the original scene, whereas the stereophonic system transports the sound source to the listener’s room (Snow, 1953).

The use of binaural sound within virtual and augmented reality systems and applications is enabled through two primary methods: recording and real-time synthesis.

Binaural Recording

Binaural recording is a method of capturing a soundscape—be it in a studio, a concert hall, or in a natural outdoor setting—in a manner similar to the way in which a healthy individual actually hears one’s real-world surroundings. In practical terms, this involves positioning two omnidirectional microphones facing away from each other and separated by about 7 inches (17.78 cm), thus approximating the distance between an adult’s ears. As discussed in Chapter 7, this separation allows the microphones to capture the interaural time and intensity differences that are used by the audio centers of our brain to localize the direction from which a sound originates along the horizontal plane (the azimuth).

Modern binaural recording systems typically consist of two microphone capsules embedded within a specialized mannequin head that includes external, contralateral (inversely symmetrical) ear forms (pinnea) to also capture the natural reflections and coloration of high-frequency (short wavelength) sound waves that our brain uses to determine the elevation of a sound source. These systems record ambient sounds on two separate and discreet audio channels to preserve these critical interaural differences and pinna spectral cues. This concept is illustrated in Figure 8.6.


Credit: Illustration by S. Aukstakalnis

Figure 8.6 Illustration depicting a basic binaural recording setup where left and right channels are kept distinctly separate to preserve binaural sound cues.

In general, mannequin head microphone systems provide the most aurally accurate binaural recordings. Upon playback, the listener is able to experience a virtual acoustic image as if one were physically present and positioned at the exact point where the original recording was made (Geil, 1979; Genuit and Bray, 1989). Just as binaural recordings capture the subtle, unique positional cues for each ear separately, the playback of binaural audio recordings over stereo headphones is the most spatially accurate method known for faithfully reproducing the original sound field (Bartlett, 2014; Lombardi, 1997).

Hardware Solution Overviews

There are a number of commercial mannequin head recording systems available. Depending on the manufacturer, some models include facial features (which typically have little impact on the overall sound spectrum arriving at the embedded microphones), whereas others dispense with the facial attributes but include neck and torso assemblies to enable the capture of the subtle sound reflections from these features. Still others forgo head and torso assemblies altogether. In general, the single common feature found in most modern binaural recording solutions is contralateral pinna pairs to capture pinna spectral cues as well as to ensure uniform left-right recordings.

The following three hardware solution overviews are presented beginning with a high-end, precision-grade system followed by a consumer-grade unit and then a DIY option. Employed correctly, each provides a compelling binaural audio experience, though accuracy in the capture of the full sound spectrum, and thus, the depth and richness of the end product is considerably different between the high and the low end. Researchers and developers considering the use of binaural recordings in their virtual and augmented reality applications should give careful consideration to the desired quality and end user needs when deciding which hardware to employ. This chapter provides specific details for three binaural recording solutions, but there are many other commercially available systems on the market. A list of these suppliers is provided within Appendix B, “Resources.”

The current state of the art in full head binaural recording systems is exemplified by HEAD Acoustics GmbH of Herzogenrath, Germany.

HEAD Acoustics HMS IV Aachen HEAD

The Head Acoustics HMS IV.1 (HEAD Measurement System IV.1) shown in Figure 8.7 is a standalone, high-precision binaural recording and measurement system that includes a mathematically definable upper torso and head. Of particular note is their use of what are referred to as simplified pinna. As opposed to other recording heads that incorporate life-like pinna forms with all the curves, ridges, and undulations, Head Acoustics takes a different approach. Based on years of research by company founder Dr. Klaus Genuit identifying the specific structures of the average human auricle demonstrated to be acoustically essential, pinna were designed to incorporate only the features and dimensions “important for the directional pattern of the external ear transfer function” (Genuit, 2005; CCITT, 1992). Among the dominant features of the simplified pinna are the cavum-conchae and the asymmetric ear canal entrance, which itself is important for localization cues (Vorländer, 2007).


Credit: Image courtesy of Prof. Dr.-Ing. Klaus Genuit, HEAD accoustics GmbH

Figure 8.7 HEAD Acoustics HMS IV.1 Artificial Head System and side detail.

The HMS IV.1 microphone capsules are positioned only slightly recessed into the side of the recording head (5 mm). In this position, sounds waves do not undergo additional coloring and changes that would normally be present if the capsule were positioned deeper into the head in a simulated ear canal. Another interesting result of not embedding the microphone capsules within artificial ear canals is that captured audio is reproduced extremely well over fixed speakers (Wolfrum, 2015), although sweet spot restrictions and limitations of sound placement in 3D space still apply.

The HMS IV.1 is capable of operating in standalone mode, without tethering to external power sources, computers, or recording equipment. Used in this manner, audio is recorded to a CompactFlash card, the reader for which is built into the torso portion of the unit.

The HMS IV.1 system enables any of five different equalization types to be selected: linear (LIN—no equalization), independent-of-direction equalization (ID), free field equalization (FF), diffuse field equalization (DF), and user-defined equalization settings for customized needs.

The HMV IV.1 is renowned for a high dynamic range comparable to that of the human ear, as well as a near-zero frequency response at points along the median plane (Wolfrum, 2015).

It is important to note that the HMS IV.1 is principally intended to serve as a high-end measurement device for such applications as the examination and optimization of the sound quality of technical products, motor vehicles interiors, aircraft cabins, and telephony devices where even the most subtle of acoustic nuances must be measured or captured for study. Thus, this high level of performance can be of significant benefit to virtual and augmented reality applications developers targeting the same level of quality and fidelity.

3DioSound Free Space Pro

Another manufacturer of binaural microphones of note is 3Dio, Inc of Vancouver, Washington. As opposed to microphone capsules being built into a dummy head, the Free Space line of microphones shown in Figure 8.8 simply does away with the head form completely but retains lifelike contralateral pinna mounted on disks separated by about 7 inches approximating the distance between an adult’s ears. According to the manufacturer, the discs upon which the ear forms are mounted are specifically designed to provide all the necessary head-shadowing for a perfectly realistic HRTF and binaural experience (Anderson, 2015a).


Credit: Image courtesy of 3Dio, LLC

Figure 8.8 The Free Space Pro II Binaural Microphone from 3Dio, Inc.

Two highly useful features about this microphone include the compact size, allowing for ease of recording in nearly any environment, as well as a hot shoe adapter for mounting with cameras, grip handle, or tripods. The system utilizes XLR outputs that support phantom power, as well as an 1/8-inch stereo output jack enabling audio capture with portable handheld recorders.

The silicone rubber pinna used for these microphones began as a 3D scan of a single ear created using a medical CT scanner (X-ray computed tomography), the resulting model from which was then adjusted for aesthetic appearance and optimization of lower frequency wavelengths1 (Anderson, 2015b). The company also sells the ear forms independent from the electronics for use in DIY binaural recording projects.

1 Tests were performed in August of 2015 by the author and two professional audio engineers comparing the recording performance of the Free Space Pro II and HEAD Acoustics HMS IV.1 Artificial Head System. Although the Free Space Pro II proved to be an outstanding microphone in terms of overall performance and low noise floor, the lack of a head form clearly resulted in a weak response to lower frequencies.

In-Ear Binaural Recording Mics

If you want to completely bypass using artificial head or pinna forms for binaural recording, one of the easiest, most accurate, and least expensive routes to consider is to simply use your own head and ears. There are dozens of commercially available, consumer-level in-ear binaural microphone systems available, each with varying levels of performance and sensitivity (feedback, balance, low noise floor, and so on). One high-quality and reasonably priced option is the Roland CS-10EM shown in Figure 8.9.


Credit: Image courtesy of Roland Corporation U.S.

Figure 8.9 Roland CS-10EM in-ear binaural recording microphones and R-05 two-channel field recorder.

This product is unique in that the binaural microphones are built into the same ear buds used for listening, enabling the user to monitor while simultaneously recording. Remember that you will need, at a minimum, a two-channel handheld field recorder such as the Roland R-05 to accommodate the twin microphone inputs. As mentioned earlier, recording on dedicated left and right channels is essential to preserve the embedded localization cues.

Binaural Recording File Formats

From an engineering standpoint, a binaural recording is generally the same as a stereo audio file. It contains two tracks, and the final edit can be converted into any format. However, encoding into commonly used compressed file formats such as MP3 can significantly degrade the quality of the audio and the spatial feel of the end product. This is due, in large part, to the fact that MP3 (MPEG-1 Audio Layer-3) compression employs psychoacoustics-based algorithms that permit the codec to discard or reduce the precision of audio frequencies that are less audible to human hearing, thus contributing to overall files size reduction (Arms and Fleischhauer, 2005). For general audio applications such as the music playlist on your smartphone, the MP3 format is ideal. For binaural recordings, this compression technique can have a dramatic impact on the directional cues for sound events specifically sought when making binaural recordings.

Researchers and developers producing binaural recordings for virtual and augmented reality applications should always record and manipulate in an uncompressed file format such as WAV (Waveform Audio File Format) or AIFF (Audio Interchange File Format) to capture and preserve the entire frequency spectrum. The finished product can then be mixed down and compressed in postproduction to the desired format, although at the highest bit rate possible to maintain acoustic detail (Anderson, 2013).

Real-Time Synthesis of Binaural Sound

The second (and currently more common) method for applying binaural sound to virtual and augmented reality applications takes a completely different approach. As opposed to presenting prerecorded binaural audio to the listener (via distinct left and right channels), existing mono audio samples are instead passed through digital signal processors containing pairs of filters based on measured head-related transfer functions (HRTFs) briefly mentioned in Chapter 7. The net result is that the frequency profile of these audio samples is slightly colored to include the effects that a human torso, head, pinna, and even the acoustic characteristics of the space would have on the sounds. This modification of the audio signal, quite literally, encodes the specific cues used by the brain to help you identify, with comparable precision to real life, the azimuth and elevation from which a virtual sound originates.

Another (albeit loose) way to think of this is that standard sounds passing through these digital filters are colored by interaction with a mathematical representation of a human torso, head, and external ears. Thus, instead of recording and playing back sound that has already been colored by interaction with your physical body (or that of a mannequin head), simple monaural sounds are passed through digital filters that encode the same effects.

In addition to the methods just referenced, there are still other techniques available besides HRTFs and binaural sound used in real-time synthesis of audio. For instance, room and environmental convolution models can actually be obtained offline or can be derived from acoustical physical modeling approaches (Chandak et al., 2012; Taylor et al., 2012).

How HRTFs Filters Are Created

As discussed in Chapter 7, humans perceive the location of a sound source through the use of binaural cues (interaural time and intensity differences) for determining azimuth and monaural cues (changes in the frequency profile of sounds by the pinna) for determining elevation. These specific effects on a sound are known as a head-related transfer function (HRTF) and can be quantified by empirically measuring sounds arriving at the entrance to the ears from hundreds of locations around an actual person or binaural recording head. These individual sound measurements are referred to as a head-related impulse response (HRIR).

In precise mathematical terms, an HRTF is simply a Fourier transform, or convolution, of an HRIR.

Measuring Impulse Responses

An HRIR is not only different for each of our ears, but unique for every direction from which a sound might originate. For instance, sounds arriving from a point above a listener along the vertical plane are modified differently (by each ear) than those arriving from a position to the front of the listener.

Thus, it is necessary to measure the HRIR to control tones from hundreds of precisely defined positions around an individual’s head (or one of the dummy heads mentioned earlier). The more positions measured, the more precise virtual sounds can be placed, although a multitude of techniques exist to interpolate between data points (Freeland et al., 2002; Ajdler et al., 2005; Keyrouz and Diepold, 2008; de Sousa and Queiroz, 2009).

Numerous methods have been developed for precision measurement of HRIR data. Typically each involves a structure arrayed with equally spaced loudspeakers surrounding a central listening position. One of the most accurate systems currently in operation is found at the Auditory Localization Facility (ALF), located at the Air Force Research Laboratory, Wright Patterson AFB, Ohio. As shown in Figure 8.10, the ALF is a large anechoic chamber that houses a 14-foot-diameter geodesic sphere arrayed with loudspeakers positioned at each of its 277 vertices (Romigh and Simpson, 2014).


Credit: Image courtesy of the U.S. Air Force

Figure 8.10 Measuring head-related impulse responses at the Auditory Localization Facility at the Air Force Research Laboratory, Wright Patterson AFB, Ohio.

Now imagine swapping your physical pinna with those of a friend. The chances are good your ability to accurately localize sounds in the environment around you would be seriously diminished. This should not be surprising given the differences in the geometry of the ear. It is well known that HRIRs, and thus, HTRFs, vary widely from person to person (Kistler and Wightman, 1992). As such, serious localization errors can occur if you listen to sounds filtered through nonindividualized HRTFs. These errors are most pronounced in front-back and up-down judgments (Wenzel et al., 1993; Middlebrooks, 1999; Brungart and Romigh, 2009).

Public Domain HRTF Databases

Although it is currently impractical to produce custom HRTFs for each individual, researchers have developed averaging methods to arrive at generic reference filters that can process sounds with spatial cues perceptible by large segments of the target population. These reference filters are often composed of HRTFs drawn from several high-spatial-resolution collections compiled and placed into the public domain by such institutions as Massachusetts Institute of Technology and University of California at Davis. Details for accessing each of these collections can be found in Appendix B, “Resources.”

Position Stabilized Binaural Sound

To this point we have explored the various enabling technology pathways through which binaural sound can be applied to virtual and augmented reality applications, although for the sake of simplicity, these explanations assumed the user would be a passive listener. In reality, most applications will have users swiveling their heads looking around as well as changing their physical position.

The Need for Head Tracking

As we learned earlier in this chapter, listening to stereo audio over headphones provides the aural sensation of a soundscape being positioned between the two speakers within your head. Conversely, binaural audio delivered over headphones provides the aural sensation of your being positioned, or immersed, within the soundscape. In other words, the sound field is located outside of your head, just like the real world. For standard audio playback and listening, this is great. But when binaural audio is used in a virtual or augmented reality application, a problem emerges. That problem is head movement.

As you move your head or body, there is no change to the 3D soundscape. It moves with you. The aural sensation of a car horn to your right will always be to your right, even if you physically turn to face the apparent source of the sound.

Thus, just as with the visual component of a virtual and augmented reality application and illustrated in Figure 8.11, it is necessary to track, at a minimum, the orientation (roll, pitch, and yaw) of the user’s head and to feed that information to the system delivering the audio. When implemented, the perceived effect is that the 3D soundscape remains fixed as you move within it. This is particularly important because head movements contribute significantly to overall sound localization. Have you ever turned your head to hear something better?


Credit: Image by andrewrybalko ©

Figure 8.11 Binaural audio delivered over headphones requires, at a minimum, tracking the orientation (roll, pitch, and yaw) of the listener’s head to present the sensation of a stabilized sound field within which the listener moves.

A more in-depth exploration of the various technologies used to track the position and orientation of objects (in this case the human head) is found in Chapter 11, “Sensors for Tracking Position, Orientation, and Motion.” Typically the information from one tracking unit is shared by both the visual and the audio components of the virtual or augmented reality system.

What About Binaural Sound over Speakers?

While binaural audio recordings can be played back over conventional stereo speakers, a noticeable loss of spatial quality and directionality of features in the sound is experienced due to crosstalk. In other words, each ear receives the sum of two signals: the signal intended for a specific ear as well as that intended for the other—that is, the crosstalk signal (Khosrow-Pour, 2014). Although crosstalk cancellation solutions have existed for many years for general audio applications, many challenges still exist, such as the need for a listener’s physical position to remain relatively stable (that “sweet spot” limitation again).

In the case of real-time binaural synthesis, such as that found in gaming applications, the challenges of delivering binaural audio over fixed speakers are significantly greater. Not only is there a need to track the user’s head position and orientation to adaptively control the crosstalk canceller, but there is the need to process audio signals through HRTF filters to impart the directional properties to the sounds in the first place. Although significant research and development work is underway to solve these issues, the delivery of a compelling binaural audio scene over fixed conventional speaker systems remains impractical for most virtual and augmented reality applications.


Applied correctly, the addition of a rich, 3D acoustic soundscape to virtual and augmented reality applications can have a powerful impact on the sense of immersion, usability, and overall experience. Within this chapter we have explored a variety of the most common, and still emerging, audio display solutions and methodologies used to achieve this goal. Appendixes A and B should be leveraged for an even greater understanding of this subject matter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.