In Chapters 32 to 34 we looked at a number of approaches to encoding speech signals with the goal of minimizing the size of the representation (i.e., the bitrate required for real-time transmission, or the number of bytes required for offline storage) while preserving the quality of the speech. These schemes started from the source-filter model of speech, then were able to exploit the constraints of that scheme to encode intelligible speech in drastically fewer bits than required for the original waveform. However, while fully intelligible, the reconstructed signals were usually easily distinguished from the originals.

In this chapter, we consider the situation where (a) we cannot assume that the source material is speech, or indeed any particular limited class of sound, and (b) our goal is a reconstructed signal that, for a normal listener, is indistinguishable from the original – that is, the coding scheme is ‘transparent’. The most prominent application of these techniques is for encoding music and other entertainment content such as the audio tracks of movies, and the best-known example of this family of coders is the MPEG-1 Audio layer 3 standard, better known as MP3. These schemes have been quite successful: taking an uncompressed CD audio stream as a starting point (16 bits per sample × 2 samples per frame × 44,100 frames per second = 1.41 Mbps), MP3 can generally achieve transparency at around 10% of the source bitrate, i.e., 128–160 kbps, or under 2 bits per sample (in a stereo frame). More advanced coders such as the AAC used in MPEG-4 push this closer to 1 bit per sample.

Note that our goal of transparency does not imply that aim to be able to decode to exactly the same waveform; such schemes, termed “lossless”, can remove redundancy (inefficient encoding) in a raw audio signal represented with 16 bits per sample, but rarely achieve compression ratios much better than about 50%, or around 700k bits per second for a CD-quality stereo. In order to achieve higher compression ratios while preserving transparency, we are looking for schemes where the decoded signal may differ from the original source, but such differences are insignificant to the listener, that is, we have removed irrelevant information. Any powerful compression scheme will also eliminate redundant information by adopting an efficient encoding, but the majority of gains come from irrelevance removal.

To achieve these dramatic reductions, it is necessary to exploit the limits of auditory perception. As discussed in Chapter 15, the complex machinery that allows us to be so exquisitely sensitive to our acoustic environment does exhibit a number of limitations, such as the phenomena of masking in which more intense sounds mask the perception of weaker sounds. The central ‘trick’ of the coding approaches described in this chapter is to take the quantization noise – the inevitable by-product of representing the signal in a small number of bits – and hide it below the masking threshold as efficiently as possible. This requires a great deal of dynamic control over where the quantization noise occurs, as well as good predictive models of masking phenomena. This approach is known as perceptual audio coding, since it is perceptual transparency that is being pursued. Measured by simpler, objective measures such as signal-to-noise ratio (SNR), the reconstruction is not particularly good. Indeed, one could say that the goal of perceptual audio coding is to create the most severely distorted version of the original signal that is still perceived as unmodified by a listener.

If SNR is no longer a relevant metric, we might ask how such systems can be evaluated. Although the development of automatic, objective measures is an area of research [1 1], the ultimate test lies with human listeners. Various formal listening test protocols have been developed to provide some quantification and repeatability in subjective evaluation: a “Mean Opinion Score” test involves presenting a panel of expert listeners with a range of audio material encoded with the scheme being tested and asking them to rate the overall impact of the processing on a scale from 1 (“very annoying”) to 5 (“imperceptible”); these ratings are then averaged to obtain a score for the scheme. Such tests are quite reliable, but expensive to conduct. An interesting phenomenon in advanced coders is the existence of “learning effects” where a given coder may introduce idiosyncratic distortion which becomes increasingly annoying as the listener learns to spot it. Listening panels may incorporate training sessions in which judges are specifically guided to listen for distortions specific to a coder.

Given the primary goal of minimizing bits used without affecting perceived quality, there are a number of secondary goals that may influence the design of these coders. These reflect the different applications in which the coders may be deployed. Most obvious is the overall computational cost of encoding and decoding, which in general is traded for compression efficiency. Another consideration is that live broadcasts or bitstreams to be read from media such as DVDs may need to maintain a constant average bitrate over some short timescale, but storing files on disk may take advantage of variations in the “compressibility” of the source material over time, e.g., by using very few bits to represent quiet or silent sections.

Given the enormous commercial impact of these schemes, questions of standardization and intellectual property ownership in the algorithms become significant: standards bodies such as MPEG conduct lengthy negotiations to decide which technologies are included in a standard, and hence which members will be able to benefit from the licensing income stream; by the same token, open-source enthusiasts may be motivated to develop entirely separate schemes purely to avoid licensing obligations. (Digital media technologies have also been shaped by questions of control of the content, e.g., to prevent unauthorized file sharing, but these questions are beyond the scope of this chapter.) Finally, certain schemes may facilitate applications beyond simple listening, such as adaptive transcoding where a single encoded representation can be “scaled” to provide the best possible transmission to a device such as a mobile phone to which the available bandwidth can vary rapidly in real time.

The remainder of this chapter looks in more detail at the main components of a “typical” perceptual coder. First, we look at the models of human psychoacoustic masking that predict where quantization noise can safely be permitted without affecting perceived quality. Then we will look at how we can in practice control the occurrence of noise in time and frequency, i.e., “noise shaping”. Finally, we look at a number of ancilliary issues in the design of coders, and compare the details of several widely-used compression standards including MP3 and AAC. An excellent and more detailed account of perceptual audio coding is provided by Painter & Spanias [6]. A more detailed description specific to MPEG-1 Audio layer 3 (MP3) is given by Pan [7].


FIGURE 35.1 Tone-on-tone simultaneous masking. A tonal (sinusoidal) component will “mask” the perception of weaker tones nearby in frequency, effectively elevating the threshold of audibility in that region of the spectrum (after [14]).


35.2.1. Psychoacoustic phenomena

Given the perceptual coder's goal of introducing distortion only in parts of the signal where it will be unnoticed by a human listener, we must start with a more detailed look at what can and cannot be perceived by the ear. The essential idea of psychoacoustic masking is illustrated schematically in Figure 35.1, which shows the threshold of detection by a human listener of a sinusoidal tone at different frequencies in the presence of a fixed “masker” sinusoid. The vertical scale shows the energy of tones at different frequencies, and the lower curve shows the “absolute threshold”, i.e., for each frequency, the minimum energy of a tone that can be detected in quiet; tones below this curve are simply not perceived by a listener. If, however, instead of quiet, the experiment is conducted in the presence of a clearly-audible masker tone, test tones with frequencies close to the masker need to be considerably more intense before they are detected – the masker tone has modified the limits of detectibility resulting in the upper, “masked threshold” curve. The range of frequencies around the masker tone that exhibit elevated thresholds is known as the “critical band”, introduced in Chapter 15; many hearing phenomena vary on this scale, and it appears to reflect a deep aspect of the ear's mechanics and neurophyisiology.

Notice that the elevated threshold is always somewhat below the level of the masker (in practice, 5–20 dB, depending on frequency and the temporal properties of the masker and masked signals), and that the masking is asymmetric in frequency, extending significantly further for frequencies above the masker than below (the so-called ‘upward spread of masking’, related to the structure of the cochlea in which energy at a particular frequency must pass through the parts of the cochlea responsible for detecting higher frequencies before it dissipates at its ‘best place’). Masking of this kind was first noted in the late nineteenth century, was first systematically investigated by Wegel & Lane [13], and has been widely studied since. A comprehensive summary is provided by Zwicker & Fastl [14].


FIGURE 35.2 Sequential masking. The masking effect of a tone takes several hundred milliseconds to fully decay after the masker ceases (forward masking). Even if the low-intensity tone starts slightly earlier than the masker, its perception may still be suppressed by the masking tone (backward masking).

Masking occurs not only when the masking tone is present, but also for a certain amount of time afterwards and even before. Figure 35.2 gives a sketch of this effect. It can take 100 ms or more for thresholds to return to normal after a masker is extinguished; this decay is known as “forward masking” in contrast to the much smaller effect of “backward masking”, in which a masking tone prevents the detection of a test tone that actually begins up to a few ms before the masker. Although this reverse causality seems paradoxical, we must remember that the brain doesn't immediately “know” that a tone is present; rather, there is some processing time during which evidence is being collected before a decision can be made. Backward masking can be understood as later signals overwhelming this process before it can complete, thereby suppressing the perception of the earlier sound. Combining these descriptions of simultaneous and sequential masking, we can illustrate the effect of a short masking tone as a kind of “skirt” in the time-frequency plane, below which the listener cannot perceive the presence of energy – see Figure 35.3. For our purposes, this masked area corresponds to range of energies in the signal into which quantization noise may safely fall without affecting perceived quality.


FIGURE 35.3 The combination of simultaneous and sequential masking result in a time-frequency “skirt” of elevated threshold around a strong, sustained tone.

35.2.2. Computational models

In perceptual coding, it is the strong and prominent components of the signal being encoded that will be used to hide the noise and distortion resulting from coding. Thus, the masking signal cannot be controlled or specified in advance; all we can do is analyze the signal and infer the tolerable level of noise at each point in the time-frequency plane (i.e., the maximum noise energy that will not be noticed by the listener). To do that, we need a quantitative model of the phenomena outlined in the previous section, able to make fine-grained predictions of the masked threshold resulting from arbitrary input signals.

There are a number of issues that complicate this process:

  • The precise masking curves as stylized in Figure 35.1 vary both in position and shape relative to the masker depending on frequency and level;
  • A tonal masker has characteristics that are qualitatively different from a masker formed from a narrow band of filtered noise that is otherwise matched in energy. This can be understood by noting that a noise masker will have a fluctuating level, making it harder to detect the small change in overall level that results from the presence of a masked tone. Thus, noise maskers raise the threshold more than tonal maskers of equal power;
  • Most experiments deal with single maskers, but a general signal will include multiple maskers that may be relevant at any given frequency.

Specific perceptual experiments can be devised to probe and measure each of these effects, e.g., the additivity of masking curves. Also, in the final analysis, it is safe to be conservative, i.e., underestimate the available masking – and limit oneself to adding quantization noise that is perhaps far below perceptible levels. Hence, a transparent audio coder does not need an extremely precise masking model, although a more accurate model will enable greater bitrate reductions while preserving transparency.

A number of perceptual model implementations exist, generally trading computational expense for accuracy of prediction. In fact, one of the dimensions in which different implementations of a single standard (such as MP3) can differentiate themselves is in the quality and efficiency of their psychoacoustic models, since the precise allocation of bit resources to different time and frequency locations is not a rigid part of the standard, but is left for individual implementations to decide. Figure 35.4 illustrates the results of the processing stages in a typical psychoacoustic model, in this case the low-complexity model (model 1) published as an example in the original MPEG Audio specification [5] and implemented in Matlab by Petitcolas [8].


FIGURE 35.4 Development of the estimated psychoacoustic masking threshold for a single frame of audio. Crosses indicate tonal peaks, and circles are non-tonal maskers. The dashed line shows the assumed absolute hearing threshold, the lower bound on all masking effects. (Figures generated with [8]).

The process starts with a short segment of the original audio file spanning the time for which the masking is to be estimated, usually 1024 samples or 23.2 ms at 44.1 kHz. This is windowed then converted to a spectrum with the Fourier transform. The solid line in the top panel of Figure 35.4 shows this spectrum, plotted on the Bark frequency axis (introduced in Chapter 15) which approximates the critical-band frequency resolution of the human ear. The dashed line shows the inferred absolute threshold on the same axis, i.e., the lower limit of energy that needs to be encoded at all1. The next step is to identify prominent masking components and to classify them as “tonal” or “non-tonal” so that the appropriate masking properties may be applied. Tonal components are identified by looking for peaks in the spectrum that are significantly larger (e.g., 7 dB) than their immediate neighborhood, where the neighborhood gets proportionally wider at higher frequencies to account for the broadening auditory filters. Energy not picked up in the search for tonal peaks is summed within each critical band and represented as a set of discrete “non-tonal” components, one for each critical band. The overall set of components is pruned to keep only the strongest tonal peaks within a 1 Bark window, and to ignore any components falling below the absolute (quiet) threshold. These individual tonal and non-tonal components are shown as points at the corresponding energies and center frequencies in the top pane.


FIGURE 35.5 Modeled masking thresholds as a function of masker level for a tone at 10 Bark ≈ 1.5 kHz (after[6]).

The next stage is to calculate the effective masking thresholds for each of these components, shown in the second panel. Each component results in a masking skirt that spreads to adjacent frequencies. The skirt is approximated by a piecewise linear function of the Bark-scale frequency; Figure 35.5 shows the masking curves for a tonal component at 10 Bark at a number of levels. Notice the “upward spread of masking” asymmetry, and that the upward slope becomes shallower as the masker becomes more intense. The middle pane of Figure 35.4 shows these curves for all the maskers identified in the short frame of music; you can see that the non-tonal maskers result in a higher level of masking relative to their energy when compared to the tonal components. Finally, the individual masking thresholds are combined with the threshold in quiet by summation in the power domain, giving the overall masking threshold as a function of frequency as shown in the lowest pane of Figure 35.4.


FIGURE 35.6 Quantization noise within one subband should be kept below the estimated masked threshold.

The MPEG specification includes a second, more complex psychoacoustic model (model 2) that avoids the explicit detection of masking peaks or their explicit classification as “tonal” or “noise”: instead, each spectral bin has an associated “tonality” index between zero and one that reflects how accurately it matches an extrapolation (in both magnitude and phase) from the preceding two time windows: sustained sinusoids, which are quite common in music audio, will tend to be well predicted by extrapolation. The tonality index is then used as an interpolation between masking functions fit to tone and noise data. However, in practice the difference between the psychoacoustic models is a matter of degree and precision: their overall behavior is similar.


Given the predictions of the masking thresholds at different frequencies that are synchronized to the signal, we now have the possibility of maximizing the amount of quantization noise that can be added to the signal while preserving perceptual transparency – provided we can control where the noise falls. This problem, of manipulating the distribution of quantization noise in time and frequency, is called “noise shaping” and comes in many guises. For the kinds of coding considered in this chapter, the goal is to be able to independently vary the quantization noise at each point in the time-frequency plane, corresponding to the masking surface calculated by the psychoacoustic model. In practice, this is achieved by dividing the signal up on a time-frequency grid, then choosing different quantization levels within each cell of the grid according to the local capacity to tolerate quantization noise, as dictated by the local masking threshold.

This idea is illustrated in Figure 35.6: The full audio spectrum is assumed divided into a number of subbands, each broad enough to encompass some variation in the masked threshold. Linear quantization of the subband time-domain samples has the effect of adding a small error offset to each sample, and this is well modeled as independent, random, additive noise with a fixed distribution. Such a white noise sequence has a flat spectrum (equal energy at all frequencies, on average) whose energy is proportional to the amplitude of the offsets and thus proportional to the step size of the quantizer. Assuming the signal within the subband is scaled to fully occupy the amplitude range afforded by the quantizer, the quantization noise will be approximately 6B dB below the peak signal level in the band, where B is the number of bits used in the linear quantization. (Each additional bit halves the quantizer step size, reducing its energy by 20log10 2 ≈ 6 dB.) Hence the signal-to-noiseratio (SNR) resulting from quantization is around 6B.

At the same time, the signal in that band and other adjacent bands, results in a masked threshold. The minimum value of the threshold within the frequencies included in the subband corresponds to the best opportunity for the listener to detect the flat quantization noise. By structuring our frequency decomposition so that subbands tend to be contained within the local region of large threshold elevation, we can keep this minimum threshold relatively high. Ideally, we will hide the quantization noise, which tracks the amplitude of a peak signal component, under the masking curve that similarly follows the peak. As long as the quantizer is given enough bits to keep the noise level below this minimum threshold, perceptual transparency should be preserved. The margin between the signal and the minimum masked threshold is sometimes called the the Signal-to-Masker Ratio (SMR), leading to a noise-to-masker ratio MNR = SMR – SNR. Keeping the MNR negative is the criterion of perceptual transparency.

35.3.1. Subband analysis

How, then, can we break up our continuous signal into little independent blocks of time-frequency for which we can make the optimal quantization choices, using just enough bits to push the MNR below zero? In principle, we can use a bank of bandpass filters to divide the signal into separate subbands, then divide each subband signal into short time sequences, and apply quantization to each sequence. At the far end, after decoding, the subband signals can be added together, and provided we are reasonably careful in designing our filters, we will reconstruct the full-band signal. If we do this naively, however, each subband signal will inherit the sampling rate of the original signal, and thus if we divide into M subbands, we end up with M times as many samples to quantize – not a good start if our goal is data reduction. If, however, we know that we have divided the signal up such that each subband represents just 1/Mth of the full spectrum, sampling theory tells us that we should be able to downsample that signal by a factor of M without losing information, even if the original bandlimited signal occupies higher frequencies. Reconstructing such a signal involves interpolating back to the full sampling frequency (i.e., inserting zeros to replace the samples discarded during decimation), then again applying the (ideal) bandpass filter to select the spectral alias that falls into the original band. Thus, our original time-domain signal is replaced by M signals, each sampled 1/Mth as often, for the same number of total samples and no “data explosion”. This is known as a maximally-decimated filter bank, or critical sampling – see Vaidyanathan [12] for a thorough discussion. This overall encode-decode loop is illustrated in Figure 35.7.

The one weakness of this approach is the assumption of ideal bandpass filters. In practice it is not possible to construct bandpass filters that exactly break up the spectrum into M disjoint regions with no overlap that can then be summed up to reconstitute the full band, and the effort to approach such ideal “brick-wall” filters would result in time-domain properties that were increasingly problematic - long ringing that would compromise the time localization we wish to achieve. Practical filters with finite-duration impulse responses will exhibit a finite transition region between the passband and stopband, and if the passbands are set up to fully capture all the original signal, this transition region will end up straying beyond the 1/Mth of the spectrum associated with the particular subband, as illustrated in Figure 35.8. Decimation, however, will alias (fold) any signal components outside the principal subband back into that band, leading to a particularly nasty kind of distortion.


FIGURE 35.7 Structure of a subband coder and corresponding decoder. The input signal is divided into M bandlimited signals, each accounting for 1/Mth of the total bandwidth, which are then downsampled by a factor of M (critically sampled), quantized based on the psychoacoustic model's estimate of the tolerable quantization noise at that moment in that band, and transmitted. At the decoding end, the dequantized samples are interpolated up to full sampling rate, again bandpass-filtered to recover the appropriate frequencies, then summed to reconstruct the full-bandwidth signal.

This problem is solved through alias cancellation [12]. Although a single decimated subband will introduce alias terms resulting from energy just beyond the ideal edge of the band, the corresponding imperfections in the reconstruction filters in the decoder will introduce similar alias terms from the neighboring channel. By careful construction, it is possible to have these two corresponding alias terms appear with opposite signs so that in the final summation they cancel out. This process is illustrated in Figure 35.8: a spectral component lies close to the subband boundary, such that it has substantial energy in two adjacent subband signals. After decimation and the subsequent upsampling (i.e., the stages in Figure 35.7, ignoring the quantization), both bands have both components at the true frequency and aliases at the frequency reflected in the band edge, fnyq/M (where fnyq is the Nyquist rate or highest representable frequency, and M is the number of subbands). However, by putting these signal through reconstruction filters that minor the original analysis filters (giving this approach its name, quadrature-mirror filtering or QMF) we can guarantee that the complementary alias from the higher band has exactly the same amplitude as the original alias in the lower band. By negating the reconstruction filters in alternating bands prior to final reconstruction, all such aliases can be cancelled2. Careful design of the original band-pass filter ensures that the magnitude response of the non-alias components remains exactly or very nearly flat.


FIGURE 35.8 Alias cancellation in quadrature-mirror filterbanks.


FIGURE 35.9 Castanet sound example. Top pane: original waveform. Middle pane: Reconstruction from MPEG-1 Audio Layer II (MP2) at 128 kbps. Bottom pane: Reconstruction from MPEG-1 Audio Layer III (MP3) at 128 kbps. Note the “pre-echo” noise immediately preceding the attack in the MP2 version.

A different analysis of maximally-decimated filterbanks focuses on the time-domain [9]. If a signal is broken into blocks of N samples, with 50% overlap between successive windows, maximal decimation will require each block to be represented by only N/2 coefficients, meaning that a reconstruction of that block in isolation cannot fully reproduce the original block but must introduce distortion that may be considered time-domain aliasing. If, however, the alias components in successive reconstructed blocks can be made equal and opposite, then, as in the frequency domain case above, the aliases can be cancelled in the final overlap-add reconstruction. This approach is known as time-domain alias cancellation (TDAC), and by dealing directly with the time-domain form of the filters, it permits the derivation of exact perfect-reconstruction analysis-synthesis structures. The most common instance of this uses the Discrete Cosine Transform (DCT) for the core frequency transform, and is known as the modified DCT (MDCT) [10].

35.3.2. Temporal noise shaping

In general, we want to divide the spectrum of the signal into a relatively large number of subbands so that each band corresponds to a narrow range of frequencies, enabling us to take full advantage of the masking that results from the strongest components in the band. However, due to the dual nature of time and frequency, reducing the width of the frequency bands implies lengthening the duration of the time windows, imposing a lower limit on the temporal resolution of our quantization noise control. If a sound changes amplitude suddenly, the analysis frame containing the large onset may also include signals of much lower amplitude immediately preceding the transient, which will inherit the same, coarse, quantization that is deemed suitable for the high-amplitude portion of the frame – leading to noise whose amplitude is significant when compared to low-amplitude portion of the waveform. This situation is known as “pre-echo” and is illustrated in Figure 35.9, which shows about 50 ms of a recording of castanets and guitar, showing a single, very sharp castanet transient. The top pane shows the original recording, and the second pane shows the result of encoding and decoding using MPEG-1 Audio Layer II (MP2), which uses a fixed 26.1 ms (1152 sample) frame, further spread by 11.6 ms (512 sample) analysis and synthesis filter impulse responses. The effects of this excess quantization noise are clearly visible for at least 20 ms prior to the onset at t = 0.972 s. The effect is also quite easy to hear, amounting to a softening or blurring of the transients, as if there were multiple castanets being played together rather than just one.


FIGURE 35.10 Switching windows as used in the MP3 MDCT coding. Subband signals are analyzed in 50%-overlapped blocks of 36 samples using the “long” window until the psychoacoustic model detects a region of rapid signal change. The block before the change uses the asymmetric “start” window, then the transient block is coded as three successive 12 sample subblocks each using the “short” window. Once the region of transients is complete, the analysis switches back to 36 sample blocks, first using a “stop” window, and using “long” windows from then on.

Of course, this problem could be reduced by using a shorter analysis windows, but to do this uniformly would sacrifice some of the “coding gain” advantages afforded by fine frequency resolution. It is, however, feasible to vary the frame size dynamically in response to the particular signal conditions within the frame. This is the approach taken in MPEG-1 Audio Layer III (MP3), which, as can be seen in the third pane of Figure 35.9, is effective in eliminating the pre-echo (in combination with the other efficiency improvements included in MP3). The core MDCT spectral transform in MP3 usually operates on blocks of 36 samples, with 50% overlap resulting in 18 new samples represented by each block. These samples come from individual bands of the first-stage 32 subband filterbank, thus a block of 18 subband samples corresponds to 18 × 32 = 576 audio samples, or around 13.1 ms at 44.1 kHz sampling rate. When, however, the psychoacoustic model detects a portion of the signal that involves rapid change in multiple bands (e.g., large deviations from the extrapolation mentioned at the end of Section 35.2.2), one or more adjacent blocks are instead encoded as successions of three, 12 point subblocks, overlapped by 6 samples. (Forcing the short windows to occur in blocks of three allows the overall framing to remain synchronized to a 576 sample frame.) These much shorter windows, illustrated in figure 35.10, allow the quantization noise to be limited much more precisely in time, at the cost of reduced ability to exploit simultaneous masking. They do, however, allow any remaining pre-echo artifacts to be hidden more effectively within backward-masking effects. The transitions between long and short windows require special asymmetric window shapes that are applied to long blocks, but actually involve setting six samples at the short-facing end to zero. Although this further sacrifices spectral resolution in these blocks, the combination of just four window shapes (long, short, start, and stop) plus two MDCT sizes (36 and 12 point) provides a much greater flexibility in handling spectrally-stable signals interspersed with rapid transients, while preserving perceptual transparency at reasonable bit rates and with relatively minor increases in coding expense and complexity.


FIGURE 35.11 Temporal envelope estimation by linear prediction. The envelope of the transient speech waveform is approximated by the smooth curve, which results from linear prediction of the frequency-domain coefficients. The poles of this predictor correspond to peaks in the envelope, shown by crosses.

Rather than providing a set of discrete choices for temporal window, a second approach could be to build (and transmit) a window that follows the temporal envelope of the signal itself. This envelope signal could be divided out prior to encoding, to produce signal blocks with approximately constant amplitude in time, then, when the envelope is re-applied to the post-quantization signal in the decoder, the quantization noise spread uniformly in time is reshaped to track the overall amplitude of the signal, being attenuated at times when the signal is small and allowed to grow larger in the presence of large-amplitude target signals that can mask it. The representation and transmission of such an envelope could, however, consume considerable bits or bandwidth that might be better used simply increasing the quantizer resolution.

However, in the case of signals with highly transient envelopes, there is a particularly data-efficient way to implement what amounts to this scheme. Chapter 21 described the techniques of linear predictive coding (LPC) in which a waveform with a highly predictable, periodic structure could be efficiently described by a low-order feedback structure that “predicted” each successive sample as a linear combination of a few preceding samples. In the frequency domain, this predictability appeared as a spectrum with a few peaks – the resonances of the all-pole filter constituted by the feedback system. The duality between time and frequency dictates that the more ‘steady’ the signal in the time domain, the more ‘peaky’ (impulse-like) its spectrum. By the same token, a signal with a highly impulsive or transient time-domain envelope is compelled to have an increasingly steady or predictable frequency-domain representation, with the limiting case being an impulse in time whose Fourier transform has a constant value and a phase that wraps around linearly with frequency.

Carrying the duality further, the same mathematical analysis can be applied to capture this predictability of the spectrum in a low-order linear predictor – for instance by predicting higher-frequency spectral values as a linear combination of a few lower-frequency bins. Just as conventional LPC allows an efficient coding scheme in terms of a set of predictor coefficients that are held fixed for a block of residual time-domain excitation, sets of spectral coefficients such as the coefficients of an MDCT transform will be more efficiently represented as predictor-plus-residual in the case where a highly transient temporal envelope induces strong correlation within a spectral region.

Because of the duality between the time and frequency domains, the frequency-domain predictor has an informative transform domain interpretation. Recall that time-domain LPC results in a filter whose magnitude response is a smoothed approximation to the signal's spectrum. Here, with time and frequency interchanged, the frequency-domain predictor can be analyzed into a time-domain magnitude response that approximates the temporal envelope of the signal, as illustrated in Figure 35.11. The residual, frequency-domain excitation that is transmitted along with the predictor, corresponds to the spectrum of the time-domain signal after this envelope is divided out. It is these residual values that are quantized and thus into which the quantization noise is introduced. On reconstruction, passing the spectrum values through the prediction filter is equivalent to multiplying the residual with the temporal envelope. Thus, the quantization noise has a temporal noise shape applied to it, and is made smaller in the critical, low-amplitude portions of the frame. Crucially, the temporal envelope is encoded as only a few coefficients – just two per “peak” in the envelope – and thus constitutes very little additional data.

This Temporal Noise Shaping (TNS) approach was proposed for coding by Herre & Johnson [4] and discussed in more detail in Herre [3]. The notion of exploiting these parametric representations of temporal envelopes is further explored in Athineos & Ellis [1]. The MPEG-4 Advanced Audio Coder (AAC) employs TNS, and its result when applied to the castanets signal is shown in the bottom pane of Figure 35.9.3


Many factors influence the design of coding schemes, including compromises between computational expense and bitrate efficiency, application-dependent considerations such as maximum coding delay, but also less technical factors such as compatibility with earlier standards and interests of intellectual property owners. Here, we briefly discuss the particular combinations of the above techniques used in a few common perceptual audio coding schemes, as well as some of the other technical details they include.

35.4.1. MPEG-1 Audio layers I and II

The original MPEG Audio specification was devised in the late 1980s and early 1990s at a time when the computational requirements for real-time decoding seemed daunting for contemporary hardware. Thus, three “layers” were specified, providing increasing coding efficiency at increasing computational expense. Layers I and II were relatively straightforward subband coders using a 32-band approximate-QMF filterbank on blocks of 384 (layer I) or 1152 (layer II) samples. This results in frequency bands 690 Hz wide, too broad to take full advantage of auditory masking at the low end of the spectrum where critical bands are 100 Hz wide or smaller. Layer II is usually considered transparent at 192 kbps for a stereo signal. The standard video encoding used on DVDs includes a layer II audio stream.

MPEG Audio is designed to achieve a constant bitrate at a fine scale, i.e., each 1152 sample block is allocated a fixed number of bits. This, combined with a short synchronization word at the start of each frame, leads to streams that can be interrupted and resumed at any point. This is important for applications such as digital audio broadcast, which are essentially continuous bitstreams with no overall header.

Figure 35.12 shows the layout of the data comprising an MPEG Audio bitstream. It shows one frame of layer I coding, corresponding to 384 stereo samples, or about 8.7 ms of audio at a sampling rate of 44.1 kHz. The frame is 140 bytes long, giving a bit rate of 140 × 8 × 44100 ÷ 384 ≈ 128 kbps.

35.4.2. MPEG-1 Audio Layer III (MP3)

The theory of perfect reconstruction MDCT filterbanks was emerging during the period of the MPEG-1 standardization, and it was incorporated into the most complex algorithm, layer III (better known as MP3). MP3 uses the same 32 band filterbank as layers I and II, meaning that a hardware decoder for all three layers could reuse this component. Beyond this, however, each subband signal is further subjected to an MDCT analysis, increasing the spectral resolution by a factor of 18 (for long windows) to give a total of 32 × 18 = 576 spectral bins, or about 38 Hz resolution for a 44.1 kHz sample rate. This two-level scheme is referred to as a hybrid filterbank.

MP3 includes a significant number of other technical enhancements over layer II. There is the window switching to control pre-echo as described in Section 35.3.2. Additionally, a more sophisticated quantizer is used that includes power-law quantization so that step sizes get smaller for smaller sample values. A Huffman coding stage varies the number of bits used to represent each sample (or small groups of samples) in inverse proportion to their overall prevalence, meaning that common sample values consume fewer bits. Finally, a bit “reservoir” allows bits to be shifted around over a scale larger than single frames, even for a fixed-rate stream. MP3 achieves transparent coding for most material at around 128 kbps.


FIGURE 35.12 Bit usage layout in an example MPEG-1 Audio Layer I frame encoding 384 stereo samples in 140 bytes, for a bit rate of 128 kbps.

35.4.3. MPEG-2 Advanced Audio Codec (AAC)

Advances in computational hardware, as well as theoretical developments in psychoacoustic coding, led to the specification of a new MPEG audio scheme in the mid 1990s that abandoned the hybrid filterbank and other vestiges of MPEG-1 Audio for a complete redesign based on a single MDCT stage, switching between 1024 and 128 spectral bins – similar to the switching windows of MP3, but with a greater difference between long and short windows. AAC also includes the LPC-based temporal noise shaping described in Section 35.3.2, additional backward prediction of spectral values between frames for highly stationary signals, and more flexible schemes for encoding stereo and multichannel sequences to exploit redundancies between channels. Along with differences in the coding techniques, the standard is expanded to accommodate a larger range of signals such as “5.1” surround sound (left, right, center, rear left, rear right, plus a low frequency effects channel), and to provide a broader set of ‘profiles’ to offer different trade-offs of bitrate and computational expenses. AAC at 96 kbps is near transparent for stereo signals. Technical details are provided in Bosi et al. [2].


FIGURE 35.13 Example of psychoacoustic coding. Top pane: original music file, including drums, guitar, and bell tree. Middle pane: reconstruction after coding in MP3 at 128 kbps. Bottom: residual difference after aligning coding delay of 2257 samples.


In this chapter we have seen how specific properties of hearing – most importantly the way that a strong tone will mask the perception of other energy nearby in time and frequency can be exploited to achieve large data rate reductions for audio without perceptible degradation. In fact, audio coded at one or two bits per sample actually has a high level of background noise: As illustrated in Figure 35.13, the residual noise that has effectively been added through coding is only 10–20 dB below the energy of the signal. However, because it follows the time-frequency structure of the signal very closely, it remains largely imperceptible to human listeners. Achieving this kind of coding requires an accurate prediction of when and where noise will be masked, good techniques to efficiently control the quantization within such fine-scale patches of time-frequency without introducing any redundant coefficients, and solutions to a range of other problems including signals that may require particularly close time resolution to avoid pre-echos. We have seen the powerful and elegant solutions that have been employed to permit this quantization, and we finished with a brief summary of the collections of techniques used in some of today's dominant high-quality audio compression schemes.


  1. 35.1 Referring to Figure 35.13:
    1. (a) What are the visible differences between the signal before and after coding? How did these changes arise?
    2. (b) What is the average SNR in the coded example?
  2. 35.2 Based on Figure 35.5, a 100 dB SPL tone at 10 Bark provides masking for energy up to around 92 dB SPL at 10 Bark (an 8 dB target to masker ratio, or TMR), falling to around 45 dB SPL at 9 Bark (a 55 dB TMR).
    1. (a) For a 2-Bark subband centered at 10 Bark, approximately how many bits would be required for a linear quantizer to ensure the quantization noise fell below the masked threshold for this tone?
    2. (b) How would this value change if the tone were instead at 40 dB SPL.
  3. 35.3 In Figure 35.8, we see how the alias just below the band edge in subband N matches the amplitude of a corresponding alias introduced in subband N + 1, such that when the subbands are combined in reconstruction, this alias cancels. However, the figure also shows a component of lower amplitude just above the band edge. Can you explain, in broad terms, what happens to this component, and why it does not result in distortion?
  4. 35.4 Figure 35.12 shows a particular example of a frame of MPEG-1 Audio Layer I data. Given that each sample can be quantized at between 2 and 15 bits, what are the smallest and largest sizes (in bytes) such a frame could occupy? What would be the overall bitrates corresponding to these extremes?


  1. Athineos, M. and Ellis, D. P. W., “Autoregressive modeling of temporal envelopes,” IEEE Tr. Signal Proc., 15: 5237–5245, 2007.
  2. Bosi, M., Brandenburg, K., Quackenbush, S., Fielder, L., Akagiri, K., Fuchs, H., Dietz, M., Herre, J., Davidson, G., and Oikawa, Y., “ISO/IEC MPEG-2 advanced audio coding,” Journal of the Audio Engineering Society, 45: 789–814, 1997.
  3. Herre, J., “Temporal noise shaping, quantization and coding methods in perceptual audio coding: a tutorial introduction,” in AES 17th International Conference, pages 312–325, 1999.
  4. Herre, J. and Johnston, J., “Enhancing the performance of perceptual audio coders by using temporal noise shaping (TNS),” in Proc. 101st Audio Engineering Society Convention, 1996 paper number 4384.
  5. ISO/IEC 11172–3, “Information technology – coding of moving pictures and associated audio for digital storage media at up to about 1.5 mbit/s – Part 3: Audio,”, 1993.
  6. Painter, T. and Spanias, A., “Perceptual coding of digital audio,” Proceedings of the IEEE, 88: 451–513, 2000.
  7. Pan, D., “A tutorial on MPEG/audio compression,” IEEE MultiMedia, 2: 60–74, 1995.
  8. Petitcolas, F., “MPEG for Matlab,”, 2003 Web resource, retrieved 2009–08-03.
  9. Princen, J. and Bradley, A., “Analysis/synthesis filter bank design based on time domain aliasing cancellation,” IEEE Transactions on Acoustics Speech and Signal Processing, 34: 1153–1161, 1986.
  10. Princen, J., Johnson, A., and Bradley, A., “Subband/transform coding using filter bank designs based on time domain aliasing cancellation,” in IEEE ICASSP, pages 2161–2164, 1987.
  11. Thiede, T., Treurniet, W., Bitto, R., Schmidmer, C., Sporer, T., Beerends, J., Colomes, C., Keyhl, M., Stoll, G., Brandenburg, K., et al., “PEAQ: The ITU standard for objective measurement of perceived audio quality,” Journal of the Audio Engineering Society, 48: 3–29,2000.
  12. Vaidyanathan, P. P., Multirate systems and filter banks, Prentice-Hall, 1993.
  13. Wegel, R. L. and Lane, C. E., “The auditory masking of one pure tone by another and its probable relation to the dynamics of the inner ear,” Phys. Rev., 23: 266–285, Feb 1924.
  14. Zwicker, E. and Fastl, H., Psychoacoustics: Facts and Models, Springer-Verlag, 1990.

1 This threshold will depend on the absolute level at which the signal is being played, but the curve is positioned based on the conservative assumption that the maximum amplitude possible within the soundfile encoding would correspond to 96 dB SPL – corresponding to listening to the music with the volume turned “way up”.

2 Alert readers may wonder why this negation – subtracting the reconstruction of subband N + 1 from the reconstruction of subband N in order to cancel the alias – does not end up flipping the polarity of the components in that subband. In fact, the way that the high-pass filter defining subband N + 1 is constructed from the low-pass filter for subband N results in a π/2 phase shift for the unaliased components (and –π/2 for aliased components). Specifically, if the low-pass filter is an even-length, symmetric FIR filter, then the high-pass – obtained by multiplying the low-pass impulse response by (−1)n – will be antisymmetric. When the filter is reapplied in reconstruction, the original component thus accumulates a total phase shift of π, which is then corrected by negating its sign.

3 Note that the temporal envelope modification of the signal blocks affects their spectrum and hence the correct application of the psychoacoustic model. However, since the reduced spectral resolution of an extreme envelope modulation implies corresponding gains from the prediction of the spectral coefficients, it is argued that this scheme allows a kind of continuous, signal-adaptive trade-off between spectral and temporal properties [4].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.