Chapter 7

,

Audio Coding

7.1. Principles of “perceptual coders”

We show, in Figure 7.1, the amplitude of a violin signal as a function of time (Figure 7.1(a)) and the repartitioning of the signal power as a function of frequency (Figure 7.1(b)).

As for speech coders, an audio coding algorithm is essentially a loop which consists first of filling a buffer with N samples as shown in Figure 7.2, processing these N samples and then passing on to the next analysis frame. The analysis frames always overlap; the shift between two analysis frames is characterized by the parameter M < N. The vector image(m) is, therefore, of the form:

image

where the operator ⊗ represents a multiplication of two vectors component by component and where image is a weighting window. The analysis frames are generally around 20 ms in length, the parameter N which must therefore be 44.1 x 20 = 882 is equal to 512 (MPEG-1 and AC3) or 2048 (MPEG-2 AAC). The value of the parameter M depends on the coder: image for the MPEG-1 coder, M = N/2 for the AC3 and MPEG-2 AAC coders.

The diagram of a perceptual coder shows that it is composed of three distinct modules which are activated at each analysis frame: a time-frequency transform of the form image, a bit allocation controlled by a hearing model, and a scalar or vector quantization of components of the vector image followed by entropy coding. The bit stream transmitted in the channel is made up of code words from the entropy coding as well as side information, for example the result of the bit allocation,

Figure 7.1. Violin signal in (a) the time domain and (b) the frequency domain

image

Figure 7.2. Diagram of a perceptual coder

image

which is necessary at the receiver a priori. At the receiver, the vector image must be constructed and brought back into the time domain by the inverse transform image then the samples image must be extracted at the speed fe. In the MPEG-2 AAC coder, the reconstruction error q(n) = x(n) − image(n) is also calculated at the transmitter: we have a local copy of the receiver at the transmitter. This information is used in the module for bit allocation. This is known as closed loop coding, as for speech coders.

Recall that in section 3.2 we showed the equivalence of filter banks and transformations. The moduli of the components of the vector image are expressed in terms of frequency if the transform has been correctly chosen; in general this is the MDCT transform. We can say, in this introductory section, that these moduli provide a good estimation of the music signal’s power spectral density SX(f) in the current analysis frame.

Perceptual coders aim to eliminate components of the vector image which are inaudible and to quantize the remaining components with the minimum number of bits. The role of the hearing model is to define the inaudibility criterion. The psychoacoustic results, shown in section 7.5, show that a pure sound (a sinusoidal waveform) or a narrow-band noise can be made inaudible (masked) by the presence of another pure sound or narrow-band noise. In the graph of Figure 7.3(a), a component of the power spectral density SX(f) has been isolated. The vaguely peaked curve shows the power of the masked sinusoid just at the limit of inaudibility. Since, in reality, the N samples of a musical sound in the current analysis frame are expressed in the form of a sum of M frequency components, the psychoacoustic results must be generalized and the different contributions added together. We obtain the masking threshold Φ(f) which can be seen superposed on the right-hand graphs with the power spectral density SX(f).

Figure 7.3. (a) Masking curve (in bold). (b) Masking threshold (in bold). Only contributions in the band [0–10] kHz are shown

image

The error from a quantizing operation results in an inaudible perturbation if its power spectral density Sq(f) satisfies the relation:

image

for any f and any analysis frame. Equation [4.17], for which we recall the expression:

image

gives the necessary bit rate which is enough to code a source with power spectral density SX(f) with a power spectral density distortion Sq(f). Since [SQ(f)]max = Φ(f), the frequency axis must be finely partitioned, frequency bands which satisfy SX(f) > Φ(f) must be determined, and the bits must be allocated as a function of the ratio SX(f)/Φ(f) by applying the 6 dB per bit rule. The graph in Figure 7.4 shows that, in this analysis frame, approximately 50% of the components can be eliminated and that the mean signal-to-mask ratio is of the order of 12 dB which therefore requires 2 bits on average per component. The compression rate is therefore of the order of 2 × 16/2 = 16. For music signals in CD format, the lowest bit rate which respects the inaudibility constraint (in this frame) is of the order of 40 kbit/s. We can see immediately that everything depends upon the precision with which the psychoacoustic model is defined. Defining a good model is very tricky.

Figure 7.4. Eliminating masked components: only components of SX(f) greater than the masking threshold are coded

image

In reality, defining an audio coder does not stop at this point. A whole series of specific problems must be studied. The two sections which follow describe the two basic perceptual coders which show that the creation of an audio coder with good performance is the result of numerous subtle compromises. If a more detailed explanation is required; see [PAI 00].

7.2. MPEG-1 layer 1 coder

A few succinct pieces of information are given here pertaining to the audio part of the international standard ISO/CEI 11172 [NOR 93]. This standard authorizes three different sampling frequencies: 32, 44.1 and 48 kHz. It also authorizes a wide range of bit rates. In this section, we assume that the sampling frequency is 44.1 kHz and the target bit rates are between 64 and 96 kbit/s per channel. All the tables provided here are therefore only relevant for this sample frequency and these bit rates.

7.2.1. Time/frequency transform

The time/frequency transform used is a bank of M = 32 pseudo-quadrature mirror filters (QMF) filters which realize a uniform partition of the frequency axis as shown on the graph in Figure 7.5, which shows only the frequency response of the first filters in the bank. This filter bank is not a perfectly reconstructed one but, in the absence of quantization, the SNR is greater than 90 dB, which is sufficient since the real SNR in CD format is itself of this size.

Figure 7.5. Frequency responses in the PQMF filter bank in the band [0–5] kHz

image

This filter bank produces M sub-band signals which are downsampled by a factor of 32. We find the sub-band samples 1 yk(m). In each sub-band independently, the coder constructs a vector image which groups 12 samples every 32 × 12 = 384 samples for, x(n) which corresponds approximately to around 10 ms. For each vector image, the component with the largest absolute value of all 12 is used to define a scale factor gk and from this a normalized vector image is deduced. Each scale factor is expressed in dB then quantized with help of a codebook covering more than 100 dB in steps of 2 dB which requires 6 bits.

The choice of 12 samples is the result of a compromise. For low values, the bit rate associated with scale factors becomes too large. For higher values, echo phenomena become audible since there is no longer any temporal masking. A priori, this coder delivers a good temporal resolution (sub-band signals are frequently updated) but a bad frequency resolution (the bank filters bandwidths are of the order of 22/32 = 0.7 kHz).

7.2.2. Psychoacoustic modeling and bit allocation

The samples of the signal x(n) contributing to determining the M vectors image are also used to realize an estimate of the power spectral density image in the current analysis frame. From image, the coder calculates a masking threshold Φ(f) using a psychoacoustic model and then a signal-to-mask ratio image/Φ(f). From the M signal-to-mask ratios, the coder realizes a bit allocation, that is, it determines the number of bits bk with which each component of the vector image is quantized. The algorithm used is the standard greedy algorithm.

7.2.3. Quantization

Next, the quantization of each of the components of these normalized vectors image is realized using uniform scalar quantizers in the range [−1, +1] with quantization steps function of bk.

Each sub-band signal has a certain number of possible quantizers. Each quantizer is characterized by a number L of quantization steps and a signal-to-noise ratio image (a point in the rate-distortion plane). The values adopted in the ISO standard for the signal-to-noise ratio as a function of the number of quantization steps are given in Table 7.1. For the purposes of comparison, the theoretical signal-to-noise ratio for a uniform source is also given in the table. This is obtained via equation [1.5] with c(l) = 1.

Notice that the number of quantization steps is always an odd number so that the value 0 is always a possible reproduction value 2. The resultant problem is that, for low values of L, there is a significant difference between b = log2 L and the next biggest integer [log2 L]. Therefore, there is a risk of spilling bits. In the ISO standard,

Table 7.1. Signal-to-noise ratio as a function of the number of quantization steps

image

this problem is solved by grouping three samples if L ≤ 9. The number of bits used is therefore given by:

image

The fourth column in Table 7.1 gives the resolution b (the equivalent number of bits per sample) as a function of L.

Table 7.2 indicates the permitted quantizers in each sub-band. They are indicated by the number of quantization steps they have [NOR 93, page 52].

Notice that the sub-bands are split into five groups. The first two groups, corresponding to frequencies between 0 and 7.5 kHz, accept 16 quantizers. It is therefore necessary, in the first instance, to dedicate 4 bits to specify the number of the selected quantizer. This number is denoted as NoQk. This first two groups are distinct through having a different configuration of possible quantizers. The third group allows 8 quantizers and the fourth allows 4. Sub-bands 27–32 are not coded.

In order to reconstruct the signal, the receiver needs to know not only the code words associated with each component of the vector image and with the scale factors, but also the bit allocation realized at the transmitter. This last piece of information must therefore be transmitted in the bitstream. It takes 11 × 4 + 12 × 3 + 4 × 2 = 88 bits. Assuming, for example, that 20 out of the possible 27 sub-bands are transmitted (the

Table 7.2. Number of possible quantization steps per sub-band

image

last five are never transmitted), the side information (bit allocation + scale factors) represents 88 + 120 bits. At 96 kbit/s, the number of bits that remain for coding the normalized sub-band signal amplitudes is 96 × 384/44.1 − 88 − 120 = 628 bits. At 64 kbit/s, 350 bits are available. We can therefore see that the number of bits reserved for the “noble” part decreases a lot. We can therefore understand why this coder is efficient in reducing the bit rate until reaching a critical bit rate below which the quality is hardly reduced.

7.3. MPEG-2 AAC coder

All of the technical details of this coder can be found in [INT 97]. A more student-friendly description is available in the article [DER 00]. Here we present only the principles.

Figure 7.6 shows a piano signal in the time domain (Figure 7.6(a)) and in the frequency domain (Figure 7.6(b)). This spectral representation is obtained simply by taking the moduli of the components of the vector image and expressing them in dB. The transform used in the MPEG-2 AAC coder is the MDCT with N = 2048 and M = 1024.

Figure 7.6. Piano signal (a) in the time domain, (b) in the frequency domain

image

The graph in Figure 7.7 still represents image but on the x-axis the frequencies are expressed in Bark and the y-axis is linear. This gives the effect of zooming in on the bass frequencies and the high values are strongly amplified.

Firstly, we can observe that the MDCT is a real-valued transform but the values are not necessarily positive. The MPEG-2 AAC coder processes the signs separately 3. Quantizing the moduli of the components of the vector image is carried out as follows. Assume that M parameters image are known, referred to as the scale factors. At the transmitter, a vector of integers is calculated using the formula:

image

Note that at the receiver, knowing image, we can reconstruct:

image

Then we obtain [image by recovering the sign information but a reconstruction error arises from the rounding operations. The graphs in Figure 7.8 show, for a given choice of scale factor, vector image(stepped curve on the left-hand graph)

Figure 7.7. Moduli of components of the vector image with frequencies in Bark on the x-axis and a linear y-axis

image

and the vector of integers image (on the right-hand side). From real values in the range of 0 to 10,000, we pass to integer values between 0 and 4.

Knowing the vector image= [i(0) ··· i(M − 1)], we can take this coding problem and ask ourselves what the cost in bits is for this information. In this example, since all the integers are lower than 8, 3 bits are enough to code each integer exactly. The total number of bits which is enough to represent the vector image is therefore M x 3. Is

Figure 7.8. (a) Vectors image and image. (b) Vector image

image

this total number of bits necessary? There is certainly a more economical solution using a Huffman coding. In reality, the solution used in this coder is to partition the frequency axis into 51 bands which correspond grosso modo to the critical half-bands, to determine the maximum value taken by i(k) in each band and then to realize a separate coding in each band using indexed Huffman tables on the maximum value and constructed value using training data. The bitstream therefore carries code words of the M integers i(k), the code words of the 51 maximum values, and the code words of the scale factors g(k).

Now, only the scale factor vector image needs to be determind. A priori we can see this problem as a simple optimization problem: it requires determining image while minimizing the reconstruction error power under the constraint that the necessary number of bits must be less than the number of bits available. This solution is not in fact realistic since, as we can notice, it requires only low compression rates (of the order of 2) if we want the reconstructed signal to be transparent. To obtain higher compression rates (from 10 to 15), we must make use of the psychoacoustic results and noise shaping techniques. This means, in reality, that the minimization problem has two constraints: one constraint on the bit rate, that the number of bits needed must be less than the number of bits available, and the psychoacoustic constraint Sq(f) < Φ(f) ∀f. The optimization algorithm proposed in the standardization document is an algorithm of the gradient type based on the following property. If the scale factor g(k) is increased, the integer i(k) is reduced. The bit rate is therefore reduced but the reconstruction noise power increases in the frequency zone characterized by the index k. Recall that the proposed algorithm is optional and does not form part of the standard.

As the scale factors must be transmitted, they use up part of the available binary bit rate. It is unreasonable to transmit M scale factors. The previous partition of the frequency axis into 51 bands is also used for scale factors which explains the stepped curve in Figure 7.8. The coding of the first scale factor is realized on 8 bits; the coding of the 50 following factors is realized by coding Δ(k) = g(k) − g(k − 1) using a Huffman table.

Note that the MPEG-2 AAC coder is a closed 5 loop perceptual coder. The algorithm for determining the scale factors makes use of the real value of the reconstruction error and not the assumed value after a rounding operation.

This presentation highlights only the major principles of the method. A more detailed study is necessary if we wish to have a good understanding of this coder. A number of points can be explained further. Let us take just one example and examine the graph in Figure 7.9.

The music signal shown is a piano signal which clearly shows the beat of one note. It shows a significant rise in power. As much as the signal is relatively stationary in an analysis frame, we are without a doubt interested in finding a good frequency

Figure 7.9. Switching between long and short frames

image

resolution which requires a long analysis frame. In the frame(s) where the signal is not stationary, we are rather more interested in finding a good temporal resolution. We therefore need shorter analysis frames. In the MPEG-2 AAC coder, a frame of N = 2048 samples is divided into eight sub-frames of 256 samples. In order to respect the perfect reconstruction conditions given by equation [3.2], we have to introduce transition frames upstream and downstream. Note that this requires a criterion which allows switching between long frames and short frames. The proposed criterion 4 in the standardizing document is rather subtle. It makes use of the prediction gain: during a rise in power the signal is not very predictable. This long frame/short frame switching gives good results but has a significant effect on the reconstruction delay. In effect, we can only decide on switching after having completely observed a rise in power and in this case we must introduce a transition frame upstream of the rise in power. Recall that long reconstruction delays are incompatible with bi-directional communications, which explains the existence of a version of the AAC coder called low delay in MPEG-4, a version in which the notion of short frames has been dropped.

7.4. Dolby AC-3 coder

This coder [ADV 95] has a classic architecture. The time-frequency transform is realized by an MDCT with overlapping frames of 512 samples. The frames become twice as small if a significant change in power in the second half of the frame is observed. The novelty in this coder is the use of the floating point format (mantissa + exponent) to code the transform coefficients. A rounding operation is realized on the mantissa with a precision which is a function of the signal-to-mask ratio. The exponent plays the role of the scale factor. The specific bit allocation information is not transmitted directly: it is reconstructed at the decoder from the spectral envelope which is coded like a speech coder.

7.5. Psychoacoustic model: calculating a masking threshold

7.5.1. Introduction

In this section, some information is presented as simply as possible to allow us to understand the significance of masking and to calculate a masking threshold. The only psychoacoustic result of interest here is how to deduce from the spectral characteristics of a music signal, the spectral density of a noise which is suitable to be added to the original signal without this noise being perceptible to the listener. For further information on the field of sound perception we can consult, for example, [ZWI 81, MOO 82].

7.5.2. The ear

We know that the ear consists of three parts: the outer ear which is formed by the pinna and the auditory canal, the middle ear, comprising the eardrum and the auditory ossicles, and the inner ear which comprises the cochlea. The outer and middle ear carry out a bandpass filtering function for frequencies between 20 Hz and 20 kHz of which the frequency response modulus is directly linked to the absolute threshold of hearing. They also perform an impedance adaptation function between propagation of waves through the medium of air and the medium of liquid in the inner ear. The inner ear consists of the cochlea, which contains a liquid in which the basilar membrane is bathed. This is stimulated by the movement of the stirrup. This membrane can be seen, once unrolled, as a straight segment of a given length (3.5 cm) which has a large number of nerve cells. All sound excitations make this membrane vibrate. A pure sound, a sinusoidal waveform of a given frequency f1, makes this membrane vibrate around a particular position k1 of the basilar membrane and this vibration produces excitation in the neurons of the auditory nerve. Thus a frequency to position transformation occurs.

Zwicker [ZWI 81] declares that k1 is a frequency linked to the frequency f1 by a non-linear relation. This relation becomes linear if the frequency is expressed on a new scale known as the Bark scale. This new frequency scale is directly linked to the notion of critical bands presented in section 7.5.3. If k is the position of a particular fiber along the basilar membrane and SE(k) is the amount of power measured by this fiber, the spread function SE(k) which is at the beginning of the masking threshold calculation, as we will see in section 7.5.4, reflects the spread and the power of the excitation created in the auditory system by the excitation sound.

Zwicker showed that the hypothesis that he made led to taking a filter bank as a representation of the auditory system (neglecting the initial bandpass filter). In effect, imagine a sinusoid x(n) = a sin(2πf1n) entering M filters with frequency responses Hk(f). The M sub-band signals are expressed:

image

They have power image. The term image represents the power transmitted by the pure excitation sound of frequency f1 to the kth position of the basilar membrane. If we plot |Hk(f1)2 as a function of k, we obtain the spread function SE(k) as a function of f1.

7.5.3. Critical bands

Different experiments have led to the same understanding of critical bands and provide approximately the same bandwidths. Only one is described here, which is directly related to the absolute threshold of hearing. Two pure, simultaneous sounds of different frequencies are detected at a power weaker than one of them with the condition that the frequency gap between them is not greater than a given value. More precisely, consider a sinusoidal waveform at a frequency f1. If the power image of this sinusoid satisfies image where Sa(f) is the absolute threshold of hearing, this sinusoidal waveform is audible. Imagine a second sinusoidal waveform with frequency f2 which is close to f1 with an identical power to the first one. Experimental measurements show that both of the sinusoidal waveforms are audible provided that image. Consider N sinusoidal waveforms with regularly spaced frequencies f1 < f2 < ··· < fn and with the same power, image. The preceding property — the set of N sinusoidal waveforms is audible provided that image — is satisfied by the condition that the bandwidth Δf = fn − f1 is less than a threshold known as the critical bandwidth in the neighborhood of the frequency f1. This experiment shows that, in conclusion, we can accept that the power perceived by the ear in a critical band is the same as the sum of all the powers of the components in this frequency band: the ear integrates powers in a critical band. We can also say that the ear behaves as a filter bank which makes an irregular partition of the frequency axis and then sums the power in each sub-band.

The audible band is divided into 24 critical bands for which the frequency partitioning is given in Table 7.3 [ZWI 81]. We can define a new frequency scale, the Bark scale,5 which corresponds simply to the critical band number.

Table 7.3. Critical bands. Frequencies are expressed in Hz

image

In this table, the 24 critical bands are put next to one another. In fact, the ear can create a critical band around any frequency. The relation between a frequency expressed in Hz in the range [20 Hz–20 kHz] and a frequency expressed in Bark in the range [1–24] can be approximately expressed as follows:

image

7.5.4. Masking curves

The ear has the property of not perceiving a sound in the presence of another sound provided that the characteristics of the two sounds have a particular relation. This is the masking phenomenon. There are two types of masking of one sound by another. We speak of temporal masking when two sounds appear consecutively and frequency masking when two sounds appear simultaneously. The temporal masking phenomenon can be of interest in coding but is not presented here. Let us introduce frequency masking.

Recall that a sinusoidal waveform of frequency f1 is only audible in a perfectly silent environment, under the condition that its power is greater than the absolute threshold of hearing for that frequency. Consider now the case of two sinusoidal waveforms. The first sinusoidal waveform, the masking sinusoidal waveform, has a particular frequency f1 and a given power image, much higher than Sa(f1). For all frequencies f2 in the audible band, the power image of the second sinusoidal waveform, the masked sinusoidal waveform, is measured at the limit of audibility in the presence of the first sinusoidal waveform. The function image is called the masking curve or the mask effect curve. The form of this masking curve is given in Figure 7.10(a) for a sinusoidal masking sound.

Figure 7.10. (a) Masking curve when the masking sound is sinusoidal. (b) Influence of frequency on the shape of the masking curve

image

The masking curves of one sinusoidal waveform by another sinusoidal waveform are not the only curves of interest. Psychoacoustics also gives us precise results for the case in which the masked signal is a narrow-band noise. Furthermore, a music signal can be considered to be composed of a number of pure sounds, which can be modeled by sinusoidal waveforms, and of impure sounds, which can be modeled by narrowband noise, and it is necessary to examine the following four cases: a sinusoid masked by a sinusoid, a sinusoid masked by a narrow-band noise, and more particularly a narrow-band noise masked by a sinusoid and a narrow-band noise masked by a narrow-band noise. The curves shown in Figure 7.10 are, in reality, the masking curves of a sinusoidal waveform by a narrow-band noise and have been obtained via simulation by applying the number 1 psychoacoustic model from MPEG.

All of these curves, whatever case is considered 6 have the same appearance in a triangular shape. Let us examine these curves more exactly and study the influence of the parameters f1 and image on image. The first property of these curves is that they reach a maximum at the frequency f1. We notice in Figure 7.10 that the power image at frequency f1 is lower than the power image. The difference image is called the masking index. The second property is that they show a significant assymmetry. The gradient of the curves is steeper toward frequencies lower than the frequency of the masking sound than toward the higher frequencies. The gradient of the curves depends on the frequency f1 of the masking sound. It becomes less steep as the frequency f1 increases. If the frequency is expressed in Bark and the power in dB, we can show that the masking curves can be modeled as straight lines which have gradients that are independent of the frequency f1. Unfortunately, the gradient at higher frequencies remains dependent on image. It is, however, less steep than image, which is large.

7.5.5. Masking threshold

The preceding exposition is not yet enough for us since it is applied only to the case when the masking and masked sounds are pure sounds or narrow-band noises with bandwidth less than or equal to that of a critical band. On the one hand, an audio signal is composed of an infinity of contributions: the power spectral density SX(f) characterizes the relative importance of each contribution. On the other hand, we are looking to determine a signal at the limit of audibility, the quantization noise, and this signal also has an infinity of contributions.

Two types of problems arise. We are in the presence of numerous masking sounds: how can we “add them up” to determine the power of the masked sound right at the limit of audibility? We are in the presence of numerous masked sounds: how can we generalize them? These are difficult problems; they are addressed briefly in the literature written by experts in acoustics since they are specifics of coding application7. Psychoacoustic experts do not know how to describe the masking phenomenon with precision when the number of masking components is greater than 2 or 3 while we need to be able to process the case of 512 or even 1024 masking components. This is the problem of the “laws of additions” in psychoacoustics as we have just mentioned. The available coding models have been determined by engineers who have defined the numerical value of a few parameters after a very large number of listenings with a complete coder.

An example of the masking threshold in the band [0–10] kHz calculated from a violin signal is given in Figure 7.3. It uses the number 1 psychoacoustic model of the MPEG-1 standard provided as an option in the standardization document.


1. To show clearly that these are signals with vector components that can be interpreted as frequencies, we have kept the notation used in section 3.2 rather than the notation image from this chapter.

2. This is for midtread quantizers [JAY 84, p. 117].

3. In reality, not always! We do not go into the details here. We no longer talk in terms of coding the signs.

4. NB: this is not part of the standard.

5. There is also another scale, the MEL scale, which is approximately 100 times finer than this.

6. Except for masking a sinusoid by a sinusoid, since it introduces beating phenomena when two frequencies of pure sounds are close, but is useless in coding.

7. And since recently in watermarking applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.76.135