As noted in [9], one of the key measurements used in speech processing is the short-term spectrum. In all of its many forms, this measure consists of some kind of local spectral estimate, typically measured over a relatively short region of speech (e.g., 20 or 30 ms). This measure has been shown to be useful for a range of speech applications, including speech coding and recognition. In each case, the basic notion is that of capturing the time-varying spectral envelope for the speech, and in each case it is desirable to reduce the effects of pitch on this estimate; either pitch is used separately (as with a vocoder or a tone language speech-recognition system), or it is generally discarded as irrelevant to the discrimination (as in most English language speech-recognition systems). Therefore, in speech applications, the short-term spectral algorithm is usually designed to estimate a spectral envelope that has a reduced influence from the pitch harmonics in voiced speech.

In this chapter and the following two, we will describe three basic approaches to the estimation of the short-term spectral envelope: filter banks, cepstral processing, and linear predictive coding (LPC). The first and oldest approach is that of temporally smoothed power estimates from a bank of bandpass filters. Since much of the inspiration for such an approach originates from models of the human auditory system, we will begin with a discussion of the interpretation of the auditory system as a filter bank.

In Chapter 14 we displayed tuning curves of individual auditory nerve fibers and showed that the bandwidth increased with the CF (characteristic frequency) of the fiber. In Chapter 15 we discussed psychological tuning. Here we discuss filter-bank designs that can be used to model these aspects of the human auditory system. We review Fletcher's early experiments on critical bands. We then move to more recent experiments, in particular to Patterson's results, that lead to specifying shape as well as to the bandwidth of auditory filters. Following this we discuss versions of the so-called gammatone filters, which are attempts at physical realizations of the tuning curves of Chapter 14. We conclude with a more informal discussion of some of the filter-bank designs that have been at least partially influenced by auditory system research.

As noted in Chapter 18, Harvey Fletcher and his collaborators at Bell Laboratories experimented extensively on human hearing [1]. In addition to the experiments on CVC syllable perception, they did extensive work on masking phenomena. In a test of what is called simultaneous masking, the listener was presented with a tone plus wideband noise. Initially, the tone was of low enough intensity so that it was not perceived. The intensity of the tone was then gradually increased until it was just barely perceived; this intensity was called the threshold intensity. As the noise bandwidth was decreased, no change took place in the threshold until a *critical band* was reached. As bandwidths decreased still further, the threshold of detection decreased.

Such experiments suggested the existence of an auditory filter in the vicinity of the tone that effectively blocks extraneous information from interfering with the detection of the tone. This vicinity is called a critical band and can be viewed as the bandwidth of each auditory filter. The experimental results showed that the width of a critical band increases with the higher frequency of the tone being masked. Thus, the results yielded important information about the bandwidth of the auditory filter, though not about its shape.

The quantitative result is shown in Fig. 19.1 (a repeat of Fig. 15.4, added here for convenience). The Bark scale of Fig. 19.1 (a good approximation to psychoacoustic critical band measurements) yields bandwidths that are below 200 Hz until the center frequency exceeds 1000 Hz. As center frequency increases above 1000 Hz, the Bark scale adheres closely to bandwidths that are logarithmic functions of the center frequencies. Thus, for frequencies above 1000 Hz, the data is another example of Weber's law, which states that our peripheral senses tend to follow a logarithmic law of sensation as a response to a stimulus.

Below approximately 800 Hz the bandwidths measured by Fletcher were fairly constant, with a bandwidth of approximately 100 Hz. More recently, Moore et al. [4] performed a different set of psychoacoustic measurements to estimate bandwidths for these low frequencies. These measurements seemed to show that the auditory filter bandwidths for low-frequency tones increased significantly between 100 and 800 Hz. The degree to which they increase is still a matter of some controversy among experts in the field.

Another result from Fletcher, based on very different psychoacoustic data (discussed in Chapter 18), related the articulation index (AI) to auditory bandwidths. As noted previously, the AI is a simple function (given by Eq. 18.3) of the average phone accuracy associated with a listener's response to CVC (consonant–vowel–consonant) nonsense syllables. Measurements were made by gradually increasing the bandwidth of a low-pass filtered version of these spoken syllables. As the bandwidth increased, the AI increased; in this way the AI could be directly associated with the speech bandwidth. As shown in Fig. 19.2, each mark indicated an equal increment in the AI, and we see from that figure that higher frequencies require a greater bandwidth to achieve the same AI increase as that of the lower frequencies. Thus, this test also leads to a model for which the auditory filters increase in bandwidth for higher frequencies.

Figure 19.3 shows two noise bands. The low-pass noise ranges from zero to 600 Hz, and the high-pass noise has a cutoff at 1200 Hz.

The listener's task is to detect a tone as its frequency varies from 400 to 1400 Hz. The small circles in the figure show the psychoacoustic results. The three curves shown correspond to three different hypotheses as to the auditory filter shape. Presumably, the assumed auditory filter shape can be varied until the computed threshold for any frequency is equal to the measured threshold. In this experiment, curves were found for three specific filter shapes: the rectangular filter (i.e., a filter with very steep transitions, approximating a rectangular frequency response) used by Fletcher, a standard resonance (a pole pair with resonance frequency of the tone), and a symmetric filter (to be discussed later in the chapter). We note that the rectangular filter assumption results in the greatest error relative to the psychoacoustic results. The data for this experiment are from [10].

In this experiment, the noise was kept fixed while thresholds were computed for different tone frequencies. This permitted fitting of the auditory filter shape given a fixed filter paradigm, but it did not allow for an arbitrary filter transfer function to be designed directly from the psychoacoustic measurements. Patterson [5] developed a somewhat more complex method by varying the width of the rectangular noise band shown in Fig. 19.4 and keeping the tone frequency fixed. For each choice of a noise bandwidth, he measured the signal threshold or the SPL that was required for the tone to just barely be heard. He began with the following mathematical representation:

where *P* is the tone power at the threshold, *N(f)* is the power spectrum of the noise, *H(f)* is the transfer function of the auditory filter, and *K* is the proportionality constant relating tone power at the threshold to the noise leaking through the filter, which is represented by the integral of the product of the noise spectrum and the filter shape.

If the noise spectrum is very close to rectangular, it can be removed from the integral, and Eq. 19.1 reduces to

where *W* is the cutoff frequency of the low-passed noise, and *N*_{o} is the constant noise power spectral level; *P* is thus a function of *W*. By differentiating Eq. 19.2, we obtain an explicit result for the auditory filter magnitude function:

Note that this method can yield the magnitude function of the auditory filter but not its phase.

We can see from Fig. 19.4 that for a value of *W* that is appreciably lower than the tone frequency, Eq. 19.3 results in a sensitive measure of *H(f)*. However, when the noise bandwidth *W* is in the vicinity of the tone frequency, sensitivity is poor. To combat this problem, Patterson introduced a variable high-pass noise and varied the cutoff of the noise from well above the tone to just below it.

The results, normalized about the tone frequency, are displayed in Fig. 19.5; tone sensitivities are displayed in terms of signal levels that were discernible by 75% of the subjects. The top display shows the skirts of the filter at the low-frequency side; the bottom displays the high-pass noise case. The tone frequencies used are shown as parameters, ranging from 0.5 to 8 kHz. The abscissa is also in kilohertz, and we clearly see the large spread of the skirts as the tone frequency increases.

The implicit assumption in all results discussed thus far is that the auditory filter whose shape is to be found is centered around the tone to be detected. Figure 19.6 illustrates a shortcoming of this method. What if the observer's auditory filter is off center, as shown in Fig. 19.6(b)? With this, the noise is lowered and this should lower the resultant threshold. A similar argument holds for high-pass noise. However, if off-center listening is really taking place, it becomes very difficult to use Eqs. 19.1–19.3 to compute *H(f)*.

Patterson [6] recognized this difficulty and devised a way to avoid it; the idea is illustrated in Fig. 19.6(c). Instead of separate low-pass and high-pass noise spectra, the listener is presented with notched wideband noise. As seen in the figure, off-frequency listening simply shifts the noise from one side of the noise source to the other, leaving the total masking noise the same. Patterson [5] was able to show that the derived auditory filters that led to the responses of Fig. 19.5 could be quite accurately represented by the so-called symmetric filter:

The parameter α is a measure of the filter selectivity; 1.29α is the 3-dB filter bandwidth. The function is symmetric on a linear frequency scale. If we define the bandwidth as BW and assume the symmetric filter response, and if we also maintain the previous assumption that the noise has a constant spectrum over its bandwidth, then we can show that

where *K* is the constant of proportionality given in Eq. 19.1; Patterson takes *K* to be 1.0 in the discussion in [5].

Thus, the symmetric filter predicts that when a tone is masked by wideband noise, the signal-to-noise (power) ratio at the threshold will be proportional to the bandwidth of the auditory filter centered at the tone frequency.

Figure 19.7 defines the gamma-tone filter in terms of its impulse response in the analog (continuous) domain. The name derives from the form of the envelope, which is an *N*th-order gamma function. Notice that there are four parameters in this formula. In particular, when ω_{r} and *b* are varied, these impulse response functions can implement filters of different center frequencies and bandwidths.

The Laplace transform of the gamma-tone filter impulse response is a function with both poles and zeros. Its frequency response (magnitude for *s* = *j*ω is displayed as GTF in Fig. 19.8 for five different ratios of *b* to ω_{r} GTF stands for gamma-tone filter. Also shown in the figure are results for APGF (all-pole gamma-tone filter) and OZGF (one-zero gamma-tone filter). All three cases are displayed for the same five values of the parameters.

By discarding the zeros of the original GTF to produce the APGF, Lyon [3] claims that an improvement in auditory modeling is obtained:

- The APGF is simpler and more well behaved.
- The APGF provides a more robust foundation for modeling auditory data.
- The low-frequency tail of the APGF is unaffected by the bandwidth parameter, unlike the awkward behavior of the GTF.
- The APGF has a very simple implementation; in the digital (or analog) domain, implementation consists of a cascade of second-order sections.

The APGF has the following Laplace transform:

As seen in Fig. 19.8, the APGF has a flat unity gain at very low frequencies. Lyon [3] states “... it is not necessarily desirable. A sloped but otherwise linear tail can be obtained by adding ... a zero at *s* = 0.” This is called the OZGF; the linear tail can be observed in the figure.

It's worth remarking that the Patterson *symmetric* filter is an APGF for *N* = 2.

Roex (rounded exponential) filters were introduced by Patterson and Nimmo-Smith [7]. One of their versions is described by the following equation:

where *p* is a parameter defining the filter and *g* is a normalized frequency value, where *g* = 0 for the center (peak) frequency.

Figure 19.9 (adapted from Lyon's paper) compares the responses of an APGF and two roex approximations to auditory filters. We can see that the roex filters are more symmetric, whereas the APGF has a slower rise and a steeper fall. Thus, a well-designed APGF tends to more closely resemble neural and psychoacoustic tuning curves.

Aside from the precise filter transfer function, other factors must be included in designing a set of filters that is, in some sense, a reasonable emulation of an auditory filter bank. A simple question to ask (though difficult to answer) is, How many filters should be used? Historically, this question has been answered by trial and error. In his choice of the Voder filter-bank design (see Chapter 2), Dudley was influenced by several factors. From psychoacoustic experiments by Fletcher and others, he decided that the filter bank should extend over the frequency range 300-3000 Hz. Then, since the Voder was to be controlled by a keyboard to switch on various spectral shapes, a suitable number of filters was 10, corresponding to the number of available fingers. This led to a bank of 10 bandpass filters, each with a width of 300 Hz.

At the time that Dudley conceived of and designed the first vocoder (the late 1930s), implementation required large and relatively expensive components, so there was a natural desire to keep the parts count low. Like the Voder, the first vocoder built had only 10 channels, each including a filter with 300-Hz bandwidth. Results, however, were not satisfactory. A decade later, Vaderson gave a lecture demonstration of a 30-channel vocoder that had excellent quality.

These early vocoders did not take advantage of the variable-frequency resolution of the ear. Later vocoders built at Bell Laboratories did have wider bandwidths for higher center frequencies, and this resulted in a reduction to 16 channels covering the same frequency range.

In a vocoder, the purpose of the analysis filter bank is to generate a reasonable estimate of the speech spectrum. The fulfillment of this goal is complicated by two important properties of speech: (a) its quasi-periodicity during voiced segments, and (b) its variation with time.

During voiced speech, the spectrum is very close to periodic; an idealized example is shown in Fig. 19.10. The top figure shows an example with many spectral lines. We see that the bank of narrow-band filters leads to a spectral estimate that follows the comblike properties shown, whereas a bank of wideband filters yields a spectral estimate that tends to follow the spectral envelope.

The spectrum of speech is time varying, but the *rate* of variation is based on articulator movements and is thus slow (of the order of 10-20 Hz). Thus, the spectral envelope at a given instant and frequency can be estimated from the output shown in Fig. 19.11. What should be the bandwidth of the low-pass filter? If made too narrow (e.g., 10 Hz), the output may not be able to follow spectral variations at that frequency. If made too wide, pitch ripple will appear at the low-pass filter output.

The incorporation of fast Fourier transform (FFT) programs as spectrum analyzers created a different set of design options. With the FFT it was easy to generate a high-resolution spectrum analysis, since the computation time only increased logarithmically with increases in resolution. However, the same issues that complicated filter-bank design still held for FFT analysis. For example, if we want the equivalent of 1024 bandpass filters, this can surely be implemented with a 1024-point FFT. However, if we implement this by choosing 1024 samples to analyze a rectangular window and then invoking the FFT program, results are not too good. Assume, for example, that the speech was originally sampled at 10 kHz. This means that 1024 samples corresponds to approximately 100 ms of speech. During this time, the spectrum could have changed greatly, which means that this type of spectrum analysis will not track the natural spectral change in the speech. There are many tricks to overcome this problem. First, windowing only 20 ms of speech (200 samples) and then augmenting the input to the FFT with 824 zeros will still produce a spectrum with 1024 samples, but since the result is based on only 20 ms of speech, the result is close to a snapshot of the 20-ms segment. Also, multiplying the 20 ms by a suitable window, e.g., Hamming or Kaiser, removes most of the artifacts produced by the abruptness of a rectangular window (see [8] for a more extended discussion of windowing).

Psychoacoustic and physiological research have made it possible to estimate the frequency resolution properties of the human auditory system. This research has helped influence the design and implementation of many kinds of spectral analysis methods that look carefully at the speech spectrum. In addition to frequency resolution issues, there are also highly significant temporal issues. Designers need to consider both time and frequency design issues in the context of the specific applications.

**19.1**Explain why the auditory bandwidths obtained by psychological measurements do not necessarily agree with the tuning curves of auditory neurons (as described in Chapter 14).**19.2**Explain why the bandwidths determined by the equal articulation index differ from the critical bands of Fig. 19.2.**19.3**Explain why Fletcher's critical band experiments were not able to predict the shape of the auditory filters.**19.4**Can you present one or more physiological explanations of why auditory filter bandwidths increase with frequency? Your answer can be speculative but should be buttressed with some facts.**19.5**High-frequency hearing loss in normal hearing adults increases with age. Give one or more explanations.**19.6**The AI devised by Fletcher was based on listeners’ responses to nonsense CVC syllables. Why did Fletcher choose nonsense CVC's instead of CVC's from a natural language such as English?**19.7**Develop a mathematical model of Patterson's notched noise experiment (Fig. 19.6). Derive an expression for the auditory filter as a function of the measured threshold of detection of a tone when (a) there is no off-frequency listening and (b) when the listener performs off-frequency listening with an increment δ*f*.

- Fletcher, “Auditory patterns,”
*Rev. Modern Phys*.**22**: 47, 1940. - Kingsbury, B. E. D., Perceptually Inspired Signal Processing Strategies for Robust Speech Recognition in Reverberant Environments, PhD Thesis, U.C. Berkeley, 1998.
- Lyon, R. F., “The all-pole gammatone filter and auditory models,” from
*Computational Models Signal Process. Audit. Syst.*, Forum Acusticum ‘96, Antwerp, Belgium, 1996. - Moore, B. C. J., Peters, R. W., and Glasberg, B. R., “Auditory filter shapes at low center frequencies,”
*J. Acoust. Soc. Am*.**88**: 132-140, 1990. - Patterson, R. D., “Auditory filter shape,”
*J. Acoust. Soc. Am*.**55**: 802-809, 1974. - Patterson, R. D., “Auditory filter shapes derived with noise stimuli,”
*J. Acoust. Soc. Am*.**59**: 640-654, 1976. - Patterson, R. D., and Nimmo-Smith, I., “Off-frequency listening and auditory filter asymmetry,”
*J. Acoust. Soc. Am*.**67**: 229-245, 1980. - Rabiner, L. R., and Gold, B.,
*Theory and Applications of Digital Signal Processing*, Prentice–Hall, Englewood Cliffs, N.J., 1975. - Schafer, R. W., and Rabiner, L. R., “Digital representations of speech signals,”
*Proc. IEEE***63**: 662-667, 1975. - Webster, J. C., Miller, P. H., Thompson, P. 0., and Davenport, E. W., “The masking and pitch shifts of pure tones near abrupt changes in a thermal noise spectrum,”
*J. Acoust. Soc. Am*.**24**: 147-152, 1952.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.