CHAPTER 20

image

THE CEPSTRUM AS A SPECTRAL ANALYZER

20.1 INTRODUCTION

In Chapters 11 and 12, models of speech and music production were introduced. The basic structure of these models could be identified as an excitation that was input to a system of resonators; the convolution of the former with the impulse response of the latter component produced the approximation to the modeled speech or music signal. It is therefore natural to contemplate an analysis of the signal as a separation of two components corresponding to source (or excitation) and filter (or resonator) respectively. This is an example of a process that is often called deconvolution, or the separation out of a signal from an impulse response that has been convolved with it. In the channel vocoder, for example, the excitation is modeled as either a quasi-periodic pulse train (caused by vocal cord vibration) or a noise signal caused by turbulence. In Chapter 16 we studied how the auditory system perceives the pulse train component of the excitation; in Chapter 31, methods of detecting both the periodic and noisy components will be reviewed. In Chapter 19 there was a brief discussion of how a vocoder analyzer models the spectral envelope, which is a function of the vocal tract articulator positions. In summary, a channel vocoder separates excitation from the filter, and therefore goes some way towards performing deconvolution. Similarly, in speech recognition it is generally desirable to separate the filter information, which provides the major cues for phone classification, from the excitation characteristics, which, for American English at least, contain only limited phonetic information beyond the voiced–unvoiced distinction.

Cepstral analysis performs deconvolution through a mechanism that is quite different from those incorporated in a channel vocoder. In order to understand how cepstral analysis performs deconvolution, we first need to delve into a little relevant theory.

20.2 A HISTORICAL NOTE

Bogert et al. [1] may have been the first to use cepstral processing; in this case it was used for seismic analysis. While Bogert was performing his research, Alan Oppenheim, then an MIT graduate student, was working on a fairly complete mathematical theory that he called homomorphic processing. During a visit to Bell Labs by Oppenheim and one of this book's authors in the early 1960s, Bogert and Oppenheim exchanged ideas on the subject. Subsequently, Oppenheim became convinced that his concepts could be usefully applied to vocoder design. He later spent a 2-year sabbatical at the MIT Lincoln Laboratory (in the late 1960s) and developed a complete analysis–synthesis system based on what he called homomorphic (i.e., cepstral) processing. Further important work along these lines was carried out by Oppenheim et al. [3], [4] and also by Schafer [6], [5] and Stockham [7].

20.3 THE REAL CEPSTRUM

It is convenient to assume that the signal consists of a discrete time sequence, so that the spectrum consists of a z transform evaluated on the unit circle. Let us consider a speech example, with X referring to the spectrum of the observed speech signal, E to the excitation component (for instance, the glottal pulse train), and V to the vocal tract shaping of the excitation spectrum. We begin with a multiplicative model of the two spectra (the excitation and the vocal tract). Thus, the spectral magnitude of the speech signal can be written as

image

Taking the logarithm of Eq. 20.1 yields

image

Particularly for voiced sounds, it can be observed that the E term corresponds to an event that is relatively extended in time (e.g., a pulse train with pulses every 10 ms), and thus it yields a spectrum that should be characterized by a relatively rapidly varying function of ω; in comparison, because of the relatively short impulse response of the vocal tract, the V term varies more slowly with ω. With the use of this knowledge, the left-hand side of Eq. 20.2 can be separated into the two right-hand-side components by a kind of a filter that separates the log spectral components that vary rapidly with ω (the so-called high-time components) from those that vary slowly with ω (the low-time components). Such an operation would essentially be performing deconvolution.

Equation 20.2 has transformed the multiplicative formula 20.1 into a linear operation and thus can be subjected to linear operations such as filtering. Since the variable is frequency rather than time, notations must be changed. Thus, for example, rather than filtering (for time), we have liftering (for frequency); instead of a frequency response, we have a quefrency response; and the DFT (or z transform or Fourier transform) of the log |X (ω)| is called the cepstrum. The cepstrum is computed by taking the inverse z transform of Eq. 20.2 on the unit circle, yielding

image

where c(n) is called the nth cepstral coefficient. The deconvolutional properties of the cepstrum for speech can be visualized by using Fig. 20.1 depicting the sequence of operations from the speech wave to the cepstrum. Figure 20.1(d) shows the cepstrum, as defined by Eq. 20.3. The spectral envelope, which varies slowly with respect to frequency, yields large-valued cepstral coefficients for low values of n, but it dies out for high n. The spectral fine structure is more rapidly varying with ω, and it yields small-valued cepstral coefficients for small n, but large values beyond the crossover point shown in the figure. Thus, the contribution of the excitation and the vocal tract filter can (in principle) be separated in the cepstral domain. Both components can be inverted to generate the original spectral magnitudes.

image

FIGURE 20.1 Cepstral analysis. The dotted line in panes (b) and (c) indicate the inverse-transform of the region of the cepstrum to the left of the dotted line in pane (d).

20.4 THE COMPLEX CEPSTRUM

Thus far, all our equations have used real functions. It is also possible to define a complex cepstrum that gives useful insight into properties of actual systems.

Let's start with a sequence x(n) that can be of finite or infinite duration. We assume that this sequence is the impulse response of a well-behaved linear system that can be described in terms of a ratio of z transforms. Then

image

Given that αk, bk, and ck are complex constants with magnitudes less than unity, Eq. 20.4 represents a digital network with poles inside the unit circle (to ensure stability) and zeros inside the unit circle (first product in the numerator) and zeros outside the unit circle (second product in the numerator). A is simply a scaling factor.

The logarithm of Eq. 20.4 is represented as

image

and the complex cepstrum image(n) of Eq. 20.5 is determined from

image

At this point, we state without proof (left as an exercise) that the complex cepstrum can be evaluated to be

image

These equations lead to some interesting relations between pole-zero positions and complex cepstral values. For example,

  • If image(n) = 0 for n ≥ 0, this must correspond to an all-zero (FIR) filter with all the zeros outside the unit circle.
  • If image(n) = 0 for n < 0, this must correspond to a filter with all the poles and zeros inside the unit circle. This defines a minimum phase filter.

An example of a practical application of such results is the development of a filter bank based on physiological measurements of the tuning curves of cat auditory neurons. Delgutte [2] has measured and documented the results of these measurements. Only the magnitude of the cat's neural tuning curve was directly measured (not the phase). In this application (as in many others) it would be desirable to estimate the complete transfer function, including the phase.

There is some physical evidence that the basilar membrane vibrations (and therefore the auditory tuning curves) can be represented as minimum phase filters [2]. The application of Eq. 20.7 makes it possible to estimate the phase under this assumption, employing the following procedure.

  1. Measure the auditory nerve tuning curves over a variety of neurons with different CFs; these responses resemble the tuning curves shown in Chapter 14 (Fig. 14.10).
  2. These tuning curves can be inverted to produce magnitude functions of auditory bandpass filters. We now have an approximation to the function X (z) of Eq. 20.4.
  3. Using a discrete-time version of Eq. 20.6, compute the complex cepstrum. Instead of evaluating the integral, a DFT-based version gives good results.
  4. Set image(n) to zero for n < 0. This means that the truncated version of image(n), which we will denote image(n), must correspond to a log spectrum that is minimum phase.
  5. The final step is the inversion of image(n).

Step 3 can be implemented with a DFT, using the formula

image

where, as in Chapter 7, W is a shorthand for image, and where the subscript p is a notation to indicate that all the processing is performed in discrete time.

Step 5 can also be implemented with a DFT as

image

The result that we are seeking, namely, the value of the complex (minimum phase) spectrum imagep(k), can be obtained by simply exponentiating the left side of Eq. 20.9.1 In this way, each neural tuning curve can be well approximated by a minimum phase filter in which both the magnitude and phase of each filter is specified.

It should be mentioned that these techniques can, in exactly the same manner, be applied to obtain filters corresponding to the psychoacoustic tuning curves obtained by methods such as those described in Chapter 19. Notice, by the way, that the all-pole gamma-tone filter (APGF) of Chapter 19 is indeed a minimum phase filter. An interesting exercise would be to try to find an APGF that approximates, in both magnitude and phase, the filter response functions obtained by the method described above.

The complex cepstrum must thus be distinguished from the traditional cepstrum, which deals entirely with real functions. Figure 20.2 shows the basic difference between the complex cepstrum and the (traditional) cepstrum.

20.5 APPLICATION OF CEPSTRAL ANALYSIS TO SPEECH SIGNALS

Figure 20.3 shows the result of various operations on a windowed speech signal to produce both the complex cepstrum and the cepstrum. Figure 20.3(b) shows the result of computing the log magnitude of the DFT of the signal shown in Fig. 20.3(a). Since we also want to obtain the complex cepstrum, we need to save the phase component of the DFT. Phase computation is a tricky operation, usually producing a value between –π and π, as in Fig. 20.3(c). However, for this situation, it is necessary to unwrap the phase, as shown in Fig. 20.3(d). The complex cepstrum can be obtained by an inverse DFT (IDFT), from the log magnitude of Fig. 20.3(b) and the phase of 20.3(d). The cepstrum can be obtained by simply computing an IDFT of the function shown in Fig. 20.3(b).

image

FIGURE 20.2 Practical implementations of systems for obtaining (a) the complex cepstrum and (b) the cepstrum.

image

FIGURE 20.3 Computing the complex cepstrum and cepstrum for a voiced speech segment. From [5].

image

FIGURE 20.4 Cepstrum “liftering” of voiced speech. From [5].

Figure 20.3 shows how both the complex cepstrum and the cepstrum can be computed from the original speech segment. Figure 20.4 now shows the steps involved in cepstral filtering, also called homomorphic filtering, of the same speech segment. Begin by multiplying the complex cepstrum, Fig. 20.3(e), by a rectangular window that includes the major energy centered about zero but excluding the small peaks at ±45 samples. When this windowed complex cepstrum is inverted, the smoothed log magnitude and the smooth unwrapped phase of Figs. 20.4(a) and 20.4(b) are obtained. Exponentiation of the complex spectrum of these two figures results in Fig. 20.4(c), which can be labeled as the impulse response of the vocal tract. If we now perform comparable operations on the high-time or high-quefrency part of the complex cepstrum of Fig. 20.3(e) (the part not in the rectangular window), we obtain the results shown in Figs. 20.4(d), 20.4(e), and 20.4(f). Therefore, Figs. 20.4(c) and 20.4(f) show how cepstral analysis and filtering perform deconvolution, generating distinct functions of the excitation function [Fig. 20.4(f)] and the vocal tract impulse response.

20.6 CONCLUDING THOUGHTS

When it is necessary to return to a time waveform without making minimum phase assumptions, it is necessary to use the complex cepstrum. However, in many practical applications, processing of the phase is not a useful operation, and in any case adds complexity. In these cases, cepstral processing can be implemented by dealing entirely with real functions; such a situation is displayed in Fig. 20.1, where separation of the excitation and filter can be seen to occur at the cepstral level. This separation has been shown to be useful in many applications, including

  • Pitch estimation for vocoding (see Chapter 31).
  • Spectral envelope estimation for vocoding (see Chapter 32).
  • Cepstral computations for other kinds of of spectral estimates; for example, see Chapter 21 for LPC, and Chapter 22 for critical band filter banks. These representations are used for a range of applications, including speech recognition.

In the latter two cases, a moderate number of cepstral coefficients (typically 10–14) are used to represent the short-term spectral envelope. The choice of a small number of coefficients provides a further smoothing of the spectral estimate beyond what might be necessary for separation of the excitation alone. This is often beneficial to pattern-recognition tasks, in which we wish to suppress minor spectral differences between examples of the same sound.

20.7 EXERCISES

  1. 20.1 Prove that the cepstrum c(n) is the even part of the complex cepstrum imagen
  2. 20.2 Find the cepstrum of an all-pole model of the vocal tract. Assume that the vocal tract transfer function can be expressed as

    image

    where Ck = rkejΘk.

  3. 20.3 Consider the FIR filter

    image

    1. (a) Find the complex cepstrum of the filter impulse response.
    2. (b) Is the filter a minimum phase filter? If not, show how it can be transformed into a minimum phase filter with the same spectral magnitude.
  4. 20.4 Assume that a speech or music signal can be represented as the product of two signals; one rapidly varying and the other slowly varying. You are asked to design a gain control to minimize the effects of the slowly varying signal on the quality of the perceived sound. Sketch the design of a homomorphic system that accomplishes this.

BIBLIOGRAPHY

  1. Bogert, B., Healy, M., and Tukey, J., “The quefrency analysis of time series for echos,” in M. Rosenblatt, ed., Proc. Symp. on Time Series Analysis, Chap. 15, Wiley, New York, pp. 209-243, 1963.
  2. Delgutte, B., Cambridge, MA, personal communication, 1990.
  3. Oppenheim, A. V., “Generalized Linear Filtering,” Chapter 8 in B. Gold and C. M. Rader, Digital Processing of Signals, McGraw–Hill, New York, pp. 233-264, 1969.
  4. Oppenheim, A. V., Schafer, R. W., and Stockham, T. G. Jr., “Nonlinear filtering of multiplied and convolved signals,” Proc. IEEE 56: 1264-1291, 1968.
  5. Rabiner, L. R., and Schafer, R. M., Digital Processing of Speech Signals, Prentice–Hall, Englewood Cliffs, N.J., 1978.
  6. Schafer, R. M., “Echo removal by discrete, generalized linear filtering,” Tech. Rep. 466, Res. Lab. of Electronics, Massachusetts Institute of Technology, Cambridge, Mass., 1969.
  7. Stockham, T. G., “The application of generalized linearity to automatic gain control,” IEEE Trans. Audio Electroacoust. AU-16: 267-270, 1968.

1 All logs in this chapter are assumed to be natural logarithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.235.182.206