In Chapter 19, we described spectral representations that are based on the signal and (to some extent) some of the properties of human hearing, in particular the property of requiring less frequency resolution at high frequencies. In Chapter 20, we showed that cepstral processing could provide a smoothed spectral representation that is useful for many speech applications. In both cases, however, we made no explicit use of our knowledge of how the excitation spectrum is shaped by the vocal tract. As noted in Chapters 10 and 11, speech can be modeled as being produced by a periodic or noiselike source that is driving a nonuniform tube. It can be shown that basing the analysis (in a very general way) on such a production model leads to a spectral estimate that is both succinct and smooth, and for which the nature of the smoothness has a number of desirable properties. This is the main topic of this chapter.^{1}

In Chapter 10, we showed that a discrete model of a lossless uniform tube led to an input–output relationship for an excitation at one end and the other end closed (see Eqs. 10.21 and 10.22, and Figs. 10.5 and 10.6). For the case in which the far end of the tube is open, we noted that the complex poles of the tube transfer function would be on the unit circle at frequencies given by

We further noted that, for the average-length (17 cm) vocal tract and the speed of sound at room temperature (344 m/s), this meant that such a tube would have one resonance/kHz. In the more realistic case with energy loss at the boundaries, the poles will be inside the unit circle, but still at the angles implied by the resonance frequencies.

Of course, the real vocal tract is far more complicated than a uniform tube, often being represented by a nonuniform tube consisting of multiple shorter concatenated tubes of differing cross-sectional areas but having the same length. This could be viewed as an approximation to a continuous vocal tract shape. The resulting tube would have a set of resonances that (one would hope) would be similar to those for an actual vocal tract (in the shape required to produce a particular sound). However, for our current purposes it is sufficient to note that real vocal tracts do generate resonances whose number can be predicted reasonably well by tube models. Experiments in speech perception, such as those described in Chapter 17, have long suggested the fundamental importance of the formants for human listeners. Therefore, we will assume for now that we only need a model that can represent a sufficient number of resonances.

Suppose, then, each formant can be represented by a pole-only transfer function of the form

(where for the moment we ignore the filter gain). A typical frequency response for such a filter is shown in Fig. 21.1. We note in passing that the values of *c _{i}*, are always less than one for a stable filter (left as an exercise for the reader).

Assuming a 5-kHz bandwidth, one would typically need five such resonators in cascade to represent the five formants that would be expected on the average. Ordinarily one would also expect to require one or two more poles (possibly real) to represent the nonflat spectrum of the driving waveform, so a complete vowel spectrum could be represented reasonably well by six such sections.

Although this cascaded approach has been used in many synthesis applications, it is useful for our current purposes to imagine multiplying through all of these sections to get a direct-form implementation of the spectral model:

where *P* is twice the number of second-order sections going into the product (*P* = 12 in the example above), and the α coefficients are the coefficients of the resulting *P*th-order polynomial.

Figure 21.2 is a diagram of the complete model. In later chapters such a system will be used as a starting point to describe linear predictive approaches to speech synthesis, but in the current context it will be used as a model to represent the signal spectrum. Thus, the short-term spectrum of a speech signal can be represented by a filter that can be specified by *P* = 2 * (BW + 1) coefficients, where BW is the speech bandwidth in kilohertz. Note that since the driving-signal spectrum is folded into the filter, the model excitations are considered to be white.

For the system shown in Fig. 21.2, the discrete-time response *y(n)* to an excitation signal *x(n)* would be

The coefficients for the second term of this expression are generally computed to give an approximation to the original sequence, which will yield a spectrum for *H(z)* that is an approximation to the original speech spectrum. Thus, we attempt to predict the speech signal by a weighted sum of its previous values. That is,

is the linear predictor. Note that this has the form of a FIR filter, but that when it is included in the model of Fig. 21.2 the resulting production model is IIR. The coefficients that yield the best approximation of *(n)* to *y(n)* (usually in the mean squared sense) are called the linear prediction coefficients. In the statistical literature, the overall model is sometimes called an autoregressive (AR) model.

The difference between the predictor and the original signal is referred to as the error signal, also sometimes called the residual error, the LPC residual, or the prediction error. When the coefficients are chosen to to minimize this signal energy, the resulting error signal can be viewed as an approximation to the excitation function. The residual signal *e(n)* = *y(n)* – *(n)* consists of the components of *y(n)* that are not linearly predictable from its own previous samples, which is the case for a periodic excitation in this model, assuming that the number of samples between excitation pulses is much larger than the order of the filter. Figure 21.3 shows several examples of such error signals (preceded by the original waveforms) for steady-state vowels; note that the prediction error has large peaks that occur once per pitch period.

We have shown that a simple model for speech production^{2} leads to a spectral representation that is the minimum required to potentially represent the vocal tract resonances that shape the speech spectrum, particularly for voiced sounds. However, this still leaves us with (at least) two remaining questions:

- What error criterion should we minimize between the model spectrum and the observed spectrum?
- What is the best (not the minimum) number of coefficients to put into the representation?

We do not know if current answers to these questions are optimal. However, in the basic form of LPC that has been traditionally incorporated in audio signal-processing systems, coefficients have been chosen to minimize the squared error between the observed and predicted signals. As shown below, this leads to a solution that is represented in terms of its autocorrelation function. We state without proof (see, for instance, 131) that minimizing the squared error criterion is also equivalent to minimizing the integrated quotient between the speech power spectrum and the model power spectrum, or

where for simplicity we have ignored a gain term corresponding to the error power. Thus, minimizing the mean squared difference between *y(n)* and its linear predictor *(n)* is equivalent to minimizing a kind of distortion between the signal spectrum and the model filter spectrum.

What are the characteristics of this particular spectral distortion criterion? Note that for this measure, the portions of the spectrum for which |*Y*(ω)|^{2} is smaller than |*H*(ω)|^{2} will make small contributions to the integral. For a harmonic signal that is being modeled with LPC, this means that the model spectrum will tend to hug the harmonic peaks, but not the valleys between. Therefore, for model orders that are not too large, the error criterion given in Eq. 21.6 will lead to a spectrum that is an estimate of the envelope of the signal spectrum. See Fig. 21.4(c) for a short-term speech spectrum and the corresponding linear predictive spectrum.

Since, as noted earlier, it is a common goal to model spectral resonances, the squared error criterion seems reasonable. Additionally, moderate amounts of additive noise will have only small effects on the estimate, since the largest change to the spectrum will occur in the spectral valleys. However, a squared error criterion is not necessarily ideal; for instance, portions of the spectrum that have large magnitudes will tend to dominate the error, which may not necessarily be those parts that are most relevant to either speech intelligibility (for coding or synthesis applications) or phone discrimination (for recognition applications). This weakness is often ameliorated by pre-emphasizing the data with a fixed first-order FIR filter to help to flatten the spectrum.

Given this model structure and error criterion, what should be the number of coefficients used? In the previous section we gave an approximation to the number of coefficients that are required in order to represent the spectral resonances. However, since the model only provides an approximate fit to the short-term signal spectrum, clearly using a greater number of coefficients (a higher-order discrete-time model) will yield model spectra that are a better match; in general, using more parameters in a least-squares fit to a sequence will provide a better fit. However, having the greater detail is not always an advantage. In particular, the typical goal for the short-term spectral analysis of speech is to compute a spectral envelope that is relatively unaffected by pitch; thus, in general the details of harmonic structure should not be modeled at this stage. Additionally, a larger number of parameters will tend to make the estimate more susceptible to errors that are due to additive noise in the observed signal.

Therefore, the model order used depends critically on the goal for the spectral analysis. In cases in which an extremely accurate spectral representation is required, higher model orders may be used. For applications such as speech recognition, though, the model order is typically kept very close to the rule of thumb described earlier (though sometimes a slightly higher model order is found to be helpful). Although this may sound somewhat ad hoc, more formal approaches are not always good enough to accurately determine the best model order for a given application. Probably the best known of these is the use of the Akaike information criterion, or AIC [1]. In this approach, the error variance is replaced by a new distortion in order to penalize the model complexity. In particular, the AIC is given by AIC = log σ^{2} + 2p, where σ^{2} is the error variance and *p* is the autoregressive model order. This measure has the right property, namely penalizing model size in addition to the error variance; however, the choice of the weight of the model order is based on assumptions that may not be correct in practice. Makhoul [3] applied this measure to a speech sample and got a reasonable result, with tenth-order predictors having the lowest AIC; however, in practice the best value would tend to be found by trial and error.

Figure 21.5 shows the spectral envelope for LPC models of different orders given the same speech sample. It is apparent that the low model orders do not adequately capture the formants, whereas the very high model orders begin to track the harmonic content.

Figure 21.6 shows the reduction in prediction error energy for increased model orders. Note that in general the error is larger for unvoiced sounds, which are inherently less predictable.

Let us assume that in the situation of interest, *x(n)* of Eq. 21.4 is unknown. In this case, as noted earlier, we choose the α parameters in order to minimize the squared error between *y(n)* and over the sequence. For speech applications, the *y(n)* of interest is typically a locally windowed form of the original speech sample sequence; that is,

where *w(n)* is a local window (e.g., a 20-ms-long Hamming window) that we assume to be *N* points long, and *s(n)* is the sampled speech data. The error signal between the model output and the signal then is

Defining a distortion metric over the window, we find

If we take partial derivatives with respect to each α, we get *P* equations of the form

where φ*(i, j)* is a correlation sum between versions of the speech signal delayed by *i* and *j* points.

For the case of the windowed signal for which no points outside of the window are used for the estimate, φ*(i, j)* is only a function of the absolute value of the difference between *i* and *j*, and the resulting correlation matrix at the left-hand side of Eq. 21.10 is Toeplitz; that is, not only is it symmetric, but all the values along each left-to-right diagonal are equal. For this special case, the system of equations can be solved by efficient procedures [0(*P*^{2})] known as the Levinson or the Durbin recursions. Although efficient, they do require significant numerical precision. These procedures are described in detail in a number of other sources, including [3], and they are not be repeated here. For what is called a covariance analysis, the correlations are computed by sliding the signal window along outside of the blocked area, so that the resulting matrix is not Toeplitz (though it is symmetric). The resulting system of equations is typically solved by a Cholesky decomposition (sometimes called the square-root method), which takes more computation [0(*P*^{3})] but which is numerically quite stable (see, for instance, [2]).

In practice, the prediction coefficients are often not a good representation to use for most applications. In cases in which the digital word length is critical, the polynomial coefficients tend to be too sensitive to numerical precision. When a covariance analysis is used, the stability of the resulting filter is not guaranteed, and it is not checked easily with a predictor polynomial. The coefficients are not orthogonal or normalized, which potentially creates other difficulties for classifiers that might use these features.

For all of these reasons, LPC coefficients are generally transformed into one of a number of other representations, including the following.

**1. Root pairs**: the polynomial can be factored (using commonly available iterative procedures) into complex pairs, thus finding something like the resonances discussed in this chapter. Each of these is implemented by a second-order filter, which has simple stability properties and good numerical behavior. This can be useful for synthesis but has not tended to be used for recognition.

**2. Reflection coefficients**: the polynomial can be transformed into a set of coefficients that represent the fraction of energy reflected at each section of a nonuniform tube (with as many sections as the order of the polynomial). The reflection coefficients also can be used directly as the coefficients for a lattice filter for synthesis. The first few coefficients must be represented with more precision than the later ones, and all of the values are bounded by –1 and 1 for stable filters. This is probably the most common representation for LPC synthesis.

**3. Cepstrum**: there is a recursion that can be used to generate a set of cepstral coefficients corresponding to the LPC spectrum; it is efficient, as it does not require any explicit spectral computations. The resulting variables are orthogonal (as are the DFT-based cepstral coefficients) and well behaved numerically. They are the most common form of LPC-based variables used for speech or speaker recognition.

Figure 21.7 shows the complete LPC process, including the transformation into cepstral coefficients for speech recognition. In this case temporal derivatives are also commonly used to augment the feature vector.

We have shown that linear prediction can be used to generate estimates of the spectral envelope for the short-term spectrum of speech that is commonly of interest for applications in synthesis and recognition. It is based on a model of the vocal tract shaping of an excitation signal. When the model is a good one (for instance, for steady-state vowels), the approach yields a good match to the spectral envelope; in other words, using an eighth- or tenth-order model to represent four formants works pretty well under good conditions. When the model is a poor match to the physical generation, for instance, when the sound is nasalized (the extra side chain creates significant zeros), the results tend not to be so good; additionally, unvoiced sounds tend to have a different (usually simpler) spectral shape and may be overparameterized by a model order that is appropriate for the vowels. Still, even in these cases LPC often provides a reasonable spectral estimate.

Another issue of concern for linear prediction modeling is the use of sharp antialiasing filters to precede sampling. In many signal-processing applications we tend to think of aliasing distortion as evil, and so we wish to prevent it at all costs – it is common to set the corner frequency for such filters at 40% or so of the sampling frequency rather than at the ideal of 50% required by the Nyquist theorem (for a filter with a perfect rectangular frequency response). However, if the analysis used is LPC, this choice will put a steep low-pass filter characteristic within the band modeled by LPC analysis. Essentially, the analysis only has a fixed number of poles to place in order to model the spectrum, and having such a steep filter will reduce the degrees of freedom available to model the speech. From another perspective, having a large range of spectral values to model tends to make the correlation matrices ill conditioned, with a corresponding large range in the matrix eigenvalues [3]. Practically speaking, when LPC is used, it is common to set the antialiasing corner frequency at roughly the half-sampling frequency, particularly for applications in speech and speaker recognition, even though the aliasing distortion becomes significant.

LPC analysis focuses on the spectrum as a product of resonances. As we have noted in previous chapters, resonant (formant) frequencies have often been associated with phonetic identity, particularly for vowels. However, there is significant variation in these frequencies among people; in particular, the vocal tract length tends to scale these frequencies, leading to very different average values for adult males, adult females, and children. Thus, there tends to be a built-in speaker dependence for LPC-based features. In contrast, filter banks and cepstral analysis are less tied to the specific resonant frequencies, but they consequently lack some of the previously noted advantages of basing the envelope estimate on these resonances. Table 21.6 summarizes a number of other comparative points between these spectral envelope estimates; note that the LPC column refers to the predictor polynomial only, as some of the weaknesses indicated there for LPC can be softened by using cepstral parameters from the LPC analysis. In Chapter 22 we will discuss approaches that have been developed to benefit from all three spectral envelope estimation methods.

We note that this chapter has skimmed over the implementation mathematics very lightly; please refer to references [3] and [4] for much more complete discussions of the autocorrelation and covariance methods, including the efficient recursions used. Lattice implementations of LPC filters will be briefly discussed in Chapter 32. Finally, reference [5] is a good source for a completely different perspective based on maximizing the entropy of the spectral estimate but leading to similar solutions.

**21.1**We have indicated that the squared error criterion leads to a spectral ratio error criterion (between the power spectra associated with the speech and the model). Use Parseval's theorem to show this.**21.2**A signal includes a very noisy spectral slice between 500 and 600 Hz. An engineer proposes implementing a steep notch filter (band reject) to remove this noise. If this signal is going to be analyzed with linear prediction, what is a potential difficulty with this plan?

- Akaike,
**H.**, “A new look at statistical model identification,”*IEEE Trans. Autom. Control***AC-19**: 716-723, 1974. - Golub, G., and van Loan, C.,
*Matrix Computations*, Johns Hopkins Univ. Press, Baltimore, 1983. - Makhoul, J., “Linear prediction: a tutorial review,”
*Proc. IEEE***63**: 561-580, 1975. - Rabiner, L., and Juang,
**B.-H.**,*Fundamentals of Speech Recognition*, Prentice–Hall, Englewood Cliffs, N.J., 1993. - Schroeder, M., “Linear prediction, entropy, and signal analysis,”
*IEEE ASSP Mag*.**1**: 3-11, 1984.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.