CHAPTER 33

image

LOW-RATE VOCODERS

33.1 INTRODUCTION

To a first approximation, a teletype system transmitting at 75 bps can transmit textual information at almost the same rate as a person speaking the same text. Of course, the speech has much more information than the text. The speaker's identity, emotional state, and prosodic nuances are all information, though not all of this information may be necessary for speech communication per se. For this chapter, it is assumed that a good 2400-bps vocoder contains all of the relevant information. Given this assumption, we will consider low-rate vocoders to encompass bit rates between 75 and 2400 bps.

In Chapter 32 we examined two methods of bit-rate reduction: efficient quantization schemes and linear transformations. In this chapter we extend this discussion to report on several other bit-saving methods. First, we describe the benefits obtainable by taking advantage of the time correlation of the spectral and excitation components. When the sampling rates of these components are lowered, bit saving automatically takes place; the trick will be to find interpolation algorithms that do not inordinately degrade the output speech.

A different approach to bit-rate reduction comes from the original work by C. P. Smith [18], [19] on channel vocoders, which was later applied to LPC vocoders by Buzo et al. [3]. This approach was called pattern matching by Smith and vector quantization (VQ) by Buzo et al. (We shall, for the most part, stick with the more popular latter description.) The idea was (as discussed more generally in Chapters 9 and 26) this: the number of perceptually distinguishable spectra is far smaller than the number that is typically generated by a speech device such as a channel vocoder. Therefore, (a) if the spectrum were treated as a multidimensional vector and a way could be found to store all the perceptually distinguishable spectra, (b) if each of the stored spectra was given a label, and (c) when a new spectrum arrived, it would be matched against the stored vectors and the label of the best match would be transmitted and decoded at the receiver using the same codebook.

Another traditional way of reducing bit rate is to reduce the number of parameters that have to be sent. For example, in a typical channel vocoder, 15–20 channel signals are used; in LPC, 8–12 predictor coefficients are generally needed. Formant vocoders, in contrast, require as few as four spectral parameters; the problem here is the difficulty of tracking formants accurately.

Finally, technological advances in the field of ASR have led to the concept of recognition synthesis, in which longer speech segments (e.g., phonemes, diphones, and syllables [8], [21]) are identified at the analyzer and regenerated at the synthesizer. This is an interesting research topic, but as explained further later, it is unlikely to achieve general use (over languages, speakers, and acoustic conditions) in the foreseeable future.

In this chapter our goal is to study several systems that make use of one, or some combination, of the above concepts.

33.2 THE FRAME-FILL CONCEPT

The basic idea of frame fill, proposed by McLarnon [10], is to transmit from analyzer to synthesizer every Mth frame, thereby achieving an M:1 reduction in bit rate. The savings is not quite that great, since some control information must be sent, instructing the receiver how to reconstruct (or fill in) missing information. Thus with the choice of M = 2, close to a 1200-bps rate is feasible if a 2400-bps system is started with.

Frame fill for a channel vocoder: Let's define frame (N – 1) and frame (N + 1) as the two frames that are sent, and frame (N) as the frame computed by the analyzer but not sent. Then:

  1. Compare frame (N) of data to be omitted with frame (N – 1) and frame (N + 1).
  2. In accordance with some reasonable distance measure, decide which neighbor matches the omitted frame.
  3. Also, consider as a match candidate some weighted combination of the information contained in the two neighboring frames.
  4. Select the option (three choices) representing the best match and append its I.D. code (two bits) to the frame that is to be transmitted.

    McLarnon chose a distance metric to be

    image

where K is the number of channels and where Sc(k) refers to the magnitude of the kth spectral component for the candidate Nth frame and Sr(k) refers to the equivalent magnitude for either frame (N – 1) or frame (N + 1). Table 33.1 (after [2]) summarizes the coding strategy to reduce the bit rate from an original 2400-bps channel vocoder to 1200 bps and 800 bps.

TABLE 33.1 Coding Conventions for a Low-Rate Channel Vocoderaa

image

TABLE 33.2 Coding Conventions for a Low-Rate LPC Vocoderaa

image

Frame fill for a LPC vocoder: Bit-rate savings in LPC vocoders is more subtle than for channel vocoders and depends greatly on the synthesizer structure. For example, if the predictor coefficients were transmitted, assigning bit rates to each would be an arduous empirical task because of the obtuse relationship between the coefficients and the more physically intuitive spectrum. Both psychoacoustic experiments [1] and engineering considerations favor the use of the reflection coefficients. Blankenship and Malpass [2] performed frame-fill experiments on a 2400-bps LPC vocoder and arrived at the breakdown of Table 33.2 for both 1200 bps and 800 bps.

Notice that the reflection coefficients k4 through k9 are not sent when the excitation is hiss. This knowledge allows the analyzer to perform a fourth-order rather than a tenth-order analysis during hiss. The strategic control of frame fill is somewhat more complex for LPC and, as shown in the table, requires four bits for the 1200-bps version and three bits for the 800-bps version. Table 33.3 gives comparative DRT (diagnostic rhyme test, see Chapter 17) results for all three versions of both systems.

TABLE 33.3 Summary of Three-Speaker DRT Accuracy Resultsaa

image

33.3 PATTERN MATCHING OR VECTOR QUANTIZATION

Let's assume that a listener can tell any two spectral patterns apart from a total population of 220(= 1, 048, 576) patterns. Given a strategy that permits identification of the storage location of any of these million or so patterns, we see that the transmitter needs to transmit only the storage location, with confidence that the receiver (possessing the same set of stored patterns) can generate the correct spectrum.

So far, so good; but there are difficulties. How do we determine the particular subset of 220 out of a much larger set? How do we implement an efficient search procedure at the receiver to find the pattern?

Two conditions for some success are clear. First, a large amount of data must be collected and analyzed, and second, a distance metric must be formulated so that most entries are perceptually distinct. An attractive geometric way of looking at these issues was proposed by Buzo et al. [3]. They defined a single frame of information as a vector. If we think of each frame as a single point in a multidimensional space, we try to fill the space with points that are as far apart from each other as possible. This line of reasoning leads straightforwardly to the term vector quantization.1

When C. P. Smith began his work in the 1950s, bulk memory was best obtained by rotating machines, such as drums. Progress was slow, although by 1962, Smith was able to perform a feasibility experiment. Buzo, Markel, and Gray reintroduced the concept using LPC vocoders. At the Lincoln Laboratory, the same concepts were applied to a channel vocoder by Gold [6] and to the SEEVOC system of Paul [11]. Each used a different vocoder configuration, a different distance metric, and a different strategy for building a system.

Gold's pattern-matching channel vocoder: listening and visual observation were the primary vehicles; this made the procedure interactive but time consuming. The system begins with an empty table of stored patterns. The first sentence is processed by the channel vocoder analyzer, and all spectral cross sections are entered into memory. By visual inspection of these cross sections, the experimenter enters the nonredundant spectra into the pattern table. The next sentence is matched against this embryonic table. The process continues, using both visual and auditory feedback, until the experimenter decides that a sufficient number of patterns has been stored.

Paul's adaptive vector quantization: adaptation is accomplished by continuing alteration of the pattern table to match the current speaker and environment. An incoming spectrum is matched against all existing reference patterns. If the best match fails to satisfy a fixed criterion, the new pattern is incorporated into the pattern table, replacing that pattern that has not been transmitted for the longest time.

The spectral envelope estimation (SEE) vocoder algorithm was used as a base in these experiments [12]. (SEE uses cepstral processing to obtain spectral patterns.) Performance was quite impressive. When a new speaker began, a brief period of quasi-intelligible speech was followed by adaptation; the system quickly tuned in on the new speech. It should be noted that the system requires that updated reference sets be periodically transmitted. To maintain the low bit rate requires the detection of silent intervals during which new pattern sets can be sent.

33.4 THE KANG–COULTER 600-BPS VOCODER

Kang and Coulter [9] developed a 600-bps vocoder that employs LPC methods followed by formant tracking based on the LPC parameters, as well as vector quantization of the formants. Their device is a useful example of how the different data-reduction techniques can be combined to achieve significant bit-rate reduction. A block diagram of their system is shown in Fig. 33.1.

LPC vocoders have already been discussed in Chapters 21 and 32. Here, we want to refer back to the predictor coefficients and show how to manipulate them to improve formant tracking. Let's consider, as an example, an nth-order predictor. We know that the z-transform of the synthesizer can be expressed in terms of an nth-order polynomial in z.

image

For a tenth-order system, as a10 approaches unity, the 10 poles of the system gravitate toward the unit circle, as shown in Fig. 33.2. As the poles progress, the resultant spectrum changes as indicated in Fig. 33.3. (In the figure the parameter k10 corresponds to a10 in our notation.) As k10 gets very close to the unit circle, in (f), the peaks of the spectrum are very obvious and relatively easy to identify.

However, these values of peak frequencies are only approximations to the formants. As seen in Fig. 33.2, the pole trajectories are not radial. Thus, it is necessary to backtrack by gradually returning kn, to its original value and iteratively recomputing the formants in steps.

Vector quantization of the formants: just as perceptually distinct speech spectra are a relatively small subset of all possible speech spectra, so are the perceptually significant formant patterns a small subset of all such patterns. Kang and Coulter worked with a stored table of just 128 vector formant patterns. They also vector quantized six partial correlation coefficients for use with unvoiced sounds. Their parameter coding is shown in Table 33.4.

image

FIGURE 33.1 Kang–Coulter 600-bps voice digitizer. From [9].

Finally, Kang and Coulter present a summary of DRT results, using the Voiers version of feature comparison (see Chapter 17); these results are shown in Table 33.5.

33.5 SEGMENTATION METHODS FOR BANDWIDTH REDUCTION

In Section 33.1 it was noted that the lowest reasonable data rate for a speech-transmission system is of the order of 75 bps, equivalent to a teletype rate. Let us try to imagine how one would go about building such an ideal system. First, a very good automatic speech recognizer would be the front end of the system. This would mean that the transmissionsystem has available the equivalence of printed text material that can now be sent by teletype. At the receiver, a text-to-speech (system) would be needed to reproduce the spoken version of the text.

image

FIGURE 33.2 Loci of the poles as k10 approaches unity. From [9].

There are two strong limitations to this scenario. First, even a very good text-tospeech system does not reproduce the characteristics of the speaker. This means that we really need more bits with which to (hopefully) reproduce the style of the speaker and convince the listener of the speaker's identity. To determine how many more bits might be needed for this task, we would have to invent methods and perform psychoacoustic testing on a fairly large scale, in order to arrive at some estimation of the extra bandwidth needed, and how to do the job of approximating the speaker's voice. To our knowledge, such research has not been done. However, as long ago as 1980, speech researchers were beginning to grapple with such problems, and work centered around these issues is currently quite active.

TABLE 33.4 Parameter Coding for a 600-bps Voice Digitizeraa

image

image

FIGURE 33.3 Spectra of an all-pole system as poles migrate toward the unit circle. From [9]. As the poles get closer to the unit circle, the peaks associated with the poles become more distinct and easier to identify. (See text for further explanation.)

Even if we are willing to sacrifice speaker identity and style, however, one would still need to accurately recognize phonemes or some other linguistically sufficient speech unit. This has been demonstrated for Japanese speech in [8] and [21]. In the latter reference, for instance, phoneme HMMs were used to recognize phoneme strings. Japanese phonotactics permit relatively strong constraints on phoneme sequences (in comparison, say, with English), and the resulting recognition was good enough to provide 150-bps synthetic speech with comparable subjective evaluations as with a VQ system that required 400 bps (where neither of these figures included pitch information). This was very impressive; however, the experiments were done for a single speaker, under pristine acoustic conditions, and in a language that is particularly conducive to such an approach. We are not aware of any current application of this approach to the general case of unconstrained acoustic conditions, speaker style, and without speaker-specific training.

Nonetheless, let's look more closely at an idealized speech-transmission system, one that recognizes and transmits individual phonemes. If we assume an average speaking rate of 10 phonemes/s and a total of 64 phonemes, this idealized system would transmit the spectral components (coded as phonemes) at 60 bps. The synthesized speech quality would be determined exclusively by the phonemes stored at the receiver, independent of the specific speaker. Further, since the acoustic properties of a phoneme are strongly dependent on the surrounding sounds, the resultant synthetic speech intelligibility would almost surely be severely compromised.

TABLE 33.5 DRT Accuracy: Comparison of a 600-bps Systema

image

A somewhat better (but far from perfect) facsimile of the speech could be obtained by recognizing and storing allophones at both ends. Assuming 1000 allophonic variations of all the phonemes, and again choosing a speaking rate of 10 phonemes/s, our new system now requires 100 bps for the transmission of vocal tract information.

Examining Table 33.4, we see that the frame rate for the Kang–Coulter 600-bps vocoder is 40 Hz. We can speculate that the application of frame fill would reduce the bit rate to approximately 300 bps. With this modification we are beginning to approach the limits as defined by the idealized analysis–synthesis system. Such an extension has been tried on different basic vocoders by various researchers and is called segmental vocoding. Instead of simply omitting the transmission of alternate frames (as in frame fill), strongly correlated contiguous frames are merged into segments, and these segments are then vector quantized. If there are, on average, N frames per segment, the resultant equivalent bit rate is reduced by almost a factor of N. Note that this approach does not require explicit ASR for the segments.

It should be noted that merging of contiguous frames is an extension of the frame-fill concept discussed briefly in Section 33.2. Rather than halving the frame rate, the segmentation methods divide the frame rate by a signal-dependent integer. Since much of the speech signal consists of quasi-stationary voiced segments that change very slightly from frame to frame, segmentation algorithms can lead to greater savings in the average bit rate.

Segmentation algorithms lead to variable length segments, with resultant complications. For example, the Kang–Coulter vocoder would search for a stored segment with formant tracks that most closely resembled the formant tracks of the new segment.

Representation of the speech spectrum can take many forms, depending on the specific processing system. For the Kang–Coulter vocoder, the derived spectral parameters are formants, and it seems reasonable to compare adjacent segments by inspecting first-order formant differences. Roucos et al. [15] worked with the LPC-derived parameters and, for the segmentation determination, they proposed inspection of the log-area ratio differences.2

A key issue is the comparison of the derived segments with stored versions of the codebook segments. In Section 33.3 there was a brief discussion of adaptive vector quantization, in which the system went through a transient period, entering new vectors into the codebook that were based on a new speaker's speech properties. Alternate strategies are the multispeaker approach, in which a fixed codebook is created, based on the data from a variety of speakers, with the hope that any new speaker's segments will resemble stored segments from this variety. Still another strategy is the single-speaker approach, wherein a fixed collection of segments is computed for just one speaker – the sole user. This idea can be extended to a population of users, each one having his or her private codebook.

Obviously, the single-speaker approach requires a smaller codebook and thus uses a lower bit rate than the multispeaker system. The adaptive VQ should compare in bit rate with the single-speaker VQ at the price of more computation and greater complexity, since it has to include an algorithm that adds new code words to the codebook and an algorithm to determine which code words to eliminate.

Many strategies exist for deciding on the appropriate code to transmit, given an analyzed segment. A standard criterion is the L2 measure; that is, the mean-squared difference between the components of the new vector and those of each of the stored vectors. These components differ from system to system, whether it be LPC, a channel system, or a homomorphic system. In addition, as discussed in [7], different distance measures are available, such as log spectral distance, cepstral distance, and various criteria based on likelihood ratios.

A variety of methods for very low bit-rate coding have been the object of research during the past two decades; see, for example, [4], [5], [14]–[17], [20]–[22].

33.6 EXERCISES

  1. 33.1 Prove that for the direct-form nth order LPC synthesizer, the poles migrate toward the unit circle as the nth PARCOR coefficient kn, approaches unity.
  2. 33.2 Devise an algorithm to obtain good approximations to the formant frequencies after correctly identifying the spectral peaks when kn is very close to unity.
  3. 33.3 Prove that the poles corresponding to the formants remain inside the unit circle as kn approaches unity and then backtracks.
  4. 33.4 Given the formant frequencies obtained by the above methods, find the corresponding predictor coefficients for a sixth-order system.

BIBLIOGRAPHY

  1. Barnwell, T. P., and Voiers, W. D., “An analysis of objective measures for user acceptability of voice communications systems,” Final Rep. DCA100–78-C-0003, DCA was the issuer, Wash., D.C. 1979.
  2. Blankenship, P. E., and Malpass, M. L., “Frame-fill techniques for reducing vocoder data rates,” Tech. Rep. 556, MIT Lincoln Laboratory, Lexington, MA, 1981.
  3. Buzo, A., Gray, A. H., Gray, R., and Markel, J., “Speech coding based on vector quantization,” IEEE Trans. Acoust. Speech Signal Process. ASSP-28: 562–574, 1980.
  4. Cernocky, J., Baudoin, G., and Chollet, G., “Segmental vocoding – going beyond the phonetic approach,” in Proc. ICASSP '98, Seattle, pp. 605–608, 1998.
  5. Ghaemmaghami, S., and Deriche, M., “A new approach to modelling excitation in very low-rate speech coding,” in Proc. ICASSP '98, Seattle, pp. 597–600, 1998.
  6. Gold, B., “Experiments with a pattern-matching vocoder,” in Proc. ICASSP '81, Atlanta, pp. 3234, 1981.
  7. Gray, A. H., Jr., and Markel, J. D., “Distance measures for speech processing,” IEEE Trans. Acoust. Speech Signal Process. ASSP-24: 380–391, 1976.
  8. Hirata, Y., and Nakagawa, S., “A 100 bits/s speech coding using a speech recognition technique,” in Proc. Eurospeech '89, Paris, pp. 290–293, 1989.
  9. Kang, G. S., and Coulter, D. C, “600-bit-per-second voice digitizer (linear predictive formant vocoder),” NRL Rep. 8043, Naval Research Laboratory, Wash., D.C. 1976.
  10. McLarnon, E., “A method for reducing the frame rate of a channel vocoder by using frame interpolation,” in Proc. ICASSP '78, Washington, D.C., pp. 458–461, 1978.
  11. Paul, D. B., “An 800 bps adaptive vector quantization vocoder using a perceptual distance measure,” in Proc. ICASSP '83, Boston, 1983.
  12. Paul, D. B., “The spectral envelope estimation vocoder,” IEEE Trans. Acoust. Speech Signal Process. ASSP-29: 562–574, 1981.
  13. Rabiner, L. R., and Schafer, R. W., Digital processing of speech signals, Prentice–Hall, Englewood Cliffs, N.J., 1978.
  14. Roucos, S., Schwartz, R. M., and Makhoul, J., “A segment vocoder at 150 b/s,” in Proc. ICASSP '83, Boston, pp. 61–64, 1983.
  15. Roucos, S., Schwartz, R. M., and Makhoul, J., “Segment quantization for very-low-rate speech coding,” in Proc. ICASSP '82, Paris, France, pp. 1565–1568, 1982.
  16. Schwartz, R. M., and Roucos, S. E., “A comparison of methods for 300–400 b/s vocoders,” in Proc. ICASSP '83, Boston, pp. 69–72, 1983.
  17. Shiraki, Y., and Honda, M., “LPC speech coding based on variable-length segment quantization,” IEEE Trans. Acoust. Speech Signal Process. 36: 1437–1444, 1988.
  18. Smith, C. P., “An approach to speech bandwidth compression,” Tech. Rep. AFCRC-TR-59–198, U.S. Air Force Cambridge Research Center, Cambridge, 1959.
  19. Smith, C. P., “Perception of vocoder speech processed by pattern matching,”. 1. Acoust. Soc. Am. 46: 1562–1571, 1969.
  20. Soong, F. K., “A phonetically labeled acoustic segment (PLAS) approach to speech analysis-synthesis,” in Proc. ICASSP '89, Glasgow, paper S11.7, pp. 584–587, 1989.
  21. Tokuda, K., Masuko, T., Hiroi, J., Kobayashi, T., and Kitamura, T., “A very low bite rate coder using HMM-based speech recognition/synthesis techniques,” in Proc. ICASSP '98, Seattle, pp. 609–612, 1998.
  22. Wong, D. Y., Juang, B. H., and Cheng, D. Y., “Very low data rate compression with LPC vector and matrix quantization,” in Proc. ICASSP '83, Boston, pp. 65–68, 1983.

1 VQ was also discussed in Chapters 9 and 26 in the context of statistical pattern recognition.

2 See Rabiner and Schafer [13] for definitions of various LPC parametric representations, including log-area ratios.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.210.151.5