There are a variety of techniques to modify the speed, pitch, and spectrum of a speech signal. Some methods work directly on the speech wave to modify the time scale or pitch. Other methods are based on analysis–synthesis systems (i.e., vocoders), in which the derived parameters can be adjusted to modify the synthetic output. However, some medium-and high-rate vocoder systems do not explicitly compute the fundamental frequency, which complicates their use for pitch modification.

Speech modification techniques have many applications. For instance, as noted in Chapter 30, pitch and duration must often be modified for concatenative synthesis. Speeding up a voice response system can save time for a busy, impatient user. It may also be a useful addition in speech communication channels subject to fading. Compressing the spectrum could potentially be of help to people with hearing disabilities.

The following three sections explain some of the fundamental issues in speech transformations. This is followed by a study of speech modification in analysis–synthesis systems, that is, channel vocoders, LPC vocoders, and homomorphic vocoders. The chapter concludes with a review of three specific systems: the phase vocoder [4], the Seneff system [20], and the sine-transform coder of Quatieri and McAulay [17].

A popular application of speech processing is time-scale modification. In this section, several systems are presented that perform this function while preserving, as far as possible, the original excitation and spectrum.

Schemes for time-scale compression and expansion include work by Lee [8], Garvey [5], and Fairbanks et al. [3]. These early works utilized a “sampling” method^{1}: Time was divided into segments, e.g., of 30 ms; to time-compress the speech by a factor of 2 (as shown in Fig. 40.1(b)), alternate segments were deleted and the remaining segments abutted. To time-expand by a factor of 2, each segment was repeated, as shown in Fig. 40.1(c). Previously, Miller and Licklider [11] had demonstrated that speech could be “chopped” (alternate segments zeroed out), with no appreciable intelligibility loss for small chops.

Figure 40.1 shows that the sampling method introduces artifacts, such as discontinuities at the segment boundaries.

Scott and Gerber [19] performed a pitch synchronous time-scale modification and reported an increase in word intelligibility from 88.1% for the sampling method to 92.1% for their method. Their experiment was restricted to words that were completely voiced, as in the example of Fig. 40.2.

A more recent method of time-scale modification described in [21] makes use of the pitch synchronous overlap and add (PSOLA) algorithm that was mentioned in Chapter 30. The overlap–add procedure can be performed on the original speech signal or on a derived excitation signal, e.g., from an LPC coder. The first step in this procedure is to compute the pitch marks. A tapered window with duration proportional to the local pitch period is centered on each pitch mark. The resultant windowed speech fragments can be overlapped to reconstruct the original signal. If, however, some fragments are selectively removed, and the gaps closed up so as to delete entire pitch cycles, the result is a sped-up version of the original signal with the same pitch and spectrum as the original. An example is shown in Fig. 40.3.

There are several ways to generate an excitation function that may be used to drive a vocal tract model to produce synthetic speech:

**Homomorphic analysis**: In Fig. 20.1, the high-time filtered version of the cepstrum corresponds to the excitation function. Therefore, when an inverse FFT is performed, followed by exponentiation, followed by an FFT, the appropriate excitation function is obtained.

**Inverse filtering**: By performing inverse filtering on the incoming speech, the output represents the excitation function. For an all-pole model derived in LPC, the error signal thus serves as the excitation function. Inverse filters can be constructed as the mathematical inverse of a spectral envelope derived by any other method.

**Spectral flattening**: This method removes the spectral envelope component of the speech, leaving the excitation function. Different techniques are available to perform this function.

An approximation to spectral flattening can be realized by low-pass filtering of the speech followed by downsampling. An example is shown in Fig. 40.4. Notice that the resulting excitation spectrum is not flat. Also, the frequency position of the spectral lines are probably not harmonics of the voice fundamental frequency. Nevertheless, if the baseband signal (i.e., the low-pass filter output) is of sufficiently high bandwidth (e.g., 1500 Hz), the perceptual effect of these distortions is minimal (see Chapter 16 on Pitch Perception).

Vocoders are analysis–synthesis systems. Thus, once the parameters of a given speech model are analyzed, it is possible to *intervene* before synthesis to produce some *transformed* version of the speech. For example, we can change the fundamental frequency from its measured value to some function of that value. The spectrum and the timing may also be altered. We will first show how such transformations can be handled in a channel vocoder. Analogous results are obtainable with LPC and cepstral vocoders.

**Speeding up the speech**: We are familiar with the result of playing a tape back at a higher speed than was used during recording, or equivalently playing out a sampled waveform at a higher sampling rate than used during capture. The pitch is increased and the formants get higher, thus distorting the spectrum. However, it is usually desirable to speed up the speech without changing the pitch or distorting the spectrum. How can this be done?

In a channel vocoder, analysis is performed on a frame basis. In each frame (typically 10–20 ms long), the energy in a frequency band is estimated (see Chapter 32). During synthesis, the number of samples synthesized is made equal to the number of samples analyzed; analysis and synthesis frames are of the same duration. Now, let's imagine that for every 100 input samples to the analyzer, only 50 samples are synthesized. This effectively shortens the duration of the output speech relative to the input speech; the result is a speedup. The fundamental frequency and the spectrum have been parametrized so they are unchanged.

It is clear that speeding up speech cannot work in real time. However, this use of the channel vocoder can be applied to a practical real-time situation [2]. Consider a long-distance speech-communication link, in which atmospheric conditions result in occasional fading of the signal. A two-way signaling path can be set up, in which the receiver notifies the transmitter that a fade has occurred. When the transmitter gets this message, it stores the analysis frames rather than transmitting; meanwhile, it continues to send a probe signal that presumably will not be received until the fade passes. When this happens, the receiver sends an all-clear signal and now the transmitted speech is sped up until the buffer is cleared, at which time normal transmission resumes.

**Pitch change**: In the early days of channel vocoders, it was demonstrated that pitch could easily be varied in real time by turning a dial. In a frame-oriented digital vocoder, there is no difficulty in combining pitch modification with slowdown (see Fig. 40.5) or speedup.

**Spectral modifications**: There are several reasons for interest in this type of transformation. Deep-sea divers speak in a helium-rich environment, and this gives the speech a Mickey Mouse effect that is due to the spectral changes caused by changes in the velocity of sound. These spectral distortions can be reversed in a channel vocoder.

Another possible application is spectral modification to cram the frequency space into the hearing portion of a partially deaf person's ear. A popular game is to change a male voice to female and vice versa; such a game could conceivably be part of a psychological gender experiment.

Voice-modification methods can also be applied to various speech recognition tasks. One such application is to modify the reference (input) voice to better resemble a target voice on which the recognizer has been trained [12], [14], [21]. Another application is, for a multispeaker environment, to generate many modified versions of one or several speakers to be used as data for training the recognizer.

To summarize: a channel vocoder, because it parametrizes excitation and spectrum separately and because the number of output samples need not be equal to the number of input samples, is capable of modifying speed, pitch, and spectrum in any combination.

The classical LPC and homomorphic vocoders (see Chapter 32) transmit the voice fundamental frequency and the voiced–unvoiced decision in ways very similar to that of a channel vocoder. Thus, pitch modifications in these systems can be the same as in the channel vocoder.

However, the parameters encoding vocal tract information are very different for these three classical algorithms. The channel vocoder estimates spectral envelope with a filterbank, and it transmits an encoded version of these estimates. In LPC, the transmitted spectral parameters are closely associated with the synthesizer, e.g., as reflection coefficients. The homomorphic vocoder typically sends an encoded version of the low-time liftered cepstrum.

In an LPC vocoder, spectral modifications can be implemented in various ways. For example, once the analyzer has determined the synthesizer parameters, the spectral envelope can be computed, either directly or by computing the DFT of the synthesizer impulse response. A new set of autocorrelation values are then computed from the modified spectrum and the reflection coefficients recomputed.

Alternately, a DFT of the computed correlation values yields the square of the spectral magnitude, which can be modified [7] before converting back to the time domain via an inverse DFT. This new correlation function can then be used to compute the modified parameters for synthesis.

In a homomorphic vocoder the cepstrum can be low-time liftered to preserve the part that pertains to the spectral envelope. A DFT will produce the log spectral envelope, and this spectrum may be modified, then exponentiated. This modified spectrum can then be converted into an impulse response via an inverse DFT.

One method of performing time-scale modification in a channel vocoder is mentioned above, where speedup or slowdown is obtained by synthesizing a different number of samples than were analyzed. Another technique is to alter both the fundamental frequency parameters and the spectral parameters and then modify the ratio of the input to output sampling rates. As an example, say we wish to double the vocoded speech rate. First, we halve the fundamental frequency parameter and modify the spectrum to “scrunch” it (compressed in frequency) by a factor of 2. Then, after resynthesis based on these modified parameters, we play back the modified signal at double the original sampling rate. This speeding-up halves the duration of the utterance, but at the same time restores the pitch and spectrum to their original values. Comparable manipulations allow for time-scale modifications in LPC and homomorphic vocoders.

The phase vocoder [4] begins by performing a short-time Fourier transform (STFT) analysis of the incoming signal, yielding a complex value for each frequency bin. The correct alignment of the overlap-add portions in reconstruction relies on the phase difference between the values in each bin in successive time frames. Thus, by manipulating the phase derivative, independent of the magnitude, it is possible to modify the signal timing. As an example, consider speeding up the speech by a factor of 2. First, we scale the phase derivative of each channel by one-half, before re-integrating along time to obtain the modified STFT. If we now reconstruct with half the time hop between successive frames, the result is a signal with half the duration, but the original spectral magnitudes, and few or no artifacts resulting from the overlaps in the time-compresses synthesis. Note, however, that some care must be taken in calculating the true phase derivative (instantaneous frequency) from widely-spaced time samples, since most frequencies will complete many entire cycles between the samples. Correctly accounting for these complete cycles is known as phase unwrapping.

The same technique can be used to modify the spectrum without changing the time scale; the most convenient way to achieve this is to modify the time-base while preserving the spectrum as above, then to adjust the sampling rate to restore the original duration while at the same time stretching or compressing the spectrum.

By decomposing the signal into a set of Fourier bins with particular magnitudes and phase-derivatives, the phase vocoder is, in essence, making a sinusoidal model of the signal (see Section 40.7). From this perspective, we can see that the phase vocoder gives the best results when its frequency analysis is fine enough to contain at most a single harmonic component in each bin.

Examples of both speedup and slowdown are shown in Fig. 40.6.

The original phase vocoder [4] was implemented by using a filter bank. Portnoff [16] worked out the rate modification details with a short-time Fourier transform analysis (STFT) to effectively emulate the phase vocoder. This work served as a model for future work [13], [18] employing the STFT. Nawab et al. [15] went a step farther; showing how the modified signal could be reconstructed from the *magnitude* of the STFT. Griffin and Lim [6] developed the LSEE-MSTFTM (least-squares error estimation of the modified short-time Fourier transform magnitude). Le Roux et al. [9] derived a direct expression of the constraints between nearby values in the overlapped STFT, and used this to create a much faster, progressive algorithm for accurate reconstruction from STFT magnitude.

Let us now look in detail at how these ideas can be combined into a single system for modifying voice. Seneff's approach [20] is shown in Fig. 40.7. The figure shows the steps leading to a doubling of the fundamental frequency without changing the spectrum and without parametrizing the fundamental frequency estimation.

First, the spectral envelope is measured; we know from Chapter 32 that there are several available approaches. Spectral envelope estimation is used to create a time-domain inverse filter that has the effect of deconvolving the excitation and the spectral envelope. Passage of the original signal through the inverse filter generates an approximation to the excitation (shown for voiced speech in the figure). By low-pass filtering and downsampling by 2:1, excitation pulses are generated having half the period. Meanwhile, the impulse response of the vocal tract is obtained by the inverse transform of the spectral envelope, and this function is convolved with the downsampled excitation function to produce the transformed signal, time compressed by 2:1. Finally, this signal is sent through a phase vocoder; we see that the output has twice the pitch of the input.

Figure 40.8 shows how these ideas can be incorporated into a complete system, in which phase vocoding is integrated into the overall system structure.

A high-resolution spectrum is obtained; the magnitude and phase are treated separately. Items I, II, and III illustrate how a flattened version of the spectrum can be obtained. The phase spectrum is unwrapped. Our next job is to reduce the extent of the spectrum as shown in the figure; this is done by 2:1 downsampling of the spectrum envelope to produce V. Multiplying the downsampled spectral envelope by the flattened spectrum produces VI. It is then a simple matter to combine item VI with the phase-multiplied spectrum (VII) to finally (by means of polar-to-rectangular transformation) get back the original spectrum (I) but with twice the pitch.

Phase unwrapping and multiplying are illustrated in Fig. 40.9. The original phase is essentially linear for any harmonic component of the spectrum, as shown by the solid lines that traverse zero to 2Φ. Unwrapping the phase yields the lower straight line, and doubling the phase yields the dotted straight line above; when it is rewrapped, we get the result shown in the figure, which corresponds to doubling the frequency of the underlying harmonic component.

**Frequency Compression and Gender Transformation**: Hearing loss is a term that covers a wide range of symptoms. Here we speculate on the possibility that severe loss of high-frequency hearing (e.g., above 1 kHz) can to some degree be compensated by employing one of the transformation tricks described above. Since there is critical speech information in the spectrum above 1 kHz, some intelligibility may be returned by scrunching an entire 4 kHz spectrum into the 0 to 1 kHz band. In a channel vocoder, this can be done by keeping the pitch undisturbed and designing the synthesizer with the *same* number of filters as the analyzer, but covering only the low band. Thus, for example, the 4 kHz analyzer filter magnitude signal modulates the 1 kHz synthesis filter, the 2 kHz analyzer filter modulates the 500 Hz synthesis filter, and so on.

Some experiments along these lines have been tried; unfortunately, we know of no successful results thus far [10]. An example of spectral scrunching is shown in Fig. 40.10 parts (e) and (f). Also shown in Fig. 40.10 [parts (c) and (d)] is the male-to female transformation, in which both fundamental frequency and effective formant frequencies have been increased.

In Chapter 30 we briefly discussed the use of sinusoidal analysis for the representation of segments for concatenative synthesis. These approaches are applied to vocoding in the sine transform coder (STC) of Quatieri and McAulay [17]. In this method, the synthesizer is excited by a collection of sinusoidal signals. The frequencies and magnitudes of these signals are derived with an analysis procedure based on a high resolution, short-time DFT.^{2} The sum of these sinusoids represents the resultant synthesis. Given this model, the STC is capable of time-scale modification, pitch modification, and spectrum modification.

**Time-scale modification in the STC**: The analysis procedure computes the frequencies and magnitudes of the sinusoids at a rate corresponding to the rate at which successive DFTs are performed. When the rate of presentation of these parameters to the synthesizer is changed, the rate of the resultant synthetic speech is also changed.

**Spectral modification in the STC**: Given the high-resolution DFT, a number of options are available for finding the spectral envelope (e.g., cepstral analysis, spline interpolation, and LPC analysis). The spectral envelope is now scrunched, as illustrated in Fig. 40.11, and new magnitudes are assigned to the sinusoids based on sampling the scrunched spectrum.

**Pitch modifications in the STC**: Given the spectral envelope, pitch modification can be done by changing the derived frequencies and then sampling the spectral envelope at the new frequencies to generate new magnitudes for the shifted frequencies.

The above methods can be combined with changes in the analysis and synthesis sampling rate ratio to produce a great variety of modifications.

In Section 40.4 we stated that in certain applications it is desirable to transform a voice to match that of a specific target voice. Valbret et al. [21I discuss this problem. To implement their scheme, it is first necessary to have sufficient data on the reference voice (the input) and the target voice. To train the system, words from each speaker are time aligned, using DTW (dynamic time warping) methods (see Chapter 24), followed by a training algorithm, using vector quantization, to set up a correspondence between target and reference vectors. When this training is complete, it is straightforward to map the input speaker into the target speaker.

The results of Valbret et al. indicate that the average value of the fundamental frequency is a more important cue than the spectrum to identify a given speaker.

Childers [1] describes a method of modeling the glottal source for voice conversion. He uses a polynomial model and enters 32 versions of the glottal source function into a VQ table.

**40.1**Show how spectral scrunching can be realized using the system of Fig. 40.8.**40.2**In Sec. 40.5 it is stated that a phase vocoder implemented as a filter bank works best if there is a single harmonic in each filter during voicing. Explain why this is so.**40.3**How would you modify the spectrum in a phase vocoder without affecting speed or pitch?**40.4**Consider a high-frequency speech-communication system in which it is desired to maintain speech continuity despite fades. Imagine that during the first 10 s there are no fades; at 10 s there is a 5-s fade. Make a sketch of the resulting timing of the transmitted, buffered, and received speech, indicating the beginning and end of the fade. The end result should be that the receiver gets all of the speech, but occasionally it will receive a time-scaled version of the transmitted speech.**40.5**It is desired to slow the output of the speech synthesized by the STC algorithm by 70%. Describe the steps needed to do this. Assume that the analyzer computes a high-resolution DFT every 10 ms.**40.6**Using pitch synchronous speedup by 2, design an algorithm to speed up the utterance histogram of Fig. 40.1. Present your research as a computer program with audio results, if possible, or as a block diagram or flow chart.**40.7**Given a cepstrum, how would you modify the pitch of the utterance without explicitly estimating the pitch?**40.8**In Fig. 40.4(a), assume that the frequencies present are 300, 600, 900, and 1200 Hz, and so on. What frequencies appear in Fig. 40.4(b)? If the corresponding signal is now upsampled by 2:1 to restore the original sampling rate, what frequencies appear in the new signal?**40.9**Using the PSOLA method, sketch a design for modifying pitch, leaving other parameters intact.

- Childers, D. G., “Glottal source modeling for voice conversion,”
*Speech Commun*.**16**: 127–138, 1995. - Gold, B., Lynch,
**J.**, and Tierney, J., “Vocoded speech through fading channels,” in*Proc. ICASSP '83*, Boston, p. 101, 1983. - Fairbanks, G., Everitt, W. L., and Jaeger, R. P., “Method for time or frequency compression-expansion of speech,”
*IRE Trans. Audio Electroacoust*.**AU-2**: 7–12, 1954. - Flanagan, J. L., and Golden, R. M., “Phase vocoder,”
*Bell Syst. Tech. J*.**45**: 1493–1509, 1966. - Garvey, W. D., “The intelligibility of abbreviated speech patterns,”
*Quart. J. Speech***39**: 296–306, 1953. - Griffin, D. W., and Lim, J. S., “Signal estimation from modified short-time Fourier transform,”
*IEEE Trans. Acoust. Speech Signal Process*.**ASSP-32**: 236–242, 1984. - Hermansky, H., Fujisaki,
**H.**, and Sato, Y., “Analysis and synthesis of speech based on spectral transform linear predictive method,” in*Proc. IEEE Int. Conf. Acoust. Speech Signal Process.*, Boston, pp. 777–780, 1983. - Lee, F. F., “Time compression and expansion of speech by the sampling method,”
*J. Audio Eng. Soc*.**20**: 738–742, 1972. - Le Roux, J., Kameoka, H., Ono, N., and Sagayama, S., “Fast Signal Reconstruction from Magnitude STFT Spectrogram Based on Spectrogram Consistency,” in
*Proc. Int. Conf. Digital Audio Effects*, Graz, pp. 397–403, 2010. - Lippmann, R. P., “Experiments with frequency compression hearing aids,” personal communication, Lincoln, Laboratory, Lexington, 1980.
- Miller, G. A., and Licklider, J. C. R., “The intelligibility of interrupted speech,”
*J. Acoust. Soc. Am*.**22**: 167–173, 1950. - Mizuno, H., and Abe, M., “Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectrum tilt,”
*Speech Commun*.**16**: 153–164, 1995. - Moulines, E., and Laroche, J., “Non-parametric techniques for pitch-scale and time-scale modification of speech,”
*Speech Commun*.**16**: 175–205, 1995. - Narendranath, M., Murthy, H. A., Rajendran, S., and Yegnanarayana, B., “Transformation of formants for voice conversion using artificial neural networks,”
*Speech Commun*.**16**: 207–216, 1995. - Nawab, S. H., Quatieri, T. F., and Lim, J. S., “Signal reconstruction from short-time Fourier transform magnitude,”
*IEEE Trans. Acoust. Speech Signal Process*.**ASSP-31**: 986–998, 1983. - Portnoff, M. R., “Time-scale modification of speech based on short-time Fourier analysis,”
*IEEE Trans. Acoust. Speech Signal Process*.**ASSP-29**: 374–390, 1981. - Quatieri, T. F., and McAulay, R. J., “Speech transformations based on a sinusoidal representation,”
*IEEE Trans. Acoust. Speech Signal Process*.**ASSP-34**: 1449–1464, 1986. - Roucos, S., and Wilgus, A. M., “High quality time-scale modification for speech,” in
*Proc. ICASSP '85*, Tampa, pp. 493–496, 1985. - Scott, R. J., and Gerber, S. E., “Pitch synchronous time compression of speech,” in
*Proc. Conf. Speech Commun. Process.*, Newton, Mass., 63–65, 1972. - Seneff, S., “System to independently modify excitation and/or spectrum of speech waveform without explicit pitch extraction,”
*IEEE Trans. Acoust. Speech Signal Process*.**ASSP-30**: 566 – 578, 1982. - Valbret,
**H.**, Moulines, E., and Tubach, J. P., “Voice transformation using PSOLA technique,”*Speech Commun*.**11**: 175–187, 1992.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.