CHAPTER 30

image

SPEECH SYNTHESIS

30.1 INTRODUCTION

The goal of this chapter1 is to introduce engineering approaches for “talking” machines that can generate spoken utterances without requiring the every possible utterance to be prerecorded. Generally, speech synthesis requires the use of sub-word units, in order to provide the extended or even arbitrary vocabularies required for applications such as text-to-speech (TTS); this is the most common application of speech synthesis. A YTS system operates as a pipeline of processes, taking text as input and producing a digitized speech waveform as output. The pipeline can be described in two main parts: the “front end”, which converts text into some kind of linguistic specification; and the waveform generation component, which takes that linguistic specification and creates an appropriate speech waveform.

The task of the front end is to infer useful information from the text; that is, information that will help in generating an appropriate waveform. The written form of a language does not fully specify the spoken form, so in order to correctly produce the spoken form prior knowledge must be used. Some examples of using prior knowledge to enrich the information encoded in the written form include:

1. Text preprocessing: Ambiguities in the written form, such as abbreviations and acronyms, must be resolved. An example of this is the translation of “Dr.” into either “Doctor” or “Drive,” depending on the context.

2. Pronunciation prediction: The spelling of each word must be converted into a phonetic string representing its pronunciation. For many languages, a pronunciation dictionary is required. Sometimes, additional information about the word will be required before it can be looked up in a dictionary: for example, noun and verb forms that are spelled the same (“close”, “present”, etc.). Words that are not in the dictionary must have pronunciations predicted using a letter-to-sound model. This is typically a simple statistical model learned from a set of known words and their pronunciations (i.e. the dictionary itself). Languages with regular spelling-to-sound patterns (e.g., Spanish) may only require a dictionary for exceptional words (e.g., foreign words) and can use a letter-to-sound model for all other words. Once the pronunciations of the individual words have been found, a small amount of further processing is often applied to modify some pronunciations based on the surrounding context.

3. Prosody prediction: To fully specify the spoken form of the utterance, more than the phonetic pronunciation is required; there must also be a description of the prosody: the pattern of fundamental frequency, duration and amplitude. Normally, a symbolic prosodic description is first predicted from the text. An example of a system for describing prosody using a small inventory of symbols is “Tones and Break Indices” (ToBI) [45] and a common choice for predicting a ToBI description from the text (including the pronunciation, parts of speech, and any other information available) would be a decision tree that had previously been trained on data manually annotated with ToBI labels. Whatever type of symbolic representation used, it will then be converted into continuous values of fundamental frequency, segment durations and so on. This conversion may be carried out explicitly by a separate process (for example, the conversion of ToBI symbols into values for rises and falls in the fundamental frequency), implicitly within unit selection (as described in Section 30.2 below), or by the same statistical parametric model that generates the spectral features, as in HMM-based synthesis (described in Section 30.3).

The resulting linguistic specification contains many layers of information: the original text, its pronunciation, prosodic pattern, and so on. From this specification, a waveform must be generated. There are two main methods in current use for doing this:

  1. Concatenative synthesis: small pre-recorded fragments of speech are automatically selected from a database, concatenated and played back. The selection of the optimal sequences of speech fragments is made according to cost functions, which evaluate how well the linguistic context of each fragment matches the context of the sentence being synthesized, and evaluate how natural the fragments will sound when concatenated together. This is the dominant technology at the moment, and is described in more detail below.
  2. Statistical parametric speech synthesis: a parametric representation of speech is generated from a statistical model, then converted to a waveform using a vocoder. This technology is receiving a significant amount of attention at the moment, because of its flexibility, as described in more detail below.

    Other methods for waveform generation include:

  3. Physical/articulatory modeling: a mathematical model of the vocal tract is used to generate the speech signal. Although such models can generate some very realistic speech sounds, particularly vowels, they typically sound less good on complex sounds, such as consonant clusters. The early mechanical systems from von Kempelen and Wheatstone (see Chapter 2) were literal physical models. More recent examples use software simulations, as in the work of Coker and colleagues ([9], [11], or [43]) or computationally-intensive numerical methods such as finite-element or waveguide models of the vocal tract [52, 13]. All of these approaches suffer the limitation that it is currently not possible to use text to automatically predict accurate values for the the (potentially large number of) parameters of a physical model. Such approaches are currently more interesting from a scientific standpoint than as practical methods for speech synthesis.
  4. Formant synthesis: a set of filters are used to simulate the resonances of the vocal tract; the input to the filters is a periodic signal for voiced speech or an aperiodic signal for unvoiced speech, or a combination of the two. Typically, the various parameters of formant synthesizers (such as the frequencies and bandwidths of the filters) are provided by a complex set of hand-crafted rules. This technology was the state of the art until the mid-1980s, and is still used in a few legacy applications, or where only very limited computational power and memory are available.

30.2 CONCATENATIVE METHODS

The current dominant approach to speech synthesis involves the concatenation of fragments of waveform, selected from a pre-recorded database of natural speech. This “cut-and-paste” approach to waveform generation requires a carefully recorded and labelled database, an automatic method for selecting fragments of speech from that database and a signal processing component to concatenate the fragments.

The earliest concatenative methods stored very small amounts of speech. For instance, in the approach developed by Forrest Mozer and used by some semiconductor manufacturers in the 1980s, simplified waveforms were stored and concatenated during playback [39]. Voiced sounds were compressed by manipulating a pitch period waveform to reduce the number of signal samples that were required to have a power spectrum that was sufficiently close to the original.

For many years, the most popular method was to store one example of every required diphone (defined in Chapter 23). Diphones are a better choice than phones for concatenative synthesis, because the joins between units will then occur mid-phone. The spectrum in the middle of a phone tends to be relatively stationary and consistent, compared to at a boundary between two phones. In diphone synthesis, any required word or sentence can be constructed by concatenation of the appropriate sequence of diphones. Since only one copy of each diphone is stored, a significant amount of signal processing is required in order to achieve the required fundamental frequency contour and segment durations. In the most popular current approach, known as unit selection, many different instances of each required diphone (or other unit type) are stored, in order to maximize the chances of finding a unit from an appropriate context and to minimize the amount of signal processing required during concatenation [28].

30.2.1. Database

A typical unit selection synthesizer will use a database containing a few hours of speech from a single speaker, recorded under controlled conditions and carefully labelled with a time-aligned phonetic transcription and prosodic information such as phrase breaks, pitch accents and boundary tones. The labeling may be done manually or semi-automatically.

The contents of the database – the sentences read by the speaker and the accuracy with which the speech is labelled – are a major contributing factor to the final quality of the system. The database must contain a wide variety of types of speech units, in a wide variety of phonetic and prosodic contexts. The sentences to be read are therefore often selected algorithmically from a much larger body of text, such as web pages, newspapers or out-of-copyright books. For domain-specific applications, the inclusion of a small amount of domain-specific material can substantially improve the quality of the system.

image

FIGURE 30.1 In the unit selection method, units are automatically chosen from a large database, then concatenated in order to synthesize a new sentence. In the figure, “S12” denotes the 12th instance of unit type “s” in the database. The list of candidates for each target has already been pruned, leaving only units with a relatively low target cost (in this example, the 12th, 73rd and 62nd instances of “s” have been retained as candidates). Dynamic programming is used to find the path through the lattice of candidate units that minimizes the total sum of target and join costs (in this example, that path is “s12 p2 iy32 ch7”). The selected units are then concatenated to produce the final synthetic speech waveform. For simplicity, this example uses phone units and the concatenation takes place using waveforms, but an actual system may use diphone units and RELP-PSOLA to concatenate the units.

Although larger databases provide better coverage of the speech units, it is harder to record them consistently since the speaker must spend many hours in the recording studio over a period of weeks or months. Consistency of recording conditions, speaking rate, effort and voice quality are very important, because when new sentences are synthesized, adjacent units might be selected from parts of the database recorded far apart in time.

30.2.2. Unit selection

From the large and carefully labelled database, units are automatically selected to render any required new sentence, as illustrated in Figure 30.1. Typically, the units are diphones, but half phones and other unit types are also used. In order to choose the best sequence of units, two cost functions are employed. The target cost measures how closely the linguistic context of each candidate unit from the database matches the linguistic specification of the corresponding unit in the sentence to be rendered. This cost function generally consists of a weighted sum of factors, each of which measures the match/mismatch for a single linguistic feature (left and right phonetic context, stress, syllable position, ...). The target cost is required because the acoustic properties of units are heavily dependent on their context. The other cost function, the join cost measures how well each possible sequence of candidate units will concatenate, by comparing the short-term spectral features, energy and fundamental frequency of each unit at the concatenation point. The join cost is required because sudden changes in acoustic properties at the concatenation points would be perceived as unnatural. It is usual to define the join cost between two contiguous units from the database as zero. As a result, the unit selection method will prefer to choose naturally-occurring sequences of units, resulting in fewer concatenation points per sentence. A dynamic programming search method (Chapter 24) is used to efficiently find the overall best sequence of units that minimizes the sum of all the target and join costs.

30.2.3. Concatenation and optional modification

Once the sequence of units has been selected, they are concatenated together. In some systems, certain properties of the units may be modified in order to make them join better (e.g., small modifications to the spectrum, energy and fundamental frequency around the concatenation points) or in order to achieve a required prosodic pattern (modifications to durations and fundamental frequency). Generally, most unit selection systems aim to perform as little modification to the speech signal as possible, since it will result in audible artefacts.

Depending on the modifications required, concatenation can take place using one of a number of possible representations of the speech signal. In the simplest case, time-domain waveforms are concatenated using pitch-synchronous overlap-and-add (PSOLA) [7] at the concatenation points. If small modifications are required, then techniques such as linear predictive PSOLA (LP-PSOLA) are popular. In LP-PSOLA, rather than storing diphone waveforms per se, linear predictive coefficients (or transformed values, such as reflection coefficients, log-area ratios or line spectral frequencies) are stored. The PSOLA algorithm is then used to provide prosodic modifications by operating on the excitation signal, and spectral envelope smoothing is achieved by smoothing the LP coefficients [40] (or more commonly the line spectral frequencies derived from them). For high-quality speech synthesis, where storage space is not the main consideration, residual-excited linear predictive PSOLA (RELP-PSOLA) [18] is commonly used since it results in near-perfect reconstruction of the speech signal in cases where little or no modification has been performed. LP techniques are described in Chapters 32 and 34.

In the days when diphone synthesis was the dominant technique, many representations for speech signals were proposed that enabled fundamental frequency, duration and spectral envelope to be modified independently. Although less relevant to unit selection, some may still be useful, and may perhaps be adapted for use in statistical parametric synthesis. These techniques include sinusoidal (harmonic) modeling, where the speech signal is represented as a sum of sine waves along with some representation of the aperiodic energy present [36], [3] and resynthesis methods such as multiband resynthesis overlap add (MBROLA) [18] in which the database is pre-processed (using a sinusoidal model) so that every frame has the same phase, thus making subsequent concatenation and prosodic modification less sensitive to errors in epoch marking. Once the sequence of units has been selected, they are concatenated together. In some systems, certain properties of the units may be modified in order to make them join better (e.g., small modifications to the spectrum, energy and fundamental frequency around the concatenation points) or in order to achieve a required prosodic pattern (modifications to durations and fundamental frequency). Such modifications are, however, avoided where possible since they can create audible artifacts.

30.3 STATISTICAL PARAMETRIC METHODS

Dutoit [18] pointed out that approaches using corpus-based models of speech segments offered a more flexible framework than waveform concatenation and indeed this turned out to be an area of significant developments in speech synthesis. At the time Dutoit made that observation, two HMM-based speech synthesis systems had been proposed. HMMs had been in use for speech recognition for some time (see Chapter 25 – Statistical sequence recognition) and, since they are generative models2, it was natural to ask whether they can be used to generate speech.

Tokuda and colleagues [48] developed a method for generating speech features from HMMs which used the dynamic features (also known as delta and delta-delta features). The key feature of this work was the ability to generate smooth speech feature trajectories with the appropriate statistical properties. This work directly led to current HMM-based synthesis systems which offer a complete modeling framework for the spectrum, fundamental frequency and duration of speech segments.

Donovan & Woodland [17] also developed a method for synthesis using HMMs. The key feature of their method was to employ powerful data-driven decision-tree clustering techniques originally developed for speech recognition, in order to automatically find an inventory of a few thousand context-dependent sub-word units (which correspond to HMM states, each of which is one-fifth of a phoneme). The waveform generation algorithm was simple: the sequence of HMM states corresponding to the input phoneme sequence is found using the clustering decision tree, then piecewise constant linear prediction coefficients associated with each state are used to generate the waveform.

30.3.1. Vocoding: from waveforms to features and back

HMMs, as formulated for speech recognition, operate not on time-domain waveforms but on a representation of the spectral envelope of speech. One problem to overcome when using HMMs for speech synthesis is therefore how to generate speech waveforms from such representations, which we will call speech features. A typical set of speech features, used in speech recognition, would be a dozen or so MFCCs (as defined in Chapter 22). This representation does not contain sufficient information to reconstruct the original speech signal. Crucially, there is no information about the source, but the spectral envelope is also only very crudely described by 12 coefficients. For synthesis, source information is required, plus a more detailed representation of the spectral envelope.

A useful way to understand statistical parametric speech synthesis is by comparing it to a vocoder (Chapter 32). A vocoder contains two components: an analyzer or encoder that converts speech waveforms into a parametric representation called speech features (usually for storage or transmission) and a synthesizer or decoder that reconstructs the waveform from the speech features. Statistical parametric speech synthesis operates in the speech feature domain. But, whereas a vocoder merely transmits the features unaltered, in speech synthesis a statistical model is built from them. This model can be stored and used later to generate appropriate speech features for any required sentence. Since the speech features are being modified (condensed into a statistical model comprising means and variances, then generated from that model), perfect reconstruction of the speech waveform will not generally be achieved even for sentences that are present in the training corpus.

Clearly, the quality of the vocoder will strongly affect the overall quality of the system. A better vocoder might be expected to give better quality synthetic speech. However, it must be remembered that the speech features will be modified in speech synthesis. Some features are more amenable to modifications than others, in just the same way that some representations devised for concatenative synthesis give better quality than others when modifying the fundamental frequency, duration and spectral envelope. For HMM-based speech synthesis (by far the most common statistical parametric method), the speech features must have the following properties:

  • allow reconstruction of the waveform (vocoding)
  • be amenable to modifications – separate out the fundamental frequency from the spectral envelope and also allow duration to be altered
  • be appropriate for statistical modeling – for example, being decorrelated to allow the use of diagonal variance Gaussian pdfs, use a non-linear frequency warping (e.g., the Mel scale) in order to concentrate the parameters of the model on those frequencies that matter most for speech signals

A common choice of speech features that meet these requirements is a form of frequency-warped cepstral coefficients, plus source information. The cepstral coefficients are usually extracted from speech waveforms not using the filterbank method employed in ASR (Chapter 22) but rather with the spectral representation referred to below. The frequency warping is typically on the Mel scale (as described in Chapter 22) and the resulting features are called MELCEP (whereas the cepstral features typically used in ASR are called MFCC).

In early HMM-based speech synthesizers, the speech features and the methods employed for generating speech waveforms from them resulted in rather “buzzy”–sounding speech. When compared to the much more “transparent” qualities of concatenative methods, early attempts at HMM-based speech synthesis produced far less natural results. But improvements came quickly to all aspects of the method, particularly in the following three areas:

Spectral envelope representation: Because the speech features will be statistically modeled, it is useful to have a decorrelated representation of the spectral envelope – for example, cepstral coefficients (Chapter 20). Better results are obtained when using a warped frequency scale, such as the Mel scale, because this leads to more detailed modeling at frequencies where the human auditory system is most discriminating (and which also matter most for speech signals). The Mel Generalized Cepstrum (MGC), proposed by Tokuda [47], is a generalized representation that includes (as special cases) both Cepstral Coefficients and Linear Predictive Coefficients, with a user-selectable frequency scale warping. MGC coefficients give good results for HMM-based speech synthesis and a cepstrum of order 40 is a common choice, although even larger values may give better results in some cases.

Spectral envelope estimation: Speech synthesis benefits from a much more detailed representation of the spectral envelope than used in automatic speech recognition (Chapter 22), but standard methods for extracting spectral envelopes from the short-term spectrum of a single frame suffer from interference from the harmonics in this case. The high-quality vocoder STRAIGHT includes a method for extracting a detailed spectral envelope representation without undue influence from the fundamental frequency, namely, pitch synchronous analysis. STRAIGHT can also be used to generate the waveform from the speech features during synthesis.

Source modeling: Simple excitation – switching between a pulse train or noise results in a buzzy quality to the synthetic speech. Current systems used mixed excitation, where the pulse train is mixed in varying proportions with aperiodic noise whose spectrum is shaped using a 5-band filter. The speech features modeled by the HMM therefore comprise a single value for the fundamental frequency (which is set to a special dummy value in unvoiced frames) and 5 filter coefficients.

30.3.2. Statistical modeling for speech generation

The HMMs used for speech synthesis are similar to those used for ASR (Chapter 25). However, there are some important differences.

Context-dependency HMM-based speech synthesis generates not only the segmental properties of speech (those things that differentiate phonemes) but also prosody, reflected in the durations, amplitudes and fundamental frequency of speech sounds. The HMMs used in speech synthesis are generally, as in ASR, models of phone-sized units. Since the acoustic properties of speech depend not only on the current phoneme identity but on the surrounding context, the models must be context dependent. The difference between synthesis and recognition is in the context that must be taken into account. In recognition, only the segmental context is used – the one or two preceding and following phones – resulting in triphone or quinphone models (Chapter 23). For synthesis, the prosodic context must also be taken into account, including features such as the position of the current phone within its parent syllable and phrase, the position of its parent syllable in the phrase, the position of that phrase in the utterance as a whole, and so on. One HMM is required for every phoneme in every possible phonetic-prosodic context, leading to a vast number of potential models. Just as in ASR, parameter sharing amongst models is necessary (see Section 26.8).

Observation probability distributions The observation vectors, comprising the speech features required to drive the vocoder, are considerably larger for synthesis than recognition. For example, 40 MELCEP coefficients may be used to describe the spectral envelope, plus a value for FO and five aperiodic energy coefficients. To these 46 numbers are appended first and second order derivatives, making a total observation vector size of nearly 140.

It is common to divide this large feature vector into several streams for statistical modeling purposes (for example, parameter tying may be performed separately for each stream). The FO value also requires special treatment, since it is undefined in unvoiced frames, so multi-space distributions (MSD) are used [49], which can handle a combination of discrete and continuous distributions.

Duration modeling In speech recognition, the durations of individual phones is not strongly discriminative (i.e., knowing the duration of a segment does not help much in identifying which phoneme it is). However, a good model of duration is essential for high-quality speech synthesis and so HMM-based speech synthesis uses explicit parametric models of state durations. These are typically log-Gaussian distributions. Introducing explicit duration models means that the models are no longer strictly HMMs, and are in fact referred to as Hidden Semi-Markov Models (HSMMs).

Generation using the Trajectory HMM algorithm Recall that the observation vectors for HMMs, whether used for recognition or synthesis, contain so-called delta and delta-delta features (Section 22.3). These represent the rate of change of the basic speech features (commonly known as the static features in HMM recognition or synthesis) over time. Obviously, the values of the delta features at any particular frame depend on the values of the static features in that frame and in the neighboring frames.

If we were to generate an observation sequence from an HMM by first randomly choosing a state sequence and then randomly generating an observation from the Gaussian in each state, there is no guarantee that the generated delta features will depend on the static features in the correct way. This is because of the Markov property of the model, which means that the features at one instant in time are statistically independent of the features at all other times, given the HMM state sequence. Furthermore, if we were to generate the most likely observation from each Gaussian, that would always be its mean value, resulting in a piecewise constant sequence of values. The so-called Trajectory HMM solves these problems, and generates a smoothly-varying sequence of static, delta and delta-delta features with the correct inter-dependencies and statistical properties. The generated sequence is the one with the maximum likelihood, given the model and the constraints between static, delta and delta-delta features.

image

FIGURE 30.2 Speaker-adaptive HMM-based speech synthesis operates in three phases. Training part: an “average voice” model is trained on a multi-speaker database. Adaptation part: speech from the target speaker is used to estimate adaptation transforms. Synthesis part: the adapted model is used to generate speech. “MSD-HSMM” means Multi-space distribution Hidden Semi-Markov Model. Reproduced with permission of Junichi Yamagishi.

30.3.3. Advanced techniques

As HMM-based speech synthesis has come to maturity, an ever-increasing array of advanced methods are being developed to improve all aspects of the method.

One notable technique that has been borrowed from speech recognition, then further improved, is the adaptation of the statistical models in order to manipulate the generated speech. The original goal of adaptation was to transform a set of already-trained HMMs so that they better fitted some new data. A typical application of this is in speaker-adaptive speech recognition (Chapter 28), in which the very large number of model parameters cannot be re-trained on a small amount of data from one speaker, but where a relatively small number of adaptation transforms can be estimated. These transforms are linear and are applied to the model parameters (such as the means and covariances of the Gaussians). To understand why linear transforms work, think about the vowel spaces (F1 vs. F2) of two individual speakers: to map one to the other, we could apply a shift followed by a scaling and rotation. Automatic methods exist for estimating the transforms from either labelled or unlabeled speech data. Applied to speech synthesis, adaptation allows the creation of new voices using quite small amounts of data. These new voices may be for different speakers, or different speaking styles of the same speaker (e.g. emotional speech). Since the transforms are linear, it is also easy to interpolate between multiple transforms, thus allowing continuous modulation of emotion or speaker identity. Figure 30.2 illustrates how a speaker-adaptive HMM-based synthesizer is trained, adapted and used to synthesize speech.

A recent comprehensive description of HMM Synthesis can be found in [53].

30.4 A HISTORICAL PERSPECTIVE

It is useful to look back at early approaches to speech synthesis. Although the methods described in this section are no longer in widespread use, links can be made between these early methods, which were generally intended to model the behavior of the vocal tract, current methods (unit selection and HMM-based synthesis) and future possibilities, such as detailed (and controllable) physical models.

As noted earlier, von Kempelen's synthesizer could be viewed as an early articulatory system. However, even the 1939 Voder of Chapter 2 already had abstracted the model of speech production from that of a direct analog of articulation to one that focused on key features that could be easily parametrized. As noted previously, this simple model was based on separately specifying the excitation from the spectral shaping that modeled the filtering action of the vocal tract. Although this characteristic of the Voder foreshadowed later developments, the Voder still required hand-control of the parameters, as did von Kempelen's synthesizer. In both cases, virtuosic performances by the human controller were required – a foreshadowing of the problems with later approaches based on analogs of the vocal tract: the limitations may not be in the modeling, but rather in the methods for controlling the values of the model parameters.

By the 1950s, the OVE [20] and PAT [34] synthesizers3 introduced an intermediate step which allowed parameters to be entered by hand after careful spectrographic study of the utterance to be synthesized. (Thus, these were not real-time devices like the Voder or von Kempelen's machine.) OVE was a serial combination of formant resonators, whereas PAT was a parallel network. In each case, though, intense human effort was involved, but not to play the synthesizer. Rather, researchers inspected spectrograms to determine formant positions for different sounds, which were catalogued and recalled.

This human labor was eliminated in the Pattern Playback of Haskins Laboratory [14], which synthesized speech directly from printed spectrograms or hand-drawn spectrographic cartoons. The device is shown in Fig. 30.3. It played an important role in early speech-perception research [35].

image

FIGURE 30.3 The Haskins Pattern Playback, consisting of an optical system for modulating the amplitudes of a set of harmonics of 120 Hz over time, depending on patterns painted on a moving transparent belt. From [32].

Fant et al. [21] had developed OVE II, shown in Fig. 30.4. This system is a typical formant synthesizer and uses an arrangement of resonant filters to simulate the formants of the vocal tract. The input to the filters is provided by an excitation model, which simulates the periodic signal produced by the vocal folds and the aperiodic (noise-like) signal produced by frication. Other systems, such as that of Holmes [27], use a parallel, rather than serial, arrangement of filters, but the idea is basically the same.

The synthesizer developed by Dennis Klatt [31] was a compromise between the designs of Fant and Holmes, using both serial and parallel arrangements of filters (Fig. 30.5). There are a total of 19 parameters to vary in this system.

The 1960s witnessed the development of synthesis-by-rule programs, which took as input simply the sequence of phonemes, from which the parameters of the synthesizer were automatically predicted using complex, hand-crafted sets of rules. Klatt's formant synthesizer (Fig. 30.5), when driven by such a set of rules, provides the complete text-tospeech system called KlattTalk, later commercialized under the name DecTalk.

As noted earlier, complete text-to-speech translation needs an additional major step: grapheme- (text symbols) to-phoneme transcription. Umeda et al. [50] demonstrated the first complete text-to-speech system, and research proceeded through the 1970s and into the 1980s, culminating in commercially available systems.

The principal connection between these early speech synthesizers and current methods such as unit selection and HMM-based speech synthesis is that they generally use some model of speech production comprising a source and a filter. Even LP-PSOLA or RELP-PSOLA unit selection involves a source-filter model of speech, although in RELP, the source “model” consists of stored samples of residual waveforms. In HMM-based speech synthesis, there is an explicit source-filter model, with the parameters of both being controlled by the HMM.

image

FIGURE 30.4 The OVE II Speech Synthesizer of Gunnar Fant. From [21].

The form of the source-filter model (whether it is LP-PSOLA, or STRAIGHT) will obviously make a significant contribution to the overall quality of the speech produced by the synthesizer, but even the best source-filter model is of no use unless it can be automatically driven – its parameter values must be provided. For example, in the case of HMM-based speech synthesis, there is still a substantial difference between STRAIGHTvocoded speech (i.e., a source-filter model being used with exactly the right parameter values) and speech synthesized by the HMM plus STRAIGHT waveform generation. This indicates that the statistical modeling is probably the current weakest link in the chain.

30.5 SPECULATION

As noted back in 1997 by Dutoit [18], speech synthesis is not a solved problem, at least in the sense that its quality is still not as good as that of natural speech (particularly for general text-to-speech synthesis). This statement is still true today, but there has nevertheless been significant progress since 1997.

The most significant development has been HMM-based speech synthesis. This method is already being used in commercial products, particularly in applications where its low memory footprint is an advantage such as embedded systems (e.g. in-car satellite navigation).

More generally, there has been an increase in the use of statistical methods in speech synthesis and more application of data-driven learning techniques. The manual labour and high costs associated with constructing a state-of-the-art unit-selection system are a substantial barrier to the widespread use of that technique. If methods can be found to automatically construct voices, with little or no manual intervention, then the range of applications for speech synthesis can be dramatically extended. Data-driven, statistical techniques offer the most promising route to this goal.

image

FIGURE 30.5 The Klatt Synthesizer. Nineteen variable control parameters are identified, including the new voicing source parameters OQ (open quotient) and TL (spectral tilt). From [32].

Dutoit also notes that natural speech has a kind of variability that is not observed in synthetic speech, but that adding randomness to the parameter streams in current systems just makes them sound worse. Further work in this area is still required.

30.5.1. Physical models

The most successful models of speech at the moment are quite abstract: they operate in the spectral domain and do not have any realistic representation of vocal tract shape. Physically-plausible models of the vocal tract are slowly improving, although there is not yet a good enough method for automatically controlling their parameters to generate speech and they tend to have difficulties with sounds such as stops. If these problem can be solved, then physical models may one day offer the ultimate in flexibility and realism for generating artificial speech.

30.5.2. Sub-word units and the role of linguistic knowledge

Speech synthesis involves the construction of new words and sentences from sub-word units. In diphone synthesis, the unit inventory (the list of types of unit) is obvious: one of each required diphone. But in unit selection and HMM-based speech synthesis, the unit inventory is not quite so explicit.

Unit selection uses a large database of labelled speech. Since the precise linguistic context of every single unit (every token) is different (absent duplicated sentences in the database), we might say the number of unit types is the same as the number of tokens. However, this is not helpful when we wish to construct a novel sentence: we will almost never find the correct units in the database. Instead, it is essential to define classes of equivalent or interchangeable units so that we can re-use units in the database to make novel sentences. For example, if we think that the only contextual linguistic properties that influence the sound of a phone are its left and right neighbors, then the inventory we should choose is triphones. In reality, many other properties of the context influence the sound of a phoneme, and this is what the target cost attempts to model. The target cost is used to ‘score’ candidates based on their contextual match (in terms of some linguistic properties) to the target sentence. Some candidates from the speech database may receive the same score from the target cost function. We may then consider them to be a single type of unit: they are linguistically equivalent, given the current target sentence being synthesized, and cannot be differentiated based on their linguistic properties. The target cost function therefore implies a unit inventory, albeit in a rather opaque fashion that is hard to analyze.

In HMM-based speech synthesis, the large number of phonetic-prosodic contexts taken into account results in a potentially vast number of unit types, but in practice the HMM states are clustered using a decision tree. The effective unit inventory is therefore the set of leaves of the tree – the state clusters.

The state clustering decision tree in current approaches to HMM-based synthesis (as well as in Donovan's method I 1 7D, is performing a very similar role to the target cost function in unit-selection synthesis. Both are describing which units are equivalent and interchangeable in terms of their linguistic properties. In unit selection, such units receive the same target cost. In HMM synthesis, they end up in the same state cluster.

This data-driven construction of linguistic classes – the basic building blocks of speech – is a powerful concept. With a larger database of speech, larger, more fine-grained, unit inventories are possible. Automatic methods for defining the number of unit types typically have the attractive property of scaling the complexity of the system to suit the available data.

30.5.3. Prosody matters

Everyone agrees: prosody matters for synthetic speech. Yet the field of prosody research is characterized by many competing theories, models and techniques with little agreement as the most promising way forward and certainly no clear winner. Prosody is inherently a difficult thing to predict, for two principal reasons: 1) it depends on many other layers in the linguistic hierarchy, all the way up to the meaning of the sentence being spoken, the pragmatic context, the intent of the speaker etc., and 2) for any given sentence, there is more than one acceptable prosodic pattern.

The dependency on meaning and para-linguistic information such as speaker state (emotion, for example) means that prediction from plain text input requires many assumptions to be made. We could say that current models can predict 'inoffensive' prosody (they sound bored, or boring), but they cannot generally predict 'appropriate' prosody for arbitrary input. The fact there there is 'more than one right answer' makes learning from data hard since any real dataset will contain just one possible pattern and not represent all acceptable patterns. It also makes evaluation hard, because comparing synthetic speech with single natural examples may be misleading, and because human listeners in subjective tests may make highly variable judgements about what constitutes 'good' prosody.

The advent of unit selection caused many researchers to lose interest in prosodic modeling because the prosody produced by selecting units based on fairly simple linguistic features tended to be better than that generated by an explicit model. However, HMM-based synthesis requires more explicit modeling (generally only at the symbolic level) and this may be re-awakening interest in prosody modeling for speech synthesis. It is possible that many of the statistical methods described in Chapters 2528 in this book will begin to see greater application to speech synthesis in general, and to prosody analysis and generation in particular.

30.6 TOOLS AND EVALUATION

Dutoit [18] noted that a critical basis for further progress in speech synthesis is the availability of a synthesis system with full control of its parameters for research purposes. More generally, research toolkits are critical to progress in fields like speech synthesis, where a wide variety of skills and considerable amounts of time are required to build even a conventional system. When such toolkits are modular, they also enable direct comparisons between competing techniques, simply by changing one module in the system. A number of speech synthesis toolkits are available and in use by the research community, including4:

  • MBROLA: A waveform generation module from the Faculté Polytechnique de Mons. http://tcts.fpms.ac.be/synthesis/mbrola.html
  • Festival: A complete and widely-used open source text-to-speech toolkit from the University of Edinburgh and Carnegie Mellon University. It includes diphone and unit-selection modules and supports HMM-based synthesis using HTS (see below). http://www.cstr.ed.ac.uk/projects/festival/
  • Mary: Another complete open source text-to-speech toolkit from the Deutsche Forschungszentrum fur Kunstliche Intelligenz. http://mary.dfki.de/
  • HTS: A toolkit for HMM-based speech synthesis (training the models and waveform generation) from Tokuda at the Nagoya Institute for Technology and his collaborators, built as an extension to the popular HTK toolkit. http://hts.sp.nitech.ac.jp/

Although there are some signal-based measures for the quality of synthetic speech (e.g., Mel-cepstral distortion between synthetic and natural examples of the same sentence), there is as yet no substitute for subjective tests using human listeners. Such tests are time-consuming and expensive to conduct. Within-system tests are routinely used for determining which of a set of competing techniques (e.g., models for predicting prosody) are preferred by listeners. Comparisons between systems are harder to perform, since only a few complete systems are freely available. The annual Blizzard Challenge [4] is the only open evaluation in which many different systems are compared in a large-scale listening test with many hundreds of listeners. Although the Blizzard Challenge results provide reliable information about the naturalness and intelligibility of complete systems, they do not directly tell us why some systems are preferred by listeners.

30.6.1. Further reading

[51] and [46] are recommended reading for broad coverage of speech synthesis.

30.7 EXERCISES

  1. 30.1 When a speech transformation to change male-sounding speech into female-sounding speech is performed, it is necessary to raise both the fundamental frequency and the formant frequencies. Give a physiological explanation for why this strategy seems to work.
  2. 30.2 Describe at least three important steps in a modern text-to-speech synthesizer.
  3. 30.3 Give four examples (for English) where letter-to-sound conversion could get into trouble. Can you design a system to avoid or alleviate these errors?
  4. 30.4 Unit selection synthesizers typically use a join cost which measures how well two speech segments will join together. Give some examples of the acoustic properties of the speech signal that might be used to calculate the join cost.
  5. 30.5 Describe some of the factors typically used in the target cost of a unit selection synthesizer
  6. 30.6 Identify as many similarities as possible (both general concepts and specific details) and point out differences in the way speech waveforms are created from a parametric representation in the Haskins Pattern Playback system, the OVE II formant synthesizer and the STRAIGHT vocoder.

30.8 APPENDIX: SYNTHESIZER EXAMPLES

30.8.1. The Klatt Recordings

Beginning with the Voder, much research (interrupted by World War II) has been dedicated to the development of speech synthesizers. Much of this work was summarized in 1987 in a monumental article by Dennis Klatt [32]. Included with Klatt's paper was a plastic record with a large number of synthesizer recordings.5 We list a number of these recordings here with a few comments.

30.8.2. Development of Speech Synthesizers

Voder: This is a recording of the Voder that was described in Chapter 2. A block diagram of the Voder was shown in Figs. 2.4 and 2.5.

Pattern Playback: See Fig. 30.3 and the accompanying discussion.

PAT, the Parametric Artificial Talker: This device, developed at the University of Edinburgh and consisting of three resonators in parallel with three frequency controls, was introduced by Walter Lawrence to American audiences at an MIT conference in 1956. Three additional parameters were used for excitation control. A moving glass slide converted painted patterns to control parameters.

OVE or OVE I: OVE was also introduced at the 1956 MIT conference as a cascade connection of three formants, with formants 1 and 2 controlled by the movement of a mechanical arm. Hand-held potentiometers controlled the frequency and amplitude of the voicing source. Since OVE spoke only vowel-like sounds, no noise source was needed.

PAT in 1962: Updated version of PAT. Amplitude controls and a separate fricative were added. At the 1962 Stockholm Speech Communication Seminar, the synthesizer attempted to match a naturally spoken sentence.

OVE ll in 1962: At the same seminar, the updated OVE II also matched, quite successfully, the same spoken utterance. See Fig. 30.4.

Holmes' Sentence on OVE John Holmes worked very hard to try to match the sentence “I enjoy the simple life” to make it indistinguishable from the original.

The Same Sentence by Holmes: Using his own parallel synthesizer, Holmes tried again.

Male-to-Female Transformation with DecTalk: The fundamental frequency was multiplied by 1.7, and the formants and glottal wave shape were varied.

The DAVO Articulatory Synthesizer: George Rosen built the original model in 1958 [44]. It was modified by Hecker [26] to include a nasal tract.

Vocal Cord–Vocal Tract Synthesis: Flanagan and Ishizaki [22] simulated a speech synthesizer in which the vocal cord and vocal tract operations were interdependent.

The independence of source and filter in nearly all synthesizer models neglects the possible effect that the dynamics of the vocal tract can alter the dynamic properties of the excitation function. By overtly including cord–tract interaction, Flanagan et al. have edged one step closer to a true physiological model, but it is still an open question as to the perceptual effect of this step. The turbulent noise source allows generation of noise anywhere in the model for fricative sounds. Could this system be improved by using digital waveguides?

30.8.3. Segmental Synthesis by Rule

Early synthesis-by-rule programs began with a phoneme string; prosodic features were entered by hand to match the original utterance.

Kelly–Gerstman (291 Synthesis by Rule: They used a basic element that resembles a digital waveguide. A section was shown in Fig. 11.3. This system was demonstrated at the Copenhagen 1961 International Conference on Acoustics.

British Synthesis by Rule, 1964: Holmes et al. demonstrated this program at the fall meeting of the Acoustical Society in Ann Arbor, Michigan, 1964.

Diphone Concatenation Synthesis by Rule: Dixon and Maxey [16] of IBM demonstrated this program at the 1967 MIT Conference on Speech Communication and Processing.

Synthesis by Rule with an Articulatory Model: Coker [9] also demonstrated this device at the same 1967 MIT Conference.

30.8.4. Synthesis By Rule of Segments and Sentence Prosody

The previous examples did not include prosodic features. These examples, in addition to using phoneme strings as inputs, also include stress marks and some syntactic information.

First Prosodic Synthesis by Rule: Mattingly discussed this as part of his report on speech research [38].

Sentence-Level Phonology Incorporated in Rules: Klatt [30] used phonology to generate segmental durations and a fundamental frequency contour, as well as sentence-level allophonic variations. The inputs were phoneme strings plus stress and syntactic symbols.

Rules from Linear Prediction Diphones and Prosody: Olive [41] demonstrated this system at ICASSP '77.

Rules from Linear Prediction Demisyllables: Browman [5] also used prosodic rules.

30.8.5. Fully Automatic Text-To-Speech Conversion: Formants and diphones

First Text-to-Speech System: Umeda et al. [50] designed this system, based on an articulatory model.

Bell Laboratories Text-to-Speech System: Coker et al. [10] demonstrated this system at the 1972 International Conference of Speech Communication and Processing in Boston.

Haskins Text-to-Speech System: Cooper et al. [15] used the Mattingly phonemeto-speech rules, coupled with a large dictionary [38].

Reading Machine for the Blind: Kurzweil [33] demonstrated this commercially available machine, which included an optical scanner, on the CBS Evening News.

Votrax Type-n-Talk System: Gagnon [23] demonstrated this cheap device at the 1978 ICASSP. He implemented the research by Elovitz et al. [19] that converted letters to sounds.

Echo Low-Cost Diphone Concatenation: This was demonstrated in 1982.

MITalk System: Allen et al. [1], [2] demonstrated a full-blown text-to-speech system, using complicated heuristics to translate graphemes to phonemes. The synthesizer was created by Klatt.

Multi-Language Infovox System: This was a commercial system [37] that was based on the research of Carlson et al. [6]. It was developed at the Royal Institute of Technology in Stockholm and demonstrated at ICASSP '76 and ICASSP '82.

The Prose-2000 Commercial System: Original research was done at Telesensory Systems by James Bliss and associates [24], [25].

Klattalk and DECtalk: This was Dennis Klatt's final system, which was licensed to Digital Equipment Corporation.

Bell Laboratories Text-to-Speech System: Olive and Liberman [42] used the Olive diphone synthesis strategy [41] in combination with a large morpheme dictionary [12] and letter-to-sound rules [8]. The system was demonstrated at the 1985 ASA meeting.

30.8.6. The van Santen Recordings

[51] also included a set of synthesizer demonstrations; given the move toward concatenative synthesis in the 1099s, most of the examples are of this variety. Here we briefly summarize the audio demos on the CD accompanying this reference as well as being described in the text.

DECtalk: This demo is from an updated version of the system described above.

Bell Laboratories Text-to-Speech System: Similarly, this is an updated version of the diphone-based system described above. The system has migrated to a commercial product called TrueTalk.

Orator: Bellcore has a commercial system that is based on demisyllable concatenation.

Eurovocs: This is a diphone concatenation system.

30.8.7 Fully Automatic Text-To-Speech Conversion: Unit selection and HMMs

The majority of research systems and commercially-available products now use either unit-selection or HMM-based synthesis. Diphone and formant synthesis are only used in niche applications where, for example, a very small memory footprint is necessary (e.g. embedded systems), or extreme manipulations of the speech are more important than naturalness (e.g., screen readers for the blind requiring very fast speaking rates).

Research systems Section 30.6 lists some research systems, including Festival and Mary, which are both fully automatic text-to-speech systems. These give a good indication of the state-of-the-art in academic research. They both have web sites with interactive demonstrations. Another notable research system is CHATR [28], from ATR in Japan, which was the first unit-selection system and had many characteristics still found in current systems; its successor is Ximera, in which a single-speaker database of over 100 hours of speech is employed.

Commercial products The names and availability of commercial products, and of the companies producing them, is constantly changing. Commercial products that are typical of the current state of the art (circa 2011) are made by large companies such as Acapela, AT&T, Loquendo and Nuance, although there are also a number of smaller companies operating in this market, often targeting specific applications such as in-car navigation (e.g., SVOX) or computer games (e.g., Phonetic Arts).

BIBLIOGRAPHY

  1. Allen, J., Hunnicut, S., Carlson, R., and Granstrom, B., “MITalk-79: the MIT Text-to-Speech system,” J. Acoust. Soc. Am. Suppl. 1 65: S130, 1979.
  2. Allen, J., Hunnicut, S., and Klatt, D. H., From Text to Speech; The MITalk System, Cambridge Univ. Press, London/New York, 1987.
  3. Almeida, L. B., and Silva, F. M., “Variable frequency synthesis: an improved harmonic coding scheme,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., San Diego, pp. 27.5.1–27.5.4, 1984.
  4. The Blizzard Challenge, http://www.synsig.org/index.php/Blizzard\_Challenge
  5. Browman, C. P., “Rules for demisyllable synthesis using lingua, a language interpreter,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Denver, 1980.
  6. Carlson, R., Granstrom, B., and Hunnicut, S., “A multi-language text-to-speech module,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Paris, pp. 1604–1607, 1982.
  7. Charpentier, F., and Stella, M., “Diphone synthesis using an overlap-add technique for speech waveform concatenation,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Tokyo, pp. 2015–2018, 1986.
  8. Church, K. W., “Stress assignment in letter-to-sound rules for speech synthesis,” in Proc. 23rd Meet. Assoc. Comp. Ling. Chicago, pp. 246–253, 1985.
  9. Coker, C. H., “Speech synthesis with a parametric articulatory model,” reprinted in J. L. Flanagan and L. R. Rabiner, eds., Speech Synthesis, Dowden, Hutchinson and Ross, Stroudsburg, Pa., 1968.
  10. Coker, C. H., Umeda, N., and Browman, C. P., “Automatic synthesis from ordinary english text,” IEEE Trans. Audio Electroacoust. AU-21: 293–297, 1973.
  11. Coker, C. H., “A model of articulatory dynamics and control,” Proc. IEEE 64: 452–460, 1976.
  12. Coker, C. H., “A dictionary-intensive letter-to-sound program,” J. Acoust. Soc. Am. Suppl. 178: S7, 1985.
  13. Cook, P. R., “Identification of control parameters in an articulatory vocal tract model, with applications to the synthesis of singing,” Electrical Engineering PhD Dissertation, Stanford University, 1991.
  14. Cooper, F. S., Gaitenby, J. H., Liberman, A. M., Borst, J. M., and Gerstman, L. J., “Some experiments on the perception of synthetic speech sounds,” J. Acoust. Soc. Am. 24: 597–606, 1952.
  15. Cooper, F. S., Gaitenby, J. H., Mattingly, I. G., Nye, P. W., and Sholes, G. N., “Audible outputs of reading machines for the blind,” Stat. Rep. Speech Res. SR-35/36, Haskins Laboratory, New Haven, Conn., pp. 117–120, 1973.
  16. Dixon, N. R., and Maxey, H. D., “Terminal analog synthesis of continuous speech using the diphone method of segment assembly,” IEEE Trans. Audio Electroacoust. AU-16: 40–50, 1968.
  17. Donovan, R. E., and Woodland, P. C., “Automatic speech synthesizer parameter estimation using HMMs,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Detroit, pp. 640–643, 1995.
  18. Dutoit, T., An Introduction to Text-to-Speech Synthesis, Kluwer, Dordrecht, the Netherlands, 1997.
  19. Elovitz, H., Johnson, R., McHugh, A., and Shore, J., “Letter-to-sound rules for automatic translation of English text to phonemes,” IEEE Trans. Acoust. Speech Signal Process. ASSP-24: 446–459, 1976.
  20. Fant, G., “Speech communication research,” Ing. Vetenskaps Akad. 24: 331–337, 1953.
  21. Fant, G., Martony, J., Rengman, U., and Risberg, A., “OVE II synthesis strategy,” presented at the Stockholm Speech Communications Seminar, Stockholm, 1962.
  22. Flanagan, J. L., Ishizaka, K., and Shipley, K. L., “Synthesis of speech from a dynamic model of the vocal cords and vocal tract,” Bell Syst. Tech. J. 54: 485–506, 1975.
  23. Gagnon, R. T., “Votrax real time hardware for phoneme synthesis of speech,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Tulsa, Oklahoma, pp. 175–178, 1978.
  24. Goldhor, R. S., and Lund, R. T., “University-to-industry technology transfer: a case study,”Res. Policy 12: 121–152, 1983.
  25. Groner, G. F., Bernstein, J., Ingber, E., Pearlman, J., and Toal, T., “A real-time text to speech converter,” Speech Technol. 1: 73–76, 1982.
  26. Hecker, M. H. L., “Studies of nasal consonants with an articulatory speech synthesizer,” J. Acoust. Soc. Am. 34: 179–188, 1962.
  27. Holmes, J. N., “The influence of the glottal waveform on the naturalness of speech from a parallel formant synthesizer,” IEEE Trans. Audio Electroacoust. AU-21: 298–305, 1973.
  28. Hunt, A., Black, A. W., “Unit selection in a concatenative speech synthesis system using a large speech database” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Atlanta, Georgia, pp. 373–376, 1996.
  29. Kelly, J., and Gerstman, L., “Digital computer synthesizes human speech,” Bell Labs. Rec. 40: 216–217, 1962.
  30. Klatt, D. H., “Structure of a phonological rule component for a speech synthesis by rule program,” IEEE Trans. Acoust. Speech Signal Process. ASSP-24: 391–398, 1976.
  31. Klatt, D. H., “Software for a cascade/parallel formant synthesizer,” J. Acoust. Soc. Am. 67: 971–995, 1980.
  32. Klatt, D. H., “Review of text-to-speech conversion for English,” J. Acoust. Soc. Am. 82: 737–792, 1987.
  33. Kurzweil, R., “The Kurzweil reading machine: a technical overview,” in M. R. Redden, and W. Schwandt, eds., Science, Technology and the Handicapped, AAAS Rep. 76-R-11, Washington, D.C., pp. 3–11, 1976.
  34. Lawrence, W., “The synthesis of speech from signals which have a low information rate,” in W. Jackson, ed., Communication Theory, Butterworths, London, pp. 460–469, 1953.
  35. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., and Studdert-Kennedy, M., “Perception of the speech code,” Psychol. Rev. 74: 431–461, 1967.
  36. MacAulay, R. J., and Quatieri, T. F., “Magnitude only reconstruction using a sinusoidal speech model,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., San Diego, pp. 27.6.1–27.6.4, 1984.
  37. Magnusson, L., Blomberg, M., Carlson, R., Elenius, K., and Granstrom, B., “Swedish researchers team up with electronic venture capitalists,” Speech Technol. 2: 15–24, 1984.
  38. Mattingly, I. G., “Synthesis by rule of general american English,” Suppl. Stat. Rep. Speech Res., Haskins Laboratory, New Haven, Conn., pp. 1–223, 1968.
  39. Morgan, N., Talking Chips, McGraw-Hill, New York, 1984.
  40. Moulines, E., and Charpentier, F., “Diphone synthesis using multipulse LPC technique,” in Proc. FASE Int. Conf, Edinburgh, pp. 47–51, 1988.
  41. Olive, J. P., “Rule synthesis of speech from diadic units,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Hartford, Connecticut, pp. 568–570, 1977.
  42. Olive, J. P. and Liberman, M. Y., “Text-to-speech - An overview,” J. Acoust. Soc. Am. Suppl. 1 78: S6, 1985.
  43. Parthsarathy, S., and Coker, C. H., “Automatic estimation of articulatory parameters,” Comput. Speech Lang. 6: 37–75, 1992.
  44. Rosen, G., “A dynamic analog speech synthesizer,” J. Acoust. Soc. Am. 30: 201–209, 1958.
  45. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J., “TOBI: A Standard for Labeling English Prosody” in Proc. Int. Conf. Spoken Language Process., Banff, Alberta, Canada, pp. 867–870, 1992.
  46. Taylor, P. Text-to-Speech Synthesis Cambridge University Press, Cambridge, UK, 2009.
  47. Tokuda, K., Kobayashi, T., Masuko, T., Imai, S., “Mel-generalized cepstral analysis -a unified approach to speech spectral estimation,” in Proc. Int. Conf. Spoken Language Process., Yokohama, Japan, pp. 1043–1046, 1994.
  48. Tokuda, K., Kobayashi, T., Imai, S., “Speech parameter generation from HMM using dynamic features,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Detroit, Michigan, pp. 660- 663, 1995.
  49. Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T., “Multi-space probability distribution HMM,” in IEICE Trans. Inf. & Syst. E85-D (3): 455–464, 2002
  50. Umeda, N., Matsui, E., Suzuki, T., and Omura, H., “Synthesis of fairy tales using an analog vocal tract,” in Proc. 6th Int. Cong. Acoust., Tokyo, pp. B159–162, 1968.
  51. van Santen, J. P. H., Sproat, R., Olive, J., and Hirschberg, J., Progress in Speech Synthesis, Springer Pub., New York, 1997.
  52. Wilhelms-Tricarico, R., “Physiological modeling of speech production: Methods for modeling soft-tissue articulators,” J. Acoust. Soc. Am., 97: 3085, 1995.
  53. Zen, H. Tokuda, K., and Black, A., “Statistical Parametric Speech Synthesis,” Speech Communication, 51(11), pp 1039–1064, November 2009.

1 This chapter has been extensively rewritten for the 2nd edition by Simon King of Edinburgh University.

2 A generative model is a statistical model for randomly generating observable data.

3 OVE is the acronym for the Latin orator verbis electris, and PAT stands for parametric artificial talker.

4 Unfortunately, this printed page is fairly static, and the Web changes rather frequently. If the Web pages cited here do not lead to the desired systems, we apologize.

5 The soundfiles of these recordings may be heard at http://www.icsi.berkeley.edu/eecs225d/klatt.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.235.182.206