9.2. Speaker Recognition

The goal of automatic speaker recognition—Campbell [45] and Furui [109]—is to recognize a speaker from his or her voice. Speaker recognition can generally be divided into two categories: speaker identification and speaker verification. Speaker identification determines the identity of an unknown speaker from a group of known speakers; speaker verification authenticates the identity of a speaker based on his or her own voice. A speaker claiming an identity is called a claimant, and an unregistered speaker pretending to be a registered speaker is called an impostor. An ideal speaker recognition system should not reject registered speakers (false rejections) or accept impostors (false acceptances).

Speaker recognition can also be divided into text-dependent and text-independent. In text-dependent systems, the same set of keywords are used for enrollment and recognition; in text-independent systems, on the other hand, different phrases or sentences are used. Text-dependent systems, which require the cooperation of users and typically use hidden Markov models to represent speakers' speech, usually outperform text-independent systems because they can use phonetic information to align the unknown speech with reference templates. However, text-independent systems are more appropriate for forensic and surveillance applications, where predefined keywords are not available and users are usually uncooperative or unaware of the recognition task.

9.2.1. Components of Speaker Verification Systems

Typically, a speaker verification system is composed of a front-end feature extractor, a set of client speaker models, a set of background speaker models, and a recognition unit. The feature extractor derives speaker-specific information from the speech signals. It is well known from the source-filter theory of speech production [93] that spectral envelopes implicitly encode vocal-tract shape information (e.g., length and cross-section area) of a speaker and that pitch harmonics encode the glottal source information. Because it is commonly believed that vocal-tract shape varies from speaker to speaker, spectral features, such as the LPCCs or the MFCCs (see Section 2.6), are often used. A set of speaker models is trained from the spectral features extracted from client utterances. A background model is also trained using the speech of a large number of speakers to represent speaker-independent speech. Basically, the background models are used to normalize the scores of the speaker models to minimize nonspeaker related variability such as acoustic noise and channel effect. To verify a claimant, speaker scores are normalized by the background scores and the resulting normalized score is compared with a decision threshold. The claimant is accepted (rejected) if the score is larger (smaller) than the threshold.

9.2.2. Speaker-Specific Features

Speech signals are generated from quasi-stationary processes. Therefore, short-term spectral analysis can be applied to short speech segments, which results in a sequence of short-time spectra. In speech and speaker recognition, the short-time spectra are further transformed into feature vectors. In addition to spectral analysis, many speech and speaker recognition systems use linear prediction (LP) analysis [228] to extract the feature vectors (known as the LP coefficients) from short segments of speech waveforms. One advantage of LP coefficients is that they can be computed efficiently. More important, the LP coefficients represent the spectral envelopes of speech signals (i.e., information about the formant frequencies and their bandwidth). The spectral envelopes are characterized by the vocal-tract resonance frequencies, vocal-tract length, and spatially varied cross-section areas. Because all of these entities are known to be speaker-dependent, the LP coefficients are one of the candidate features for speaker recognition.

Several sets of features (e.g., LP coefficients, impulse responses, autocorrelation coefficients, cross-sectional areas, and cepstral coefficients) can be derived from LP analysis. Of particular interest is that a simple and unique relationship exists among these features. Despite this simple relationship, it has been shown that the cepstral coefficients are the most effective feature for speaker recognition [13] because the components of cepstral vectors are almost orthogonal.

In the field of psychoacoustics, it is well known that the human auditory system can be approximately described by a set of overlapped bandpass filters whose frequencies follow a scale known as the critical band scale. To capture the phonetically important characteristics of speech using auditory-based principles, Davies and Mermelstein [69] proposed using triangular mel-scale filter banks to extract spectral features from speech signals. These filters follow the mel-scale and space linearly at low frequencies and logarithmically at high frequencies. The resulting coefficients are therefore called mel-frequency cepstral coefficients (MFCCs). This chapter uses both LPCCs and MFCCs as speaker features.

9.2.3. Speaker Modeling

Over the years, a variety of speaker modeling techniques have been proposed [305]. This section describes four approaches to speaker modeling.

Template Matching

In this technique, reference templates are used as speaker models. Templates are composed of a sequence of feature vectors extracted from a set of fixed sentences uttered by a registered speaker. During recognition, an input utterance is dynamically aligned with the reference templates, and match scores are obtained by measuring the similarity between the aligned utterance and the templates [80]. The use of fixed templates, however, cannot model the wide variability present in the speech signals.

Vector Quantization

Vector quantization (VQ) is a coding technique typically used in transmitting signals at low bit rate. To use VQ in speaker recognition, a personalized codebook is created for each speaker. During recognition, an unknown speaker is identified by selecting the codebook whose code vectors are closest to the input vectors. Since its introduction by Soong et al. [342] in 1985, VQ has been a benchmark method for speaker recognition systems [234], and improvement in the standard VQ approach has also been made [32]. The advantage of VQ is that the problem of segmenting speech into phonetic units can be avoided. Additionally, VQ is more computationally efficient than template matching. The disadvantage of VQ, however, lies in the complexity of codebook search during recognition.

Hidden Markov Models

Hidden Markov models (HMMs) encode both the temporal structure of feature sequences and the statistical variation of the features. As a result, HMMs can be used as speaker models in text-dependent speaker recognition. The earliest attempt to use HMMs in speaker recognition was reported by Portiz [282]. After Portiz's work, several improved methods were proposed; that is, the mixture autoregressive HMMs [349], subword HMMs [123], and semicontinuous HMMs [102]. HMMs parameters can be estimated based on maximum-likelihood (ML) or maximum a posteriori (MAP) criteria. Criteria that use discriminative training, such as minimum classification error (MCE) [175, 238] or maximum mutual information (MMI) [264, 357], can also be adopted. Multistate, left-to-right HMMs can be used as speaker-and utterance-specific models.

The HMM approach is similar to the VQ approach in that the HMM states are found by a VQ-like procedure. However, unlike VQ, the probabilities of transition between states are encoded, and the order of presentation of the speech data is important. This may cause problems in text-independent speaker recognition where no temporal correlation exists between the training data and the test data. On the other hand, single-state HMMs—also known as Gaussian mixture models (GMMs) [300, 307]—can be applied to text-independent speaker recognition. Like VQ, the feature space of speakers is divided into a number of clusters. However, the probability density function is continuous rather than discrete, and the cluster membership is soft rather than hard. GMMs provide a probabilistic model for each speaker but unlike HMMs, there is no Markov constraint among the sound classes. As a result, the order of presentation of speech data will not affect recognition decisions.

Neural Networks

Neural networks, which have been used for speaker recognition, can be considered supervised classifiers that learn the complex mappings between data in the input and output space. This capability is particularly useful when the statistical distributions of the data are not known. The speaker models can have many different forms, including multi-layer perceptrons (MLP), radical basis functions (RBFs), hybrid MLP-RBF models [7], multi-expert connectionist models [24], and modified neural tree networks [94]. For MLP and RBF networks, each speaker has a personalized network trained to output a 1 (one) for the voices associated with that speaker and a 0 (zero) otherwise. One advantage of using neural networks for speaker recognition is that discriminative information can easily be incorporated by means of supervised learning. Although this information can usually improve recognition performance, it requires a longer training time.

9.2.4. Threshold Determination

The determination of decision thresholds is a very important problem in speaker verification. A large threshold could make the system annoying to users, but a small threshold could result in a vulnerable system. Conventional threshold determination methods [41, 107] typically compute the distribution of inter-and intraspeaker distances and then choose a threshold to equalize the overlapping area of the distributions—that is, to equalize the false acceptance rate (FAR) and false rejection rate (FRR). The success of this approach, however, depends on the estimated distributions matching the speaker-and impostor-class distributions. Another approach derives the threshold of a speaker solely from his or her own voice and speaker model [259]. Session-to-session speaker variability, however, contributes a great deal of bias to the threshold, rendering the verification system unusable.

Because of the difficulty in determining a reliable threshold, researchers often report the equal error rate (EER) of verification systems based on the assumption that an a posteriori threshold can be optimally adjusted during verification. Real-world applications, however, are only realistic with a priori thresholds that should be determined before verification.

In recent years, research effort has focused on the normalization of speaker scores both to minimize error rates and to determine a reliable threshold. This includes the likelihood ratio scoring proposed by Higgins et al. [136], where verification decisions are based on the ratio of the likelihood that the observed speech is uttered by the true speaker to the likelihood that it is spoken by an impostor. The a priori threshold is then set to 1.0, with the claimant being accepted (rejected) if the ratio is greater (less) than 1.0. Subsequent work based on likelihood normalization [214, 235], cohort normalized scoring [318], and minimum verification error training [320] also shows that including an impostor model during verification not only improves speaker separability but also allows decision thresholds to be set easily. Rosenberg and Parthasarathy [319] established some principles for constructing impostor models and showed that those with speech closest to the reference speaker's model performed the best. Their result, however, differs from that of Reynolds [308], who found that a gender-balanced, randomly selected impostor model performs better, which suggests that more work in this area is required.

Although these previous approaches help select an appropriate threshold, they may cause the system to favor rejecting true speakers, resulting in a high FRR. For example, Higgins et al. [136] reported the FRR is more than 10 times larger than the FAR. A report by Pierrot et al. [276], based on a similar normalization technique but with a different threshold setting procedure, also found the average of FAR and FRR is approximately three to five times larger than the EER, thereby suggesting that the EER could be an overly optimistic estimate of true system performance.

9.2.5. Performance Evaluation

Performance of speaker verification systems is usually specified by two types of errors:

  1. Miss rate (Pmiss|target)— the chance of misclassifying a true speaker as an impostor.

  2. False Alarm Rate (Pfa|nontarget)— the chance of falsely identifying an impostor as a true speaker.

The miss rate and false alarm rate are also known as the FRR and the FAR. In addition to these two error rates, it is also common to report the equal error rate (EER)—the error rate at which Pmiss|target = Pfa|nontarget.

Because the miss rate and false alarm rate depend on the decision threshold, a {Pmiss|target, Pfa|nontarget} pair represents one operating point of the system under evaluation. To provide more information about system performance, it is necessary to evaluate the system for a range of thresholds. This results in a receiver operating characteristic (ROC) curve, where the miss probability is plotted against the probability of a false alarm, similar to the one used by the face recognition community. However, the speaker verification community has chosen to use a variant of the ROC plots called detection error tradeoff (DET) plots [233]. In a DET plot, the axes' scales are normally deviated so that Gaussian distributed scores result in a straight line; the advantage is that systems with almost perfect performance can be compared easily.

In addition to DET curves, speaker verification systems are also compared based on the detection cost:


where Cmiss and Cfa are the cost of making a false rejection error and false acceptance error, respectively, and where Ptarget and Pnontarget are, respectively, the chance of having a true speaker and an impostor. Typical values of these figures are Cmiss = 10, Cfa = 1, Ptarget = 0.01, and Pnontarget = 0.99 [288]. These values give an expected detection cost of approximately 1.0 for a system without any knowledge of the speakers. The operating point at which the detection cost Cdet is at a minimum can be plotted on top of the DET curve.

Because the performance of speaker verification systems depends on the amount of training data, acoustic environment, and the length of test segments, it is very important to report this information in any performance evaluations so that performance of different systems and techniques can be compared. Thus, the NIST established a common set of evaluation data and protocols [152] in 1996. Although only focusing on conversational speech, the NIST speaker recognition evaluations are one of the most important benchmark tests for speaker verification techniques.

9.2.6. Speaker Recognition in Adverse Environments

It is well known that acoustic variation in the environment seriously degrades the performance of speaker recognition systems. In particular, the performance of most systems degrades rapidly under adverse conditions such as in the presence of background noise, channel interference, handset variation, intersession variability, and long-term variability of speakers' voices. Many sources of distortion can affect speech signals; additive noise and convolutive distortion are the most common.

Additive noise can be classified into different categories according to its properties. For example, stationary noise, such as electric fans and air conditioners, has a time-invariant power spectral density, but nonstationary noise created by passing cars, keyboard clicks, and slamming doors has time-varying properties. Additive noise can also be divided into continuous and short-lived, depending on the noise duration with respect to the speech duration.

In addition to additive noise, distortion can also be convolutional. For example, microphones, transmission channels, and speech codecs can be considered to be digital filters with which speech signals are convolved. In particular, typical phone channels exhibit a bandpass filtering effect on speech signals, where different degrees of attenuation are exerted on different spectral bands. Reverberation of speech signals is another source of convolutive distortion, which results in the addition of a noise component to the speech signals in the log-spectral domain.

Speakers may alter the way they speak under high levels of background noise or when under stress (Lombard effect); during articulation, speakers may also produce breathing noises and lip-smacks. All of these distortions will cause serious performance degradation in speaker recognition systems.

This chapter considers only additive noise and convolutive distortion, which can be combined into a composite source of distortion. Specifically, the acquired signal y(t) is expressed as

Equation 9.2.1


where s(t) is the clean speech signal, and h(t), n(t), and * (an asterisk) represent the channel's transfer function, the additive noise, and the convolution operators, respectively.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.135.36