CHAPTER 41

image

SPEAKER VERIFICATION

41.1 INTRODUCTION

In Chapters 2229, we introduced the basics of automatic speech-recognition systems. However, there are a number of related application areas that use many of the same tools and perspectives. One such class of applications is speaker recognition, for which speaker verification is a particularly important example. Here1 we describe some of the basic principles of this application.

Speech contains many characteristics that are specific to each individual, many of which are independent of the linguistic message for an utterance. In Chapter 29 we discussed some of these characteristics from the perspective of speech recognition, for which they generally are a source of degradation. For instance, each utterance from an individual is produced by the same vocal tract, tends to have a typical pitch range (particularly for each gender), and has a characteristic articulator movement that is associated with speaker, dialect, or gender. All of these factors have a strong effect on the speech that is highly correlated with the particular individual who is speaking. For this reason, listeners are often able to recognize the speaker identity fairly quickly, even over the telephone. Artificial systems recognizing speakers rather than speech have been the subject of much research over the past 30 years, and multiple commercial systems are currently in use.

Speaker recognition is a generic term for the classification of a speaker's identity from an acoustic signal. In the case of speaker identification, the speaker is classified as being one of a finite set of speakers. As in the case of speech recognition, this will require the comparison of a speech utterance with a set of reference models for each potential speaker. For the case of speaker verification, the speaker is classified as having the purported identity or not. That is, the goal is to automatically accept or reject an identity that is claimed by the speaker.

Traditionally, there has been the distinction of text-dependent vs text-independent mode of operation, depending on whether or not the recognition process is constrained to a predefined text. Text-dependent recognition is the easier task, because there is less variability between speech utterances. It is probably better to call this mode word conditioned, because there are many ways to know what has been said: it could be a fixed or prompted passphrase, or words can be extracted automatically by word spotting or ASR techniques.

Speaker recognition has many potential applications, including the authentication (e.g., telephone and banking applications), access control, parole monitoring, fraud detection, and intelligence.

Speaker recognition generally requires the calculation of a score reflecting the similarity between two speech segments: a test segment and a training (or enrollment) segment. The basic task is that of detection: to tell whether the segments were spoken by the same of by different speakers. This task can be directly used for verification or authentication purposes. The detection score can also be used as the basis for speaker identification, but here we will concentrate on the more general detection task.

The main challenge in speaker recognition is to distinguish the variability due to the difference in speakers from the variability due to other factors. These confounding variabilities can be intrinsic, such as the physical, medical or emotional state of the speaker, the content, the language spoken, the effort at which speech is produced; or extrinsic, such as the recording conditions including acoustics, transducers, recording equipment, transmission channel and noise.

For better performance, there can be multiple training segments, preferably recorded in different sessions.

For a good introduction to speaker recognition, we refer the reader to [6, 8, 2, 1].

41.2 GENERAL DESIGN OF A SPEAKER RECOGNITION SYSTEM

Since the basic task of speaker detection is a two-way classification problem, the core approach is to try to estimate the likelihoods P(X|S), that the speech X is produced by speaker S, and P(X|–S), that the speech is produced by someone else. The speech data X here is usually represented by a sequence of features extracted from the speech signal, and the likelihood functions for S and –S are formed by some mathematical model M. A basicsimilarity measure then is formed by the likelihood ratio

image

Note that it is also possible to directly model the likelihood ratio in a single model. The model parameters for M(S) and M(–, S) are estimated using the training speech segment and from typically a large set of “background” speech from many speakers that are known to be different from S.

The score s can be used for decisions, i.e., whether or not the speech X was uttered by the speaker S, directly by thresholding the score. However, in practice the likelihood functions tend to be dependent on the particular sample of the test data X, and similarly the model parameters are sensitive to the particular sample of training training data. Therefore, this influence is lowered by normalizing the scores s by computing scores over a cohort of non-target models and test segments, respectively.

Multiple systems can be fused by computing a weighted sum of the individual normalized system scores, leading to better performance. As a final step, a threshold must be chosen that will minimize the expected cost of decision errors. This is a process known as calibration and is governed by the relative costs of false positives and false negatives, and the prior probability of the target speaker.

At various points in this design, a collection of “background” speech is required, for instance for modeling the feature space, the non-target speakers –S, for score normalization and for fusion and calibration. A proper choice of data for these steps is an essential part of the design of a speaker recognition system.

41.3 EXAMPLE SYSTEM COMPONENTS

In this section we will briefly discuss some of the more popular system design choices.

41.3.1. Features

Many types of features have been proposed and used in speaker recognition. The lowest-level features are known as ‘acoustic’ features, typically frame-based spectral representations. Perhaps surprisingly, successful features are the same as used in speaker-independent ASR, namely MFCC and PLP coefficients (see Chapter 22). Although LPC features (Chapter 21) typically do not perform as well on their own, they tend to lead to better performance when fused with other acoustic systems. When using acoustic features, it is essential to first remove silence frames by some form of speech activity detection. The classifier is also helped by normalizing individual feature streams via short-time Gaussianization (known as ‘feature warping’) [15] or utterance-based normalization to a zero-mean, unit standard deviation distribution. Higher-level features include prosody (pitch contour and timing), idiolect (word usage) and turn-taking behavior. Generally, one can say that higher level features represent more behavioral traits, and are more difficult to extract automatically, and lower level features represent more physical traits, and tend to be easier to extract automatically.

41.3.2. Models

In the enrollment process a speaker model needs to be formed to be able compute likelihood ratios for test segments. For simplicity, we will concentrate on acoustic features. Since the 1990s a very popular modeling technique has been Gaussian Mixture Models (GMMs, Chapter 9), which we have encountered as the output probabilities in HMMs for ASR in Chapter 26. Specifically, the UBM-GMM approach introduced by Doug Reynolds [16] represents P(X|M, ¬S) by a GMM trained on many (e.g., thousands) of speakers. This is called the Universal Background Model (UBM), and is trained via Maximum Likelihood using the Expectation Maximization algorithm (see Section 9.8). Rather than training a separate model for the target speaker S from scratch, the UBM is used as a prior distribution for a maximum a posteriori (MAP) [9] estimate of the model parameters given the training data, as described in Section 28.2.1. Typically, only the means of the Gaussians are adapted in this way. This has a nice interpretation of the GMMs: the UBM represents the probability density function (PDF) of all possible speech from all possible speakers, and the characteristics of the target speaker S are represented by the shifts in the means. The UBM-GMM further has many other advantages, including computational efficiency.

The shifts can be used directly in the calculation of the GMM likelihood function, but they can also be used treated as features in their own right. Of particular interest is the use of these shifts as inputs to a Support Vector Machine classifier (see Section 8.4), which is a discriminative modeling technique, rather than the generative GMM. The shifts of the means can, after some scaling [4], be concatenated into a fixed-size ‘supervector’, containing all information extracted from the utterance by the model. By performing this operation not only for the target speaker but also for a large collection of background speakers, we obtain many points in a high dimensional space. An SVM can be trained to separate the one positive example of the target speaker from the background data point by constructing a hyper-plane that maximizes the margin between the two classes. The points on the margin are called the support vectors, and determine the model. The concept of distance between points x and y in an SVM is governed by a kernel function k(x, y). For the GMM-SVM approach [4] this is simply a linear kernel, i.e., the kernel is the inner product xTY. This makes it possible to represent the model by the position and direction of the hyperplane alone, i.e., by the normal to the plane n and an offset. This is an efficient way of storing the model, and leads to very efficient computation of a score at test time by means of an inner product. This way of modeling is an example where the likelihood ratio function is computed in a single model.

Another way to arrive at an efficient computation of the likelihood function in the UBM-GMM model is to see the MAP-adapted speaker model as a perturbation of the UBM. The effect of this perturbation on the value of the likelihood function can then be approximated by a Taylor expansion of the UBM w.r.t. its parameters, i.e., the means of the Gaussians. In the difference of the log-likelihood functions of the Taylor expansion and the UBM, many terms cancel, and the speaker model is then computed from the zeroth- and first-order statistics of the training data–the statistics that are normally computed in MAP adaptation. The representation of the model is by a single supervector, and computation of a test score involves a single inner product (dot-product) with test supervector. This approach is known as dot-scoring [10].

There are many more models possible to represent a likelihood function. The Generalized Linear Discriminative Sequence kernel for SVM [3] uses a monomial expansion of the acoustic features (i.e., powers and products of individual feature dimensions) averaged over the whole utterance as another large vector to represent a speech utterance as a point in a high-dimensional space. A standard linear SVM in this space has all the computational advantages mentioned in the GMM-SVM approach. This approach is very symmetric in character, there is no preference for using either the “train” of “test” segment for training the model by contrasting the utterance to a large background of non-target speaker utterances, and it performs relatively well if either of the segments is of shorter duration. Some feature types require different models. For instance, a speech recognition system can be used to decode the words spoken in the train and test utterance. The sequence of words can then be used as features. [7] One model for such features can be formed by a n-gram language model. Again, a background of non-target speakers can be used to train a baseline model, and this can be adapted to a speaker-specific model by interpolation of the background model with a model formed by the train words. The likelihood function can be formed, e.g., by computing the negative entropy of the test words given the speaker model, and comparing this to that of the background model. Also other models have been proposed for word-based features, such as sequence kernels in SVMs. Word features require quite some computational efforts to produce, but in situations with longer training (~15 minutes) this approach pays off in terms of speaker recognition performance.

A last approach worth mentioning here aims at modeling the acoustic characteristics of the speaker by the parameters of transform that is required to make the speaker look more like the average speaker. [17] To this end, the (constrained) maximum likelihood linear regression (C)MLLR technique (briefly described in Chapter 28) is borrowed from speech recognition where it is used to make the acoustic features more “speaker independent”. Here, what is used are the transform parameters, which are further modeled using an SVM classifier. Specifically, the transform parameters need a special normalization, rank normalization, in order to obtain good speaker recognition performance results. In rank normalization, each parameter value is replaced by its rank in a large background of transforms from different speakers.

41.3.3. Score normalization

Speaker detection performance can depend critically on the normalization of scores. One can imagine that the value of the likelihood function of an utterance spoken by a particular speaker actually depends on what has been said. If the contents contains more “likely” sounds, the likelihood score will be higher than when it contains relatively many “unlikely” sounds. Another effect that will influence the overall likelihood is the spectral shaping of the signal. Although the denominator in the likelihood ratio will compensate for such likelihood shape to some extent, we need better compensation strategies. One such strategy is score normalization. This can be carried out to compensate for variation in test segment (t-norm) or in train segment (z-norm) or both (zt-norm).

The idea is to compute the score of a test segment not only with the target model, but also with a set of known non-target models, known as the t-norm cohort. A score is then normalized by the transformation

image

where μ, and σt are the mean and standard deviation of scores over the t-norm cohort. Equivalently, variation due to training segment variability can be compensated for by normalizing scores using a cohort of non-target test segments (z-norm cohort) applied to the target model.

The importance of t-norm or z-norm depends on the modeling technique and the application of the speaker recognition system, and on other compensation schemes that are used. t-norming appears to be quite effective in general discrimination performance [14], and is also helpful in the fusion of multiple speaker recognition systems. t-norming has the effect that the expected distribution of non-target scores has zero mean and unit standard deviation, which can give some control over the false alarm rate. For instance, a threshold of 3.0 for t-normed scores would give an expected false alarm rate of 0.13 % assuming a normal distribution of non-target scores. The expected miss-rate at this threshold, however, is determined by the overall discriminability of the detection task.

Some techniques require z-normalization, such as the dot-scoring approach. Here, the z-norm cohort plays a similar role as the negative examples in the SVM approaches. Advanced channel compensation techniques such as Joint Factor Analysis (JFA) also rely on the application of z-norm [21].

41.3.4. Fusion and calibration

At the end of the computation chain lies the possible combination of multiple systems, and presentation of the score to the user. In practice, these to steps can be combined, sharing the same training data. When two or more systems have the same score ranges (e.g., as a result of applying t-norming to each individual system) their scores can be combined into a detector with a better performance. The easies model for such system fusion is to use a weighted sum of scores si

image

where wi are the weights, and w0 is an overall offset. One can choose the weights equal, but this is not likely to be the optimum combination. In order to choose weights optimally, they must be chosen to optimize some expected performance. This needs a development set of test trials, and a performance objective. A natural objective may be to minimize the number of errors, but we will see that different types of errors (false alarms, misses) can have different costs associated to them. The ratio of target trials to non-target trials in the development test will determine the optimum settings, so care must be taken to ensure that this ratio is set to give the intended results. We will now address the important subject of evaluation.

41.4 EVALUATION

As for many fields in speech technology, evaluation plays an important role in the development of speaker recognition systems. In the first place, it gives the developers of systems a means to check their performance and make design decisions about techniques, usage of data, and fusion strategies. Secondly, when different research groups use the same data and evaluation protocol, results can be compared at a detailed level and more is learned from the differences in approach between groups. The latter motivation has been taken to the level of a formal paradigm in the series of NIST speaker recognition evaluations (SRE), which have been organized at regular intervals since 1996. These evaluations draw participants from all over the world, and have a somewhat competitive nature, providing extra motivation to researchers.

Important ingredients of evaluation are the definition of a task and an evaluation metric. Speaker detection presents two possible decision errors, false negatives and false positives, or misses and false alarms. Depending on the application, the error of one type may be considered worse than the error of the other type. The performance metric is therefore defined in terms of a cost function Cdet

image

where Cmiss, and CFA are the costs of misses and false alarms, and Ptarg is the prior probability of a target speaker. These three parameters define the application of the speaker recognition. The values of Pmiss and PFA are determined in the evaluation, and are computed as the fraction of target and non-target trials in error, respectively. Note that one can compare the performance to a trivial detector that always makes the least-cost decision based on prior and cost only,

image

where

image

may be called the ‘effective prior odds’, the parameter that governs the decision trade-off.

As indicated in Sect. 41.2 a system can make decisions by applying a threshold t to the (normalized) score s, i.e., decide that X is uttered by speaker S if s > t. It is instructive to look at the distribution of scores for target and non-target trials for a typical speaker recognition system. By applying a threshold, one can see the probabilities of misses and false alarms as the light and dark gray areas in Fig. 41.1a in proportion to the total area of the PDF for targets and non-targets, respectively. Immediately visible is the trade-off of the two error types when the threshold is varied. This trade-off is nicely shown in a DET-plot (Detection Error Trade-off) [13]. Here the axes are warped according to the probit function Φ(x), or inverse cumulative normal distribution, which can be expressed in terms of the inverse error function

image

This warping has the property that a threshold swept across overlapping Gaussian distributions will result in a straight line. The DET-plot for the data in Fig. 41.1a is shown in Fig. 41.1b. The warping of the axis makes the trade-off nicely linear, which has several advantages. Many different systems or conditions can be shown in the plot without too much clutter. The decision point is indicated by the rectangle, the width and height of the rectangle indicating the 95 % confidence interval of the values for (PFA, Pmiss) assuming trial-independence and a binomial distribution. Indicated in the circle ŝ, the minimum value of Cdet obtained by sweeping the threshold. Finally, the the crossing of the diagonal Pmiss = PFA the Equal Error Rate (EER), the performance measure most often reported in literature as in indication of the general discrimination performance of a speaker recognition system.

The DET-plot and the performance measures image and EER are used extensively in development of speaker recognition systems, and when actual decisions have to be made (as for submitting results to a NIST SRE), the threshold used for obtaining image development trial set is often used for making decisions for the evaluation trial set. As indicated in Sect. 41.3.4 decisions can be made in the fusion process, where a development set of trials is needed to optimize the fusion parameters. Rather than optimizing for the cost function Cdet (Cmiss, CFA, Ptarg), it is possible to calibrate scores over a range of cost functions, and thereby applications, by minimizing a ‘soft’ version of counting errors

image

FIGURE 41.1 (a, left) Probability density function for target and non-target scores. (b, right) DET-plot, showing the trade-off between misses and false alarms obtained from the same scores.

image

where s(t) indicates the score for trial t, Ntarg and Nnon are the number of target and non-target trials in the test, and summations are over those trials sets, respectively. The metric Cllr measures in how far the scores s can be considered calibrated log-likelihood-ratios, which have the property [19]

image

By using Cllr as optimizing criterion in fusion, the final score has log-likelihood-ratio character. One of the consequences is that the optimal Bayes' decision threshold for these calibrated scores is –θ (cf. (41.6)), which can be computed from the application's cost parameters. If the prior of costs in the application changes, no re-calibration is necessary; the only modification required is to shift the threshold according to (41.6).

The evaluation metric Cllr is dimensioned such that it measures the average information loss in bits due to discrimination and calibration errors of the speaker detector. There also exists an algorithm to compute C11r, under optimal calibration conditions, image which can be considered the application-independent counterpart of image. For a deeper discussion of these various evaluation measures we refer to [19].

41.5 MODERN RESEARCH CHALLENGES

The research in text independent speaker recognition has in recent years been driven by task and conditions defined in the regular NIST SREs. In addition, these evaluations donate speech data to the research community. The research focus has been on channel variability (train and test segments recorded through different telephone handsets, networks, encodings, different acoustics and transducers), language variability (train and test segments recorded in possibly different languages), and speaking style variability (varying levels of vocal effort). The channel (or session) variability challenge has been approached quite successfully in the (Joint) Factor Analysis model (JFA) from Patrick Kenny [11], where the shifts in GMM means in the supervector space are seen as the sum of a shift due to channel and speaker. Both shifts appear in different sub-spaces, which have to be found from training data. Here, the intricate relation between evaluations (providing the data) and algorithm development becomes clear. Apart from the JFA model, other techniques have with more or less success been applied, such as Probabilistic Subspace Adaptation (PSA) [12], the very elegant and efficient Nuisance Attribution Projection (NAP) for the GMM-SVM [5] and Feature Domain channel factor compensation [18] which can be applied to any further classifier.

In the SRE series, some parameters that influence performance have been varied, such as training duration (from 10 s to multiple sessions of 5 minutes) or test segment duration (10 s to 5 minutes), and here the different research groups can choose their own focus. For instance, commercial systems with applications in banking have more interest in shorter duration conditions, and for this reason concentrate on text-dependent implementations.

Some issues in speaker recognition have hardly been addressed, for instance aging of the voice, and different physical, emotional or health conditions of the speakers. One reason for this is the lack of data available to characterize such variations.

Forensic applications of speaker recognition pose stringent demands on the robustness of calibration to adverse recording and speech production conditions of the samples. The challenge is to understand the effects of these conditions such that speaker recognition results can be used not only for investigatory purposes, but ultimately for presenting evidence using a Bayesian interpretation for the truth-finding process in court.

41.6 EXERCISES

  1. 41.1 Give some of the specific properties of mel cepstral or PLP analysis, as described in Chapter 22, that could be a poor match to the goals of speaker recognition.
  2. 41.2 Show that Eq. 41.1 is the probabilistically-correct basis for decision, and explain how prior probabilities for different speakers give the optimal threshold for the score s.
  3. 41.3 Suppose that you have already trained a large-vocabulary speaker-independent recognizer on many speakers. Propose some ways that such a system could be used as the basis for a speaker-verification system.

BIBLIOGRAPHY

  1. F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-Garcia, D. Petrovska-Delacrétaz, and D.A. Reynolds. A tutorial on text-independent speaker verification. Eurasip Journal on Applied Signal Processing, 2004(4):430–451, 2004.
  2. J. P. Campbell. Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9):1437–1462, 1997.
  3. Joseph P. Campbell, Douglas A. Reynolds, and Robert B. Dunn. Fusing high- and low-level features for speaker recognition. In Proc. Eurospeech, pages 2665–2669. ISCA, 2003.
  4. W. M. Campbell, K. J. Brady, J. P. Campbell, R. Granvile, and D. A. Reynolds. Understanding scores in forensic speaker recognition. In Proc. Odyssey 2006 Speaker and Language Recognition Workshop, San Juan, June 2006.
  5. William Campbell, Douglas Sturim, Douglas Reynolds, and Alex Solomonoff. SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In Proc. ICASSP, pages 97–100, Toulouse, 2006. IEEE.
  6. G. Doddington. Speaker recogniton—Identifying people by their voices. Proceedings of the IEEE, 71(11):1651, 1985.
  7. George Doddington. Speaker recognition based on idiolectal differences between speakers. In Proc. Eurospeech, pages 2521–2524. ISCA, 2001.
  8. S. Furui. Recent advances in speaker recognition. Pattern Recognition Letters, 18(9):859–872, 1997.
  9. J. L. Gauvain, L. F. Lamel, G. Adda, and M. Adda-Decker. Speaker-independent continuous speech dictation. Speech Comm., 15:21–37, 1994.
  10. O. Glembek, L. Burget, N. Dehak, N. Brümmer, and P. Kenny. Comparison of scoring methods used in speaker recognition with joint factor analysis. In Proc ICASSP 2009, Taipei, Taiwan, April 2009.
  11. Patrick Kenny and Pierre Dumouchel. Disentangling speaker and channel effects in speaker verification. In Proc. ICASSP, pages 37–40, 2004.
  12. Simon Lucey and Tsuhan Chen. Improved speaker verification through probabilistic subspace a daptation. In Proc. Interspeech, pages 2021–2024, Geneva, 2003. ISCA.
  13. Alvin Martin, George Doddington, Terri Kamm, Mark Ordowski, and Mark Przybocki. The DET curve in assessment of detection task performance. In Proc. Eurospeech 1997, pages 1895–1898, Rhodes, Greece, 1997.
  14. Jiăí Navrátil and Ganesh N Ramsawamy. The awe and mistery of t-norm. In Proc. Eurospeech, pages 2009–2012, 2003.
  15. Jason Pelecanos and Sridha Sridharan. Feature warping for robust speaker verification. In Proc. Speaker Odyssey, pages 213–218. Crete, Greece, 2001.
  16. D.A. Reynolds, T.F. Quatieri, and R.B. Dunn. Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10:19–41, 2000.
  17. A. Stolcke, L. Ferrer, S. Kajarekar, E. Shriberg, and A. Venkataraman. Mllr transforms as features in speaker recognition. In Proc. Eurospeech, pages 2425–2428, 2005.
  18. Claudio Vair, Daniele Colibro, Fabio Castaldo, Emanue le Dalmasso, and Pietro Laface. Channel factors compensation in model and feature domain for s peaker recognition. In Proc. Odyssey 2006 Speaker and Language recognition workshop, San Juan, June 2006.
  19. David A. van Leeuwen and Niko Briimmer. An introduction to application-independent evaluation of speaker recognition systems. In Christian Müller, editor, Speaker Classification, volume 4343 of Lecture Notes in Computer Science /Artificial Intelligence. Springer, Heidelberg - New York - Berlin, 2007.
  20. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
  21. Robbie Vogt, Brendan Baker, and Sridha Sridharan. Modelling session variability in text-independent speaker verification. In Proc. Interspeech, pages 3117–3120, 2005.

1This chapter was originally written by Hervé Bourlard, and later substantially updated and expanded by David van Leeuwen.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.210.151.5