CHAPTER 42

image

SPEAKER DIARIZATION

42.1 INTRODUCTION

As discussed in Chapter 8, for some applications it is useful to develop a classifier even without any labels, the so-called ‘unsupervised’ clustering task. For time series data, it is often useful to both segment and cluster the segments, for instance to associate each time segment with a particular source, even if that source is unknown. In the case of speech, this operation is known as speaker diarization, namely, the determination of who spoke when [25]. In its typical instantiation, there are no pre-existing models for any of the speakers; models are learned on the fly, with no supervisory information. No information about the underlying language, spoken text, amount of speech, number of speakers, or the placement of microphones need be given. As with nearly all modern speech applications, the dominant underlying model is a statistical one; and as in speaker verification, the basic representation is a Gaussian mixture model for each speaker, as described in Chapter 41. However, also like speaker verification, state-of-the-art implementations are relatively complex. In this chapter we1 will present the major methods in current use.

Unlike verification, speaker diarization does not require the recognition of particular speakers i.e., labeling speech with real names. It does, however, have its own challenges. In particular, diarization requires segmentation, which is not usually required for verification. Furthermore, for some applications, the clustering and segmentation must be done without having a large amount of audio material – it can sometimes be required to determine the speaker segments with only a few minutes of audio. Unsurprisingly, this task is also subject to the same problems common to other forms of speech processing: in particular, the conflation of speaker identity with transducers, room acoustics, noise, channel properties, and voice variability e.g., due to emotion or health; and of course with linguistic content, which may vary from segment to segment of the same speaker, and may be similar for segments with differing speakers. As with many speech tasks, speaker diarization is also made more difficult by speech overlap, which is quite common in conversational speech.

Speaker diarization is a useful first step in virtually any task that involves the presence of more than one talking person when there is no speaker-specific training data. Tasks to which speaker diarization has been successfully applied include speaker-adaptive speech recognition, video and audio retrieval and navigation, copyright infringement detection (by extracting speaker patterns to identify unique identical dialogs parts), and certain tasks in automated human-behavior analysis (e.g., dominance detection based on analyses such as who spoke most/least and who got interrupted most often).

image

FIGURE 42.1 Conceptual overview of a typical speaker diarization system. In single-microphone conditions beamforming is ommited.

42.2 GENERAL DESIGN OF A SPEAKER DIARIZATION SYSTEM

As already introduced above, the goal of speaker diarization is to segment a single or multichannel audio recording into speaker-homogeneous regions with the goal of answering the question who spoke when? using little or no prior knowledge. When multiple microphones are used, the task can be become much simpler, e.g., when every speaker has his or her dedicated line (such as in telephone conferences) or individual headsets (though cross-talk can still be a problem). However, the task of using microphone arrays at one or more fixed locations in the room can be as much of a challenge as the single-microphone case. In practice, a speaker diarization system has to answer not only one but two questions:

  • What are the speech regions?
  • Which speech regions belong to the same speaker?

Conceptually, a speaker diarization system performs three tasks: First, discriminate between speech and non-speech regions, second, detect speaker changes to segment the audio data, third, group the segmented regions together into speaker-homogeneous clusters. While this could in theory be achieved by a single clustering pass, as a practical matter many speaker diarization systems use a speech activity detector as a first processing step and then perform speaker segmentation and clustering in one pass as a second step. Other pieces of information, such as the number of speakers in the recording and quantitative ranking of each talker's speech time, are extracted implicitly. When arrays of microphones are used for recording, most current speaker diarization systems choose to use beamforming and then process the signal is if it was one audio stream. A side effect of beamforming is a higher signal-to-noise ratio compared to the individual microphone channels and the possibility to compute time-delay-of-arrival features. The time-delay-of-arrival features are the estimated differences of the sound wave traveling times caused by the speakers being located at different distances from each microphone in the array, as discussed in Section 39.4. Figure 42.1 shows the conceptual architecture of a typical speaker diarization system.

The output of a speaker diarization system consists of labels describing speech segments in terms of start time, end time, and speaker cluster name. The US National Institute of Standards and Technology (NIST) has defined a standard metric to measure the accuracy of speaker diarization systems by evaluating the output against manually-annotated ground-truth segments, usually with timings refined through ASR forced-alignment. The two segmentations are compared by using a dynamic programming procedure to find the optimal one-to-one mapping between the hypothesis and the ground truth segments so that the total overlap between the reference speaker and the corresponding mapped hypothesized speaker cluster is maximized. The difference is expressed as Diarization Error Rate (DER) which is defined as follows:

image

with S being the total number of time segments, defined by merging all the boundaries in both reference and hypothesized segmentations. The terms Nref (S) and Nsys(s) indicate the number of speakers speaking in segment s, and Ncorrect(s) indicates the number of speakers that speak in segment s and have been correctly matched between reference and hypothesis. Segments labelled as non-speech are considered to contain zero speakers. DER is usually expressed in percent; when all speakers and the non-speech in a file are correctly matched the error is 0%. Consequently, a formal definition for the task of speaker diarization is the minimization of Equation 42.1.

For practical reasons, DER is often decomposed into the sum of three components: misses (speaker in reference, but not in hypothesis), false alarms (speaker in hypothesis, but not in reference), and speaker-errors (mapped reference is not the same as hypothesized speaker). Wrongly assigned labels in overlapping speech regions are either handled as misses or false alarm, depending on whether it is the reference or the hypothesis containing non-assigned speakers. If multiple unmatched speakers appear in either or both of the reference and hypothesis, the error counts as multiple speaker errors. When measuring performance, NIST uses a collar of 250 ms around every reference speaker segment, which absorbs any inaccuracies in the ground-truth annotation.

At the time of writing, typical speaker diarization systems achieve roughly 1015% SNR-to-noise ratio 4-speaker meetings with very limited overlap recorded with a microphone array can sometimes be diarized with less than 2% DER. A larger number of speakers, speakers with easily-confusable voices, non-speech sounds such as music, speakers talking simultaneously, emotional variance, laughter, coughs, and many other factos can increase DER. The greatest challenge is to roubustly achieve low error rates despite these factors that can vary significantly from recording to recording. The next section provides an overview of some popular system design choices.

image

FIGURE 42.2 A conceptual visualization of two iterations of the agglomerative hierarchical clustering algorithm explained in Section 42.3.2.

42.3 EXAMPLE SYSTEM COMPONENTS

Over the years, many different speaker diarization algorithms have been developed in the speech community. The following is an overview of some design choices.

42.3.1. Features

The speech signal is usually parametrized in frames with a window size of 20–30 ms and a stepsize of 10 ms, typically computing 13–19 parameters. Popular features are Mel-Frequency Cepstral Coefficients (MFCCs) and Perceptual Linear Prediction (PLP) features, sometimes augmented with log energy. Some implementations also use Linear-Prediction Coefficients (LPC) features. Note that these are the same features, introduced in Chapter 22, that are commonly used in automatic speech recognition, among other speech processing technologies. Additional features are found by computing first and second order derivatives, often found by applying linear regression over 3–7 consecutive frames. Thus, a common dimensionality of the feature space is 26 or 39. Recent research work has also explored the use spectro-temporal features [27] and long-term features (around 500 ms), including prosodic features [11]. The performance of features depends on many factors, especially the chosen metric and statistical model for the segmentation and clustering algorithm. As a result, beyond the common usage of MFCC and PLP cepstra (and time-delay features from microphone arrays, as described in Section 42.3.3), there is no predominant feature choice in the speaker diarization research community.

42.3.2. Segmentation and clustering

There are fundamentally two styles of approach to the main core of a speaker diarization, the segmentation and clustering step, which could be referred to as “bottom-up” and “top-down”. The former incorporates agglomerative hierarchical clustering, in which an initially large number of clusters are gradually merged to improve some chosen metric, using a stopping criterion to determine when to discontinue the merging process. The latter uses divisive clustering, which starts with a small number of initial clusters (sometimes one) and performs splits according to a metric, again stopping when some criterion is reached. In both cases, the goal is to achieve an optimum number of clusters (ideally corresponding to the number of speakers), and also to determine the start and end points for each segment in each cluster.

The most common basis for the choice of merging two segments, or splitting a single one into two, is the so-called Bayesian Information Criterion (BIC) [5]. For both alternatives of merged and split assignments, the following measure is computed:

image

where X is the sequence of speech features in the segment (such as MFCCs), Θ are the parameters of the statistical model(s) for the segment, K is the number of parameters for the model(s), N is the number of speech feature vectors (e.g., frames) in the segment, and λ is an optimization parameter – ideally 1.0, but in reality tuned empirically. Other “hyperparameters” to the process include the number of initial segments (for the bottom-up approach) and the initial number of Gaussians per model, both of which can be critical choices, as can the type of initialization used for the first segmentation.

Note that the first term is simply the log likelihood of the segment, while the second term accounts for complexity. Without the second term, the optimum value would simply be one that had the largest number of segments (and parameters), since this would best fit the data. However, the presence of the tuning term λ, is a bothersome limitation, since it requires a well-matched development data set for tuning. [1] proposed keeping the number of parameters constant between the split-or-merge choices, for instance by using more Gaussian components for the single-model case. This would mean that the second term would be irrelevant. In other words, as long as the number of parameters is kept the same, comparison based on BIC reduces to using log-likelihood alone.

Most commonly, the statistical model used to represent each cluster is a Gaussian Mixture Model (GMM). As described in Chapter 9, this is a weighted sum of Gaussian distributions, where most commonly each Gaussian is parameterized by a mean vector and a diagonal covariance matrix. The underlying model for the entire speech sample is typically a Hidden Markov Model, where each state corresponds to a cluster (represented by a GMM), and the actual segmentation at each iteration is determined by a Viterbi realignment. Since speaker turns usually last many frames, a minimum-duration constraint is enforced for each speech segment (with a typical value of 2.5 seconds) to make sure that it is speakers that are clustered, not phones or other units. The following outlines a typical bottom-up diarization algorithm, visualized in Figure 42.2:

  1. Generate an initial segmentation by uniformly partitioning the audio file into k segments of the same length, where k is chosen to be several times larger than the assumed number of speakers in the audio track. Train a GMM for each initial segment.
  2. Re-segmentation: Run a Viterbi decoder using the current set of GMMs to segment the audio track.
  3. Re-training: Retrain the models using the current segmentation as input.
  4. Select the closest pair of clusters and merge them. This is done by going over all possible pairs of clusters, and computing the difference between the sum of the BIC scores of each of the models and the BIC score of a new GMM trained on the merged cluster pair. The clusters from the pair with the largest positive difference are merged, the new GMM is used, and the algorithm repeats from step 2 unless the stopping criterion (e.g., target number of clusters) is reached.

The result of the algorithm is a segmentation of the audio into k′ clusters and a GMM for each cluster, where k′ is assumed to be the number of speakers.

42.3.3. Acoustic beamforming

As discussed in Section 39.4, microphone arrays are often used as a method to enhance the audio signal captured by far-field microphones. The redundancy among the multiple channels can be exploited to enhance the signal, even if some of the channels have a very poor SNR. When speaker diarization is to be performed on data that has been collected by a microphone array and enhanced by beamforming, it is natural to exploit the availability of spatial information for speaker segmentation and clustering. By correlating the individual microphone signals, estimates of inter-channel delay may be used not only for delay-andsum beamforming of multiple microphone channels but also for speaker localization. One can obtain information on the location of the audio source (i.e., speaker) by calculating the so-called time delay of arrival (TDOA). This is the relative difference in propagation delay caused by the varying distances of the microphones from the speaker. NIST evaluates diarization on microphone arrays as the so-called MDM (multi-distant microphone) condition, which is contrasted with the single-distant microphone (SDM) condition. It is preferable for the beamforming algorithm not to require knowledge of the placement of microphones, as this information may not be available.

The standard algorithm for calculating TDOA features is GCC-PHAT (Generalized Cross Correlation with Phase Transform) [15]. The algorithm selects a reference channel and aligns the other channels using a standard delay-and-sum algorithm. The contribution of each signal channel to the output is dynamically weighted using cross-correlation. The TDOA features are estimated by measuring the resulting time shifts of each channel after alignment. Current research focuses on finding methods to select the optimum reference channel and to better stabilize the time-delay-of-arrival (TDOA) values between channels before the signals are summed together. Using TDOA features alone, error rates are considerably higher than those achieved with acoustic features. Therefore, a combination of TDOA and acoustic features is used by integrating the feature streams at some stage of the algorithm, e.g., by using them for initialization or integrating them using weighted log-likelihood combination.

Unfortunately, there are also some disadvantages to obtaining spatial information purely from the audio signal. First, it is very hard to detect when a person moves or walks around, so the method can mistakenly report multiple speakers in this case. Second, this method requires significantly more computational effort since as many as eight or more data streams have to be processed. Third, and most importantly, a microphone array is required, which limits the generality of this approach.

42.3.4. Speech activity detection

If the end goal is diarizing speakers, and not extraneous nonspeech sounds, speech/nonspeech discrimination is an essential prerequisite for diarization. Many of the errors in current systems are due to mistakes in this stage of the processing. While it may seem surprising that this two-class problem is still not fully solved, in reality one of the classes (nonspeech) can consist of anything, and perhaps it should really be thought of as a problem of detecting speech (often called Speech Activity Detection, or SAD) in an environment that can contain many signals for which one has no models.

SAD involves the labeling of speech and non-speech segments, and can have a significant impact on speaker diarization performance for two main reasons. The first stems directly from the speaker diarization performance metric, the DER, which takes into account both the false alarm and missed speaker error rates. Consequently, poor SAD performance will lead to an increased DER. Second, non-speech segments can disturb the speaker diarization process by introducing irrelevant data into the the acoustic models.

SAD is a fundamental task in almost all fields of speech processing (coding, enhancement, and recognition) and many different approaches and studies have been reported in the literature [20]. Non-speech segments may include silence, but also ambient noise such as paper shuffling, door knocks, or voiced noise such as breathing or coughing, or other background noise such as an ambulance going by outside. TV content often contains sound effects, music, laughter, and applause. Highly variable energy levels can be observed in the non-speech parts of the signal. Moreover, differences in microphones or room configurations may result in variable SNRs from one recording to another. These factors make SAD far from trivial, and simple, threshold-based techniques using features of energy, spectral divergence between speech and noise, and pitch estimation, have proven to be relatively ineffective.

Supervised approaches tend to give better performance. They rely on a two-class detector, with models pre-trained on speech and non-speech data [28, 2, 8, 16, 30]. Speech and non-speech models (e.g., using GMMs) may optionally be adapted to specific recording conditions [9]. As with other speech processing tasks, MFCCs have been used for this purpose. Discriminant transformations such as Linear Discriminant Analysis (LDA), sometimes coupled with Support Vector Machines (SVM) have also been proposed. The main drawback of model-based approaches is their reliance on external data for the training of the initial models which makes them less robust to changes in acoustic conditions. Hybrid approaches have also been proposed: In most cases, an energy-based detection is first applied in order to label a limited amount of speech and non-speech data for which there is high confidence in the classification. In a second step, the labeled data are used to train session-specific speech and non-speech models, which are subsequently utilized in a model-based detector to obtain the final speech/non-speech segmentation [29, 24, 18].

42.4 RESEARCH CHALLENGES

In this section we briefly review some of the research topics in this area that remain unsolved as of this writing.

42.4.1. Overlap resolution

A fundamental limitation of most current speaker diarization systems is that only one speaker is assigned to each segment. The presence of overlapped speech, though, is common in multiparty conversations and consequently presents a significant challenge to speaker diarization. Specifically, in regions where more than one speaker is active, missed speech errors may occur. Even when classified as speech, overlapped speech segments should not be assigned to only a single speaker cluster nor included in any individual speaker model. Doing so not only results in a higher speaker error when scoring but also adversely affects the purity of speaker models, which ultimately reduces diarization performance. Approaches to at least detect overlap for ASR were assessed in [22, 7]. However, only a small number of systems detect overlapping speech well enough to improve diarization error rates [4, 26, 3].

Initially, the authors in [19] demonstrated a theoretical improvement in diarization performance by adding a second speaker during overlap regions using a simple strategy of assigning speaker labels according to the labels of the neighboring segments, as well as by excluding overlap regions from the input to the diarization system. However, this initial study assumed oracle (ideal) overlap detection. In [26] a real overlap detection system was developed, as well as a heuristic that computed posterior probabilities from diarization to post-process the output and include a second speaker in overlap regions. The main performance bottleneck is due to errors in overlap detection; more work on enhancing its precision and recall is reported in [4, 3]. This approach consists of a three state HMM-GMM system (non-speech, non-overlapped speech, and overlapped speech), and the best feature combination is MFCC and modulation spectrogram features [14], although comparable results were achieved with other features such as RMS energy, spectral flatness, and harmonic energy ratio. The reported performance yielded a relative improvement of about 10% DER. However, given ideal overlap detection, the relative DER improvement goes up to 37%, indicating that this area has great potential for future improvement.

42.4.2. Multimodal diarization

As discussed above, speaker localization information can successfully be used in speaker diarization by incorporating TDOA features. With the recent ubiquitous availability of cameras, it is therefore no surprise that researchers have started tackling speaker diarization as a combined audio-visual problem. Initial approaches to audio-visual speaker identification involved identifying lip motion from frontal faces, eg. [6], [12], [13], [21], [23]. The underlying assumption was that motion from a person comes predominantly from the motion of the lower half of their face. In a real scenario, however, the subject behavior is not controlled and, consequently, the correct detection of the mouth is not always feasible. More recent approaches of multimodal speaker diarization have therefore tried to exploit other forms of body behavior, e.g., head and hand gestures, which are also visible manifestations of speech [17]. The work presented in [10], for example, assumes audio-visual diarization as a single, unsupervised joint-optimization problem. The work uses the basic underlying assumption that people who currently speak also show greater visual activity than people who do not talk. Audio-visual speaker diarization has the potential to overcome the shortcomings of audio-only approaches, such as overlap resolution.

42.4.3. Further challenges

This chapter provides a basic introduction to the current state-of-the-art in speaker diarization research. Still, many challenges remain. Most importantly, as of 2011, speaker diarization systems are not yet robust enough to be easily ported across different task and data domains. Often parameters of systems are tuned to a particular set of data such as broadcast news or meetings. In a new domain, tuning of parameters often starts from scratch. Even variations inside one domain, e.g., meeting data recorded at different sites, can lead to large variations in performance. Speaker variations caused by emotions or very short interruptions (e.g., below the minimum duration constraint) pose challenges that are yet to be addressed, possibly by audio-visual approaches. The greatest challenge remains the handling of overlapped speech.

42.5 EXERCISES

  1. 42.1 The Bayesian Information Criterion (BIC) is related to the principle of Minimum Description Length (MDL). Explain how.
  2. 42.2 In the segmentation/clustering algorithm presented in this chapter, the clusters are said to be “purified” in each step by merging two clusters according to the BIC. Provide a colloquial explanation for how this “purification” works. Explain possible problems. (Hint: Having done exercise 1 first will help).
  3. 42.3 Show how the Diarization Error Rate (Equation 42.1) can be split up into the sum of false alarm speech, missed speech, and speaker error.
  4. 42.4 Propose a way to measure the fitness of a feature for speaker diarization and explain why pitch alone is not good enough. You may use real data to explain your results.
  5. 42.5 Estimate an upper bound for the runtime of the speaker diarization algorithm as presented in this chapter. What is the bottleneck?
  6. 42.6 Propose a divisive segmentation/clustering approach in pseudo code. Discuss the differences between top-down approach and a bottom up approach shortly in terms of runtime.
  7. 42.7 Explain typical expected problems when performing speaker diarization as presented here in the following data domains: Recorded voice-over-IP phone conference, a board meeting recorded with a microphone array, a conversation recorded with a cell-phone in a car, a recorded theater performance, broadcast news, an air-traffic control session, a microphone mounted onto a surveillance camera.
  8. 42.8 Perform the following experiment: Ask a co-student/co-worker to find a video on the Internet in a language that you do not speak and where you do not know the participants. It should contain a conversation of several minutes with at least 4 speakers (a foreign talk show might be a good choice). Do not watch the video, only listen to the audio and perform manual speaker online diarization by saying “speaker 1”, “speaker 2”. Let your co-worker/co-student rate you: How good are you at assigning the right speakers in a normal and in an overlap situation? How does the situation improve once you look at the video?

BIBLIOGRAPHY

  1. Ajmera, J., and Wooters, C., “A Robust Speaker Clustering Algorithm,” Proceeding of IEEE Automatic Speech Recognition and Understanding Workshop, pp. 411–416, 2003.
  2. X. Anguera, C. Wooters, B. Peskin, and M. Aguilo, “Robust speaker segmentation for meetings: The ICSI-SRI spring 2005 diarization system,” in Proc. NIST MLMI Meeting Recognition Workshop. Edinburgh: Springer, 2005.
  3. K. Boakye, “Audio Segmentation for Meetings Speech Processing,” Ph.D. dissertation, University of California at Berkeley, 2008.
  4. K. Boakye, B. Trueba-Hornero, O. Vinyals, and G. Friedland, “Overlapped speech detection for improved speaker diarization in multiparty meetings,” Proc. ICASSP, pp. 4353–4356, 2008.
  5. Chen, S. S. and Gopalakrishnan, P., “Clustering via the bayesian information criterion with applications in speech recognition,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 2, Seattle, USA, pp. 645–648.
  6. T. Chen and R. Rao, “Cross-modal Prediction in Audio-visual Communication,” in Proc. ICASSP, vol. 4, 1996, pp. 2056–2059.
  7. O. Çetin and E. Shriberg, “Speaker overlaps and ASR errors in meetings: Effects before, during, and after the overlap,” in Proc. ICASSP, 2006, pp. 357–360, Toulouse, France.
  8. C. Fredouille and G. Senay, “Technical Improvements of the E-HMM Based Speaker Diarization System for Meeting Records,” in Proc. MLMI Third International Workshop, Bethesda, MD, USA, revised selected paper. Berlin, Heidelberg: Springer-Verlag, 2006, pp. 359–370.
  9. C. Fredouille and N. Evans, “The LIA RT'07 speaker diarization system,” in Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8–11, 2007, Revised Selected Papers. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 520–532.
  10. G. Friedland, C. Yeo, and H. Hung, “Visual speaker localization aided by acoustic models,” in MM '09: Proceedings of the seventeen ACM international conference on Multimedia. New York, NY, USA: ACM, 2009, pp. 195–202.
  11. G. Friedland, 0. Vinyals, Y. Huang, and C. Muller, “Prosodic and Other Long-term Features for Speaker Diarization,” IEEE Transactions on Audio, Speech, and Language Processing, Vol 17, No 5, pp 985–993, July 2009.
  12. J. W. Fisher, T. Darrell, W. T. Freeman, and P. A. Viola, “Learning joint statistical models for audio-visual fusion and segregation,” in Proc. NIPS, 2000, pp. 772–778.
  13. J. W. Fisher and T. Darrell, “Speaker association with signal-level audiovisual fusion.” IEEE Transactions on Multimedia, vol. 6, no. 3, pp. 406–413, 2004.
  14. B. E. D. Kingsbury, N. Morgan, and S. Greenberg, “Robust speech recognition using the modulation spectrogram,” Speech Commun., vol. 25, no. 1–3, pp. 117–132, 1998.
  15. C. H. Knapp, G. C. Carter: “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-24(4), 320–327, 1976.
  16. D.A. Van Leeuwen and M. KoneimagenY, “Progress in the AMIDA Speaker Diarization System for Meeting Data,” in Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8–11, 2007, Revised Selected Papers. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 475–483.
  17. D. McNeill, Language and Gesture. Cambridge University Press New York, 2000.
  18. T. L. Nwe, H. Sun, H. Li, and S. Rahardja, “Speaker diarization in meeting audio,” in Proc. ICASSP, Taipei, Taiwan, 2009.
  19. S. Otterson and M. Ostendorf, “Efficient use of overlap information in speaker diarization,” in Proc. ASRU, Kyoto, Japan, 2007, pp. 686–6.
  20. J. Ramirez, J. M. Girriz, and J. C. Segura, “Voice activity detection. fundamentals and speech recognition system robustness,” in Robust Speech Recognition and Understanding, M. Grimm and K. Kroschel, Eds., Vienna, Austria, June 2007, p. 460.
  21. R. Rao and T. Chen, “Exploiting audio-visual correlation in coding of talking head sequences,” International Picture Coding Symposium, March 1996.
  22. E. Shriberg, A. Stolcke, and D. Baron, “Observations on overlap: Findings and implications for automatic processing of multi-party conversations,” in Proc. Eurospeech 2001, 2001, pp. 1359–1362, aalborg, Denmark.
  23. M. Siracusa and J. Fisher, “Dynamic dependency tests for audio-visual speaker association,” in Proc. ICASSP, April 2007.
  24. H. Sun, T. L. Nwe, B. Ma, and H. Li, “Speaker diarization for meeting room audio,” in Proc. Interspeech'09, September 2009.
  25. S. Tranter and D. Reynolds, “An overview of automatic speaker diarization systems,” IEEE TASLP, vol. 14, no. 5, pp. 1557–1565, 2006.
  26. B. Trueba-Hornero, “Handling overlapped speech in speaker diarization,” Master's thesis, Universitat Politecnica de Catalunya, May 2008.
  27. O. Vinyals, G. Friedland: “Modulation Spectrogram Features for Speaker Diarization,” Proceedings of Interspeech 2008, pp. 630–633, Brisbane Australia, September 2008.
  28. C. Wooters, J. Fung, B. Peskin, and X. Anguera, “Towards robust speaker segmentation: The ICSI-SRI fall 2004 diarization system,” in Fall 2004 Rich Transcription Workshop (RT04), Palisades, NY, November 2004.
  29. C. Wooters and M. Huijbregts, “The ICSI RTO7s Speaker Diarization System,” in Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8–11, 2007, Revised Selected Papers. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 509–519.
  30. X. Zhu, C. Barras, L. Lamel, and J.-L. Gauvain, “Multi-stage Speaker Diarization for Conference and Lecture Meetings,” in Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8–11, 2007, Revised Selected Papers. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 533–542.

1This chapter was written by Gerald Friedland.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.210.151.5