Sound is a remarkably linear medium, which is to say that in the presence of two, distinct sound-producing sources such as two people speaking, the pressure waveform at our ears is essentially the sum of the individual pressure waveforms we would experience from each of the speakers individually. This makes hearing useful because it means that acoustic information is not easily obscured – unlike the visual domain in which a nearer object can block the view of a more distant one. By the same token, however, it means that every acoustic “scene” we experience is the sum of all the acoustic sources within audible range, which can become a complicated mess of energy.

Most of the recognition problems we have considered so far have made the assumption that the source of interest – speech, musical instrument, or something else – dominates the received sound. Speech against a noisy background has received a fair amount of attention in the speech recognition community, but most often it is handled with simple feature-domain compensation (such as the approaches discussed in Chapter 22) that try to make the features resemble the noise-free case, and/or by use of noisy training examples so that the variations due to the background noise are absorbed by the same statistical models used to accommodate other variations (speaker, style, etc.). Such approaches have clear limits, however: consider the scenario we started with, illustrated in Figure 39.1, of two people speaking at the same time. Viewed as a collection of low-level features, there is no intrinsic way to distinguish between the properties of the “target” and “interfering” speaker – they are both voices. Thus, a more sophisticated approach is required.

An ideal solution would be to take the sound consisting of the mixture of different sources, and somehow separate out the individual source signals. This is known as the source separation problem, and is the subject of this chapter. This intuitively natural idea, however, hides a number of subtle problems: How can we define an acoustic source? Many sounds can themselves be further broken down into components with distinct originating mechanisms, such as the pitched voicing and the sibilant hiss of speech, so a unique definition of a source is elusive. At what level of detail do we need to separate the sources, for instance do we need the waveform, or just higher-level features? Do we need to completely remove interference and perfectly recover the original signal, or is enhancement sufficient? What is the ideal source signal anyway, given that even in the absence of interference a single acoustic source will result in different sounds at different locations thanks to reverberation and other channel effects?

In the following sections we look at a number of approaches to this problem. When multiple observation are available, for instance if we have multiple microphones, several approaches can enhance one source relative to others with minimal assumptions by using acoustic beamforming or other blind source separation techniques. Examination of the behavior of human listeners when confronted with real or artificial mixture signals – known as Auditory Scene Analysis [5] – can inspire biomimetic processing that earns the name Computational Auditory Scene Analysis (CASA). Finally, more detailed prior knowledge (or assumptions) about the nature of the target signals can accomplish model-based separation even in the case when only a single recording channel is available.


FIGURE 39.1 An example source separation problem. Top row: Spectrograms of individual male and female voices. Bottom pane: Spectrogram of the mixture of the two voices (as might occur at a cocktail party). It is difficult to discern the details of each component in the mixture.


Before looking at different techniques, it is worth spending a few moments considering what we wish to achieve, and how we can measure our success. One obvious application of successful acoustic signal separation would be for hearing instruments, i.e., to pick out and amplify a single voice from a competing background in situations where a human listener has difficulty understanding the voice unaided – the scenario dubbed the Cocktail Party Problem by Cherry [7]. We could gauge performance by some measure of the distortion between the isolated source signal and the signal recovered from the mixture (e.g., for artificially-mixed test cases, where the true source signals are thus available).

Signal-to-noise ratio (SNR), i.e., the energy of the original target divided by the energy of the difference between target and processed output, is clearly a sufficient condition, in the sense that achieving high SNR is sufficient to guarantee high separation quality with output signals that sound very much like the original. However, it may not be a necessary condition, since in many cases it can be too restrictive. For instance, modifying a signal with a simple linear filter, or a small delay, can result in large differences as measured by SNR, yet the signal is barely altered from the listener's point of view. A set of measures that preserve the equivalence of such simple modifications has been proposed by Vincent et al. [35], which divides the recovered signal into components that can be obtained by fixed filtering of the target source, filtering of any known competing sources (interference), and energy that cannot be produced by fixed filtering (distortion). This leads to additional measures such as signal-to-interference ratio (SIR) and signal-to-distortion ratio (SDR), which give more useful measures of the performance of complex source separation systems.

If the goal of separation is to support ASR for noise-corrupted signals, we might prefer to measure performance with the standard metrics of speech recognition – e.g., word error rate – and see how the inclusion of source separation techniques can reduce errors compared to running the unmodified mixture into the recognizer. Source separation for other kinds of automatic signal analysis (such as the melody transcription discussed in Chapter 37) can similarly be evaluated by the corresponding metrics developed for scenarios without separation.

Finally, if the goal of source separation is to improve a sound for presentation to a human listener, the ultimate evaluation must be by tests with real listeners. These could measure intelligibility, i.e., whether the processing improves the ability of a listener to correctly identify the words in an utterance. Very often, intelligibility has a rather abrupt variation with SNR, with a transition from near-perfect recognition to near-guessing over just a few dB change in interference level (although this depends greatly on the conditions, including factors such as the predictability of the speech material). A different facet of listeners' judgments is sound quality, as determined by mean opinion score tests, in which a panel of listeners rate the the quality of a processed sound on a scale of 1 (awful) to 5 (excellent). These two attributes do not necessarily correlate: high intelligibility is still possible with low quality, and improvements in quality do not necessarily result in improvements in intelligibility [22].

Since tests with listeners are arduous and expensive to conduct, there has been a significant effort to produce automatic systems to predict the results, at least within certain domains. Examples include Perceptual Evaluation of Speech Quality (PESQ) and Perceptual Evaluation of Audio Quality (PEAQ) [32], complex systems incorporating auditory models that are tuned to match subjective judgments of quality as closely as possible over a range of material. Similarly, Ma et al. [26] review a number of models including the Articulation Index and Speech Transmission Index that have been proposed to predict intelligibility based on more or less detailed descriptions of the signal-to-noise conditions.


FIGURE 39.2 Two simultaneous speakers s1 and s2 being recorded by two microphones xl and x2. Each microphone records a mixture of both voices, but the mixtures are slightly different due to the different coupling channels aij between mic i and speaker j.


Consider the situation illustrated in Figure 39.2, where two people speaking simultaneously are being recorded by two microphones. (The signal shown in Figure 39.1 could be from one of these microphones.) Although both microphones will, in general, record a mixture of both voices, the precise combination between the voices in each mixture will be different. We can express this situation in matrix form,


where x, (t) is the mixture signal recorded by microphone i, sj(t) is the source signal from speaker j, a1, represents the coupling between mic i and source j, and the bold symbols indicate matrices. aij could consist of simple direction-dependent gains, or, in general, frequency-dependent gain and time (or phase) modification - in which case Eq. 39.1 should more properly be written in the Fourier transform domain.

The matrix formulation of Eq. 39.2 immediately suggests an approach to solving the problem via an unmixing matrix W, with estimates of the original sources image given by image = Wx. Making W = A.−1 will exactly undo the mixing observed by the microphones, so that A = s. The main difficulty with this, however, is that the mixing matrix A is usually not known, and thus finding (or approximating) its inverse is a challenge. It is instructive, however, to follow through this simple example: Applying both mixing and unmixing matrices, we get:


Separation is achieved when the “cross terms” in Eq. 39.6 disappear, i.e.,


This of course occurs when


The purpose of running through this familiar inverse of a 2 × 2 matrix is to highlight exactly how the sources are separated: to remove the contribution of s2 in the reconstruction image1 the weights used to combine x1 and x2 (namely w11 and w12) are set in the precise ratio that will balance the different proportions of s2 in each microphone signal, so that on summing the two weighted microphone components together the contributions of s2 will cancel out leaving only s1.

This cancellation (or “nulling-out”) provides the basis for rejection of point-source interference in multichannel source separation techniques. It is important because, in ideal circumstances, it can provide perfect removal of unwanted sources. However, notice that it is very sensitive to the exact balance between the contributions of the interfering source in each microphone: small errors in the corresponding weights will result in a failure to completely cancel and rapid growth of the residual error. Moreover, if the proportions of both sources are similar in the two microphones, then cancelling the interfering source may nearly cancel the target source too, so any other components (such as independent noise from the microphone preamplifiers) will become increasingly dominant. This is in fact equivalent to an ill-conditioned mixing matrix, indicating likely numerical problems in calculating its inverse.


As a result of the finite speed of sound, sound emanating from a single point source will arrive at different points in space at different times. A typical speed of sound (which varies slightly with temperature, pressure, humidity etc.) is around 340 m/s, or around 1 foot per millisecond. Thus, two microphones a foot apart will record the waveform of a source located along the extension of their common axis with around 1 ms relative delay, whereas sound from a source lying on their equidistant plane will reach both microphones at the same time. (This analysis extends easily to any number of microphones). By delaying the various microphone signals so as to align sound arriving from a particular direction then summing the results, sound from that direction sums up coherently (in-phase), whereas sounds from any other directions will generally not be in phase and will thus be attenuated; the precise attenuation depends on the phase difference corresponding to the delay, and thus varies with frequency. This simple approach, known as delay-and-sum, has a wide range of applications, and is in fact the principle behind the elongated, highly directional “shotgun mics” used on film sets and by birdwatchers – although in that case, the delaying and summing is done in the acoustic domain (within a waveguide formed by the mic body), rather than electronically.


FIGURE 39.3 Directional gain responses for a five-element linear mic array operating in broadside (zero relative delay) and endfire (maximum inter-mic delay) modes. Each plot shows the response for 250 Hz, 1 kHz, and 4 kHz. Note how the spatial response becomes sharper for higher frequencies.

Instead of precisely canceling sound from a single, interfering direction, a delay-andsum system has a gain that varies with direction but is maximized in the target direction; thus, it is more appropriate for reducing noise that comes from many directions at once, such as isotropic noise or late reverberation. Figure 39.3 shows the directional response of a simulated five-element linear array of omnidirectional mics with uniform inter-mic spacing of 10 cm. Gain is shown as a function of angle on a polar plot; thus the ‘lobes’ point in the direction of greatest sensitivity. The upper plot shows the response when the channels are summed with zero relative delay, which favors wavefronts parallel to the mic axis (broadside). The plot to the right shows the response when each mic signal is delayed by 0.1/340 sec so as to align the response to wavefronts normal to the mic axis (endfire). Note that the mic array, and thus its directional response, is symmetric for any rotation around the inter-mic axis. We see multiple lobes and nulls arising from the different phase cancellations between the different mic signals. The precise pattern depends on frequency, becoming increasingly detailed at high frequencies when the relative phase of the signals received by the mics changes most rapidly with angle. Large sidelobes can occur due to spatial aliasing when the inter-mic spacing is greater than the sound wavelength, leading to incident angles where, at that frequency, the phases observed by each microphone are identical to those expected from the target direction.

Because the spatial response of the delay-and-sum beamformer varies with frequency, interfering sounds are not uniformly attenuated but are subject to considerable spectral coloration, which can be disturbing – particularly if the interfering source moves. This can be moderated by using more complex weighting schemes that vary with frequency, so the spatial response is more uniform across the spectrum – which normally means broadening the spatial response at high frequencies. This generalization is known as filter-and-sum, since each mic channel is now subject to a complex, frequency-dependent modification of timing (phase) and gain before being summed together. The range of possible filter-and-sum systems is evidently very large, and design schemes can be devised to optimize different objective criteria. We1 will show how a fairly simple multi-channel signal model can be used to derive many of the most widely used classical beamforming algorithms.

39.4.1. A multi-channel signal model

Given an array of M microphones, it is assumed that each microphone receives a delayed and attenuated version of the original source signal plus some additive noise. This noise is assumed to have zero mean and can be attributed to the environment and/or the electrical circuitry of the sound capture hardware. The amount of attenuation and delay is dependent on the relative positions of the sound source and the microphones. Thus, at the mth microphone, the received signal can be written as


where s(t) is the source signal, and am and τm are the gain and delay associated with the mth microphone. In the frequency domain, this can be represented as


where image. We can then represent the set of signals received by all microphones as an M-dimensional vector


In beamforming, the vector of M observed signals is linearly combined to generate a single channel output signal.


Because we typically process all frequency bands independently, the variable representing frequency will be dropped from subsequent notation. Furthermore, we can further simplify the equations by representing the multi-channel processing using matrix-vector notation. Thus, the signal model in Eq. 39.12 can be compactly represented as


and the beamforming operation in Eq. 39.13 represented as a simple inner product


The manner in which the beamformer parameters w are chosen is the subject of much research. In the following sections, we describe several classical approaches. These can be divided into two basic categories: time-invariant beamformers and adaptive beamformers.

39.4.2. Time-invariant Beamformers

A time-invariant beamformer is one whose weights are pre-computed and held fixed during deployment. The weights are independent of the observed target and/or interference signals, and depend only on the assumed location of the source and/or interference. In time-invariant beamforming, because we do not know anything about the nature of the desired signal other than its direction of arrival, a reasonable goal of processing would be to minimize the power of the output signal y subject to the so-called distortionless constraint, i.e., there should be no distortion in gain or phase of any signal that arrives from the direction of arrival of the target signal, called the look direction of the array. The power of the array output signal can be expressed as


where Φxx is the M × M power spectral density (PSD) matrix of the observed array signals. It is clear from Eq. 39.14 that Φxx is composed of the PSD of source signal and the PSD of the noise. Because we cannot assume knowledge of the characteristics of the source signal, we can only minimize the noise power Φnn. Substituting Φnn for Φxx into Eq. 39.16, we can write a suitable beamforming objective function


where wHd = 1 represents the distortionless constraint. This constrained optimization problem can be solved using the method of Lagrange multipliers to obtain the following solution


This solution is the called the Minimum Variance Distortionless Response (MVDR) beamformer. In practice, we do not know the PSD of the noise Φnn ahead of time. As a result, the noise PSD is replaced by a model of the noise, called the coherence matrix Λnn The coherence matrix represents the normalized cross correlation between the noise at the different pairs of microphones. The most commonly used coherence matrix models noise that has equal power at all microphones but is uncorrelated across all pairs of microphones, i.e., Γnn = αI. Substituting I for Φnn in Eq. 39.18 leads to the aforementioned delay-andsum beamformer:


Using instead a coherence matrix that models spherically or cylindrically isotropic noise [14] results in the superdirective beamformer [12]. Such coherence models are good where ambient stationary noise, e.g., air conditioning, is present.

In addition to approaches to beamformer design that use signal and noise models to obtain the parameters, another class of beamformer design algorithms use a technique called pattern synthesis. In these methods, the shape of the resulting beampattern is specified during the design and beamformer parameters are chosen to best match the desired beam shape. Examples of such methods include modeling the beampattern using DolphChebyshev polynomials [33] or cosine functions [31].

It should be noted that the name “time-invariant beamforming” does not imply that such beamformers cannot be used to track moving targets. For moving sound sources or sources with unknown position, a set of time-invariant beamformers, chosen to cover the expected sound field, is first created offline and stored. During deployment, a sound source localization algorithm, e.g., [24, 3, 39], is used to estimate the location (or direction of arrival) of the desired sound source. This information is then used to select the most appropriate beamformer from the collection.


FIGURE 39.4 The Generalized Sidelobe Canceller.

39.4.3. Adaptive beamformers

If the environment contains discrete noise sources with unknown location, e.g., a radio or interfering talker, it may be advantageous to use an adaptive beamformer. As the name implies, adaptive beamformers update their parameters in an online manner as input samples are received. The Frost beamformer [15] is arguably the most well-known adaptive beamforming algorithm. In this algorithm, the output power of the array is minimized for the current noise conditions, while maintaining the same distortionless constraint described earlier. The Frost beamformer is in essence an online implementation of the time-invariant beamformer of Eq. 39.18 in which the power spectral density of the observed signals is used directly. However, to make the beamformer adaptive, the PSD of the received signals Φxx = E [xxH] is replaced by an instantaneous estimate based solely on the current frame xtxtH. The beamformer weights are incrementally updated at each frame by the LMS algorithm, to minimize the instantaneous output power without distorting signal from the look direction.

The Generalized Sidelobe Canceller (GSC) was proposed as an alternative architecture to the Frost beamformer [18]. The GSC consists of two structures, a fixed beamformer that produces a non-adaptive output, and an adaptive structure for sidelobe cancellation. The adaptive structure of the GSC is preceded by a blocking matrix that blocks signals coming from the desired look direction. The weights of the adaptive structure are then adjusted to cancel any signal common to both structures, assumed to be noise. The architecture of the GSC is shown in Figure 39.4. The advantage of the GSC is that it turns the constrained optimization problem solved by Frost into a simpler unconstrained optimization problem.

While adaptive beamforming algorithms can be quite effective at preserving the source signal and cancelling point sources of interference, they are similar to time-invariant beamformers in that they are highly dependent on accurate sound source localization. If the estimate of the talker's position (direction of arrival) is incorrect, even by a few degrees, some cancellation of the desired signal will occur. This is easy to understand by examining the structure of the GSC. If the sound source localization is not accurate, the target signal will pass through the blocking matrix and then be cancelled the adaptive filters. As a result, improvements to the GSC have been proposed that make the algorithm more robust to localization errors [20].

39.4.4. Alternative Objective Criteria

The classical time-invariant and adaptive beamformer algorithms described in the previous sections worked on the same basic objective criterion, namely minimizing the output power of the array. More recently, researchers have proposed alternative objective criteria to derive beamformer parameters in either a time-invariant or adaptive manner. For example, in Gillespie et al. [17], a beamformer was proposed in which parameters were chosen to maximize the kurtosis of the LPC residual. This beamformer was specifically targeted at reducing the amount of reverberation in the output signal. In Kumatani et al. [25], a GSC that was adapted according to a negentropy criterion was proposed. Both kurtosis and negentropy have also been proposed as objective criteria for independent component analysis [23], to be discussed in the next section. Finally, in Seltzer et al. [30] a beamformer was proposed for speech recognition applications in which the parameters were adapted to maximize the likelihood of the output signals as measured by the speech recognizer.

The information presented in this section represents only a brief introduction of beam-forming approaches that are possible with an array of microphones. Additional information about microphone arrays can be found in Benesty et al. [2] and Brandstein & Ward [4], and a thorough treatment of array processing in general can be found in Van Trees [33].


The microphone array processing techniques of the previous section made few assumptions about the nature of the signals being processed, but they did assume knowledge of the microphone array geometry, and the direction of the target source. A different scenario could be an environment with a number of point sources, all of which are potentially targets, measured by a number of sensors whose relative location, and even individual characteristics, are unknown. If the number of sources is less than or equal to the number of sensors, then a mixing expression similar to Eq. 39.1 could hold, and a solution that approximated the inverse of the mixing matrix might exist. But in the absence of structural constraints on the mixing matrix, how can we approximate its inverse?

One possibility is to look at the outputs of the candidate unmixing process, ŝ. When we have the “right” unmixing process, each channel will contain contributions from only a single source; for all other settings, one or more of the reconstructed channels will consist of mixtures of multiple sources. By making the simple but critical assumption that the different sources are emitting unrelated sound waveforms, we can search for the unmixing parameters that maximize the statistical independence of the separated outputs. A pair of outputs that both contain contributions from the same source will exhibit a degree of statistical dependency, thus independence is maximized when each source appears in only a single output channel – i.e., complete separation. The family of techniques based on this separation principle is known as Independent Component Analysis (ICA). It was initially proposed by Comon [8] and widely popularized by Bell & Sejnowski [1].

One route to evaluating statistical independence is to look at the higher-order moments of each output signal individually. Many audio signals have heavy-tailed distributions (leptokurtic), which is to say that very large amplitude values are more likely than would be expected for a Gaussian-distributed signal of the same variance. As a corollary, the signal also spends more time close to zero than a comparable Gaussian distribution. This tendency can be measured by the fourth moment, also known as kurtosis, commonly defined as image where x is the random variable, and μ and σ are its mean and standard deviation respectively. Note that this definition assigns a kurtosis of zero to a Gaussian distribution, negative values for platykurtic distributions, and positive values for leptokurtic distributions. The weak law of large numbers dictates that the sum of independent random variables with non-Gaussian distributions will have a distribution that tends towards Gaussianity as the number of components increases, so the kurtosis of an output channel that comprises a sum of two independent, leptokurtic signals will have a kurtosis smaller than one source alone. Given a parameter space that allows us to vary the relative proportions of the two sources, we expect a peak in the kurtosis for settings that result in a pure, single source in the output.

This is the case illustrated in Figure 39.5. The left pane shows the joint distribution of a simulated two-channel recording of two sources, i.e., the situation from Figure 39.2 described by Eq. 39.1. The underlying signals are leptokurtic, leading to visible “rays” in the joint distribution that result from instants when one source has a large amplitude but the other source is close to zero. The right pane shows the kurtosis evaluated for weighted combinations of the two mixture signals. In general, this weighted combination is parameterized by two coefficients (i.e., one row of the unmixing matrix W), but we can normalize the magnitude of this two-element vector without changing the net balance of the two original sources. Thus, all possible mixtures can be parameterized by a single coefficient, which corresponds to the angle of the unit vector in the joint distribution plane onto which the mixture is being projected. The right pane shows the kurtosis of the projection as a function of this angle; we see two local maxima at the values marked 91 and 92. These values correspond to projections onto the corresponding vectors in the left pane, which can be seen to be orthogonal to the “rays” of the joint distribution. Thus, these are the projections that collapse one of the two input sources onto zero, leaving only the other source in the reconstruction – i.e., the rows of unmixing matrix W that would achieve perfect separation.

Note that if the original source signals had been Gaussian distributed, this approach would not have worked. The sum of Gaussian distributed random variables is itself Gaussian distrubuted, and thus the mixture scatter would have been elliptical with no features to indicate the separate sources, and the kurtosis would have been zero for all projections. Fortunately, as mentioned above, in almost all cases real-world signals have non-Gaussian amplitude distributions and so the ICA approach can be applied.


FIGURE 39.5 Kurtosis as a function of projection “angle” for unmixing a mixture of two heavy-tailed signals. Left pane shows the joint distribution of the two mixture signals, with clear dominant directions for the two underlying distributions. Right pane shows the kurtosis of a linear combination of the two sensor signals (i.e., one row of a candidate unmixing matrix) as a function of the equivalent angle 9 they are being projected onto in the left pane. Kurtosis reaches local maximal values for projections that completely eliminate one or other of the underlying sources.

Despite a very wide range of formulations and approaches, ICA techniques all rely on maximizing an independence measure of this kind for the reconstructed source outputs, implying that each output consists of a single, distinct source. Optimization is generally done via gradient descent, i.e., unmixing parameters are progressively updated in the direction that increases the measure of independence until a local maximum is reached. While the example above was for a simple, instantaneous mixture with no frequency dependence, the more complex case in which different sources experience different filterings prior to mixing can in principle be solved by the same techniques. One approach is to break each signal into a large number of frequency bands via the DFT, then solve a scalar complex-valued ICA problem separately in each frequency band. As long as the individual source components found for each frequency can be correctly collected together into complete sources (the “permutation problem”), this can successfully accomplish blind source separation.

Clear and thorough reviews of ICA are given by Hyvarinen et al. [23] and Pedersen et al. [27].


The beamforming and ICA approaches presented above rely on the idea that an inverse can be found to a matrix-based mixing in the style of Eq. 39.1. In fact, despite their very different formulations, it is worth noting that, in the case of two point sources, both techniques will converge towards equivalent ideal solutions. If however the number of sources is larger than the number of microphones, then the mixing matrix is nonsquare, the unmixing process is underdetermined, and no unmixing matrix can be found. That is to say, fixed linear combinations of N microphone signals can give us at most N linearly-independent output signals; mixtures of more than N sources cannot be inverted this way. Another way of stating this is that each null of the form of Eq. 39.7 consumes one degree of freedom, so N channels allows us to place at most N – 1 independent spatial nulls, to completely remove N – 1 spatially-compact interfering sources. If the interference arises from more than this many directions, or if its structure is not spatially compact, fixed linear filtering and cancellation will not be able to remove it completely.

In the limiting case only a single channel is available, leaving no opportunity for cancellation through summing. However, when presented with such monaural sources – for instance, a recorded sound mixture played back through a single speaker – human listeners are often able to “hear out” individual sources, be they voices, musical instruments, or other sound events. This process of organizing complex acoustic signals was dubbed “Auditory Scene Analysis” (ASA) by Bregman [5] by analogy with the process by which visual percepts are organized into objects and other structures.

Extensive perceptual experiments by Bregman and many others (reviewed by Darwin [13]) have led to an account of how listeners achieve ASA: the received sound mixture is broken up into distinct components delimited in time and frequency (for instance, individual harmonics, or bands of noisy energy, presumably broken into separate frequency regions by the spectral analysis performed by the cochlea). Each of these components has a range of attributes such as onset and offset times, frequency range, modulation characteristics etc. Then, perceived sound sources are constructed by grouping together these individual components on the basis of their shared characteristics. This is in keeping with the Gestalt school of psychology from the early twentieth century, which accounted for perceptual organization in terms of abstract principles such as well-formedness and common-fate: energy in different frequency bands that starts or stops at the same time is showing evidence of common fate, and thus it will tend to be perceived as arising from a single source even if the energy is widely spaced and discontiguous.

An alternative interpretation in ecological terms (i.e., relating on the properties of the environment) is that each independent sound-producing source will have some kind of time-varying state that will be reflected in co-ordinated ways in the sound energy it emits, regardless of frequency band. It is intrinsically unlikely that energy in two different frequency ranges will appear or disappear at the same moment simply by coincidence. If this happens repeatedly, the simplest and most likely explanation is that the energy all arises from a single source, and thus is most appropriately considered to be a single source's sound.

This description, consisting of an initial analysis into small energy fragments local in time and frequency, then grouping these fragments into larger-scale sources on the basis of “grouping cues” such as common onset time or harmonicity (consistency with a single fundamental frequency), has provided inspiration for a family of loosely-related source separation algorithms gathered under the title of Computational Auditory Scene Analysis. A schematic of a typical system described by Brown & Wang [6] is shown in Figure 39.6, and follows the ASA account from psychology quite closely: The input sound mixture is first analyzed into different frequency bands by a simulation of the cochlear filterbank, then a variety of cues are calculated on this time-frequency representation to extract points of energy onset, spectral regions consistent with the same fundamental frequency, regions exhibiting similar rates of frequency modulation, etc. The entire time-frequency plane is carved up into small regions such that these properties are consistent for each region, then a set of grouping rules are applied to form sources that favor the integration of energy that has synchronized onset, can be regarded as harmonics of a single fundamental frequency, etc. This results in a ‘labeling’ of the time-frequency plane that indicates which source is considered dominant at each point.


FIGURE 39.6 Schematic of a Computational Auditory Scene Analysis (CASA) system [6]. Sound mixtures are broken up into time-frequency fragments, each associated with a range of attributes such as onset time or compatible fundamental frequencies. Fragments are then grouped into perceived sources on the basis of these “cues”, leading to a segmentation of the time-frequency plane according to different sources.

This labeling leads naturally to an approach to separating the sources: Starting from an invertible decomposition into time and frequency such as the short-time Fourier transform, simply zero-out all energy outside the regions deemed belonging to a particular source, then invert the transformation. This time-frequency “masking” has proven to be a remarkably effective method for separating sources even when only a single channel recording is available. In contrast to the stationary (or at best slowly-varying) processing of beamforming or ICA, masking can be viewed as a time-varying filter that follows and extracts the rapidly time-varying spectral structure of the target source.

The weakness of time-frequency masking is that energy cannot be separated below the level of the individual time-frequency cells used in the initial analysis, since any mixed energy within those cells is not modified beyond a simple scaling. ICA and similar techniques are not affected by such signal “collisions” in time and frequency, since their weighted combinations can exactly cancel interference based on its spatial properties regardless of its spectral characteristics. However, masking has no intrinsic dimensionality limits on the number of channels needed to separate a certain number of sources. Moreover, it can be startlingly successful, perhaps because of the sparsity of many natural sounds when distributed on a time-frequency plane, which is to say that source signal energy is often highly concentrated in a subset of the cells (e.g., the frequencies of the harmonics) with little or no energy in the remaining cells. This makes it relatively rare that two signals will have similar energy in a single time-frequency cell, the situation that masking cannot satisfactorily handle. Figure 39.7 shows an example of separating one voice from a mixture of two on the basis of local harmonicity cues, and reveals the “holes” left in the reconstruction in regions where target voice energy was not easily identified, leading to muffled or distorted reconstructions. Statistical models, however, can be used to infer the most likely values for these cells based on marginalizing joint distributions of present and missing values [10].


FIGURE 39.7 Example of CASA signal separation via time-frequency masking. Left pane is a spectrogram of a two-voice mixture. Middle pane shows the mask indicating cells dominated by the target voice on the basis of detected harmonicity cues by Hu & Wang [21]. Right pane shows reconstructed target voice. Although the extracted energy is successfully dominated by one voice, many regions contain no energy, corresponding to deleted cells.

A review-style comparison of human source separation with efforts to reproduce this capacity by computer is presented by Cooke & Ellis [9], and a set of articles providing a comprehensive view of CASA has been collected in Brown & Wang [36].


The missing energy in Figure 39.7 highlights a problem with the CASA approach: it is based more or less entirely on local signal features, yet sources often show structure at much larger scales, and this structure is potentially useful for separation. To develop this idea, let us pose signal separation as a probabilistic inference problem of identifying the set of individual source reconstructions A that have the greatest posterior probability given the observed signals x, i.e.,


Using Bayes rule we turn this into the product of two pieces: The first part is the likelihood of the observations given the source signals, Pr(x|image), which is typically a simple forward problem, e.g., adding the signals together for a single-channel, filtering-free scenario. The second part is the prior likelihood of that particular set of source signals; if we assume the sources are behaving independently, this is just the product of prior likelihoods over each source s, individually. These source likelihoods, Pr(image) amount to models of source behavior, and are the vehicle through which any kind of constraints about the structure of source signals – both low-level, local structure, and more high-level top-down behavior may be brought to bear on the signal separation problem.

These models may take a very wide range of forms, but some of the richest and most useful models of acoustic signals are found in speech recognition. The acoustic and language models in a speech recognizer comprise a highly complex description of the possible, expected behaviors of a voice signal. As we saw in Chapter 25, the process of speech recognition itself is nothing more than an inference problem of the form of Eq. 39.22, except with the word sequence, instead of the signal, as the desired output, and, most often, only a single source signal being considered. The framework, however, does not change profoundly if multiple sources (characterized, for instance, by mutually-independent hidden state sequences) are recognized at once. This idea was first proposed as HMM decomposition by Varga & Moore [34], and has been extensively studied under the title of Factorial HMMs (FHMMs) [16].

A successful recent example is the so-called “super-human” multi-talker speech recognition system developed by Hershey et al. [19]. This system was developed for a formal Speech Separation Challenge in which the task was to correctly transcribe particular keywords in simultaneous mixtures of two grammatically-constrained utterances. The appellation “super-human” reflected the result that this system actually exceeded the performance of human listeners in certain conditions such as when both speakers had similar voices and were presented at a similar level. The approach was to treat this as a factorial HMM problem, and recover the pair of hidden state sequences corresponding to the two utterances. This idea is illustrated in Figure 39.8, which shows how the best (Viterbi) path for such a two-chain FHMM can be visualized as a 3-dimensional trajectory through a series of planes formed as the outer product of the set of states of the two models.

Although it is computationally challenging, in principle such a factorial model is easily reduced to a familiar single-chain HMM: if the two models have state inventories Q1 (with N1 distinct states) and Q2 (N2 states), then the factorial model will have a state space Q1 × Q2 (N1 × N2 states), consisting of every possible combination of a state from model 1 and a state from model 2. The transition matrix (now consisting of (N1 × N2)2 entries) is simply the product of the probabilities of the transitions involved in each separate chain, which are assumed to occur independently.

The per-state acoustic observation models, Pr(x|qi1, qj2) are also much more nu-merous, but again are relatively simply related to the two original source models, being the expected observations given the combination of the two model states. If the original single-source models provide observation distributions in terms of means and variances of log-spectra, then these models can be systematically combined in the original signal domain to create the appropriate combined observation distribution, even taking into account the unknown relative phase of the two components (which will increase magnitude uncertainty). However, as noted above, in a sufficiently fine time-frequency representation, the sparsity of the speech signal will ensure that most cells encounter a large imbalance between predicted source magnitudes, with one model predicting a signal much larger than the other and thus dominating the combination in that dimension. This inspires the “max-approximation” [34, 29], in which the combined observation model is built up as the larger magnitude of the two component states in each dimension.


FIGURE 39.8 Illustration of a factorial HMM. The observed mixture is modeled as the combination of two, independent hidden Markov models; the best state sequence is thus a trajectory in a 3-dimensional volume with axes model 1 state, model 2 state, and time. (Figure drawn by Ron Weiss.)

Given this new, larger, composed set of states, comprising observation distributions and the full transition matrix, factorial decoding can now proceed exactly as for any other HMM. However, because the state space is exponential in the number of models, various computational tricks and approximations are generally employed.

In the “super-human” system, a couple of additional considerations were included. Firstly, since the relative level of the two voices was not constant but varied over a 15 dB range, it was necessary to estimate this relative level for each test mixture, for instance by trying a variety of composed models to see which gave the best match to the observations. In single-voice recognition, overall variations in level are easily removed in signal preprocessing, but the combined state observations obviously depend on the relative level of the two sources that can no longer be normalized away. Secondly, the system attempted to identify exactly which pair of speakers was present in the mixture, then used individual acoustic models specifically trained on those speakers. The particular database used in the evaluation was constructed to consist of only 34 different speakers, all of whom appeared both in the (single speaker) training data and in the (mixed speaker) test data. This allowed the “super-human” system to build 34 individual per-speaker models, and to identify from among the closed set the pair of speaker models best matching the mixed observation, which could then be used in the factorial recognition. Although this seems to be exploiting an artificial aspect of the evaluation, a similar system based on parametric “eigenvoice” speaker models, and hence able to adapt to a much wider range of speakers, has been proposed by Weiss & Ellis [38]. A detailed comparison of a wide range of approaches, all evaluated on this same two-speaker task, is presented in Cooke et al. [11].

If the goal of processing the speech mixtures is simply to identify the words, then recovering the most likely state sequences can give all the desired information. If, however, a reconstructed audio signal is required, some additional processing is needed. (In fact, the “super-human” system used the outputs of its factorial HMM processing to reconstruct audio, which was then fed to a second, conventional speech recognizer to make the final word transcription; this permitted the use of a much more complex final recognizer, trained on a wider range of speech data). Typically, the feature distributions used in an HMM do not contain sufficient detail to permit waveform reconstruction, but the approximate spectra of each inferred source, for instance as represented by the means of the per-state observation distributions, can be compared between the different state sequences to create a mask similar to the one in Figure 39.7. This can then be used in the same way to separate out the energy from time-frequency cells believed to be dominated by the target. Moreover, since the state sequence includes estimates for the whole spectrum, the “holes” left by masking can be filled-in to provide energy at the estimated level – for instance, by scaling the original energy in the mixture. Even when this energy is predominantly non-target, it can improve the quality of the reconstruction to have something rather than nothing in these holes.

Although we have focused in this section on using speech models to provide the constraints to permit inference of the individual source signals, the same principles can be applied to any sound source. The individual state-based acoustic distributions in a hidden Markov model can encapsulate a very wide range of structures and regularities, including attributes like harmonicity and common onset that were used explicitly in CASA systems. However, the computational efficiency of such a generic approach may suffer in comparison to a more special-purpose representation.


In this chapter we have seen a range of approaches to the ubiquitous problem of making sense of sounds that consist of the combined effects of multiple sound-producing sources. This problem admits a very wide range of approaches, from potentially exact, source-independent, directional filtering in situations where sources have compact spatial characteristics and multiple microphones are available, through to probabilistic inference exploiting detailed prior knowledge of source behavior for mixtures of structured sources in monaural recordings. Because of the multiple principles at play, there are many opportunities for combining approaches, for example to use ICA-style techniques to cancel signals between multiple microphones, then time-frequency masking to further reduce interference in the best resulting single-channel signal [28], or combinations of human-inspired binaural source formation with speech source models derived from speech recognizers [37]. Such combinations are advantageous, since each approach has unique advantages.


  1. 39.1 The broad idea of signal-to-noise ratio (SNR) is to measure how much signal energy there is compared to the amount of noise energy (usually expressed in logarithmic units, i.e., dB). But the actual value depends on the definitions of “signal” and “noise”.
    1. (a) One definition, applicable to a signal separation system with a clearly identifiable signal input s(t) that has been corrupted by the addition of interference, is to stipulate that the ideal output should exactly match the input; any difference between them counts as noise. Thus if the separation system output is image(t), the noise in that output is, by definition n(t) = s(t) – image(t). Under this definition, what is the SNR of a system whose output is identically zero, image(t) = 0? What about if the system output is a white noise sequence with the same energy as the input, but no discernible evidence of the input?
    2. (b) Another definition, particularly applicable to systems that enhance or separate by applying a time-varying filter to the signal as illustrated in Figure 39.7, is to consider the input of the system as consisting of two components, signal and noise, then to calculate SNR as the ratio of the energy of the two components at the output, i.e., the energy of the output of the filtering process if the input was signal alone compared to the output energy if the input was the noise alone. Under this measure, what is the SNR if the output is identically zero? What strategy delivers the best possible SNR?
  2. 39.2 The Generalized Sidelobe Canceller of Fig. 39.4 uses a blocking matrix to convert the M input signals into M –1 channels that contain only energy from non-target directions. Anything in these signals can then be removed from the system output, via conventional adaptive filtering, to improve separation. For a two-input system with a straight-ahead (broadside) target look direction, what would be a suitable blocking matrix?
  3. 39.3 Two signals s1 (t) and s2(t), both have the same kurtosis of 4. A combination is formed as x(t) = αs1(t) (1 – α)s2(t). Sketch the kurtosis of x over the range a = 0... 1. What additional factorsinfluence the precise shape of this curve?
  4. 39.4 A 2-chain factorial HMM, as illustrated in 39.8, can be constructed as a conventional, single-chain HMM whose state space consists of every possible combination of states from each of the two component source models. If both source HMMs have N states, how big is the transition matrix for this combined model? How can these values be obtained?


  1. Bell, A. J. and Sejnowski, T. J., “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, 7: 1129–1159, 1995.
  2. Benesty, J., Chen, J., and Huang, Y., Microphone array signal processing, Springer Topics in Signal Processing, Vol. 1. Springer, 2008.
  3. Brandstein, M., Adcock, J., and Silverman, H., “A practical time-delay estimator for localizing speech sources with a microphone array,” Computer, Speech, and Language, 9: 153–169, April 1995.
  4. Brandstein, M. and Ward, D., editors, Microphone Arrays - Signal Processing Techniques and Applications, Springer-Verlag, New York, 2001.
  5. Bregman, A. S., Auditory Scene Analysis, Bradford Books, MIT Press, 1990.
  6. Brown, G. and Cooke, M., “Computational auditory scene analysis,” Computer speech and language, 8: 297–336, 1994.
  7. Cherry, E. C., “Some experiments on the recognition of speech with one and two ears,” J. Acoust. Soc. Am., 25: 975–979, 1953.
  8. Comon, P., “Independent component analysis, a new concept?,” Signal processing, 36: 287–314, 1994.
  9. Cooke, M. and Ellis, D., “The auditory organization of speech and other sources in listeners and computational models,” Speech Communication, 35: 141–177, 2001.
  10. Cooke, M., Green, P., Josifovski, L., and Vizinho, A., “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Communication, 34: 267–285, 2001.
  11. Cooke, M., Hershey, J. R., and Rennie, S. J., “Monaural speech separation and recognition challenge,” Comput. Speech Lang., 24: 1–15, 2010.
  12. Cox, H., Zeskind, R. M., and Owen, M. M., “Robust adaptive beamforming,” IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-35: 1365–1376, October 1987.
  13. Darwin, C. J., “Listening to speech in the presence of other sounds,” Philosophical Transactions of the Royal Society B: Biological Sciences, 363: 1011–1021, 2008.
  14. Elko, G. W., “Spatial coherence functions for differential microphones in isotropic noise fields,” in Brandstein, M. and Ward, D., editors, Microphone Arrays, pages 61–85. Springer Verlag, 2001.
  15. Frost, 0. L., “An algorithm for linearly constrained adaptive array processing,” Proceedings of the IEEE, 60: 926–935, August 1972.
  16. Ghahramani, Z. and Jordan, M., “Factorial hidden Markov models,” Machine Learning, 29: 245–273, 1997.
  17. Gillespie, B., Malvar, H., and Florencio, D., “Speech dereverberation via maximum kurtosis subband adaptive filtering,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 6, pages 3701 -3704, Salt Lake City, UT, May 2001.
  18. Griffiths, L. J. and Jim, C. W., “An alternative approach to linearly constrained adaptive beam-forming,” IEEE Transactions on Antennas and Propagation, AP-30: 27–34, January 1982.
  19. Hershey, J. R., Rennie, S. J., Olsen, P. A., and Kristjansson, T. T., “Super-human multi-talker speech recognition: A graphical modeling approach,” Computer Speech and Language, 24: 45 - 66, 2010.
  20. Hoshuyama, 0., Sugiyama, A., and Hirano, A., “A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters,” IEEE Transactions on Signal Processing, 47: 2677 -2684, October 1999.
  21. Hu, G. and Wang, D. L., “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Transactions on Neural Networks, 15: 1135–1150, 2004.
  22. Hu, Y. and Loizou, P. C., “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, 16: 229–238, 2008.
  23. Hyvärinen, A., Karhunen, J., and Oj a, E., Independent Component Analysis, John Wiley & Sons, 2001.
  24. Knapp, C. H. and Carter, C., “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-24: 320–327, August 1976.
  25. Kumatani, K., McDonough, J., Klakow, D., Garner, P., and Li, W., “Adaptive beamforming with a maximum negentropy criterion,” in Proc. Hands-Free Speech Communication and Microphone Arrays, pages 180–183, Trento, Italy, May 2008.
  26. Ma, J., Hu, Y., and Loizou, P. C., “Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions,” J. Acoust. Soc. Am., 125: 3387–3405, 2009.
  27. Pedersen, M., Larsen, J., Kjems, U., and Parra, L., “A survey of convolutive blind source separation methods,” in Benesty, J., Sondhi, M. M., and Huang, Y., editors, Springer Handbook of Speech Processing, chapter 52, pages 1065–1094. Springer, New York, 2008.
  28. Pedersen, M., Wang, D., Larsen, J., and Kjems, U., “Two-microphone separation of speech mixtures,” IEEE Transactions on Neural Networks, 19: 475–492, 2008.
  29. Roweis, S., “Factorial models and refiltering for speech separation and denoising,” in Proc. Eurospeech, pages 1009–1012, Geneva, 2003.
  30. Seltzer, M. L., Raj, B., and Stern, R. M., “Likelihood maximizing beamforming for robust hands-free speech recognition,” IEEE Transactions on Speech and Audio Processing, 12: 489–498, September 2004.
  31. Tashev, I. and Malvar, H. S., “A new beamformer design algoritmh for microphone arrays,” in Proc. ICASSP, Philadelphia, Pennsylvania, March 2005.
  32. Thiede, T., Treurniet, W., Bitto, R., Schmidmer, C., Sporer, T., Beerends, J., Colomes, C., Keyhl, M., Stoll, G., Brandeburg, K., and Feiten, B., “PEAQ - the ITU standard for objective measurement of perceived audio quality,” J. Audio Eng. Soc., 48, Jan/Feb 2000.
  33. Van Trees, H. L., Optimum Array Processing, Part IV of Detection, Estimation and Modulation Theory. John Wiley & Sons, New York, 2002.
  34. Varga, A. and Moore, R., “Hidden markov model decomposition of speech and noise,” in Proceedings of the 1990 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 845–848, 1990.
  35. Vincent, E., Fevotte, C., and Gribonval, R., “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, 14: 1462–1469, 2006.
  36. Wang, D. L. and Brown, G. J., editors, Computational auditory scene analysis: Principles, algorithms and applications, Wiley-IEEE Press, 2006.
  37. Weiss, R., Mandel, M., and Ellis, D., “Source separation based on binaural cues and source model constraints,” in Proc. Interspeech, pages 419–422, Brisbane, Sep 2008.
  38. Weiss, R. and Ellis, D., “Speech separation using speaker-adapted eigenvoice speech models,” Computer Speech and Language, 24: 16–29, 2010.
  39. Zhang, C., Zhang, Z., and Florencio, D., “Maximum likelihood sound source localization for multiple directional microphones,” in Proc. ICASSP, Honolulu, HI, April 2007.

1 This section was written by Michael Seltzer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.