11
Bayesian Estimation

In several applications there is enough (statistical) information on the most likely values of the parameters θ to be estimated even before making any experiment, or before any data collection. The information is encoded in terms of the a‐priori pdf p(θ) that accounts for the properties of θ before any observation (Chapter 6). Bayesian methods make efficient use of the a‐priori pdf to yield the best estimate given both the observation x and the a‐priori knowledge on the admissible values from p(θ).

Recall that MLE is based on the conditional pdf images that sets the probability of the specific observation xk for any choice images, but these choices are not all equally likely. In the Bayesian approach, the outcome of the k th experiment is part of a set of (real or conceptual) experiments with two random sets θ and x, that in the case of an additive noise model images are θ and w. The joint pdf is

images

but it is meaningful to consider for each experiment (or realization of the rv x) the pdf of the parameter θ conditioned to the k th observation images (here deterministic) according to Bayes’ rule, which gets the pdf of the unobserved θ

images

with γ as a scale factor to normalize the pdf. The a‐posteriori pdf images is the pdf of the unknown parameter θ after the observation images, and reflects the statistical properties of θ after the specific experiment.

To further stress the role of the a‐priori pdf in Bayesian methods, we can make reference to Figure 11.1 to derive the a‐posteriori pdf images (as routinely employed in the notation, the specific observation images is understood), which is obtained by multiplying the a‐priori p(θ) and the conditional pdf images. In Figure 11.1, θ is scalar (number of parameters images) and the a‐priori pdf shows some values that are more or less likely than others. After the observation, the conditional pdf images is the one that is used in the ML method and it has a different behavior from the a‐priori one. More specifically, the maximum of the conditional pdf provides the MLE θML without any knowledge of the probability of the values of θ. After multiplying the conditional pdf with the a‐priori pdf, the a‐posteriori pdf images compounds the likelihood of the parameters regardless of the knowledge of their probability, and the probability associated to each value of the parameters, by the a‐priori pdf p(θ).

Three right arrows, each with waveform depicting (top– bottom) p (θ), p (x Ι θ), and p (θ Ι x Ι ). Wave on the last arrow is shaded, with two dashed wave and two lines for ϴMMSE and ϴMAP.

Figure 11.1 Bayesian estimation (images).

Based on the a‐posteriori pdf, Bayesian estimation can follow two different strategies:

  • Maximum a‐posteriori (MAP) estimation: the estimate is the value that maximizes the a‐posteriori probability
    images
  • Minimum mean square error (MMSE) estimation: the estimate can be in the baricentral position wrt the a‐posteriori pdf:
    images
    This is the mean value of the conditional pdf images, and it is commonly referred to as the MMSE estimate as clarified below.

Any estimator images depends on the observations, but under the Bayesian assumptions, the error images is an rv that depends on both the data x and the rv θ with pdf p(θ). The corresponding Bayesian MSE (Section 6.3.2):

images

is evaluated wrt the randomness of the data and θ. The MSE is thus the “mean” in the sense that it accounts for all possible choices of the rv θ weighted by the corresponding occurrence probability p(θ) as images. The minimization of the MSE seeks for the parameter value that minimizes the images as a function of images. From

images

since p(x) is semipositive definite and independent of images, it is enough to minimize the term within the bracket (i.e., the MSE conditioned to a specific observation x) by nulling the gradient:

images

Rearranging terms yields the MMSE estimator

images

as the estimator that minimizes the MSE for all possible (and random) outcomes of the parameters θ.

Remark: The illustrative example at hand helps to raise some issues based on which Bayesian estimators have to be preferred. There is no unique answer, but it depends on the application. The example shows a multimodal a‐posteriori pdf and MAP selects one value that occurs with the greatest probability. On the other hand, the MAP choice could experience large errors In the case that some outcomes are in other modes of the pdf. The MMSE is the best “mean solution” that could coincide with a value that is unlikely, if not admissible (e.g., images). In general, for multimodal a‐posteriori pdf, the maximum (MAP) and the mean (MMSE) of the a‐posteriori pdf do not coincide, and choosing one or another depends on the specific application problem. Some care should be taken to avoid the Buridan’s ass paradox1 and estimate a value that is a nonsense for the problem at hand, but still compliant with the estimator’s metric—as for the MMSE estimate for binary valued parameters (Section 11.1.3). On the other hand, if the a‐posteriori pdf is unimodal and even, both MAP and MMSE coincide, as in the case of Gaussian rvs—a special and favorable case as detailed below.

11.1 Additive Linear Model with Gaussian Noise

This observation is modeled by images, in which the scalar parameter θ is random and should be estimated from a single observation with additive Gaussian noise: images. The Gaussian model is discussed first with images, and then extended to the case of non‐Gaussian rv θ. It will be shown that Bayesian estimators become non‐linear for an arbitrary pdf, and this motivates the interest in linear estimators even when non‐linear estimators should be adopted, at the price of some degree of sub‐optimality.

11.1.1 Gaussian A‐priori: images

Single Observation images

In MLE, the value of the parameter θ does not change even if considering multiple experiments. For a single observation, the MLE is

images

Assuming now that for each observation (composed of a single sample) the parameter value θ is different and drawn from a Gaussian distribution images where images and images are known, the joint pdf is images while the pdf of θ conditioned to a specific observation (a‐posteriori pdf) is

images

up to a scale Γ. Rearranging the exponentials, the a‐posteriori pdf is

images

This shows the general property (see Section 11.2) that for additive linear models, the a‐posteriori pdf is Gaussian if all terms are Gaussian. Returning to the specific example, the terms involved are

images

The second term shows that the variance of the a‐posteriori pdf is always smaller than the a‐priori (images) except when noise dominates the measurements (images) and images. The maximum of the a‐posteriori images when Gaussian is for

images

Considering the symmetry of the Gaussian pdf, it is straightforward to assert that images. It is interesting to note that for images, the a‐priori information dominates and images regardless of the specific observation, while for images, a‐priori information becomes irrelevant (in this case the a‐priori information is diffuse or non‐informative) and images as in MLE. This is another general property: when the a‐priori pdf p(θ) is uniform, there is no difference between one choice and the other, and the Bayesian estimator cannot rely on any a‐priori pdf; in this case the MAP coincides with the MLE.

The Bayesian MSE should take into account both the parameter and noise as rvs:

images

and images for diffuse a‐priori (images).

The Bayesian MSE can be evaluated in the following alternative way:

images

Note that MSEML(θo) is the MSE for deterministic images as for MLE. Since for the example here images, but θMMSE optimizes the Bayesian MSE and not the MSEML(θo), the estimate θMAP cannot be optimal for the deterministic value images (incidentally, it is biased). Nevertheless, when the MSE is evaluated by taking into account the variability of θo, the Bayesian estimation shows its superiority with respect to the ML estimation for any arbitrary (but still with pdf images) value of θ.

To summarize, the Bayesian MSE of the MMSE estimator is always lower than the MSE of the MLE. This is due to fact that the MMSE estimator exploits the available a‐priori information on the parameter value.

Multiple Observations images

We can consider a more complex (but somewhat more realistic) example with N observations characterized by independent noise components with the same value of the parameter θ, that in turn it can change over the realizations. The model can be written as:

images

Estimation of the parameter is the sample mean (see, e.g., Section 6.7)

images

while the MMSE estimator is linear (it will be derived in the following section) and it is given by

images

It is interesting to remark that the MMSE estimation is a linear combination between the mean parameter value images (known a‐priori) and the sample mean images. For images the a‐priori information becomes irrelevant and images while for images we have images. From the symmetry of the Gaussian a‐posteriori pdf, images.

11.1.2 Non‐Gaussian A‐Priori

The observation images is the sum of two random variables, only one of which (here noise) is Gaussian, say images. The MMSE follows from the definition:

images

where the following two equalities have been used to derive the last term: images and images. The general relationship of the MMSE estimator is2

In other words, the MMSE estimator is a non‐linear transformation of the observation x and its shape depends on the pdf of the observation images. This example can be generalized: the MMSE estimator is non‐linear except in Gaussian contexts (the proof is simple: just set images; this gives images therefore, in this case, the MMSE estimator is linear and it coincides with the one reported previously in Section 11.1.1).

Left: A plane with a square wave (solid) and a waveform (grayed). Right: A circle with 3 arrows (one downward and two rightward) labeled ϴ, w, and x.

Figure 11.2 Binary communication with noise.

The additive model has the nice property that the complementary MMSE estimator (here for noise w) follows from simple reasoning. Since images, expanding this identity images it follows that

images

This means that the MMSE estimation of noise is complementary to the estimator of θ.

Remark: The MMSE estimator depends on the pdf of observations px(x), but when this pdf is unknown and a large set of independent observations is available, one can use the histogram of the observations, say images, in place of px(x), and then evaluate images in a numerical (or analytical) way recalling that the histogram bins the values of x over a finite set (see Section 9.7 for the design of histogram binning). In this way an approximate MMSE estimator can be obtained. Alternatively, it is possible to fit a pdf model over the observations by matching the sample moments with the analytical ones (method of moments3) and compute the MMSE estimator from the pdf model derived so far.

11.1.3 Binary Signals: MMSE vs. MAP Estimators

In a digital communication system sketched in Figure 11.2 a binary stream θ(t) (black line) is transmitted over a noisy communication link and the received signal (grey line) can be modeled as impaired by a Gaussian noise

images

The transmitted signal θ(t) is binary and its value encodes the values of a binary source to be delivered to destination. Source values are unknown, but can be modeled as a random stationary process (i.e., time t is irrelevant) with independent samples:

x vs. px[x; σw = 0.5] displaying a waveform with ascending and descending lines.

Figure 11.3 Pdf px(x) for binary communication.

images

This represents the a‐priori pdf. The unconditional pdf of the received signal images is

images

It is bimodal (or multimodal in the case of a multi‐level signal) with the maxima positioned at images as shown in Figure 11.3 for images and images.

The MMSE approach estimates the value θ as a continuous‐valued (not binary) variable by computing the derivative of the closed form (11.1). A fairly compact relationship of the MMSE estimator can be derived after some algebra (recall that images):

images

In information theory this is referred as soft decoding. Figure 11.4 shows the non‐linearity that represents the MMSE estimate θMMSE from the observation x. For a low noise level (images), the non‐linearity has the shape of a binary detector with threshold, while for high noise level (images), the non‐linearity attains a linear behavior with a slope of images. In fact, for images the noise dominates and the binary level that minimizes the MSE is 0 (i.e., halfway between the two choices). The MMSE estimation of w is based on the use of the non‐linearity that is complementary to the previous one considered, that for images coincides with the observation (as it is a 1:1 mapping).

The MAP estimator should decide between two source values, and the decision among a limited set is part of detection theory, discussed later in Section 23.2. However, it is educational to use Bayesian inference to highlight the difference between MMSE and MAP in a non‐Gaussian context. The a‐posteriori pdf is:

images

so the MAP is

images
x vs. E[θ|x] with a sloping dash-dotted line and four solid lines representing σw= 0.1, σw= 1, σw= 0.5, and σw= 3.

Figure 11.4 MMSE estimator for binary valued signals.

Since

images

the MAP estimator reduces to

images

which for images (the most common case in digital communications) is images. It is without doubt that MMSE and MAP are clearly different from one another: the MMSE estimate returns a value that is not compatible with the generation mechanism but still has minimum MSE; the MAP estimate is simply the decision between the two alternative hypotheses images or images (see Chapter 23).

11.1.4 Example: Impulse Noise Mitigation

In real life, the stationarity assumption of Gaussian noise (or any other stochastic process) is violated by the presence of another phenomenon that occurs unexpectedly with low probability but large amplitudes: this is impulse noise. Gaussian mixtures can be used to model impulse noise that is characterized by long tails in the pdf (like those generated by lightning, power on/off, or other electromagnetic events) and it was modeled this way by Middleton in 1973. Still referring to the observation model images, the rv labeled as “noise” is θ while w is the signal of interest (just to use a notation congruent with all above) with a Gaussian pdf.

The value of the random variable θ is randomly chosen between two (or more) values associated to two (or more) different random variables. Therefore the random variable θ can be expressed as

images

and its pdf is the mixture

images

that is a linear combination of the pdfs associated to the two random variables α e β. The pdf of the observation is the convolution of the θ and w pdfs:

images

and this can be used to obtain (from (11.1)) the MMSE estimator of θ or w in many situations.

The Gaussian mixture with images and images can be shaped to model the tails of the pdf; for zero‐mean (images) Gaussians, the pdf is:

images
Image described by caption.

Figure 11.5 Example of data affected by impulse noise (upper figure), and after removal of impulse noise (lower figure) by MMSE estimator of the Gaussian samples (back line) compared to the true samples (gray line).

The MMSE estimator for x is

images

The effect of “spiky” data is illustrated in Figure 11.5 with x affected by Gaussian noise and some impulse noise. The estimated MMSE wMMSE compared to true w (gray line) shows a strong reduction of non‐Gaussian (impulse) noise but still leaves the Gaussian noise that cannot be mitigated from a non‐linear MMSE estimator other than by filtering as is discussed later (simulated experiment with images, images and images). Namely, the noise with large amplitude, when infrequently present (images), is characterized by a Gaussian pdf with images, while normally there is little, or even no noise at all, as images (images). In this case the estimator has a non‐linearity with saturation that thresholds the observed values when too large compared to the background.

Inspection of the MMSE estimator for varying parameters in Figure 11.6 shows that the threshold effect is more sensitive to the probability images of large amplitudes rather than to their variance images (solid) and images (dashed).

Graph depicting the MMSE estimator for impulse noise modeled as Gaussian mixture, with sloping line and six curves.

Figure 11.6 MMSE estimator for impulse noise modeled as Gaussian mixture.

11.2 Bayesian Estimation in Gaussian Settings

In general, Bayesian estimators are non‐linear and their derivation could be cumbersome, or prohibitively complex. When parameters and data are jointly Gaussian, the conditional pdf images is Gaussian, and the MMSE estimator is a linear transformation of the observations. Furthermore, when the model is linear the MMSE estimator is very compact and in closed form. Even if the pdfs are non‐Gaussian and the MMSE estimator would be non‐linear, the use of a linear estimator in place of the true MMSE is very common and can be designed by minimizing the MSE. This is called a linear MMSE estimator (LMMSE), and in these situations the LMMSE is not optimal, but it has the benefit of being compact and pretty straightforward as it only requires knowledge of the first and second order central moments. For a Gaussian context, the LMMSE coincides with the MMSE estimator, and it is sub‐optimal otherwise.

Tight lower bounds on the attainable MSE of the performance of any optimal or sub‐optimal estimation scheme are useful performance analysis tools for comparisons. The Bayesian CRB in Section 11.4 is the most straightforward extension of the CRB when parameters are random.

11.2.1 MMSE Estimator

The use of a Gaussian model for parameters and observations is very common and this yields a closed form MMSE estimator. Let p(x, θ) be joint Gaussian; then

images

The conditional pdf images is also Gaussian (see Section 3.5 for details and derivation):

images

This condition is enough to prove that the MMSE estimator is

images

which is linear after removing the a‐priori mean value μx from observations, and it has an elegant closed form.

In the additive noise model

images

the covariance matrixes and the mean values lead to the following results (under the hypothesis that w and θ are mutually uncorrelated):

images

Note that moments wrt the parameters images are the only new quantities to be evaluated to derive the MMSE estimator. When the model s(θ) is linear, the central moments are further simplified as shown below.

11.2.2 MMSE Estimator for Linear Models

The linear model with additive Gaussian noise is

images

the moments can be related to the linear model H:

images
images

and the MMSE estimator is now

The covariance

(11.3)images

is also useful to have the performance of the a‐posteriori method. Note that the a‐posteriori covariance is always an improvement over the a‐priori pdf, and this results from images and thus

images

where the equality is when images (i.e., data is not correlated with the parameters, and estimation of θ from x is meaningless).

Computationally Efficient MMSE

There is a computational drawback in the MMSE method as the matrix images has dimension images that depends on the size of the observations (N) and this could be prohibitively expensive as matrix inversion costs images. One efficient formulation follows from the algebraic properties of matrixes. Namely, the Woodbury identity images (see Section 1.1.1) can be applied to the linear transformation in MMSE expression:

images

This leads to an alternative formulation of the MMSE estimator:

This expression is computationally far more efficient than (11.2) as the dimension of images is images, and the number of parameters is always smaller than the number of measurements (images); hence the MMSE estimator (11.4) has a cost imagesimages.

Moreover, the linear transformation images can be evaluated in advance as a Cholesky factorization of images and applied both to the regressor as images, and the observation x with the result of decorrelating the noise components (whitening filter, see Section 7.8.2) as images with images:

images

Example

The MMSE estimator provides a straightforward solution to the example of mean value estimate from multiple observations in Section 11.1.1 with model:

images

The general expression for the linear MMSE estimator is

(11.5)images

After the Woodbury identity (Section 1.1.1) it is more manageable:

images

thus highlighting that the MMSE is the combination of the a‐priori mean value images and the sample mean of the observations images, with

images

For large N, the influence of noise decreases as images, and for images the importance of the a‐priori information decreases as images, and images. Using the same procedure it is possible to evaluate

images

that is dependent on the weighting term α previously defined.

11.3 LMMSE Estimation and Orthogonality

The MMSE estimator images can be complex to derive as it requires a detailed statistical model and the evaluation of the conditional mean images can be quite difficult, namely when involving non‐Gaussian rvs. A pragmatic approach is to derive a sub‐optimal estimator that is linear with respect to the observations and is based only on the knowledge of the first and the second order moments of the involved rvs. This is the the Linear MMSE (LMMSE) estimator, where the prefix “linear” denotes the structure of the estimator itself and highlights its sub‐optimality for non‐Gaussian contexts.

The LMMSE is based on the linear combination of N observations

images

while for the ensemble of p parameters it is

(11.6)images

The estimator is based on the images coefficients of matrix A that are obtained from the two constraints:

  • Unbiasedness: from the Bayesian definition of bias (Section 6.3)
    images
    this provides a relationship for the scaling term
    images

    This term can be obtained by rewriting the estimator after the subtraction of the mean from both the parameters and the observations:

    images
  • MSE minimization: given the error
    images
    that corresponds to the k th parameter, one should minimize its mean square value
    images

    To simplify the notation it is assumed that the mean is subtracted from all the involved variables (in other words the assignment is images and images), or equivalently images and images.

    Setting the gradients to zero:

    images

    This latter relationship is defined as the (statistical) orthogonality condition between the estimation error images and the observations

    images

    or equivalently

    images

    By transposing both sides of the previous equation it is possible to write:

    images

    that shows how the orthogonality conditions lead to a set of N linear equations (one for each parameter θk) with N unknowns (the elements of the ak vector). The solution of this set of equation is given by (recall the assumptions here: images, images, or images)

    (11.7)images

    Generalizing the orthogonality conditions between observations and estimation error:

    images

    which corresponds to pN equations into the pN unknowns entries of A. Substituting the estimator relationship images into the previous equation yields the LMMSE estimator:

    (11.8)images

    Considering now also the relation to zeroing the polarization, it is possible to write the general expression for the LMMSE estimator:

    images

    that obviously gives images.

    The dispersion of the estimate is by the covariance:

    images

    and the fundamental relationships of LMMSE are the same as for the MMSE for Gaussian models in Section 11.2.1.

Three arrows forming right triangle. Arrows are labeled θ, θ, and Ɛ (x).

Figure 11.7 Geometric view of LMMSE orthogonality.

Compared to the general MMSE, that is never simple or linear, any LMMSE estimator is based on the moments that should be known, and the cross‐covariance matrix Cθx between the observations and the parameters. This implies that the LMMSE estimator can always be derived in closed form, but it has no proof of optimality. However, it is worthwhile to remark that the LMMSE coincides with the MMSE estimator for Gaussian variables, and it is the optimum MMSE estimator in this situation. The linear MMSE (LMMSE) estimator provides a sub‐optimal estimation in the case of random variables with arbitrary pdfs even if it provides the estimation with the minimum MSE among all linear estimators. The degree of sub‐optimality follows from the analysis of the covariance matrix cov[θLMMSE], which for the sub‐optimal case (non‐Gaussian pdfs) has a larger value with respect to the “true” (optimal) MMSE estimation. In summary:

images

(the difference is semipositive definite). The simplicity of the LMMSE estimator justifies its wide adoption in science and engineering, and sometimes the optimal MMSE is so complex to derive that the LMMSE is referred to (erroneously!) as the MMSE estimator.

Orthogonality

The orthogonality condition between the observations x and the estimation errors images states that these are mutually orthogonal

images

This is the general way to derive the LMMSE estimator as the set of Np equations that are enough to derive the images entries of A. The representation

images

decomposes the vector θ into two orthogonal components as

images

This relation is also called Pitagora’s statistical theorem, illustrated geometrically in Figure 11.7.

11.4 Bayesian CRB

When MMSE estimators are not feasible, or practical, the comparison of any estimator with lower bounds is a useful performance index. Given any Bayesian estimator g(x), the Bayesian MSE is

images

where the right hand side is the conditional mean estimator that has the minimum MSE over all possible estimators. The Bayesian CRB was introduced in the mid‐1960s by Van Trees [37,42] and it states that under some regularity conditions

images

where

images

Since images, the Bayesian CRB is the sum of two terms:

images

where

images

is the contribution of data that can be interpreted as the mean of the FIM for non‐Bayesian estimators (Section 8.1) over the distribution of parameters, and

images

is the contribution of the a‐priori information.

In detail, the MSE bound of any Bayesian estimator can be stated as follows:

images

with equality only if the a‐posteriori pdf images is Gaussian. The condition for Bayesian CRB to hold requires that the estimate is unbiased according to the Bayesian context

images

which is evaluated for an average choice of parameters according to their pdf p[θ]. This unbiasedness condition for a random set of parameters is weaker than the CRB for deterministic parameters, which needs to be unbiased for any choice of θ. Further inspection shows that (Section 1.1.1)

images

where J(θ) is the FIM for non‐Bayesian case (Chapter 8).

11.5 Mixing Bayesian and Non‐Bayesian

The Bayesian vs. non‐Bayesian debate is a long‐standing one. However, complex engineering problems are not always so clear— the a‐priori pdf is not known, or a‐priori information is not given as a pdf, and the parameters could mix nuisance, random, and deterministic ones. Making a set of cases, exceptions, and examples with special settings could take several pages, and the reader would feel disoriented faced with a taxonomy. Other than to warn the reader that one can be faced with a complex combination of parameters, the recommendation is to face the problems with a cool head using all the analytical tools discussed so far, and mix them as appropriate.

11.5.1 Linear Model with Mixed Random/Deterministic Parameters

Let the linear model with mixed parameters be:

images

where θd is deterministic, and the other set θr is random and Gaussian

images

This is the called hybrid estimation problem. Assuming that the noise is images, one can write the joint pdf images; under the Gaussian assumption of terms, the logarithmic of the pdf is

images

and the MAP/ML estimator is obtained by setting

images

for random (θr) and deterministic (θd) parameters, and every x. The log‐pdf can be rewritten more compactly as

images

where for analytical convenience it is defined that

images

The gradient becomes

images

where the joint estimate is

images

The CRB for this hybrid system is [43] (Section 11.5.2)

images

where

images

Example: A sinusoid has random amplitude sine/cosine components (this generalizes the case of deterministic amplitude in Section 9.2). From Section 5.2, the model is

images

with

images

The log‐pdf for random α is

images

that minimized wrt to α for a given ω yields:

images

Since images:

images

This estimate images substituted into the log‐pdf (neglecting scale‐factors)

images

proves that, regardless of whether the amplitude is deterministic or modeled as stochastic Gaussian, the frequency estimate is the value that maximizes the amplitude of the Fourier transform as for the MLE in Section 9.2, and Section 9.3.

11.5.2 Hybrid CRB

The case when the parameters are mixed non‐random and random needs a separate discussion. Let

images

be the images parameters grouped into images non‐random θd, and images random θr, the MSE hybrid bound is obtained from the pdf [37]

images

that decouples random and non‐random parameters. The Hybrid CRB

images

where

images

is now partitioned with expectations over θr

images
images

The random parameters θr can be nuisance, and the Hybrid CRB on deterministic parameters θd is less tight than the CRB for the same parameters [43].

Mapping between complete y and incomplete (or data) x set in EM method illustrated by an oblong with 7 dots inside. Three of the dots have arrows directing to another dot inside another oblong.

Figure 11.8 Mapping between complete images and incomplete (or data) images set in EM method.

11.6 Expectation‐Maximization (EM)

Making reference to non‐Bayesian methods in Chapter 7, the MLE of the parameters θ according to the set of observations x is

images

However, it could happen that the optimization of the log‐likelihood function is complex and/or not easily tractable, and/or it depends on too many parameters to be easily handled. The expectation maximization algorithm [45] solves the MLE by postulating that there exists a set of alternative observations y called complete data so that it is always possible to get the “true” observations x (here referred as incomplete data) from a not‐invertible transformation (i.e., many‐to‐one) images. The new set y is made in such a way that the MLE wrt the complete set images is better constrained and much easier to use, and/or computationally more efficient, or numerical optimization is more rapidly converging.

The choice of the set y (and the transformation γ(.)) is the degree of freedom of the method and is arbitrary, but in any case it cannot be inferred from the incomplete set as the transformation is not‐invertible. The EM algorithm alternates the estimation (expectation or E‐step) of the log‐likelihood images from the observations x (available), and the estimate θ(n) (at n th iteration) from the maximization (M‐step) from the estimate of images. More specifically, the log‐likelihood images is estimated using the MMSE approach given (x, θ(n)); this is the expected mean estimate, and the two steps of the EM algorithm are

images

The E‐step is quite simple in the case that the model has additive Gaussian noise and the transformation γ(.) is linear but still not‐invertible; this case is discussed in detail below.

The EM method converges to a maximum, but not necessarily the global maximum. In any case, it can be proved that, for any choice θ(n) that corresponds to the solution at the nth iteration, the inequality images holds for any choice images, and the corresponding likelihood for the MLE is always increasing: images.

11.6.1 EM of the Sum of Signals in Gaussian Noise

The derivation here has been adapted from the original contribution of Feder and Weinstein, 1988 [46] for superimposed signals in Gaussian noise. Let the observation images be the sum of K signals each dependent on a set of parameters:

images

where sk(θk) is the k th signal that depends on the k th set of parameters θk and imagesimages. The objective is to estimate all the set of parameters images from the observations by decomposing the superposition of K signals s(θ) into K simple signals, all by using the EM algorithm.

The complete set from the incomplete x is based on the structure of the problem at hand. More specifically, one can choose as a complete set the collection of each individual signal:

images

where the incomplete set is the superposition of the signals

images

where images is the non‐invertible transformation γ(y). The following conditions should hold to preserve the original MLE:

images

To simplify the reasoning, each noise term in the complete set is assumed as uncorrelated wrt the others (images for images), but with the same correlation properties of w except for an arbitrary scaling

images

so that

images

To summarize, the noise of the complete set is

images

and the images covariance matrix is block‐diagonal

images

so it is convenient and compact (but not strictly necessary) to use the Kronecker products notation (Section 1.1.1). The inverse images is still block‐diagonal. From the pdf of images for the complete set:

images

and the log‐likelihood is (up to a scaling factor)

images

Computing the conditional mean images gives (apart from a scaling term images that is independent of θ)

images

where images is the Bayesian estimate of the complete set from the incomplete x and current estimate θ(n). For the optimization wrt θ (M‐step), it is convenient to group the terms of images after adding/removing the term images as independent of θ:

images

where c, d, e are scaling terms independent of θ.

The estimation of the complete set images is based on the linear model

images

and thus on the basis of the transformation images. The MMSE estimate of y from the incomplete observation set (the only one available) x for Gaussian noise (here equal to zero) is4

images

The terms of the estimate ŷ(θ(n)) can be expanded by using the property of Kronecker products (Section 1.1.1)

images

substituting and simplifying, it follows that the MMSE estimate of the k th signal of the complete set at the n‐th iteration of the EM method is

images

which is the sum of the k th signal estimated from the parameter images at the n‐th iteration and the original observation x after stripping all the signals estimated up to the n‐th iteration and interpreted as noise, and thus weighted for the corresponding scaling term βk. Re‐interpreting images now, it can be seen that

images

or equivalently, the optimization is fragmented into a set of K local optimizations, one for each set of parameters (M‐step):

images

The model that uses the EM method for the superposition of signals in Gaussian noise is general enough to find several applications, such as for the estimation of the parameters for a sum of sinusoids as discussed in Section 10.4 for Doppler‐radar system, or a sum of waveforms (in radar or remote sensing systems), or the sum of pseudo‐orthogonal modulations (in spread‐spectrum and code division multiple access—CDMA—systems). An example of time of delay estimation is discussed below.

11.6.2 EM Method for the Time of Delay Estimation of Multiple Waveforms

Let the signal be the sum of K delayed echoes as in a radar system (Section 5.6); the waveform of each echo is known but not its amplitude and delay, and the ensemble of delays should be estimated from their superposition

images

Each waveform can be modeled as

images

and the complete set of the EM method coincides with the generation model for each individual echo as if it were non‐overlapped with the others.

Given the estimate of amplitude images and delay images at then nth iteration, the two steps of the EM method are as follows.

E‐step (for images):

images

M‐step (for images):

images

The M‐step is equivalent to the estimation of the delay of one isolated waveform from the signal ŷk(θ(n)) where the effect of the interference with the other waveforms has been (approximately) removed from the E‐Step (see Figure 11.9 with images radar‐echoes). In other words, at every step the superposition of all the other waveforms (here in Figure 11.9 of the complementary waveform) is removed5 until (at convergence) the estimator of amplitude and delay acts on one isolated waveform and thus it is optimal. Of course, any imperfect cancellation of other waveforms could cause the EM method to converge to some local maximum. This problem is not avoided at all by the EM method and should be verified case‐by‐case.

Three right arrows, each with waveforms depicting (top– bottom) x (t), ŷ2 (θ (n)), and ŷ1(θ (n) ).

Figure 11.9 ToD estimation of multiple waveforms by EM method (images).

This example could be adapted for the estimation of frequency for sinusoids with different frequencies and amplitudes with small changes of terms, and the numerical example in Section 10.4 can suggest other applications of the EM method when trying to reduce the estimation of multiple parameters to the estimation of a few parameters carried out in parallel.

11.6.3 Remarks

The EM algorithm alternates the E‐step and the M‐step with the property that at every step the log‐likelihood function increases. However, there is no guarantee that convergence will be obtained to the global maximum, and the ability to reach convergence to the true MLE depends on the initialization θ(0) as for any iterative optimization method. Also, in some situations EM algorithm can be painfully slow to converge.

The mapping γ(.) that maps the incomplete set x from the complete (but not available) data y assumes that there are some missing measurements that complete the incomplete observations x. In some cases, adding these missing data can simplify the maximization of the log‐likelihood, but it could slow down the convergence of this EM method. Since the choice of the complete set is a free design aspect in the EM method, the convergence speed depends on how much information on the parameters are contained in the complete set.

Space‐alternating EM (SAGE) has been proposed in [47] to cope with the problem of convergence speed and proved to be successful in a wide class of application problems. A detailed discussion on this matter is left to publications specialized to solve specific applications by EM method.

Mixture model of non-Gaussian pdfs illustrated by a right arrow labeled x with four waveforms. Along the arrow are four vertical lines labeled µ1, µ2, µ3, and µ4.

Figure 11.10 Mixture model of non‐Gaussian pdfs.

Appendix Gaussian Mixture pdf

A non‐Gaussian pdf can be represented or approximated as a mixture of pdfs, mostly when multimodal, with long (or short) tail compared to a Gaussian pdf, or any combination thereof. Given some nice properties of estimators involving Gaussian pdfs, it can be convenient to rewrite (or approximate) a non‐Gaussian pdf as the weighted sum of Gaussian pdfs (Gaussian mixture) as follows (Figure 11.10):

Parameters images characterize each Gaussian function used as the basis of the Gaussian mixture, and {αk} are the mixing proportions constrained by images. The problem of finding the set of parameters images that better fits the non‐Gaussian pdf is similar to interpolation over non‐uniform sampling with one additional constraint on pdf normalization

images

Methods to get the approximating parameters images are many, and the method of moments (MoM) is a simple choice if either px(x) is analytically known, or a set of samples {x} are available. Alternatively, one can assume that the Gaussian mixture is the data generated from a set of independent and disjoint experiments, and one can estimate the parameters by assigning each sample to each of the experiments, as detailed in Section 23.6.

To generate a non‐Gaussian process x[n] with Gaussian mixture pdf based on K Gaussian functions, one can assume that there are K independent Gaussian generators (y1, y2, …, yK) each with images, and the assignment images is random with mixing probability αk and it is mutually exclusive wrt the others. A simple example here can help to illustrate. Let images be the number of normal rvs; the Gaussian mixture is obtained by selecting each sample out of three Gaussian sources, with selection probability {α1, α2, α3} into the images vector alfa:

The generalization to N‐dimensional Gaussian mixture pdf is straightforward:

images

The central moments {μk, Ck} are the generalization to N rvs of the Gaussian mixture (11.9).

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.29.89