12 Monte Carlo Methods for Statistical Signal Processing

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Xiaodong Wang^‡

^‡ Columbia University, New York, USA

In many problems encountered in signal processing, it is possible to describe accurately the underlying statistical model using probability distributions. Statistical inference can then theoretically be performed based on the relevant likelihood function or posterior distribution in a Bayesian framework. However, most problems encountered in applied research require non-Gaussian and/or nonlinear models in order to correctly account for the observed data. In these cases, it is typically impossible to obtain the required statistical estimates of interest, e.g., maximum likelihood, conditional expectation, in closed form as it requires integration and/or maximization of complex multidimensional functions. A standard approach consists of making model simplifications or crude analytic approximations in order to obtain algorithms that can be easily implemented. With the recent availability of high-powered computers, numerical simulation based approaches can now be considered and the full complexity of real problems can be addressed.

These integration and/or optimization problems could be tackled using analytic approximation techniques or deterministic numerical integration/optimization methods. These classical methods are often either not precise and robust enough or are too complex to implement. An attractive alternative consists of Monte Carlo algorithms. These algorithms are remarkably flexible and extremely powerful. The basic idea is to draw a large number of samples distributed according to some probability distribution(s) of interest so as to obtain simulation-based consistent estimates. These methods first became popular in physics [1] before literally revolutionizing applied statistics and related fields such as bioinformatics and econometrics in the 1990s [2–5].

Despite their ability to allow statistical inference to be performed for highly complex models, these flexible and powerful methods are not yet well-known in signal processing. This chapter provides a simple yet complete review of these methods in a signal processing context. We describe generic Monte Carlo methods which can be used to perform statistical inference in both batch and sequential contexts. We illustrate their applications in solving inference problems found in digital communications and bioinformatics.

12.1.1 Model-Based Signal Processing

In statistical signal processing, many problems can be formulated as follows. One is interested in obtaining an estimate of an unobserved random variable X taking values in $X$ given the realization of some statistically related observations Y = y. In a model-based context, one has access to the likelihood function giving the probability or PDF p (y ∣ x) of Y = y given X = x. In this case a standard point estimate of X is given by the maximum likelihood estimate

$x_{M L} = \underset{x \in χ}{\arg \max} p (y | x) .$

For simple models, it is possible to compute p (y ∣ x) in closed-form and the maximization of the probability distribution/PDF can be performed easily. However, when the model includes latent variables, some non-Gaussian and/or nonlinear elements, it is often impossible to compute in closed-form the likelihood and/or it is difficult to maximize it as it is a multimodal and potentially high-dimensional function. This severely limits the applications of maximum likelihood approaches for complex models.

The problem appears even more clearly when one is interested in performing Bayesian inference [6, 7]. In this context, one sets a prior distribution on X, say p (x), and all (Bayesian) inference relies on the posterior distribution given by Bayes’ theorem

$p (y | x) = \frac{p (y | x) p (x)}{p (y)}$

where

$\int p (y | x) p (x) d x = p (y) .$

For example the MMSE estimate of X given Y = y is defined by

$x_{M M S E} = \int x p (x | y) d x .$

To be able to compute this estimate, it is necessary to compute two integrals. It is only feasible to perform these calculations analytically for simple statistical models.

12.1.2 Examples

To illustrate these problems, we discuss a few standard signal processing applications here. For the sake of simplicity, we do not distinguish random variables and their realizations from now on. We will use the notation z_i_:_j = (z_i, z_i₊₁, …, z_j)^T for any sequence {z_n}.

Spectral Analysis

Consider the problem of estimating some sinusoids in noise. Let y_1:_T be an observed vector of T real data samples. The elements of y_1:_T may be represented by different models $ℳ_{k}$ corresponding either to samples of noise only (k = 0) or to the superposition of k (k ≥ 1) sinusoids corrupted by noise, more precisely

$\begin{array}{l} ℳ_{0} : y_{n} = v_{n, k} & k = 0 \\ ℳ_{k} : y_{n} = \sum_{j = 1}^{k} (a_{c_{j, k}} \cos [ω_{j, k} n] + a_{s_{j, k}} \sin [ω_{j, k} n]) + v_{n, k} & k \geq 1 \end{array}$

where $ω_{J_{1}, κ} \neq ω_{J_{2}, κ}$ for j₁ = j₂ and $a_{c_{j}, k}, a_{s_{j}, k}, ω_{j, k}$ are respectively the amplitudes and the radial frequency of the j^th sinusoid for the model with k sinusoids. The noise sequence v_1:_T_,_k is assumed zero-mean white Gaussian of variance $σ_{k}^{2}$ . In vector-matrix form, we have

$y_{1 : T} = D (ω_{k}) a_{k} + v_{k, 1 : T}$

where $a_{k} = {(a_{c_{1, k}}, a_{s_{1, k}}, \dots, a_{c_{k, k}}, a_{s_{k, k}})}^{T}$ and ω_k = (ω_1,_k, …, ω_k_,_k)^T. The T × 2k matrix D (ω_k) is defined as

$\begin{array}{r} {[D (ω_{k})]}_{i, 2 j - 1} = \cos [ω_{j, k} i], & (i = 1, \dots, T, j = 1, \dots, k) \\ {[D (ω_{k})]}_{i, 2 j} = \sin [ω_{j, k} i], & (i = 1, \dots, T, j = 1, \dots, k) \end{array}$

We assume here that the number k of sinusoids and their parameters $(a_{k}, ω_{k}, σ_{k}^{2})$ are unknown. Given y_1:_T, our objective is to estimate $(k, a_{k}, ω_{k}, σ_{k}^{2})$ . It is standard in signal processing to perform parameter estimation and model selection using a (penalized) ML approach. First, an approximate ML estimate of the parameters is found; we emphasize that unfortunately the likelihood is highly nonlinear in its parameters ω_k and admits typically severe local maxima. Model selection is then performed by maximizing an information criterion (IC) such as AIC (Akaike), BIC (Bayes) or MDL (Minimum Description Length). Note that when the number of observations is small, these criteria can perform poorly. In this problem, a Bayesian approach is considered; see [8] for a motivation of this model. One has

$\begin{array}{r} a_{k} | σ_{k}^{2} ~ N (0, σ_{k}^{2} δ^{2} {(D^{T} (ω_{k}) D (ω_{k}))}^{- 1}), & σ_{k}^{2} ~ ℐ G (\frac{v_{0}}{2}, \frac{γ_{0}}{2}) \end{array}$

(12.1)

and the frequencies ω_k are independent and uniformly distributed over (0, π). Finally, we assume that the prior distribution p (k) is a truncated Poisson distribution of intensity Λ where k_max ≜ ⎣(N − 1) /2⎦ (this constraint is added because otherwise the columns of D (ω_k) would be linearly dependent. The terms δ² and Λ can be respectively interpreted as an expected signal-to-noise ratio and the expected number of sinusoids.

In this case, it can easily be established that the marginal posterior distribution of the frequencies ω_k is proportional on Ω = {0, 1, …, k_max} × (0, π)^k to

$p (ω_{k}, k | y_{1 : T}) \propto {(γ_{0} + y_{1 : T}^{T} P_{k} y_{1 : T})}^{- \frac{T + v_{0}}{2}} \frac{{(Λ/ ((δ^{2} + 1) π))}^{k}}{k!}$

(12.2)

where

$\begin{array}{l} M_{k}^{- 1} & = & (1 + δ^{- 2}) D^{T} (ω_{k}) D (ω_{k}), & m_{k} = M_{k} D^{T} (ω_{k}) y_{1 : T}, \\ P_{k} & = & I_{T} - D (ω_{k}) M_{k} D^{T} (ω_{k}) . \end{array}$

This posterior distribution is highly nonlinear in the parameters ω_k. Moreover, one cannot compute explicitly its normalizing constant p (y_1:_T ∣ k) so it is impossible to compute the Bayes factors to perform model selection. Standard numerical integration techniques could be used but they are typically inefficient as soon as the dimension of the space of interest is high.

Optimal Filtering in State-Space Models

Consider an unobserved Markov process {x_n}_n_{> 1} of initial density μ and transition density x_n ∣ x_n₋₁ ~ f (· ∣ x_n₋₁). The observations {y_n}_n_{≥ 1} are conditionally independent given {x_n}_n_{≥ 1} of marginal density y_n ∣ x_n ~ g (· ⌉ x_n). This class of models is extremely wide. For example, it includes

$x_{n} = φ (x_{n}_{- 1}, v_{n}), y_{n} = Ψ (x_{n}, ω_{n})$

where φ and ψ are two nonlinear deterministic mappings and {v_n}_n_{≥ 2} and {w_n}_n_{≥ 2} are two independent and mutually independent sequences.

All inference on x_1:_n based on y_1:_n is based on the posterior distribution

$p (x_{1 : n} | y_{1 : n}) = \frac{p (y_{1 : n} | x_{1 : n}) p (x_{1 : n})}{\int p (y_{1 : n} | x_{1 : n}) p (x_{1 : n}) d x_{1 : n}}$

where

$p (x_{1 : n}) = p (x_{1}) \prod_{k = 2}^{n} f (x_{k} | x_{k - 1}), p (y_{1 : n} | x_{1 : n}) = \prod_{k = 1}^{n} g (y_{k} | x_{k}) .$

This posterior distribution satisfies the following recursion

$p (y_{1 : n} | x_{1 : n}) = \frac{f (x_{n} | x_{n - 1}) g (y_{n} | x_{n})}{p (y_{n} | y_{1 : n - 1})} p (x_{1 : n - 1} | y_{1 : n - 1}) .$

Unfortunately, except in the case where {x_n}_n_{≥ 1} takes values in a finite state-space (Hidden Markov model techniques) or the model is linear Gaussian (Kalman filtering techniques) then it is impossible to come up with a closed-form expression for this sequence of posterior distributions. Many suboptimal methods have been proposed to approximate this sequence; e.g., Extended Kalman filter, Gaussian sum approximations. However these methods tend to be unreliable as soon as the model includes strong nonlinear and/or non-Gaussian models. Deterministic numerical integration methods have been proposed but they are complex to implement, not flexible, and realistically can only be applied to models where {x_n}_n_{≥ 1} takes values in ℝ or ℝ².

DNA Sequence Motif Discovery

Efforts by various genomic projects have steadily expanded the pool of sequenced deoxyribonucleic acid (DNA) data. Motifs, or DNA patterns found in different locations within the genome, are often of interest to biologists. By seeking out these similarities exhibited in sequences, we can further our knowledge on the functions and evolutions of these sequences [9]. Let S = {s₁, s₂, ⋯, s_T}, with s_t = [s_t₁, ⋯, s_tL], be the set of DNA sequences of length L where we wish to find a common motif. Let us assume that a motif of length w is present in each one of the sequences. The distribution of the motif is described by the 4 × w position weight matrix (PWM) Θ = [θ₁, θ₂, ⋯, θ_w], where the column vector θ_j = [θ_j1, ⋯, θ_j₄]^T, j = 1, ⋯, w, is the probability distribution of the nucleotides {A, C, G, T} at the j-th position of the PWM. The remaining non-motif nucleotides are assumed to follow a Markovian distribution with probabilities given by Θ₀.

To formulate the motif-finding problem we use the state space model, where the states represent the locations of the first nucleotides of the different occurrences of the motif in the sequence, whereas the observation for the state at step t is the entire nucleotide sequence, s_t. Since the ending w – 1 nucleotides in a sequence are not valid locations for the beginning of a motif with length w, at step t, t = 1, ⋯, T, the state, denoted as x_t, takes value from the set $X = {1, 2, \dots, L_{m}}$ , where Lm = L – w + 1.

Let $a_{t, x}_{_{t}}$ be a sequence fragment of length w from s_t starting from position x_t in s_t, and denote $a_{t, x_{t}}^{c}$ as the remaining fragment from s_t with $a_{t, x_{_{t}}}$ removed. For example, for s_t = [AAAAGGGGAAAA] and x_t = 5 with w = 4, $a_{t, x_{_{t}}} = [G G G G]$ and $a_{t, x_{t}}^{c} = [A A A A A A A A]$ . Let us further define a vector n(a) = [n₁, n₂, n₃, n₄] where n_i, i = 1, ⋯, 4, denotes the number of different nucleotides in the sequence fragment a. Given the vectors θ = [θ₁, ⋯, θ₄] and n = [n₁, ⋯, n₄], we define

$θ^{n} ≜ \prod_{j = 1}^{4} θ_{j}^{n_{j}} .$

(12.3)

In DNA sequences, a nucleotide is often influenced by the surrounding nucleotides. We assume for our system model a 3rd order Markov model for the non-motif nucleotides in the sequence. Let us denote $P_{t, x_{t}}^{3}$ as the probability of $a_{t, x_{t}}^{c}$ . For example, if $a_{t, x_{t}}^{c} = [A T A A G]$ , the probability of $a_{t, x_{t}}^{c}$ is given by

$P_{t, x_{t}}^{3} = p (A) p (T | A) p (A | A, T) p (A | A, T, A) p (G | T, A, A) .$

(12.4)

In general, the 0-th to 3-rd order Markov chain probabilities for the background non-motif nucleotides can be averaged over a large genomic region, and are assumed to be known, which we denote as Θ₀. To perform motif discovery, Θ₀ can be given as a known parameter by the user or default values can be used. Since the nucleotides being located in the motif are independent of the other motif nucleotides and non-motif nucleotides, given the PWM Θ, the background distribution Θ₀, and the state at time t, the distribution of the observed sequence s_t is then given as follows:

$p (s_{t} | x_{t} = i, Θ) = P_{t, x_{t}}^{3} \prod_{}^{} θ_{k}^{n (t, i (k)) ≜} ℬ (s_{t}; i, Θ),$

(12.5)

where a_t,i(k) is the k-th element of the sequence fragment a_t,i, and n(a_t,i(k)) is a 1 × 4 vector of zeros except at the position corresponding to the nucleotide a_t,i(k), where it is a one.

From the discussion above, we formulate the inference problem as follows. Let us denote the state realizations up to time T as x ≜ [x₁, x₂, ⋯, x_T] and similarly the sequences up to time T as S ≜ [s₁, s₂, ⋯, s_T], with the unknown parameter Θ, the position weight matrix. Given the sequences S and the Markovian non-motif nucleotide distribution Θ₀, we wish to estimate the state realizations x, which are the starting locations of the motif in each sequence, and the position weight matrix Θ, which describes the statistics of the motif.

Remark: All problems described above require computing and/or maximizing high-dimensional probability distributions. It is possible to come up with deterministic techniques to approximate these distributions. However, as soon as the problems get very complex, the performance of these methods typically deteriorate quickly. In this chapter, we advocate that Monte Carlo methods are a powerful set of techniques which can provide satisfactory answers to all these problems.

12.2 Monte Carlo Methods

Let us consider the probability distribution or PDF π (x) where $x \in X$ . We will assume from now on that π (x) is known pointwise up to a normalizing constant, i.e.,

$π (x) = Z^{- 1} \tilde{π} (x)$

where π̃ (x) is known pointwise but the normalizing constant

$Z = \int_{χ} \tilde{π} (x) d x$

is unknown. Note this assumption is satisfied in all the examples discussed in the previous section if x corresponds to all the unknown variables/parameters.

In most applications of interest, the space $X$ is typically high-dimensional; say $X = ℝ^{1000}$ or $X = {0, 1}^{1000}$ . We are interested in the following generic problems.

• Computing integrals. For any test function φ : $X$ → ℝ, we want to compute

$E_{π} (φ) = \int_{χ} φ (x) π (x) d x .$

(12.6)

• Marginal distributions. Assume x = (x₁, x₂) ∈ $X$ ₁ × $X$ ₂, then we want to compute the marginal distribution

$π (x_{1}) = \int_{χ_{2}} π (x_{1}, x_{2}) d x_{2} .$

(12.7)

• Optimization. Given π (x), we are interested in finding

$\underset{x \in χ}{\arg \max} π (x) = \underset{x \in χ}{\arg \max} \tilde{π} (x) .$

(12.8)

• Integration/Optimization. Given the marginal distribution (12.7), we want to compute

$\underset{x \in χ_{1}}{\arg \max} π (x_{1}) = \underset{x \in χ_{1}}{\arg \max} \tilde{π} (x_{1})$

(12.9)

Assume it is possible to obtain a large number of N independent random samples {x⁽ⁱ⁾} (i = 1, …, N) distributed according to π. The Monte Carlo method approximates π by the following point-mass measure

$\hat{π} (x) = \frac{1}{N} \sum_{i = 1}^{N} δ (x - x^{(i)}) .$

(12.10)

It follows that an estimate of (12.6) is given by

${\hat{E}}_{π} (φ) = \int_{χ} φ (x) \hat{π} (x) = \frac{1}{N} \sum_{i = 1}^{N} φ (x^{(i)}) .$

(12.11)

Marginal distributions can also be estimated straightforwardly as

$\begin{array}{l} \hat{π} (x_{1}) & = & \int_{χ_{2}} \hat{π} (x_{1}, x_{2}) d x_{2} \\ = & \int_{χ_{2}} \frac{1}{N} \sum_{i = 1}^{N} δ (x_{1} - x_{1}^{(i)}, x_{2} - x_{2}^{(i)}) d x_{2} \\ = & \frac{1}{N} \sum_{i = 1}^{N} δ (x_{1} - x_{1}^{(i)}) . \end{array}$

(12.12)

The samples {x⁽ⁱ⁾} being distributed according to π, it means that a significant proportion of them will be in the vicinity of the mode so a reasonable estimate of (12.8) is

$\underset{{x^{(i)}}}{arg \max} \tilde{π} (x^{(i)}) .$

(12.13)

Optimizing marginal distribution is more difficult, one cannot use $\begin{matrix} \arg \max π (x_{1}^{(i)}) \\ {x_{1}^{(i)}} \end{matrix}$ , as the marginal distribution cannot be computed even up to a normalizing constant. If the scenario where π (x₁ ∣ x₂) is known analytically, then an alternative to (12.12) is

$\begin{array}{l} \hat{π} (x_{1}) & = & \int_{χ_{2}} π (x_{1} | x_{2}) \hat{π} (x_{2}) d x_{2} \\ = & \int_{χ_{2}} π (x_{1} | x_{2}) (\frac{1}{N} \sum_{i = 1}^{N} δ (x_{2} - x_{2}^{(i)})) d x_{2} \\ = & \frac{1}{N} \sum_{i = 1}^{N} π (x_{1} | x_{2}^{(i)}) . \end{array}$

(12.14)

It is then possible to estimate (12.9) by $\begin{matrix} \arg \max \hat{π} (x_{1}^{(i)}) \\ {x_{1}^{(i)}} \end{matrix}$ . Note that the computational complexity of this algorithm is unfortunately very expensive as evaluating (12.14) pointwise involves N ≫ 1 terms. Alternative techniques will be discussed later.

A natural question to ask is why the Monte Carlo method is attractive. The typical answer is that if one considers (12.11), then this estimate has good properties; i.e., it is clearly unbiased and one can easily show that its variance satisfies

$var {{\hat{E}}_{π} (φ)} = \frac{\int φ^{2} (x) π (x) d x - E_{π}^{2} (φ)}{N} .$

(12.15)

The truly remarkable property of this estimate is that the rate of convergence to zero of its variance is independent of the space $X$ (i.e., it can be ℝ or ℝ¹⁰⁰⁰⁰), whereas all deterministic integration methods have a rate of convergence of the approximation error decreasing severely as the dimension of the space increases. Note however that it does not imply that Monte Carlo methods will always outperform deterministic methods as the numerator of (12.15) can be huge. However, Monte Carlo tends to be much more flexible and powerful.

Nevertheless, they rely on the assumption that we are able to simulate samples {x⁽ⁱ⁾} from π. The next question is how we obtain such samples.

12.3 Markov Chain Monte Carlo (MCMC) Methods

12.3.1 General MCMC Algorithms

MCMC is a class of algorithms that allow one to draw (pseudo) random samples from an arbitrary target probability distribution, p(x), known up to a normalizing constant. The basic idea behind these algorithms is that one can achieve the sampling from p by running a Markov chain whose equilibrium distribution is exactly p. Two basic types of MCMC algorithms, the Metropolis algorithm and the Gibbs sampler, have been widely used in diverse fields. The validity of the both algorithms can be proved by the basic Markov chain theory.

Metropolis-Hastings Algorithm

Let p(x) = c exp{−f (x)} be the target probability distribution from which we want to simulate random draws. The normalizing constant c may be unknown to us. Metropolis et al. [1] first introduced the fundamental idea of evolving a Markov process in Monte Carlo sampling, which was later generalized by Hastings [10]. Starting with any configuration x⁽⁰⁾, the algorithm evolves from the current state x⁽^t⁾ = x to the next state x⁽^t⁺¹⁾ as follows:

Algorithm 12.3.1 [Metropolis-Hastings algorithm]

• Propose a random “perturbation” of the current state, i.e., x → x′, where x′ is generated from a transition function T(x⁽^t⁾ → x′), which is nearly arbitrary (of course, some are better than others in terms of efficiency) and is completely specified by the user.

• Compute the Metropolis ratio

$r (x, x^{'}) = \frac{p (x^{'}) T (x^{'} \to x)}{p (x) T (x \to x^{'})} .$

(12.16)

• Generate a random number u ~ uniform(0,1). Let x⁽^t⁺¹⁾ = x′ if u ≤ r(x, x′), and let x⁽^t⁺¹⁾ = x⁽^t⁾ otherwise.

It is easy to prove that the M-H transition rule results in an “actual” transition function A(x, y) (it is different from T because a acceptance/rejection step is involved) that satisfies the detailed balance condition

$p (x) A (x, y) = p (y) A (y, x),$

(12.17)

which necessarily leads to a reversible Markov chain with p(x) as its invariant distribution.

The Metropolis algorithm has been extensively used in statistical physics over the past forty years and is the cornerstone of all MCMC techniques recently adopted and generalized in the statistics community. Another class of MCMC algorithms, the Gibbs sampler [11], differs from the Metropolis algorithm in that it uses conditional distributions based on p(x) to construct Markov chain moves.

Gibbs Sampler

Suppose x = (x₁, ⋯, x_d), where x_i is either a scalar or a vector. In the Gibbs sampler, one systematically or randomly chooses a coordinate, say x_i, and then updates its value with a new sample x′_i drawn from the conditional distribution p(· ∣ x_[−_i_]). Algorithmically, the Gibbs sampler can be implemented as follows:

Algorithm 12.3.2 [Gibbs sampler]

Let the current state be $x^{(t)} = (x_{1}^{(t)}, \dots, x_{d}^{(t)})$ .

For i = 1, ⋯, d, we draw $x_{i}^{(t + 1)}$ from the conditional distribution

$p (x_{i} | x_{1}^{(t + 1)}, \dots, x_{i - 1}^{(t + 1)}, x_{i + 1}^{(t)}, \dots, x_{d}^{(t)}) .$

Alternatively, one can randomly scan the coordinate to be updated. Suppose currently $x^{(t)} = (x_{1}^{(t)}, \dots x_{d}^{(t)})$ . Then one can randomly select i from the index set {1, ⋯, d} according to a given probability vector (π₁, ⋯, π_d); and then draw $x_{i}^{(t + 1)}$ from the conditional distribution $p (\cdot | x_{[- i]}^{(t)})$ , and let $x_{[- i]}^{(t + 1)} = x_{[- i]}^{(t)}$ .

It is easy to check that every individual conditional update leaves p invariant. Suppose currently x⁽^t⁾ ~ p. Then $x_{[- i]}^{(t)}$ follows its marginal distribution under p. Thus,

$p (x_{i}^{(t + 1)} | x_{[- i]}^{(t)}) \cdot p (x_{[- i]}^{(t)}) = p (x_{i}^{(t + 1)}, x_{[- i]}^{(t)}),$

(12.18)

which implies that the joint distribution of $(x_{[- i]}^{(t)}, x_{i}^{(t + 1)})$ is unchanged at p after one update.

The Gibbs sampler’s popularity in statistics community stems from its extensive use of conditional distributions in each iteration. The data augmentation method [12] first linked the Gibbs sampling structure with missing data problems and the EM-type algorithms. The Gibbs sampler was further popularized by [13] where it was pointed out that the conditionals needed in Gibbs iterations are commonly available in many Bayesian and likelihood computations. Under regularity conditions, one can show that the Gibbs sampler chain converges geometrically and its convergence rate is related to how the variables correlate with each other [14]. Therefore, grouping highly correlated variables together in the Gibbs update can greatly speed up the sampler.

Other techniques - A main problem with all the MCMC algorithms is that they may, for some problems, move very slowly in the configuration space or may be trapped in a local mode. This phenomenon is generally called slow-mixing of the chain. When the chain is slow-mixing, estimation based on the resulting Monte Carlo samples becomes very inaccurate. Some recent techniques suitable for designing more efficient MCMC samplers include parallel tempering [15], multipletry method [16], and evolutionary Monte Carlo [17].

12.3.2 Applications of MCMC in Digital Communications

In this section, we discuss MCMC-based receiver signal processing algorithms for several typical communication channels, when the channel conditions are unknown a priori.

MCMC Detectors in AWGN Channels

We start with the simplest channel model in digital communications – the additive white Gaussian noise (AWGN) channel. After filtering and sampling of the continuous-time received waveform, the discrete-time received signal in such a channel is given by

$y_{t} = ϕ x_{t} + v_{t}, t = 1, 2, \dots, n,$

(12.19)

where y_t is the received signal at time t; x_t ∈ {+1, −1} is the transmitted binary symbol at time t; ϕ ∈ ℝ is the received signal amplitude; and v_t is an independent Gaussian noise sample with zero-mean and variance σ², i.e., v_t ~ $N$ (0, σ²). Denote X ≜ [x₁, …, x_n] and Y ≜ [y₁, …, y_n]. Our problem is to estimate the a posteriori probability distribution of each symbol based on the received signal Y, without knowing the channel parameters (ϕ, σ²). The solution to this problem based on the Gibbs sampler is as follows. Assuming a uniform prior for ϕ, a uniform prior for X (on {−1, +1}ⁿ) and an inverse $X^{2}$ prior for $σ^{2}, σ^{2} \sim X^{- 2} (ν, λ)$ , the complete posterior distribution is given by

$p (X, ϕ, σ^{2} | Y) \propto p (Y | X, ϕ, σ^{2}) p (ϕ) p (σ^{2}) p (X) .$

(12.20)

The Gibbs sampler starts with arbitrary initial values of X⁽⁰⁾ and for k = 0, 1, …, iterates between the following two steps.

Algorithm 12.3.3 [Two-component Gibbs detector in AWGN channel]

• Draw a sample (ϕ⁽^k⁺¹⁾, σ²⁽^k+¹⁾) from the conditional distribution (given X⁽^k⁾)

$\begin{array}{l} p (ϕ, σ^{2} | X^{(k)}, Y) & \propto & {(σ^{2})}^{- \frac{n}{2}} \exp [- \frac{1}{2 σ^{2}} \sum_{t = 1}^{n} {(y_{t} - ϕ x_{t}^{(k)})}^{2}] \\ \cdot {(σ^{2})}^{- \frac{v + 2}{2}} \exp (- \frac{v λ}{2 σ^{2}}) \\ \propto & π^{(k + 1)} (ϕ + σ^{2}) π^{(k + 1)} (σ^{2}), \end{array}$

(12.21)

where

$π^{(k + 1)} (σ^{2}) ~ χ^{- 2} (v + n - 1, \frac{1}{v + n - 1} [v λ+ \sum_{t = 1}^{n} y_{t}^{2} - \frac{1}{n} {(\sum_{t = 1}^{n} y_{t} x_{t}^{(k)})}^{2}]),$

(12.22)

and

$π^{(k + 1)} (ϕ | σ^{2}) ~ N (\frac{1}{n} \sum_{t = 1}^{n} y_{t} x_{t}^{(k)}, \frac{σ^{2}}{n}) .$

(12.23)

• Draw a sample X⁽^k⁺¹⁾ from the following conditional distribution, given (ϕ⁽^k⁺¹⁾, σ²⁽^k^{+ 1)}),

$\begin{array}{l} p (X | ϕ^{(k + 1)}, σ^{2 (k + 1)}, Y) & = & \prod_{t = 1}^{n} p (x_{t} | y_{t}, ϕ^{(k + 1)}, σ^{2 (k + 1)}) \\ \propto & \prod_{t = 1}^{n} exp [- \frac{1}{2 σ^{2 (k + 1)}} {(y_{t} - ϕ^{(k + 1)} x_{t})}^{2}] . \end{array}$

(12.24)

That is, for t = 1, …, n and b ∈ {+1, −1}, draw $x_{t}^{(k + 1)}$ from

$P (x_{t}^{(k + 1)} = b) = {[1 + exp (- \frac{2 b ϕ^{(k + 1)} y_{t}}{σ^{2 (k + 1)}})]}^{- 1} .$

(12.25)

It is worthwhile to note that one can integrate out ϕ and σ² in (12.20) analytically to get the marginal target distribution of X, which can provide some further insight. More precisely, we have

$π (X) \propto {[v λ+ \sum_{t = 1}^{n} y_{t}^{2} - \frac{1}{n} {(\sum_{t = 1}^{n} x_{t} y_{t})}^{2}]}^{- (n + v) / 2} .$

(12.26)

This defines a distribution on the space of a n-dimensional cube. The mode of this distribution is clearly at X̃, and −X̃, where X̃ = sign(Y). Intuitively, this is the “obvious solution” in this simple setting but it is not easy to generalize. Based on (12.26), we can derive another Gibbs sampling algorithm as follows.

Algorithm 12.3.4 [One-component Gibbs detector in AWGN channel]

• Choose t from 1, …, n by either the random scan (i.e., the t is chosen at random) or the deterministic scan (i.e., one cycles t from 1 to n systematically). Update X⁽^k⁾ to X⁽^k⁺¹⁾, where $x_{s}^{(k)} = x_{s}^{(k + 1)}$ for s ≠ t and $x_{t}^{(k + 1)}$ is drawn from the conditional distribution

$π (x_{t} = b | X_{[- t]}^{(k)}) = \frac{π (x_{t} = b, X_{[- t]}^{(k)})}{π (x_{t} = b, X_{[- t]}^{(k)}) + π (x_{t} = - b, X_{[- t]}^{(k)})},$

(12.27)

where π(X) is as in (12.26). When the variance σ² is known,

$π (X) \propto \exp {\frac{1}{2 n σ^{2}} {(\sum_{t = 1}^{n} x_{t} y_{t})}^{2}} .$

(12.28)

Besides the two Gibbs samplers just described, an attractive alternative is the Metropolis algorithm applied directly to (12.26). Suppose $X^{(k)} = (x_{1}^{(k)}, \dots, x_{n}^{(k)})$ . At step k + 1, the Metropolis algorithm proceeds as follows:

Algorithm 12.3.5 [Metropolis detector in AWGN channel]

• Choose t ∈ {1, …, n} either by the random scan or by the deterministic scan. Define Z = (z₁, …, z_n) where $z_{t} = - x_{t}^{(k)}$ and $z_{s} = x_{s}^{(k)}$ for s ≠ t. Generate independently U ~ uniform(0, 1). Let X⁽^k⁺¹⁾ = Z if

$U \leq \min {1, \frac{π (Z)}{π (X^{(k)})}} .$

(12.29)

and let X⁽^k⁺¹⁾ = X⁽^k⁾ otherwise.

This Metropolis algorithm differs from the one-component Gibbs detector only slightly in the way of updating $x_{t}^{(k)}$ to $x_{t}^{(k + 1)}$ . That is, the Metropolis algorithm always forces the change (to $- x_{t}^{(k)}$ ) unless it is rejected, whereas the Gibbs sampler “voluntarily” selects whether to make the change so that no rejection is incurred. It is known that when the random scan is used, the Metropolis rule always results in a smaller second-largest eigenvalue (not in absolute value) than the corresponding Gibbs sampler [4]. Thus, when the target distribution is relatively peaked (high signal–to–noise ratio (SNR)) the Metropolis algorithm is slightly preferable. However, the Metropolis algorithm may have a large (in absolute value) negative eigenvalue when the target distribution is flatter (low SNR). In practice, however, the large negative eigenvalue is not a serious concern. No clear theory is available when a deterministic scan is used for updating. Simulations suggest that a similar result to that of the random scan samplers seems to hold well.

To overcome the phase ambiguity, one can either restrict ϕ to be positive, or, alternatively, use differential encoding. Let the information sequence be s_t ∈ {+1, −1}, t = 2, …, n. In differential coding, we construct the transmitted sequence x_t ∈ {+1, −1}, t = 1, …, n, such that x_t = x_t₋₁s_t. To obtain Monte Carlo draws from the posterior distribution of p (S, ϕ, σ² ∣ Y), we use one of the MCMC algorithms to generate a Markov chain on (X, ϕ, σ²) and then convert the samples of X to S using $s_{t}^{(k)} = x_{t}^{(k)} x_{t - 1}^{(k)}$ , t = 2, …, n. Note that in this way X and −X result in the same S. Since {X⁽^k⁾} is a Markov chain, so is {S⁽^k⁾}. The transition probability from S⁽^k⁾ to S⁽^k⁺¹⁾ is given by

$P (S^{(k + 1)} | S^{(k)}) = P (X^{(k + 1)} | X^{(k)}) + P (- X^{(k + 1)} | X^{(k)}),$

(12.30)

where both X⁽^k⁺¹⁾ and −X⁽^k⁺¹⁾ result in S⁽^k⁺¹⁾, and X⁽^k⁾ results in S⁽^k⁾. Note that, both X⁽^k⁾ and −X⁽^k⁾ result in S⁽^k⁾, but since P (−X⁽^k⁺¹⁾ ∣ −X⁽^k⁾) = P (X⁽^k⁺¹⁾ ∣ X⁽^k⁾), either one can be used.

By denoting s₁ = x₁ and S ≜ [s₁, s₂, … s_n], we can modify (12.26) to give rise to the marginal target distribution for the s_t:

$π (s_{1}, \dots, s_{n}) \propto {v λ + \sum_{t = 1}^{n} y_{t}^{2} - \frac{1}{n} {(\sum_{t = 1}^{n} y_{t} \prod_{i = 1}^{t} s_{i})}^{2}}^{- (n + v) / 2} .$

(12.31)

Clearly, s₁ is independent of all the other s and has a uniform marginal distribution.

It is trickier to implement an efficient Gibbs sampler or Metropolis algorithm based on (12.31). For example, the single-site update method (i.e., changing one s_t at a time) may be inefficient because when we propose to change s_t to −s_t all the signs on y_t, y_t₊₁, … have to be changed. This may result in a very small acceptance rate. Since a single update from x_t to −x_t corresponds to changing (s_t, s_t₊₁) (−s_t, −s_t₊₁), we can employ proposals

$(s_{t}, s_{t + 1}) \propto (- s_{t}, - s_{t + 1}), t < n,$

and s_n → −s_n for distribution (12.31).

MCMC Equalizers in ISI Channels

Next we consider the Gibbs sampler for blind equalization in an intersymbol interference (ISI) channel [18, 19]. After filtering and sampling the continuous-time received waveform, the discrete-time received signal in such a channel is given by

$y_{t} = \sum_{s = 0}^{q} ϕ_{s} x_{t - s} + v_{t}, t = 1, 2, \dots, n,$

(12.32)

where (q + 1) is the channel order; ϕ_i ∈ ℝ is the value of the i-th channel tap, i = 0, …, q; x_t ∈ {+1, −1} is the transmitted binary symbol at time t; and v_t ~ $N$ (0, σ²) is an independent Gaussian noise sample at time t.

Let X ≜ [x₁₋_q, …, x_n], Y ≜ [y₁, …, y_n], ϕ ≜ [ϕ₀, …, ϕ_q]^T. With a uniform prior for ϕ, a uniform prior for X, and an inverse $X^{2}$ prior for σ² (e.g., $σ^{2} \sim χ_{ν, λ}^{- 2}$ ), the complete posterior distribution is

$p (X, ϕ, σ^{2} | Y) ~ p (Y | X, ϕ, σ^{2}) p (ϕ) p (σ^{2}) p (X) .$

(12.33)

The Gibbs sampler approach to this problem starts with an arbitrary initial value of X⁽⁰⁾ and iterates between the following two steps:

Algorithm 12.3.6 [Two-component Gibbs equalizer in ISI channel]

• Draw a sample (ϕ⁽^k⁺¹⁾, σ²⁽^k⁺¹⁾) from the conditional distribution (given X^(k))

$\begin{array}{l} p (ϕ, σ^{2} | X^{(k)}, Y) & \propto & {(σ^{2})}^{- \frac{n}{2}} exp [- \frac{1}{2 σ^{2}} \sum_{t = 1}^{n} {(y_{t} - ϕ^{T} x_{t}^{(k)})}^{2}] {(σ^{2})}^{- \frac{v + 2}{2}} \\ exp (- \frac{v λ}{2 σ^{2}}) \\ \propto & π^{(k + 1)} (ϕ | σ^{2}) π^{(k + 1)} (σ^{2}), \end{array}$

(12.34)

where $x_{t}^{(k)} ≜ [x_{t}^{(k)}, \dots, x_{t - q}^{(k)}]$ for k = 0, 1, …, and

$π^{(k + 1)} (σ^{2}) ~ χ^{- 2} (v + n - 1, \frac{v λ+ W^{(k + 1)}}{v + n - 1}),$

(12.35)

$π^{(k + 1)} (ϕ | σ^{2}) ~ N (μ^{(k + 1)}, \sum^{(k + 1)}),$

(12.36)

$W^{(k + 1)} = \sum_{t = 1}^{n} y_{t}^{2} - {[\sum_{t = 1}^{n} x_{t}^{(k)} y_{t}]}^{T} {[\sum_{t = 1}^{n} x_{t}^{(k)} x_{t}^{(k) T}]}^{- 1} [\sum_{t = 1}^{n} x_{t}^{(k)} y_{t}],$

(12.37)

$\sum^{(k + 1)} = {[\frac{1}{σ^{2}} \sum_{t = 1}^{n} x_{t} x_{t}^{T}]}^{- 1},$

(12.38)

$μ^{(k + 1)} = \sum^{(k + 1)} (\frac{1}{σ^{2}} \sum_{t = 1}^{n} x_{t}^{(k)} y_{t}) .$

(12.39)

• Draw a sample X⁽^k⁺¹⁾ from the conditional distribution, given (ϕ⁽^k⁺¹⁾, σ²⁽^k⁺¹⁾) through the following iterations. For t = 1 − q, …, n, generate $x_{t}^{(k + 1)}$ from

$\begin{array}{l} p (x_{t} | ϕ^{(k + 1)}, σ^{2 (k + 1)}, Y, X_{[- t]}^{(k)}) \propto \\ exp [- \frac{1}{2 σ^{2 (k + 1)}} \sum_{j = 1}^{n} {(y_{j} - ϕ^{(k + 1) T} x_{j}^{(k)})}^{2}], \end{array}$

(12.40)

where $X_{[- t]}^{(k)} ≜ [x_{1 - q}^{(k + 1)}, \dots, x_{t - 1}^{(k + 1)}, x_{t + 1}^{(k)}, \dots, x_{M}^{(k)}]$ and $x_{j}^{(k)} ≜ {[X_{[- t]}^{(k)}]}_{j - q : j}$ .

Another interesting Gibbs sampling scheme is based on the “grouping” idea [20]. In particular, a forward-backward algorithm can be employed to sample X jointly, conditional on Y and the parameters. This scheme is shown effective when X forms a Gaussian Markov model or a Markov chain whose state variable takes on only a few values. In the ISI channel equalization problem, x_t’s are i.i.d. symbols a priori, but they are correlated a posteriori because of the observed signal Y and the relationship (12.32). The induced correlation among the x_t vanishes after lag q. More precisely, instead of using formula (12.40) to sample X iteratively, one can draw X altogether:

Algorithm 12.3.7 [Grouping-Gibbs equalizer in ISI channel]

• The first few steps are identical to the previous Gibbs equalizer.

• The last step is replaced by the forward-backward scheme. Conditional on ϕ and σ (we suppress the superscript for iteration numbers), we have the joint distribution of X:

$\begin{array}{l} p (X | ϕ, σ, Y) & \propto & exp [- \frac{1}{2 σ^{2}} \sum_{j = 1}^{n} {(y_{j} - ϕ^{T} x_{j})}^{2}] \\ \equiv & exp {g_{1} (x_{1}) + \dots + g_{n} (x_{n})}, \end{array}$

(12.41)

where x_j = (x_j−q, …, x_j). Thus, each x_j can take 2^q⁺¹ possible values. The following two steps produce a sample X from p (X ∣ ϕ, σ, Y).

– Forward summation. Define f₁(x₁) = exp{g₁(x₁)} and compute recursively

$f_{j + 1} (x_{j} + 1) = \sum_{x_{j - q} = - 1}^{1} [f_{j} (x_{j}) \exp {g_{j + 1} (x_{j + 1})}] .$

(12.42)

– Backward sampling. First draw x_n = (x_n−q, …, x_n) from distribution P(x_n) ∝ f_n(x_n). Then, for j = n − q − 1, …, 1, draw P(x_j ∣ x_j₊₁, …, x_n) ∝ f_j+q (x_j, …, x_j+q).

Although the grouping idea is attractive for overcoming the channel memory problem, the additional computation cost may offset its advantages. More precisely, the forward-backward procedure needs about 2^q times more memory and about 2^q times more basic operations.

Similar to the previous section, we can integrate out the continuous parameters and write down the marginal target distribution of X:

$π (X) \propto {[v λ+ W]}^{- (n + v) / 2}$

(12.43)

where

$W = \sum y_{t}^{2} - {[\sum x_{t} y_{t}]}^{T} {[\sum x_{t} x_{t}^{T}]}^{- 1} [\sum x_{t} y_{t}] .$

(12.44)

We can then derive the one-component Gibbs and Metropolis algorithms accordingly. The phase ambiguity (i.e., likelihood unchanged when X is changed to −X) can be clearly seen from this joint distribution.

Algorithm 12.3.8 [One-component Gibb/Metropolis equalizer in ISI channel]

• Choose t from 1, …, n by either the random scan or the systematic scan. Let X⁽^k⁺¹⁾ = Z, where $z_{s} = x_{s}^{(k)}$ for s ≠ t and $z_{t} = - x_{t}^{(k)}$ , with probability

$\frac{π (Z)}{π (X^{(k)}) + π (Z)},$

(12.45)

for the Gibbs equalizer, or with probability

$min {1, \frac{π (Z)}{π (X^{(k)})}}$

(12.46)

for the Metropolis equalizer, where π(X) is as in (12.43). Otherwise let X^(k+1) = X^(k). When the variance σ² is known,

$π (X) \propto {\frac{1}{Σ x_{t} x_{t}^{T} |^{q / 2}}} \exp (\frac{1}{2 σ^{2}} {[\sum x_{t} y_{t}]}^{T} {[\sum x_{t} x_{t}^{T}]}^{- 1} [\sum x_{t} y_{t}]) .$

To overcome the phase ambiguity, we use differential coding in all of our algorithms. Denote S ≜ [s₂, …, s_n] as the information bits. Let $s_{t}^{(k)} = x_{t}^{(k)} x_{t - 1}^{(k)}$ , t = 2, …, n. Since X⁽^k⁾ forms a Markov chain, S⁽^k⁾ is a Markov chain too. The transition probability from S⁽^k⁾ to S⁽^k⁺¹⁾ is

$P (S^{(k + 1)} | S^{(k)}) = P (X^{(k + 1)} | X^{(k)}) + P (- X^{(k + 1)} | X^{(k)}),$

(12.47)

where both X^(k+1) and −X⁽^k⁺¹⁾ result in S⁽^k⁺¹⁾ and X⁽^k⁾ results in S⁽^k⁾.

12.4 Sequential Monte Carlo (SMC) Methods

12.4.1 General SMC Algorithms

Sequential Importance Sampling

Importance sampling is perhaps one of the most elementary, well-known, and versatile Monte Carlo techniques. Suppose we want to estimate E{h(x)} (with respect to p), using Monte Carlo method. Since directly sampling from p(x) is difficult, we want to find a trial distribution, q(x), which is reasonably close to p but is easy to draw samples from. Because of the simple identity

$\begin{array}{l} E {h (x)} & = & \int h (x) p (x) d x \\ = & \int h (x) w (x) q (x) d x, \end{array}$

(12.48)

where

$w (x) ≜ \frac{p (x)}{q (x)},$

(12.49)

is the importance weight, we can approximate (12.48) by

$E {h (x)} ≅ \frac{1}{W} \sum_{j = 1}^{v} h (x^{(j)}) w (x^{(j)}),$

(12.50)

where x⁽¹⁾, x⁽²⁾, ⋯, x⁽^ν⁾ are random samples from q, and $W = \sum_{j = 1}^{n} w (x^{(j)})$ . In using this method, we only need to know the expression of p(x) up to a normalizing constant, which is the case for many processing problems found in digital communications. Each x⁽^j⁾ is said to be properly weighted by w(x⁽^j⁾) with respect to p.

However, it is usually difficult to design a good trial density function in high-dimensional problems. One of the most useful strategies in these problems is to build up the trial density sequentially. Suppose we can decompose x as (x₁, ⋯, x_d) where each of the x_j may be multidimensional. Then our trial density can be constructed as

$q (x) = q_{1} (x_{1}) q_{2} (x_{2} | x_{1}) \dots q_{d} (x_{d} | x_{1}, \dots, x_{d - 1}),$

(12.51)

by which we hope to obtain some guidance from the target density while building up the trial density. Corresponding to the decomposition of x, we can rewrite the target density as

$p (x) = p (x_{1}) p (x_{2} | x_{1}) \dots p (x_{d} | x_{1}, \dots, x_{d - 1}),$

(12.52)

and the importance weight as

$w (x) = \frac{p (x_{1}) p (x_{2} | x_{1}) \dots p (x_{d} | x_{1}, \dots, x_{d - 1})}{q_{1} (x_{1}) q_{2} (x_{2} | x_{1}) \dots q_{d} (x_{d} | x_{1}, \dots, x_{d - 1})} .$

(12.53)

Equation (12.53) suggests a recursive way of computing and monitoring the importance weight. That is, by denoting x_t = (x₁, ⋯, x_t) (thus, x_d ≡ x), we have

$w_{t} (x_{t}) = w_{t - 1} (x_{t - 1}) \frac{p (x_{t} | x_{t - 1})}{q_{t} (x_{t} | x_{t - 1})} .$

(12.54)

Then w_d is equal to w(x) in (12.53). Potential advantages of this recursion and (12.52) are: (a) We can stop generating further components of x if the partial weight derived from the sequentially generated partial sample is too small; and (b) we can take advantage of p(x_t∣x_t₋₁) in designing q_t(x_t∣x_t₋₁). In other words, the marginal distribution p(x_t) can be used to guide the generation of x.

Although the “idea” sounds interesting, the trouble is that expressions (12.52) and (12.53) are not useful at all! The reason is that in order to get (12.52), one needs to have the marginal distribution

$p (x_{t}) = \int p (x_{1}, \dots, x_{d}) d x_{t + 1} \dots d x_{d},$

(12.55)

which is perhaps more difficult than the original problem.

In order to carry out the sequential sampling idea, we need to find a sequence of “auxiliary distributions,” π₁(x₁), π₂(x₂), ⋯, π_d(x), so that π_t(x_t) is a reasonable approximation to the marginal distribution p(x_t), for t = 1, ⋯, d − 1, and π_d = p. We want to emphasize that the π_t are only required to be known up to a normalizing constant and they only serve as “guides” to our construction of the whole sample x = (x₁, ⋯, x_d). The sequential importance sampling (SIS) method can then be defined as the following recursive procedure.

Algorithm 12.4.1 [Sequential importance sampling (SIS)]

For t = 2, ⋯, d:

• Draw x_t from q_t(x_t∣x_t₋₁), and let x_t = (x_t₋₁, x_t).

• Compute

$u_{t} = \frac{π_{t} (x_{t})}{π_{t - 1} (x_{t - 1}) q t (x_{t} | x_{t - 1})},$

(12.56)

and let w_t = w_t₋₁u_t. Here u_t is called an incremental weight.

It is easy to show that x_t is properly weighted by w_t with respect to π_t provided that x_t₋₁ is properly weighted by w_t₋₁ with respect to π_t₋₁. Thus, the whole sample x obtained by SIS is properly weighted by w_d with respect to the target density p(x). The “auxiliary distributions” can also be used to help construct a more efficient trial distribution:

• We can build q_t in light of π_t. For example, one can choose (if possible)

$q_{t} (x_{t} | x_{t - 1}) = π_{t} (x_{t} | x_{t - 1}) .$

(12.57)

Then the incremental weight becomes

$u_{t} = \frac{π_{t} (x_{t})}{π_{t - 1} (x_{t - 1})} .$

(12.58)

In the same token, we may also want q_t to be π_t₊₁(x_t ∣ x_t₋₁), where the latter involves integrating out x_t₊₁.

• When we observe that w_t is getting too small, we may want to reject the sample half way and restart. In this way, we avoid wasting time on generating samples that are deemed to have little effect in the final estimation. However, as an outright rejection incurs bias, techniques such as the rejection control are needed [21].

• Another problem with the SIS is that the resulting importance weights are often very skewed, especially when d is large. An important recent advance in sequential Monte Carlo to address this problem is the resampling technique [21, 22, 23].

SMC for Dynamic Systems

Consider the following dynamic system modeled in a state-space form as

$\begin{matrix} state equation & z_{t} = f_{t} (z_{t - 1}, u_{t}) \\ observation equation & y_{t} = g_{t} (z_{t}, v_{t}), \end{matrix}$

(12.59)

where z_t, y_t, u_t and v_t are, respectively, the state variable, the observation, the state noise, and the observation noise at time t. They can be either scalars or vectors.

Let Z_t=(z₀, z₁, ⋯, z_t) and let Y_t=(y₀, y₁, ⋯, y_t). Suppose an on-line inference of Z_t is of interest; that is, at current time t we wish to make a timely estimate of a function of the state variable Z_t, say h(Z_t), based on the currently available observation, Y_t. With the Bayes theorem, we realize that the optimal solution to this problem is E{h(Z_t)∣Y_t} = ∫ h(Z_t)p(Z_t∣Y_t)dZ_t. In most cases an exact evaluation of this expectation is analytically intractable because of the complexity of such a dynamic system. Monte Carlo methods provide us with a viable alternative to the required computation. Specifically, suppose a set of random samples ${Z_{t}^{(j)}}_{j = 1}^{ν}$ is generated from the trial distribution q(Z_t∣Y_t). By associating the weight

$w_{t}^{(j)} = \frac{p (Z_{t}^{(j)} | Y_{t})}{q (Z_{t}^{(j)} | Y_{t})}$

(12.60)

to the sample $Z_{t}^{(i)}$ , we can approximate the quantity of interest, E{h(Z_t)∣Y_t}, as

$E {h (Z_{t} | Y_{t})} ≅ \frac{1}{W_{t}} \sum_{j = 1}^{v} h (Z_{t}^{(j)}) w_{t}^{(j)},$

(12.61)

where $W_{t} = \sum_{j = 1}^{ν} w_{t}^{(j)}$ . The pair $(Z_{t}^{(j)}, w_{t}^{(j)})$ , is a properly weighted sample with respect to distribution p(Z_t∣Y_t). A trivial but important observation is that $z_{t}^{(j)}$ (one of the components of $Z_{t}^{(j)}$ ) is also properly weighted by $w_{t}^{(j)}$ with respect to the marginal distribution p(z_t∣Y_t).

To implement Monte Carlo techniques for a dynamic system, a set of random samples properly weighted with respect to p(Z_t∣Y_t) is needed for any time t. Because the state equation in system (12.59) possesses a Markovian structure, we can implement a SMC strategy [21]. Suppose a set of properly weighted samples ${(Z_{t - 1}^{(j)}, w_{t - 1}^{(j)})}_{j = 1}^{ν}$ (with respect to p(Z_t₋₁∣Y_t₋₁)) is given at time (t − 1). A sequential Monte Carlo filter generates from the set a new one, ${Z_{t}^{(j)}, w_{t}^{(j)}}_{j = 1}^{ν}$ , which is properly weighted at time t with respect to p(Z_t∣Y_t), according to the following algorithm.

Algorithm 12.4.2 [Sequential Monte Carlo filter for dynamic systems]

For j = 1, ⋯, ν:

• Draw a sample $z_{t}^{(j)}$ from a trial distribution $q (z_{t} | Z_{t - 1}^{(j)}, Y_{t})$ and let $Z_{t}^{(j)} = (Z_{t - 1}^{(j)}, z_{t}^{(j)})$ ;

• Compute the importance weight

$w_{t}^{(j)} = w_{t - 1}^{(j)} \cdot \frac{p (Z_{t}^{(j)} | Y_{t})}{p (Z_{t - 1}^{(j)} | Y_{t - 1}) q (z_{t}^{(j)} | Z_{t - 1}^{(j)}, Y_{t})} .$

(12.62)

The algorithm is initialized by drawing a set of i.i.d. samples $z_{0}^{(1)}, \dots, z_{0}^{(m)}$ from p(z₀∣y₀). When y₀ represents the “null” information, p(z₀∣y₀) corresponds to the prior of z₀.

A useful choice of the trial distribution $q (z_{t} | Z_{t - 1}^{(j)}, Y_{t})$ for the state space model (12.59) is of the form

$\begin{array}{l} q (z_{t} | Z_{t - 1}^{(j)}, Y_{t}) & = & p (z_{t} | Z_{t - 1}^{(j)}, Y) \\ = & \frac{p (y_{t} | z_{t}) p (z_{t} | z_{t - 1}^{(j)})}{p (y_{t} | z_{t - 1}^{(j)})} \end{array} .$

(12.63)

For this trial distribution, the importance weight is updated according to

$\begin{array}{l} w_{t}^{(j)} & \propto & w_{t - 1}^{(j)} \cdot p (y_{t} | z_{t - 1}^{(j)}) . \end{array}$

(12.64)

Mixture Kalman Filter

Many dynamic system models belong to the class of conditional dynamic linear models (CDLM) of the form

$\begin{array}{l} x_{t} & = & F_{λ_{t}} x_{t - 1} + G_{λ_{t}} u_{t}, \\ y_{t} & = & H_{λ_{t}} x_{t} + K_{λ}_{_{t}} v_{t}, \end{array}$

(12.65)

where u_t ~ $N$ _c(0, I), v_t ~ $N$ _c(0, I) (here I denotes an identity matrix), and λ_t is a random indicator variable. The matrices $F_{λ_{t}}, G_{λ_{t}}, H_{λ_{t}}$ and $K_{λ_{t}}$ are known given λ_t. In this model, the “state variable” z_t corresponds to (x_t, λ_t).

We observe that for a given trajectory of the indicator λ_t in a CDLM, the system is both linear and Gaussian, for which the Kalman filter provides the complete statistical characterization of the system dynamics. The mixture Kalman filter (MKF) [24] can be employed for on-line filtering and prediction of CDLM’s. It exploits the conditional Gaussian property and utilizes a marginalization operation to improve the algorithmic efficiency. Instead of dealing with both x_t and λ_t, MKF draws Monte Carlo samples only in the indicator space and uses a mixture of Gaussian distributions to approximate the target distribution. Compared with the generic SMC method, MKF is substantially more efficient (e.g., giving more accurate results with the same computing resources).

Let Yt = (y₀, y₁, ⋯, y_t) and let Λ_t = (λ₀, λ₁, ⋯, λ_t). By recursively generating a set of properly weighted random samples ${(Λ_{t}^{(j)}, w_{t}^{(j)})}_{j = 1}^{ν}$ to represent p(Λ_t∣Y_t), the MKF approximates the target distribution p(x_t∣Y_t) by a random mixture of Gaussian distributions

$\frac{1}{W_{t}} \sum_{j = 1}^{v} w_{t}^{(j)} N_{c} (μ_{t}^{(j)}, Σ_{t}^{(j)}),$

(12.66)

where $κ_{t}^{(j)} ≜ [μ_{t}^{(j)}, \sum_{t}^{(j)}]$ is obtained by implementing a Kalman filter for the given indicator trajectory $Λ_{t}^{(j)}$ and $W_{t} = \sum_{j = 1}^{ν} w_{t}^{(j)}$ . A key step in the MKF is the production at time t of a weighted sample of indicators, ${(Λ_{t}^{(j)}, κ_{t}^{(j)}, w_{t}^{(j)})}_{j = 1}^{ν}$ , based on the set of samples, ${(Λ_{t - 1}^{(j)}, κ_{t - 1}^{(j)}, w_{t - 1}^{(j)})}_{j = 1}^{ν}$ , at the previous time (t − 1) according to the following algorithm.

Algorithm 12.4.3 [Mixture Kalman filter]

For j = 1, ⋯, ν:

• Draw a sample $λ_{t}^{(j)}$ from a trial distribution $q (λ_{t} | Λ_{t - 1}^{(j)}, κ_{t - 1}^{(j)}, Y_{t})$ .

• Run a one-step Kalman filter based on $λ_{t}^{(j)}, κ_{t - 1}^{(j)}$ , and y_t to obtain $κ_{t}^{(j)}$ .

• Compute the weight

$w_{t}^{(j)} \propto w_{t - 1}^{(j)} \cdot \frac{p (Λ_{t - 1}^{(j)}, λ_{t}^{(j)} | Y_{t})}{p (Λ_{t - 1}^{(j)} | Y_{t - 1}) q (λ_{t}^{(j)} | Λ_{t - 1}^{(j)}, κ_{t - 1}^{(j)} Y_{t})} .$

(12.67)

12.4.2 Resampling Procedures

The importance sampling weight $w_{t}^{(j)}$ measures the “quality” of the corresponding imputed signal sequence $Z_{t}^{(j)}$ . A relatively small weight implies that the sample is drawn far from the main body of the posterior distribution and has a small contribution in the final estimation. Such a sample is said to be ineffective. If there are too many ineffective samples, the Monte Carlo procedure becomes inefficient. This can be detected by observing a large coefficient of variation in the importance weight. Suppose ${w_{t}^{(j)}}_{j = 1}^{m}$ is a sequence of importance weights. Then the coefficient of variation, v_t is defined as

$v_{t}^{2} = \frac{\sum_{j = 1}^{m} {(w_{t}^{(j)} - {\bar{w}}_{t})}^{2} / m}{{\bar{w}}_{t}^{2}} = \frac{1}{m} {\sum_{j = 1}^{m} (\frac{ω_{t}^{(j)}}{{\bar{w}}_{t}} - 1)}^{2},$

(12.68)

where ${\bar{w}}_{t} = \sum_{j = 1}^{m} w_{t}^{(j)} / m$ . Note that if the samples are drawn exactly from the target distribution, then all the weights are equal, implying that v_t = 0. It is shown in [25] that the importance weights resulting from a sequential Monte Carlo filter form a martingale sequence. As more and more data are processed, the coefficient of variation of the weights increases — that is, the number of ineffective samples increases — rapidly.

A useful method for reducing ineffective samples and enhancing effective ones is resampling [23]. Roughly speaking, resampling allows those “bad” samples (with small importance weights) to be discarded and those “good” ones (with large importance weights) to replicate so as to accommodate the dynamic change of the system. Specifically, let ${(Z_{t}^{(j)}, w_{t}^{(j)})}_{j = 1}^{m}$ be the original properly weighted samples at time t. A residual resampling strategy forms a new set of weighted samples ${({\tilde{Z}}_{t}^{(j)}, {\tilde{w}}_{t}^{(j)})}_{j = 1}^{m}$ according to the following algorithm (assume that $\sum_{j = 1}^{m} w_{t}^{(j)} = m$ ):

Algorithm 12.4.4 [Resampling algorithm]

• For j = 1, ⋯, m, retain $k_{j} = ⌊ w_{t}^{(j)} ⌋$ copies of the sample $Z_{t}^{(j)}$ . Denote $K_{r} = m - \sum_{j = 1}^{m} k_{j}$ .

• Obtain K_r i.i.d. draws from the original sample set ${Z_{t}^{(j)}}_{j = 1}^{m}$ , with probabilities proportional to $(w_{t}^{(j)} - k_{j})$ , j = 1, ⋯, m.

• Assign equal weight, i.e., set ${\tilde{w}}_{t}^{(j)} = 1$ , for each new sample.

The samples drawn by the above residual resampling procedure are properly weighted with respect to p(Z_t∣Y_t), provided that m is sufficiently large. In practice when small to modest m is used the resampling procedure can be seen as trading off between bias and variance. That is, the new samples with their weights resulting from the resampling procedure are only approximately proper, which introduces small bias in Monte Carlo estimation. On the other hand, however, resampling greatly reduces Monte Carlo variance for the future samples.

Resampling can be done at any time. However, resampling too often adds computational burden and decreases “diversities” of the Monte Carlo filter (i.e., it decreases the number of distinctive filters and loses information). On the other hand, resampling too rarely may result in a loss of efficiency. It is thus desirable to give guidance on when to do resampling. A measure of the efficiency of an importance sampling scheme is the effective sample size m̄_t, defined as

${\bar{m}}_{t} ≜ \frac{m}{1 + v_{t}^{2}} .$

(12.69)

Heuristically, m̄_t reflects the equivalent size of a set of i.i.d. samples for the set of m weighted ones. It is suggested in [21] that resampling should be performed when the effective sample size becomes small, e.g., ${\bar{m}}_{t} \leq \frac{m}{10}$ . Alternatively, one can conduct resampling at every fixed-length time interval (say, every five steps).

Instead of the previous resampling scheme suggested in the literature, we may implement a more flexible resampling scheme as follows (assume that $\sum_{j = 1}^{m} w_{t}^{(j)} = m$ ):

For j = 1, ⋯, m,

(a) For $w_{t}^{(j)} \geq 1$ ,

• Retain k_j copies of the sample $Z_{t}^{(j)}$ , where k_j is given in advance (see below);

• Assign weight ${\tilde{w}}_{t}^{(j)} = w_{t}^{(j)} / k_{j}$ for each copy.

(b) For $w_{t}^{(j)} < 1$ ,

• Kill the sample with probability 1 − f_j.

• Assign weight $w_{t}^{(j)} / f_{j}$ to the survived sample.

The advantage of this new resampling method is that we have the flexibility of choosing a proper resampling size k_j as we wish. On one hand, we want to eliminate those hopeless samples and emphasize those “promising” ones. On the other hand, however, we do not want to throw away those mediocre ones which may prove important later on (as the dynamical system moves towards their way). An empirical choice of the resample size formula is $k_{j} = ⌊ \sqrt{w_{t}^{(j)}} ⌋$ and $f_{j} = ⌊ \sqrt{w_{t}^{(j)}} ⌋$ . The intuition behind this choice is that it effectively removes those hopeless samples with small weights but still maintains the diversity of the Monte Carlo sample.

12.4.3 Applications of SMC in Bioinformatics

In this section we illustrate the application of SMC in solving the DNA sequence motif discovery problem described in Section 12.1.2.

SMC Motif Discovery Algorithm

For the system states up to time t, x_t = [x₁, ⋯, x_t], and the corresponding sequences S_t = [s₁, ⋯, s_t], we will first present their prior distributions and their conditional posterior distributions, and then describe the steps of the SMC motif discovery algorithm.

Prior Distributions: Denote θ_j ≜ [θ_j₁, ⋯, θ_j₄]^T, j = 1, ⋯, w, as the j-th column of the position weight matrix Θ. In Monte Carlo methods, the prior distribution is often chosen so that the posterior and the prior are conjugate pairs, i.e., they belong to the same functional family. It can be seen that for all of the motifs in the dataset S, the nucleotide counts at each motif location are drawn from multinomial distributions. It is well known that the Dirichlet distribution provides conjugate pairs for such distribution. Therefore, we use a multivariate Dirichlet distribution as the prior for θ. The prior distribution for the i-th column of the PWM is then given by

$θ_{i} ~ D (ρ_{i}_{1}, \dots, ρ_{i}_{4}), i = 1, 2, \dots, w .$

(12.70)

Denote ρ_i ≜ [ρ_i₁, ⋯, ρ_i₄]. Assuming independent priors, then the prior distribution for the PWM Θ is the product Dirichlet distribution

$Θ ~ \prod_{i = 1}^{w} D (ρ_{i}) .$

(12.71)

Conditional Posterior Distributions: Here we describe the conditional posterior distributions that are used in the SMC algorithm:

1. The conditional posterior distribution of the PWM Θ:

$\begin{array}{l} p (Θ | S_{t}, x_{t - 1}, x_{t} = i) & \propto & p (S_{t} | Θ, x_{t - 1}, x_{t} = i, S_{t - 1}) p (Θ | x_{t - 1}, S_{t - 1}) \\ \propto & \prod_{j = 1}^{w} θ_{j}^{n (a_{t, i} (j))} \prod_{ℓ = 1}^{w} θ_{ℓ}^{ρ_{ℓ} (t - 1) - 1} \\ \propto & Λ_{w} (Θ; ρ_{1} (t - 1) + n (a_{t, i} (1)), \dots, \\ ρ_{w} (t - 1) + n (a_{t, i} (w))), \end{array}$

(12.72)

where we denote Λ_w (Θ; ρ₁, ⋯, ρ_w) as the product Dirichlet PDF for Θ, ρ_i(t) ≜ [ρ_i₁(t), ⋯, ρ_i₄(t)], i = 1, ⋯, w, as the parameters of the distribution of Θ at time t, and $θ_{k}^{ρ_{κ} (t) - 1} ≜ \prod_{ℓ = 1}^{4} θ_{k ℓ}^{(ρ κ ℓ (t) - 1)}$ . Note that the posterior distribution of Θ depends only on the sufficient statistics T_t ≜ [ρ_ij(t), 1 ≤ i ≤ w, 1 ≤ j ≤ 4}, which is easily updated based on T_t₋₁, x_t, and s_t as given by (12.72), i.e., T_t = T_t(T_t₋₁, x_t; s_t).

2. The conditional posterior distribution of state x_t:

$p (x_{t} = i | S_{t}, Θ) = p (x_{t} = i | s_{t}, Θ) \propto ℬ (s_{t}; i, Θ), i = 1, 2, \dots, L_{m} .$

(12.73)

SMC Estimator: We now outline the SMC algorithm for motif discovery when the PWM is unknown, assuming that there is only one motif of length w, and it is present in each of the sequences in the dataset. At time t, to draw random samples of $x_{t}^{(k)}$ we use the optimal proposal distribution

$q_{2} (x_{t} = i | x_{t - 1}^{(k)}, S_{t}, Θ) = p (x_{t} = i | x_{t - 1}^{(k)}, S_{t}, Θ) ~ ℬ (s_{t}; i, Θ) .$

(12.74)

To sample Θ, we use the following proposal distribution

$\begin{array}{l} q_{1} (Θ | x_{t - 1}^{(k)}, S_{t}) & \propto & \sum_{i = 1}^{L_{m}} p (s_{t} | x_{t} = i, Θ, x_{t - 1}, S_{t - 1}) p (Θ | x_{t - 1}, S_{t - 1}) \\ \propto & \sum_{i = 1}^{L_{m}} p_{t, x_{t}}^{3} \prod_{k = 1}^{w} θ_{k}^{ρ_{k} (t - 1) + n (α_{t, i} (k)) - 1} \\ \propto & \sum_{i = 1}^{L_{m}} λ_{i, t} Λ w (Θ; ρ_{1} (t - 1) + n (a_{t, i} (1))), \dots, ρ_{w} (t - 1) \\ + n (a_{t, i} (w))) . \end{array}$

(12.75)

where

$λ_{i, t} ≜ P_{t, x_{t}}^{3} \prod_{ℓ = 1}^{w} ρ_{ℓ} {(t - 1)}^{n (a_{t, i} (ℓ))},$

(12.76)

with $ρ_{ℓ} {(t)}^{n (a_{t, i} (ℓ))} ≜ \prod_{j = 1}^{4} ρ ℓ_{j} {(t)}^{I (s_{t, i + ℓ - 1} - j)}$ . The weight update is given by

$w_{t} \propto w_{t - 1} \frac{\sum_{i = 1}^{L_{m}} λ_{i, t}}{\prod_{k = 1}^{w} \sum_{j = 1}^{4} ρ_{k j} (t - 1)} .$

(12.77)

We are now ready to give the SMC motif discovery algorithm:

Algorithm 12.4.5 [SMC motif discovery algorithm for single motif present in all sequences]

• For k = 1, ⋯, K

– Sample Θ⁽^k⁾ from the mixture Dirichlet distribution given by (12.75).

– Sample $x_{t}^{(k)}$ from (12.74).

– Update the sufficient statistics $T_{t}^{(k)} = T_{t} (T_{t - 1}^{(k)}, x_{t}^{(k)}, s_{t})$ from (12.72).

• Compute the new weights according to (12.77).

• Compute $\hat{K_{e f f}} = {(\sum_{k = 1}^{K} {(w_{t}^{(k)})}^{2})}^{- 1}$ . if $\hat{K_{e f f}} \leq \frac{K}{10}$ perform resampling.

Motif Scores: When searching for motifs in a dataset, it is often necessary to assign confidence scores to the estimated motif locations. A natural choice in this case would be to use the a posteriori probability

$p (x_{t} | s_{t}) \propto p (s_{t} | x_{t}) p (x_{t}),$

(12.78)

as the confidence score for our estimation, where p(x_t), the prior probability of the starting location of the motif in sequence t is assumed to be uniformly distributed. Note that

$p (s_{t} | x_{t}) = \int p (s_{t} | x_{t}, Θ) p (Θ) d Θ .$

(12.79)

From [26], [27], (12.79) can be approximated by

$p (s_{t} | x_{t}) \approx p (s_{t} | x_{t}, \hat{Θ}) p (\hat{Θ}) = ℬ (s_{_{t}}; x_{t}, \hat{Θ}) Λ_{w} (\hat{Θ}; ρ_{1, t}, \dots, ρ_{w, t}),$

(12.80)

and we denote (12.80) as the Bayesian score. Extensions of the above basic SMC motif discovery algorithm can be found in [9].

12.5 Conclusions and Further Readings

Monte Carlo techniques rely on random number generation to calculate more efficiently deterministic variables and functions, to solve complicated optimization and estimation problems, and to simulate complex phenomena and systems. They found applicability in a wide variety of fields including engineering, bioinformatics, statistics, and physical sciences (phsyics, astronomy, chemistry, etc.). In the areas of signal processing, communications and networking, Monte Carlo techniques combined with Bayesian statistics proved to be very powerful tools for solving complex estimation, detection, optimization and simulation problems (see e.g., [28, 29, 30]). Recently, the class of sequential Monte Carlo techniques helped to design efficient recursive algorithms for diverse estimation and detection applications (see e.g., [4, 31, 32, 33], as well as the tutorials [34, 35, 36, 37]). For a more comprehensive treatment of Monte Carlo techniques and Bayesian statistics, we recommend the excellent references [3, 4, 5, 6, 7].

12.6 Exercises

Exercise 12.6.1 (Variance of MCMC sampler). Suppose the samples from a Markov Chain Monte Carlo sampler are given by {xⁱ}(i = 1, …, N) distributed according to π. Let us further assume that the process has been running long enough and has achieved the equilibrium distribution. Show that:

$N var {\sum_{i = 1}^{N} \frac{ϕ (x^{(i)})}{N}} = σ^{2} [1 + 2 \sum_{i = 1}^{N - 1} (1 - \frac{i}{N}) ρ_{i}]$

where σ² = var[ϕ(x)] and ρ_i = E[ϕ(x^j)ϕ(x^j+i)].

Exercise 12.6.2 (Two-component Gibbs sampler). Consider a Markov chain resulting from a two-component Gibbs sampler which is in stationarity. Prove that

$cov {ϕ (x_{1}^{(0)}), ϕ (x_{1}^{(1)})} = var {E {ϕ (x_{1} | x_{2})}}$

holds for any function ϕ.

Exercise 12.6.3 (Computational efficiency of MC estimates). Suppose we have Monte Carlo samples from data augmentation and we are estimating I = E[ϕ(x₁)]. Which one of following estimators should be preferred?

$\begin{array}{l} \hat{I} = \frac{1}{m} {ϕ (x_{1}^{(1)}) + \dots + ϕ (x_{1}^{(m)})} \\ I^{'} = \frac{1}{m} {E [ϕ (x_{1}^{(1)}) | x_{2}^{(1)}] + \dots + E [ϕ (x_{1}^{(m)}) | ϕ_{1}^{(m)}]} \end{array}$

Justify your result by finding the variances for the two estimators.

Exercise 12.6.4 (Mean of the importance weights). Suppose that the target density is p(x) and the trial density is q(x). We draw random samples x¹, x²,…, x^ν from q and the sum of weights is given by $W = \sum_{j = 1}^{n} w (x^{(j)})$ . Prove that the expectation of W is equal to n.

Exercise 12.6.5 (Normalizing the importance weights). Let us assume that the weights have been normalized to sum one and x̂(^j⁾ are sampled from x⁽^j⁾ according to the normalized weights. Prove the following relation:

$E {\frac{1}{v} \sum ϕ (\hat{x} (j))} = E {\sum_{j = 1}^{v} w_{j} ϕ (x^{(j)})}$

Exercise 12.6.6 (Importance sampling estimators). Find the MSE for the importance sampling estimator given by

$\frac{1}{W} \sum_{j = 1}^{v} h (x^{(j)}) w (x^{(j)})$

in terms of K(x) = h(x)w(x), where W is the sum of weights.

Exercise 12.6.7 (More analytical work is good in importance sampling). Prove that

$var {\frac{f_{X_{1}}_{X_{2}} (x_{1}, x_{2})}{g_{X_{1}}_{X_{2}} (x_{1}, x_{2})}} \geq var {\frac{f_{X 1} (x_{1})}{g_{X 1} (x_{1})}}$

where the variance is calculated with respect to the density g.

References

[1] N. Metropolis, A. Rosenbluth, A. Teller, and E. Teller, “Equations of state calculations by fast computing machines,” J. Chemical Physics, vol. 21, pp. 1087–1091, 1953.

[2] J. Besag, P. Green, D. Higdon, and K. Mengersen, “Bayesian computation and stochastic systems (with discussion),” Statist. Sci., vol. 10, pp. 3–66, 1995.

[3] W. Gilks, S. Richardson, and D. Spiegelhalter, Markov Chain Monte Carlo in Practice. Chapman & Hall, New York, 1995.

[4] J. Liu, Monte Carlo Methods for Scientific Computing. Springer-Verlag, New York, 2001.

[5] C. Robert and G. Casella, Monte Carlo Statistical Methods. Springer-Verlag, New York, 1999.

[6] J. Bernardo and A. Smith, Bayesian Theory. Wiley, New York, 1995.

[7] C. P. Robert, “Mixtures of distributions: inference and estimation,” in Markov Chain Monte Carlo in Practice. Chapman & Hall, New York, 1996, ch. 24, pp. 441–464.

[8] C. Andrieu and A. Doucet, “Joint Bayesian detection and estimation of noisy sinusoids via reversible jump MCMC,” IEEE Transactions on Signal Processing, vol. 47, pp. 2667–2676, 1999.

[9] K.-C. Liang, X. Wang, and D. Anastassiou, “A sequential Monte Carlo method for motif discovery,” IEEE Transactions on Signal Processing, vol. 56, no. 9, pp. 4486–4495, September 2008.

[10] W. Hastings, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, pp. 97–109, 1970.

[11] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-6, no. 11, pp. 721–741, Nov. 1984.

[12] M. Tanner and W. Wong, “The calculation of posterior distribution by data augmentation (with discussion),” J. Amer. Statist. Assoc., vol. 82, pp. 528–550, 1987.

[13] A. Gelfand and A. Smith, “Sampling-based approaches to calculating marginal densities,” J. Amer. Stat. Assoc., vol. 85, pp. 398–409, 1990.

[14] J. Liu, “The collapsed Gibbs sampler with applications to a gene regulation problem,” J. Amer. Statist. Assoc, vol. 89, pp. 958–966, 1994.

[15] C. Geyer, “Markov chain Monte Carlo maximum likelihood,” in Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, E. Keramigas, Ed. Fairfax: Interface Foundation, 1991, pp. 156–163.

[16] J. Liu, F. Ling, and W. Wong, “The use of multiple-try method and local optimization in Metropolis sampling,” J. Amer. Statist. Assoc, vol. 95, pp. 121–134, 2000.

[17] F. Liang and W. Wong, “Evolutionary Monte Carlo: applications to cp model sampling and change point problem,” Statistica Sinica, vol. 10, pp. 317–342, 2000.

[18] R. Chen and T. Li, “Blind restoration of linearly degraded discrete signals by Gibbs sampling,” IEEE Trans. Sig. Proc., vol. 43, no. 10, pp. 2410–2413, Oct. 1995.

[19] X. Wang and R. Chen, “Blind turbo equalization in Gaussian and impulsive noise,” IEEE Transactions Vehicular Technology, vol. 50, no. 4, pp. 1092–1105, July 2001.

[20] C. Carter and R. Kohn, “On Gibbs sampling for state space models,” Biometrika, vol. 81, pp. 541–553, 1994.

[21] J. Liu and R. Chen, “Sequential Monte Carlo methods for dynamic systems,” Journal of the American Statistical Association, vol. 93, pp. 1032–1044, 1998.

[22] N. Gordon, D. Salmon, and A. Smith, “A novel approach to nonlinear/non-Gaussian Bayesian state estimation,” IEE Proc. Radar Sig. Proc., vol. 140, pp. 107–113, 1993.

[23] J. Liu and R. Chen, “Blind deconvolution via sequential imputations,” Journal of the American, Statistical Association, vol. 90, pp. 567–576, 1995.

[24] R. Chen and J. Liu, “Mixture Kalman filters,” J. Roy. Statist. Soc. B, vol. 62, no. 3, pp. 493–509, 2000.

[25] A. Kong, J. Liu, and W. Wong, “Sequential imputations and Bayesian missing data problems,” J. Amer. Statist. Assoc, vol. 89, pp. 278–288, 1994.

[26] X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, and E. R. Dougherty, “A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks,” Bioinformatics, vol. 20, no. 17, pp. 2918–2927, 2004.

[27] C. Andrieu, J. Freitas, and A. Doucet, “Robust full Bayesian learning from neural networks,” Neural Computation, vol. 13, pp. 2359–2407, 2001.

[28] X. Wang and V. Poor, Wireless Communication Systems: Advanced Techniques for Signal Reception. US: Prentice Hall, 2003.

[29] X. Wang and A. Doucet, “Monte Carlo methods for signal processing: a review in the statistical signal processing context,” IEEE Signal Processing Magazine, vol. 22, no. 6, pp. 152–170, November 2005.

[30] X. Wang, R. Chen, and J. Liu, “Monte Carlo signal processing for wireless communications,” J. VLSI Sig. Proc., vol. 30, no. 1–3, pp. 89–105, Jan.-Mar. 2002.

[31] O. Cappe, E. Moulines, and T. Ryden, Inference in Hidden Markov Models. Berlin: Springer, 2005.

[32] A. Doucet, N. D. Freitas, and N. Gordon, Sequential Monte Carlo Methods in Practice. Berlin: Springer, 2001.

[33] B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman Filter: Particle Filters for Tracking Applications. Artech House, Norwood, MA, 2004.

[34] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” IEEE Transactions on Signal Processing, vol. 50, no. 2, p. 174–188, February 2002.

[35] O. Cappe, S. Godsill, and E. Moulines, “An overview of existing methods and recent advances in sequential Monte Carlo,” Proceedings of IEEE, vol. 95, no. 5, pp. 899–924, April 2007.

[36] P. M. Djuric, J. H. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M. F. Bugallo, and J. Miguez, “Particle Filtering,” IEEE Signal Processing Magazine, vol. 20, no. 5, pp. 19–38, September 2003.

[37] A. Doucet, S. Godsill, and C. Andrieu, “On Sequential Monte Carlo Methods for Bayesian Filtering,” Statistics and Computing, vol. 10, no. 3, pp. 197–208, 2000.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12 Monte Carlo Methods for Statistical Signal Processing

Create new playlist

Sign In

Sign Up

Table of Contents for
12 Monte Carlo Methods for Statistical Signal Processing