Chapter 12  Monte Carlo Methods for Statistical Signal Processing

Xiaodong Wang

Columbia University, New York, USA

12.1    Introduction

In many problems encountered in signal processing, it is possible to describe accurately the underlying statistical model using probability distributions. Statistical inference can then theoretically be performed based on the relevant likelihood function or posterior distribution in a Bayesian framework. However, most problems encountered in applied research require non-Gaussian and/or nonlinear models in order to correctly account for the observed data. In these cases, it is typically impossible to obtain the required statistical estimates of interest, e.g., maximum likelihood, conditional expectation, in closed form as it requires integration and/or maximization of complex multidimensional functions. A standard approach consists of making model simplifications or crude analytic approximations in order to obtain algorithms that can be easily implemented. With the recent availability of high-powered computers, numerical simulation based approaches can now be considered and the full complexity of real problems can be addressed.

These integration and/or optimization problems could be tackled using analytic approximation techniques or deterministic numerical integration/optimization methods. These classical methods are often either not precise and robust enough or are too complex to implement. An attractive alternative consists of Monte Carlo algorithms. These algorithms are remarkably flexible and extremely powerful. The basic idea is to draw a large number of samples distributed according to some probability distribution(s) of interest so as to obtain simulation-based consistent estimates. These methods first became popular in physics [1] before literally revolutionizing applied statistics and related fields such as bioinformatics and econometrics in the 1990s [25].

Despite their ability to allow statistical inference to be performed for highly complex models, these flexible and powerful methods are not yet well-known in signal processing. This chapter provides a simple yet complete review of these methods in a signal processing context. We describe generic Monte Carlo methods which can be used to perform statistical inference in both batch and sequential contexts. We illustrate their applications in solving inference problems found in digital communications and bioinformatics.

12.1.1    Model-Based Signal Processing

In statistical signal processing, many problems can be formulated as follows. One is interested in obtaining an estimate of an unobserved random variable X taking values in X given the realization of some statistically related observations Y = y. In a model-based context, one has access to the likelihood function giving the probability or PDF p (yx) of Y = y given X = x. In this case a standard point estimate of X is given by the maximum likelihood estimate

xML = argmaxxχp(y|x).

For simple models, it is possible to compute p (yx) in closed-form and the maximization of the probability distribution/PDF can be performed easily. However, when the model includes latent variables, some non-Gaussian and/or nonlinear elements, it is often impossible to compute in closed-form the likelihood and/or it is difficult to maximize it as it is a multimodal and potentially high-dimensional function. This severely limits the applications of maximum likelihood approaches for complex models.

The problem appears even more clearly when one is interested in performing Bayesian inference [6, 7]. In this context, one sets a prior distribution on X, say p (x), and all (Bayesian) inference relies on the posterior distribution given by Bayes’ theorem

p(y|x) = p(y|x)p(x)p(y)

where

p(y|x)p(x)dx = p(y).

For example the MMSE estimate of X given Y = y is defined by

xMMSE = xp(x|y)dx.

To be able to compute this estimate, it is necessary to compute two integrals. It is only feasible to perform these calculations analytically for simple statistical models.

12.1.2    Examples

To illustrate these problems, we discuss a few standard signal processing applications here. For the sake of simplicity, we do not distinguish random variables and their realizations from now on. We will use the notation zi:j = (zi, zi+1, …, zj)T for any sequence {zn}.

Spectral Analysis

Consider the problem of estimating some sinusoids in noise. Let y1:T be an observed vector of T real data samples. The elements of y1:T may be represented by different models k corresponding either to samples of noise only (k = 0) or to the superposition of k (k ≥ 1) sinusoids corrupted by noise, more precisely

0:yn = vn,kk = 0k:yn = kj = 1(acj,kcos[ωj,kn] + asj,ksin[ωj,kn]) + vn,kk1

where ωJ1,κωJ2,κ for j1 = j2 and acj,k,asj,k,ωj,k are respectively the amplitudes and the radial frequency of the jth sinusoid for the model with k sinusoids. The noise sequence v1:T,k is assumed zero-mean white Gaussian of variance σ2k. In vector-matrix form, we have

y1:T = D(ωk)ak + vk,1:T

where ak = (ac1,k,as1,k,,ack,k,ask,k)T and ωk = (ω1,k, …, ωk,k)T. The T × 2k matrix D (ωk) is defined as

[D(ωk)]i,2j  1 = cos[ωj,ki],(i = 1,,T,j = 1,,k)[D(ωk)]i,2j = sin[ωj,ki],(i = 1,,T,j = 1,,k)

We assume here that the number k of sinusoids and their parameters (ak,ωk,σ2k) are unknown. Given y1:T, our objective is to estimate (k,ak,ωk,σ2k). It is standard in signal processing to perform parameter estimation and model selection using a (penalized) ML approach. First, an approximate ML estimate of the parameters is found; we emphasize that unfortunately the likelihood is highly nonlinear in its parameters ωk and admits typically severe local maxima. Model selection is then performed by maximizing an information criterion (IC) such as AIC (Akaike), BIC (Bayes) or MDL (Minimum Description Length). Note that when the number of observations is small, these criteria can perform poorly. In this problem, a Bayesian approach is considered; see [8] for a motivation of this model. One has

ak|σ2k~N(0,σ2kδ2(DT(ωk)D(ωk))  1),σ2k~G(v02,γ02)

(12.1)

and the frequencies ωk are independent and uniformly distributed over (0, π). Finally, we assume that the prior distribution p (k) is a truncated Poisson distribution of intensity Λ where kmax ≜ ⎣(N − 1) /2⎦ (this constraint is added because otherwise the columns of D (ωk) would be linearly dependent. The terms δ2 and Λ can be respectively interpreted as an expected signal-to-noise ratio and the expected number of sinusoids.

In this case, it can easily be established that the marginal posterior distribution of the frequencies ωk is proportional on Ω = {0, 1, …, kmax} × (0, π)k to

p(ωk,k|y1:T)(γ0 + yT1:TPky1:T)  T + v02(Λ/((δ2 + 1)π))kk!

(12.2)

where

M  1k = (1 + δ  2)DT(ωk)D(ωk),mk = MkDT(ωk)y1:T,Pk = IT  D(ωk)MkDT(ωk).

This posterior distribution is highly nonlinear in the parameters ωk. Moreover, one cannot compute explicitly its normalizing constant p (y1:Tk) so it is impossible to compute the Bayes factors to perform model selection. Standard numerical integration techniques could be used but they are typically inefficient as soon as the dimension of the space of interest is high.

Optimal Filtering in State-Space Models

Consider an unobserved Markov process {xn}n > 1 of initial density μ and transition density xnxn−1 ~ f (· ∣ xn−1). The observations {yn}n ≥ 1 are conditionally independent given {xn}n ≥ 1 of marginal density ynxn ~ g (· ⌉ xn). This class of models is extremely wide. For example, it includes

xn = φ(xn  1,vn),yn = Ψ(xn,ωn)

where φ and ψ are two nonlinear deterministic mappings and {vn}n ≥ 2 and {wn}n ≥ 2 are two independent and mutually independent sequences.

All inference on x1:n based on y1:n is based on the posterior distribution

p(x1:n|y1:n) = p(y1:n|x1:n)p(x1:n)p(y1:n|x1:n)p(x1:n)dx1:n

where

p(x1:n) = p(x1)nk = 2f(xk|xk  1),p(y1:n|x1:n) = nk = 1g(yk|xk).

This posterior distribution satisfies the following recursion

p(y1:n|x1:n) = f(xn|xn  1)g(yn|xn)p(yn|y1:n  1)p(x1:n  1|y1:n  1).

Unfortunately, except in the case where {xn}n ≥ 1 takes values in a finite state-space (Hidden Markov model techniques) or the model is linear Gaussian (Kalman filtering techniques) then it is impossible to come up with a closed-form expression for this sequence of posterior distributions. Many suboptimal methods have been proposed to approximate this sequence; e.g., Extended Kalman filter, Gaussian sum approximations. However these methods tend to be unreliable as soon as the model includes strong nonlinear and/or non-Gaussian models. Deterministic numerical integration methods have been proposed but they are complex to implement, not flexible, and realistically can only be applied to models where {xn}n ≥ 1 takes values in ℝ or ℝ2.

DNA Sequence Motif Discovery

Efforts by various genomic projects have steadily expanded the pool of sequenced deoxyribonucleic acid (DNA) data. Motifs, or DNA patterns found in different locations within the genome, are often of interest to biologists. By seeking out these similarities exhibited in sequences, we can further our knowledge on the functions and evolutions of these sequences [9]. Let S = {s1, s2, ⋯, sT}, with st = [st1, ⋯, stL], be the set of DNA sequences of length L where we wish to find a common motif. Let us assume that a motif of length w is present in each one of the sequences. The distribution of the motif is described by the 4 × w position weight matrix (PWM) Θ = [θ1, θ2, ⋯, θw], where the column vector θj = [θj1, ⋯, θj4]T, j = 1, ⋯, w, is the probability distribution of the nucleotides {A, C, G, T} at the j-th position of the PWM. The remaining non-motif nucleotides are assumed to follow a Markovian distribution with probabilities given by Θ0.

To formulate the motif-finding problem we use the state space model, where the states represent the locations of the first nucleotides of the different occurrences of the motif in the sequence, whereas the observation for the state at step t is the entire nucleotide sequence, st. Since the ending w – 1 nucleotides in a sequence are not valid locations for the beginning of a motif with length w, at step t, t = 1, ⋯, T, the state, denoted as xt, takes value from the set X={1,2,,Lm}, where Lm = Lw + 1.

Let at,xt be a sequence fragment of length w from st starting from position xt in st, and denote act,xt as the remaining fragment from st with at,xt removed. For example, for st = [AAAAGGGGAAAA] and xt = 5 with w = 4, at,xt = [GGGG] and act,xt = [AAAAAAAA]. Let us further define a vector n(a) = [n1, n2, n3, n4] where ni, i = 1, ⋯, 4, denotes the number of different nucleotides in the sequence fragment a. Given the vectors θ = [θ1, ⋯, θ4] and n = [n1, ⋯, n4], we define

θn4j = 1θnjj.

(12.3)

In DNA sequences, a nucleotide is often influenced by the surrounding nucleotides. We assume for our system model a 3rd order Markov model for the non-motif nucleotides in the sequence. Let us denote P3t,xt as the probability of act,xt. For example, if act,xt = [ATAAG], the probability of act,xt is given by

P3t,xt = p(A)p(T|A)p(A|A,T)p(A|A,T,A)p(G|T,A,A).

(12.4)

In general, the 0-th to 3-rd order Markov chain probabilities for the background non-motif nucleotides can be averaged over a large genomic region, and are assumed to be known, which we denote as Θ0. To perform motif discovery, Θ0 can be given as a known parameter by the user or default values can be used. Since the nucleotides being located in the motif are independent of the other motif nucleotides and non-motif nucleotides, given the PWM Θ, the background distribution Θ0, and the state at time t, the distribution of the observed sequence st is then given as follows:

p(st|xt = i,Θ) = P3t,xtθn(t,i(k))k(st;i,Θ),

(12.5)

where at,i(k) is the k-th element of the sequence fragment at,i, and n(at,i(k)) is a 1 × 4 vector of zeros except at the position corresponding to the nucleotide at,i(k), where it is a one.

From the discussion above, we formulate the inference problem as follows. Let us denote the state realizations up to time T as x ≜ [x1, x2, ⋯, xT] and similarly the sequences up to time T as S ≜ [s1, s2, ⋯, sT], with the unknown parameter Θ, the position weight matrix. Given the sequences S and the Markovian non-motif nucleotide distribution Θ0, we wish to estimate the state realizations x, which are the starting locations of the motif in each sequence, and the position weight matrix Θ, which describes the statistics of the motif.

Remark: All problems described above require computing and/or maximizing high-dimensional probability distributions. It is possible to come up with deterministic techniques to approximate these distributions. However, as soon as the problems get very complex, the performance of these methods typically deteriorate quickly. In this chapter, we advocate that Monte Carlo methods are a powerful set of techniques which can provide satisfactory answers to all these problems.

12.2    Monte Carlo Methods

Let us consider the probability distribution or PDF π (x) where xX. We will assume from now on that π (x) is known pointwise up to a normalizing constant, i.e.,

π(x) = Z  1˜π(x)

where π̃ (x) is known pointwise but the normalizing constant

Z = χ˜π(x)dx

is unknown. Note this assumption is satisfied in all the examples discussed in the previous section if x corresponds to all the unknown variables/parameters.

In most applications of interest, the space X is typically high-dimensional; say X=1000 or X={0,1}1000. We are interested in the following generic problems.

•  Computing integrals. For any test function φ : X → ℝ, we want to compute

Eπ(φ) = χφ(x)π(x)dx.

(12.6)

•  Marginal distributions. Assume x = (x1, x2) ∈ X1 × X2, then we want to compute the marginal distribution

π(x1) = χ2π(x1,x2)dx2.

(12.7)

•  Optimization. Given π (x), we are interested in finding

argmaxxχπ(x) = argmaxxχ˜π(x).

(12.8)

•  Integration/Optimization. Given the marginal distribution (12.7), we want to compute

argmaxxχ1π(x1) = argmaxxχ1˜π(x1)

(12.9)

Assume it is possible to obtain a large number of N independent random samples {x(i)} (i = 1, …, N) distributed according to π. The Monte Carlo method approximates π by the following point-mass measure

ˆπ(x) = 1NNi = 1δ(x  x(i)).

(12.10)

It follows that an estimate of (12.6) is given by

ˆEπ(φ) = χφ(x)ˆπ(x) = 1NNi = 1φ(x(i)).

(12.11)

Marginal distributions can also be estimated straightforwardly as

ˆπ(x1) = χ2ˆπ(x1,x2)dx2 = χ21NNi = 1δ(x1  x(i)1,x2  x(i)2)dx2 = 1NNi = 1δ(x1  x(i)1).

(12.12)

The samples {x(i)} being distributed according to π, it means that a significant proportion of them will be in the vicinity of the mode so a reasonable estimate of (12.8) is

argmax{x(i)}˜π(x(i)).

(12.13)

Optimizing marginal distribution is more difficult, one cannot use argmaxπ(x(i)1){x(i)1}, as the marginal distribution cannot be computed even up to a normalizing constant. If the scenario where π (x1x2) is known analytically, then an alternative to (12.12) is

ˆπ(x1) = χ2π(x1|x2)ˆπ(x2)dx2 = χ2π(x1|x2)(1NNi = 1δ(x2  x(i)2))dx2 = 1NNi = 1π(x1|x(i)2).

(12.14)

It is then possible to estimate (12.9) by argmaxˆπ(x(i)1){x(i)1}. Note that the computational complexity of this algorithm is unfortunately very expensive as evaluating (12.14) pointwise involves N ≫ 1 terms. Alternative techniques will be discussed later.

A natural question to ask is why the Monte Carlo method is attractive. The typical answer is that if one considers (12.11), then this estimate has good properties; i.e., it is clearly unbiased and one can easily show that its variance satisfies

var{ˆEπ(φ)} = φ2(x)π(x)dx  E2π(φ)N.

(12.15)

The truly remarkable property of this estimate is that the rate of convergence to zero of its variance is independent of the space X (i.e., it can be ℝ or ℝ10000), whereas all deterministic integration methods have a rate of convergence of the approximation error decreasing severely as the dimension of the space increases. Note however that it does not imply that Monte Carlo methods will always outperform deterministic methods as the numerator of (12.15) can be huge. However, Monte Carlo tends to be much more flexible and powerful.

Nevertheless, they rely on the assumption that we are able to simulate samples {x(i)} from π. The next question is how we obtain such samples.

12.3    Markov Chain Monte Carlo (MCMC) Methods

12.3.1    General MCMC Algorithms

MCMC is a class of algorithms that allow one to draw (pseudo) random samples from an arbitrary target probability distribution, p(x), known up to a normalizing constant. The basic idea behind these algorithms is that one can achieve the sampling from p by running a Markov chain whose equilibrium distribution is exactly p. Two basic types of MCMC algorithms, the Metropolis algorithm and the Gibbs sampler, have been widely used in diverse fields. The validity of the both algorithms can be proved by the basic Markov chain theory.

Metropolis-Hastings Algorithm

Let p(x) = c exp{−f (x)} be the target probability distribution from which we want to simulate random draws. The normalizing constant c may be unknown to us. Metropolis et al. [1] first introduced the fundamental idea of evolving a Markov process in Monte Carlo sampling, which was later generalized by Hastings [10]. Starting with any configuration x(0), the algorithm evolves from the current state x(t) = x to the next state x(t+1) as follows:

Algorithm 12.3.1 [Metropolis-Hastings algorithm]

•  Propose a random “perturbation” of the current state, i.e., xx′, where x′ is generated from a transition function T(x(t)x′), which is nearly arbitrary (of course, some are better than others in terms of efficiency) and is completely specified by the user.

•  Compute the Metropolis ratio

r(x,x) = p(x)T(xx)p(x)T(xx).

(12.16)

•  Generate a random number u ~ uniform(0,1). Let x(t+1) = xif u ≤ r(x, x′), and let x(t+1) = x(t) otherwise.

It is easy to prove that the M-H transition rule results in an “actual” transition function A(x, y) (it is different from T because a acceptance/rejection step is involved) that satisfies the detailed balance condition

p(x)A(x,y) = p(y)A(y,x),

(12.17)

which necessarily leads to a reversible Markov chain with p(x) as its invariant distribution.

The Metropolis algorithm has been extensively used in statistical physics over the past forty years and is the cornerstone of all MCMC techniques recently adopted and generalized in the statistics community. Another class of MCMC algorithms, the Gibbs sampler [11], differs from the Metropolis algorithm in that it uses conditional distributions based on p(x) to construct Markov chain moves.

Gibbs Sampler

Suppose x = (x1, ⋯, xd), where xi is either a scalar or a vector. In the Gibbs sampler, one systematically or randomly chooses a coordinate, say xi, and then updates its value with a new sample x′i drawn from the conditional distribution p(· ∣ x[−i]). Algorithmically, the Gibbs sampler can be implemented as follows:

Algorithm 12.3.2 [Gibbs sampler]

Let the current state be x(t) = (x(t)1,,x(t)d).

For i = 1, ⋯, d, we draw x(t + 1)i from the conditional distribution

p(xi|x(t + 1)1,,x(t + 1)i  1,x(t)i + 1,,x(t)d).

Alternatively, one can randomly scan the coordinate to be updated. Suppose currently x(t) = (x(t)1,x(t)d). Then one can randomly select i from the index set {1, ⋯, d} according to a given probability vector (π1, ⋯, πd); and then draw x(t + 1)i from the conditional distribution p(|x(t)[  i]), and let x(t + 1)[  i] = x(t)[  i].

It is easy to check that every individual conditional update leaves p invariant. Suppose currently x(t) ~ p. Then x(t)[  i] follows its marginal distribution under p. Thus,

p(x(t + 1)i|x(t)[  i])p(x(t)[  i]) = p(x(t + 1)i,x(t)[  i]),

(12.18)

which implies that the joint distribution of (x(t)[  i],x(t + 1)i) is unchanged at p after one update.

The Gibbs sampler’s popularity in statistics community stems from its extensive use of conditional distributions in each iteration. The data augmentation method [12] first linked the Gibbs sampling structure with missing data problems and the EM-type algorithms. The Gibbs sampler was further popularized by [13] where it was pointed out that the conditionals needed in Gibbs iterations are commonly available in many Bayesian and likelihood computations. Under regularity conditions, one can show that the Gibbs sampler chain converges geometrically and its convergence rate is related to how the variables correlate with each other [14]. Therefore, grouping highly correlated variables together in the Gibbs update can greatly speed up the sampler.

Other techniques - A main problem with all the MCMC algorithms is that they may, for some problems, move very slowly in the configuration space or may be trapped in a local mode. This phenomenon is generally called slow-mixing of the chain. When the chain is slow-mixing, estimation based on the resulting Monte Carlo samples becomes very inaccurate. Some recent techniques suitable for designing more efficient MCMC samplers include parallel tempering [15], multipletry method [16], and evolutionary Monte Carlo [17].

12.3.2    Applications of MCMC in Digital Communications

In this section, we discuss MCMC-based receiver signal processing algorithms for several typical communication channels, when the channel conditions are unknown a priori.

MCMC Detectors in AWGN Channels

We start with the simplest channel model in digital communications – the additive white Gaussian noise (AWGN) channel. After filtering and sampling of the continuous-time received waveform, the discrete-time received signal in such a channel is given by

yt = ϕxt + vt,t = 1,2,,n,

(12.19)

where yt is the received signal at time t; xt ∈ {+1, −1} is the transmitted binary symbol at time t; ϕ is the received signal amplitude; and vt is an independent Gaussian noise sample with zero-mean and variance σ2, i.e., vt ~ N(0, σ2). Denote X ≜ [x1, …, xn] and Y ≜ [y1, …, yn]. Our problem is to estimate the a posteriori probability distribution of each symbol based on the received signal Y, without knowing the channel parameters (ϕ, σ2). The solution to this problem based on the Gibbs sampler is as follows. Assuming a uniform prior for ϕ, a uniform prior for X (on {−1, +1}n) and an inverse X2 prior for σ2,σ2X2(ν,λ), the complete posterior distribution is given by

p(X,ϕ,σ2|Y)p(Y|X,ϕ,σ2)p(ϕ)p(σ2)p(X).

(12.20)

The Gibbs sampler starts with arbitrary initial values of X(0) and for k = 0, 1, …, iterates between the following two steps.

Algorithm 12.3.3 [Two-component Gibbs detector in AWGN channel]

•  Draw a sample (ϕ(k+1), σ2(k+1)) from the conditional distribution (given X(k))

p(ϕ,σ2|X(k),Y)(σ2)  n2exp[  12σ2nt = 1(yt  ϕx(k)t)2](σ2)  v + 22exp(  vλ2σ2)π(k + 1)(ϕ + σ2)π(k + 1)(σ2),

(12.21)

where

π(k + 1)(σ2)~χ  2(v + n  1,1v + n  1[vλ+nt = 1y2t  1n(nt = 1ytx(k)t)2]),

(12.22)

and

π(k + 1)(ϕ|σ2)~N(1nnt = 1ytx(k)t,σ2n).

(12.23)

•  Draw a sample X(k+1) from the following conditional distribution, given (ϕ(k+1), σ2(k + 1)),

p(X|ϕ(k + 1),σ2(k + 1),Y) = nt = 1p(xt|yt,ϕ(k + 1),σ2(k + 1))nt = 1exp[  12σ2(k + 1)(yt  ϕ(k + 1)xt)2].

(12.24)

That is, for t = 1, …, n and b ∈ {+1, −1}, draw x(k + 1)t from

P(x(k + 1)t = b) = [1 + exp(  2bϕ(k + 1)ytσ2(k + 1))]  1.

(12.25)

It is worthwhile to note that one can integrate out ϕ and σ2 in (12.20) analytically to get the marginal target distribution of X, which can provide some further insight. More precisely, we have

π(X)[vλ+nt = 1y2t  1n(nt = 1xtyt)2]  (n + v)/2.

(12.26)

This defines a distribution on the space of a n-dimensional cube. The mode of this distribution is clearly at , and −, where = sign(Y). Intuitively, this is the “obvious solution” in this simple setting but it is not easy to generalize. Based on (12.26), we can derive another Gibbs sampling algorithm as follows.

Algorithm 12.3.4 [One-component Gibbs detector in AWGN channel]

•  Choose t from 1, …, n by either the random scan (i.e., the t is chosen at random) or the deterministic scan (i.e., one cycles t from 1 to n systematically). Update X(k) to X(k+1), where x(k)s = x(k + 1)s for st and x(k + 1)t is drawn from the conditional distribution

π(xt = b|X(k)[  t]) = π(xt = b,X(k)[  t])π(xt = b,X(k)[  t]) + π(xt =   b,X(k)[  t]),

(12.27)

where π(X) is as in (12.26). When the variance σ2 is known,

π(X)exp{12nσ2(nt = 1xtyt)2}.

(12.28)

Besides the two Gibbs samplers just described, an attractive alternative is the Metropolis algorithm applied directly to (12.26). Suppose X(k) = (x(k)1,,x(k)n). At step k + 1, the Metropolis algorithm proceeds as follows:

Algorithm 12.3.5 [Metropolis detector in AWGN channel]

•  Choose t ∈ {1, …, n} either by the random scan or by the deterministic scan. Define Z = (z1, …, zn) where zt =   x(k)t and zs =   x(k)s for s ≠ t. Generate independently U ~ uniform(0, 1). Let X(k+1) = Z if

Umin{1,π(Z)π(X(k))}.

(12.29)

and let X(k+1) = X(k) otherwise.

This Metropolis algorithm differs from the one-component Gibbs detector only slightly in the way of updating x(k)t to x(k + 1)t. That is, the Metropolis algorithm always forces the change (to   x(k)t) unless it is rejected, whereas the Gibbs sampler “voluntarily” selects whether to make the change so that no rejection is incurred. It is known that when the random scan is used, the Metropolis rule always results in a smaller second-largest eigenvalue (not in absolute value) than the corresponding Gibbs sampler [4]. Thus, when the target distribution is relatively peaked (high signal–to–noise ratio (SNR)) the Metropolis algorithm is slightly preferable. However, the Metropolis algorithm may have a large (in absolute value) negative eigenvalue when the target distribution is flatter (low SNR). In practice, however, the large negative eigenvalue is not a serious concern. No clear theory is available when a deterministic scan is used for updating. Simulations suggest that a similar result to that of the random scan samplers seems to hold well.

To overcome the phase ambiguity, one can either restrict ϕ to be positive, or, alternatively, use differential encoding. Let the information sequence be st ∈ {+1, −1}, t = 2, …, n. In differential coding, we construct the transmitted sequence xt ∈ {+1, −1}, t = 1, …, n, such that xt = xt−1st. To obtain Monte Carlo draws from the posterior distribution of p (S, ϕ, σ2Y), we use one of the MCMC algorithms to generate a Markov chain on (X, ϕ, σ2) and then convert the samples of X to S using s(k)t = x(k)tx(k)t  1, t = 2, …, n. Note that in this way X and −X result in the same S. Since {X(k)} is a Markov chain, so is {S(k)}. The transition probability from S(k) to S(k+1) is given by

P(S(k + 1)|S(k)) = P(X(k + 1)|X(k)) + P(  X(k + 1)|X(k)),

(12.30)

where both X(k+1) and −X(k+1) result in S(k+1), and X(k) results in S(k). Note that, both X(k) and −X(k) result in S(k), but since P (−X(k+1) ∣ −X(k)) = P (X(k+1)X(k)), either one can be used.

By denoting s1 = x1 and S ≜ [s1, s2, … sn], we can modify (12.26) to give rise to the marginal target distribution for the st:

π(s1,,sn){vλ + nt = 1y2t  1n(nt = 1ytti = 1si)2}  (n + v)/2.

(12.31)

Clearly, s1 is independent of all the other s and has a uniform marginal distribution.

It is trickier to implement an efficient Gibbs sampler or Metropolis algorithm based on (12.31). For example, the single-site update method (i.e., changing one st at a time) may be inefficient because when we propose to change st to −st all the signs on yt, yt+1, … have to be changed. This may result in a very small acceptance rate. Since a single update from xt to −xt corresponds to changing (st, st+1) (−st, −st+1), we can employ proposals

(st,st + 1)(  st,  st + 1),t<n,

and sn → −sn for distribution (12.31).

MCMC Equalizers in ISI Channels

Next we consider the Gibbs sampler for blind equalization in an intersymbol interference (ISI) channel [18, 19]. After filtering and sampling the continuous-time received waveform, the discrete-time received signal in such a channel is given by

yt = qs = 0ϕsxt  s + vt,t = 1,2,,n,

(12.32)

where (q + 1) is the channel order; ϕi is the value of the i-th channel tap, i = 0, …, q; xt ∈ {+1, −1} is the transmitted binary symbol at time t; and vt ~ N(0, σ2) is an independent Gaussian noise sample at time t.

Let X ≜ [x1−q, …, xn], Y ≜ [y1, …, yn], ϕ ≜ [ϕ0, …, ϕq]T. With a uniform prior for ϕ, a uniform prior for X, and an inverse X2 prior for σ2 (e.g., σ2χ  2ν,λ), the complete posterior distribution is

p(X,ϕ,σ2|Y)~p(Y|X,ϕ,σ2)p(ϕ)p(σ2)p(X).

(12.33)

The Gibbs sampler approach to this problem starts with an arbitrary initial value of X(0) and iterates between the following two steps:

Algorithm 12.3.6 [Two-component Gibbs equalizer in ISI channel]

•  Draw a sample (ϕ(k+1), σ2(k+1)) from the conditional distribution (given X(k))

p(ϕ,σ2|X(k),Y)(σ2)  n2exp[  12σ2nt = 1(yt  ϕTx(k)t)2](σ2)  v + 22exp(  vλ2σ2)π(k + 1)(ϕ|σ2)π(k + 1)(σ2),

(12.34)

where x(k)t[x(k)t,,x(k)t  q] for k = 0, 1, …, and

π(k + 1)(σ2)~χ  2(v + n  1,vλ+W(k + 1)v + n  1),

(12.35)

π(k + 1)(ϕ|σ2)~N(μ(k + 1),(k + 1)),

(12.36)

W(k + 1) = nt = 1y2t  [nt = 1x(k)tyt]T[nt = 1x(k)tx(k)Tt]  1[nt = 1x(k)tyt],

(12.37)

(k + 1) = [1σ2nt = 1xtxTt]  1,

(12.38)

μ(k + 1) = (k + 1)(1σ2nt = 1x(k)tyt).

(12.39)

•  Draw a sample X(k+1) from the conditional distribution, given (ϕ(k+1), σ2(k+1)) through the following iterations. For t = 1 − q, …, n, generate x(k + 1)t from

p(xt|ϕ(k + 1),σ2(k + 1),Y,X(k)[  t])exp[  12σ2(k + 1)nj = 1(yj  ϕ(k + 1)Tx(k)j)2],

(12.40)

where X(k)[  t][x(k + 1)1  q,,x(k + 1)t  1,x(k)t + 1,,x(k)M] and x(k)j[X(k)[  t]]j  q:j.

Another interesting Gibbs sampling scheme is based on the “grouping” idea [20]. In particular, a forward-backward algorithm can be employed to sample X jointly, conditional on Y and the parameters. This scheme is shown effective when X forms a Gaussian Markov model or a Markov chain whose state variable takes on only a few values. In the ISI channel equalization problem, xt’s are i.i.d. symbols a priori, but they are correlated a posteriori because of the observed signal Y and the relationship (12.32). The induced correlation among the xt vanishes after lag q. More precisely, instead of using formula (12.40) to sample X iteratively, one can draw X altogether:

Algorithm 12.3.7 [Grouping-Gibbs equalizer in ISI channel]

•  The first few steps are identical to the previous Gibbs equalizer.

•  The last step is replaced by the forward-backward scheme. Conditional on ϕ and σ (we suppress the superscript for iteration numbers), we have the joint distribution of X:

p(X|ϕ,σ,Y)exp[  12σ2nj = 1(yj  ϕTxj)2]exp{g1(x1) +  + gn(xn)},

(12.41)

where xj = (xj−q, …, xj). Thus, each xj can take 2q+1 possible values. The following two steps produce a sample X from p (Xϕ, σ, Y).

–  Forward summation. Define f1(x1) = exp{g1(x1)} and compute recursively

fj + 1(xj + 1) = 1xj  q =   1[fj(xj)exp{gj + 1(xj + 1)}].

(12.42)

–  Backward sampling. First draw xn = (xn−q, …, xn) from distribution P(xn) ∝ fn(xn). Then, for j = nq − 1, …, 1, draw P(xjxj+1, …, xn) ∝ fj+q (xj, …, xj+q).

Although the grouping idea is attractive for overcoming the channel memory problem, the additional computation cost may offset its advantages. More precisely, the forward-backward procedure needs about 2q times more memory and about 2q times more basic operations.

Similar to the previous section, we can integrate out the continuous parameters and write down the marginal target distribution of X:

π(X)[vλ+W]  (n + v)/2

(12.43)

where

W = y2t  [xtyt]T[xtxTt]  1[xtyt].

(12.44)

We can then derive the one-component Gibbs and Metropolis algorithms accordingly. The phase ambiguity (i.e., likelihood unchanged when X is changed to −X) can be clearly seen from this joint distribution.

Algorithm 12.3.8 [One-component Gibb/Metropolis equalizer in ISI channel]

•  Choose t from 1, …, n by either the random scan or the systematic scan. Let X(k+1) = Z, where zs = x(k)s for st and zt =   x(k)t, with probability

π(Z)π(X(k)) + π(Z),

(12.45)

for the Gibbs equalizer, or with probability

min{1,π(Z)π(X(k))}

(12.46)

for the Metropolis equalizer, where π(X) is as in (12.43). Otherwise let X(k+1) = X(k). When the variance σ2 is known,

π(X){1ΣxtxTt|q/2}exp(12σ2[xtyt]T[xtxTt]  1[xtyt]).

To overcome the phase ambiguity, we use differential coding in all of our algorithms. Denote S ≜ [s2, …, sn] as the information bits. Let s(k)t = x(k)tx(k)t  1, t = 2, …, n. Since X(k) forms a Markov chain, S(k) is a Markov chain too. The transition probability from S(k) to S(k+1) is

P(S(k + 1)|S(k)) = P(X(k + 1)|X(k)) + P(  X(k + 1)|X(k)),

(12.47)

where both X(k+1) and −X(k+1) result in S(k+1) and X(k) results in S(k).

12.4    Sequential Monte Carlo (SMC) Methods

12.4.1    General SMC Algorithms

Sequential Importance Sampling

Importance sampling is perhaps one of the most elementary, well-known, and versatile Monte Carlo techniques. Suppose we want to estimate E{h(x)} (with respect to p), using Monte Carlo method. Since directly sampling from p(x) is difficult, we want to find a trial distribution, q(x), which is reasonably close to p but is easy to draw samples from. Because of the simple identity

E{h(x)} = h(x)p(x)dx = h(x)w(x)q(x)dx,

(12.48)

where

w(x)p(x)q(x),

(12.49)

is the importance weight, we can approximate (12.48) by

E{h(x)}1Wvj = 1h(x(j))w(x(j)),

(12.50)

where x(1), x(2), ⋯, x(ν) are random samples from q, and W = nj = 1w(x(j)). In using this method, we only need to know the expression of p(x) up to a normalizing constant, which is the case for many processing problems found in digital communications. Each x(j) is said to be properly weighted by w(x(j)) with respect to p.

However, it is usually difficult to design a good trial density function in high-dimensional problems. One of the most useful strategies in these problems is to build up the trial density sequentially. Suppose we can decompose x as (x1, ⋯, xd) where each of the xj may be multidimensional. Then our trial density can be constructed as

q(x) = q1(x1)q2(x2|x1)qd(xd|x1,,xd  1),

(12.51)

by which we hope to obtain some guidance from the target density while building up the trial density. Corresponding to the decomposition of x, we can rewrite the target density as

p(x) = p(x1)p(x2|x1)p(xd|x1,,xd  1),

(12.52)

and the importance weight as

w(x) = p(x1)p(x2|x1)p(xd|x1,,xd  1)q1(x1)q2(x2|x1)qd(xd|x1,,xd  1).

(12.53)

Equation (12.53) suggests a recursive way of computing and monitoring the importance weight. That is, by denoting xt = (x1, ⋯, xt) (thus, xdx), we have

wt(xt) = wt  1(xt  1)p(xt|xt  1)qt(xt|xt  1).

(12.54)

Then wd is equal to w(x) in (12.53). Potential advantages of this recursion and (12.52) are: (a) We can stop generating further components of x if the partial weight derived from the sequentially generated partial sample is too small; and (b) we can take advantage of p(xtxt−1) in designing qt(xtxt−1). In other words, the marginal distribution p(xt) can be used to guide the generation of x.

Although the “idea” sounds interesting, the trouble is that expressions (12.52) and (12.53) are not useful at all! The reason is that in order to get (12.52), one needs to have the marginal distribution

p(xt) = p(x1,,xd)dxt + 1dxd,

(12.55)

which is perhaps more difficult than the original problem.

In order to carry out the sequential sampling idea, we need to find a sequence of “auxiliary distributions,” π1(x1), π2(x2), ⋯, πd(x), so that πt(xt) is a reasonable approximation to the marginal distribution p(xt), for t = 1, ⋯, d − 1, and πd = p. We want to emphasize that the πt are only required to be known up to a normalizing constant and they only serve as “guides” to our construction of the whole sample x = (x1, ⋯, xd). The sequential importance sampling (SIS) method can then be defined as the following recursive procedure.

Algorithm 12.4.1 [Sequential importance sampling (SIS)]

For t = 2, ⋯, d:

•  Draw xt from qt(xtxt−1), and let xt = (xt−1, xt).

•  Compute

ut = πt(xt)πt  1(xt  1)qt(xt|xt  1),

(12.56)

and let wt = wt−1ut. Here ut is called an incremental weight.

It is easy to show that xt is properly weighted by wt with respect to πt provided that xt−1 is properly weighted by wt−1 with respect to πt−1. Thus, the whole sample x obtained by SIS is properly weighted by wd with respect to the target density p(x). The “auxiliary distributions” can also be used to help construct a more efficient trial distribution:

•  We can build qt in light of πt. For example, one can choose (if possible)

qt(xt|xt  1) = πt(xt|xt  1).

(12.57)

Then the incremental weight becomes

ut = πt(xt)πt  1(xt  1).

(12.58)

In the same token, we may also want qt to be πt+1(xtxt−1), where the latter involves integrating out xt+1.

•  When we observe that wt is getting too small, we may want to reject the sample half way and restart. In this way, we avoid wasting time on generating samples that are deemed to have little effect in the final estimation. However, as an outright rejection incurs bias, techniques such as the rejection control are needed [21].

•  Another problem with the SIS is that the resulting importance weights are often very skewed, especially when d is large. An important recent advance in sequential Monte Carlo to address this problem is the resampling technique [21, 22, 23].

SMC for Dynamic Systems

Consider the following dynamic system modeled in a state-space form as

stateequationzt = ft(zt  1,ut)observationequationyt = gt(zt,vt),

(12.59)

where zt, yt, ut and vt are, respectively, the state variable, the observation, the state noise, and the observation noise at time t. They can be either scalars or vectors.

Let Zt=(z0, z1, ⋯, zt) and let Yt=(y0, y1, ⋯, yt). Suppose an on-line inference of Zt is of interest; that is, at current time t we wish to make a timely estimate of a function of the state variable Zt, say h(Zt), based on the currently available observation, Yt. With the Bayes theorem, we realize that the optimal solution to this problem is E{h(Zt)∣Yt} = ∫ h(Zt)p(ZtYt)dZt. In most cases an exact evaluation of this expectation is analytically intractable because of the complexity of such a dynamic system. Monte Carlo methods provide us with a viable alternative to the required computation. Specifically, suppose a set of random samples {Z(j)t}νj = 1 is generated from the trial distribution q(ZtYt). By associating the weight

w(j)t = p(Z(j)t|Yt)q(Z(j)t|Yt)

(12.60)

to the sample Z(i)t, we can approximate the quantity of interest, E{h(Zt)∣Yt}, as

E{h(Zt|Yt)}1Wtvj = 1h(Z(j)t)w(j)t,

(12.61)

where Wt = νj = 1w(j)t. The pair (Z(j)t,w(j)t), is a properly weighted sample with respect to distribution p(ZtYt). A trivial but important observation is that z(j)t (one of the components of Z(j)t) is also properly weighted by w(j)t with respect to the marginal distribution p(ztYt).

To implement Monte Carlo techniques for a dynamic system, a set of random samples properly weighted with respect to p(ZtYt) is needed for any time t. Because the state equation in system (12.59) possesses a Markovian structure, we can implement a SMC strategy [21]. Suppose a set of properly weighted samples {(Z(j)t  1,w(j)t  1)}νj = 1 (with respect to p(Zt−1Yt−1)) is given at time (t − 1). A sequential Monte Carlo filter generates from the set a new one, {Z(j)t,w(j)t}νj = 1, which is properly weighted at time t with respect to p(ZtYt), according to the following algorithm.

Algorithm 12.4.2 [Sequential Monte Carlo filter for dynamic systems]

For j = 1, ⋯, ν:

•  Draw a sample z(j)t from a trial distribution q(zt|Z(j)t  1,Yt) and let Z(j)t = (Z(j)t  1,z(j)t);

•  Compute the importance weight

w(j)t = w(j)t  1p(Z(j)t|Yt)p(Z(j)t  1|Yt  1)q(z(j)t|Z(j)t  1,Yt).

(12.62)

The algorithm is initialized by drawing a set of i.i.d. samples z(1)0,,z(m)0 from p(z0y0). When y0 represents the “null” information, p(z0y0) corresponds to the prior of z0.

A useful choice of the trial distribution q(zt|Z(j)t  1,Yt) for the state space model (12.59) is of the form

q(zt|Z(j)t  1,Yt) = p(zt|Z(j)t  1,Y) = p(yt|zt)p(zt|z(j)t  1)p(yt|z(j)t  1).

(12.63)

For this trial distribution, the importance weight is updated according to

w(j)tw(j)t  1p(yt|z(j)t  1).

(12.64)

Mixture Kalman Filter

Many dynamic system models belong to the class of conditional dynamic linear models (CDLM) of the form

xt = Fλtxt  1 + Gλtut,yt = Hλtxt + Kλtvt,

(12.65)

where ut ~ Nc(0, I), vt ~ Nc(0, I) (here I denotes an identity matrix), and λt is a random indicator variable. The matrices Fλt,Gλt,Hλt and Kλt are known given λt. In this model, the “state variable” zt corresponds to (xt, λt).

We observe that for a given trajectory of the indicator λt in a CDLM, the system is both linear and Gaussian, for which the Kalman filter provides the complete statistical characterization of the system dynamics. The mixture Kalman filter (MKF) [24] can be employed for on-line filtering and prediction of CDLM’s. It exploits the conditional Gaussian property and utilizes a marginalization operation to improve the algorithmic efficiency. Instead of dealing with both xt and λt, MKF draws Monte Carlo samples only in the indicator space and uses a mixture of Gaussian distributions to approximate the target distribution. Compared with the generic SMC method, MKF is substantially more efficient (e.g., giving more accurate results with the same computing resources).

Let Yt = (y0, y1, ⋯, yt) and let Λt = (λ0, λ1, ⋯, λt). By recursively generating a set of properly weighted random samples {(Λ(j)t,w(j)t)}νj = 1 to represent p(ΛtYt), the MKF approximates the target distribution p(xtYt) by a random mixture of Gaussian distributions

1Wtvj = 1w(j)tNc(μ(j)t,Σ(j)t),

(12.66)

where κ(j)t[μ(j)t,(j)t] is obtained by implementing a Kalman filter for the given indicator trajectory Λ(j)t and Wt = νj = 1w(j)t. A key step in the MKF is the production at time t of a weighted sample of indicators, {(Λ(j)t,κ(j)t,w(j)t)}νj = 1, based on the set of samples, {(Λ(j)t  1,κ(j)t  1,w(j)t  1)}νj = 1, at the previous time (t − 1) according to the following algorithm.

Algorithm 12.4.3 [Mixture Kalman filter]

For j = 1, ⋯, ν:

•  Draw a sample λ(j)t from a trial distribution q(λt|Λ(j)t  1,κ(j)t  1,Yt).

•  Run a one-step Kalman filter based on λ(j)t,κ(j)t  1, and yt to obtain κ(j)t.

•  Compute the weight

w(j)tw(j)t  1p(Λ(j)t  1,λ(j)t|Yt)p(Λ(j)t  1|Yt  1)q(λ(j)t|Λ(j)t  1,κ(j)t  1Yt).

(12.67)

12.4.2    Resampling Procedures

The importance sampling weight w(j)t measures the “quality” of the corresponding imputed signal sequence Z(j)t. A relatively small weight implies that the sample is drawn far from the main body of the posterior distribution and has a small contribution in the final estimation. Such a sample is said to be ineffective. If there are too many ineffective samples, the Monte Carlo procedure becomes inefficient. This can be detected by observing a large coefficient of variation in the importance weight. Suppose {w(j)t}mj = 1 is a sequence of importance weights. Then the coefficient of variation, vt is defined as

v2t = mj = 1(w(j)t  ˉwt)2/mˉw2t = 1mmj = 1(ω(j)tˉwt  1)2,

(12.68)

where ˉwt = mj = 1w(j)t/m. Note that if the samples are drawn exactly from the target distribution, then all the weights are equal, implying that vt = 0. It is shown in [25] that the importance weights resulting from a sequential Monte Carlo filter form a martingale sequence. As more and more data are processed, the coefficient of variation of the weights increases — that is, the number of ineffective samples increases — rapidly.

A useful method for reducing ineffective samples and enhancing effective ones is resampling [23]. Roughly speaking, resampling allows those “bad” samples (with small importance weights) to be discarded and those “good” ones (with large importance weights) to replicate so as to accommodate the dynamic change of the system. Specifically, let {(Z(j)t,w(j)t)}mj = 1 be the original properly weighted samples at time t. A residual resampling strategy forms a new set of weighted samples {(˜Z(j)t,˜w(j)t)}mj = 1 according to the following algorithm (assume that mj = 1w(j)t = m):

Algorithm 12.4.4 [Resampling algorithm]

•  For j = 1, , m, retain kj = w(j)t copies of the sample Z(j)t. Denote Kr = m  mj = 1kj.

•  Obtain Kr i.i.d. draws from the original sample set {Z(j)t}mj = 1, with probabilities proportional to (w(j)t  kj), j = 1, ⋯, m.

•  Assign equal weight, i.e., set ˜w(j)t = 1, for each new sample.

The samples drawn by the above residual resampling procedure are properly weighted with respect to p(ZtYt), provided that m is sufficiently large. In practice when small to modest m is used the resampling procedure can be seen as trading off between bias and variance. That is, the new samples with their weights resulting from the resampling procedure are only approximately proper, which introduces small bias in Monte Carlo estimation. On the other hand, however, resampling greatly reduces Monte Carlo variance for the future samples.

Resampling can be done at any time. However, resampling too often adds computational burden and decreases “diversities” of the Monte Carlo filter (i.e., it decreases the number of distinctive filters and loses information). On the other hand, resampling too rarely may result in a loss of efficiency. It is thus desirable to give guidance on when to do resampling. A measure of the efficiency of an importance sampling scheme is the effective sample size m̄t, defined as

ˉmtm1 + v2t.

(12.69)

Heuristically, t reflects the equivalent size of a set of i.i.d. samples for the set of m weighted ones. It is suggested in [21] that resampling should be performed when the effective sample size becomes small, e.g., ˉmtm10. Alternatively, one can conduct resampling at every fixed-length time interval (say, every five steps).

Instead of the previous resampling scheme suggested in the literature, we may implement a more flexible resampling scheme as follows (assume that mj = 1w(j)t = m):

For j = 1, ⋯, m,

(a)  For w(j)t1,

•  Retain kj copies of the sample Z(j)t, where kj is given in advance (see below);

•  Assign weight ˜w(j)t = w(j)t/kj for each copy.

(b)  For w(j)t<1,

•  Kill the sample with probability 1 − fj.

•  Assign weight w(j)t/fj to the survived sample.

The advantage of this new resampling method is that we have the flexibility of choosing a proper resampling size kj as we wish. On one hand, we want to eliminate those hopeless samples and emphasize those “promising” ones. On the other hand, however, we do not want to throw away those mediocre ones which may prove important later on (as the dynamical system moves towards their way). An empirical choice of the resample size formula is kj = w(j)t and fj = w(j)t. The intuition behind this choice is that it effectively removes those hopeless samples with small weights but still maintains the diversity of the Monte Carlo sample.

12.4.3    Applications of SMC in Bioinformatics

In this section we illustrate the application of SMC in solving the DNA sequence motif discovery problem described in Section 12.1.2.

SMC Motif Discovery Algorithm

For the system states up to time t, xt = [x1, ⋯, xt], and the corresponding sequences St = [s1, ⋯, st], we will first present their prior distributions and their conditional posterior distributions, and then describe the steps of the SMC motif discovery algorithm.

Prior Distributions: Denote θj ≜ [θj1, ⋯, θj4]T, j = 1, ⋯, w, as the j-th column of the position weight matrix Θ. In Monte Carlo methods, the prior distribution is often chosen so that the posterior and the prior are conjugate pairs, i.e., they belong to the same functional family. It can be seen that for all of the motifs in the dataset S, the nucleotide counts at each motif location are drawn from multinomial distributions. It is well known that the Dirichlet distribution provides conjugate pairs for such distribution. Therefore, we use a multivariate Dirichlet distribution as the prior for θ. The prior distribution for the i-th column of the PWM is then given by

θi~D(ρi1,,ρi4),i = 1,2,,w.

(12.70)

Denote ρi ≜ [ρi1, ⋯, ρi4]. Assuming independent priors, then the prior distribution for the PWM Θ is the product Dirichlet distribution

Θ~wi = 1D(ρi).

(12.71)

Conditional Posterior Distributions: Here we describe the conditional posterior distributions that are used in the SMC algorithm:

1.  The conditional posterior distribution of the PWM Θ:

p(Θ|St,xt  1,xt = i)p(St|Θ,xt  1,xt = i,St  1)p(Θ|xt  1,St  1)wj = 1θn(at,i(j))jw = 1θρ(t  1)  1Λw(Θ;ρ1(t  1) + n(at,i(1)),,ρw(t  1) + n(at,i(w))),

(12.72)

where we denote Λw (Θ; ρ1, ⋯, ρw) as the product Dirichlet PDF for Θ, ρi(t) ≜ [ρi1(t), ⋯, ρi4(t)], i = 1, ⋯, w, as the parameters of the distribution of Θ at time t, and θρκ(t)  1k4 = 1θ(ρκ(t)  1)k. Note that the posterior distribution of Θ depends only on the sufficient statistics Tt ≜ [ρij(t), 1 ≤ iw, 1 ≤ j ≤ 4}, which is easily updated based on Tt−1, xt, and st as given by (12.72), i.e., Tt = Tt(Tt−1, xt; st).

2.  The conditional posterior distribution of state xt:

p(xt = i|St,Θ) = p(xt = i|st,Θ)(st;i,Θ),i = 1,2,,Lm.

(12.73)

SMC Estimator: We now outline the SMC algorithm for motif discovery when the PWM is unknown, assuming that there is only one motif of length w, and it is present in each of the sequences in the dataset. At time t, to draw random samples of x(k)t we use the optimal proposal distribution

q2(xt = i|x(k)t  1,St,Θ) = p(xt = i|x(k)t  1,St,Θ)~(st;i,Θ).

(12.74)

To sample Θ, we use the following proposal distribution

q1(Θ|x(k)t  1,St)Lmi = 1p(st|xt = i,Θ,xt  1,St  1)p(Θ|xt  1,St  1)Lmi = 1p3t,xtwk = 1θρk(t  1) + n(αt,i(k))  1kLmi = 1λi,tΛw(Θ;ρ1(t  1) + n(at,i(1))),,ρw(t  1) + n(at,i(w))).

(12.75)

where

λi,tP3t,xtw = 1ρ(t  1)n(at,i()),

(12.76)

with ρ(t)n(at,i())4j = 1ρj(t)I(st,i +   1  j). The weight update is given by

wtwt  1Lmi = 1λi,twk = 14j = 1ρkj(t  1).

(12.77)

We are now ready to give the SMC motif discovery algorithm:

Algorithm 12.4.5 [SMC motif discovery algorithm for single motif present in all sequences]

•  For k = 1, ⋯, K

–  Sample Θ(k) from the mixture Dirichlet distribution given by (12.75).

–  Sample x(k)t from (12.74).

–  Update the sufficient statistics T(k)t = Tt(T(k)t  1,x(k)t,st) from (12.72).

•  Compute the new weights according to (12.77).

•  Compute ^Keff = (Kk = 1(w(k)t)2)  1. if ^KeffK10 perform resampling.

Motif Scores: When searching for motifs in a dataset, it is often necessary to assign confidence scores to the estimated motif locations. A natural choice in this case would be to use the a posteriori probability

p(xt|st)p(st|xt)p(xt),

(12.78)

as the confidence score for our estimation, where p(xt), the prior probability of the starting location of the motif in sequence t is assumed to be uniformly distributed. Note that

p(st|xt) = p(st|xt,Θ)p(Θ)dΘ.

(12.79)

From [26], [27], (12.79) can be approximated by

p(st|xt)p(st|xt,ˆΘ)p(ˆΘ) = (st;xt,ˆΘ)Λw(ˆΘ;ρ1,t,,ρw,t),

(12.80)

and we denote (12.80) as the Bayesian score. Extensions of the above basic SMC motif discovery algorithm can be found in [9].

12.5    Conclusions and Further Readings

Monte Carlo techniques rely on random number generation to calculate more efficiently deterministic variables and functions, to solve complicated optimization and estimation problems, and to simulate complex phenomena and systems. They found applicability in a wide variety of fields including engineering, bioinformatics, statistics, and physical sciences (phsyics, astronomy, chemistry, etc.). In the areas of signal processing, communications and networking, Monte Carlo techniques combined with Bayesian statistics proved to be very powerful tools for solving complex estimation, detection, optimization and simulation problems (see e.g., [28, 29, 30]). Recently, the class of sequential Monte Carlo techniques helped to design efficient recursive algorithms for diverse estimation and detection applications (see e.g., [4, 31, 32, 33], as well as the tutorials [34, 35, 36, 37]). For a more comprehensive treatment of Monte Carlo techniques and Bayesian statistics, we recommend the excellent references [3, 4, 5, 6, 7].

12.6    Exercises

Exercise 12.6.1 (Variance of MCMC sampler). Suppose the samples from a Markov Chain Monte Carlo sampler are given by {xi}(i = 1, …, N) distributed according to π. Let us further assume that the process has been running long enough and has achieved the equilibrium distribution. Show that:

Nvar{Ni = 1ϕ(x(i))N} = σ2[1 + 2N  1i = 1(1  iN)ρi]

where σ2 = var[ϕ(x)] and ρi = E[ϕ(xj)ϕ(xj+i)].

Exercise 12.6.2 (Two-component Gibbs sampler). Consider a Markov chain resulting from a two-component Gibbs sampler which is in stationarity. Prove that

cov{ϕ(x(0)1),ϕ(x(1)1)} = var{E{ϕ(x1|x2)}}

holds for any function ϕ.

Exercise 12.6.3 (Computational efficiency of MC estimates). Suppose we have Monte Carlo samples from data augmentation and we are estimating I = E[ϕ(x1)]. Which one of following estimators should be preferred?

ˆI = 1m{ϕ(x(1)1) +  + ϕ(x(m)1)}I = 1m{E[ϕ(x(1)1)|x(1)2] +  + E[ϕ(x(m)1)|ϕ(m)1]}

Justify your result by finding the variances for the two estimators.

Exercise 12.6.4 (Mean of the importance weights). Suppose that the target density is p(x) and the trial density is q(x). We draw random samples x1, x2,…, xν from q and the sum of weights is given by W = nj = 1w(x(j)). Prove that the expectation of W is equal to n.

Exercise 12.6.5 (Normalizing the importance weights). Let us assume that the weights have been normalized to sum one and x̂(j) are sampled from x(j) according to the normalized weights. Prove the following relation:

E{1vϕ(ˆx(j))} = E{vj = 1wjϕ(x(j))}

Exercise 12.6.6 (Importance sampling estimators). Find the MSE for the importance sampling estimator given by

1Wvj = 1h(x(j))w(x(j))

in terms of K(x) = h(x)w(x), where W is the sum of weights.

Exercise 12.6.7 (More analytical work is good in importance sampling). Prove that

var{fX1X2(x1,x2)gX1X2(x1,x2)}var{fX1(x1)gX1(x1)}

where the variance is calculated with respect to the density g.

References

[1]  N. Metropolis, A. Rosenbluth, A. Teller, and E. Teller, “Equations of state calculations by fast computing machines,” J. Chemical Physics, vol. 21, pp. 1087–1091, 1953.

[2]  J. Besag, P. Green, D. Higdon, and K. Mengersen, “Bayesian computation and stochastic systems (with discussion),” Statist. Sci., vol. 10, pp. 3–66, 1995.

[3]  W. Gilks, S. Richardson, and D. Spiegelhalter, Markov Chain Monte Carlo in Practice. Chapman & Hall, New York, 1995.

[4]  J. Liu, Monte Carlo Methods for Scientific Computing. Springer-Verlag, New York, 2001.

[5]  C. Robert and G. Casella, Monte Carlo Statistical Methods. Springer-Verlag, New York, 1999.

[6]  J. Bernardo and A. Smith, Bayesian Theory. Wiley, New York, 1995.

[7]  C. P. Robert, “Mixtures of distributions: inference and estimation,” in Markov Chain Monte Carlo in Practice. Chapman & Hall, New York, 1996, ch. 24, pp. 441–464.

[8]  C. Andrieu and A. Doucet, “Joint Bayesian detection and estimation of noisy sinusoids via reversible jump MCMC,” IEEE Transactions on Signal Processing, vol. 47, pp. 2667–2676, 1999.

[9]  K.-C. Liang, X. Wang, and D. Anastassiou, “A sequential Monte Carlo method for motif discovery,” IEEE Transactions on Signal Processing, vol. 56, no. 9, pp. 4486–4495, September 2008.

[10]  W. Hastings, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, pp. 97–109, 1970.

[11]  S. Geman and D. Geman, “Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-6, no. 11, pp. 721–741, Nov. 1984.

[12]  M. Tanner and W. Wong, “The calculation of posterior distribution by data augmentation (with discussion),” J. Amer. Statist. Assoc., vol. 82, pp. 528–550, 1987.

[13]  A. Gelfand and A. Smith, “Sampling-based approaches to calculating marginal densities,” J. Amer. Stat. Assoc., vol. 85, pp. 398–409, 1990.

[14]  J. Liu, “The collapsed Gibbs sampler with applications to a gene regulation problem,” J. Amer. Statist. Assoc, vol. 89, pp. 958–966, 1994.

[15]  C. Geyer, “Markov chain Monte Carlo maximum likelihood,” in Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, E. Keramigas, Ed. Fairfax: Interface Foundation, 1991, pp. 156–163.

[16]  J. Liu, F. Ling, and W. Wong, “The use of multiple-try method and local optimization in Metropolis sampling,” J. Amer. Statist. Assoc, vol. 95, pp. 121–134, 2000.

[17]  F. Liang and W. Wong, “Evolutionary Monte Carlo: applications to cp model sampling and change point problem,” Statistica Sinica, vol. 10, pp. 317–342, 2000.

[18]  R. Chen and T. Li, “Blind restoration of linearly degraded discrete signals by Gibbs sampling,” IEEE Trans. Sig. Proc., vol. 43, no. 10, pp. 2410–2413, Oct. 1995.

[19]  X. Wang and R. Chen, “Blind turbo equalization in Gaussian and impulsive noise,” IEEE Transactions Vehicular Technology, vol. 50, no. 4, pp. 1092–1105, July 2001.

[20]  C. Carter and R. Kohn, “On Gibbs sampling for state space models,” Biometrika, vol. 81, pp. 541–553, 1994.

[21]  J. Liu and R. Chen, “Sequential Monte Carlo methods for dynamic systems,” Journal of the American Statistical Association, vol. 93, pp. 1032–1044, 1998.

[22]  N. Gordon, D. Salmon, and A. Smith, “A novel approach to nonlinear/non-Gaussian Bayesian state estimation,” IEE Proc. Radar Sig. Proc., vol. 140, pp. 107–113, 1993.

[23]  J. Liu and R. Chen, “Blind deconvolution via sequential imputations,” Journal of the American, Statistical Association, vol. 90, pp. 567–576, 1995.

[24]  R. Chen and J. Liu, “Mixture Kalman filters,” J. Roy. Statist. Soc. B, vol. 62, no. 3, pp. 493–509, 2000.

[25]  A. Kong, J. Liu, and W. Wong, “Sequential imputations and Bayesian missing data problems,” J. Amer. Statist. Assoc, vol. 89, pp. 278–288, 1994.

[26]  X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, and E. R. Dougherty, “A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks,” Bioinformatics, vol. 20, no. 17, pp. 2918–2927, 2004.

[27]  C. Andrieu, J. Freitas, and A. Doucet, “Robust full Bayesian learning from neural networks,” Neural Computation, vol. 13, pp. 2359–2407, 2001.

[28]  X. Wang and V. Poor, Wireless Communication Systems: Advanced Techniques for Signal Reception. US: Prentice Hall, 2003.

[29]  X. Wang and A. Doucet, “Monte Carlo methods for signal processing: a review in the statistical signal processing context,” IEEE Signal Processing Magazine, vol. 22, no. 6, pp. 152–170, November 2005.

[30]  X. Wang, R. Chen, and J. Liu, “Monte Carlo signal processing for wireless communications,” J. VLSI Sig. Proc., vol. 30, no. 1–3, pp. 89–105, Jan.-Mar. 2002.

[31]  O. Cappe, E. Moulines, and T. Ryden, Inference in Hidden Markov Models. Berlin: Springer, 2005.

[32]  A. Doucet, N. D. Freitas, and N. Gordon, Sequential Monte Carlo Methods in Practice. Berlin: Springer, 2001.

[33]  B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman Filter: Particle Filters for Tracking Applications. Artech House, Norwood, MA, 2004.

[34]  M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” IEEE Transactions on Signal Processing, vol. 50, no. 2, p. 174–188, February 2002.

[35]  O. Cappe, S. Godsill, and E. Moulines, “An overview of existing methods and recent advances in sequential Monte Carlo,” Proceedings of IEEE, vol. 95, no. 5, pp. 899–924, April 2007.

[36]  P. M. Djuric, J. H. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M. F. Bugallo, and J. Miguez, “Particle Filtering,” IEEE Signal Processing Magazine, vol. 20, no. 5, pp. 19–38, September 2003.

[37]  A. Doucet, S. Godsill, and C. Andrieu, “On Sequential Monte Carlo Methods for Bayesian Filtering,” Statistics and Computing, vol. 10, no. 3, pp. 197–208, 2000.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.111.125