7.3.2 Inequalities

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

For many complex systems, analyzing the exact performance can be challenging and even analytically intractable. In many such situations, inequalities are used to bound the system performance. Such bounding techniques are frequently used in communications to derive closed form approximations for the performance of various systems. The closed form approximations clearly demonstrate the effect that changes in a certain parameter has on performance. These inequalities are also used in learning theory to study the problem of optimal model selection for a given training data. A detailed discussion on using inequalities and large deviation theory is given in Chapter 9. In this section, we discuss in brief some of the frequently used inequalities.

For a non-negative random variable X $X$ with mean μX = E{X} $μ_{X} = E {X}$ , the Markov inequality provides a bound on the tail probability and is given by

Pr(χ≥a)≤E{χ}a. $\Pr (χ \geq a) \leq \frac{E {χ}}{a} .$

(7.59)

Clearly, this inequality only provides meaningful results only when a>μX $a > μ_{X}$ . Even in that case, the bound is quite weak since it only uses the mean of the function to evaluate a bound on the tail probability. A tighter bound on the tail probability is given by the Chebychev inequality which uses both the mean μX $μ_{X}$ and the variance as σ2X $σ_{X}^{2}$ as

Pr(|χ − μχ|≥a)≤σ2χa2 $\Pr (| χ - μ_{χ} | \geq a) \leq \frac{σ_{χ}^{2}}{a^{2}}$

(7.60)

The Chebychev inequality is obtained by applying the Markov inequality to (X − μX)2 ${(X - μ_{X})}^{2}$ .

Jensen’s inequality states that for a convex function g(·), the expected value of the function of a random variable is no smaller than the function of the expected value of that random variable, i.e.,

g(E{χ})≤E{g(χ)} $g (E {χ}) \leq E {g (χ)}$

(7.61)

The Union bound simply states that the probability of the union of certain events is lesser than the sum of the probabilities of the individual events,

Pr(∪iAi)≤∑iPr(Ai) $\Pr (\cup_{i} A_{i}) \leq \sum_{i} \Pr (A_{i})$

(7.62)

The Schwarz inequality is given by

|E{χY}|≤E{χ2}E{Y2}−−−−−−−−−−−−√ $| E {χ Y} | \leq \sqrt{E {χ^{2}} E {Y^{2}}}$

(7.63)

The Chernoff bound is given by

Pr(χ>a)⇒Pr(χ>a) = ≤≤Pr(etχ>eta)∀t>0E{etχ}eta(using Markov inequality)mint>0E{etχ}eta $\begin{array}{l} \Pr (χ > a) & = & \Pr (e^{t χ} > e^{t a}) \forall t > 0 \\ \leq & \frac{E {e^{t χ}}}{e^{t a}} (using Markov inequality) \\ \Rightarrow \Pr (χ > a) & \leq & \min_{t > 0} \frac{E {e^{t χ}}}{e^{t a}} \end{array}$

(7.64)

Recognize that the tail probability fall of exponentially with the Chernoff bound but only in polynomial terms with the Chebychev inequality. On the other hand, while the Chebychev inequality is general enough to apply to the sum of dependent random variables, the Chernoff bound is only valid for the sum of independent random variables. Hoeffdings inequality generalizes the Chernoff bound to the case of arbitrary bounded random variables [7].

7.3.3 Functions of Multiple Random Variables

Let random variable Z = g(X1,X2,…Xn $X_{1}, X_{2}, \dots X_{n}$ ). Then, the distribution of Z $Z$ can be computed from the joint distribution of the random variables Xi,i = 1,2,…n $X_{i, i} = 1, 2, \dots n$ Depending on whether the random variables are continuous or discrete, several cases arise.

• Random variables X1,X2,…Xn $X_{1}, X_{2}, \dots X_{n}$ and Z $Z$ are continuous. In this case, the CDF of Z $Z$ can be computed as

FZ(z)∫(x1,…,xn):g⋯∫(x1,…xn)≤zfχ1,χ2,…χn(x1,x2,…,xn)dx1 dx2 …dxn, $F_{Z} (z) \int_{(x_{1}, \dots, x_{n}) : g} \dots \int_{(x_{1}, \dots x_{n}) \leq z} f χ_{1}, χ_{2}, \dots χ_{n} (x_{1}, x_{2}, \dots, x_{n}) d x_{1} d x_{2} \dots d x_{n},$

(7.65)

where the integral is over the n−tuples of (X1,X2,…Xn $X_{1}, X_{2}, \dots X_{n}$ ) that are mapped by the function g to a value lesser than z. The PDF of Z $Z$ can be computed by differentiating the CDF obtained in (7.65). While this same principle can be used in case the random variables are not continuous, it is also possible to directly compute the PMF of random variable Z $Z$ if it’s discrete as discussed in the next 2 cases.

• Random variables X1,X2,…Xn $X_{1}, X_{2}, \dots X_{n}$ and Z $Z$ are discrete. In this case the PMF of Z $Z$ can be computed as

Pz(z) = ∑(x1,…xn):g(x1,…xn) = zPχ1,χ2,…,χn, $P_{z} (z) = \sum_{(x_{1}, \dots x_{n}) : g (x_{1}, \dots x_{n}) = z} P χ_{1,} χ_{2, \dots,} χ_{n,}$

(7.66)

where the summation is over the n−tuples of (X1,X2,…Xn $X_{1}, X_{2}, \dots X_{n}$ ) that are mapped by the function g to the value z.

• Random variables X1,X2,…XN $X_{1}, X_{2}, \dots X_{N}$ are continuous and random variable Z $Z$ is discrete. In this case the PMF of Z $Z$ can be computed as

PZ(z)∫(x1,…xn):g⋯∫g(x1,…,xn) = zPχ1,χ2,…,χn, $P_{Z} (z) \int_{(x_{1}, \dots x_{n}) : g} \dots \int_{g (x_{1}, \dots, x_{n}) = z} P χ_{1}, χ_{2}, \dots, χ_{n},$

(7.67)

where the integral is over the n—tuples of (X1,X2,…Xn $X_{1}, X_{2}, \dots X_{n}$ ) that are mapped by the function g to the value z.

Example 7.3.9. Let the joint PDF of random variables X1 $X_{1}$ and X2 $X_{2}$ be given by

fχ1,χ2(x1,x2) = 1,1≤x1≤2,1≤x2≤2 $\begin{matrix} f χ_{1,} χ_{2} (x_{1}, x_{2}) = 1, & 1 \leq x_{1} \leq 2, 1 \end{matrix} \leq x_{2} \leq 2$

(7.68)

Now let Z = X1X2 $Z = X_{1} X_{2}$ . The CDF of Z $Z$ can be calculated as follows:

• For the values 1 ≤ Z $Z$ ≤ 2

FZ(z) = ∫z1∫z/x111dx2 dx1 $F_{Z} (z) = \int_{1}^{z} \int_{1}^{z / x_{1}} 1 d x_{2} d x_{1}$

(7.69)

= ∫z1(zx1 − 1) dx1 $= \int_{1}^{z} (\frac{z}{x_{1}} - 1) d x_{1}$

(7.70)

= zlog(z) − z + 1 $= z \log (z) - z + 1$

(7.71)

• For the values 2 < Z $Z$ ≤ 4

FZ(z) = ∫z/21∫211dx2 dx1 + ∫2z/2∫z/x111dx2 dx1 $F_{Z} (z) = \int_{1}^{z / 2} \int_{1}^{2} 1 d x_{2} d x_{1} + \int_{z / 2}^{2} \int_{1}^{z / x_{1}} 1 d x_{2} d x_{1}$

(7.72)

= z/2 − 1 + ∫2z/2(zx1 − 1)dx1 $= z / 2 - 1 + \int_{z / 2}^{2} (\frac{z}{x_{1}} - 1) d x_{1}$

(7.73)

= z/2 − 1 + z(log(2) − log(z/2)) − 2 + z/2 $= z / 2 - 1 + z (\log (2) - \log (z / 2)) - 2 + z / 2$

(7.74)

= z(1 + log(2) − log(z/2)) − 3 $= z (1 + \log (2) - \log (z / 2)) - 3$

(7.75)

The corresponding PDF of Z $Z$ is given by

fZ(z) = {log(z)2log(2) − log(z)1≤z≤22<z≤4 $f_{Z} (z) = {\begin{array}{l} \log (z) & 1 \leq z \leq 2 \\ 2 \log (2) - \log (z) & 2 < z \leq 4 \end{array}$

(7.76)

Now suppose Y = ⌊2X1 + 2X2⌋ $Y = ⌊ 2 X_{1} + 2 X_{2} ⌋$ . Then the PMF of Y $Y$ can be calculated as

PY(y) = ⎧⎩⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪∫1.51∫2.5 − x11dx2dx1 = 1/8∫1.51∫3 − x12.5 − x1dx2dx1 + ∫21.5∫3 − x11dx2dx1 = 3/8∫1.51∫23 − x1dx2dx1 + ∫21.5∫3.5 − x13 − x1dx2dx1 = 3/8∫21.5∫23.5 − x1dx2dx1 = 1/8y = 4y = 5y = 6y = 7 $P_{Y} (y) = {\begin{array}{l} \int_{1}^{1.5} \int_{1}^{2.5 - x_{1}} d x_{2} d x_{1} = 1 / 8 & y = 4 \\ \int_{1}^{1.5} \int_{2.5 - x_{1}}^{3 - x_{1}} d x_{2} d x_{1} + \int_{1.5}^{2} \int_{1}^{3 - x_{1}} d x_{2} d x_{1} = 3 / 8 & y = 5 \\ \int_{1}^{1.5} \int_{3 - x_{1}}^{2} d x_{2} d x_{1} + \int_{1.5}^{2} \int_{3 - x_{1}}^{3.5 - x_{1}} d x_{2} d x_{1} = 3 / 8 & y = 6 \\ \int_{1.5}^{2} \int_{3.5 - x_{1}}^{2} d x_{2} d x_{1} = 1 / 8 & y = 7 \end{array}$

(7.77)

□

In many applications it turns out that we are interested in computing the distribution of the sum of multiple independent random variables. In this case, the PDF of Z $Z$ can be computed simply as the convolution of the marginal PDF’s of random variables X1,X2,…Xn $X_{1}, X_{2}, \dots X_{n}$ . For simplicity, we prove this result for the case that Z = X1 + X2 $Z = X_{1} + X_{2}$ . The CDF of Z $Z$ can be calculated as,

FZ(z) = ∫x1 + x2≤zfχ1,χ2(x1,x2) dx1 dx2, $F_{Z} (z) = \int_{x_{1} + x_{2} \leq z} f χ_{1,} χ_{2} (x_{1}, x_{2}) d x_{1} d x_{2},$

(7.78)

= ∫∞ − ∞∫z − x2 − ∞fχ1(x1)fχ2(x2)dx1 dx2 (due to independence ofχ1 and χ2) $\begin{matrix} = \int_{- \infty}^{\infty} \int_{- \infty}^{z - x_{2}} f χ_{1} (x_{1}) f χ_{2} (x_{2}) d x_{1} d x_{2} \\ (due to independence of χ_{1} and χ_{2}) \end{matrix}$

(7.79)

The PDF of Z $Z$ can now be computed by taking the derivative of the CDF as

fZ(z) = ∫∞ − ∞fχ1(x1)fχ2(z − x1)dx1, $f_{Z} (z) = \int_{- \infty}^{\infty} f χ_{1} (x_{1}) f χ_{2} (z - x_{1}) d x_{1},$

(7.80)

which can be recognized as the convolution of the marginal PDFs of X1 $X_{1}$ and X2 $X_{2}$ . In the case of discrete random variables, the convolution integral is given by

PZ(k) = ∑kPχ1(k)fχ2(z − k). $P_{Z} (k) = \sum_{k} P χ_{1} (k) f χ_{2} (z - k) .$

(7.81)

The PDF of Z $Z$ can also be computed using the properties of the characteristic functions as,

ϕZ(ω) = E{ejω(χ1 + χ2)} $ϕ_{Z} (ω) = E {e^{j ω}^{(χ_{1} + χ_{2})}}$

(7.82)

= E{ejωχ1}E{djωχ2} $= E {e^{j ω χ_{1}}} E {d^{j ω χ_{2}}}$

(7.83)

= ϕχ1(ω)ϕχ2(ω) $= ϕ χ_{1} (ω) ϕ χ_{2} (ω)$

(7.84)

This product relationship is similar to the analysis of linear time invariant systems in which the output of the system equals the convolution of the input and impulse response. Or equivalently, the Fourier transform of the output equals the product of the Fourier transform of the input and the frequency response of the system.

Example 7.3.10. Let X1 $X_{1}$ ≈ Uniform (a, b) and X2 $X_{2}$ ≈ Uniform (c, d) be independent uniformly distributed continuous random variables. For simplicity, assume that d−c = b − a. Let Y = X1 + X2 $Y = X_{1} + X_{2}$ . The characteristic function of X1 $X_{1}$ equals

ϕχ1(ω) = ejωb − ejωajω(b − a) $ϕ χ_{1} (ω) = \frac{e^{j ω b} - e^{j ω a}}{j ω (b - a)}$

(7.85)

Using (7.84), the characteristic function of Y $Y$ can be evaluated as,

ϕY(ω) = ejωb − ejωajω(b − a)ejωd − ejωcjω(d − c) $ϕ_{Y} (ω) = \frac{e^{j ω b} - e^{j ω a}}{j ω (b - a)} \frac{e^{j ω d} - e^{j ω c}}{j ω (d - c)}$

(7.86)

By taking the inverse Fourier transform of (7.86) it can be shown that the PDF of Y $Y$ equals

fY(y) = ⎧⎩⎨⎪⎪4(y − a − c)(b + d − a − c)(b + d − a − c) = y − a − c(d − c)(b − a)4(b + d − y)(b + d − a − c)(b + d − a − c) = b + d − y(d − c)(b − a)(a + c)≤y≤(a − b + c + d)/2(b + d)≥y>(a + b + c + d)/2 $f_{Y} (y) = {\begin{array}{l} \frac{4 (y - a - c)}{(b + d - a - c) (b + d - a - c)} = \frac{y - a - c}{(d - c) (b - a)} & (a + c) \leq y \leq (a - b + c + d) / 2 \\ \frac{4 (b + d - y)}{(b + d - a - c) (b + d - a - c)} = \frac{b + d - y}{(d - c) (b - a)} & (b + d) \geq y > (a + b + c + d) / 2 \end{array}$

(7.87)

The PDF of Y $Y$ can also be derived by convolving the PDFs of X1 $X_{1}$ and X2 $X_{2}$ as shown below. For (a + c) ≤ y (a + b + c + d)/2

fY(y) = ∫a + cy1b − a1d − cdx $f_{Y} (y) = \int_{a + c}^{y} \frac{1}{b - a} \frac{1}{d - c} d x$

(7.88)

= y − a − c(b − a)(d − c) $= \frac{y - a - c}{(b - a) (d - c)}$

(7.89)

For (a + b + c + d)/2 < y ≤ (b + d), the PDF is given by

fY(y) = ∫yb + d1b − a1d − cdx $f_{Y} (y) = \int_{y}^{b + d} \frac{1}{b - a} \frac{1}{d - c} d x$

(7.90)

= b + d − y(b − a)(d − c) $= \frac{b + d - y}{(b - a) (d - c)}$

(7.91)

It can be recognized that this PDF corresponds to a triangular PDF.

□

Example 7.3.11. Let X1 $X_{1}$ and X2 $X_{2}$ be independent Gaussian random variables with means μX1 $μ_{X_{1}}$ and μX2 $μ_{X_{2}}$ , respectively. Let their variances be σ2X1 $σ_{X_{1}}^{2}$ and σ2X2 $σ_{X_{2}}^{2}$ , respectively. The distribution of Y = aX1 + bX2 $Y = a X_{1} + b X_{2}$ can be computed using the characteristic functions as follows. The characteristic function of X1 $X_{1}$ equals

ϕχ1(ω) = ejμχω − σ2χω2/2 $ϕ χ_{1} (ω) = e^{j μ χ ω - σ_{χ}^{2} ω^{2} / 2}$

(7.92)

Now, the mean and variance of aX1 $a X_{1}$ are aμX1 $a μ_{X_{1}}$ and a2σ2X1 $a^{2} σ_{X_{1}}^{2}$ , respectively. Using (7.84), the characteristic function of Y $Y$ can be evaluated as,

ϕY(ω) = ejωaμχ1 − a2σ2χ1ω2/2ejωbμχ2 − b2σ2χ2ω2/2 $ϕ_{Y} (ω) = e^{j ω a μ χ_{1}}^{- a 2 σ_{χ_{1}}^{2} ω^{2} / 2} e^{j ω b μ χ_{2} - b 2 σ_{χ_{2}}^{2} ω^{2} / 2}$

(7.93)

= ejω(aμχ1 + bμχ2) − ω2(a2σ2χ1 + b2σ2χ2)/2 $= e^{j ω (a μ χ_{1} + b μ χ_{2}) - ω^{2} (a^{2} σ_{χ_{1}}^{2} + b^{2} σ_{χ_{2}}^{2}) / 2}$

(7.94)

From (7.94), it can be clearly seen that Y $Y$ is also a Gaussian random variable with mean aμX1 + bμX2 $a μ_{X_{1}} + b μ_{X_{2}}$ and variance a2σ2X1 + b2σ1X2 $a^{2} σ_{X_{1}}^{2} + b^{2} σ_{X_{2}}^{1}$ . Thus, any linear processing of a Gaussian variable results in another Gaussian variable.

□

Example 7.3.12. Let Xi $X_{i}$ be a sequence of i.i.d. exponential random variables n with PDF fx (x) = αe^−α^x. Let Yn = ∑i = 1nXn $Y_{n} = \sum_{i = 1}^{n} X_{n}$ represent the sum of the random variables Xi $X_{i}$ . It can be shown using the principles of mathematical induction that the PDF and CDF of Y_n are given by

fYn(x) = αe − αx(αx)n − 1(n − 1)! $f_{Y}_{_{n}} (x) = α e^{- α x} \frac{{(α x)}^{n - 1}}{(n - 1)!}$

(7.95)

FYn(x) = 1 − e − αx(1 + αx1! + … + (αx)n − 1(n − 1)!) $F_{Y}_{_{n}} (x) = 1 - e^{- α x} (1 + \frac{α x}{1!} + \dots + \frac{{(α x)}^{n - 1}}{(n - 1)!})$

(7.96)

Now, define a new family of discrete random variables N(t) $N (t)$ as follows

N(t) = argmaxkYk≤t $N (t) = \underset{k}{\arg \max} Y_{k} \leq t$

(7.97)

In other words, N(t) $N (t)$ takes on the value n iff Yn≤t $Y_{n} \leq t$ and Yn + 1>t $Y_{n + 1} > t$ The probability of this event can be calculated as

Pr(N(t) = n) = FYn(t) − FYn + 1(t) = e − αt(αt)nn! $\Pr (N (t) = n) = F_{Y_{n}} (t) - F_{Y_{n + 1}} (t) = e^{- α t} \frac{{(α t)}^{n}}{n!}$

(7.98)

This demonstrates the relationship between the discrete Poisson random variable and the continuous exponential random variables.

□

7.3.4 Convergence of Random Variables

Before formally introducing the law of large numbers, in this section, we introduce the various types of convergences of random variables.

First, consider a sequence a₁, a₂,… a_n of real numbers. This sequence is said to converge to a real number a if for any given value of ϵ > 0, there exists an integer N such that ∣a_n − a∣ < ϵ, ∀n ≥ N.

As noted before a random variable is a mapping from the set of outcomes to real numbers. Consider a sequence of random variables X1 $X_{1}$ , X2 $X_{2}$ ,….

Convergence everywhere: If for every value of ω ∈ S, the sequence X1(ω),X2(ω),… $X_{1} (ω), X_{2} (ω), \dots$ converges to a value that could depend on ω then the sequence is said to converge everywhere.

Convergence almost everywhere: If the set of outcomes ω for which the sequence X1(ω),X2(ω),… $X_{1} (ω), X_{2} (ω), \dots$ converges has a probability equal to 1, then the sequence is said to converge almost everywhere. In other words,

Pr(χn→χ) = 1, as n→∞. $\Pr (χ_{n} \to χ) = 1, as n \to \infty .$

(7.99)

Convergence in probability or stochastic convergence: For a given value of ϵ > 0, consider the sequence of real numbers given by Pr(|Xn − X|>ε) $\Pr (| X_{n} - X | > ε)$ . If this sequence of real numbers converges to 0 for all values of ϵ > 0, then the sequence X1 $X_{1}$ , X2 $X_{2}$ ,… is said to converge in probability to the random variable X $X$ .

It should be noted that the expression “converges almost everywhere” also implies convergence in probability while the reverse is not always true. Further, several other forms of convergences are defined in the literature [8].

7.3.5 Law of Large Numbers (LLN) and Central Limit Theorem (CLT)

The LLN allows the approximation of the average of a finite set of independent random variables by a single number. The LLN essentially states that certain aspects of the behavior of populations becomes more predictable when the population size increases. In its simple form, the weak law of large numbers (WLLN) is stated as follows:

Theorem 7.3.1 (Weak Law of Large Numbers). Let Mn = 1n(X1 + X2 + ⋯ + Xn) $ℳ_{n} = \frac{1}{n} (X_{1} + X_{2} + \dots + X_{n})$ , where the random variables Xi $X_{i}$ are all independent and have identical means μX $μ_{X}$ and a finite variance no greater than σ2X $σ_{X}^{2}$ . Then,

Pr(|Mn − μχ|≥δ)≤σ2χnδ2 $\Pr (| ℳ_{n} - μ_{χ} | \geq δ) \leq \frac{σ_{χ}^{2}}{n δ^{2}}$

(7.100)

In the limit of large n, the RHS of (7.100) equals 0. In other words, Mn $M_{n}$ converges to μX $μ_{X}$ in probability.

In simple terms, the WLLN states that Mn $M_{n}$ can be approximated by μX $μ_{X}$ . It should be noted that the finite variance requirement in the theorem is not strictly needed; it only makes the proof easier.

Theorem 7.3.2 (Strong Law of Large Numbers). Let Mn = 1n(X1 + X2 + ⋯ + Xn) $ℳ_{n} = \frac{1}{n} (X_{1} + X_{2} + \dots + X_{n})$ , where the random variables $X_{i}$ are all independent and have identical means $μ_{X}$ . Then, $M_{n}$ converges to $μ_{X}$ almost everywhere, i.e.,

$\Pr (ℳ_{n} \to μ_{χ}) = 1 a s n \to \infty$

(7.101)

Theorem 7.3.3 (Central Limit Theorem). Let $ℳ_{n} = \frac{1}{n} (X_{1} + X_{2} + \dots + X_{n})$ , where the random variables $X_{i}$ are all independent and have identical means $μ_{X}$ and a finite variance $σ^{2} X$ Then,

$\lim_{x \to \infty} \Pr (\frac{ℳ_{n} - μ χ}{\sqrt{n - 1 σ_{χ}^{2}}} \leq c) = Φ (c)$

(7.102)

The CLT allows the sum or mean of a finite set of independent random variables to be approximated by a Gaussian distribution and the approximation becomes better for large population sizes.

The LLN can be used to determine the probability of an event that is repeatable and has independent probabilities in each trial. For instance, let random variable $X_{i}$ = 1 if event A occurs in the i^th trial and $X_{i}$ = 0 otherwise. Then Pr (A) = E { $X_{i}$ } The sample mean ${\hat{μ}}_{X} = \frac{1}{n} \sum_{k = 1}^{n} X_{k}$ is an estimate of the probability Pr (A). By the LLN, as the sample size gets large, the estimate will converge to the true probability. Other applications of the laws of large numbers in estimation theory are given in Chapter 8.

The Monte Carlo simulation method is based on this principle and is commonly used to evaluate the performance of a variety of systems. For instance, to detect the probability of bit error of a communication system, a large number of bits are transmitted and the average number of errors are calculated. In information theory, the asymptotic equipartition property (AEP) [9] is a consequence of the LLN.

7.4 Random Processes

A random process $X$ (t, ω) is a mapping from each outcome ω to functions of time. A random process can also be considered as an indexed collection of random variables. The index set can be continuous or discrete. The values taken by the process can also belong to a discrete set or continuous set. Thus, there are 4 possible types of random processes:

• Continuous time, continuous valued processes such as seismic measurements and Brownian motion.

• Continuous time, discrete valued processes such as population size and stock prices.

• Discrete time, continuous valued processes such as sampled speech and temperature signals.

• Discrete time, discrete valued processes such as digitized speech and video signals.

The collection of temporal functions over all values of ω is sometimes referred to as the ensemble of function. For simplicity of notation, we do not explicitly show the dependence of the random process $X (t)$ on the outcome ω. To completely define the random process, the joint distribution of the set of random variables $X (t_{1}), X (t_{2}), \dots X (t_{K})$ needs to be specified for all values of K and t₁, t₂,…, t_K. Clearly, this is a daunting task and practically impossible except in some very narrow and specialized cases such as a process defined by an i.i.d. sequence of random variables. Consequently, most analysis of random processes is restricted to understanding some statistical properties.

The mean function $μ_{X (t)}$ of a random process is defined as

$μ_{χ}_{(t)} = \int x f_{χ (t)} (x) d x$

(7.103)

The autocorrelation $R_{X} (t, τ)$ of the random process is defined as

$R_{χ} (t, τ) = E {χ (t) χ * (t + τ)}$

(7.104)

The autocovariance of the process is defined as

$C_{χ} (t, τ) = E {(χ (t) - μ_{χ (t)}) {(χ (t + τ) - μ_{χ (t + τ)})}^{*}} = R_{χ} (t, τ) - μ_{χ (t)} μ_{χ (t + τ)}$

(7.105)

The crosscorrelation $R_{X, Y} (t, τ)$ of two random process $X (t)$ and $Y (t)$ is defined as

$R_{χ, Y} (t, τ) = E {χ (t) Y * (t + τ)}$

(7.106)

Properties of autocorrelation: The autocorrelation $R_{X} (t, τ)$ is a positive definite function. Further, it can be shown that given any positive definite function f (t, τ) we can compute a random process $X (t)$ for which the function is the autocorrelation. It can also be easily shown that $R_{X} (t + τ, - τ) = R_{X}^{*} (t, τ)$ .

7.4.1 Stationary Process

Random process $X (t)$ is said to be strictly stationary if the joint PDF of $(X (t_{1}), X (t_{2}), \dots, X (t_{K}))$ is the same as the joint PDF of $(X (t_{1} + T), X (t_{2} + T), \dots X (t_{k} + T))$ for all values of k, t₁, t₂,…, t_k and T. Two random process $X_{t}$ and $Y_{t}$ are jointly stationary if the joint PDF of $(X (t_{1}), X (t_{2}), \dots, X (t_{K}), Y (t_{k} + 1), Y (t_{k} + 2), \dots Y (t_{K}))$ is the same as the joint PDF of $(X (t_{1} + T), X (t_{2} + T), \dots X (t_{k} + T), Y (t_{k + 1} + T), Y (t_{k + 2} + T), \dots Y (t_{K} + T))$ for all values of k, K,t₁,t₂,…, t_K and T.

Random process $X (t)$ is said to be wide-sense stationary (WSS) if and only if its mean is independent of time $(μ_{X (t)} = μ_{X})$ and its autocorrelation depends only on time difference, i.e., $R_{X} (t, τ) = R_{X} (τ)$ . Two random processes $X_{t}$ and $Y_{t}$ are jointly WSS if both $X (t)$ and $Y (t)$ are WSS and their crosscorrelation $R_{X, Y} (t, τ)$ depends only on time difference.

Random process $X (t)$ is said to be cyclostationary if the joint PDF of $(X (t_{1}), X (t_{2}), \dots, X (t_{K}))$ is the same as the joint PDF of $(X (t_{1} + T), X (t_{2} + T), \dots, X (t_{k} + T))$ for all values of k,t₁,t₂,…, t_k and for a fixed value of T called the time period of the random process.

The change in the process $X (t_{2}) - X_{t_{1}}, X (t_{3}) - X_{(} t_{2}), X (t_{4}) - X (t_{3}), \dots$ between successive sampling instants, is called the increments of the process. A process in which the increments are all independent is said to be an independent increments process. Similarly, a process with stationary increments is said to be a stationary increments process. The stationary properties of discrete time random processes are similarly defined.

A random process is said to be ergodicity if the properties calculated from a temporal sequence of the process are the same as the properties of the ensemble of functions.

Random Walk: The random walk process $D_{n}$ is defined as the sum of a sequence of independent random variables $L_{i}$ that takes values 1 and -1 with probabilities α and 1 − α, respectively. Thus, $D_{n} = \sum_{i = 1}^{n} L_{i}$ . In other words, at each time-instant the process either increases by 1 or decreases by 1. The random walk process is clearly memoryless in the sense that given the current state of the process, the future state is independent of the past. It can also be easily seen that the increments of the process are independent and stationary.

ARMA Process: Consider a random process $Y_{n}$ which is obtained by passing the White noise process $X_{n}$ through a linear filter H(f). If the transfer function of the filter is of the form

$H (f) = \frac{\sum_{k = 0}^{B} b_{k} e^{- j 2 π f k}}{1 - \sum_{k = 0}^{A} a_{k} e^{- j 2 π f k}}$

(7.107)

then the evolution of the process can be represented as

$Y_{n} = \sum_{k = 0}^{A} a_{k} Y_{n - k} + \sum_{k = 0}^{B} b_{k} χ_{n - k}$

(7.108)

Such a process is referred to as an ARMA process and is frequently used in several signal processing applications and for time-series analysis.

Wiener random process and Brownian motion: A Wiener random process is constructed as the limit of a random walk process. Specifically, consider the symmetric random walk process which has equal probability to increase or decrease by 1. Now let the time increments be denoted by δ and the increase in process at each time instant be set to $\sqrt{α δ}$ . In the limit of δ → 0, the continuous time random process $X (t)$ becomes a process with zero mean and a variance that increases linearly with time. Further, since each $X (t)$ is now the sum of an infinite number of independent random variables, it has a Gaussian distribution. Such a process is referred to as the Wiener random process and is commonly used to model Brownian motion. Clearly, by construction, it can be seen that the Wiener process has independent and stationary increments. The autocorrelation of this process can be shown to equal $R_{X} (t, τ) = α \min {t, t + τ}$ .

7.4.2 Random Process as the Input to a Linear System

When a random process $X (t)$ is the input to a LTI system then the output $Y (t)$ is also a random process. Further, if the input is WSS then the output is also WSS as will be clear from (7.111) and (7.116). The mean and autocorrelation of the output can be derived in terms of the mean and autocorrelation of the input to the system and the impulse response of the LTI system. Let h(t) and H(f) denote, respectively, the impulse response and frequency response of the LTI system.

Now, the mean of the output process $Y (t)$ can be derived as

$E {Y (t)} = E {\int h (t - u) χ (u) d u}$

(7.109)

$\int h (t - u) E {χ (u)} d u$

(7.110)

$= μ_{χ} H (0)$

(7.111)

Thus, the mean of the output process $Y (t)$ is independent of time. The autocorrelation of the output is given by,

$R_{Y} (t, τ) = E {Y_{t} Y_{t + τ}^{*}}$

(7.112)

$= E {\int_{- \infty}^{\infty} h (u) χ (t - u) d u \int_{- \infty}^{\infty} h^{*} (v) χ^{*} (t + τ - v) d v}$

(7.113)

$= \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} h (u) h^{*} (v) E {χ (t - u) χ (t + τ - v)} d v d u$

(7.114)

Since, $X (t)$ is WSS, (7.114) can be rewritten as

$R_{Y} (t, τ) = \int_{- \infty}^{\infty} h (u) \int_{- \infty}^{\infty} h^{*} (v) R_{χ} (τ - v + u) d v d u$

(7.115)

$= R_{χ} (τ) * h^{*} (τ) * h (- τ)$

(7.116)

Thus, the autocorrelation of the output process $Y (t)$ depends only on time-difference τ.

The cross-correlation is given by

$R_{χ, Y} (t, τ) = E {\int_{- \infty}^{\infty} χ (t) h^{*} (u) χ^{*} (t + τ - u) d u}$

(7.117)

$= \int_{- \infty}^{\infty} h^{*} (u) E {χ (t) χ^{*} (t + τ - u) d u}$

(7.118)

$= \int_{- \infty}^{\infty} h^{*} (u) R_{χ} (τ - u) d u$

(7.119)

$= \int_{- \infty}^{\infty} h^{*} (u) R_{χ} (τ - u) d u$

(7.120)

If random processes $X (t)$ and $Y (t)$ are jointly wide sense stationary, then

$R_{χ, Y} (τ) = R_{Y, χ} (- τ) .$

(7.121)

The power spectral density (PSD), denoted $W_{X} (f)$ , of random process $X_{t}$ is defined as the Fourier transform of the autocorrelation $R_{X} (τ)$ and is given by

$W_{χ} (f) = \int_{- \infty}^{\infty} R_{χ} (t) e^{- j 2 π f t} d t$

(7.122)

Similarly, the cross-power spectral density, denoted $W_{X, Y} (f)$ of jointly WSS random processes $X (t)$ and $Y (t)$ is defined as,

$W_{χ, Y} (f) = \int_{- \infty}^{\infty} R_{χ, Y} (t) e^{- j 2 π f t} d t$

(7.123)

For this linear time invariant system with input $X (t)$ the PSD of the output can be computed using the frequency transfer function of the system and the PSD of the input as,

$W_{Y} (f) = W_{χ} (f) {| H (f) |}^{2}$

(7.124)

The cross-power spectral density is related to the PSD of $X (t)$ and $Y (t)$ as

$W_{χ, Y} (f) = H (f) W_{χ} (f)$

(7.125)

$W_{Y} (f) = H^{*} (f) W_{χ, Y} (f)$

(7.126)

Example 7.4.1. Consider a WSS random process $X (t) = A \cos (2 π f_{0} t + θ)$ , where A is a constant and θ is uniformly distributed between (0, 2π). It can be easily shown that the autocorrelation $R X (τ) = \frac{A^{2}}{2} \cos (2 π f + 0 τ)$ . Let this process be input to differentiator system with frequency response H(f) = j2πf. The output $Y (t)$ has spectral density given by

$S_{Y} (f) = 4 π^{2} f^{2} S_{χ} (f) = A 2 π^{2} f_{0}^{2} (δ (f - f_{0}) + δ (f + f_{0})) .$

(7.127)

□

Discrete Time Processes and Systems

For a discrete time random process, $R_{X} (m, n) = E {X_{n} X_{n + m}}$ . If the process is wide sense stationary, then $R_{X} (m, n) = R_{X} (n)$ . Since $R_{X} (n)$ is a discrete function, the PSD is a periodic function. Thus, it is sufficient to consider one period of the PSD. The relationship between the autocorrelation and PSD is now given by,

$W_{χ} (f) = \sum_{k} R_{χ} (k) e^{- j e π k f}$

(7.128)

$R_{χ} (k) = \int_{- 1 / 2}^{1 / 2} W_{χ} (f) e^{j 2 π f k} d f$

(7.129)

Example 7.4.2. Let the input to a simple averaging filter h(n) be a WSS process $X (n)$ with autocorrelation

$R_{χ} (n) = δ (n)$

(7.130)

Let h(n) = 1/(2M + 1), n = −M,… − 1,0,1,… M and h(n) = 0 for all other values of n. The autocorrelation of the output process $X (n)$ can be calculated as

$R_{Y} (n) = \sum_{i = - M}^{M} \sum_{j = - M}^{M} h_{i} h_{j} R_{χ} (n + i - j)$

(7.131)

$= {\begin{array}{l} \frac{1}{{(2 M + 1)}^{2}} (2 M + 1 | n |) & n = - (2 M + 1), \dots, - 1, 0, 1, \dots (2 M + 1) \\ 0 & otherwise \end{array}$

(7.132)

□

7.5 Markov Process

A random process with discrete states is said to be a Markov process if it is memoryless, i.e., it satisfies the following property

$f χ_{t} | χ_{τ < 0} = f χ_{t} | χ_{τ}$

(7.133)

Without loss of generality, the states of the random process are denoted by {0,1,…}.

7.5.1 Markov Chains

A discrete time Markov process is also referred to as a Markov Chain. In this case,

$\Pr (χ_{n + 1} | χ_{n}, χ_{n - 1}, \dots) = \Pr (χ_{n + 1} | χ_{n})$

(7.134)

A Markov chain is completely defined by its one step state transition probabilities denoted by p_ij and the initial starting condition. For convenience, these transition probabilities are arranged in a matrix form represented by P with p_ij being the i^th row and j^th column element such that $p_{i j} = \Pr (X_{n + 1} = j | X_{n} = i)$ . The n—step transition probabilities can be derived from p_ij using the Chapman-Kolmogorov equations as follows:

$p_{i j}^{(n)} = \sum_{k} p_{i k}^{(m)} p_{k j}^{(n - m)}, 0 \leq m \leq n$

(7.135)

To intuitively understand (7.135), we consider the transition from state i to state j in exactly n steps. After m < n steps, let the Markov chain be in state k. Consider all the paths that start at state i and is in state k after m steps. The probability of all such paths is given by $p_{i k}^{(m)}$ . Similarly, the probability of going from state k to state j in n − m steps equals $p_{k j}^{(n - m)}$ . Thus the probability of all paths that start at state i and end at state j and passing through state k in step m is given by $p_{i k}^{(m)} p_{k j}^{(n - m)}$ Thus, the total probability equals the summation over all possible intermediate states k.

A Markov chain is sometimes represented pictorially (as shown in Figure 7.3) as a group of states with arcs representing the transitions between the various states. In a discrete time MC, there is a transition at every time instant. In a continuous time MC, the transition between states occur at random time instants.

The states of a Markov chain are classified as follows. State i can reach state j if there is a non-zero probability of reaching state j in n-steps after starting from state i. States i and j are said to communicate if one can reach state i starting from state j and vice versa. A communicating class consists of a set of states that can communicate with each other. Clearly, all the states of a Markov chain can be partitioned into disjoint communicating classes. An MC is said to be irreducible if it contains only one class, i.e., all states communicate with each other.

Figure 7.3: Illustration of the transitions between various states of a discrete Markov chain.

The first passage time or hitting time $T_{i j}$ is the random time it takes to first reach state j starting from state i. Let $f_{j j}^{(n)} = \Pr (T_{j j} = n)$ denote the probability that the MC comes back to state j in exactly n-steps. Let $f_{j j} = \sum_{n = 1}^{\infty} f_{j j}^{(n)}$ .

State j is transient if f_jj < 1. State j is recurrent if f_jj=1. The recurrent states are positive recurrent if $E {T_{i} j} < \infty$ and null recurrent if $E {T_{j j}} = \infty$ . The positive recurrent states are said to be periodic if revisits must occur in multiples of a certain period d. The positive recurrent states are said to be a periodic if revisits have period 1.

If the Markov chain is irreducible, and all its states are aperiodic and positive recurrent, then the Markov chain is said to be ergodic. In this case, as time continues the chain approaches a steady state or equilibrium condition. The probability distribution of the states in equilibrium are unique and can be computed from the transition probability matrix of the MC. Let π_i denote the probability of being in state i and let Π = [π₁ π₂ …]. These probabilities π_i also represent a time average, i.e., the fraction of time spent in state i. The steady-state probabilities are not dependent on the initial state probabilities and can be calculated as the solution to the balance equations which is given in matrix form as,

$Π = Π P$

(7.136)

with the additional constraint that the sum of the probabilities of the states should equal 1, i.e.,

$\sum_{i} π_{i} = 1$

(7.137)

Example 7.5.1. Consider a node in an ad-hoc network that is in one of three possible states during each time slot: transmitting its own data (T), receiving data (R), or in sleep mode (S). The transition probabilities from the various states are given in Figure 7.3. Compute the steady-state probabilities of the various states. If the energy consumed in the T, R, and S states are respectively 100 mJ, 30 mJ, and 1 mJ, compute the average energy consumed by the node.

The transition probability matrix is given by

$P = [\begin{matrix} 0.1 & 0.7 & 0.2 \\ 0.2 & 0.4 & 0.4 \\ 0.3 & 0.1 & 0.6 \end{matrix}]$

(7.138)

The steady-state probability distribution for the transmit, receive, and sleep states can be computed respectively as 2/9,1/3, and 4/9. The average energy consumption equals 32.67 mJ.

□

Time reversibility. Let $Y_{n} = X_{- n}$ represent the MC with time reversed process of the Markov chain $X_{n}$ . Clearly, the time reversed MC has the same states as the original MC. Also, the time spent in each state is the same whether it is the original MC or the time-reversed MC, so the steady-state probabilities are also identical. Further, the transition probabilities of the time-reversed MC are related to those of the original MC as

$\Pr (Y_{n - 1} = i | Y_{n} = j) = \frac{π_{j}}{π_{i}} p_{i j}$

(7.139)

7.5.2 Continuous Time Markov Chains

As noted before, a continuous time Markov chain also has discrete states and is characterized by arcs representing state transitions. However, transitions between states can happen at any time. The state transition probability matrix for the continuous time Markov chains can also be defined in a manner similar to the discrete time Markov chain. However, the transition probabilities now also depend on time as follows

$p_{i}_{j} (s, t) = \Pr (χ (s + t) = j | χ (s) = i)$

(7.140)

For simplicity, it is sometimes assumed that p_ij(s,t) is independent of starting time s and only depends on time difference t. Even in this case, the analysis is complicated since the transition probabilities must be specified for all values of t.

An alternate approach to the analysis of such systems is to consider the continuous time MC at the time instants when a state transition occurs. Specifically, $Y_{n} = X (t_{n})$ is an embedded discrete time Markov chain where the transition times are t₁,t₂,…. Further, for a given value of $Y_{n}$ , the time intervals τ_n = t_n₊₁ − t_n are independent exponential random variables with mean a_i that depends only on the current state.

The infinitesimal rate a_ij is defined as

$a_{i j} = \lim_{x \to \infty} \frac{p_{i j} (t)}{t} = a_{i} \Pr (Y_{n + 1} = j | Y_{n} = i)$

(7.141)

The time spent by the MC in state i before a transition to state j is exponentially distributed with parameter a_ij. Clearly, $a_{i} = \sum_{j} a_{i j} \forall i$ . These rates are arranged in matrix form for convenience and denoted by matrix A with the i^th row and j^th column entry being a_ij.

The steady-state distribution for the continuous time MC can be computed using the same principle as for its discrete time counterpart. Recall that π_i denotes the probability of being in state i. Now the probability of transition from state i to state j is given by a_ij/a_i. Thus,

$π_{j} = \sum_{i \neq j} π_{i} \frac{a_{i j}}{a_{j}}$

(7.142)

$\Rightarrow π_{j} a_{j} = \sum_{i \neq j} π_{i} a_{i j}$

(7.143)

$\Rightarrow \sum_{i \neq j} π_{i} a_{i j} - π_{j} a_{j} = 0$

(7.144)

with the additional constraint that $\sum_{i} π_{i} = 1$ . Grouping (7.144) for all i together i into matrix form results in the following balance equation

$Π Q = 0$

(7.145)

where

$Q = [\begin{matrix} - a_{0} & a_{01} & a_{02} & \dots \\ a_{10} & - a_{1} & a_{12} & \dots \\ a_{20} & a_{21} & - a_{2} & \dots \end{matrix}]$

(7.146)

Example 7.5.2. Each node in an ad-hoc network is either in transmit or receive mode. The time spent in the transmit mode is exponentially distributed with parameter α and the time spent in the receive mode is exponentially distributed with parameter β. Assume that all transmit and receive durations are independent. In this case, the Q matrix is given by

$Q = [\begin{matrix} - α & α \\ β & - β \end{matrix}]$

(7.147)

The steady-state probabilities can easily be computed as $\frac{β}{α + α}$ and $\frac{α}{α + β}$ .

□

7.5.3 Hidden Markov Model

In many applications the output variables being observed do not directly have the Markov property. However, there is an underlying dynamically changing state variable which exhibits the Markov property. Further, corresponding to each state, there is a probability distribution that governs the output variable that is observed. Such a model is referred to as a HMM and such models are often used in signal modeling and coding applications as illustrated by the examples in this section. An excellent introduction to HMMs is given in [10].

Let random variables $X_{n}$ , for n=1, 2,… represent the underlying state of the system. Let the observed random variables be represented by $Y_{n}$ , for n =1, 2,… The hidden Markov model is characterized by two condition distribution matrices, A and B. Matrix A with elements $a_{i j} = \Pr (X_{n + 1} = j | X_{n} = i)$ determines the state transition probabilities for the underlying state variable $X_{n}$ . Matrix B with elements $b_{i k} = \Pr (Y_{n} = k | X_{n} = i)$ determines the conditional probabilities of the output variable for each input variable. In some applications the initial distribution, $π_{i} = \Pr (X_{1} = i)$ , of the states of the system is also specified. The structure of the hidden Markov model is depicted in Figure 7.4.

Figure 7.4: Structure of a general Hidden Markov Model.

There are essentially three main problems in HMMs.

1. Enumeration/Evaluation Problem: Given a certain model (A, B, Π) for the HMM and a sequence of observations, y₁, y₂,… y_N, calculate the probability Pr (y₁, y₂,… y_N ∣ A, B, Π) that the model caused these observations.

2. State Estimation: Given a certain model (A, B, Π) and observation sequence y₁, y₂,… y_N, find the best sequences x̂₁, x̂₂,… x̂_N of state variables that could result in the observation.

3. Model Parameter Optimization: Given a certain sequence of observations y₁, y₂,… y_N, find the model probabilities (A, B, Π) that best fit the given data. This problem is typically solved during the model training phase and is critical in the practical use of HMMs.

Solution to the Enumeration problem

The problem of enumeration can be evaluated using elementary tools of conditional probability. Formally, given a sequence of observations y₁, y₂,…, y_N, this sequence can be obtained from a given state sequence as

$\Pr (y_{1}, y_{2}, \dots, y_{N} | s_{1}, s_{2}, \dots, s_{N}) = Π_{k = 1}^{N} \Pr (y_{k} | s_{k}),$

(7.148)

assuming the observations are independent. The probability of this state sequence is given by

$\Pr (s_{1}, s_{2}, \dots s_{N}) = \Pr (χ_{1} = s_{1}) Π_{k = 2}^{N} \Pr (χ_{k} = s_{k} | χ_{k - 1} = s_{k - 1}) .$

(7.149)

The desired probability of the observations given the model can then be calculated as

$\begin{array}{l} \Pr (y_{1}, y_{2}, \dots, y_{N} | A, B, π) = \sum_{s_{1}, s_{2}, \dots s_{N}} \Pr (χ_{1} = s_{1}) \Pr (y_{1} | s_{1}) \\ Π_{k = 2}^{N} \Pr (χ_{k} = s_{k} | χ_{k - 1} = s_{k - 1}) \Pr (y_{k} | s_{k}) \end{array}$

(7.150)

The difficulty with this direct evaluation (7.150) of Pr (y_l, y₂,…, y_N∣A,B,π) is that the number of computations required is of the order of NM^N, where M is the number of possible states of the underlying Markov chain. Clearly, this poses a significant computational challenge even for reasonable lengths N of the observed sequence. In practice, a computationally efficient algorithms known as the Forward-Backward procedure is used to reduce the computational complexity. It should be mentioned that this Forward-Backward procedure can be derived as a special case of a message passing algorithm in graphs as described in Chapter 13.

The main steps in the Forward algorithm are as follows. Define α_i(m) as the probability of the partial observation sequence y₁,y₂,…,y_i up to time i and having state x_i = m. Now these probabilities α_i(m) can be calculated using the following recursive approach:

1. The Initialization Step: First set

$α_{1} (m) = π_{m} \Pr (y_{1} | x_{1} = m), \forall 1 \leq m \leq M$

(7.151)

2. The Induction Step:

$\begin{array}{l} α_{i + 1} (k) = [\sum_{m = 1}^{M} α_{i} (m) \Pr (x_{i + 1} = k | x_{i} = m)] \Pr (y_{i + 1} | x_{i + 1} = k), \\ \forall k = 1, \dots M, and i = 1, \dots N - 1. \end{array}$

(7.152)

3. The Termination Step:

$\Pr (y_{1}, y_{2}, \dots, y_{N} | A, B, π) = \sum_{m = 1}^{M} α_{N} (m)$

(7.153)

The Backward algorithm proceeds in a similar manner to the Forward algorithm. Define β_i(m) as the joint probability of the partial sequence y_i₊₁, y_i₊₂,…, y_N given that the state of the Markov chain at time i is x_i = m. Now β_i is calculated using the following recursive approach.

1. The Initialization Step: First set β_N(m) = 1, ∀1 ≤ m ≤ M

2. The Induction Step:

$\begin{array}{l} β_{i} (k) = \sum_{m = 1}^{M} \Pr (x_{i + 1} = m | x_{i} = k) \Pr (y_{i + 1} | x_{i + 1} = m) β_{i + 1} (m) \\ \forall k = 1, \dots M, and i = N - 1, N - 2, \dots 1. \end{array}$

(7.154)

3. The Termination Step:

$\Pr (y_{1}, y_{2}, \dots, y_{N} | A, B, π) = \sum_{m = 1}^{M} β_{1} (m) π_{m} \Pr (y_{1} | x_{1} = m)$

(7.155)

Solution to the State Estimation Problem

Unlike the enumeration problem, the optimal solution to the state estimation problem depends on the definition of optimality. One possible optimality measure is to compute the state at each step that maximizes the probability of the outcome at that step. Clearly this approach could result in state sequences that may be infeasible depending on the state transition probabilities. An alternate definition of optimality that is commonly used is to find the state sequence that has highest probability. The well-known Viterbi algorithm provides a computationally efficient method to compute this optimal state sequence. In Chapter 13, the Viterbi algorithm is also shown to be a special case of a generalized message passing algorithm on graphs.

Similar to the forward algorithm, in this case, we define the probabilities δ_i(m) of the partial state sequence and observations up to time i and having state x_i = m, as

$δ_{i} (m) = \max_{x_{1}, x_{2}, \dots, x_{i - 1}} \Pr (x_{1}, x_{2}, \dots, x_{i - 1}, x_{i} = m, y_{1}, y_{2}, \dots y_{i})$

(7.156)

Using the principle of induction, it can be easily shown that these probabilities δ_i(m) can be computed as

$δ_{i + 1} (m) = [\max_{k} δ_{i} (k) \Pr (x_{i + 1} = m | x_{i} = k)] \Pr (y_{i + 1} | x_{i + 1} = m)$

(7.157)

It turns out that to find the optimal state sequence, we also need to keep track of the maximizing argument at each step in (7.157). This maximizing argument is thus stored in the ψ_i(m) variables as outlined in the formal Viterbi procedure:

1. The initialization step

$δ_{1} (m) = π_{m} \Pr (y_{1} | x_{1} = m), ψ_{1} (m) = 0, \forall 1 \leq m \leq M$

(7.158)

2. The recursive step

$δ_{i + 1} (m) = \max_{1 \leq k \leq M} [δ_{i} (k) \Pr (x_{i + 1} = m | x_{i} = k)] \Pr (y_{i + 1} | x_{i + 1} = m)$

(7.159)

$ψ_{i + 1} (m) = \arg \max_{1 \leq k \leq M} [δ_{i} (k) \Pr (x_{i + 1} = m | x_{i} = k)]$

(7.160)

3. The termination step

$P^{*} = \max_{1 \leq m \leq M} δ_{N} (m)$

(7.161)

${\hat{x}}_{N} = \arg \max_{1 \leq m \leq M} δ_{N} (m)$

(7.162)

4. Backtracking step

${\hat{x}}_{n} = ψ_{n + 1} ({\hat{x}}_{n + 1}), n = N - 1, N - 2, \dots, 1.$

(7.163)

Solution to the Model Parameter Estimation Problem

This problem is the most challenging of all problems in HMMs and there exists no analytical solutions to find the optimal model parameters given a finite observation sequence. However, the model parameters can be selected as a local optimum that maximizes the probability of the observation sequence for that model, using the iterative Baum-Welch algorithm.

Define ϵ_i(m, n) as the conditional probability of being in state x_i = m and x_i₊₁ = n given the observation. This conditional probability can be expressed in terms of the α_i(m) and β_i(n) variables as,

$ϵ_{i} (m, n) = \frac{α_{i} (m) β_{i + 1} (n) \Pr (x_{i + 1} = n | x_{i} = m) \Pr (y_{i + 1} | x_{i + 1} = n)}{\sum_{l = 1}^{M} \sum_{l = 1}^{M} α_{i} (k) β_{i + 1} (l) \Pr (x_{i + 1} = l | x_{i} = k) \Pr (y_{i + 1} | x_{i + 1} = l)}$

(7.164)

The iterative procedure is as follows:

1. Begin with an initial estimate for the parameters of the HMM.

2. Calculate a new estimate for the parameters as

$π_{i} = \sum_{j = 1}^{M} ϵ_{1} (i, j)$

(7.165)

$a_{j i} = \Pr (x_{n + 1} = j | x_{n} = i) = \frac{\sum_{n = 1}^{N - 1} ϵ_{n} (i, j)}{\sum_{k = 1}^{M} \sum_{n = 1}^{N - 1} ϵ_{n} (i, k)}$

(7.166)

$b_{i k} = \frac{\begin{array}{l} \sum_{n = 1}^{N} \sum_{j = 1}^{M} ϵ_{n} (i, j) \\ such that y_{n} = k \end{array}}{\sum_{n = 1}^{N} \sum_{j = 1}^{M} ϵ_{n} (i, j)}$

(7.167)

3. Using the new calculated parameters recalculate ϵ_i using (7.164) and then repeat step 2 above, i.e., evaluate (7.165)–(7.167).

This iterative process is repeated until convergence; as noted before this limit is not the global optimum but a local optimum. It should also be noted that this algorithm can be interpreted as being similar to the Expectation-Maximization (EM) algorithm [11, 12].

7.6 Summary and Further Reading

In this chapter, we provided a brief introduction to probability theory and provided several simple examples to understand the main results. These results form the basis of several chapters in this text. For instance, Markov models and Poisson processes are frequently used in Chapter 17 on queueing theory. Hidden Markov models are considered as a special case of a generalized factor graph-based formulation in Chapter 13. The ideas on conditional probabilities and Bayes result are extensively used in Detection (Chapter 11) and Estimation Theory (Chapter 10). As noted before, several excellent textbooks and papers discuss various aspects of this chapter in great detail [1, 2, 3, 4, 5].

Acknowledgment: This work is supported in part by NSF through grant CCF-0546519.

7.7 Exercises

Exercise 7.7.1. A box contains 100 identical balls numbered from 1 through 100. Let there be 10 balls picked from the box without replacement. What is the probability that the largest numbered ball selected equals 20? What is the probability that the smallest numbered ball selected equals 20?

Exercise 7.7.2. A biased coin has a probability of head equal to p. This biased coin is flipped 5 times. The outcome of each flip is independent of the outcome of the other flips. Let discrete random variable $X$ denote the number of heads in the first 3 flips and discrete random variable $Y$ denote the number of tails in the last 3 flips. Find the joint PMF $P_{X, Y} (x, y)$ . Are $X$ and $Y$ independent?

Exercise 7.7.3. There are two different types of radioactive elements available to a scientist. Type 1 emits radioactive particles with a Poisson distribution and mean of α. Type 2 emits radioactive particles with a Poisson distribution and a mean of β. Let α > β. The scientist does not know apriori which of the two materials he is given, i.e., both materials have a probability of 0.5 of being selected. The scientist can only observe the number of emissions from the source in a given period of time. How should the scientist decide whether the material is type 1 or type 2? In other words, what is the decision rule? What is the probability of making an error using this decision rule?

Exercise 7.7.4. We have a set of M coins, numbered 1, 2,…, M. The probability of getting a head when coin k is tossed equals 1/k. We select one of the coins at random and toss it N times. The sequence of N observations is then given to you. What is your decision rule to determine which of the coins was selected based on your observations?

Exercise 7.7.5. Let $X (t)$ be a random process defined by $X (t) = \sum_{i = 1}^{2} N_{i} s i n (2 π f_{0} t + θ_{i})$ , where f₀ is a known frequency and all $N_{i}$ and θ_j are independent random variables. The characteristic function for $N_{i}$ is $Φ_{N_{i}} (ω) = e^{(λ_{i} [e^{j ω} - 1])}$ , where λ_i is a positive constant. Random variable θ_i is uniformly distributed on [−n, +n]. Find the mean and covariance of the process $X (t)$ . Is the process $X (t)$ wide sense stationary?

Exercise 7.7.6. Consider the motion of a Queen piece on a 3 × 3 chessboard in which the piece moves its position at every integer times starting from t > 0. For simplicity assume that the squares are numbered sequentially from 1 through 9, with the top left square being 1 and the bottom right square being 9. Let $X_{t}$ denote which vertex the queen is at time t. Assume $X_{t}, t > 0$ is a discrete time Markov process, such that given $X_{t}$ , the next state $X_{t + 1}$ is equally likely to be any one of the legal squares to which the queen may move. (a) Sketch the one step transition probability diagram for $X_{t}$ . (b) Compute the steady-state probabilities.

Exercise 7.7.7. Let $X$ and $Y$ be continuous random variables with a joint PDF given by

$f_{χ, Y} (x, y) = {\begin{cases} c x y & 0 < x < 1, 0 < y < 1 \\ 0 & otherwise \end{cases}$

(7.168)

Let $A = \min {X, Y}$ and $B = \max {X, Y}$ . Compute the joint PDF, $F_{A, B} (a, b)$ , b) and marginal PDFs $f_{A} (a)$ and $f_{B} (a)$ .

Exercise 7.7.8. The value of a particular stock equals $100 on the first of January of a particular year. Every day the price of the stock increases by 2% with probability 0.1 and decreases by 1% with probability 0.2. With a probability of 0.7 the value of the stock remains the same for that day. Let random variable $X$ represent the value of the stock after 1 year of trading. Compute the probability that the price of the stock exceeds $105 after 1 year.

Exercise 7.7.9. Let continuous random variables $X_{i}$ , i ≥ 1 be a sequence of i.i.d. random variables with uniform distribution over the interval (0,2). Let discrete random variable $N$ be independent of $X_{i}$ and have PMF

$P_{N} (n) = {\begin{array}{l} 1 / 2 & n = 1, 2 \\ 0 & otherwise \end{array}$

(7.169)

Let random variable $M = \sum_{i = 1}^{n} X_{〉}$ . Find the PDF of $M$ .

Exercise 7.7.10. Let $Y (t) = X (t) - a_{1} X (t - d_{1}) - a_{2} X (t - d_{2})$ where a_i and d_i are real constants. Find the autocorrelation and spectral density of $Y (t)$ .

Exercise 7.7.11. Consider a random walk over the set {0,1, 2,… M} where the probability of transitioning from state i to state i + 1 equals q and the probability of transitioning from state i to state i − 1 equals 1 − q. From state 0 the probability of transitioning to state 1 equals 1. The probability of transitioning from state M to state M equals q. Find the long-term probability of being in each state i.

Exercise 7.7.12. Compute the autocorrelation of a first order MA process of the form

$Y_{n} = X_{n} + β X_{n - 1}$

(7.170)

where { $X_{n}$ } are i.i.d. with mean μ and variance σ².

Exercise 7.7.13. Consider a packet retransmission system in which the same packet is repeatedly sent by the transmitter until the receiver gets N consecutive correct versions of the packet. Assume that each transmission of the packet has a probability of error of 1 − q and is independent of the other transmissions. What is the expected number of packet transmissions?

Exercise 7.7.14. In source coding, a prefix code or an instantaneous code is one in which no codeword is a prefix of any other codeword. Consider the following prefix code: Symbols {a, b, c, d} are mapped to codewords: 0, 10,110,111, respectively. Consider an i.i.d. sequence of symbols with probabilities of the symbols as Pr (a) = 0.5, Pr (b) = 0.3, Pr(c) = Pr (d) = 0.1. Find the average length of the binary representation of the code. Let $B_{n}$ represent the sequence of binary symbols that result from the encoding. Find the PMF of $B_{n}$ .

Exercise 7.7.15. In error control coding, one metric used to measure the performance of a block code is the number of bit errors that it can correct. For instance, in a 2-error correcting code, any bit error combination up to 2 errors can be corrected by the decoder at the receiver. Consider a 5-error correcting code with a resulting codeword of length 100. This length 100 code is transmitted over a binary symmetric channel with crossover probability p = 0.05. Compute the exact probability that all the errors made by the channel can be corrected by the code. Now, estimate this probability using the Central Limit Theorem.

References

[1] W. Feller, An Introduction to Probability Theory and Its Applications. New York, Wiley, 1968.

[2] S. Kay, Intuitive Probability and Random Processes Using MATLAB, 8th ed. Springer, New York, 2005.

[3] A. Leon-Garcia, Probability, Statistics and Random Processes for Electrical Engineering, 3rd ed. Prentice Hall, Upper Saddle River, NJ, 2008.

[4] S. Ross, A First Course in Probability. Prentice Hall, Upper Saddle River, NJ, 2009.

[5] R. Yates and D. J. Goodman, Probability and Stochastic Processes: A Friendly Introduction for Electrical and Computer Engineers, 2nd ed. Hoboken, NJ: John Wiley and Sons Inc., 2004.

[6] S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Application, 2nd ed., Prentice Hall, Upper Saddle River, NJ, 2004.

[7] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 13–30, March 1963.

[8] A. Papoulis and U. S. Pillai, Probability, Random Variables and Stochastic Processes. McGraw-Hill, New York, NY, 2002.

[9] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley, New York, 2006.

[10] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, February 1989.

[11] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, pp. 1–38, 1977.

[12] M. Gupta and Y. Chen, “Theory and use of EM algorithm,” Foundations and Trends in Signal Processing, vol. 4, pp. 223–296, 2010.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7.3.2 Inequalities

Create new playlist

Sign In

Sign Up

Table of Contents for
7.3.2 Inequalities