7.3.2      Inequalities

For many complex systems, analyzing the exact performance can be challenging and even analytically intractable. In many such situations, inequalities are used to bound the system performance. Such bounding techniques are frequently used in communications to derive closed form approximations for the performance of various systems. The closed form approximations clearly demonstrate the effect that changes in a certain parameter has on performance. These inequalities are also used in learning theory to study the problem of optimal model selection for a given training data. A detailed discussion on using inequalities and large deviation theory is given in Chapter 9. In this section, we discuss in brief some of the frequently used inequalities.

For a non-negative random variable XX with mean μX = E{X}μX = E{X} , the Markov inequality provides a bound on the tail probability and is given by

Pr(χa)E{χ}a.Pr(χa)E{χ}a.

(7.59)

Clearly, this inequality only provides meaningful results only when a>μXa>μX . Even in that case, the bound is quite weak since it only uses the mean of the function to evaluate a bound on the tail probability. A tighter bound on the tail probability is given by the Chebychev inequality which uses both the mean μXμX and the variance as σ2Xσ2X as

Pr(|χ  μχ|a)σ2χa2Pr(|χ  μχ|a)σ2χa2

(7.60)

The Chebychev inequality is obtained by applying the Markov inequality to (X  μX)2(X  μX)2 .

Jensen’s inequality states that for a convex function g(·), the expected value of the function of a random variable is no smaller than the function of the expected value of that random variable, i.e.,

g(E{χ})E{g(χ)}g(E{χ})E{g(χ)}

(7.61)

The Union bound simply states that the probability of the union of certain events is lesser than the sum of the probabilities of the individual events,

Pr(iAi)iPr(Ai)Pr(iAi)iPr(Ai)

(7.62)

The Schwarz inequality is given by

|E{χY}|E{χ2}E{Y2}|E{χY}|E{χ2}E{Y2}

(7.63)

The Chernoff bound is given by

Pr(χ>a) = Pr(etχ>eta)t>0E{etχ}eta(usingMarkovinequality)Pr(χ>a)mint>0E{etχ}etaPr(χ>a)Pr(χ>a) = Pr(etχ>eta)t>0E{etχ}eta(usingMarkovinequality)mint>0E{etχ}eta

(7.64)

Recognize that the tail probability fall of exponentially with the Chernoff bound but only in polynomial terms with the Chebychev inequality. On the other hand, while the Chebychev inequality is general enough to apply to the sum of dependent random variables, the Chernoff bound is only valid for the sum of independent random variables. Hoeffdings inequality generalizes the Chernoff bound to the case of arbitrary bounded random variables [7].

7.3.3      Functions of Multiple Random Variables

Let random variable Z = g(X1,X2,XnX1,X2,Xn ). Then, the distribution of ZZ can be computed from the joint distribution of the random variables Xi,i = 1,2,nXi,i = 1,2,n Depending on whether the random variables are continuous or discrete, several cases arise.

•  Random variables X1,X2,XnX1,X2,Xn and ZZ are continuous. In this case, the CDF of ZZ can be computed as

FZ(z)(x1,,xn):g(x1,xn)zfχ1,χ2,χn(x1,x2,,xn)dx1dx2dxn,FZ(z)(x1,,xn):g(x1,xn)zfχ1,χ2,χn(x1,x2,,xn)dx1dx2dxn,

(7.65)

where the integral is over the n−tuples of (X1,X2,XnX1,X2,Xn ) that are mapped by the function g to a value lesser than z. The PDF of ZZ can be computed by differentiating the CDF obtained in (7.65). While this same principle can be used in case the random variables are not continuous, it is also possible to directly compute the PMF of random variable ZZ if it’s discrete as discussed in the next 2 cases.

•  Random variables X1,X2,XnX1,X2,Xn and ZZ are discrete. In this case the PMF of ZZ can be computed as

Pz(z) = (x1,xn):g(x1,xn) = zPχ1,χ2,,χn,Pz(z) = (x1,xn):g(x1,xn) = zPχ1,χ2,,χn,

(7.66)

where the summation is over the n−tuples of (X1,X2,XnX1,X2,Xn ) that are mapped by the function g to the value z.

•  Random variables X1,X2,XNX1,X2,XN are continuous and random variable ZZ is discrete. In this case the PMF of ZZ can be computed as

PZ(z)(x1,xn):gg(x1,,xn) = zPχ1,χ2,,χn,PZ(z)(x1,xn):gg(x1,,xn) = zPχ1,χ2,,χn,

(7.67)

where the integral is over the n—tuples of (X1,X2,XnX1,X2,Xn ) that are mapped by the function g to the value z.

Example 7.3.9. Let the joint PDF of random variables X1X1 and X2X2 be given by

fχ1,χ2(x1,x2) = 1,1x12,1x22fχ1,χ2(x1,x2) = 1,1x12,1x22

(7.68)

Now let Z = X1X2Z = X1X2 . The CDF of ZZ can be calculated as follows:

•  For the values 1 ≤ ZZ ≤ 2

FZ(z) = z1z/x111dx2dx1FZ(z) = z1z/x111dx2dx1

(7.69)

 = z1(zx1  1)dx1 = z1(zx1  1)dx1

(7.70)

 = zlog(z)  z + 1 = zlog(z)  z + 1

(7.71)

•  For the values 2 < ZZ ≤ 4

FZ(z) = z/21211dx2dx1 + 2z/2z/x111dx2dx1FZ(z) = z/21211dx2dx1 + 2z/2z/x111dx2dx1

(7.72)

 = z/2  1 + 2z/2(zx1  1)dx1 = z/2  1 + 2z/2(zx1  1)dx1

(7.73)

 = z/2  1 + z(log(2)  log(z/2))  2 + z/2 = z/2  1 + z(log(2)  log(z/2))  2 + z/2

(7.74)

 = z(1 + log(2)  log(z/2))  3 = z(1 + log(2)  log(z/2))  3

(7.75)

The corresponding PDF of ZZ is given by

fZ(z) = {log(z)1z22log(2)  log(z)2<z4fZ(z) = {log(z)2log(2)  log(z)1z22<z4

(7.76)

Now suppose Y = 2X1 + 2X2Y = 2X1 + 2X2 . Then the PMF of YY can be calculated as

PY(y) = {1.512.5  x11dx2dx1 = 1/8y = 41.513  x12.5  x1dx2dx1 + 21.53  x11dx2dx1 = 3/8y = 51.5123  x1dx2dx1 + 21.53.5  x13  x1dx2dx1 = 3/8y = 621.523.5  x1dx2dx1 = 1/8y = 7PY(y) = 1.512.5  x11dx2dx1 = 1/81.513  x12.5  x1dx2dx1 + 21.53  x11dx2dx1 = 3/81.5123  x1dx2dx1 + 21.53.5  x13  x1dx2dx1 = 3/821.523.5  x1dx2dx1 = 1/8y = 4y = 5y = 6y = 7

(7.77)

In many applications it turns out that we are interested in computing the distribution of the sum of multiple independent random variables. In this case, the PDF of ZZ can be computed simply as the convolution of the marginal PDF’s of random variables X1,X2,XnX1,X2,Xn . For simplicity, we prove this result for the case that Z = X1 + X2Z = X1 + X2 . The CDF of ZZ can be calculated as,

FZ(z) = x1 + x2zfχ1,χ2(x1,x2)dx1dx2,FZ(z) = x1 + x2zfχ1,χ2(x1,x2)dx1dx2,

(7.78)

 =   z  x2  fχ1(x1)fχ2(x2)dx1dx2(duetoindependenceofχ1andχ2) =   z  x2  fχ1(x1)fχ2(x2)dx1dx2(duetoindependenceofχ1andχ2)

(7.79)

The PDF of ZZ can now be computed by taking the derivative of the CDF as

fZ(z) =   fχ1(x1)fχ2(z  x1)dx1,fZ(z) =   fχ1(x1)fχ2(z  x1)dx1,

(7.80)

which can be recognized as the convolution of the marginal PDFs of X1X1 and X2X2 . In the case of discrete random variables, the convolution integral is given by

PZ(k) = kPχ1(k)fχ2(z  k).PZ(k) = kPχ1(k)fχ2(z  k).

(7.81)

The PDF of ZZ can also be computed using the properties of the characteristic functions as,

ϕZ(ω) = E{ejω(χ1 + χ2)}ϕZ(ω) = E{ejω(χ1 + χ2)}

(7.82)

 = E{ejωχ1}E{djωχ2} = E{ejωχ1}E{djωχ2}

(7.83)

 = ϕχ1(ω)ϕχ2(ω) = ϕχ1(ω)ϕχ2(ω)

(7.84)

This product relationship is similar to the analysis of linear time invariant systems in which the output of the system equals the convolution of the input and impulse response. Or equivalently, the Fourier transform of the output equals the product of the Fourier transform of the input and the frequency response of the system.

Example 7.3.10. Let X1X1 ≈ Uniform (a, b) and X2X2 ≈ Uniform (c, d) be independent uniformly distributed continuous random variables. For simplicity, assume that dc = ba. Let Y = X1 + X2Y = X1 + X2 . The characteristic function of X1X1 equals

ϕχ1(ω) = ejωb  ejωajω(b  a)ϕχ1(ω) = ejωb  ejωajω(b  a)

(7.85)

Using (7.84), the characteristic function of YY can be evaluated as,

ϕY(ω) = ejωb  ejωajω(b  a)ejωd  ejωcjω(d  c)ϕY(ω) = ejωb  ejωajω(b  a)ejωd  ejωcjω(d  c)

(7.86)

By taking the inverse Fourier transform of (7.86) it can be shown that the PDF of YY equals

fY(y) = {4(y  a  c)(b + d  a  c)(b + d  a  c) = y  a  c(d  c)(b  a)(a + c)y(a  b + c + d)/24(b + d  y)(b + d  a  c)(b + d  a  c) = b + d  y(d  c)(b  a)(b + d)y>(a + b + c + d)/2fY(y) = 4(y  a  c)(b + d  a  c)(b + d  a  c) = y  a  c(d  c)(b  a)4(b + d  y)(b + d  a  c)(b + d  a  c) = b + d  y(d  c)(b  a)(a + c)y(a  b + c + d)/2(b + d)y>(a + b + c + d)/2

(7.87)

The PDF of YY can also be derived by convolving the PDFs of X1X1 and X2X2 as shown below. For (a + c) ≤ y (a + b + c + d)/2

fY(y) = ya + c1b  a1d  cdxfY(y) = a + cy1b  a1d  cdx

(7.88)

 = y  a  c(b  a)(d  c) = y  a  c(b  a)(d  c)

(7.89)

For (a + b + c + d)/2 < y ≤ (b + d), the PDF is given by

fY(y) = b + dy1b  a1d  cdxfY(y) = yb + d1b  a1d  cdx

(7.90)

 = b + d  y(b  a)(d  c) = b + d  y(b  a)(d  c)

(7.91)

It can be recognized that this PDF corresponds to a triangular PDF.

Example 7.3.11. Let X1X1 and X2X2 be independent Gaussian random variables with means μX1μX1 and μX2μX2 , respectively. Let their variances be σ2X1σ2X1 and σ2X2σ2X2, respectively. The distribution of Y = aX1 + bX2Y = aX1 + bX2 can be computed using the characteristic functions as follows. The characteristic function of X1X1 equals

ϕχ1(ω) = ejμχω  σ2χω2/2ϕχ1(ω) = ejμχω  σ2χω2/2

(7.92)

Now, the mean and variance of aX1aX1 are aμX1aμX1 and a2σ2X1a2σ2X1, respectively. Using (7.84), the characteristic function of YY can be evaluated as,

ϕY(ω) = ejωaμχ1  a2σ2χ1ω2/2ejωbμχ2  b2σ2χ2ω2/2ϕY(ω) = ejωaμχ1  a2σ2χ1ω2/2ejωbμχ2  b2σ2χ2ω2/2

(7.93)

 = ejω(aμχ1 + bμχ2)  ω2(a2σ2χ1 + b2σ2χ2)/2 = ejω(aμχ1 + bμχ2)  ω2(a2σ2χ1 + b2σ2χ2)/2

(7.94)

From (7.94), it can be clearly seen that YY is also a Gaussian random variable with mean aμX1 + bμX2aμX1 + bμX2 and variance a2σ2X1 + b2σ1X2a2σ2X1 + b2σ1X2. Thus, any linear processing of a Gaussian variable results in another Gaussian variable.

Example 7.3.12. Let XiXi be a sequence of i.i.d. exponential random variables n with PDF fx (x) = αe−αx. Let Yn = ni = 1XnYn = i = 1nXn represent the sum of the random variables XiXi . It can be shown using the principles of mathematical induction that the PDF and CDF of Yn are given by

fYn(x) = αe  αx(αx)n  1(n  1)!fYn(x) = αe  αx(αx)n  1(n  1)!

(7.95)

FYn(x) = 1  e  αx(1 + αx1! +  + (αx)n  1(n  1)!)FYn(x) = 1  e  αx(1 + αx1! +  + (αx)n  1(n  1)!)

(7.96)

Now, define a new family of discrete random variables N(t)N(t) as follows

N(t) = argmaxkYktN(t) = argmaxkYkt

(7.97)

In other words, N(t)N(t) takes on the value n iff YntYnt and Yn + 1>tYn + 1>t The probability of this event can be calculated as

Pr(N(t) = n) = FYn(t)  FYn + 1(t) = e  αt(αt)nn!Pr(N(t) = n) = FYn(t)  FYn + 1(t) = e  αt(αt)nn!

(7.98)

This demonstrates the relationship between the discrete Poisson random variable and the continuous exponential random variables.

7.3.4      Convergence of Random Variables

Before formally introducing the law of large numbers, in this section, we introduce the various types of convergences of random variables.

First, consider a sequence a1, a2,… an of real numbers. This sequence is said to converge to a real number a if for any given value of ϵ > 0, there exists an integer N such that ∣ana∣ < ϵ, ∀nN.

As noted before a random variable is a mapping from the set of outcomes to real numbers. Consider a sequence of random variables X1X1 , X2X2 ,….

Convergence everywhere: If for every value of ωS, the sequence X1(ω),X2(ω),X1(ω),X2(ω), converges to a value that could depend on ω then the sequence is said to converge everywhere.

Convergence almost everywhere: If the set of outcomes ω for which the sequence X1(ω),X2(ω),X1(ω),X2(ω), converges has a probability equal to 1, then the sequence is said to converge almost everywhere. In other words,

Pr(χnχ) = 1,asn.Pr(χnχ) = 1,asn.

(7.99)

Convergence in probability or stochastic convergence: For a given value of ϵ > 0, consider the sequence of real numbers given by Pr(|Xn  X|>ε)Pr(|Xn  X|>ε) . If this sequence of real numbers converges to 0 for all values of ϵ > 0, then the sequence X1X1 , X2X2 ,… is said to converge in probability to the random variable XX .

It should be noted that the expression “converges almost everywhere” also implies convergence in probability while the reverse is not always true. Further, several other forms of convergences are defined in the literature [8].

7.3.5      Law of Large Numbers (LLN) and Central Limit Theorem (CLT)

The LLN allows the approximation of the average of a finite set of independent random variables by a single number. The LLN essentially states that certain aspects of the behavior of populations becomes more predictable when the population size increases. In its simple form, the weak law of large numbers (WLLN) is stated as follows:

Theorem 7.3.1 (Weak Law of Large Numbers). Let n = 1n(X1 + X2 +  + Xn)Mn = 1n(X1 + X2 +  + Xn), where the random variables XiXi are all independent and have identical means μXμX and a finite variance no greater than σ2Xσ2X . Then,

Pr(|n  μχ|δ)σ2χnδ2Pr(|Mn  μχ|δ)σ2χnδ2

(7.100)

In the limit of large n, the RHS of (7.100) equals 0. In other words, MnMn converges to μXμX in probability.

In simple terms, the WLLN states that MnMn can be approximated by μXμX . It should be noted that the finite variance requirement in the theorem is not strictly needed; it only makes the proof easier.

Theorem 7.3.2 (Strong Law of Large Numbers). Let n = 1n(X1 + X2 +  + Xn)Mn = 1n(X1 + X2 +  + Xn), where the random variables Xi are all independent and have identical means μX . Then, Mn converges to μX almost everywhere, i.e.,

Pr(nμχ) = 1asn

(7.101)

Theorem 7.3.3 (Central Limit Theorem). Let n = 1n(X1 + X2 +  + Xn), where the random variables Xi are all independent and have identical means μX and a finite variance σ2X Then,

limxPr(n  μχn  1σ2χc) = Φ(c)

(7.102)

The CLT allows the sum or mean of a finite set of independent random variables to be approximated by a Gaussian distribution and the approximation becomes better for large population sizes.

The LLN can be used to determine the probability of an event that is repeatable and has independent probabilities in each trial. For instance, let random variable Xi = 1 if event A occurs in the ith trial and Xi = 0 otherwise. Then Pr (A) = E {Xi } The sample mean ˆμX = 1nnk = 1Xk is an estimate of the probability Pr (A). By the LLN, as the sample size gets large, the estimate will converge to the true probability. Other applications of the laws of large numbers in estimation theory are given in Chapter 8.

The Monte Carlo simulation method is based on this principle and is commonly used to evaluate the performance of a variety of systems. For instance, to detect the probability of bit error of a communication system, a large number of bits are transmitted and the average number of errors are calculated. In information theory, the asymptotic equipartition property (AEP) [9] is a consequence of the LLN.

7.4      Random Processes

A random process X (t, ω) is a mapping from each outcome ω to functions of time. A random process can also be considered as an indexed collection of random variables. The index set can be continuous or discrete. The values taken by the process can also belong to a discrete set or continuous set. Thus, there are 4 possible types of random processes:

•  Continuous time, continuous valued processes such as seismic measurements and Brownian motion.

•  Continuous time, discrete valued processes such as population size and stock prices.

•  Discrete time, continuous valued processes such as sampled speech and temperature signals.

•  Discrete time, discrete valued processes such as digitized speech and video signals.

The collection of temporal functions over all values of ω is sometimes referred to as the ensemble of function. For simplicity of notation, we do not explicitly show the dependence of the random process X(t) on the outcome ω. To completely define the random process, the joint distribution of the set of random variables X(t1),X(t2),X(tK) needs to be specified for all values of K and t1, t2,…, tK. Clearly, this is a daunting task and practically impossible except in some very narrow and specialized cases such as a process defined by an i.i.d. sequence of random variables. Consequently, most analysis of random processes is restricted to understanding some statistical properties.

The mean function μX(t) of a random process is defined as

μχ(t) = xfχ(t)(x)dx

(7.103)

The autocorrelation RX(t,τ) of the random process is defined as

Rχ(t,τ) = E{χ(t)χ*(t + τ)}

(7.104)

The autocovariance of the process is defined as

Cχ(t,τ) = E{(χ(t)  μχ(t))(χ(t + τ)  μχ(t + τ))*} = Rχ(t,τ)  μχ(t)μχ(t + τ)

(7.105)

The crosscorrelation RX,Y(t,τ) of two random process X(t) and Y(t) is defined as

Rχ,Y(t,τ) = E{χ(t)Y*(t + τ)}

(7.106)

Properties of autocorrelation: The autocorrelation RX(t,τ) is a positive definite function. Further, it can be shown that given any positive definite function f (t, τ) we can compute a random process X(t) for which the function is the autocorrelation. It can also be easily shown that RX(t + τ,  τ) = R*X(t,τ) .

7.4.1      Stationary Process

Random process X(t) is said to be strictly stationary if the joint PDF of (X(t1),X(t2),,X(tK)) is the same as the joint PDF of (X(t1 + T),X(t2 + T),X(tk + T)) for all values of k, t1, t2,…, tk and T. Two random process Xt and Yt are jointly stationary if the joint PDF of (X(t1),X(t2),,X(tK),Y(tk + 1),Y(tk + 2),Y(tK)) is the same as the joint PDF of (X(t1 + T),X(t2 + T),X(tk + T),Y(tk + 1 + T),Y(tk + 2 + T),Y(tK + T)) for all values of k, K,t1,t2,…, tK and T.

Random process X(t) is said to be wide-sense stationary (WSS) if and only if its mean is independent of time (μX(t) = μX) and its autocorrelation depends only on time difference, i.e., RX(t,τ) = RX(τ) . Two random processes Xt and Yt are jointly WSS if both X(t) and Y(t) are WSS and their crosscorrelation RX,Y(t,τ) depends only on time difference.

Random process X(t) is said to be cyclostationary if the joint PDF of (X(t1),X(t2),,X(tK)) is the same as the joint PDF of (X(t1 + T),X(t2 + T),,X(tk + T)) for all values of k,t1,t2,…, tk and for a fixed value of T called the time period of the random process.

The change in the process X(t2)  Xt1,X(t3)  X(t2),X(t4)  X(t3), between successive sampling instants, is called the increments of the process. A process in which the increments are all independent is said to be an independent increments process. Similarly, a process with stationary increments is said to be a stationary increments process. The stationary properties of discrete time random processes are similarly defined.

A random process is said to be ergodicity if the properties calculated from a temporal sequence of the process are the same as the properties of the ensemble of functions.

Random Walk: The random walk process Dn is defined as the sum of a sequence of independent random variables Li that takes values 1 and -1 with probabilities α and 1 − α, respectively. Thus, Dn = ni = 1Li . In other words, at each time-instant the process either increases by 1 or decreases by 1. The random walk process is clearly memoryless in the sense that given the current state of the process, the future state is independent of the past. It can also be easily seen that the increments of the process are independent and stationary.

ARMA Process: Consider a random process Yn which is obtained by passing the White noise process Xn through a linear filter H(f). If the transfer function of the filter is of the form

H(f) = Bk = 0bke  j2πfk1  Ak = 0ake  j2πfk

(7.107)

then the evolution of the process can be represented as

Yn = Ak = 0akYn  k + Bk = 0bkχn  k

(7.108)

Such a process is referred to as an ARMA process and is frequently used in several signal processing applications and for time-series analysis.

Wiener random process and Brownian motion: A Wiener random process is constructed as the limit of a random walk process. Specifically, consider the symmetric random walk process which has equal probability to increase or decrease by 1. Now let the time increments be denoted by δ and the increase in process at each time instant be set to αδ. In the limit of δ → 0, the continuous time random process X(t) becomes a process with zero mean and a variance that increases linearly with time. Further, since each X(t) is now the sum of an infinite number of independent random variables, it has a Gaussian distribution. Such a process is referred to as the Wiener random process and is commonly used to model Brownian motion. Clearly, by construction, it can be seen that the Wiener process has independent and stationary increments. The autocorrelation of this process can be shown to equal RX(t,τ) = αmin{t,t + τ} .

7.4.2      Random Process as the Input to a Linear System

When a random process X(t) is the input to a LTI system then the output Y(t) is also a random process. Further, if the input is WSS then the output is also WSS as will be clear from (7.111) and (7.116). The mean and autocorrelation of the output can be derived in terms of the mean and autocorrelation of the input to the system and the impulse response of the LTI system. Let h(t) and H(f) denote, respectively, the impulse response and frequency response of the LTI system.

Now, the mean of the output process Y(t) can be derived as

E{Y(t)} = E{h(t  u)χ(u)du}

(7.109)

h(t  u)E{χ(u)}du

(7.110)

 = μχH(0)

(7.111)

Thus, the mean of the output process Y(t) is independent of time. The autocorrelation of the output is given by,

RY(t,τ) = E{YtY*t + τ}

(7.112)

 = E{  h(u)χ(t  u)du  h*(v)χ*(t + τ  v)dv}

(7.113)

 =     h(u)h*(v)E{χ(t  u)χ(t + τ  v)}dvdu

(7.114)

Since, X(t) is WSS, (7.114) can be rewritten as

RY(t,τ) =   h(u)  h*(v)Rχ(τ  v + u)dvdu

(7.115)

 = Rχ(τ)*h*(τ)*h(  τ)

(7.116)

Thus, the autocorrelation of the output process Y(t) depends only on time-difference τ.

The cross-correlation is given by

Rχ,Y(t,τ) = E{  χ(t)h*(u)χ*(t + τ  u)du}

(7.117)

 =   h*(u)E{χ(t)χ*(t + τ  u)du}

(7.118)

 =   h*(u)Rχ(τ  u)du

(7.119)

 =   h*(u)Rχ(τ  u)du

(7.120)

If random processes X(t) and Y(t) are jointly wide sense stationary, then

Rχ,Y(τ) = RY,χ(  τ).

(7.121)

The power spectral density (PSD), denoted WX(f) , of random process Xt is defined as the Fourier transform of the autocorrelation RX(τ) and is given by

Wχ(f) =   Rχ(t)e  j2πftdt

(7.122)

Similarly, the cross-power spectral density, denoted WX,Y(f) of jointly WSS random processes X(t) and Y(t) is defined as,

Wχ,Y(f) =   Rχ,Y(t)e  j2πftdt

(7.123)

For this linear time invariant system with input X(t) the PSD of the output can be computed using the frequency transfer function of the system and the PSD of the input as,

WY(f) = Wχ(f)|H(f)|2

(7.124)

The cross-power spectral density is related to the PSD of X(t) and Y(t) as

Wχ,Y(f) = H(f)Wχ(f)

(7.125)

WY(f) = H*(f)Wχ,Y(f)

(7.126)

Example 7.4.1. Consider a WSS random process X(t) = Acos(2πf0t + θ) , where A is a constant and θ is uniformly distributed between (0, 2π). It can be easily shown that the autocorrelation RX(τ) = A22cos(2πf + 0τ). Let this process be input to differentiator system with frequency response H(f) = j2πf. The output Y(t) has spectral density given by

SY(f) = 4π2f2Sχ(f) = A2π2f20(δ(f  f0) + δ(f + f0)).

(7.127)

Discrete Time Processes and Systems

For a discrete time random process, RX(m,n) = E{XnXn + m} . If the process is wide sense stationary, then RX(m,n) = RX(n) . Since RX(n) is a discrete function, the PSD is a periodic function. Thus, it is sufficient to consider one period of the PSD. The relationship between the autocorrelation and PSD is now given by,

Wχ(f) = kRχ(k)e  jeπkf

(7.128)

Rχ(k) = 1/2  1/2Wχ(f)ej2πfkdf

(7.129)

Example 7.4.2. Let the input to a simple averaging filter h(n) be a WSS process X(n) with autocorrelation

Rχ(n) = δ(n)

(7.130)

Let h(n) = 1/(2M + 1), n = −M,… − 1,0,1,… M and h(n) = 0 for all other values of n. The autocorrelation of the output process X(n) can be calculated as

RY(n) = Mi =   MMj =   MhihjRχ(n + i  j)

(7.131)

 = {1(2M + 1)2(2M + 1|n|)n =   (2M + 1),,  1,0,1,(2M + 1)0otherwise

(7.132)

7.5      Markov Process

A random process with discrete states is said to be a Markov process if it is memoryless, i.e., it satisfies the following property

fχt|χτ<0 = fχt|χτ

(7.133)

Without loss of generality, the states of the random process are denoted by {0,1,…}.

7.5.1      Markov Chains

A discrete time Markov process is also referred to as a Markov Chain. In this case,

Pr(χn + 1|χn,χn  1,) = Pr(χn + 1|χn)

(7.134)

A Markov chain is completely defined by its one step state transition probabilities denoted by pij and the initial starting condition. For convenience, these transition probabilities are arranged in a matrix form represented by P with pij being the ith row and jth column element such that pij = Pr(Xn + 1 = j|Xn = i) . The n—step transition probabilities can be derived from pij using the Chapman-Kolmogorov equations as follows:

p(n)ij = kp(m)ikp(n  m)kj,0mn

(7.135)

To intuitively understand (7.135), we consider the transition from state i to state j in exactly n steps. After m < n steps, let the Markov chain be in state k. Consider all the paths that start at state i and is in state k after m steps. The probability of all such paths is given by p(m)ik. Similarly, the probability of going from state k to state j in nm steps equals p(n  m)kj. Thus the probability of all paths that start at state i and end at state j and passing through state k in step m is given by p(m)ikp(n  m)kj Thus, the total probability equals the summation over all possible intermediate states k.

A Markov chain is sometimes represented pictorially (as shown in Figure 7.3) as a group of states with arcs representing the transitions between the various states. In a discrete time MC, there is a transition at every time instant. In a continuous time MC, the transition between states occur at random time instants.

The states of a Markov chain are classified as follows. State i can reach state j if there is a non-zero probability of reaching state j in n-steps after starting from state i. States i and j are said to communicate if one can reach state i starting from state j and vice versa. A communicating class consists of a set of states that can communicate with each other. Clearly, all the states of a Markov chain can be partitioned into disjoint communicating classes. An MC is said to be irreducible if it contains only one class, i.e., all states communicate with each other.

Image

Figure 7.3:  Illustration of the transitions between various states of a discrete Markov chain.

The first passage time or hitting time Tij is the random time it takes to first reach state j starting from state i. Let f(n)jj = Pr(Tjj = n) denote the probability that the MC comes back to state j in exactly n-steps. Let fjj = n = 1f(n)jj.

State j is transient if fjj < 1. State j is recurrent if fjj=1. The recurrent states are positive recurrent if E{Tij}< and null recurrent if E{Tjj} =  . The positive recurrent states are said to be periodic if revisits must occur in multiples of a certain period d. The positive recurrent states are said to be a periodic if revisits have period 1.

If the Markov chain is irreducible, and all its states are aperiodic and positive recurrent, then the Markov chain is said to be ergodic. In this case, as time continues the chain approaches a steady state or equilibrium condition. The probability distribution of the states in equilibrium are unique and can be computed from the transition probability matrix of the MC. Let πi denote the probability of being in state i and let Π = [π1 π2 …]. These probabilities πi also represent a time average, i.e., the fraction of time spent in state i. The steady-state probabilities are not dependent on the initial state probabilities and can be calculated as the solution to the balance equations which is given in matrix form as,

Π = ΠP

(7.136)

with the additional constraint that the sum of the probabilities of the states should equal 1, i.e.,

iπi = 1

(7.137)

Example 7.5.1. Consider a node in an ad-hoc network that is in one of three possible states during each time slot: transmitting its own data (T), receiving data (R), or in sleep mode (S). The transition probabilities from the various states are given in Figure 7.3. Compute the steady-state probabilities of the various states. If the energy consumed in the T, R, and S states are respectively 100 mJ, 30 mJ, and 1 mJ, compute the average energy consumed by the node.

The transition probability matrix is given by

P = [0.10.70.20.20.40.40.30.10.6]

(7.138)

The steady-state probability distribution for the transmit, receive, and sleep states can be computed respectively as 2/9,1/3, and 4/9. The average energy consumption equals 32.67 mJ.

Time reversibility. Let Yn = X  n represent the MC with time reversed process of the Markov chain Xn . Clearly, the time reversed MC has the same states as the original MC. Also, the time spent in each state is the same whether it is the original MC or the time-reversed MC, so the steady-state probabilities are also identical. Further, the transition probabilities of the time-reversed MC are related to those of the original MC as

Pr(Yn  1 = i|Yn = j) = πjπipij

(7.139)

7.5.2      Continuous Time Markov Chains

As noted before, a continuous time Markov chain also has discrete states and is characterized by arcs representing state transitions. However, transitions between states can happen at any time. The state transition probability matrix for the continuous time Markov chains can also be defined in a manner similar to the discrete time Markov chain. However, the transition probabilities now also depend on time as follows

pij(s,t) = Pr(χ(s + t) = j|χ(s) = i)

(7.140)

For simplicity, it is sometimes assumed that pij(s,t) is independent of starting time s and only depends on time difference t. Even in this case, the analysis is complicated since the transition probabilities must be specified for all values of t.

An alternate approach to the analysis of such systems is to consider the continuous time MC at the time instants when a state transition occurs. Specifically, Yn = X(tn) is an embedded discrete time Markov chain where the transition times are t1,t2,…. Further, for a given value of Yn , the time intervals τn = tn+1tn are independent exponential random variables with mean ai that depends only on the current state.

The infinitesimal rate aij is defined as

aij = limxpij(t)t = aiPr(Yn + 1 = j|Yn = i)

(7.141)

The time spent by the MC in state i before a transition to state j is exponentially distributed with parameter aij. Clearly, ai = jaiji. These rates are arranged in matrix form for convenience and denoted by matrix A with the ith row and jth column entry being aij.

The steady-state distribution for the continuous time MC can be computed using the same principle as for its discrete time counterpart. Recall that πi denotes the probability of being in state i. Now the probability of transition from state i to state j is given by aij/ai. Thus,

πj = ijπiaijaj

(7.142)

πjaj = ijπiaij

(7.143)

ijπiaij  πjaj = 0

(7.144)

with the additional constraint that iπi = 1. Grouping (7.144) for all i together i into matrix form results in the following balance equation

ΠQ = 0

(7.145)

where

Q = [  a0a01a02a10  a1a12a20a21  a2]

(7.146)

Example 7.5.2. Each node in an ad-hoc network is either in transmit or receive mode. The time spent in the transmit mode is exponentially distributed with parameter α and the time spent in the receive mode is exponentially distributed with parameter β. Assume that all transmit and receive durations are independent. In this case, the Q matrix is given by

Q = [  ααβ  β]

(7.147)

The steady-state probabilities can easily be computed as βα + α and αα + β.

7.5.3      Hidden Markov Model

In many applications the output variables being observed do not directly have the Markov property. However, there is an underlying dynamically changing state variable which exhibits the Markov property. Further, corresponding to each state, there is a probability distribution that governs the output variable that is observed. Such a model is referred to as a HMM and such models are often used in signal modeling and coding applications as illustrated by the examples in this section. An excellent introduction to HMMs is given in [10].

Let random variables Xn , for n=1, 2,… represent the underlying state of the system. Let the observed random variables be represented by Yn , for n =1, 2,… The hidden Markov model is characterized by two condition distribution matrices, A and B. Matrix A with elements aij = Pr(Xn + 1 = j|Xn = i) determines the state transition probabilities for the underlying state variable Xn . Matrix B with elements bik = Pr(Yn = k|Xn = i) determines the conditional probabilities of the output variable for each input variable. In some applications the initial distribution,πi = Pr(X1 = i) , of the states of the system is also specified. The structure of the hidden Markov model is depicted in Figure 7.4.

Image

Figure 7.4:  Structure of a general Hidden Markov Model.

There are essentially three main problems in HMMs.

1.  Enumeration/Evaluation Problem: Given a certain model (A, B, Π) for the HMM and a sequence of observations, y1, y2,… yN, calculate the probability Pr (y1, y2,… yNA, B, Π) that the model caused these observations.

2.  State Estimation: Given a certain model (A, B, Π) and observation sequence y1, y2,… yN, find the best sequences 1, 2,… N of state variables that could result in the observation.

3.  Model Parameter Optimization: Given a certain sequence of observations y1, y2,… yN, find the model probabilities (A, B, Π) that best fit the given data. This problem is typically solved during the model training phase and is critical in the practical use of HMMs.

Solution to the Enumeration problem

The problem of enumeration can be evaluated using elementary tools of conditional probability. Formally, given a sequence of observations y1, y2,…, yN, this sequence can be obtained from a given state sequence as

Pr(y1,y2,,yN|s1,s2,,sN) = ΠNk = 1Pr(yk|sk),

(7.148)

assuming the observations are independent. The probability of this state sequence is given by

Pr(s1,s2,sN) = Pr(χ1 = s1)ΠNk = 2Pr(χk = sk|χk  1 = sk  1).

(7.149)

The desired probability of the observations given the model can then be calculated as

Pr(y1,y2,,yN|A,B,π) = s1,s2,sNPr(χ1 = s1)Pr(y1|s1)ΠNk = 2Pr(χk = sk|χk  1 = sk  1)Pr(yk|sk)

(7.150)

The difficulty with this direct evaluation (7.150) of Pr (yl, y2,…, yNA,B,π) is that the number of computations required is of the order of NMN, where M is the number of possible states of the underlying Markov chain. Clearly, this poses a significant computational challenge even for reasonable lengths N of the observed sequence. In practice, a computationally efficient algorithms known as the Forward-Backward procedure is used to reduce the computational complexity. It should be mentioned that this Forward-Backward procedure can be derived as a special case of a message passing algorithm in graphs as described in Chapter 13.

The main steps in the Forward algorithm are as follows. Define αi(m) as the probability of the partial observation sequence y1,y2,…,yi up to time i and having state xi = m. Now these probabilities αi(m) can be calculated using the following recursive approach:

1.  The Initialization Step: First set

α1(m) = πmPr(y1|x1 = m),1mM

(7.151)

2.  The Induction Step:

αi + 1(k) = [Mm = 1αi(m)Pr(xi + 1 = k|xi = m)]Pr(yi + 1|xi + 1 = k),k = 1,M,andi = 1,N  1.

(7.152)

3.  The Termination Step:

Pr(y1,y2,,yN|A,B,π) = Mm = 1αN(m)

(7.153)

The Backward algorithm proceeds in a similar manner to the Forward algorithm. Define βi(m) as the joint probability of the partial sequence yi+1, yi+2,…, yN given that the state of the Markov chain at time i is xi = m. Now βi is calculated using the following recursive approach.

1.  The Initialization Step: First set βN(m) = 1, ∀1 ≤ mM

2.  The Induction Step:

βi(k) = Mm = 1Pr(xi + 1 = m|xi = k)Pr(yi + 1|xi + 1 = m)βi + 1(m)k = 1,M,andi = N  1,N  2,1.

(7.154)

3.  The Termination Step:

Pr(y1,y2,,yN|A,B,π) = Mm = 1β1(m)πmPr(y1|x1 = m)

(7.155)

Solution to the State Estimation Problem

Unlike the enumeration problem, the optimal solution to the state estimation problem depends on the definition of optimality. One possible optimality measure is to compute the state at each step that maximizes the probability of the outcome at that step. Clearly this approach could result in state sequences that may be infeasible depending on the state transition probabilities. An alternate definition of optimality that is commonly used is to find the state sequence that has highest probability. The well-known Viterbi algorithm provides a computationally efficient method to compute this optimal state sequence. In Chapter 13, the Viterbi algorithm is also shown to be a special case of a generalized message passing algorithm on graphs.

Similar to the forward algorithm, in this case, we define the probabilities δi(m) of the partial state sequence and observations up to time i and having state xi = m, as

δi(m) = maxx1,x2,,xi  1Pr(x1,x2,,xi  1,xi = m,y1,y2,yi)

(7.156)

Using the principle of induction, it can be easily shown that these probabilities δi(m) can be computed as

δi + 1(m) = [maxkδi(k)Pr(xi + 1 = m|xi = k)]Pr(yi + 1|xi + 1 = m)

(7.157)

It turns out that to find the optimal state sequence, we also need to keep track of the maximizing argument at each step in (7.157). This maximizing argument is thus stored in the ψi(m) variables as outlined in the formal Viterbi procedure:

1.  The initialization step

δ1(m) = πmPr(y1|x1 = m),ψ1(m) = 0,1mM

(7.158)

2.  The recursive step

δi + 1(m) = max1kM[δi(k)Pr(xi + 1 = m|xi = k)]Pr(yi + 1|xi + 1 = m)

(7.159)

ψi + 1(m) = argmax1kM[δi(k)Pr(xi + 1 = m|xi = k)]

(7.160)

3.  The termination step

P* = max1mMδN(m)

(7.161)

ˆxN = argmax1mMδN(m)

(7.162)

4.  Backtracking step

ˆxn = ψn + 1(ˆxn + 1),n = N  1,N  2,,1.

(7.163)

Solution to the Model Parameter Estimation Problem

This problem is the most challenging of all problems in HMMs and there exists no analytical solutions to find the optimal model parameters given a finite observation sequence. However, the model parameters can be selected as a local optimum that maximizes the probability of the observation sequence for that model, using the iterative Baum-Welch algorithm.

Define ϵi(m, n) as the conditional probability of being in state xi = m and xi+1 = n given the observation. This conditional probability can be expressed in terms of the αi(m) and βi(n) variables as,

ϵi(m,n) = αi(m)βi + 1(n)Pr(xi + 1 = n|xi = m)Pr(yi + 1|xi + 1 = n)Ml = 1Ml = 1αi(k)βi + 1(l)Pr(xi + 1 = l|xi = k)Pr(yi + 1|xi + 1 = l)

(7.164)

The iterative procedure is as follows:

1.  Begin with an initial estimate for the parameters of the HMM.

2.  Calculate a new estimate for the parameters as

πi = Mj = 1ϵ1(i,j)

(7.165)

aji = Pr(xn + 1 = j|xn = i) = N  1n = 1ϵn(i,j)Mk = 1N  1n = 1ϵn(i,k)

(7.166)

bik = Nn = 1Mj = 1ϵn(i,j)suchthatyn = kNn = 1Mj = 1ϵn(i,j)

(7.167)

3.  Using the new calculated parameters recalculate ϵi using (7.164) and then repeat step 2 above, i.e., evaluate (7.165)(7.167).

This iterative process is repeated until convergence; as noted before this limit is not the global optimum but a local optimum. It should also be noted that this algorithm can be interpreted as being similar to the Expectation-Maximization (EM) algorithm [11, 12].

7.6      Summary and Further Reading

In this chapter, we provided a brief introduction to probability theory and provided several simple examples to understand the main results. These results form the basis of several chapters in this text. For instance, Markov models and Poisson processes are frequently used in Chapter 17 on queueing theory. Hidden Markov models are considered as a special case of a generalized factor graph-based formulation in Chapter 13. The ideas on conditional probabilities and Bayes result are extensively used in Detection (Chapter 11) and Estimation Theory (Chapter 10). As noted before, several excellent textbooks and papers discuss various aspects of this chapter in great detail [1, 2, 3, 4, 5].

Acknowledgment: This work is supported in part by NSF through grant CCF-0546519.

7.7      Exercises

Exercise 7.7.1. A box contains 100 identical balls numbered from 1 through 100. Let there be 10 balls picked from the box without replacement. What is the probability that the largest numbered ball selected equals 20? What is the probability that the smallest numbered ball selected equals 20?

Exercise 7.7.2. A biased coin has a probability of head equal to p. This biased coin is flipped 5 times. The outcome of each flip is independent of the outcome of the other flips. Let discrete random variable X denote the number of heads in the first 3 flips and discrete random variable Y denote the number of tails in the last 3 flips. Find the joint PMF PX,Y(x,y) . Are X and Y independent?

Exercise 7.7.3. There are two different types of radioactive elements available to a scientist. Type 1 emits radioactive particles with a Poisson distribution and mean of α. Type 2 emits radioactive particles with a Poisson distribution and a mean of β. Let α > β. The scientist does not know apriori which of the two materials he is given, i.e., both materials have a probability of 0.5 of being selected. The scientist can only observe the number of emissions from the source in a given period of time. How should the scientist decide whether the material is type 1 or type 2? In other words, what is the decision rule? What is the probability of making an error using this decision rule?

Exercise 7.7.4. We have a set of M coins, numbered 1, 2,…, M. The probability of getting a head when coin k is tossed equals 1/k. We select one of the coins at random and toss it N times. The sequence of N observations is then given to you. What is your decision rule to determine which of the coins was selected based on your observations?

Exercise 7.7.5. Let X(t) be a random process defined by X(t) = 2i = 1Nisin(2πf0t + θi) , where f0 is a known frequency and all Ni and θj are independent random variables. The characteristic function for Ni is ΦNi(ω) = e(λi[ejω  1]), where λi is a positive constant. Random variable θi is uniformly distributed on [−n, +n]. Find the mean and covariance of the process X(t) . Is the process X(t) wide sense stationary?

Exercise 7.7.6. Consider the motion of a Queen piece on a 3 × 3 chessboard in which the piece moves its position at every integer times starting from t > 0. For simplicity assume that the squares are numbered sequentially from 1 through 9, with the top left square being 1 and the bottom right square being 9. Let Xt denote which vertex the queen is at time t. Assume Xt,t>0 is a discrete time Markov process, such that given Xt , the next state Xt + 1 is equally likely to be any one of the legal squares to which the queen may move. (a) Sketch the one step transition probability diagram for Xt . (b) Compute the steady-state probabilities.

Exercise 7.7.7. Let X and Y be continuous random variables with a joint PDF given by

fχ,Y(x,y) = {cxy0<x<1,0<y<10otherwise

(7.168)

Let A = min{X,Y} and B = max{X,Y} . Compute the joint PDF, FA,B(a,b) , b) and marginal PDFs fA(a) and fB(a) .

Exercise 7.7.8. The value of a particular stock equals $100 on the first of January of a particular year. Every day the price of the stock increases by 2% with probability 0.1 and decreases by 1% with probability 0.2. With a probability of 0.7 the value of the stock remains the same for that day. Let random variable X represent the value of the stock after 1 year of trading. Compute the probability that the price of the stock exceeds $105 after 1 year.

Exercise 7.7.9. Let continuous random variables Xi , i ≥ 1 be a sequence of i.i.d. random variables with uniform distribution over the interval (0,2). Let discrete random variable N be independent of Xi and have PMF

PN(n) = {1/2n = 1,20otherwise

(7.169)

Let random variable M = ni = 1X. Find the PDF of M .

Exercise 7.7.10. Let Y(t) = X(t)  a1X(t  d1)  a2X(t  d2) where ai and di are real constants. Find the autocorrelation and spectral density of Y(t) .

Exercise 7.7.11. Consider a random walk over the set {0,1, 2,… M} where the probability of transitioning from state i to state i + 1 equals q and the probability of transitioning from state i to state i − 1 equals 1 − q. From state 0 the probability of transitioning to state 1 equals 1. The probability of transitioning from state M to state M equals q. Find the long-term probability of being in each state i.

Exercise 7.7.12. Compute the autocorrelation of a first order MA process of the form

Yn = Xn + βXn  1

(7.170)

where {Xn } are i.i.d. with mean μ and variance σ2.

Exercise 7.7.13. Consider a packet retransmission system in which the same packet is repeatedly sent by the transmitter until the receiver gets N consecutive correct versions of the packet. Assume that each transmission of the packet has a probability of error of 1 − q and is independent of the other transmissions. What is the expected number of packet transmissions?

Exercise 7.7.14. In source coding, a prefix code or an instantaneous code is one in which no codeword is a prefix of any other codeword. Consider the following prefix code: Symbols {a, b, c, d} are mapped to codewords: 0, 10,110,111, respectively. Consider an i.i.d. sequence of symbols with probabilities of the symbols as Pr (a) = 0.5, Pr (b) = 0.3, Pr(c) = Pr (d) = 0.1. Find the average length of the binary representation of the code. Let Bn represent the sequence of binary symbols that result from the encoding. Find the PMF of Bn .

Exercise 7.7.15. In error control coding, one metric used to measure the performance of a block code is the number of bit errors that it can correct. For instance, in a 2-error correcting code, any bit error combination up to 2 errors can be corrected by the decoder at the receiver. Consider a 5-error correcting code with a resulting codeword of length 100. This length 100 code is transmitted over a binary symmetric channel with crossover probability p = 0.05. Compute the exact probability that all the errors made by the channel can be corrected by the code. Now, estimate this probability using the Central Limit Theorem.

References

[1]  W. Feller, An Introduction to Probability Theory and Its Applications. New York, Wiley, 1968.

[2]  S. Kay, Intuitive Probability and Random Processes Using MATLAB, 8th ed. Springer, New York, 2005.

[3]  A. Leon-Garcia, Probability, Statistics and Random Processes for Electrical Engineering, 3rd ed. Prentice Hall, Upper Saddle River, NJ, 2008.

[4]  S. Ross, A First Course in Probability. Prentice Hall, Upper Saddle River, NJ, 2009.

[5]  R. Yates and D. J. Goodman, Probability and Stochastic Processes: A Friendly Introduction for Electrical and Computer Engineers, 2nd ed. Hoboken, NJ: John Wiley and Sons Inc., 2004.

[6]  S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Application, 2nd ed., Prentice Hall, Upper Saddle River, NJ, 2004.

[7]  W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 13–30, March 1963.

[8]  A. Papoulis and U. S. Pillai, Probability, Random Variables and Stochastic Processes. McGraw-Hill, New York, NY, 2002.

[9]  T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley, New York, 2006.

[10]  L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, February 1989.

[11]  A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, pp. 1–38, 1977.

[12]  M. Gupta and Y. Chen, “Theory and use of EM algorithm,” Foundations and Trends in Signal Processing, vol. 4, pp. 223–296, 2010.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.81.240