For many complex systems, analyzing the exact performance can be challenging and even analytically intractable. In many such situations, inequalities are used to bound the system performance. Such bounding techniques are frequently used in communications to derive closed form approximations for the performance of various systems. The closed form approximations clearly demonstrate the effect that changes in a certain parameter has on performance. These inequalities are also used in learning theory to study the problem of optimal model selection for a given training data. A detailed discussion on using inequalities and large deviation theory is given in Chapter 9. In this section, we discuss in brief some of the frequently used inequalities.
For a non-negative random variable X
(7.59) |
Clearly, this inequality only provides meaningful results only when a>μX
(7.60) |
The Chebychev inequality is obtained by applying the Markov inequality to (X − μX)2
Jensen’s inequality states that for a convex function g(·), the expected value of the function of a random variable is no smaller than the function of the expected value of that random variable, i.e.,
(7.61) |
The Union bound simply states that the probability of the union of certain events is lesser than the sum of the probabilities of the individual events,
(7.62) |
The Schwarz inequality is given by
(7.63) |
The Chernoff bound is given by
Pr(χ>a) = Pr(etχ>eta)∀t>0≤E{etχ}eta(using Markov inequality)⇒Pr(χ>a)≤mint>0E{etχ}eta |
(7.64) |
Recognize that the tail probability fall of exponentially with the Chernoff bound but only in polynomial terms with the Chebychev inequality. On the other hand, while the Chebychev inequality is general enough to apply to the sum of dependent random variables, the Chernoff bound is only valid for the sum of independent random variables. Hoeffdings inequality generalizes the Chernoff bound to the case of arbitrary bounded random variables [7].
7.3.3 Functions of Multiple Random Variables
Let random variable Z = g(X1,X2,…Xn
• Random variables X1,X2,…Xn
FZ(z)∫(x1,…,xn):g⋯∫(x1,…xn)≤zfχ1,χ2,…χn(x1,x2,…,xn)dx1 dx2 …dxn, |
(7.65) |
where the integral is over the n−tuples of (X1,X2,…Xn
• Random variables X1,X2,…Xn
Pz(z) = ∑(x1,…xn):g(x1,…xn) = zPχ1,χ2,…,χn, |
(7.66) |
where the summation is over the n−tuples of (X1,X2,…Xn
• Random variables X1,X2,…XN
PZ(z)∫(x1,…xn):g⋯∫g(x1,…,xn) = zPχ1,χ2,…,χn, |
(7.67) |
where the integral is over the n—tuples of (X1,X2,…Xn
Example 7.3.9. Let the joint PDF of random variables X1
fχ1,χ2(x1,x2) = 1,1≤x1≤2,1≤x2≤2 |
(7.68) |
Now let Z = X1X2
• For the values 1 ≤ Z
(7.69) |
(7.70) |
(7.71) |
• For the values 2 < Z
FZ(z) = ∫z/21∫211dx2 dx1 + ∫2z/2∫z/x111dx2 dx1 |
(7.72) |
= z/2 − 1 + ∫2z/2(zx1 − 1)dx1 |
(7.73) |
= z/2 − 1 + z(log(2) − log(z/2)) − 2 + z/2 |
(7.74) |
= z(1 + log(2) − log(z/2)) − 3 |
(7.75) |
The corresponding PDF of Z
fZ(z) = {log(z)1≤z≤22log(2) − log(z)2<z≤4 |
(7.76) |
Now suppose Y = ⌊2X1 + 2X2⌋
□
In many applications it turns out that we are interested in computing the distribution of the sum of multiple independent random variables. In this case, the PDF of Z
FZ(z) = ∫x1 + x2≤zfχ1,χ2(x1,x2) dx1 dx2, |
(7.78) |
= ∫∞ − ∞∫z − x2 − ∞fχ1(x1)fχ2(x2)dx1 dx2 (due to independence ofχ1 and χ2) |
(7.79) |
The PDF of Z
fZ(z) = ∫∞ − ∞fχ1(x1)fχ2(z − x1)dx1, |
(7.80) |
which can be recognized as the convolution of the marginal PDFs of X1
(7.81) |
The PDF of Z
(7.82) |
(7.83) |
(7.84) |
This product relationship is similar to the analysis of linear time invariant systems in which the output of the system equals the convolution of the input and impulse response. Or equivalently, the Fourier transform of the output equals the product of the Fourier transform of the input and the frequency response of the system.
Example 7.3.10. Let X1
(7.85) |
Using (7.84), the characteristic function of Y
ϕY(ω) = ejωb − ejωajω(b − a)ejωd − ejωcjω(d − c) |
(7.86) |
By taking the inverse Fourier transform of (7.86) it can be shown that the PDF of Y
The PDF of Y
(7.88) |
(7.89) |
For (a + b + c + d)/2 < y ≤ (b + d), the PDF is given by
(7.90) |
(7.91) |
It can be recognized that this PDF corresponds to a triangular PDF.
□
Example 7.3.11. Let X1
(7.92) |
Now, the mean and variance of aX1
ϕY(ω) = ejωaμχ1 − a2σ2χ1ω2/2ejωbμχ2 − b2σ2χ2ω2/2 |
(7.93) |
= ejω(aμχ1 + bμχ2) − ω2(a2σ2χ1 + b2σ2χ2)/2 |
(7.94) |
From (7.94), it can be clearly seen that Y
□
Example 7.3.12. Let Xi
fYn(x) = αe − αx(αx)n − 1(n − 1)! |
(7.95) |
FYn(x) = 1 − e − αx(1 + αx1! + … + (αx)n − 1(n − 1)!) |
(7.96) |
Now, define a new family of discrete random variables N(t)
(7.97) |
In other words, N(t)
Pr(N(t) = n) = FYn(t) − FYn + 1(t) = e − αt(αt)nn! |
(7.98) |
This demonstrates the relationship between the discrete Poisson random variable and the continuous exponential random variables.
□
7.3.4 Convergence of Random Variables
Before formally introducing the law of large numbers, in this section, we introduce the various types of convergences of random variables.
First, consider a sequence a1, a2,… an of real numbers. This sequence is said to converge to a real number a if for any given value of ϵ > 0, there exists an integer N such that ∣an − a∣ < ϵ, ∀n ≥ N.
As noted before a random variable is a mapping from the set of outcomes to real numbers. Consider a sequence of random variables X1
Convergence everywhere: If for every value of ω ∈ S, the sequence X1(ω),X2(ω),…
Convergence almost everywhere: If the set of outcomes ω for which the sequence X1(ω),X2(ω),…
(7.99) |
Convergence in probability or stochastic convergence: For a given value of ϵ > 0, consider the sequence of real numbers given by Pr(|Xn − X|>ε)
It should be noted that the expression “converges almost everywhere” also implies convergence in probability while the reverse is not always true. Further, several other forms of convergences are defined in the literature [8].
7.3.5 Law of Large Numbers (LLN) and Central Limit Theorem (CLT)
The LLN allows the approximation of the average of a finite set of independent random variables by a single number. The LLN essentially states that certain aspects of the behavior of populations becomes more predictable when the population size increases. In its simple form, the weak law of large numbers (WLLN) is stated as follows:
Theorem 7.3.1 (Weak Law of Large Numbers). Let ℳn = 1n(X1 + X2 + ⋯ + Xn)
(7.100) |
In the limit of large n, the RHS of (7.100) equals 0. In other words, Mn
In simple terms, the WLLN states that Mn
Theorem 7.3.2 (Strong Law of Large Numbers). Let ℳn = 1n(X1 + X2 + ⋯ + Xn)
(7.101) |
Theorem 7.3.3 (Central Limit Theorem). Let ℳn = 1n(X1 + X2 + ⋯ + Xn), where the random variables Xi are all independent and have identical means μX and a finite variance σ2X Then,
(7.102) |
The CLT allows the sum or mean of a finite set of independent random variables to be approximated by a Gaussian distribution and the approximation becomes better for large population sizes.
The LLN can be used to determine the probability of an event that is repeatable and has independent probabilities in each trial. For instance, let random variable Xi = 1 if event A occurs in the ith trial and Xi = 0 otherwise. Then Pr (A) = E {Xi } The sample mean ˆμX = 1nn∑k = 1Xk is an estimate of the probability Pr (A). By the LLN, as the sample size gets large, the estimate will converge to the true probability. Other applications of the laws of large numbers in estimation theory are given in Chapter 8.
The Monte Carlo simulation method is based on this principle and is commonly used to evaluate the performance of a variety of systems. For instance, to detect the probability of bit error of a communication system, a large number of bits are transmitted and the average number of errors are calculated. In information theory, the asymptotic equipartition property (AEP) [9] is a consequence of the LLN.
A random process X (t, ω) is a mapping from each outcome ω to functions of time. A random process can also be considered as an indexed collection of random variables. The index set can be continuous or discrete. The values taken by the process can also belong to a discrete set or continuous set. Thus, there are 4 possible types of random processes:
• Continuous time, continuous valued processes such as seismic measurements and Brownian motion.
• Continuous time, discrete valued processes such as population size and stock prices.
• Discrete time, continuous valued processes such as sampled speech and temperature signals.
• Discrete time, discrete valued processes such as digitized speech and video signals.
The collection of temporal functions over all values of ω is sometimes referred to as the ensemble of function. For simplicity of notation, we do not explicitly show the dependence of the random process X(t) on the outcome ω. To completely define the random process, the joint distribution of the set of random variables X(t1),X(t2),…X(tK) needs to be specified for all values of K and t1, t2,…, tK. Clearly, this is a daunting task and practically impossible except in some very narrow and specialized cases such as a process defined by an i.i.d. sequence of random variables. Consequently, most analysis of random processes is restricted to understanding some statistical properties.
The mean function μX(t) of a random process is defined as
(7.103) |
The autocorrelation RX(t,τ) of the random process is defined as
(7.104) |
The autocovariance of the process is defined as
Cχ(t,τ) = E{(χ(t) − μχ(t))(χ(t + τ) − μχ(t + τ))*} = Rχ(t,τ) − μχ(t)μχ(t + τ) |
(7.105) |
The crosscorrelation RX,Y(t,τ) of two random process X(t) and Y(t) is defined as
(7.106) |
Properties of autocorrelation: The autocorrelation RX(t,τ) is a positive definite function. Further, it can be shown that given any positive definite function f (t, τ) we can compute a random process X(t) for which the function is the autocorrelation. It can also be easily shown that RX(t + τ, − τ) = R*X(t,τ) .
Random process X(t) is said to be strictly stationary if the joint PDF of (X(t1),X(t2),…,X(tK)) is the same as the joint PDF of (X(t1 + T),X(t2 + T),…X(tk + T)) for all values of k, t1, t2,…, tk and T. Two random process Xt and Yt are jointly stationary if the joint PDF of (X(t1),X(t2),…,X(tK),Y(tk + 1),Y(tk + 2),…Y(tK)) is the same as the joint PDF of (X(t1 + T),X(t2 + T),…X(tk + T),Y(tk + 1 + T),Y(tk + 2 + T),…Y(tK + T)) for all values of k, K,t1,t2,…, tK and T.
Random process X(t) is said to be wide-sense stationary (WSS) if and only if its mean is independent of time (μX(t) = μX) and its autocorrelation depends only on time difference, i.e., RX(t,τ) = RX(τ) . Two random processes Xt and Yt are jointly WSS if both X(t) and Y(t) are WSS and their crosscorrelation RX,Y(t,τ) depends only on time difference.
Random process X(t) is said to be cyclostationary if the joint PDF of (X(t1),X(t2),…,X(tK)) is the same as the joint PDF of (X(t1 + T),X(t2 + T),…,X(tk + T)) for all values of k,t1,t2,…, tk and for a fixed value of T called the time period of the random process.
The change in the process X(t2) − Xt1,X(t3) − X(t2),X(t4) − X(t3),⋯ between successive sampling instants, is called the increments of the process. A process in which the increments are all independent is said to be an independent increments process. Similarly, a process with stationary increments is said to be a stationary increments process. The stationary properties of discrete time random processes are similarly defined.
A random process is said to be ergodicity if the properties calculated from a temporal sequence of the process are the same as the properties of the ensemble of functions.
Random Walk: The random walk process Dn is defined as the sum of a sequence of independent random variables Li that takes values 1 and -1 with probabilities α and 1 − α, respectively. Thus, Dn = n∑i = 1Li . In other words, at each time-instant the process either increases by 1 or decreases by 1. The random walk process is clearly memoryless in the sense that given the current state of the process, the future state is independent of the past. It can also be easily seen that the increments of the process are independent and stationary.
ARMA Process: Consider a random process Yn which is obtained by passing the White noise process Xn through a linear filter H(f). If the transfer function of the filter is of the form
(7.107) |
then the evolution of the process can be represented as
(7.108) |
Such a process is referred to as an ARMA process and is frequently used in several signal processing applications and for time-series analysis.
Wiener random process and Brownian motion: A Wiener random process is constructed as the limit of a random walk process. Specifically, consider the symmetric random walk process which has equal probability to increase or decrease by 1. Now let the time increments be denoted by δ and the increase in process at each time instant be set to √αδ. In the limit of δ → 0, the continuous time random process X(t) becomes a process with zero mean and a variance that increases linearly with time. Further, since each X(t) is now the sum of an infinite number of independent random variables, it has a Gaussian distribution. Such a process is referred to as the Wiener random process and is commonly used to model Brownian motion. Clearly, by construction, it can be seen that the Wiener process has independent and stationary increments. The autocorrelation of this process can be shown to equal RX(t,τ) = αmin{t,t + τ} .
7.4.2 Random Process as the Input to a Linear System
When a random process X(t) is the input to a LTI system then the output Y(t) is also a random process. Further, if the input is WSS then the output is also WSS as will be clear from (7.111) and (7.116). The mean and autocorrelation of the output can be derived in terms of the mean and autocorrelation of the input to the system and the impulse response of the LTI system. Let h(t) and H(f) denote, respectively, the impulse response and frequency response of the LTI system.
Now, the mean of the output process Y(t) can be derived as
(7.109) |
(7.110) |
(7.111) |
Thus, the mean of the output process Y(t) is independent of time. The autocorrelation of the output is given by,
(7.112) |
(7.113) |
(7.114) |
Since, X(t) is WSS, (7.114) can be rewritten as
(7.115) |
(7.116) |
Thus, the autocorrelation of the output process Y(t) depends only on time-difference τ.
The cross-correlation is given by
(7.117) |
(7.118) |
(7.119) |
(7.120) |
If random processes X(t) and Y(t) are jointly wide sense stationary, then
(7.121) |
The power spectral density (PSD), denoted WX(f) , of random process Xt is defined as the Fourier transform of the autocorrelation RX(τ) and is given by
(7.122) |
Similarly, the cross-power spectral density, denoted WX,Y(f) of jointly WSS random processes X(t) and Y(t) is defined as,
(7.123) |
For this linear time invariant system with input X(t) the PSD of the output can be computed using the frequency transfer function of the system and the PSD of the input as,
(7.124) |
The cross-power spectral density is related to the PSD of X(t) and Y(t) as
(7.125) |
(7.126) |
Example 7.4.1. Consider a WSS random process X(t) = Acos(2πf0t + θ) , where A is a constant and θ is uniformly distributed between (0, 2π). It can be easily shown that the autocorrelation RX(τ) = A22cos(2πf + 0τ). Let this process be input to differentiator system with frequency response H(f) = j2πf. The output Y(t) has spectral density given by
(7.127) |
□
Discrete Time Processes and Systems
For a discrete time random process, RX(m,n) = E{XnXn + m} . If the process is wide sense stationary, then RX(m,n) = RX(n) . Since RX(n) is a discrete function, the PSD is a periodic function. Thus, it is sufficient to consider one period of the PSD. The relationship between the autocorrelation and PSD is now given by,
(7.128) |
(7.129) |
Example 7.4.2. Let the input to a simple averaging filter h(n) be a WSS process X(n) with autocorrelation
(7.130) |
Let h(n) = 1/(2M + 1), n = −M,… − 1,0,1,… M and h(n) = 0 for all other values of n. The autocorrelation of the output process X(n) can be calculated as
(7.131) |
= {1(2M + 1)2(2M + 1|n|)n = − (2M + 1),…, − 1,0,1,…(2M + 1)0otherwise |
(7.132) |
□
A random process with discrete states is said to be a Markov process if it is memoryless, i.e., it satisfies the following property
(7.133) |
Without loss of generality, the states of the random process are denoted by {0,1,…}.
A discrete time Markov process is also referred to as a Markov Chain. In this case,
(7.134) |
A Markov chain is completely defined by its one step state transition probabilities denoted by pij and the initial starting condition. For convenience, these transition probabilities are arranged in a matrix form represented by P with pij being the ith row and jth column element such that pij = Pr(Xn + 1 = j|Xn = i) . The n—step transition probabilities can be derived from pij using the Chapman-Kolmogorov equations as follows:
(7.135) |
To intuitively understand (7.135), we consider the transition from state i to state j in exactly n steps. After m < n steps, let the Markov chain be in state k. Consider all the paths that start at state i and is in state k after m steps. The probability of all such paths is given by p(m)ik. Similarly, the probability of going from state k to state j in n − m steps equals p(n − m)kj. Thus the probability of all paths that start at state i and end at state j and passing through state k in step m is given by p(m)ikp(n − m)kj Thus, the total probability equals the summation over all possible intermediate states k.
A Markov chain is sometimes represented pictorially (as shown in Figure 7.3) as a group of states with arcs representing the transitions between the various states. In a discrete time MC, there is a transition at every time instant. In a continuous time MC, the transition between states occur at random time instants.
The states of a Markov chain are classified as follows. State i can reach state j if there is a non-zero probability of reaching state j in n-steps after starting from state i. States i and j are said to communicate if one can reach state i starting from state j and vice versa. A communicating class consists of a set of states that can communicate with each other. Clearly, all the states of a Markov chain can be partitioned into disjoint communicating classes. An MC is said to be irreducible if it contains only one class, i.e., all states communicate with each other.
The first passage time or hitting time Tij is the random time it takes to first reach state j starting from state i. Let f(n)jj = Pr(Tjj = n) denote the probability that the MC comes back to state j in exactly n-steps. Let fjj = ∞∑n = 1f(n)jj.
State j is transient if fjj < 1. State j is recurrent if fjj=1. The recurrent states are positive recurrent if E{Tij}<∞ and null recurrent if E{Tjj} = ∞ . The positive recurrent states are said to be periodic if revisits must occur in multiples of a certain period d. The positive recurrent states are said to be a periodic if revisits have period 1.
If the Markov chain is irreducible, and all its states are aperiodic and positive recurrent, then the Markov chain is said to be ergodic. In this case, as time continues the chain approaches a steady state or equilibrium condition. The probability distribution of the states in equilibrium are unique and can be computed from the transition probability matrix of the MC. Let πi denote the probability of being in state i and let Π = [π1 π2 …]. These probabilities πi also represent a time average, i.e., the fraction of time spent in state i. The steady-state probabilities are not dependent on the initial state probabilities and can be calculated as the solution to the balance equations which is given in matrix form as,
(7.136) |
with the additional constraint that the sum of the probabilities of the states should equal 1, i.e.,
(7.137) |
Example 7.5.1. Consider a node in an ad-hoc network that is in one of three possible states during each time slot: transmitting its own data (T), receiving data (R), or in sleep mode (S). The transition probabilities from the various states are given in Figure 7.3. Compute the steady-state probabilities of the various states. If the energy consumed in the T, R, and S states are respectively 100 mJ, 30 mJ, and 1 mJ, compute the average energy consumed by the node.
The transition probability matrix is given by
(7.138) |
The steady-state probability distribution for the transmit, receive, and sleep states can be computed respectively as 2/9,1/3, and 4/9. The average energy consumption equals 32.67 mJ.
□
Time reversibility. Let Yn = X − n represent the MC with time reversed process of the Markov chain Xn . Clearly, the time reversed MC has the same states as the original MC. Also, the time spent in each state is the same whether it is the original MC or the time-reversed MC, so the steady-state probabilities are also identical. Further, the transition probabilities of the time-reversed MC are related to those of the original MC as
(7.139) |
7.5.2 Continuous Time Markov Chains
As noted before, a continuous time Markov chain also has discrete states and is characterized by arcs representing state transitions. However, transitions between states can happen at any time. The state transition probability matrix for the continuous time Markov chains can also be defined in a manner similar to the discrete time Markov chain. However, the transition probabilities now also depend on time as follows
(7.140) |
For simplicity, it is sometimes assumed that pij(s,t) is independent of starting time s and only depends on time difference t. Even in this case, the analysis is complicated since the transition probabilities must be specified for all values of t.
An alternate approach to the analysis of such systems is to consider the continuous time MC at the time instants when a state transition occurs. Specifically, Yn = X(tn) is an embedded discrete time Markov chain where the transition times are t1,t2,…. Further, for a given value of Yn , the time intervals τn = tn+1 − tn are independent exponential random variables with mean ai that depends only on the current state.
The infinitesimal rate aij is defined as
(7.141) |
The time spent by the MC in state i before a transition to state j is exponentially distributed with parameter aij. Clearly, ai = ∑jaij∀i. These rates are arranged in matrix form for convenience and denoted by matrix A with the ith row and jth column entry being aij.
The steady-state distribution for the continuous time MC can be computed using the same principle as for its discrete time counterpart. Recall that πi denotes the probability of being in state i. Now the probability of transition from state i to state j is given by aij/ai. Thus,
(7.142) |
(7.143) |
(7.144) |
with the additional constraint that ∑iπi = 1. Grouping (7.144) for all i together i into matrix form results in the following balance equation
(7.145) |
where
(7.146) |
Example 7.5.2. Each node in an ad-hoc network is either in transmit or receive mode. The time spent in the transmit mode is exponentially distributed with parameter α and the time spent in the receive mode is exponentially distributed with parameter β. Assume that all transmit and receive durations are independent. In this case, the Q matrix is given by
(7.147) |
The steady-state probabilities can easily be computed as βα + α and αα + β.
□
In many applications the output variables being observed do not directly have the Markov property. However, there is an underlying dynamically changing state variable which exhibits the Markov property. Further, corresponding to each state, there is a probability distribution that governs the output variable that is observed. Such a model is referred to as a HMM and such models are often used in signal modeling and coding applications as illustrated by the examples in this section. An excellent introduction to HMMs is given in [10].
Let random variables Xn , for n=1, 2,… represent the underlying state of the system. Let the observed random variables be represented by Yn , for n =1, 2,… The hidden Markov model is characterized by two condition distribution matrices, A and B. Matrix A with elements aij = Pr(Xn + 1 = j|Xn = i) determines the state transition probabilities for the underlying state variable Xn . Matrix B with elements bik = Pr(Yn = k|Xn = i) determines the conditional probabilities of the output variable for each input variable. In some applications the initial distribution,πi = Pr(X1 = i) , of the states of the system is also specified. The structure of the hidden Markov model is depicted in Figure 7.4.
There are essentially three main problems in HMMs.
1. Enumeration/Evaluation Problem: Given a certain model (A, B, Π) for the HMM and a sequence of observations, y1, y2,… yN, calculate the probability Pr (y1, y2,… yN ∣ A, B, Π) that the model caused these observations.
2. State Estimation: Given a certain model (A, B, Π) and observation sequence y1, y2,… yN, find the best sequences x̂1, x̂2,… x̂N of state variables that could result in the observation.
3. Model Parameter Optimization: Given a certain sequence of observations y1, y2,… yN, find the model probabilities (A, B, Π) that best fit the given data. This problem is typically solved during the model training phase and is critical in the practical use of HMMs.
Solution to the Enumeration problem
The problem of enumeration can be evaluated using elementary tools of conditional probability. Formally, given a sequence of observations y1, y2,…, yN, this sequence can be obtained from a given state sequence as
(7.148) |
assuming the observations are independent. The probability of this state sequence is given by
Pr(s1,s2,…sN) = Pr(χ1 = s1)ΠNk = 2Pr(χk = sk|χk − 1 = sk − 1). |
(7.149) |
The desired probability of the observations given the model can then be calculated as
Pr(y1,y2,⋯,yN|A,B,π) = ∑s1,s2,⋯sNPr(χ1 = s1)Pr(y1|s1) ΠNk = 2Pr(χk = sk|χk − 1 = sk − 1)Pr(yk|sk) |
(7.150) |
The difficulty with this direct evaluation (7.150) of Pr (yl, y2,…, yN∣A,B,π) is that the number of computations required is of the order of NMN, where M is the number of possible states of the underlying Markov chain. Clearly, this poses a significant computational challenge even for reasonable lengths N of the observed sequence. In practice, a computationally efficient algorithms known as the Forward-Backward procedure is used to reduce the computational complexity. It should be mentioned that this Forward-Backward procedure can be derived as a special case of a message passing algorithm in graphs as described in Chapter 13.
The main steps in the Forward algorithm are as follows. Define αi(m) as the probability of the partial observation sequence y1,y2,…,yi up to time i and having state xi = m. Now these probabilities αi(m) can be calculated using the following recursive approach:
1. The Initialization Step: First set
(7.151) |
2. The Induction Step:
αi + 1(k) = [M∑m = 1αi(m)Pr(xi + 1 = k|xi = m)]Pr(yi + 1|xi + 1 = k), ∀k = 1,…M,and i = 1,…N − 1. |
(7.152) |
3. The Termination Step:
(7.153) |
The Backward algorithm proceeds in a similar manner to the Forward algorithm. Define βi(m) as the joint probability of the partial sequence yi+1, yi+2,…, yN given that the state of the Markov chain at time i is xi = m. Now βi is calculated using the following recursive approach.
1. The Initialization Step: First set βN(m) = 1, ∀1 ≤ m ≤ M
2. The Induction Step:
βi(k) = M∑m = 1Pr(xi + 1 = m|xi = k)Pr(yi + 1|xi + 1 = m)βi + 1(m) ∀k = 1,…M,and i = N − 1,N − 2,…1. |
(7.154) |
3. The Termination Step:
(7.155) |
Solution to the State Estimation Problem
Unlike the enumeration problem, the optimal solution to the state estimation problem depends on the definition of optimality. One possible optimality measure is to compute the state at each step that maximizes the probability of the outcome at that step. Clearly this approach could result in state sequences that may be infeasible depending on the state transition probabilities. An alternate definition of optimality that is commonly used is to find the state sequence that has highest probability. The well-known Viterbi algorithm provides a computationally efficient method to compute this optimal state sequence. In Chapter 13, the Viterbi algorithm is also shown to be a special case of a generalized message passing algorithm on graphs.
Similar to the forward algorithm, in this case, we define the probabilities δi(m) of the partial state sequence and observations up to time i and having state xi = m, as
δi(m) = maxx1,x2,…,xi − 1Pr(x1,x2,…,xi − 1,xi = m,y1,y2,…yi) |
(7.156) |
Using the principle of induction, it can be easily shown that these probabilities δi(m) can be computed as
δi + 1(m) = [maxkδi(k)Pr(xi + 1 = m|xi = k)]Pr(yi + 1|xi + 1 = m) |
(7.157) |
It turns out that to find the optimal state sequence, we also need to keep track of the maximizing argument at each step in (7.157). This maximizing argument is thus stored in the ψi(m) variables as outlined in the formal Viterbi procedure:
1. The initialization step
(7.158) |
2. The recursive step
δi + 1(m) = max1≤k≤M[δi(k)Pr(xi + 1 = m|xi = k)]Pr(yi + 1|xi + 1 = m) |
(7.159) |
(7.160) |
3. The termination step
(7.161) |
(7.162) |
4. Backtracking step
(7.163) |
Solution to the Model Parameter Estimation Problem
This problem is the most challenging of all problems in HMMs and there exists no analytical solutions to find the optimal model parameters given a finite observation sequence. However, the model parameters can be selected as a local optimum that maximizes the probability of the observation sequence for that model, using the iterative Baum-Welch algorithm.
Define ϵi(m, n) as the conditional probability of being in state xi = m and xi+1 = n given the observation. This conditional probability can be expressed in terms of the αi(m) and βi(n) variables as,
ϵi(m,n) = αi(m)βi + 1(n)Pr(xi + 1 = n|xi = m)Pr(yi + 1|xi + 1 = n)∑Ml = 1∑Ml = 1αi(k)βi + 1(l)Pr(xi + 1 = l|xi = k)Pr(yi + 1|xi + 1 = l) |
(7.164) |
The iterative procedure is as follows:
1. Begin with an initial estimate for the parameters of the HMM.
2. Calculate a new estimate for the parameters as
(7.165) |
aji = Pr(xn + 1 = j|xn = i) = ∑N − 1n = 1ϵn(i,j)∑Mk = 1∑N − 1n = 1ϵn(i,k) |
(7.166) |
bik = ∑Nn = 1∑Mj = 1ϵn(i,j)such that yn = k∑Nn = 1∑Mj = 1ϵn(i,j) |
(7.167) |
3. Using the new calculated parameters recalculate ϵi using (7.164) and then repeat step 2 above, i.e., evaluate (7.165)–(7.167).
This iterative process is repeated until convergence; as noted before this limit is not the global optimum but a local optimum. It should also be noted that this algorithm can be interpreted as being similar to the Expectation-Maximization (EM) algorithm [11, 12].
7.6 Summary and Further Reading
In this chapter, we provided a brief introduction to probability theory and provided several simple examples to understand the main results. These results form the basis of several chapters in this text. For instance, Markov models and Poisson processes are frequently used in Chapter 17 on queueing theory. Hidden Markov models are considered as a special case of a generalized factor graph-based formulation in Chapter 13. The ideas on conditional probabilities and Bayes result are extensively used in Detection (Chapter 11) and Estimation Theory (Chapter 10). As noted before, several excellent textbooks and papers discuss various aspects of this chapter in great detail [1, 2, 3, 4, 5].
Acknowledgment: This work is supported in part by NSF through grant CCF-0546519.
Exercise 7.7.1. A box contains 100 identical balls numbered from 1 through 100. Let there be 10 balls picked from the box without replacement. What is the probability that the largest numbered ball selected equals 20? What is the probability that the smallest numbered ball selected equals 20?
Exercise 7.7.2. A biased coin has a probability of head equal to p. This biased coin is flipped 5 times. The outcome of each flip is independent of the outcome of the other flips. Let discrete random variable X denote the number of heads in the first 3 flips and discrete random variable Y denote the number of tails in the last 3 flips. Find the joint PMF PX,Y(x,y) . Are X and Y independent?
Exercise 7.7.3. There are two different types of radioactive elements available to a scientist. Type 1 emits radioactive particles with a Poisson distribution and mean of α. Type 2 emits radioactive particles with a Poisson distribution and a mean of β. Let α > β. The scientist does not know apriori which of the two materials he is given, i.e., both materials have a probability of 0.5 of being selected. The scientist can only observe the number of emissions from the source in a given period of time. How should the scientist decide whether the material is type 1 or type 2? In other words, what is the decision rule? What is the probability of making an error using this decision rule?
Exercise 7.7.4. We have a set of M coins, numbered 1, 2,…, M. The probability of getting a head when coin k is tossed equals 1/k. We select one of the coins at random and toss it N times. The sequence of N observations is then given to you. What is your decision rule to determine which of the coins was selected based on your observations?
Exercise 7.7.5. Let X(t) be a random process defined by X(t) = ∑2i = 1Nisin(2πf0t + θi) , where f0 is a known frequency and all Ni and θj are independent random variables. The characteristic function for Ni is ΦNi(ω) = e(λi[ejω − 1]), where λi is a positive constant. Random variable θi is uniformly distributed on [−n, +n]. Find the mean and covariance of the process X(t) . Is the process X(t) wide sense stationary?
Exercise 7.7.6. Consider the motion of a Queen piece on a 3 × 3 chessboard in which the piece moves its position at every integer times starting from t > 0. For simplicity assume that the squares are numbered sequentially from 1 through 9, with the top left square being 1 and the bottom right square being 9. Let Xt denote which vertex the queen is at time t. Assume Xt,t>0 is a discrete time Markov process, such that given Xt , the next state Xt + 1 is equally likely to be any one of the legal squares to which the queen may move. (a) Sketch the one step transition probability diagram for Xt . (b) Compute the steady-state probabilities.
Exercise 7.7.7. Let X and Y be continuous random variables with a joint PDF given by
(7.168) |
Let A = min{X,Y} and B = max{X,Y} . Compute the joint PDF, FA,B(a,b) , b) and marginal PDFs fA(a) and fB(a) .
Exercise 7.7.8. The value of a particular stock equals $100 on the first of January of a particular year. Every day the price of the stock increases by 2% with probability 0.1 and decreases by 1% with probability 0.2. With a probability of 0.7 the value of the stock remains the same for that day. Let random variable X represent the value of the stock after 1 year of trading. Compute the probability that the price of the stock exceeds $105 after 1 year.
Exercise 7.7.9. Let continuous random variables Xi , i ≥ 1 be a sequence of i.i.d. random variables with uniform distribution over the interval (0,2). Let discrete random variable N be independent of Xi and have PMF
(7.169) |
Let random variable M = n∑i = 1X〉. Find the PDF of M .
Exercise 7.7.10. Let Y(t) = X(t) − a1X(t − d1) − a2X(t − d2) where ai and di are real constants. Find the autocorrelation and spectral density of Y(t) .
Exercise 7.7.11. Consider a random walk over the set {0,1, 2,… M} where the probability of transitioning from state i to state i + 1 equals q and the probability of transitioning from state i to state i − 1 equals 1 − q. From state 0 the probability of transitioning to state 1 equals 1. The probability of transitioning from state M to state M equals q. Find the long-term probability of being in each state i.
Exercise 7.7.12. Compute the autocorrelation of a first order MA process of the form
(7.170) |
where {Xn } are i.i.d. with mean μ and variance σ2.
Exercise 7.7.13. Consider a packet retransmission system in which the same packet is repeatedly sent by the transmitter until the receiver gets N consecutive correct versions of the packet. Assume that each transmission of the packet has a probability of error of 1 − q and is independent of the other transmissions. What is the expected number of packet transmissions?
Exercise 7.7.14. In source coding, a prefix code or an instantaneous code is one in which no codeword is a prefix of any other codeword. Consider the following prefix code: Symbols {a, b, c, d} are mapped to codewords: 0, 10,110,111, respectively. Consider an i.i.d. sequence of symbols with probabilities of the symbols as Pr (a) = 0.5, Pr (b) = 0.3, Pr(c) = Pr (d) = 0.1. Find the average length of the binary representation of the code. Let Bn represent the sequence of binary symbols that result from the encoding. Find the PMF of Bn .
Exercise 7.7.15. In error control coding, one metric used to measure the performance of a block code is the number of bit errors that it can correct. For instance, in a 2-error correcting code, any bit error combination up to 2 errors can be corrected by the decoder at the receiver. Consider a 5-error correcting code with a resulting codeword of length 100. This length 100 code is transmitted over a binary symmetric channel with crossover probability p = 0.05. Compute the exact probability that all the errors made by the channel can be corrected by the code. Now, estimate this probability using the Central Limit Theorem.
[1] W. Feller, An Introduction to Probability Theory and Its Applications. New York, Wiley, 1968.
[2] S. Kay, Intuitive Probability and Random Processes Using MATLAB, 8th ed. Springer, New York, 2005.
[3] A. Leon-Garcia, Probability, Statistics and Random Processes for Electrical Engineering, 3rd ed. Prentice Hall, Upper Saddle River, NJ, 2008.
[4] S. Ross, A First Course in Probability. Prentice Hall, Upper Saddle River, NJ, 2009.
[5] R. Yates and D. J. Goodman, Probability and Stochastic Processes: A Friendly Introduction for Electrical and Computer Engineers, 2nd ed. Hoboken, NJ: John Wiley and Sons Inc., 2004.
[6] S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Application, 2nd ed., Prentice Hall, Upper Saddle River, NJ, 2004.
[7] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 13–30, March 1963.
[8] A. Papoulis and U. S. Pillai, Probability, Random Variables and Stochastic Processes. McGraw-Hill, New York, NY, 2002.
[9] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley, New York, 2006.
[10] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, February 1989.
[11] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, pp. 1–38, 1977.
[12] M. Gupta and Y. Chen, “Theory and use of EM algorithm,” Foundations and Trends in Signal Processing, vol. 4, pp. 223–296, 2010.
18.117.81.240