10 Fundamentals of Estimation Theory

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Yik-Chung Wu^‡

^‡ University of Hong Kong

Parameter estimation is prevalent in communications and signal processing applications, e.g., in channel estimation, synchronization, parametric spectral estimation, direction-of-arrival estimation, etc. Estimation theory is extensively used in modern research in various fields related to signal processing. This chapter presents an overview of basic parameter estimation techniques and discusses the relationships among them. Applications in problems involving recent research are provided as examples.

10.1 Introduction

In an estimation problem, we have some observations which depend on the parameters we want to estimate. For example, the observations can be written as

x = f(θ) + w, $x = f (θ) + w,$

(10.1)

where x stands for the observations, θ is the parameter we want to estimate, f is a known function, and w denotes the random noise corrupting the observations. All the above quantities can be in vector form. In this chapter, it is assumed that the dimension of x is N × 1 while that of θ is p × 1 with N > p. In general, x can depend on θ in a linear or nonlinear way. It is even possible that some of the elements of θ are related to x in a linear way, while others are related to x in a nonlinear way. The task of parameter estimation is to construct a function (called estimator) such that θˆ: = g(x) $\hat{θ} : = g (x)$ is a good estimate of θ. The first question we encounter is how to define a “good” estimate. One natural criterion is that the average squared difference between θˆi ${\hat{θ}}_{i}$ and θ_i should be the smallest, where θ_i is the i^th element of θ. Such squared difference is referred to as the MSE:

MSE(θˆi) = E{(θˆi − θi)2}, $MSE ({\hat{θ}}_{i}) = E {{({\hat{θ}}_{i} - θ_{i})}^{2}},$

where expectation is taken with respect to the random noise w. If we rewrite the above expression as

MSE(θˆi) = E{[(θˆi − E{θˆi}) + (E{θˆi} − θi)]2} = [θˆi − E{θˆi}]variance2 + [E{θˆi} − θˆi](bias)22, $\begin{array}{l} MSE ({\hat{θ}}_{i}) = E {{[({\hat{θ}}_{i} - E {{\hat{θ}}_{i}}) + (E {{\hat{θ}}_{i}} - θ_{i})]}^{2}} \\ = {\underset{variance}{\underset{︸}{[{\hat{θ}}_{i} - E {{\hat{θ}}_{i}}]}}}^{2} + {\underset{{(bias)}^{2}}{\underset{︸}{[E {{\hat{θ}}_{i}} - {\hat{θ}}_{i}]}}}^{2}, \end{array}$

(10.2)

it can be seen that the mean-square error (MSE) can be decomposed into two terms. The first term, variance, measures the variation of the estimate θˆi ${\hat{θ}}_{i}$ from its average. The second term, squared bias, represents the difference between the average estimate and the true value θ_i. Unfortunately, the bias term in general depends on the true value θ_i. Therefore, in the classical deterministic approach, where θ_i is deterministic but unknown, the estimator minimizing the MSE is in general not realizable. In view of this, one “good” estimator aims to minimize the variance of estimate while constraining the average estimate equal to the true value (i.e., E{θˆi} = θi $E {{\hat{θ}}_{i}} = θ_{i}$ ). Such estimator is referred to as the minimum variance unbiased estimator (MVUE).

Example 10.1.1. [1] Consider a very simple example with observations

x[n] = A + ω[n] n = 0,1…,N − 1 $x [n] = A + ω [n] n = 0, 1 \dots, N - 1$

where x[n] represent the observations, A is the parameter we want to estimate, and w[n] denote the independent and identically distributed (i.i.d.) observation noise with variance σ². A reasonable estimator for A is the sample mean estimator:

Aˆ = 1N∑n = 0N − 1x[n]. $\hat{A} = \frac{1}{N} \sum_{n = 0}^{N - 1} x [n] .$

If we compute the mean of the estimate, we have

E{Aˆ} = 1N∑n = 0N − 1E{x[n]} = 1N∑n = 0N − 1A = A. $E {\hat{A}} = \frac{1}{N} \sum_{n = 0}^{N - 1} E {x [n]} = \frac{1}{N} \sum_{n = 0}^{N - 1} A = A .$

Therefore, the sample mean estimator is unbiased. Now, consider another estimator

A⌣ = aN∑n = 0N − 1x[n], $\overset{⌣}{A} = \frac{a}{N} \sum_{n = 0}^{N - 1} x [n],$

where a is an adjustable constant such that the MSE is minimized. It can be easily shown that E {Ǎ} = aA. Furthermore, the variance of Ǎ can be computed as

$\begin{array}{l} ver (Ă) = a^{2} E {{[\frac{1}{N} \sum_{n = 0}^{N - 1} x [n] - A]}^{2}} \\ = a^{2} E {\frac{1}{N^{2}} \sum_{n = 0}^{N - 1} \sum_{m = 0}^{N - 1} (A + ω [n]) (A + ω [m]) - \frac{2 A}{N} \sum_{n = 0}^{N - 1} (A + ω [n] + A^{2})} \\ = a^{2} [A^{2} + \frac{1}{N} σ^{2} - 2 A^{2} + A^{2}] \\ = \frac{a^{2} σ^{2}}{N} . \end{array}$

Putting the mean and variance expressions into the bias-variance decomposition of MSE expression (10.2), the MSE can be shown to be

$MSE (Ă) = \frac{a^{2} σ^{2}}{N} + {(a - 1)}^{2} A^{2} .$

Differentiating the MSE with respect to a and setting the result to zero yields

$a_{o p t} = \frac{A^{2}}{A^{2} + σ^{2} / N} .$

It is obvious that the optimal a depends on the unknown parameter A, and therefore, the optimal estimator might not be realizable.

10.2 Bound on Minimum Variance – Cramér-Rao Lower Bound

Before we discuss a general method of finding MVUE, we first introduce the Cramér-Rao lower bound (CRLB) in this section. CRLB is a lower bound on the variance of any unbiased estimator. This bound is important, as it provides a benchmark performance for any unbiased estimator. If we have an unbiased estimator that can reach the CRLB, then we know that it is the MVUE, and there is no other unbiased estimator that can perform better (at most another estimator would have equal performance). Even though a given unbiased estimator cannot touch the CRLB, we still have an idea of how far away the performance of this estimator is from the theoretical limit.

10.2.1 Computation of CRLB

The CRLB is derived from the likelihood function, which is the probability density function (PDF) of the observations viewed as a function of the unknown parameter.

Figure 10.1: Sharpness of likelihood function and estimation accuracy.

In (10.1), if the noise w is Gaussian with covariance matrix C, then the likelihood function is

$p (x | θ) = \frac{1}{{(2 π)}^{N / 2} \det^{1 / 2} [C]} \exp [- \frac{1}{2} {(x - f (θ))}^{T} C^{- 1} (x - f (θ))] .$

(10.3)

In likelihood function, x is considered to be fixed, and for different θ, it represents the relative probability that θ being the true parameter (thus the probability that θ generates the given observation x). With this interpretation, one popular method for estimation is the maximum likelihood (ML) estimation approach, which searches for θ that maximizes p(x|θ). We will discuss ML estimation in more detail later in this chapter.

The “sharpness” of the likelihood function contains information about the ultimate accuracy any estimator can achieve. For example, Figure 10.1 shows two likelihood functions for the simple model x = A + w with w being Gaussian with two different variances. It can be seen that the likelihood function with smaller noise variance has a much sharper peak, and the values of A with significant likelihood lie within a small range. Thus, the variance of estimate in this scenario will be smaller. Mathematically, the “sharpness” is quantified by the second order derivative of the likelihood function. This is exactly the idea behind CRLB.

Theorem 10.2.1. (Cramér-Rao Lower Bound) [1] Assume the likelihood function p(x|θ) satisfies the regularity condition

$E {[\frac{\partial \ln p (x | θ)}{\partial θ}]} = 0$

where the expectation is taken with respect to p(x|θ). Then, the covariance matrix of any unbiased estimator $\hat{θ}$ satisfies

$C_{\hat{θ}} \geq I^{- 1} (θ)$

(10.4)

where A ≥ B means A − B is positive semidefinite. Matrix I(θ) is the Fisher information matrix with the (i,j)^th element given by

${[I (θ)]}_{i j} = - E {[\frac{\partial^{2} \ln p (x | θ)}{\partial θ_{i} \partial θ_{j}}]}$

(10.5)

where the expectation is taken with respect to p(x|θ), and θ_i stands for the i^th element of θ.

Equation (10.4) is an inequality on the covariance matrix, which includes as a corollary the lower bound on the variance of individual element-wise estimates: var $(θ_{i}) = {[C_{\hat{θ}}]}_{i i} \geq {[I^{- 1} (θ)]}_{i i}$ . Also notice that the discussion here is about the potential accuracy, and is irrelevant to the specific estimation method being used. Of course, although the potential ultimate accuracy might be high, using a bad estimator might yield poor performance.

Example 10.2.1. [3] In clock synchronization in a wireless sensor node, the observations can be modeled as

$x_{1} [n] = β_{1} t_{1} [n] + β_{0} + β_{1} (d + ω [n]), n = 0, \dots, N - 1,$

(10.6)

$x_{2} [n] = t_{2} [n] / + β_{1} - β_{0} / β_{1} + (d + z [n]), n = 0, \dots, N - 1,$

(10.7)

where ${x_{_{1}} [n], x_{2} [n]}_{n = 0}^{N - 1}$ are the observations, ${t_{_{1}} [n], t_{2} [n]}_{n = 0}^{N - 1}$ are known quantities and can be considered as the training data, β₁, β₀ and d are the unknowns to be estimated, and w[n] and z[n] are i.i.d. observation noise sequences with zero mean and variance σ².

Since the random delays w[n] and z[n] follow i.i.d. Gaussian distributions, the PDF of the observations conditioned on β₁, β₀ and d can be expressed from (10.6)–(10.7) as ln $p ({x_{_{1}} [n], x_{2} [n]}_{n = 0}^{N - 1} | β_{1}, β_{0}, d)$

$\begin{array}{l} \ln p ({x_{1} [n], x_{2} [n]}_{n = 0}^{N - 1} | β_{1}, β_{0}, d) \\ = \ln \frac{N}{1 π σ^{2}} - \frac{1}{2 σ^{2}} \times \sum_{n = 0}^{N - 1} [{(\frac{x_{1} [n]}{β_{1}} - t_{1} [n] \frac{β_{0}}{β_{1}} - d)}^{2} + {(x_{2} [n] - \frac{t_{2} [n]}{β_{1}} + \frac{β_{0}}{β_{1}} - d)}^{2}] . \end{array}$

(10.8)

To derive the CRLB, we first check the regularity condition. Taking β₀ as an example, it can be shown that

$\frac{\partial \ln p}{\partial β_{0}} = - \frac{1}{σ^{2}} \sum_{n = 0}^{N - 1} [- \frac{1}{β_{1}} (\frac{x_{1} [n] - β_{0}}{β_{1}} - t_{1} [n] - d) + \frac{1}{β_{1}} (x_{2} [n] - d - \frac{t_{2} [n] - β_{0}}{β_{1}})] .$

(10.9)

Putting (10.6) and (10.7) into (10.9), it yields:

$\frac{\partial \ln p}{\partial β_{0}} = - \frac{1}{σ^{2}} \sum_{i = 0}^{N - 1} \frac{z [n] - ω [n]}{β_{1}} .$

(10.10)

Since w[n] and z[n] have zero means, it is easy to see that the expectation of (10.10) is zero (notice that taking the expectation with respect to x₁[n], x₂[n] is equivalent to taking expectation with respect to w[n], z[n]) and the PDF in (10.8) satisfies the regularity condition for β₀. The regularity conditions for β₁ and d can be checked in a similar way.

From (10.5), the Fisher information matrix is given by

$I (β_{1}, β_{0}, d) = [\begin{array}{l} - E {\frac{\partial^{2} \ln p}{\partial β_{1}^{2}}} - E {\frac{\partial^{2} \ln p}{\partial β_{1} \partial β_{0}}} - E {\frac{\partial^{2} \ln p}{\partial β_{1} \partial d}} \\ - E {\frac{\partial^{2} \ln p}{\partial β_{0}^{2}}} - E {\frac{\partial^{2} \ln p}{\partial β_{0} \partial β_{1}}} - E {\frac{\partial^{2} \ln p}{\partial β_{0} \partial d}} \\ - E {\frac{\partial^{2} \ln p}{\partial d^{2}}} - E {\frac{\partial^{2} \ln p}{\partial d \partial β_{1}}} - E {\frac{\partial^{2} \ln p}{\partial d \partial β_{0}}} \end{array}],$

and each element in the matrix can be obtained separately. For example for $- E {\frac{\partial \ln p}{\partial β_{1} \partial β_{0}}}$ , based on (10.9), we obtain

$\frac{\partial^{2} \ln p}{\partial β_{1} \partial β_{0}} = - \frac{1}{σ^{2}} \sum_{n = 0}^{N - 1} [\frac{2 (x_{1} [n] - 2 β_{0})}{β_{1}^{3}} - \frac{t_{1} [n] + x_{2} [n]}{β_{1}^{2}}] .$

(10.11)

Putting (10.6) and (10.7) into (10.11), we have

$\frac{\partial^{2} \ln p}{\partial β_{1} \partial β_{0}} = - \frac{1}{σ^{2}} \sum_{n = 0}^{N - 1} (\frac{t_{1} [n] + d + 2 w [n] - z [n]}{β_{1}^{2}} + \frac{t_{2} [n] - β_{0}}{β_{1}^{3}})$

(10.12)

Taking expectation of (10.12) with respect to random delays w[n] and z[n] gives

$- E {\frac{\partial^{2} \ln p}{\partial β_{1} \partial β_{0}}} = \frac{1}{σ^{2} β_{1}^{3}} \sum_{n = 0}^{N - 1} [β_{1} (t_{1} [n] + d) + (t_{2} [n] - β_{0})],$

where we have used the fact that w[n] and z[n] have zero means. Similarly, the other elements of I(β₁,β₀,d) can be obtained and the Fisher information matrix can be expressed as

$I (β_{1}, β_{0}, d) = \frac{1}{σ^{2}} [\begin{matrix} A & ℬ & C \\ ℬ & \frac{2 N}{β_{1}^{2}} & 0 \\ C & 0 & 2 N \end{matrix}],$

(10.13)

where

$A : = β_{1}^{- 4} \sum_{n = 0}^{N - 1} [β_{1}^{2} {(t_{1} [n] + d)}^{2} + β_{1}^{2} σ^{2} + {(t_{2} [n] - β_{0})}^{2}]$ ,

$ℬ : = β_{1}^{- 3} \sum_{n = 0}^{N - 1} [β_{1} (t_{1} [n] + d) + (t_{2} [n] - β_{0})]$ , and

$C : = β_{1}^{- 2} \sum_{n = 0}^{N - 1} [β_{1} (t_{1} [n] + d) - (t_{2} [n] - β_{0})]$ . By inverting the matrix (10.13), it can be shown that the CRLB for each parameter (i.e., the diagonal elements of I⁻¹(β₁,β₀, d)) is given, respectively, by:

$\begin{array}{l} CRLB (β_{1}) = \frac{1 N σ^{2}}{2 N A - β_{1}^{2} ℬ^{2} - C^{2}}, \\ CRLB (β_{0}) = \frac{σ^{2} β_{1}^{2} (2 N A - C^{2})}{2 N (2 N A - β_{1}^{2} ℬ^{2} - C^{2})}, \\ CRLB (d) = \frac{σ^{2} (2 N A - β_{1}^{2} ℬ^{2})}{2 N (2 N A - β_{1}^{2} ℬ^{2} - C^{2})} . \end{array}$

As can be seen from this example, CRLB in general depends on the true parameters.

10.2.2 Finding MVUE Attaining the CRLB

With the lower bound on minimum variance, one may wonder if we can construct an unbiased estimator that can reach the CRLB. The answer is yes. Such an estimator is called efficient and might be obtained during the evaluation of the CRLB.

Theorem 10.2.2. [1] [2] An estimator $\hat{θ} = g (x)$ is the MVUE if and only if

$\frac{\partial \ln p (x | θ)}{\partial θ} = I (θ) (g (x) - θ)$

(10.14)

for some matrix I(θ). Furthermore, the covariance matrix of the MVUE is I(θ)⁻¹, which is also the CRLB.

Example 10.2.2. [1] Let us consider a linear model

$x = H θ + s + w$

where x is the observation vector, H represents a known matrix, θ denotes the parameter vector to be estimated, s is a known vector, and w stands for a noise vector with PDF $N$ (0, C). Subtracting s on both sides, and applying the whitening matrix D with D^TD = C⁻¹, we have

$\begin{matrix} D (x - s) = D H θ + Dw \\ x^{'} = H^{'} θ + w^{'} . \end{matrix}$

It can be easily shown that w′ is Gaussian with zero mean and the covariance matrix is an identity matrix. Taking the first order derivative of logarithm of the likelihood function in the form of (10.3), it follows that:

$\frac{\partial \ln p (x^{'} | θ)}{\partial θ} = - \frac{1}{2} \frac{\partial}{\partial θ} [{x^{'}}^{T} x^{'} - 2 {x^{'}}^{T} H^{'} θ + θ^{T} {H^{'}}^{T} H θ] .$

Since

$\begin{matrix} \frac{\partial {x^{'}}^{T} H^{'} θ}{\partial θ} = {H^{'}}^{T} x^{'}, \\ \frac{\partial θ^{T} H^{'} θ}{\partial θ} = 2 {H^{'}}^{T} H^{'} θ, \end{matrix}$

we have

$\begin{array}{l} \frac{\partial \ln p (x^{'} | θ)}{\partial θ} = {H^{'}}^{T} x^{'} - {H^{'}}^{T} H^{'} θ \\ = {H^{'}}^{T} H^{'} [{({H^{'}}^{T} H^{'})}^{- 1} {H^{'}}^{T} x^{'} - θ], \end{array}$

(10.15)

where in the last step, it is assumed that H^′^TH′ is invertible. Comparing the above equation with (10.14), it is obvious that we obtain

$\hat{θ} = {({H^{'}}^{T} H^{'})}^{- 1} {H^{'}}^{T} x^{'} = {(H^{T} C^{- 1} H)}^{- 1} H^{T} C^{- 1} (x - s),$

(10.16)

$C_{\hat{θ}} = I {(θ)}^{- 1} = (H C^{- 1} H^{- 1}),$

(10.17)

the MVUE and its corresponding covariance, respectively. Notice that this result does not require the computation of second order derivative in order to obtain the CRLB.

10.3 MVUE Using Rao-Blackwell-Lehmann-Scheffe Theorem

In the previous section, we introduced a method for finding MVUE that can reach the CRLB. However, even if there is no estimator that can reach the CRLB, there may still exist a MVUE (the MVUE in such case is not an efficient estimator). In this section, we introduce another technique that can determine the MVUE from the likelihood function, and this technique requires the concept of sufficient statistics.

10.3.1 Sufficient Statistics

In a general estimation problem, we have access to a bunch of observations or data x[0], x[1], …, x[N − 1]. These observations contain information about the parameter we want to estimate. In general, making use of all available data would achieve the best estimation performance (how to use them is another problem). One would wonder: Is there any way to compress the data while the useful information for parameter estimation is not lost? If so, we can compress the original data set into sufficient statistics and use them for parameter estimation instead. Of course, the original data set is a sufficient statistic itself. Furthermore, for given original observations, there are many sufficient statistics. The one with the minimum number of elements is called the minimal sufficient statistic.

Formally, the concept of sufficient statistic is defined as follows. Let x be a vector of observations, and θ be the parameter of interest, with the likelihood function given by p(x|θ). A statistic T(x) is sufficient for θ if p(x|T(x), θ) is independent of θ. Intuitively, if T(x) is observed, and if p(x|T(x), θ) still depends on θ, there must be some information contain in x but not in T(x), and therefore T(x) cannot be a sufficient statistic.

Example 10.3.1. [2] Consider a sequence of independent Bernoulli trial x[0], x[1], …, x[N − 1] ∊ {0,1}. The probability for x[n] = 1 is θ. Then the likelihood function for estimation of θ is

$p (x | θ) = \prod_{n = 0}^{N - 1} θ^{x [n]} {(1 - θ)}^{1 - x [n]} = θ^{k} {(1 - θ)}^{N - k}$

(10.18)

where $k : \sum_{n = 0}^{N - 1} x [n]$ . Now consider

$p (x | k, θ) = \frac{p (x, k | θ)}{p (k | θ)} .$

Since κ completely depends on x, p(x, κ|θ) is zero except when κ equals the number of ones in x. When it does, p(x, κ|θ) = p(x|θ). On the other hand, p(κ|θ) is the distribution of number of ones in N independent Bernoulli trials, and it is given by

$p (k | θ) = (\begin{matrix} N \\ k \end{matrix}) θ^{k} {(1 - θ)}^{N - k} .$

Therefore,

$p (x | k, θ) = \frac{θ^{k} {(1 - θ)}^{N - k}}{(\begin{matrix} N \\ k \end{matrix}) θ^{k} {(1 - θ)}^{N - k}} = \frac{1}{(\begin{matrix} N \\ k \end{matrix})} .$

It is obvious that p(x|κ, θ) is independent of θ, and $k = \sum_{n = 0}^{N - 1} x [n]$ is a sufficient statistic. That is, for estimation of θ, we only need to store the number of ones in the observations rather than the original sequence of ones and zeros.

Finding sufficient statistic using the basic definition involves guessing the sufficient statistic and verifying the conditional PDF being independent of the parameter of interest. Both steps can be challenging. Fortunately, we have the following theorem to help us in practice.

Theorem 10.3.1. (Neyman-Fisher Factorization) [1] [2] Let x be a vector of observations with the likelihood function p(x|θ). The statistic T(x) is sufficient if and only if the likelihood function can be factorized as follows

$p (x | θ) = g (T (x), θ) h (x)$

where g is a function depending on x only through T(x) and h is a function depending on x only.

Now, let us revisit the previous example of Bernoulli trials. The likelihood function in (10.18) can be expressed as

$p (x | θ) = \underset{g (T (x) = k, θ)}{\underset{︸}{θ^{k} {(1 - θ)}^{N - k}}} . \underset{h (x)}{\underset{︸}{1}} .$

The first term g depends on x through κ only, while h is independent of θ. Therefore, from the Neyman-Fisher factorization Theorem, κ is a sufficient statistic. Obviously, identifying the sufficient statistic using this theorem is much simpler. Below is another example.

Example 10.3.2. [1] For the mean estimation from observations x[n] = A + w[n] for n = 0,1,…, N − 1, with w[n] being i.i.d. zero mean Gaussian distributed with variance σ², the likelihood function is given by

$p (x | A) = \frac{1}{{(2 π σ^{2})}^{N / 2}} \exp [- \frac{1}{2 σ^{2}} \sum_{n = 0}^{N - 1} {(x [n] - A)}^{2}]$

(10.19)

$= \underset{g (T (x) = \sum_{n = 0}^{N - 1} x [n], A)}{\underset{︸}{\frac{1}{{(2 π σ^{2})}^{N / 2}} \exp [- \frac{1}{2 σ^{2}} (N A^{2} - 2 A \sum_{n = 0}^{N - 1} x [n])]}} \underset{h (x)}{\underset{︸}{\exp [- \frac{1}{2 σ^{2}} \sum_{n = 0}^{N - 1} x^{2} [n]]}} .$

(10.20)

Now, g depends on x through $\sum_{n = 0}^{N - 1} x [n]$ only, while h is a function of x only and is independent of A. Clearly, $T (x) = \sum_{n = 0}^{N - 1} x [n]$ is a sufficient statistic for estimating A. Notice that $T^{'} (x) = 2 \sum_{n = 0}^{N - 1} x [n]$ is also a sufficient statistic for A. In fact, sufficient statistics are unique only to within a one-to-one transformation.

□

Another concept that is important to the determination of MVUE is the concept of complete sufficient statistic. In essence, a statistic is complete if there is only one function of the statistic that forms an unbiased estimator. Suppose T(x) is a sufficient statistic for θ. Furthermore, suppose there are two function g₁ and g₂ of T (x) that are unbiased estimators of θ. Equivalently, we have

$E {g_{1} (T (x)) - g_{2} (T (x))} = 0$

(10.21)

for all θ, and the expectation is taken with respect to the observations (or equivalently the sufficient statistics). In order to prove T(x) is a complete sufficient statistic, we must prove that if (10.21) holds, then g₁(T(x)) = g₂(T(x)).

Example 10.3.3. [1] In the estimation of A from observations x[n] = A + w[n] with w[n] being i.i.d. Gaussian noise with zero mean and variance σ², we have shown in the previous example that a sufficient statistic for A is $T (x) = \sum_{n = 0}^{N - 1} x [n]$ . Now suppose there are two unbiased estimators g₁(T(x)) and g₂(T(x)), that is

$E {g_{1} (T (x)) - g_{2} (T (x))} = 0$

for all A. Since τ ≔ T(x) ~ $N$ (NA, Nσ²), and defining v(τ) ≔ g₁(τ) − g₂(τ), the above equation is equivalent to

$\int_{- \infty}^{\infty} v (τ) \frac{1}{\sqrt{2 π N σ^{2}}} \exp [- \frac{1}{2 N σ^{2}} {(N A - τ)}^{2}] d r = 0.$

It can be recognized that the above equation is the convolution of v(τ) and a Gaussian pulse. Since the Gaussian pulse is non-negative, in order to have the convolution equal to zero for all A, v(τ) must be identically zero. Therefore, g₁(T(x)) = g₂(T(x)), implying the unbiased estimator is unique, and the sufficient statistic $T (x) = \sum_{n = 0}^{N - 1} x [n]$ is complete.

□

10.3.2 Finding MVUE from Sufficient Statistics

With the background of sufficient statistics, we now present a theorem that helps us to identify the MVUE.

Theorem 10.3.2. (Rao-Blackwell-Lehmann-Scheffe) [1] [2] Let $\hat{θ}$ be an unbiased estimator of θ, and T(x) be a sufficient statistic for θ. Then $\hat{θ} = E {\hat{θ} | Τ (x)}$ improves on $\overset{⌣}{θ}$ in the following ways:

1. $\hat{θ}$ is a valid unbiased estimator for θ;

2. var $(\hat{θ}) \leq var (\overset{⌣}{θ})$ .

The result of this theorem is important in the sense that it provides a way to improve the variance of any unbiased estimator using sufficient statistics. However, this theorem does not immediately offer the MVUE, since it is not guaranteed that the improved estimator is of minimum variance within the class of unbiased estimators. But if we have the additional knowledge that the sufficient statistic is complete, the estimator $\hat{θ} = E {\overset{⌣}{θ} | T (x)}$ is the only unbiased estimator for θ while making use of the sufficient statistic. Therefore, it must be the MVUE. In fact, if the estimator we start with $\overset{⌣}{θ}$ is also an unbiased estimator using complete sufficient statistics, it is automatically the MVUE, since there is only one unbiased estimator if the sufficient statistic is complete. There is even no need to compute the expectation $E {\overset{⌣}{θ} | T (x)}$ . In summary, the procedure of determining MVUE using Rao-Blackwell-Lehmann-Scheffe theorem is as follows.

1. Find a sufficient statistic for θ, T(x), using Neyman-Fisher factorization Theorem.

2. Determine if the sufficient statistic is complete.

3. If so, find a function $\hat{θ} = g (T (x))$ which is an unbiased estimator of θ. The MVUE will be $\hat{θ}$ .

Example 10.3.4. [4] In the clock offset estimation in a wireless sensor node, the observation equations can be written as

$\begin{matrix} U_{n} & = & δ + ϕ + w_{n}, & n = 1, 2, \dots, N, \\ V_{n} & = & δ - ϕ + z_{n}, & n = 1, 2, \dots N, \end{matrix}$

where δ symbolizes the fixed portions of the transmission delays, w_n and z_n denote the variable portions of delays and assume i.i.d. exponential distributions with means α and β, respectively, and φ stands for the clock phase offset from the reference time. We want to determine the MVUE of θ ≔ [δ, ϕ, α, β]^T using the procedure introduced above.

First, we need to determine a sufficient statistic. The likelihood of θ is given by

$\begin{array}{l} L (θ) = \underset{: = g_{1} (\sum_{n = 1}^{N} U_{n}, δ, ϕ, α)}{\underset{︸}{α^{- N} e^{- \frac{1}{α} \sum_{n = 1}^{N} (U_{n} - δ - ϕ)}}} \underset{: = g_{3} (U_{(1)}, δ, ϕ)}{\underset{︸}{u [U_{(1)} - δ - ϕ]}} . \\ \underset{: = g_{2} (\sum_{n = 1}^{N} V_{n}, δ, ϕ, α)}{\underset{︸}{β^{- N} e^{- \frac{1}{β} \sum_{k = 1}^{N} (V_{k} - δ + ϕ)}}} \underset{: = g_{4} (V_{(1)}, δ, ϕ)}{\underset{︸}{u [V_{(1)} - δ + ϕ]}} . \underset{: = h (U_{n}, V_{n})}{\underset{︸}{1}}, \end{array}$

where u[·] denotes the unit step function, and U₍₁₎ and V₍₁₎ are the minimum order statistics of U_κ and V_κ, respectively. In the above expression, h(U_n, V_n) is independent of the unknown parameter vector θ, whereas g₁, g₂, g₃ and g₄ are functions depending on the data through $T = {\sum_{n = 1}^{N} U_{n}, U_{(1)}, \sum_{n = 1}^{N} V_{n}, V_{(1)}}$ . Therefore, according to Neyman-Fisher factorization Theorem 10.3.1, T is a sufficient statistic for θ.

Next, we need to prove that T is complete, a step which requires the PDF of T. Unfortunately, since $\sum_{n = 1}^{N} U_{n}$ and U₍₁₎, and similarly $\sum_{n = 1}^{N} V_{n}$ and V₍₁₎, are not independent, the PDF of T is difficult to be obtained. But if we consider the statistic $T^{'} = {\sum_{n = 1}^{N} (U_{n} - U_{(1)}), U_{(1)}, \sum_{n = 1}^{N} (V_{n} - V_{(1)}), V_{(1)}}$ , it turns out that it is also a sufficient statistic because it is obtained using a one-to-one transformation from T. It can be proved that the elements of T′ are independent, and each of them obeys the three-parameter Gamma distribution:

$\begin{array}{r} r & : = & \sum_{n = 1}^{N} (U_{n} - U_{(1)}) & \sim & Γ (N - 1, α, 0), \\ s & : = & \sum_{n = 1}^{N} (V_{n} - V_{(1)}) & \sim & Γ (N - 1, β, 0), \\ U_{(1)} & \sim & Γ (1, α / N, δ + ϕ), \\ V_{(1)} & \sim & Γ (1, β / N, δ - ϕ) . \end{array}$

Now, suppose there are two functions g(T′) and h(T′) that are unbiased estimators of θ. Then we have

$\begin{matrix} E {g (T^{'}) - h (T^{'})} = E {π (T^{'})} = 0 & \forall θ \end{matrix}$

where π(T′) ≔ g(T′) − h(T′). As a result,

$\begin{array}{l} \int_{δ - ϕ}^{\infty} \int_{0}^{\infty} \int_{δ + ϕ}^{\infty} \int_{0}^{\infty} π (r, U_{(1)}, s, V_{(1)}) . \frac{{(α β)}^{- (N - 1)}}{{Γ (N - 1)}^{2}} {(r s)}^{N - 2} e^{- \frac{r}{α} - \frac{s}{β}} \\ \times \frac{N^{2}}{α β} e^{- \frac{N}{α} {U_{(1)} - δ - ϕ} - \frac{N}{β} {V_{(1)} - δ + ϕ}} d r d U_{(1)} d s d V_{(1)} = 0 \end{array}$

for all θ. The above relation can also be expressed as

$\begin{array}{l} \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} [π (r, U_{(1)}, s, V_{(1)}) {(r s)}^{N - 2} u (s) u (r) u (U_{(1)} - δ - ϕ) u (V_{(1)} - δ + ϕ)] \\ \times e^{- {\frac{r}{α} + \frac{N U_{(1)}}{α} + \frac{s}{β} + \frac{N V_{(1)}}{β}}} d r d U_{(1)} d s d V_{(1)} = 0. \end{array}$

The expression on the left above is the four-dimensional Laplace transform of the function within the square brackets. It follows from the uniqueness theorem for the two-sided Laplace transform that π(T′)(rs)^N⁻²u(s)u(r)u(U₍₁₎ − δ − ϕ)u(V₍₁₎ − δ + ϕ) = 0 almost everywhere. Since r, s, u(s), u(r), u(U₍₁₎ − δ − ϕ), u(V₍₁₎ − δ + ϕ) are not identically zero, we can conclude that π(T′) = 0, resulting in g(T′) = h(T′). This proves that the statistic T′, or equivalently T, is complete for estimating θ.

Finally, we have to find an unbiased estimator for θ as a function of T. A careful inspection of the sufficient statistics reveals that

$\hat{θ} = \frac{1}{2 (N - 1)} [\begin{matrix} N (U_{(1)} + V_{(1)}) - \frac{1}{N} (\sum_{n = 1}^{N} U_{n} + \sum_{n = 1}^{N} V_{n}) \\ N (U_{(1)} - V_{(1)}) - \frac{1}{N} (\sum_{n = 1}^{N} U_{n} - \sum_{n = 1}^{N} V_{n}) \\ 2 N (\frac{1}{N} \sum_{n = 1}^{N} U_{n} - U_{(1)}) \\ 2 N (\frac{1}{N} \sum_{n = 1}^{N} V_{n} - V_{(1)}) \end{matrix}]$

is an unbiased estimator of θ, which is also the MVUE according to the Rao-Blackwell-Lehmann-Scheffe Theorem 10.3.2.

10.4 Maximum Likelihood Estimation

In general, determining MVUE is difficult due to the fact that we do not know whether the MVUE for a particular problem exists. Even if it does exist, the procedure for determining the MVUE is overwhelmingly challenging for most problems. Therefore, in practice, MVUEs are only reported in isolated cases. In this section, we introduce the maximum likelihood (ML) estimation, a popular technique that usually results in practical estimators.

10.4.1 ML Estimation Principle

As it was introduced previously in Section 10.2 in the context of Cramér-Rao lower bound, the likelihood function can be interpreted as a PDF of the unknown parameter θ with fixed observation x. The maximum likelihood estimation approach resumes to finding the parameter vector that maximizes the likelihood function:

$\hat{θ} = \arg [\max_{\tilde{θ}} p (x | \tilde{θ})],$

where $\tilde{θ}$ is the trial value of the unknown parameter. In fact, the ML estimate $\hat{θ}$ is the parameter that is the most probable for generating the observations.

Example 10.4.1. [1] Consider the received data x[n] = A + w[n] for n = 0,1,…, N − 1, with A unknown, and w[n] is i.i.d. Gaussian distributed with zero mean and variance σ². We want to derive the ML estimator for A. First, the likelihood function of A is given by (10.19). Taking logarithm on the likelihood function gives

$\ln p (x | A) = - \frac{N}{2} \ln 2 π - N \ln σ - \frac{1}{2 σ^{2}} \sum_{n = 0}^{N - 1} {(x [n] - A)}^{2} .$

Since the value of A that maximizes p(x|A) is the same as that maximizing ln p(x|A), we can proceed to maximize the log-likelihood function ln p(x|A) instead. Differentiating ln p(x|A) with respect to A, we have

$\frac{\partial \ln p (x | A)}{\partial A} = \frac{1}{σ^{2}} \sum_{n = 0}^{N - 1} (x [n] - A) .$

(10.22)

Setting (10.22) to zero, the ML estimator of A is given by

$\hat{A} = \frac{1}{N} \sum_{n = 0}^{N - 1} x [n],$

which can be interpreted as the sample mean estimator.

Notice that the first derivative of the log-likelihood function (10.22) of the above example can also be written as

$\frac{\partial \ln p (x | A)}{\partial A} = \frac{N}{σ^{2}} (\frac{1}{N} \sum_{n = 0}^{N - 1} x [n] - A) .$

(10.23)

Comparing (10.23) to (10.14), it is obvious that the ML estimator of A is also the MVUE. Furthermore, in the general linear model considered in Example 10.2.2, the first derivative of the log-likelihood function is given by (10.15). By setting (10.15) to zero and solving for θ, the ML estimator of θ can be easily shown to be identical to the MVUE in (10.16). This can be viewed as a special case of following important result.

Theorem 10.4.1. (Optimality of the MLE for the Linear Model) [1] [2] For the data x described by the general linear model

$x = H θ + w$

where H stands for a known N × p matrix with N > p and of full rank, θ represents a p × 1 vector of parameters to be estimated, and w denotes the Gaussian noise vector with PDF $N$ (0, C), the ML estimate of θ is given by

$\hat{θ} = {(H^{T} C^{- 1} H)}^{- 1} H^{T} C^{- 1} x .$

(10.24)

Furthermore, $\hat{θ}$ is also the MVUE and its variance (H^TC⁻¹H)⁻¹ attains the CRLB.

The above two examples are not pure coincidence. In fact, if an unbiased estimator attains the CRLB (i.e., an efficient estimator exists), the ML procedure will produce it. This can be easily explained by the fact that if an efficient estimator exists, (10.14) holds. On the other hand, ML estimate is obtained by setting the first derivative of the log-likelihood function, i.e., (10.14), to zero. This automatically produces $\hat{θ} = g (x)$ , which is also the MVUE.

10.4.2 Properties of the ML Estimator

However, being identical to the MVUE in certain situations is not the main reason for the popularity of the ML estimation, as these situations do not frequently occur. On the other hand, the following property itself is strong enough for the ML estimation to be useful in practice.

Theorem 10.4.2. (Asymptotic Property of the ML estimator) [1] [2] If the likelihood function p(x|θ) of data x satisfies certain “regularity” conditions (existence of the derivatives of the log-likelihood function, as well as the Fisher information being nonzero), then the ML estimate of θ is asymptotically distributed (for large data records) according to

$\hat{θ} \underset{˜}{a} N (θ, I^{- 1} (θ))$

where I(θ) is the Fisher information matrix evaluated at the true value of the unknown parameter.

This theorem states that when the observation data length tends to infinity, the ML estimate is asymptotically unbiased and asymptotically attains the CRLB. The ML estimate is therefore asymptotically efficient.

Example 10.4.2. [1] Consider x[n] = A + w[n] for n = 0,1,…,N − 1, with A > 0 is the parameter of interest, and w[n] is i.i.d. Gaussian noise with unknown variance A. The likelihood function of A is

$p (x | A) = \frac{1}{{(2 π A)}^{N / 2}} \exp [- \frac{1}{2 A} \sum_{n = 0}^{N - 1} {(x [n] - A)}^{2}] .$

Notice that this problem is not the same as that in the last example, as A appears as both the mean of received data and the variance of the noise. Differentiating the log-likelihood function with respect to A, it follows that

$\frac{\partial \ln p (x | A)}{\partial A} = - \frac{N}{2 A} + \frac{1}{A} \sum_{n = 0}^{N - 1} (x [n] - A) + \frac{1}{2 A^{2}} \sum_{n = 0}^{N - 1} {(x [n] - A)}^{2} .$

(10.25)

Setting the result to zero produces

$A^{2} + A - \frac{1}{N} \sum_{n = 0}^{N - 1} x^{2} [n] = 0,$

which yields the ML estimate of A:

$\hat{A} = - \frac{1}{2} + \sqrt{\frac{1}{N} \sum_{n = 0}^{N - 1} x^{2} [n] + \frac{1}{4},}$

and the other solution is dropped to guarantee A > 0.

Now, we will examine the asymptotic mean and variance of the above estimator. To simply the notation, we let $u = \frac{1}{N} \sum_{n = 0}^{N - 1} x^{2} [n]$ , and g(u) be the function:

$\hat{A} = g (u) = - \frac{1}{2} + \sqrt{u + \frac{1}{4}} .$

As N → ∞, u → E {x²[n]} = A² + A := u_o. Using a first order Taylor series approximation for g(u) around u_o yields

$\begin{matrix} \hat{A} = g (u) \approx g (u_{o}) + \frac{d g (u)}{d u} |_{u = u_{o}} (u - u_{o}) \\ = A + \frac{1 / 2}{A + 1 / 2} [\frac{1}{N} \sum_{n = 0}^{N - 1} x^{2} [n] - (A + A^{2})] . \end{matrix}$

Therefore, E {Â} = A, so Â is asymptotically unbiased. Furthermore, the asymptotic variance is computed as

$\begin{matrix} var (\hat{A}) = {(\frac{1 / 2}{A + 1 / 2})}^{2} var (\frac{1}{N} \sum_{n = 0}^{N - 1} x^{2} [n]) \\ = \frac{1 / 4}{N {(A + 1 / 2)}^{2}} var (x^{2} [n]) . \end{matrix}$

Since x[n] is Gaussian distributed with mean A and variance A, it can be shown that var (x²[n]) = 4A³ + 2A², and therefore

$var (\hat{A}) = \frac{A^{2}}{N (A + 1 / 2)} .$

(10.26)

To check whether this asymptotic variance equals the CRLB, we differentiate (10.25) with respect to A again, and obtain

$\frac{\partial \ln p (x | A)}{\partial A^{2}} = \frac{N}{2 A^{2}} - \frac{1}{A^{2}} \sum_{n = 0}^{N - 1} (x [n] - A) - \frac{N}{A} - \frac{1}{A^{3}} \sum_{n = 0}^{N - 1} {(x [n] - A)}^{2} \sum_{n = 0}^{N - 1} (x [n] - A) .$

The CRLB of A is then given by

$\begin{matrix} {[- E {\frac{\partial^{2} \ln p (x | A)}{\partial A^{2}}}]}^{- 1} = - {(\frac{N}{2 A^{2}} - \frac{N}{A} - \frac{N}{A^{2}})}^{- 1} \\ = \frac{A^{2}}{N (A + 1 / 2)} . \end{matrix}$

The asymptotic variance (10.26) coincides with the CRLB. Therefore, the ML estimator for A is asymptotic efficient.

□

Example 10.4.3. [5] In orthogonal frequency division multiple access (OFDMA) uplink, multiple signals from different users arrive at the base station at the same time. However, their carrier frequency offsets (CFO) and channels are different and unknown. Before the data can be detected, we need to estimate all these parameters. In general, the received signal model in such systems can be expressed in the form of matrix equation:

$x = Q (ω) h + w,$

(10.27)

where x stands for the received data from one OFDM symbol, h denotes an unknown vector containing the channel coefficients of different users, Q(ω) is a matrix with known structure but parameterized nonlinearly by the unknown vector ω = [ω₁ ω₂ …ω_K]^T of frequency offsets, w represents the i.i.d. complex Gaussian noise with zero mean and variance σ². The ML estimate of parameters {h, ω} is obtained by maximizing the likelihood function

$p (x | \tilde{h}, \tilde{ω}) = \frac{1}{{(π σ^{2})}^{N}} \cdot \exp {- \frac{1}{σ^{2}} {[x - Q (\tilde{ω}) \tilde{h}]}^{H} [x - Q (\tilde{ω}) \tilde{h}]},$

(10.28)

where h̃ and $\tilde{ω}$ are trial values of h and ω, respectively. Notice that since w is complex, the PDF expression (10.28) assumes a slightly different form than that when w is real-valued in (10.3). Taking the logarithm of (10.28), and ignoring the constant irrelevant terms, the ML estimate can be equivalently found by minimizing

$Λ (x | \tilde{h}, \tilde{ω}) = {[x - Q (\tilde{ω}) \tilde{h}]}^{H} [x - Q (\tilde{ω}) \tilde{h}] .$

(10.29)

Due to the linear dependence of parameter h in (10.27), the ML estimate for the channel vector h (when $\tilde{ω}$ is fixed) is given by (see (10.24))

$\hat{h} = (Q^{H} (\tilde{ω}) Q {(\tilde{ω}))}^{- 1} Q^{H} (\tilde{ω}) x .$

Plugging ĥ into (10.29), and retaining only those terms related to $\tilde{ω}$ , the ML estimate of ω can be expressed as

$\hat{ω} = \arg \max_{\tilde{ω}} {J (\tilde{ω}) : = x^{H} Q (\tilde{ω}) {(Q^{H} (\tilde{ω}) Q (\tilde{ω}))}^{- 1} Q^{H} (\tilde{ω}) x} .$

(10.30)

Now the question resumes to how to maximize the nonlinear function $J (\tilde{ω})$ . One straightforward way is to use a grid search, in which the feasible region of the parameter is divided into many subregions and the potential points are checked one by one. For example, if the parameter ω is one-dimensional and lies within [a, b], we can divide this range into 10,000 equally spaced intervals, and compute the cost function for 10,000 different values of ω. This approach may be acceptable in the presence of a single unknown parameter. However, when the dimension of parameter vector increases, the number of points to be searched increases exponentially. For example, if there are four users in the OFDMA system, the unknown vector ω is four dimensional. If we divide each dimension into 10,000 subintervals, the number of points to be searched is (10,000)⁴. This is unacceptable. Furthermore, if the parameter of interest is not bounded to a finite-length interval, this approach is also practically infeasible.

Another possible solution is to use iterative numerical search methods, such as the Newton-Raphson or steepest descent method. The idea is to approximate the optimization of a nonlinear function by a series of linearized optimizations with respect to a previously obtained solution. In the Newton-Raphson method, the second order Taylor series expansion is applied to the nonlinear function J(ω) around a previous estimate ${\tilde{ω}}^{(i)}$ :

$J (ω) \approx J ({\hat{ω}}^{(i)}) + (ω - {\hat{ω}}^{(i)}) \frac{\partial J (ω)}{\partial ω} |_{ω = {\hat{ω}}^{(i)}} + \frac{1}{2} {(ω - {\hat{ω}}^{(i)})}^{H} {[\frac{\partial^{2} J (ω)}{\partial ω \partial ω^{H}}]}_{ω = {\hat{ω}}^{(i)}} (ω - {\hat{ω}}^{(i)})$

(10.31)

where

$\frac{\partial J (ω)}{\partial ω} = {[\frac{\partial J}{\partial ω_{1}}, \frac{\partial J}{\partial ω_{2}}, \dots . \frac{\partial J}{\partial ω_{K}}]}^{T}$

is the gradient vector, and

$\frac{\partial^{2} J (ω)}{\partial ω \partial ω^{H}} = (\begin{matrix} \frac{\partial^{2} J}{\partial ω_{1} \partial ω_{1}} & \frac{\partial^{2} J}{\partial ω_{1} \partial ω_{2}} & \dots & \frac{\partial^{2} J}{\partial ω_{1} \partial ω_{K}} \\ \frac{\partial^{2} J}{\partial ω_{2} \partial ω_{1}} & \frac{\partial^{2} J}{\partial ω_{2} \partial ω_{2}} & \dots & \frac{\partial^{2} J}{\partial ω_{2} \partial ω_{K}} \\ ⋮ \\ \frac{\partial^{2} J}{\partial ω_{K} \partial ω_{1}} & \frac{\partial^{2} J}{\partial ω_{K} \partial ω_{2}} & \dots & \frac{\partial^{2} J}{\partial ω_{K} \partial ω_{K}} \end{matrix})$

is the Hessian of J(ω). Differentiating (10.31) with respect to $(ω - {\hat{ω}}^{(i)})$ and setting the result to zero leads to

$\frac{\partial J (ω)}{\partial ω} |_{ω = {\hat{ω}}^{(i)}} + {[\frac{\partial 2 J (ω)}{\partial ω \partial ω^{H}}]}_{ω = {\hat{ω}}^{(i)}} (ω = {\hat{ω}}^{(i)}) = 0.$

Solving the above equation gives the value of ω that maximizes the second order approximated Taylor series expansion:

${\hat{ω}}^{(i + 1)} = {\hat{ω}}^{(i)} - {[\frac{\partial^{2} J (ω)}{\partial ω \partial ω^{H}}]}_{ω = {\hat{ω}}^{(i)}}^{- 1} \frac{\partial J (ω)}{\partial ω} |_{ω = {\hat{ω}}^{(i)}} .$

On the other hand, in the steepest descent method, the update is given by the equation:

${\hat{ω}}^{(i + 1)} = {\hat{ω}}^{(i)} - α \frac{\partial J (ω)}{\partial ω} |_{ω = {\hat{ω}}^{(i)}},$

where ${\hat{ω}}^{(i)}$ is the estimate of ω at the i^th iteration, the gradient vector ∂J(ω)/∂ω points in the direction of the maximum increase of J(ω) at the point ω, and α is a parameter that determines how far away each update moves along the direction opposite to the gradient. Notice that the steepest descent method is derived to minimize a function, but minimizing a function is equivalent to maximizing the negative of the same function. Therefore, we do not explicitly describe the maximization algorithm.

Both methods share some similarities. For example, both methods require the initial estimate near the global optimal solution. Otherwise, in general, in the presence of multi-modal objective functions, only convergence to a local optimum can be guaranteed. Furthermore, both methods require the use of the derivative information. However, the Newton-Raphson method requires the additional computation of the Hessian, which may be difficult or impossible to obtain. Notice that even if the Hessian can be obtained, it may not be invertible. However, the use of the Hessian information (if available) ensures a faster convergence rate for the Newton-Raphson method relative to the steepest descent method.

Newton-Raphson and steepest descent methods may indeed reduce the computational complexity compared to the grid search. However, they are still multidimensional optimization approaches. To further reduce the computational burden brought by the multidimensional searches in the ML estimator, the alternating projection algorithm can be exploited. The alternating projection method reduces a K-dimensional maximization problem (10.30) into a series of one-dimensional maximization problems, by updating one parameter at a time, while keeping the other parameters fixed at the previous estimated values. Let ${\hat{ω}}_{k}^{(i)}$ be the estimate of ω_κ at the i^th iteration. Further, let

${\hat{ω}}_{- k}^{(i)} = {[{\hat{ω}}_{1}^{(i + 1)} \dots {\hat{ω}}_{k - 1}^{(i + 1)} {\hat{ω}}_{k + 1}^{(i)} \dots {\hat{ω}}_{k + 1}^{(i)}]}^{T} .$

Given the initial estimates ${{\hat{ω}}_{k}^{(0)}}_{k = 0}^{K}$ , the i^th (i ≥ 1) iteration of the alternating projection algorithm for maximizing (10.30) assumes the following form:

${\hat{ω}}_{k}^{(i)} = \arg \max_{{\tilde{ω}}_{k}} {J ({\tilde{ω}}_{k}, {\hat{ω}}_{- k}^{(i - 1)})},$

for κ = 1,…,K. Grid search, steepest descent or Newton-Raphson method can be used in each of the one-dimensional search. Multiple iterations on i are performed until the estimates of ω_κ converge to a stable solution. Similar to other iterative algorithms, in general, the alternating projection can only be guaranteed to converge to at least a local optimum. Therefore, a good initialization is important.

Figure 10.2(a) shows the MSE performance of the ML estimator of ω with alternating projection and grid search in each dimension, as a function of observation length N. There are two users (K=2), the signal–to–noise ratio (SNR) is set to 10dB, and the number of iterations for alternating projection is 3. CRLB is also shown in the figure. It can be seen that the performance of the ML estimator approaches the CRLB when N increases, confirming the ML estimator is asymptotic efficient. On the other hand, Figure 10.2(b) shows the MSE performance of the ML estimator as a function of SNR. The OFDMA system has N = 64 subcarriers and K = 4 users. It is obvious that the MSE coincides with the CRLB in medium to high SNRs. Notice that increasing the number of observations has the same effect as decreasing the noise variance in observations. Thus, the ML estimator is also asymptotically efficient at high SNRs.

The ML estimate of a parameter θ is obtained by maximizing the likelihood function p(x|θ). But suppose we are interested not in θ itself, but α = g(θ). The question is if we can obtain the ML estimate of α from the ML estimate of θ. Such a situation occurs when the ML estimate of θ is relatively easy to obtain, but the direct derivation of ML estimate of α is difficult. The following theorem answers this question and is another major factor in making the ML estimation so popular.

Theorem 10.4.3. (Invariance property of the ML estimator) [1] Suppose α = g(θ); where θ is a p × 1 vector with likelihood function p(x|θ) and g is an r-dimensional function of θ. The ML estimator of α is given by

$\hat{α} = g (\hat{θ})$

where $\hat{θ}$ is the ML estimate of θ.

Example 10.4.4. In this example, we consider again the clock synchronization of wireless sensor node discussed in Example 10.2.1, with observations given by (10.6) and (10.7). However, in this example, the observation noise w[n] and z[n] are modeled as i.i.d. exponential random variables with an unknown common rate parameter λ. The goal is to estimate β₀, β₁, λ and d based on the observations ${x_{1} [n], x_{2} [n]}_{n = 0}^{N - 1}$ .

To derive the ML estimator, we rewrite the observation equations (10.6) and (10.7) as

$w [n] = \frac{1}{β_{1}} \cdot x_{1} [n] - t_{1} [n] - \frac{β_{0}}{β_{1}} - d,$

(10.32)

$z [n] = - \frac{1}{β_{1}} \cdot t_{2} [n] + x_{2} [n] + \frac{β_{0}}{β_{1}} - d .$

(10.33)

Figure 10.2: Asymptotic property of the ML estimator.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10 Fundamentals of Estimation Theory

Create new playlist

Sign In

Sign Up

Table of Contents for
10 Fundamentals of Estimation Theory