Chapter 8    Random Matrix Theory

Romain Couillet and Merouane Debbah

L’École Supérieure D’Electricité (SUPELEC), France

Random matrix theory deals with the study of matrix-valued random variables. It is conventionally considered that random matrix theory dates back to the work of Wishart in 1928 [1] on the properties of matrices of the type XX with X ∈ ℂN×n a random matrix with independent Gaussian entries with zero mean and equal variance. Wishart and his followers were primarily interested in the joint distribution of the entries of such matrices and then on their eigenvalues distribution. It then dawned to mathematicians that, as the matrix dimensions N and n grow large with ratio converging to a positive value, its eigenvalue distribution converges weakly and almost surely to some deterministic distribution, which is somewhat similar to a law of large numbers for random matrices. This triggered a growing interest in particular among the signal processing community, as it is usually difficult to deal efficiently with large dimensional data because of the so-called curse of dimensionality. Other fields of research have been interested in large dimensional random matrices, among which the field of wireless communications, as the eigenvalue distribution of some random matrices is often a sufficient statistics for the performance evaluation of multidimensional wireless communication systems.

In the following, we introduce the main notions, results and details of classical as well as recent techniques to deal with large random matrices.

8.1      Probability Notations

In this chapter, an event will be the element ω of some set Ω. Based on Ω, we will consider the probability space (Ω, FF, P), with FF some σ-field on Ω and P a probability measure on FF. If XX is a random variable on Ω, we will denote

μχ(A)Pr({ω,χ(ω)A})μχ(A)Pr({ω,χ(ω)A})

the probability distribution of XX.

When μXX has a PDF, it will be denoted PXX, i.e., for XX with image in ℝ with Lebesgue measure and for all measurable f,

f(x)Pχ(x)dxf(x)μχ(dx).f(x)Pχ(x)dxf(x)μχ(dx).

To differentiate between multidimensional random variables and scalar random variables, we may denote pXX (x) ≜ PXX (x), in lowercase character, if XX is scalar. The CDF of a real random variable will often be denoted by the letter F, e.g., for x ∈ ℝ,

F(x)pχ((  ,x])F(x)pχ((  ,x])

denotes the CDF of XX.

We further denote, for XX, YY two random variables with density, and for y such that PYY (y) > 0,

Pχ|Y(x,y)Pχ,Y(x,y)PY(y)Pχ|Y(x,y)Pχ,Y(x,y)PY(y)

the conditional probability density of XX given YY.

8.2      Spectral Distribution of Random Matrices

We start this section with a formal definition of a random matrix and the introduction of necessary notations.

Definition 8.2.1. An N × n matrix X is said to be a random matrix if it is a matrix-valued random variable on some probability space (Ω, FF, P) with entries in some measurable space (RR, GG), where FF is a σ-field on Ω with probability measure P and GG is a σ-field on RR. As per conventional notations, we denote X(ω) the realization of the variable X at point ω ∈ Ω.

We shall in particular often consider the marginal probability distribution function of the eigenvalues of random Hermitian matrices X. Unless otherwise stated, the distribution function (d.f.) of the real eigenvalues of X will be denoted FX.

We now discuss the properties of the so-called Wishart matrices and some known results on unitarily invariant random matrices. These properties are useful to the characterization, e.g., of Neyman-Pearson tests for signal sensing procedures [2] , [3].

8.2.1      Wishart Matrices

We start with the definition of a Wishart matrix.

Definition 8.2.2. The N × N random matrix XX is a (real or complex) central Wishart matrix with n degrees of freedom and covariance matrix R if the columns of the N × n matrix X are zero mean independent (real or complex) Gaussian vectors with covariance matrix R. This is denoted

XX~WN(n,R).XX~WN(n,R).

Defining the Gram matrix associated to any matrix X as being the matrix XX, XX ~ WWN (n, R) is by definition the Gram matrix of a matrix with Gaussian i.i.d. columns with zero mean and variance R. When R = IN, it is usual to refer to X as a standard Gaussian matrix.

One interest of Wishart matrices in signal processing applications lies in the following remark.

Remark 8.2.1. Let x1,…, xn ∈ ℂN be n independent samples of the random process x1CCNN(0, R). Then, denoting X = [x1,…, xn],

ni = 1xixi=XX.i = 1nxixi=XX.

For this reason, the random matrix Rn = 1nXXRn = 1nXX is often referred to as an (empirical) sample covariance matrix associated to the random process x1. This is to be contrasted with the population covariance matrix E{x1x1} = RE{x1x1} = R. Of particular importance is the case when R = IN. In this situation, XX, sometimes referred to as a zero (or null) Wishart matrix, is proportional to the sample covariance matrix of a white Gaussian process. The zero (or null) terminology is due to the signal processing problem of hypothesis testing, in which one has to decide whether the observed X emerges from a white noise process or from an information plus noise process.

Wishart provides us with the joint probability density function of the entries of Wishart matrices, as follows:

Theorem 8.2.1 ([1]). The PDF of the complex Wishart matrix XXWWN(n, R), X ∈ ℂN×n, for n ≥ N is

PXX(B) = πN(N  1)/2det(Rn)Ni = 1(n  i)!e  tr(R  1B)det(Bn  N).PXX(B) = πN(N  1)/2det(Rn)Ni = 1(n  i)!e  tr(R  1B)det(Bn  N).

(8.1)

Note in particular that for N =1, this is a conventional chi-square distribution with n degrees of freedom.

For null Wishart matrices, notice that PXX† (B) = PXX (UBU), for any unitary N × N matrix U.1 Otherwise stated, the eigenvectors of the random variable XX are uniformly distributed over the space UU(N) of unitary N × N matrices. As such, the eigenvectors do not carry relevant information, and PXX (B) is only a function of the eigenvalues of B. This property will turn out essential to the derivation of further properties of Wishart matrices.

The joint PDF of the eigenvalues of zero Wishart matrices were studied simultaneously in 1939 by different authors [4, 5, 6, 7]. The two main results are summarized in the following,

Theorem 8.2.2. Let the entries of X ∈ ℂn, n > N, be independent and identically distributed (i.i.d.) Gaussian with zero mean and unit variance. The joint PDF P(λi)P(λi) of the ordered eigenvalues λ1 ≥ … ≥ λN of the zero Wishart matrix XX, is given by

P(λi)(λ1,λN) = e  Ni = 1λiNi = 1λn  Ni(n  i)!(N  i)Δ(Λ)2,P(λi)(λ1,λN) = e  Ni = 1λii = 1Nλn  Ni(n  i)!(N  i)Δ(Λ)2,

where, for a Hermitian non-negative N × N matrix Λ,2 Δ(Λ) denotes the Vandermonde determinant of its eigenvalues λ1,…, λN,

Δ(Λ)1i<jN(λj  λi).Δ(Λ)1i<jN(λj  λi).

The marginal PDF pλ (≜ Pλ) of the unordered eigenvalues is

pλ(λ) = 1MN  1k = 0k!(k + n  N)![Ln  Nk(λ)]2λn  Ne  λ,pλ(λ) = 1Mk = 0N  1k!(k + n  N)![Ln  Nk(λ)]2λn  Ne  λ,

where Lkn(λ)Lkn(λ) are the Laguerre polynomials defined as

Lkn(λ) = eλk!λndkdλk(e  λλn + k).Lkn(λ) = eλk!λndkdλk(e  λλn + k).

The generalized case of (non-zero) central Wishart matrices is more involved since it requires advanced tools of multivariate analysis, such as the fundamental Harish-Chandra integral [8]. We will mention the result of Harish-Chandra, which is at the core of the results in signal sensing presented later in Section 8.5.1.

Theorem 8.2.3. For nonsingular N × N positive definite Hermitian matrices A and B of respective eigenvalues a1,…, aN and b1,…, bN, such that for all ij, aiaj and bibj, we have

UU(N)eκtr(AUBU)dU = (N  1i = 1i!)κ12N(N  1)det({e  bjai}1i,jN)Δ(A)Δ(B)UU(N)eκtr(AUBU)dU = (i = 1N  1i!)κ12N(N  1)det({e  bjai}1i,jN)Δ(A)Δ(B)

where, for any bivariate function f, {f (i,j)}1≤ij≤N denotes the N × N matrix of (i,j) entry f (i,j), and UU(N) is the space of N × N unitary matrices.

This result enables the calculation of the marginal joint-eigenvalue distribution of (non-zero) central Wishart matrices [9], given as follows:

Theorem 8.2.4. Let the columns of X ∈ ℂN×n be independent and identically distributed (i.i.d.) zero mean Gaussian with positive definite covariance R. The joint PDF P(λi)P(λi) of the ordered positive eigenvalues λ1 ≥ … ≥ λN of the central Wishart matrix XX, reads

P(λi)(λ1,,λN)det({e  r  1jλi}1i,jN)Δ(R  1)Δ(Λ)Nj = 1λn  Njrnj(n  j)!P(λi)(λ1,,λN)det({e  r  1jλi}1i,jN)Δ(R  1)Δ(Λ)j = 1Nλn  Njrnj(n  j)!

where r1 ≥ … ≥ rN denote the ordered eigenvalues of R and Λ = diag (λ1 ≥ … ≥ λN).

This is obtained from the joint distribution of Wishart matrices XX which, up to a variables change, leads to the joint distribution of the couples (U, L) of unitary matrices and diagonal eigenvalue matrices such that XX = ULU. In performing this variable change, the Jacobian Δ(L)2 arises. Integrating over U to obtain the marginal distribution of L, we recognize the Harish-Chandra equality which finally leads to the result.

These are the tools we need for the study of Wishart matrices. As it appears, the above properties hold due to the rotational invariance of Gaussian matrices. For more involved random matrix models, e.g., when the entries of the random matrices under study are no longer Gaussian, the study of the eigenvalue distribution is much more involved, if not unfeasible.

However, it turns out that, as the matrix dimensions grow large, nice properties arise that can be studied much more efficiently than when the matrix sizes are kept fixed. A short introduction to these large matrix considerations is described hereafter.

8.2.2      Limiting Spectral Distribution

Consider an N × N (non-necessarily random) Hermitian matrix XN. Define its empirical spectral distribution (e.s.d.) FXNFXN to be the d.f. of the eigenvalues of XN, i.e., for x ∈ ℝ,

FXN(x) = 1NNj = 11λjx(x),FXN(x) = 1Nj = 1N1λjx(x),

where λ1 ≥ … ≥ λN are the eigenvalues of XN.3

The relevant aspect of large N × N Hermitian matrices XN is that their (random) e.s.d. FXn often converges, with N ϕ , towards a nonrandom distribution F. This function F, if it exists, will be called the limit spectral distribution (l.s.d.) of XN. Weak convergence [11] of FXNFXN to F, i.e., for all x where F is continuous, FXN(x)  F(x)0FXN(x)  F(x)0, is often sufficient to obtain relevant results; this is denoted

FXNF.FXNF.

In most cases though, the weak convergence of FXNFXN to F will only be true on a set of matrices XN = XN (ω) of measure one. This will be mentioned with the phrase FXNFFXNF almost surely.

The Marc̆enko-Pastur Law

In signal processing, one is often interested in sample covariance matrices or even more general matrices such as independent and identically distributed (i.i.d.) matrices with left and right correlation, or i.i.d. matrices with a variance profile [12]. One of the best known results with a large range of applications in signal processing is the convergence of the empirical spectral distribution (e.s.d.) of the Gram matrix of a random matrix with i.i.d. entries of zero mean and normalized variance (not necessarily a Wishart matrix). This result is due to Marc̆enko and Pastur [13], so that the limiting e.s.d. of the Gram matrix is called the Marc̆enko-Pastur law. The result unfolds as follows.

Theorem 8.2.5. Consider a matrix XN×n with i.i.d. entries (1nχ(N)ij)(1nχ(N)ij) such that X(N)ijX(N)ij has zero-mean and variance 1. As n, N → ∞ with Nnc(0,)Nnc(0,), the e.s.d. of Rn = XX converges almost surely to a nonrandom d.f. Fc with density fc given by

fc(x) = (1  c  1) + δ(x) + 12πcx(x  a) + (b  x) + ,fc(x) = (1  c  1) + δ(x) + 12πcx(x  a) + (b  x) + ,

(8.2)

where a = (1  c)2,b = (1 + c)2a = (1  c)2,b = (1 + c)2 and δ(x) = 1{0}(x)(δ(x) = 1 if x = 0 and δ(x) = 0 otherwise).

The d.f. Fc is named the Marc̆enko-Pastur law with limiting ratio c. This is depicted in Figure 8.1 for different values of the limiting ratio c. Notice in particular that, when c tends to be small and approaches zero, the Marc̆enko-Pastur law reduces to a single mass in 1, as the law of large numbers in classical probability theory requires.

Several approaches can be used to derive the Marc̆enko-Pastur law. However, the original technique proposed by Marc̆enko and Pastur is based on a fundamental tool, the Stieltjes transform, which will be constantly used in this chapter. In the following we present the Stieltjes transform, along with a few important lemmas, before we introduce several applications based on the Stieltjes transform method.

Image

Figure 8.1:  Marc̆enko-Pastur law for different limit ratios c = lim N/n.

The Stieltjes Transform and Associated Lemmas

Definition 8.2.3. Let F be a real-valued bounded measurable function over ℝ. Then the Stieltjes transform mF(z),4 for z ∈ Supp (F)c, the complex space complementary to the support of F,5 is defined as

mF(z)  1λ  zdF(λ).mF(z)  1λ  zdF(λ).

(8.3)

For all F that admit a Stieltjes transform, the inverse transformation exists and is given by [10, Theorem B.8]:

Theorem 8.2.6. If x is a continuity points of F, then

F(x) = 1πlimy0 + x  J[mF(x + iy)]dx.F(x) = 1πlimy0 + x  J[mF(x + iy)]dx.

(8.4)

In practice here, F will be a distribution function. Therefore, there exists an intimate link between distribution functions and their Stieltjes transforms. More precisely, if F1 and F2 are two distribution functions (therefore right-continuous by definition [16, see §14]) that have the same Stieltjes transform, then F1 and F2 coincide everywhere and the converse is true. As a consequence, mF uniquely determines F and vice versa. It will turn out that, while working on the distribution functions of the empirical eigenvalues of large random matrices is often a tedious task, the approach via Stieltjes transforms greatly simplifies the study. The initial intuition behind the Stieltjes transform approach for random matrices lies in the following remark: for an Hermitian matrix XN×N,

mFX(z) = 1λ  zdFX(λ) = 1Ntr(Λ  zIN)  1 = 1Ntr(X  zIN)  1,mFX(z) =  =  = 1λ  zdFX(λ)1Ntr(Λ  zIN)  11Ntr(X  zIN)  1,

in which we denoted Λ the diagonal matrix of eigenvalues of X. Working with the Stieltjes transform of FX therefore boils down to working with the matrix (XzIN)−1, and more specifically with the sum of its diagonal entries. From matrix inversion lemmas and several fundamental matrix identities, it is then rather simple to derive limits of traces 1Ntr(X  zIN)  11Ntr(X  zIN)  1, as N grows large, hence the Stieltjes transform of the weak limit of FX. For notational simplicity, we may denote mXmFx the Stieltjes transform of the e.s.d. of the Hermitian matrix X, and call mX the Stieltjes transform of X.

An identity of particular interest is the relation between the Stieltjes transform of XX and XX, for X ∈ ℂN×n. Note that both matrices are Hermitian, and actually non-negative definite, so that the Stieltjes transform of both is well defined.

Lemma 8.2.1. For z ∈ ℂ ℝ+, we have

nNmFxx(z) = mFxx(z) + N  nN1z.nNmFxx(z) = mFxx(z) + N  nN1z.

On the wireless communication side, it turns out that the Stieltjes transform is directly connected to the expression of the mutual information, through the so-called Shannon transform, initially coined by Tulino and Verdú [17, see §2.3.3].

Definition 8.2.4. Let F be a probability distribution defined on ℝ+. The Shannon-transform VVF of F is defined, for x ∈ ℝ+, as

VF(x)0log(1 + xλ)dF(λ).VF(x)0log(1 + xλ)dF(λ).

(8.5)

The Shannon-transform of F is related to its Stieltjes transform mF through the expression

VF(x) = 1x(1t  mF(  t))dt.VF(x) = 1x(1t  mF(  t))dt.

(8.6)

This last relation is fundamental to derive a link between the limit spectral distribution (l.s.d.) of a random matrix and the mutual information of a multidimensional channel, whose model is based on this random matrix.

We complete this section by the introduction of fundamental lemmas, required to derive the l.s.d. of random matrix models with independent entries, among which the Marčenko-Pastur law, and that will be necessary to the derivation of deterministic equivalents. These are recalled briefly below.

The first lemma is called the trace lemma, introduced in [15] (and extended in [18] under the form of a central limit theorem), which we formulate in the following theorem.

Theorem 8.2.7. Let A1, A2,AN ∈ ℂN×N, be a series of matrices with uniformly bounded spectral norm. Let x1, x2,… be random vectors of i.i.d. entries such that xNN has zero mean, variance 1/N and finite eighth order moment, independent of AN. Then

xNANXN  1Ntr(AN)a.s.0,xNANXN  1Ntr(AN)a.s.0,

(8.7)

as N → ∞.

Several alternative versions of this result exist in the literature, which can be adapted to different application needs, see e.g., [12, 14].

The second important ingredient is the rank-1 perturbation lemma, given below [14, Lemma 2.6]:

Theorem 8.2.8. (i) Let z ∈ ℂℝ, A ∈ ℂN×N , B ∈ ℂN×N with B Hermitian, and v ∈ ℂN. Then

|1Ntr(A((B  zIN)  1  (B+vv  zIN)  1))|AN|J(z)|,1Ntr(A((B  zIN)  1  (B+vv  zIN)  1))AN|J(z)|,

withAthe spectral norm of A.

(ii) Moreover, if B is non-negative definite, for z ∈ ℝ,

|1Ntr(A((B  zIN)  1  (B+vv  zIN)  1))|AN|(z)|.1Ntr(A((B  zIN)  1  (B+vv  zIN)  1))AN|(z)|.

Generalizations of the above result can be found e.g., in [12].

Based on the above ingredients and classical results from probability theory, it is possible to prove the almost sure weak convergence of the e.s.d. of XX, where X ∈ ℂN×n has i.i.d. entries of zero mean and variance 1/n, the Marc̆enko-Pastur law, as well as the convergence of the e.s.d. of more involved random matrix models based on matrices with independent entries. In particular, we will be interested in Section 8.5.2 in limiting results on the e.s.d. of sample covariance matrices.

Limiting Spectrum of Sample Covariance Matrices

The limiting spectral distribution of the sample covariance matrix unfolds from the following result, originally provided by Bai and Silverstein in [14], and further extended in e.g., [10],

Theorem 8.2.9. Consider the matrix BN = AN + XNTNXNCn × nBN = AN + XNTNXNCn × n, where XN = (1NχNij)CN × nXN = (1NχNij)CN × n with entries XNijXNij independent with zero mean, variance 1 and finite order 2+ε moment for some ε > 0 (ε is independent of N, i,j), the e.s.d. FTNFTN of TN = diag(tN1,,tNN)N × NTN = diag(tN1,,tNN)RN × N converges weakly and almost surely to FT , AN is n × n Hermitian whose e.s.d. converges weakly and almost surely to FA, N/n tends to c, with 0 < c < ∞ as n, N grow large. Then, the e.s.d. FBNFBN of BN converges weakly and almost surely to FB such that, for z ∈ ℂ+, mFB (z) satisfies

mFB(z) = mFA(z  ct1 + tmFB(z)dFT(t)).mFB(z) = mFA(z  ct1 + tmFB(z)dFT(t)).

(8.8)

The solution of the implicit equation (8.8) in the dummy variable mFB (z) is unique on the set {z ∈ ℂ+, mFB (z) ∈ ℂ+}. Moreover, if the XN has identically distributed entries, then the result holds without requiring that a moment of order 2 + ε exists.

In the following, using the tools from the previous sections, we give a sketch of the proof of Theorem 8.2.9.

Proof. The fundamental idea to infer the final formula of Theorem 8.2.9 is to first guess the form it should take. For this, write

mFBN(z)1ntr(AN + XNTNXN  zIN)  1.mFBN(z)1ntr(AN + XNTNXN  zIN)  1.

and take DN ∈ ℂN×N to be some deterministic matrix such that

mFBN(z)  mN(z)a.s.0mFBN(z)  mN(z)a.s.0

with

mN(z)1ntr(AN + DN  zIN)  1mN(z)1ntr(AN + DN  zIN)  1

as N, n → ∞ with N/nc. We then have, from the identity A−1B−11 = A−1(BA)B−1,

mFBN(z)  mN(z) = 1ntr((BN  zIN)  1(DN  XNTNXN)(AN + DN  zIN)  1).mFBN(z)  mN(z) = 1ntr((BN  zIN)  1(DN  XNTNXN)(AN + DN  zIN)  1).

Taking DN = aNIN, and writing

XNTNXN = Nk = 1tNkxkxkXNTNXN = k = 1NtNkxkxk

with xk the kth column of XNXN, we further have

MFBN(z)  mN(z) = aNntr((BN  zIN)  1(AN + DN  zIN)  1)  1nNk = 1tNkxk(AN + DN  zIN)  1(BN  zIN)  1xk.MFBN(z)  mN(z) = aNntr((BN  zIN)  1(AN + DN  zIN)  1)  1nk = 1NtNkxk(AN + DN  zIN)  1(BN  zIN)  1xk.

Using the matrix inversion identity

(A + vv  zIN)  1v = 11 + v(A  zIN)  1v(A  zIN)  1v,(A + vv  zIN)  1v = 11 + v(A  zIN)  1v(A  zIN)  1v,

each term in the sum of the right-hand side can further be expressed as

tNkxk(AN + DN  zIN)  1(BN  zIN)  1xk = tNkxk(AN + DN  zIN)  1(B(k)  zIN)  1xk1 + tNkxk(B(k)  zIN)  1xktNkxk(AN + DN  zIN)  1(BN  zIN)  1xk = tNkxk(AN + DN  zIN)  1(B(k)  zIN)  1xk1 + tNkxk(B(k)  zIN)  1xk

where B(k) = BN  tNkxkxk and where now xk and (AN + DNZIN)−1(B(k)ZIN)−1 are independent. But then, using the trace lemma, Theorem 8.2.7, we have that

xk(AN + DN  zIN)  1(B(k)  zIN)  1xk  1ntr((AN + DN  zIN)  1(B(k)  zIN)  1)a.s.0.

Replacing the quadratic form by the trace in the Stieltjes transform difference, we then have for all large N,

MFBN(z)  mN(z)aNntr((BN  zIN)  1(AN + DN  zIN)  1)  1nNk = 1tNk1ntr((AN + DN  zIN)  1(B(k)  zIN)  1)1 + tNk1ntr((BN  zIN)  1).

But then, from the rank−1 perturbation lemma, Theorem 8.2.8, this is further approximated, for all large N by

MFBN(z)  mN(z)aNntr((BN  zIN)  1(AN + DN  zIN)  1)  1nNk = 1tNk1ntr((AN + DN  zIN)  1(B(k)  zIN)  1)1 + tNk1ntr((BN  zIN)  1).

where we recognize in the right-hand side the Stieltjes transform mFBN(z) = 1ntr((BN  zIN)  1). Taking

aN = 1nNk = 1tNk11 + tNkmFBN(z)ct1 + tmFBN(z)dFT(t),

it is clear that the difference mFBN (Z) − mN (Z) becomes increasingly small for large N and therefore mFBN (Z) is asymptotically close to

1ntr((AN + c)tdFT(t)tmFBN(z)IN  zIN)  1

which is exactly

mFAN(z  ctdFT(t)1 + tmFBN(z)).

Hence the result.

The sample covariance matrix model corresponds to the particular case where AN = 0. In that case, (8.8) becomes

mF_(z) =   (z  ct1 + tmF_(z)dFT(t))  1,

(8.9)

where we denoted FB in this special case. This special notation will often be used to differentiate the l.s.d. F of the matrix T12NXNXNT12N from the l.s.d. of the reversed Gram matrix XNTNXN. Remark indeed from Lemma 8.2.1 that the Stieltjes transform m of the l.s.d. of XNTNXN is linked to the Stieltjes transform mF of the l.s.d. F of T12NXNXNT12N through

mF_(z) = cmF(z) + (c  1)1z

(8.10)

and then we also have access to a characterization of F, which is exactly the asymptotic eigenvalue distribution of the sample covariance matrix model, when the denormalized columns nx1,nxn of nXN form a sequence of independent vectors with zero mean and covariance matrix nE{x1x1} = TN. An illustrative simulation example is given in Figure 8.2 where TN is constituted of three distinct eigenvalues.

Secondly, in addition to the uniqueness of the pair (z,m(z)) in the set {z ∈ ℂ+, m (z) ∈ ℂ+} solution of (8.9), an inverse formula for the Stieltjes transform can be written in closed-form, i.e., we can define a function z(m) on { ∈ ℂ+, z(m) ∈ ℂ+}, such that

mF_(m_) =   1m_ + ct1 + tm_dFT(t).

(8.11)

This will turn out to be extremely useful to characterize the spectrum of F. More on this topic is discussed in Section 8.3.

8.3      Spectral Analysis

In this section, we summarize some important results regarding (i) the characterization of the support of the eigenvalues of a sample covariance matrix and (ii) the position of the individual eigenvalues of a sample covariance matrix. The point (i) is obviously a must-have from a pure mathematical viewpoint, but is also fundamental to the study of estimators based on large dimensional random matrices. We will provide in Section 8.4 and in Section 8.5.2 estimators of functionals of the eigenvalues of a population covariance matrix based on the observation of a sample covariance matrix. We will in particular investigate large dimensional sample covariance matrix models with population covariance matrix composed of a few eigenvalues with large multiplicities. The validity of these estimators relies importantly on the fact that the support of the l.s.d. of the sample covariance matrix is formed of disjoint so-called clusters; each cluster is associated to one of the few eigenvalues of the population covariance matrix. Characterizing the limiting support is therefore paramount to the study of the estimator performance. The point (ii) is even more important for the estimators described above, as knowing the position of the individual eigenvalues allows one to derive such estimators. This second point is also fundamental to the derivation of hypothesis tests based on large dimensional matrix analysis, that will be introduced in Section 8.5.1. What we will show in particular is that, under mild assumptions on the random matrix model, all eigenvalues are asymptotically contained within the limiting support. Also, when the limiting support is divided into disjoint clusters, the number of sample eigenvalues in each cluster corresponds exactly to the multiplicity of the population eigenvalue attached to this cluster. For signal sensing, this is fundamental as the observation of a sample eigenvalue outside the expected limiting support of the pure noise hypothesis (called hypothesis H0) suggests that a signal is present in the observed data.

Image

Figure 8.2:  Histogram of the eigenvalues of BN=T12NXNXNT12N, n = 3000, with TN diagonal composed of three evenly weighted masses in (a) 1, 3, and 7, (b) 1, 3, and 4.

We start with the point (ii).

8.3.1      Exact Eigenvalue Separation

The results of interest here are due to Bai and Silverstein and are summarized in the following theorems.

Theorem 8.3.1 ([15]). Let XN = (1nχNij)CN × n have i.i.d. entries, such that XNij has zero mean, variance 1 and finite fourth order moment. Let TN ∈ ℂN×N be nonrandom, whose e.s.d. FTN converge weakly to H. From Theorem 8.2.9, the e.s.d. of BN=T12NXNXNT12NCN × N converges weakly and almost surely towards some distribution function F, as N, n go to infinity with ratio cN = N/n → c, 0 < c < ∞. Similarly, the e.s.d. of B_N=XNTNXNCn × n converges towards F̲ given by

F_(x) = cF(x) + (1  c)1[0,)(x).

Denote F̲N the distribution of Stieltjes transform mF_N(z), solution, for z ∈ ℂ+, of the following equation in m

m =   (z  Nnτ1 + τmdFTN(τ))  1,

and define FN the d.f. such that

F_N(x) = NnFN(x) + (1  Nn)1[0,)(x).

Let N0 ∈ ℕ, and choose an interval [a, b], a > 0, outside the union of the supports of F and FN for all N ≥ N0. For ω ∈ Ω, the random space generating the series X1, X2,…, denote LN(ω) the set of eigenvalues of BN(ω). Then,

P(ω,N(ω)[a,b]0,i.o.) = 0.

This means concretely that, given a segment [a, b] outside the union of the supports of F and FN0,FN0 + 1,, for all series B1(ω), B2(ω), …, with ω in some set of probability one, there exists M(ω) such that, for all NM(ω), there will be no eigenvalue of BN (ω) in [a, b].

As an immediate corollary of Theorem 8.2.5 and Theorem 8.3.1, we have the following results on the extreme eigenvalues of BN, with TN = IN.

Corollary 8.3.2. Let BN ∈ ℂN×N be defined as BN = XNXN, with XN ∈ ℂN ×n with i.i.d. entries of zero mean, variance 1/n and finite fourth order moment. Then, denoting λNmin and λNmax the smallest and largest eigenvalues of BN, respectively, we have

λNmina.s.(1  c)2λNmaxa.s.(1 + c)2

as N, n → ∞ with N/n → c.

This result further extends to the case when BN = XNTNXN, with TN diagonal with ones on the diagonal but for a few entries different from one. This model, often referred to as spiked model lets some eigenvalues escape the limiting support of BN (which is still the support of the Marc̆enko-Pastur law). Note that this is not inconsistent with Theorem 8.3.1 since here, for all finite N0, the distribution functions FN0,FN0 + 1, may all have a non-zero mass outside the support of the Marc̆enko-Pastur law. The segments [a, b] where no eigenvalues are found asymptotically must be away from these potential masses. The theorem, due to Baik, is given precisely as follows

Theorem 8.3.3 ([19]). Let ˉBN = ˉT12NXNXNˉT12N, where XN ∈ ℂN×n has i.i.d. entries of zero mean and variance 1/n, and ̄TN ∈ ℝN ×N is diagonal given by

ˉTN = diag(α1,,α1k1,αM,,αMkM,1,,1N  Mi = 1ki)

with α1 > … > αM > 0 for some positive integer M. We denote here c = limNN/n.CallM0 = #{j|αj>1 + c}. For c < 1, take also M1 to be such that M  M1 = #{j|αj<1  c}. Denote additionally λ1 ≥ … ≥ λN the eigenvalues of BN, ordered as λ1 ≥ … ≥ λN. We then have

•  for 1 ≤ jM0, 1 ≤ ikj,

λk1 +  + kj  1 + ia.s.αj + cαjαj  1,

•  for the other eigenvalues, we must discriminate upon c,

–  if c < 1,

*  for M1 + 1 ≤ j ≤ M, 1 ≤ i ≤ kj,  

λN  kj    kM + ia.s.αj + cαjαj  1,

*  for the indexes of eigenvalues of TN inside [1  c,1 + c],

λk1 +  + kM0 + 1a.s.(1 + c)2,λN  kM1 + 1    kMa.s.(1  c)2,

–  if c > 1,

λna.s.(1  c)2,λn + 1 =  = λN = 0,

–  if c = 1,

λmin(n,N)a.s.0.

The important part of this result is that all αj such that αj>1 + c produces an eigenvalue of BN outside the support of the Marc̆enko-Pastur, found asymptotically at the position aj + cαjαj  1.

Now Theorem 8.3.1 and Theorem 8.3.3 ensure that, for a given N0, no eigenvalue of BN is found outside the support of FN0,FN0 + 1, for all large N, but do not say where the eigenvalues of BN are approximately positioned. The answer to this question is provided by Bai and Silverstein in [20] in which the exact separation properties of the l.s.d. of such matrices BN is discussed.

Theorem 8.3.4 ([20]). Assume BN is as in Theorem 8.3.1 with TN non-negative definite and FTN converging weakly to the distribution function H, and cN = N/n converging to c. Consider also 0 < a < b < ∞ such that [a, b] lies outside the support of F, the l.s.d. of BN. Denote additionally λk and τk the kth eigenvalues of BN and TN in decreasing order, respectively. Then we have

1.  If c(1 − H(0)) > 1, then the smallest eigenvalue x0 of the support of F is positive and λN → x0 almost surely, as N → ∞.

2.  If c(1 − H(0)) ≤ 1, or c(1 − H(0)) > 1 but [a,b] is not contained in [0, x0], then

Pr(ω,λiN>b,λiN + 1<a) = 1,

for all N large, where iN is the unique integer such that

τiN>  1/mF(b),τiN + 1<  1/mF(a).

Theorem 8.3.4 states in particular that, when the limiting spectrum can be divided in disjoint clusters, then the index of the sample eigenvalue “for which a jump from one cluster (right to b) to a subsequent cluster (left to a) arises” corresponds exactly to the index of the population eigenvalue where a jump arises in the population eigenvalue spectrum (from −1/mF(b) to −1/mF(a)). Therefore, the sample eigenvalues distribute as one would expect between the consecutive clusters. This result will be used in Section 8.4 and Section 8.5.2 to find which sample eigenvalues are present in which cluster. This is necessary because we will perform complex integration on contours surrounding specific clusters and that residue calculus will demand that we know exactly what eigenvalues are found inside these contours.

Nonetheless, this still does not exactly answer the question of the exact characterization of the limiting support, which we treat in the following.

8.3.2      Support of l.s.d.

Remember from the inverse Stieltjes transform formula (8.4) that it is possible to determine the support of the l.s.d. F of a random matrix once we know its limiting Stieltjes transform mF(z) for all z ∈ ℂ+. Thanks to Theorem 8.2.9, we know in particular that we can determine the support of the l.s.d. of a sample covariance matrix. Nonetheless, (8.4) features a limit for the imaginary part y of the argument z = x + iy of mF (z) going to zero, which has not been characterized to this point (even its existence everywhere is not ensured). Choi and Silverstein proved in [21] that this limit does exist for the case of sample covariance matrices and goes even further in characterizing exactly what this limit is. This uses the important Stieltjes transform composition inverse formula (8.11) and is summarized as follows.

Theorem 8.3.5 ([21]). Denote ScX the complementary of SX, the support of some d.f. X. Let B_N = XNTNXNCn × n have l.s.d. F̲, where XN ∈ ℂN ×n has i.i.d. entries of zero mean and variance 1/n, TN has l.s.d. H and N/n → c. Let B = {m_|m_0,  1/m_ScH} and X be the function defined on B by

xF_(m_) =   1m_ + ct1 + tm_dH(t).

(8.12)

For x0 ∈ ℝ*, we can then determine the limit of m (z) as z → x0, z ∈ ℂ+, along the following rules,

(R.I)  

If x0ScF_, then the equation x0 = xF_(m_) in the dummy variable m̲ has a unique real solution m0B such that x′(m0) > 0; this m0 is the limit of x′(z) when z → x0, z ∈ ℂ+. Conversely, for m0B such that xF_(m0)>0.

(R.II)  

If x0S then the equation x0 = x() in the dummy variable m̲ has a unique complex solution moB with a positive imaginary part; this m0 is the limit of m(z) when z → x0, z ∈ ℂ+.

From Theorem 8.3.5.(R.I), it is possible to determine the exact support of F. It indeed suffices to draw x () for −1 /m̲ ∈ ℝ SH. Whenever x is increasing on an interval I, x (I) is outside S. The support S of , and therefore of F (modulo the mass in 0), is then defined exactly by

SF_ = a,b,,a<b{zF_((a,b))|m_(a,b),xF_(m_)>0}.

This is depicted in Figure 8.3 in the case when H is composed of three evenly weighted masses t1,t2,t3 in {1, 3, 5} or {1, 3, 10} and c = 1/10. Notice that, in the case where t3 = 10, F is divided into three clusters while when t3 = 5, F is divided into only two clusters, which is due to the fact that x is nonincreasing in the interval (−1/3, −1/5).

From Figure 8.3 and Theorem 8.3.5, we now observe that x′() has exactly 2KF roots with KF the number of clusters in F. Denote these roots m_1<m_ + 1m_  2<m_ + 2<m_  KF<m_ + KF. Each pair (m_  j,m_ + j) is such that xF_([m_  j,m_ + j]) is the jth cluster in F. We therefore have a way to determine the support of the asymptotic spectrum through the function x′. This is presented in the following result.

Theorem 8.3.6 ([22, 23]). Let BNN×N be defined as in Theorem 8.3.1. Then the support SF of the l.s.d. F of BN is defined as

SF=KFi=1[xj,x+j],

where x  1,x + 1,,x  KF,x + KF are defined as

xj=1m_j+Kr=1crtr1+trm_j,x+j=1m_j+Kr=1crtr1+trm_+j,

Image

Figure 8.3:  x() for TN diagonal composed of three evenly weighted masses in 1, 3, and 10 (a) and 1, 3, and 5 (b), c = 1/10 in both cases. Local extrema are marked in circles, inflexion points are marked in squares. The support of F can be read on the right vertical axes.

with m_  1<m_ + 1m_  2<m_ + 2<m_  KF<m_ + KF the 2KF (possibly counted with multiplicity) real roots of the equation in m̲,

Kr = 1crt2rm_2(1 + trm_2)2 = 1.

Notice further from Figure 8.3 that, while x′() might not have roots on some intervals (−1/tk−1, −1/tk), it always has a unique inflexion point there. This is proved in [23] by observing that x″() = 0 is equivalent to

Kr = 1crt3rm_3(1 + trm_)3  1 = 0,

the left-hand side of which has always a positive derivative and shows asymptotes in the neighborhood of tr; hence the existence of a unique inflexion point on every interval (−1/tk−1, − 1/tk), for 1 ≤ kK, with convention t0 = 0+. When x increases on an interval (−1/tk−1, − 1/tk), it must have its inflexion point in a point of positive derivative (from the concavity change induced by the asymptotes). Therefore, to verify that cluster kF is disjoint from clusters (k − 1)F and (k + 1)F (when they exist), it suffices to verify that the (k − 1)th and kth roots k−1 and k of x″() are such that x′(k_1) > 0 and x′(k) > 0. This is exactly what the following result states for the case of a sample covariance matrix whose population covariance matrix has few eigenvalues, each with a large multiplicity.

Theorem 8.3.7 ([23, 24]). Let BN be defined as in Theorem 8.3.1, with TN = diag (τ1,…, τN) ∈ ℝN×N, diagonal containing K distinct eigenvalues 0 < t1 < … < tK, for some fixed K. Denote Nk the multiplicity of the kth largest eigenvalue, counted with multiplicity (assuming ordering of the τi, we may then have τ1 = … = τN1 = t1,,τN  NK + 1 = τN = tK). Assume also that for all 1 ≤ r ≥ K, Nr/ncr > 0, and N/nc, with 0 < c < ∞. Then the cluster kF associated to the eigenvalue tk in the l.s.d. F of BN is distinct from the clusters (k − 1)F and (k + 1)F (when they exist), associated to tk−1 and tk+1 in F, respectively, if and only if

Kr = 1crt3rm_2k(1 + trm_2k)2<1,Kr = 1crt3rm_2k + 1(1 + trm_2k + 1)2<1,

(8.13)

where m̲1, …, K are such that m̲K+1 = 0 and m̲1 < 2 < … < K are the K solutions of the equation in m̲,

Kr = 1crt3rm_3(1 + trm_)3 = 1.

For k = 1, this condition ensures 1F = 2F − 1; for k = K, this ensures KF = (K − 1)F +1; and for 1 < k < K, this ensures (k − 1)F+1 = kF = (k + 1)F− 1.

This result is again fundamental in the sense that the separability of subsequent clusters in the support of the l.s.d. of BN will play a fundamental role in the validity of statistical inference methods. In the subsequent section, we introduce the key ideas that allow statistical inference for sample covariance matrices.

8.4      Statistical Inference

Statistical inference allows for the estimation of deterministic parameters present in a stochastic model based on observations of random realizations of the model. In the context of sample covariance matrices, statistical inference methods consist in providing estimates of functionals of the eigenvalue distribution of the population covariance matrix TN ∈ ℂN×N based on the observation YN = T12NXN with XN ∈ ℂN×n a random matrix of independent and identically distributed entries. Different methods exist that allow for statistical inference that mostly rely on the study of the l.s.d. of the sample covariance matrix BN = 1nYNYN. One of these methods relates to free probability theory [25], and more specifically to free deconvolution approaches, see e.g., [26], [27]. The idea behind free deconvolution is based on the fact that the moments of the l.s.d. of some random matrix models can be written as a polynomial function of the moments of the l.s.d. of another (random) matrix in the model, under some proper conditions. Typically, the moments of the l.s.d. of TN can be written as a polynomial of the moments of the (almost sure) l.s.d. of BN, if XN has Gaussian entries and the e.s.d. of TN has uniformly bounded support. Therefore, to put it simply, one can obtain all moments of TN based on a sufficiently large observation of BN; this allows one to recover the l.s.d. of TN (since Carleman condition is satisfied) and therefore any functional of the l.s.d. However natural, this method has some major drawbacks. From a practical point of view, a reliable estimation of moments of high order requires extremely large dimensional matrix observations. This is due to the fact that the estimate of the moment of order k of the l.s.d. is based on polynomial expressions of the estimates of moments of lower orders. A small error in the estimate in a low order moment therefore propagates as a large error for higher moments; it is therefore compelling to obtain accurate first order estimates, hence large dimensional observations.

We will not further investigate the moment-based approach above, which we discuss in more detail with a proper introduction to free probability theory in [28]. Instead, we introduce the methods based on the Stieltjes transform and those rely strongly on the results described in the previous section. We will introduce this method for the sample covariance matrix model discussed so far, because it will be instrumental to understanding the power estimator introduced in Section 8.5.2. Similar results have been provided for other models of interest to telecommunications, as for instance the so-called information-plus-noise model, studied in [29].

The central idea is based on a trivial application of the Cauchy complex integration formula [30]. Consider f some complex holomorphic function on U ⊂ ℂ, H a distribution function and denote G the functional

G(f) = f(z)dH(z).

From the Cauchy integration formula, we have, for a negatively oriented closed path γ enclosing the support of H and with winding number one,

G(f) = 12πiγf(ω)z  ωdωdH(z) = 12πiγf(ω)z  ωdH(z)dω = 12πiγf(ω)mH(ω)dω,

(8.14)

the integral inversion being valid since f (ω)/(zω) is bounded for ωγ. Note that the sign inversion due to the negative contour orientation is compensated by the sign reversal of (ωz) in the denominator.

If dH is a sum of finite or countable masses and one is interested in evaluating f (λk), with λk the value of the kth mass with weight lk, then on a negatively oriented contour λk enclosing λk and excluding λj, jk,

lkf(λk) = 12πiγkf(ω)mH(ω)dω.

(8.15)

This last expression is particularly convenient when one has access to H only through an expression of its Stieltjes transform.

Now, in terms of random matrices, for the sample covariance matrix BN = T12NXNXNT12N, we already noticed that the l.s.d. F of BN (or equivalently the l.s.d. of B_N = XNTNXN) can be rewritten under the form (8.9), which can further be rewritten

cmF_(z)mH(  1mF_(z)) =   zmF_(z) + (c  1),

(8.16)

where H is the l.s.d. of TN. Note that it is allowed to evaluate mH in −1/m(z) for z ∈ ℂ+ since −1/m(z) ∈ ℂ+.

As a consequence, if one only has access to FBN (from the observation BN), then the only link from the observation to H is obtained by (i) the fact that FB_NF_ almost surely and (ii) the fact that and H are related through (8.16). Evaluating a functional f of the eigenvalue λk of TN is then made possible by (8.15) . The relations (8.15) and (8.16) are the essential ingredients behind the derivation of a consistent estimator for f (λk).

We now concentrate specifically on the sample covariance matrix B_N = T12NXNXNTN defined as in Theorem 8.3.1 with TN composed of K distinct eigenvalues t1,…, tK of multiplicities N1,…, NK, respectively. We further denote ck ≜ limn Nk/n and will discuss the question of estimating tk itself. What follows summarizes the original ideas of Mestre in [22] and [24]. We have from (8.15) that, for any continuous f and for any negatively oriented contour Ck that encloses tk and tk only, f (tk) can be written under the form

NkNf(tk) = 12πiCkf(ω)mH(ω)dω = 12πiCk1NKr = 1Nrf(ω)tr  ωdω

with H the limit FTNH. This provides a link between f (tk) for all continuous f and the Stieltjes transform mH(z).

Letting f (x) = x and taking the limit N → ∞, Nk/N → ck/c, with cc1 + … + cK the limit of N/n, we have

ckctk = 12πiCkωmH(ω)dω.

(8.17)

We now want to express mH as a function of mF, the Stieltjes transform of the l.s.d. F of BN. For this, we have the two relations (8.10), i.e.,

mF_(z) = cmF(z) + (c  1)1z

and (8.16) with FT = H, i.e.,

cmF_(z)mH(  1mF_(z)) =   zmF_(z) + (c  1).

Together, those two equations give the simpler expression

mH(  1mF_(z)) =   zmF_(z)mF(z).

Applying the variable change ω = −1/m(z) in (8.17), we obtain

ckctk = 12πiCF_,kzmF_(z)mF_(z)c + 1  ccmF_(z)m2F_(z)dz = 1c12πiCF_,kzmF_(z)mF_(z)dz,

(8.18)

where CF̲,k is the preimage of C by −1/m. The second equality (8.18) comes from the fact that the second term in the previous relation is the derivative of (c − 1) / (cm(z)), which therefore integrates to 0 on a closed path, from classical real or complex integration rules [30]. Obviously, since z ∈ ℂ+ is equivalent to − 1/m(z) ∈ ℂ+ (the same being true if ℂ+ is replaced by ℂ), C ,k is clearly continuous and of non-zero imaginary part whenever I(z) = 0. Now, one must be careful about the exact choice of C ,k.

We make the important assumption that the index k satisfies the separability conditions of Theorem 8.3.7. That is, the cluster kF associated to k in F is distinct from (k − 1)F and (k + 1)F (whenever they exist). Let us then pick x(l)F and x(r)F two real values such that

x+(k1)F<x(l)F<xkF<x+kF<x(r)F<x(k+1)F

with {x  1,x + 1,,x  KF,x + KF} the support boundary of F, as defined in Theorem 8.3.6. Now remember Theorem 8.3.5 and Figure 8.3; for x(l)F as defined previously, m (z) has a limit m(l) ∈ ℝ as zx(l)F,z + , and a limit m(r) ∈ ℝ as zx(r)F,z + , those two limits verifying

tk  1<x(l)<tk<x(r)<tk + 1,

(8.19)

with x(l) = − 1/m(l) and x(r) = − 1/m(r).

This is the most important outcome of the integration process. Let us define CF̲,k to be any continuous contour surrounding cluster kF such that CF̲,k crosses the real axis in only two points, namely x(l)F and x(r)F. Since −1/m(ℂ+) ⊂ ℂ+ and −1/m(ℂ+) ⊂ ℂ+, Ck does not cross the real axis whenever I(z) ≠ 0 and is obviously continuously differentiable there; now Ck crosses the real axis in x(l) and x(r), and is in fact continuous there. Because of (8.19), we then have that Ck is (at least) continuous and piecewise continuously differentiable and encloses only tk. This is what is required to ensure the validity of (8.18).

The difficult part of the proof is completed. The rest will unfold more naturally. We start by considering the following expression,

ˆtk12πinNkCF_,kzmFB_N(z)mFB_N(z)dz = 12πinNkCF_,kz1nni = 11(λi  z)21nni = 11λi  zdz,

(8.20)

where we remind that B_NΔ__XNTNXN and where, if nN, we defined λN+1 = … =λn = 0.

The value k can be viewed as the empirical counterpart of tk. Now, we know from Theorem 8.2.9 that mFBN(z)a.s.mF(z) and mFB_N(z)mF_(z). It is not difficult to verify, from the fact that m is holomorphic, that the same convergence holds for the successive derivatives.

At this point, we need the two fundamental results that are Theorem 8.3.1 and Theorem 8.3.4. We know that, for all matrices BN in a set of probability one, all the eigenvalues of BN are contained in the support of F for all large N, and that the eigenvalues of BN contained in cluster kF are exactly {λi, iNk} for these large N, with Nk{k  1j = 1Nj + 1,,kj = 1Nj}. Take such a BN. For all large N,mBN(z) is uniformly bounded over N and zCF̲,k, since CF̲,k is away from the support of F. The integrand on the right-hand side of (8.20) is then uniformly bounded for all large N and for all zCF̲,k. By the dominated convergence theorem, Theorem 16.4 in [16], we then have that ˆtk  tka.s.0.

It then remains to evaluate k explicitly. This is performed by residue calculus [30], i.e., by determining the poles in the expanded expression of k (when developing mFBN(z) in its full expression). Those poles are found to be λ1,…, λN (indeed, the integrand of (8.20) behaves like O(1/(λiz)) for zλi) and μ1,…, μN, the N real roots of the equation in μ,mFBN(μ) = 0 (indeed, the denominator of the integrand cancels for z = μi while the numerator is non-zero). Since CF̲,k encloses only those values λi such that iVk, the other poles are discarded. Noticing now that mFBN(μ) = ± as μAi, we deduce that μ1 < μ1 < μ2 < … < μN < λN, and therefore we have that μi,jNk are all in CF̲,k but maybe for μj, j = min Nk. It can in fact be shown that μj is also in CF̲,k. To notice this last remaining fact, observe simply that

12πiCk1ωdω = 0.

since 0 is not contained in the contour Ck. Applying the variable change ω = −1/m (z) as previously, this gives

CF_,kmF_(z)m2F_(z)dz = 0.

(8.21)

From the same reasoning as above, with the dominated convergence theorem argument, we have that for sufficiently large N and almost surely,

|CF_,kmFB_N(z)m2FB_N(z)dz|<12.

(8.22)

At this point, we need to proceed to residue calculus in order to compute the integral on the left-hand side of (8.22). We will in fact prove that the value of this integral is an integer, hence necessarily equal to zero from the inequality (8.22) . Notice indeed that the poles of (8.21) are the λi and the μi that lie inside the integration contour CF̲,k, all of order one with residues equal to −1 and 1, respectively. Therefore, (8.21) equals the number of such λi minus the number of such μi (remember that the integration contour is negatively oriented, so we need to reverse the signs). We, however, already know that this difference, for large N, equals either 0 or 1, since only the position of the leftmost μi is unknown yet. But since the integral is asymptotically less than 1/2, this implies that it is identically zero, and therefore the leftmost μi (indexed by min Nk) also lies inside the integration contour.

From this point on, we can evaluate (8.20), which is clearly determined since we know exactly which eigenvalues of BN are contained (with probability one for all large N) within the integration contour. This calls again for residue calculus, the steps of which are detailed below. Denoting

f(z) = zmFB_N(z)mFB_N(z),

we find that λi (inside CF̲,k) is a pole of order 1 with residue

limzλi(z  λi)f(z) =   λi,

which is straightforwardly obtained from the fact that f(z)1λi  z as z ~ λi. Also μi (inside CF̲,k) is a pole of order 1 with residue

limzμi(z  μi)f(z) =   μi.

Since the integration contour is chosen to be negatively oriented, it must be kept in mind that the signs of the residues need be inverted in the final relation.

Noticing finally that μ1,…, μN are also the eigenvalues of diag (λ)  1nλλT, with λ ≜ (λ1,…, AN)T, from a lemma provided in [23, Lemma 1] and [31], we finally have the following statistical inference result for sample covariance matrices.

Theorem 8.4.1 ([24]). Let BN=T12NXNXNT12NN × N be defined as in Theorem 8.3.7, i.e., TN has K distinct eigenvalues t1 < … < tK with multiplicities N1, …, NK, respectively, for all r, Nr/n → cr , 0 < cr < ∞, and the separability conditions (8.13) are satisfied. Further denote λ1 < … ≤ λN the eigenvalues of BN and λ = (λ1, …, λN)T. Let k ∈ {1, …, K}, and define

ˆtk = nNkmNk(λm  μm)

(8.23)

with Nk{k  1j = 1Nj + 1,,kj = 1Nj} and μ1 ≤ … ≤ are the ordered eigenvalues of the matrix diag (λ)  1nλλT.

Then, if condition (8.13) is fulfilled, we have

ˆtk  tk0

almost surely as N,n → ∞, N/n → c, 0 < c < ∞.

Similarly, for the quadratic form, the following holds.

Image

Figure 8.4:  Estimation of t1,t2,t3 in the model BN=T12NXNXNT12N based on the first three empirical moments of BN and Newton-Girard inversion, see [32], for N1/N = N2/N = N3/N = 1/3, N/n = 1/10, for 100,000 simulation runs; (a) N = 30, n = 90; (b) N = 90, n = 270. Comparison is made against the Stieltjes transform estimator of Theorem 8.4.1.

Theorem 8.4.2 ([24]). Let BN be defined as in Theorem 8.4.1, and denote BN = Nk = 1λkbkbk,bkbi = δik, the spectral decomposition of BN. Similarly, denote TN = Kk = 1tkUkUk,UkUk = Ink, with UkN × Nk the eigenspace associated to tk. For given vectors x, γ ∈ ℂN, denote

u(k;x,y)xUkUky.

Then we have

ˆu(k;x,y)  u(k;x,y)a.s.0

as N,n → ∞ with ratio cN = N/n → c, where

ˆu(k;x,y)Ni = 1θk(i)xbkbky

and θk (i) is defined by

θi(k) = {  ϕk(i),iNk1 + ψk(i),iNk,

with

ϕk(i) = rNk(λrλi  λr  μrλi  μr),ψk(i) = rNk(λrλi  λr  μrλi  μr)

and Nk , μ1,…, μN defined as in Theorem 8.4.1.

The estimator proposed in Theorem 8.4.1 is extremely accurate and is in fact much more flexible and precise than free deconvolution approaches. A visual comparison is proposed in Figure 8.4 for the same scenario as in Figure 8.3, where the free deconvolution (also called moment-based) method is based on the inference techniques proposed in e.g., [26, 32]. Nonetheless, it must be stressed that the cluster separability condition, necessary to the validity of the Stieltjes transform approach, is mandatory and sometimes a rather strong assumption. Typically, the number of observations must be rather large compared to the number of sensors in order to be able to resolve close values of tk.

8.5      Applications

In this section, we apply the random matrix methods developed above to the problems of multidimensional binary hypothesis testing and parameter estimation. More details on these applications as well as a more exhaustive list of applications, notably in the field of wireless communications, are provided in [28].

8.5.1      Binary Hypothesis Testing

We first consider the problem of detecting the presence of a signal source impaired by white Gaussian noise. The question is therefore to decide whether only noise is being sensed or if some data plus noise are sensed.

Precisely, we consider a signal source or transmitter of dimension K and a sink or receiver composed of N sensors. The linear filter between the transmitter and the receiver is modelled by the matrix H ∈ ℂN×K, with (i, j)th entries hij. If at time l the transmitter emits data, those are denoted by the K-dimensional vector x(l) = (x(l)1,,x(l)K)K. The additive white Gaussian noise at the receiver is modelled, at time l, by the vector σw(l) = σ(ω(l)1,ω(l)N)TN, where σ2 denotes the variance of the noise vector entries. Without generality restriction, we consider in the following zero mean and unit variance of the entries of both w(1) and x(1), i.e., E{|w(l)i|2} = 1,E{|w(l)i|2} = 1 for all i. We then denote y(l) = (y(l)1,,y(l)N)T the N-dimensional data received at time l. Assuming the filter is static during at least M sampling periods, we finally denote Y = [y(1),…, y(M)] ∈ ℂN×N the matrix of the concatenated received vectors.

Depending on whether the transmitter emits data, we consider the following hypotheses

•  H0. Only background noise is received.

•  H1. Data plus background noise are received.

Therefore, under condition H0, we have the model

Y = σW

with W = [w(1),…, w(M)] ∈ ℂN×N and under condition H1

Y = (HσIN)(XW)

(8.24)

with X = [x(1),…, x(M)] ∈ ℂN×N.

Under this hypothesis, we further denote the covariance matrix of y(1),

 = E{y(1)(y(1))} = HH + σ2IN = UGU

where G = diag (ν1 + σ2,…, νN + σ2) ∈ ℝN×N, with {ν1,…, vN} the eigenvalues of HH and U ∈ ℂN×N a certain unitary matrix.

The receiver is entitled to decide whether data were transmitted or not. It is a common assumption to be in the scenario where σ2 is known in advance, although it is uncommon to know the transfer matrix H. This is true in particular of the wireless signal sensing scenario where H is the wireless fading channel matrix between two antenna arrays. We consider specifically the scenario where the probability distribution of H is unitarily invariant, which is in particular consistent in wireless communications with channel models that do not a priori exhibit specific directions of energy propagation. This is in particular relevant when the filter H presents rotational invariance properties. For simplicity we take H and x(l) to be i.i.d. Gaussian with zero mean and E {∣hij2} = 1/K, although our study could go well beyond the Gaussian case.

For simplicity, we consider in the following K = 1, although a generalized result exists for K ≥ 1 [2]. The Neyman-Pearson criterion for the receiver to establish whether data were transmitted is based on the ratio

C(Y) = Pr1|Y(Y)Pr0|Y(Y),

(8.25)

where Pri|Y(Y) is the probability of the event Hi conditioned on the observation Y. For a given receive space-time matrix Y, if C(Y) > 1, then the odds are that an informative signal was transmitted, while if C(Y) < 1, it is more likely that only background noise was captured. To ensure a low probability of false alarm (or false positive), i.e., the probability to declare a pure noise sample to carry an informative signal, a certain threshold ξ is generally set such that, when C(Y) > ξ, the receiver declares data were transmitted, while when C(Y) < ξ the receiver declares that no data were sent. The question of what ratio ξ should be set to ensure a given maximally acceptable false alarm rate will not be treated here. We will however provide an explicit expression of (8.25) for the aforementioned model, and shall compare its performance to that achieved by the classical energy detector. The results provided in this section are taken from [2].

Applying Bayes’ rule, (8.25) becomes

C(Y) = Pr0Pr1|Y(Y)Pr1Pr0|Y(Y),

with PrHi the a priori probability for hypothesis Hi to hold. We suppose that no side information allows the receiver to consider that H1 is more or less probable than H0, and therefore set Pr0 = Pr1 = 12, so that

C(Y) = Pr1|Y(Y)Pr0|Y(Y),

(8.26)

reduces to a maximum likelihood ratio.

Likelihood under H0. In this first scenario, the noise entries w(l)i are Gaussian and independent. The probability density of Y, which can be seen as a random vector with NM entries, is then an NM multivariate uncorrelated complex Gaussian with covariance matrix σ2 INM,

PrY|0(Y) = 1(πσ2)NMe  1σ2tr(YY).

(8.27)

Denoting λ = (λ1,…, λN)T the eigenvalues of YY, (8.27) only depends on Ni = 1λi, as follows

PrY|0(Y) = 1(πσ2)NMe  1σ2Ni = 1λi.

Likelihood under H1. Under the data plus noise hypothesis H1, the problem is more involved. The entries of the channel matrix H are modeled as jointly uncorrelated Gaussian, with E {∣hij2} = 1/K. Therefore, since here K =1, H ∈ ℂN×1 and = HH + σ2IN has N − 1 eigenvalues g2 = ⋯ = gN equal to σ2 and another distinct eigenvalue g1 = v1 + σ2(Ni = 1|hi1|2) + σ2. The density of g1σ2 is a complex chi-square distribution of N degrees of freedom (denoted X2N), which up to a scaling factor 2 is equivalent to a real X22N distribution. Hence, the eigenvalue distribution of , defined on ℝ+N = [0, ∞)N, reads

PrG(G) = 1N((g1  σ2)N  1 + )e  (g1  σ2)(N  1)!Ni = 2(gi  σ2).

From the model H1, Y is distributed as correlated Gaussian, as follows

PrY|,I1(Y,) = 1πMNdet(G)Me  tr(YYUG  1U),

where Ik denotes the prior information at the receiver “H1 and K = k.”

Since H is unknown, we need to integrate out all possible linear filters for the transmission model under H1 over the probability space of N × K matrices with Gaussian i.i.d. distribution. From the invariance of Gaussian i.i.d. random matrices by left and right products with unitary matrices, this is equivalent to integrating out all possible covariance matrices over the space of such nonnegative definite Hermitian matrices, as follows

PrY|1(Y) = PrY|,1(Y,)Pr()d.

Eventually, after complete integration calculus given in the proof below, the Neyman-Pearson decision ratio (8.25) for the single-input multiple-output channel takes an explicit expression, given by the following theorem.

Theorem 8.5.1. The Neyman-Pearson test ratio CY(Y) for the presence of data reads

CY(Y) = 1NNl = 1σ2(N + M  1)eσ2 + λlσ2Ni = 1il(λl  λi)JN  M  1(σ2,λl)

(8.28)

with λ1, …, λN the eigenvalues of YY and where

Jk(x,y) + xtke  t  ytdt.

The proof of Theorem 8.5.1 is provided below. Among the interesting features of (8.28), note that the Neyman-Pearson test does only depend on the eigenvalues of YY. This suggests that the eigenvectors of YY do not provide any information regarding the presence of data. The essential reason is that, both under H0 and H1, the eigenvectors of Y are isotropically distributed on the unit N-dimensional complex sphere due to the Gaussian assumptions made here. As such, a given realization of the eigenvectors of Y does indeed not carry any relevant information to the hypothesis test. The Gaussian assumption for H brought by the maximum entropy principle is in fact essential here. Note however that (8.28) is not reduced to a function of the sum ∑i λi of the eigenvalues, as suggests the classical energy detector.

On the practical side, note that the integral Jk (x,y) does not take a closedform expression, but for x = 0, [33, see e.g., pp. 561]. This is rather inconvenient for practical purposes, since Jk (x, y) must either be evaluated every time, or be tabulated. It is also difficult to get any insight on the performance of such a detector for different values of σ2, N and K. We provide hereafter a sketch of proof of Theorem 8.5.1, in which classical multidimensional integration techniques are introduced. In particular, the tools introduced in Section 8.2 will be shown to be key ingredients of the derivation.

Proof. We start by noticing that H is Gaussian and therefore that the joint density of its entries is invariant by left and right unitary products. As a consequence, the distribution of the matrix = HH + σ2I is unitarily invariant. This allows us to write

PrY|I1(Y) = PrY|Σ,1(Y,)Pr()d = U(N) × ( + )NPrY|Σ,1(Y,)PrG(G)dUdG = U(N) ×  + PrY|Σ,1(Y,)Prg1(g1)dUd(g1)

with U(N) the space of N × N unitary matrices and ∑ = UGU.

The latter can further be equated to

PrY|I1(Y) = U(N) ×  + e  tr(YYUG  1U)πNMdet(G)M((g1  σ2)N  1 + )e  (g1  σ2)N!dUd(g1)

with (x)+ ≜ max(x, 0) here.

To go further, we use the Harish–Chandra identity provided in Theorem 8.2.3. Denoting Δ(Z) the Vandermonde determinant of matrix Z ∈ ℂN×N with eigenvalues z1 ≤ … ≤ zN

Δ(Z)i>j(zi  zj),

(8.29)

the likelihood PrY|I1(Y) reads

PrY|I1(Y) = (limg2,,gNσ2eσ2(  1)N(N  1)2N  1j = 1j!πMNσ2M(N  1)N!) + σ21gM1(g1  σ2)N  1e  g1det(e  λigj)Δ(YY)Δ(G  1)d(g1)

(8.30)

in which we remind that λ1,…, λN are the eigenvalues of YY. Note the trick of replacing the known values of g2,…, gN by limits of scalars converging to these known values, which allows us to use correctly the Harish–Chandra formula. The remainder of the proof consists of deriving the explicit limits, which in particular relies on the following result [34, Lemma 6].

Theorem 8.5.2. Let f1,…,fN be a family of infinitely differentiable functions and let x1,…, xN ∈ ℝ. Denote

R(x1,,xN)det({fi(xj)}i,j)i>j(xi  xj).

Then, for p ≤ N and for x0 ∈ ℝ,

limx1,,xpx0R(x1,,xN) = det(fi(x0),fi(x0),,f(p  1)i(x0),fi(xp + 1),,fi(xN))p<j<i(xi  xj)Ni = p + 1(xi  x0)pp  1j = 1j!.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.4.239