3.12    Best Approximations and Orthogonal Projections

The following result plays an important role in many signal processing applications involving estimation and optimization tasks such as solving an overdetermined system of equations in the least-squares (LS) sense [10, 11].

Theorem 3.12.1. (Orthogonal Projection Theorem [6]) Let VV be an inner-product vector space and let SS be a subspace of it. For every vector v ∈ VV there may exist a best approximation s ∈ SS such that

v  sv  s,sS.v  sv  s,sS.

(3.100)

This condition holds if and only if

(v  s)s,sS.(v  s)s,sS.

(3.101)

Moreover, if this best approximation exists, then it is unique. Now let an orthogonal basis for SS be denoted by {s1, s2,…, sm}. It can be shown that the best approximation of v by vectors in SS can be computed as

s = mk = 1v|sksk2sk.s = k = 1mv|sksk2sk.

(3.102)

In other words, the best approximation of v by vectors in SS is given by the orthogonal projection s of vector v onto the subspace SS. Notice further that in (3.102), the kth term in the sum is the projection of v on the direction of vector sk. Consider now that for any vector v ∈ VV, there exists an orthogonal projection onto subspace SS. Thus, a mapping P : VVSS can be defined that associates to each v ∈ VV, its orthogonal projection onto SS. Mapping P will be referred to as the orthogonal projection of VV onto subspace SS, and some of its properties will be considered in Exercise 3.23.18.

3.13    Least Squares Approximations

Suppose T : VVWW is a linear transformation such that for some v ∈ VV and w ∈ WW, the equation

T(v) = wT(v) = w

(3.103)

does not present an exact solution; i.e., w does not lie in the range space of T. In practice, such equations are often encountered and solutions that satisfy (3.103) in an approximative manner must be determined. One of the most common approaches adopted for solving (3.103) is to find a vector v such that the norm of the least-squares (LS) criterion:

e = T(v)  w2,e = T(v)  w2,

is minimized. This problem reduces to finding the vector w′ ∈ RR(T) such that

T(v) = w,T(v) = w,

(3.104)

and ‖w′ - w‖ is minimized. The corresponding solution v of (3.104) is called the least squares solution. The following result characterizes the LS solution of (3.103) (see e.g., [6, 8]).

Theorem 3.13.1. Vector v ∈ VV is the minimizer ofT(v) - w‖ if and only if

T*T(v) = T*(w).TT(v) = T(w).

(3.105)

If T*T is invertible, then the LS solution is unique and given by:

v = (T*T)  1T*(w),v = (TT)  1T(w),

(3.106)

and

w = T(v) = T(T*T)  1T*(w).w = T(v) = T(TT)  1T(w).

(3.107)

A justification of this result is delegated to Exercise 3.23.20.

Now consider the inconsistent overdetermined m × n(m > n) system of linear equations Av = w. If w does not lie in the column space (RR(A)), there is no exact solution. Using the LS method, a unique vector ŵ ∈ RR(A) that is closest to w can be found; or equivalently we can choose v̂ to minimize the norm of the error ‖Av̂−w‖ Finding the least squares solution v̂ is equivalent to finding

ˆw = Aˆv,wˆ = Avˆ,

(3.108)

that is closest to w than any other vector in the column space RR(A). Any solution to the new system of equations:

Av = ˆwAv = wˆ

(3.109)

is referred to as a least-squares solution for Av = w. Finally, note that the least-squares solution to Av = w is any vector v that minimizes ‖ Avw‖The following result characterizes the LS solution of minimum norm for an overdetermined system of linear equations [3, 6, 8].

Theorem 3.13.2. If A is an m x n matrix (mn) and A* A is invertible, then among all the least squares solutions of the system of equations: Av = w, there is a unique LS solution of minimum norm given by:

v = (A*A)  1A*w.v = (AA)  1Aw.

(3.110)

The matrix (A*A)−1A* is called the Moore-Penrose pseudoinverse of A. The approximation ŵ of w by the vectors in RR(A) is given by:

ˆw = Av = A(A*A)  1A*w,wˆ = Av = A(AA)  1Aw,

(3.111)

and it is obtained by projecting w onto RR(A). As a corollary, the error term wŵ is orthogonal to RR(A). Reference [3] presents several extensions of these results and a characterization of all the LS solutions.

3.14    Angles Between Subspaces

3.14.1    Principal Angles Between Subspaces

Recall that given two non-zero vectors v, w ∈ ℝn, the angle between them is computed as

θ = cos  1v|wvw,(0θπ2).θ = cos  1v|wvw,(0θπ2/).

(3.112)

Now suppose that we are given two subspaces represented in terms of the range spaces of two matrices A and B. A set of angles between them can be defined recursively. These angles will be referred to as principal (canonical) angles. Matrices A and B are assumed real-valued with the same number n of rows and dim RR(A) = p and dim RR(B) = q(q ≥ p). The principal angles θ1, θ2, …, θq ∈ [0, π/2] between the two column spaces RR(A) and RR(B) are recursively defined via:

cos(θk) = maxv(A)maxw(B)vTw = vTkwk,k = 1,,q,cos(θk) = maxvR(A)maxwR(B)vTw = vTkwk,k = 1,,q,

subject to these conditions: ‖v‖ = ‖w‖ = 1, vTvi = 0, wTwi; =0, i = 1,…, k −1. Notice that the notation ‖·‖ stands for the standard Euclidean norm of a vector, and the resulting vectors vi’s and wi’s are called the principal vectors. Notice also that during step k, the algorithm determines θk, and the associated principal vectors vk and wk by searching for vectors in subspaces that are orthogonal with respect to vi and wi, i = 1,…, k − 1, respectively. Therefore, by searching for vectors that present the largest angle between them and that are orthogonal to the already found principal vectors, the complete sets of principal angles and principal vectors are obtained.

The concept of principal angles between two vector spaces has found applications in assessing the degree of correlation between two sets of random variables and has been used in canonical analysis [13, 14]. Applications of principal angles also include random processes [15, 16], stochastic realization of random processes and system identification [17, 18, 19], pattern recognition and machine learning (see e.g., [20, 21]), and signal processing and communications (see e.g., the signal estimation and detection applications presented in [22]).

3.15    Eigenvalues and Eigenvectors

Let T be a linear operator on a vector space VV, T : VVVV. An eigenvalue of T is a scalar λ such that

T(v) = λv,T(v) = λv,

(3.113)

for a non-zero vector vVV. The vector v is called an eigenvector of T. Similarly, for an n × n matrix A that is the matrix representation of the linear transformation T, an eigenvalue is a scalar λ that makes the matrix A − λI singular. The eigenvector of matrix A corresponding to eigenvalue λ is any non-zero vector v that satisfies the equation Av = λν. The polynomial in λ expressed as:

χA(λ) = det(A  λI),χA(λ) = det(A  λI),

(3.114)

is called the characteristic polynomial of A and the equation

det(A  λI) = 0,det(A  λI) = 0,

(3.115)

is referred to as the characteristic equation of A. The set of roots of the characteristic equation provides the set of eigenvalues, and therefore, it is also referred to as the spectrum of A and denoted by the notation by σ(A). Thus, every eigenvalue λ is a root of the characteristic polynomial. The eigenspace corresponding to an eigenvalue λ of A is defined as the subspace

Eλ = {vV|Av = λv}.Eλ = {vV|Av = λv}.

(3.116)

If A has distinct eigenvalues, the eigenspaces are all one-dimensional. The following general results, whose proofs are deferred to Exercises 3.23.22 and 3.23.23, hold.

Theorem 3.15.1. Let A be an n × n matrix with distinct eigenvalues. Then, the eigenvectors of A form a linearly independent set.

Theorem 3.15.2. The eigenvectors of a Hermitian matrix are real and the eigenvectors corresponding to distinct eigenvalues are orthogonal.

The concepts of eigenvalue and eigenvector play an important role in building high-resolution spectral estimation techniques, designing efficient algorithms for localization and tracking in antenna array processing [12], checking stability of digital filters, system identification, designing efficient signaling schemes in multiantenna (multi–input multi–output (MIMO)) based communication systems [25], etc.

3.15.1    Diagonalization

A square matrix A is diagonalizable if it can be factored as follows:

A = SΛS  1,A = SΛS  1,

(3.117)

for some non-singular matrix S and diagonal matrix Λ. Notice that Λ presents the same eigenvalues as A. In general, two matrices A and Λ that are related by means of an equation of the form (3.117) are referred to as similar matrices,and they present the same set of eigenvalues. The following result provides an alternative characterization for diagonalizable matrices.

Theorem 3.15.3. If A is an n × n matrix, then it is diagonalizable if it admits n linearly independent eigenvectors or n distinct eigenvalues.

A justification of this result is left to Exercise 3.23.23.

Using the characteristic function, it can be checked that the eigenvalues of a diagonal matrix coincides with its diagonal entries. Also, it is not difficult to show that the ith column of S is actually the eigenvector of A corresponding to the ith diagonal entry (eigenvalue) of Λ (A), for any i = 1,… ,n. The following result shows a simple way to calculate the higher order powers of a matrix in terms of the higher-order powers of its eigenvalues.

Theorem 3.15.4. Let A be an n × n matrix with eigenvalues λ1,…, λn. Then the eigenvalues of Ak (k∈ ℤ) are λk1,,λknλk1,,λkn. Moreover, it follows that

Ak = (SΛS  1)k = SΛkS  1.Ak = (SΛS  1)k = SΛkS  1.

(3.118)

3.16    Schur Factorization and Spectral Theorem

Theorem 3.16.1. Let A be an n × n matrix. There exists a unitary matrix Q such that the multiplication

T = QAQT = QAQ

(3.119)

is upper triangular, or equivalently A can be factorized as

A = QTQ,A = QTQ,

(3.120)

where T is an upper triangular matrix.

The decomposition (3.120) is referred to as Schur’s factorization. Notice that the diagonal elements of T consist of the eigenvalues of A. The proof of this theorem can be done using the induction principle with respect to n. Alternative justifications for Schur factorization can be found in [3, 5, 24].

Theorem 3.16.2. (Spectral Theorem) For any Hermitian matrix A, there is a unitary diagonalizing matrix Q.

Stated differently, if A is a Hermitian matrix, the matrix T in Schur’s factorization must be diagonal. Assume there exists a unitary matrix Q such that (3.119) holds, where T is an upper triangular matrix. However, A = A, because A is Hermitian symmetric. Hence, T = QAQ = QAQ = T, Thus, T is also Hermitian symmetric. Being also an upper triangular matrix, T should be diagonal. As an immediate corollary, if A is a real-valued symmetric matrix, then it can be diagonalized via an orthogonal matrix Q, i.e., the equality QTAQ = T holds. Since Q is diagonalizing A to T, the diagonal elements of T are the eigenvalues of A, and the columns of Q consist of eigenvectors of A, i.e., A = QTQT represents the spectral decomposition of a symmetric matrix.

3.17    Singular Value Decomposition (SVD)

Let A be an m × n real-valued matrix. Then the Singular Value Decomposition (SVD) of matrix A is given by the factorization [24]:

A = UΣVT,A = UΣVT,

(3.121)

where Σ is an m × n diagonal matrix, and U and V are two orthogonal matrices of dimensions m × m and n × n, respectively. In general, Σ can be expressed as

Σ = [Σ0r × (n  r)0(m  r) × r0(m  r) × (n  r)],Σ = [Σ0(m  r) × r0r × (n  r)0(m  r) × (n  r)],

(3.122)

where

Σ = [σ1000σ2000σr]r × rσi>0,1i<r,Σ = σ1000σ2000σrr × rσi>0,1i<r,

(3.123)

and r stands for the rank of matrix A. The non-zero entries on the main diagonal of Σ, σi, i = 1,…r, are non-negative and are referred to as the singular values of A. Notice that using (3.121), it follows that AAT =UΣΣTUT and ATA =TΣVT Because matrices U and V are orthogonal, and ΣΣT and ΣT Σ are diagonal, it follows that the singular values of matrix A are equal with the square roots of non-zero eigenvalues of AA and AA. Furthermore, the columns of U and V are the eigenvectors of AAT and ATA, respectively. The SVD holds also for complex-valued matrices. If A is a complex-valued matrix, then its SVD assumes the factorization: A = UΣV, where U and V are unitary matrices and Σ is a diagonal matrix with nonnegative entries. Although in the SVD factorization, σ1,…,σr are unique, U and V are not.

The SVD (3.121) can be re-expressed in these two alternative ways:

AV = UΣ,AV = UΣ,

(3.124)

and

ATU = VΣT.ATU = VΣT.

(3.125)

Equating the ith columns in the left-hand-side (LHS) and right-hand side (RHS) of (3.124) and (3.125), it follows that:

Avi = σiuii = 1,2,,r,Avi = σiuii = 1,2,,r,

(3.126)

and

ATui = {σivi1ir0r + 1im,ATui = {σivi01irr + 1im,

(3.127)

where vectors ui and vi stand for the ith column of matrices U and V, respectively. Vectors vi ’s and ui ’s are called right singular vectors and left singular vectors of A, respectively. It follows also that {v1, v2,…, vr} and {u1, u2,…, ur} represent orthogonal bases for RR(AT) and RR(A), respectively. Likewise, {vr+1, vr+2,…, vn} and {ur+1, ur+2,…, ur} are orthogonal bases for NNull (A) and NNull (AT), respectively. Thus, the SVD provides bases for all the four fundamental subspaces of a matrix: RR(A), NNull (A), RR(A), and NNull (A). Notice also that the rank r of A is obtained immediately from its SVD, and it is equal to the number of non-zero entries located on the main diagonal of Σ. SVD enables also the construction of the orthogonal projections on the four fundamental subspaces of a matrix. E.g., if matrices U1:r and V1:r stand for the sub-matrices formed by the columns 1 through r of matrices U and V, respectively, then U1:rUT1:rU1:rUT1:r and V1:rVT1:rV1:rVT1:r represent the orthogonal projection matrices on R(A) and NNull (A) , respectively [3, 24].

Reference [24] presents a number of applications of SVD in calculating row-rank matrix approximations, principal angles between subspaces, generalized inverses, solving systems of equations in the standard or total least-squares sense, orthogonal projections, etc. During the past decades, SVD proved to be a powerful tool in solving numerous problems in statistics, signal processing, wireless communications, information theory, pattern recognition, machine learning and other fields. Also, SVD played an important role in enabling the calculation of the capacity of a MIMO channel by reducing the MIMO channel to a number of independent single–input single–output (SISO) channels [25], developing efficient high-resolution spectral estimation techniques [12], and designing optimal channel estimation, signaling and equalization techniques.

3.18    Rayleigh Quotient

Suppose A is an n × n Hermitian matrix, and denote by λ1 ≥ λ2 ≥…≥ λn and u1, u2,…, un its eigenvalues and corresponding orthogonal eigenvectors, respectively. Notice also that the Spectral Theorem 3.16.2 guarantees the existence of a factorization of the form A = UΛUH, with U unitary and Λ diagonal matrix containing on its main diagonal the eigenvalues of A. One can observe that orthonormal vectors u1, u2,…, un, representing the columns of U, stand also for the eigenvectors of A corresponding to eigenvalues λ1,…, λn. If x is a non-zero vector in Cn, then the Rayleigh quotient of x with respect to Hermitian matrix A is defined as:

R(x) = Ax|xx|x = xAxxx.R(x) = Ax|xx|x = xAxxx.

(3.128)

It can be shown that the minimum and the maximum values of the Rayleigh quotient are λn and λ1, respectively, i.e.,

λnR(x)λ1.λnR(x)λ1.

(3.129)

A quick justification of (3.129) can be obtained by expressing the arbitrary vector x ∈ ℂn as a linear combination of the orthogonal eigenvectors of A, ui, i = 1,…, n (or in terms of the singular vectors of A obtained via SVD),

x = ni = 1aiui.x = i = 1naiui.

(3.130)

Plugging3.130 into3.128, it follows that:

R(x) = ni = 1λi|ai|2ni = 1|ai|2 = ni = 1λi|ai|2a22,R(x) = ni = 1λi|ai|2ni = 1|ai|2 = ni = 1λi|ai|2a22,

(3.131)

where a = [a1, a2,…,an]T. Because of the ordering λ1 ≥ λ2 ≥…≥ λn, it follows that the numerator in (3.131) can be lower- and upper-bounded, respectively, by:

λna22ni = 1λi|ai|2λ1a22,λna22i = 1nλi|ai|2λ1a22,

from where (3.129) follows immediately. Rayleigh quotient found applicability in signal processing and wireless communication applications dealing with the design of filters and equalization schemes with improved performance.

3.19    Application of SVD and Rayleigh Quotient: Principal Component Analysis

Suppose we have an m × n (m ≥ n) matrix X of rank r, and whose column vectors are possibly correlated (i.e., are not orthogonal). The principal component analysis (PCA) is a method that transforms a set of correlated variables (in this case column vectors of matrix X) into a possibly smaller set of uncorrelated variables {y1,… yr}, called principal components vectors, that present the largest possible norm (variance, power or variability) and span the same vector space as the column space of matrix X. Herein application, the “uncorrelatedness” condition is achieved by imposing that the vectors yi’s, i = 1, 2,… r, are orthogonal.

Consider now the SVD (or spectral) decomposition of correlation matrix C = XTX/(n − 1) under the form C = UΛUT, where Λ is a diagonal matrix with the diagonal entries λ1 ≥ λ2 ≥ … ≥ λr > 0, and U = [u1, u2,…, un] is anorthogonal matrix. The facts that yi’s must lie in the range space of X and must be uncorrelated imply that at most r principal component vectors might be defined. Next we will determine the first principal component y1 as the vector of largest variance or norm that lies in the range space of matrix X. This reduces to finding the vector y1 = Xv1, where v1 is an arbitrary vector of unit norm, such that

Var(y1) = yT1y1n  1 = (Xv1)TXv1n  1 = vT1Cv1,Var(y1) = yT1y1n  1 = (Xv1)TXv1n  1 = vT1Cv1,

(3.132)

is maximized. The Rayleigh quotient implies immediately that v1 should coincide with the eigenvector (u1) corresponding to the largest eigenvalue (λ1) of correlation matrix C. Now, using the Induction Principle, we will construct the remaining principal directions. Assuming that the first k− 1 principal component vectors y1,…, yk−1 are given by the eigenvectors u1,…, uk−1 corresponding to the first k − 1 largest eigenvalues, respectively, we want to prove that the kth principal component vector is given by the eigenvector uk corresponding to the kth eigenvalue λk. Consider that yk = Xvk, where vk is an arbitrary vector of unit norm. Because yk should be orthogonal to y1,…, yk−1, it follows that

yTkyι = 0,l = 1,,k  1.yTkyι = 0,l = 1,,k  1.

(3.133)

Equation (3.133) leads further to

(vTkXT)(Xuι) = (n  1)vTkCuι = (n  1)λιvTkuι = 0,l = 1,,k  1(vTkXT)(Xuι) = (n  1)vTkCuι = (n  1)λιvTkuι = 0,l = 1,,k  1

(3.134)

Thus, it follows that necessarily vk should lie in the range space of orthogonal eigenvectors uk,…, un. Consequently, there exist the scalars αl, l = k,…,n,such that:

vk = nl = kαιuι,vk = l = knαιuι,

(3.135)

and nl = kα2l = 1nl = kα2l = 1 is imposed to enforce the unit norm for vk. Notice that the maximization of the kth principal component norm or variance is equivalent to the maximization of

yTkyk = (nj = kαjuj)TXTX(nl = kαιuι) = nl = kα2ιλ2ι,yTkyk = (j = knαjuj)TXTX(l = knαιuι) = l = knα2ιλ2ι,

(3.136)

with respect to scalars αι, l = k,…, n, and assuming the constraint nl = kα2l = 1nl = kα2l = 1. Obviously, the maximum of (3.136) is achieved when αι = 0 for l > k, and |αk | = 1. This concludes the proof. In other words, the eigenvectors corresponding to the correlation matrix C yield the principal component directions. Depending on the applications where the PCA method is employed, PCA undertakes different names such as Karhunen-Loeve transformation, SVD or Hotelling transform. For more details on PCA, the reference [25] presents a comprehensive overview of PCA and its applications.

3.20    Special Matrices

In this section several special matrices such as block, circulant, Toeplitz, Vandermonde, stochastic, generalized inverse and positive definite matrices will be introduced.

3.20.1    Block Matrices

A matrix A is referred to as a block matrix if it can be represented in terms of sub-matrices Ai,j of appropriate sizes as follows:

A = [A1,1A1,2A1,nA2,1A2,2A2,nAm,1Am,2Am,n].A = A1,1A2,1Am,1A1,2A2,2Am,2A1,nA2,nAm,n.

(3.137)

A block matrix with non-zero entries on its main diagonal and zero-blocks elsewhere is called a block diagonal matrix, and assumes the expression:

A = [A100An].A = A100An.

Some properties pertaining to the multiplication of two block matrices A and B of appropriate sizes are listed below.

•  If A = [A1A2]A = [A1A2], then AB = [A1BA2B]AB = [A1BA2B]

•  If B = [B1 B2 ] , then A [ B1 B2 ] = [ AB1 AB2].

•  if A = [A1A2],B = [B1,B2]A = [A1A2],B = [B1,B2], the AB = [A1A2][B1,B2] = [A1B1A1B2A2B1A2B2]AB = [A1A2][B1,B2] = [A1B1A1B2A2B1A2B2]

•  If A = [A1,1A1,2A2,1A2,2]B = [B1,1B1,2B2,1B2,2]A = [A1,1A1,2A2,1A2,2]B = [B1,1B1,2B2,1B2,2], then AB = [A1,1B1,1 + A1,2B2,1A1,1B1,2 + A1,2B2,2A2,1B1,1 + A2,2B2,1A2,1B1,2 + A2,2B2,2]AB = [A1,1B1,1 + A1,2B2,1A1,1B1,2 + A1,2B2,2A2,1B1,1 + A2,2B2,1A2,1B1,2 + A2,2B2,2]

The proof of the last property is delegated to Exercise 3.23.25. Reference [7] presents additional properties of block matrices.

3.20.2    Circulant Matrices

An n × n matrix A is called circulant if each row of A is obtained by right-shifting the previous row by one column, and the element in the last column being shifted to the first column. In other words, a circulant matrix can be defined in terms of elements located on its first row, as depicted below:

A = circ(a1,,an) = [a1a2an  1anana1an  2an  1a2a3ana1].A = circ(a1,,an) = a1ana2a2a1a3an  1an  2ananan  1a1.

(3.138)

The circularity condition for matrix A = [ak,l] can be described alternatively through the condition:

ak,l = {ak  1,l  1k = 2,,n,l = 2,,nak  1,nk = 2,,n,l = 1.ak,l = {ak  1,l  1ak  1,nk = 2,,n,l = 2,,nk = 2,,n,l = 1.

Consider now two circulant matrices A and B of the same size and arbitrary scalars α and β. Then the following statements hold:

•  AT is circulant.

•  αA + βB is circulant.

•  Ak is circulant, for any positive integer k.

•  If A is nonsingular, then A−1 is circulant.

•  AB is circulant.

The proofs of the last two properties are delegated to Exercise 3.23.27.

An important feature of circulant matrices is that they can be diagonalized using Discrete Fourier Transform (DFT). In fact, for an arbitrary n × n circulant matrix A the following spectral decomposition holds:

A = F*nΛnFn,A = FnΛnFn,

(3.139)

where Fn is the DFT matrix with entries

[Fn]k,l = 1ne  2π  1kl/n,0k,ln  1,[Fn]k,l = 1ne  2π  1kl/n,0k,ln  1,

and Λn is a diagonal matrix with its diagonal entries equal to the eigenvalues of matrix A. The diagonal entries of Λn, i.e., the eigenvalues of circulant matrix A, can be calculated in O(n log n) flops (floating point operations) by considering the fast Fourier transform (FFT) of the first column of A. Indeed, using (3.139) and denoting by 1 the vector of all ones and e1 = (1, 0,…, 0)T, the following equations hold:

ΛnFn = FnA,ΛnFne1 = FnAe1,Λn1 = FnAe1,ΛnFn = FnA,ΛnFne1 = FnAe1,Λn1 = FnAe1,

(3.140)

where we took into consideration in (3.140) the fact that Fne1 = 1. Equating the k entries in the LHS and RHS of (3.140), it follows that the diagonal entries λk of Λn are given by

λk = 1nn  1j = 0aje  2πijk/n,k = 0,,n  1,i =   1λk = 1nj = 0n  1aje  2πijk/n,k = 0,,n  1,i =   1

Exercise 3.23.28 describes how the eigenvectors of circulant matrix A can be efficiently determined. Circular matrices are important in signal processing applications because they arise in the matrix representation of circular convolution between two discrete-time sequences (see e.g., [26]). Since the inverse DFT of the product of two DFTs is equal to the circular convolution of the two sequences whose DFTs are multiplied, circular matrices play a fundamental role in performing the linear convolution between two sequences [26]. Reference [27] presents a nice introduction to circular matrices. Additional properties of circular matrices can be found in [7].

3.20.3    Toeplitz Matrices

An n × n matrix A = [ak,l] whose entries satisfy the property ak,l = ak−l is called a Toeplitz matrix. In other words, the entry (k, l) of a Toeplitz matrix is perfectly described by the difference between its row and column indices. Due to this fact, the following well-known characterization of Toeplitz matrices is used in the literature: all matrix entries located on diagonals parallel with the main diagonal are equal. As a corollary, all Toeplitz matrices are perfectly determined by its first column and first row vectors. Notice also that circulant matrices are a special class of Toeplitz matrices. Toeplitz matrices are present in the representation of discrete-time linear convolutions between two sequences, and autocorrelation (auto-covariance) matrices of stationary stochastic processes. An interesting and quite useful property of Toeplitz matrices is the fact that Toeplitz systems of equations of the form Ax = b can be solved efficiently in O(n2) flops using the Levinson-Durbin algorithm. This result plays an important role in designing parametric autoregressive-moving average (ARMA) or AR power spectral density estimators (solving the Yule-Walker system of equations) [12], calculation of Wiener (minimum mean-square error (MMSE)) equalizers and predictors. For additional information, reference [27] represents an excellent elementary introduction to Toeplitz and circulant matrices.

3.20.4    Hankel Matrices

A matrix A = [ak,l] whose entries satisfy the condition: ak,l = ak+l is referred to as a Hankel matrix. Hankel matrices enjoy the property that the entries located on anti-diagonals are equal. Therefore, the entries of a Hankel matrix are perfectly described by its first row and last column. Hankel matrices found applicability in the design of state-space based system identification and channel estimation algorithms [28]. Hankel matrices enjoy similar properties to Toeplitz matrices in terms of computing their inverses. This can be understood by noticing the result: if J stands for the exchange matrix, i.e., the matrix with all the entries on the main counter-diagonal equal to 1 and with zeros elsewhere, then JA is Toeplitz for a square Hankel matrix A [7].

3.20.5    Vandermonde Matrices

A Vandermonde matrix A admits the following structure:

A = [1a1a21am  111a2a22am  121ama2mam  1m]A = 111a1a2ama21a22a2mam  11am  12am  1m

(3.141)

 = [111a1a2ama21a22a2mam  11am  12am  1m]T = 1a1a21am  111a2a22am  121ama2mam  1mT

(3.142)

In other words, in each row in a Vandermonde matrix are terms of a geometric progression. A square Vandermonde matrix A enjoys the property that its determinant can be expressed in closed-form expression:

det(A) = 1ijm(aj  ai).det(A) = 1ijm(aj  ai).

(3.143)

Vandermonde matrices arise in applications involving evaluations of polynomials at a specific set of values, polynomial least-squares fitting and polynomial interpolation designs.

3.20.6    Normal Matrices

A complex square matrix A is said to be a normal if

AA = AA,AA = AA,

where we recall that superscript. stands for the conjugate transposition (Hermitian transposition) operation. For real-valued matrices, a normal matrix satisfies the condition:

ATA = AAT.ATA = AAT.

One of the most important features of normal matrices is the fact that they are unitarily diagonalizable, i.e., there exists a unitary matrix U such that UHAU = diag(λ1,…, λn) [5]. In particular, notice that a Hermitian symmetric (or antisymmetric) matrix is a normal matrix.

3.20.7    Stochastic Matrices

An n × n matrix A = [αk,l] is called stochastic (probabilistic or transition) matrix if all its entries are non-negative (ak,l ≥ 0) and the entries in each row add up to one nl = 1ak,l = 1nl = 1ak,l = 1 A stochastic matrix can be interpreted as the transition probability matrix in a Markov chain, and its use arises in many fields such as probability, statistics, computer science, optimization (majorization theory), etc. A matrix with non-negative entries and with each column summing up to one is also referred to as a stochastic matrix. A doubly stochastic matrix is a matrix with non-negative entries, and in which each row and each column add up to one.

3.20.8    Positive and Negative Definite Matrices

Assume x is an n × 1 vector in ℝn and n × n matrix A is symmetric. The quadratic form f (x) = xTAx is said to be positive (negative) definite if f (x) > 0 (f (x) < 0) for any non-zero vector x. By definition, the real symmetric matrix A is called:

•  Positive Definite iff xTAx > 0 , ∀x ∈ ℝn − {0}.

•  Positive Semi-Definite iff xTAx ≥ 0 , ∀x ∈ ℝn − {0}.

•  Negative Definite iff xTAx < 0 , ∀x ∈ ℝn − {0}.

•  Negative Semi-Definite iff xT Ax ≤ 0 , ∀x ∈ ℝn − {0}.

The following result provides an alternative way to assess when a matrix is positive definite.

Theorem 3.20.1. A real symmetric matrix A is positive definite if and only if all of its eigenvalues are positive.

The proof of this result is left to Exercise 3.23.30. An alternative method to check for the positive definiteness of a matrix is to verify that the determinants of all principal submatrices are all positive (see e.g., [8]). Additional information about the properties and applications of positive definite matrices can be found in the book [29].

3.20.9    Matrix Condition Number

Consider the system of equations Ax = b. Matrix A is said to be ill-conditioned if a relatively small change in bb) causes relatively large changes in the solution vector x (Δx). In contrast, matrix A is called a well-conditioned matrix, if a relatively small change in b causes relatively small changes in solution x. Assume A is a square nonsingular matrix and

Ax = A(x + Δx) = b + Δb,Ax = A(x + Δx) = b + Δb,

(3.144)

where x′ represents the solution error caused by an error Δb in b. Considering Ax = b, it follows that ΑΔx = Δb and hence:

Δx = A  1Δb.Δx = A  1Δb.

(3.145)

Therefore,

ΔxA  1Δb.ΔxA  1Δb.

(3.146)

On the other hand,

Δb = AΔxAΔx.Δb = AΔxAΔx.

(3.147)

Therefore,

ΔbAΔxA  1Δb.ΔbAΔxA  1Δb.

(3.148)

Considering x = A−1 b, and following the same reasoning it is concluded that:

bAxA  1b.bAxA  1b.

(3.149)

From (3.148) and (3.149), it follows that:

1AA  1ΔbbΔxxAA  1Δbb.1AA  1ΔbbΔxxAA  1Δbb.

(3.150)

The number ‖A‖ ‖A−1‖ in3.150 is referred to as the condition number of matrix A and is denoted by cond (A). Therefore,3.150 can be expressed as:

1cond(A)(Δbb)Δxxcond(A)(Δbb).1cond(A)(Δbb)Δxxcond(A)(Δbb).

(3.151)

This equation relates the relative solution error (‖Δx‖/ ‖x‖) to the relative error in (‖Δb‖/ ‖b‖) If cond (A) is close to 1, then the relative errors in x and b are close. However, if cond (A) is a large number, then the relative errors in x can be significantly larger than the relative errors in b.

3.20.10  Sherman-Morrison-Woodbury Identity

Consider that the system of linear equations Ax = b admits the solution vector x0. Consider also the slightly changed version of this system of equations under the form: Ãx = b, and whose solution we want to determine under the assumption that à represents a perturbed version of A. The problem that we would like to address is finding a computationally efficient solution for the perturbed system of equations assuming knowledge of x0 and that à represents a rank-one perturbation of A, i.e.,

˜A = A  uvT,

(3.152)

in which u and v are two arbitrary vectors. Because of3.152, the inverse of à can be expressed in terms of the original matrix A as follows:

(˜A)  1 = A  1 + (A  1u)(vTA  1)1  vTA  1u.

(3.153)

Identity (3.153) is referred to as the Sherman-Morrison identity. Right-multiplying (3.153) with b, the solution of the perturbed system x̃ = (Ã)−1b can be expressed as:

˜x = x0 + Δx = x0 + (A  1u)(vTx0)1  vTA  1u,

(3.154)

where we took into account A−1b = x0. Sherman-Morrison formula can be extended to situations when matrix A is subject to higher-order rank perturbations of the form:

˜A = A  UVT,

where U and V are n × m matrices and n > m. In this case, the inverse of the perturbed matrix à is calculated by means of a generalization of the Sherman-Morrison formulation, referred to as the Woodbury identity :

(˜A)  1 = A  1 + A  1U(Im  VTA  1U)  1VTA  1,

(3.155)

Equation (3.155) is often referred to as the Sherman-Morrison-Woodbury identity. Right-multiplying (3.155) with vector b, it turns out that the solution of the perturbed system can be expressed in terms of x0 as follows:

x = x0 + Δx

(3.156)

 = x0 + A  1U(Im  VTA  1U)  1VTx0.

(3.157)

The Sherman-Morrison-Woodbury identities (3.153) and (3.155) play an important role in the development of recursive (iterative or online) estimation and filtering techniques. The RLS algorithm and Kalman filter are two important examples in this regard [10, 11, 30, 31].

3.20.11  Schur Complement

Many practical applications require the calculation of the inverse of a block- partitioned matrix. Assume that the matrix E is block-partitioned as:

E = [ABCD],

(3.158)

where submatrix A is invertible. The expression F = DCA−1 B is called the Schur complement of submatrix A in E and represents an important ingredient in computing the inverse of E in terms of the inverse of A. It turns out that the inverse of the partitioned matrix E in3.158 can be expressed in terms of the inverses of submatrix A and its Schur complement F as follows:

E  1 = [A  1 + A  1BF  1CA  1  A  1BF  1  F  1CA  1F  1].

(3.159)

Furthermore, the determinant of matrix E can be calculated using determinants of submatrix A and its Schur complement:

|E| = |A||F|.

(3.160)

The proof of (3.160) is deferred to Exercise 3.23.32. Additional properties and applications of Schur complement can be found in [7, 32].

3.20.12  Generalized Inverses

Often it is necessary to extend the concept of matrix invertibility to a wider class of rank-deficient matrices by means of concepts such as generalized inverse or pseudoinverse of a matrix. One of the major applications of the concept of generalized inverse is in solving systems of equations of the form: Ax = b, where matrix A is neither square nor nonsingular. Such a system of equations may not admit any solution or may admit an infinite number of solutions.

Basic Definition of Generalized Inverse

The generalized inverse of a matrix is not unique. Any matrix that satisfies the following equation:

AA ͞ A = A,

(3.161)

is referred to as a generalized inverse of matrix A and is denoted by A. In general, the generalized inverse of an m × n matrix is an n × m matrix. Some of the main properties of generalized inverses are enumerated below.

1.  If A is a square and nonsingular matrix, then A = A−1, i.e., the generalized matrix coincides with the regular inverse.

2.  Transposition: (AT) = (A )T.

3.  Assume that matrix A is of size m × n, then

(a)  rank (AA) = rank (A) .

(b)  rank (I − AA) = n − rank (A).

4.  Consider m × n matrices A and B and their n × m generalized inverses, A and A, respectively.

(a)  A (I + A) = (I + A) .

(b)  (A + B) = A (A + B)B .

(c)  AA (A + B) A = BB (A + B) B .

(d)  (I + AB) = IA (I + BA) B .

5.  Generalized inverse of block matrices. Consider the block matrix

E = [ABCD],

where A, B, C, and D are submatrices of E and A, B, C, and D stand for their generalized inverses, respectively. The following formula enables calculation of generalized inverse of E:

E   = [A   + A  BF  CA    A  BF    F  CA        F  ],

where F = DCAB.

An excellent reference dealing with the properties and applications of generalized inverses is [33]. More recently, [7] provides a good list of the major properties exhibited by generalized (weak) inverses.

Generalized Inverses to Solve Consistent Systems of Equations

Assume Ax = b is a consistent under-determined system of equations, i.e., the system of equations admits more than one solution. The generalized inverse A of matrix A is assumed to satisfy (3.161). Right-multiplying (3.161) with x, it follows that:

AA  Ax = Ax.

(3.162)

Since Ax = b, (3.162) simplifies further to:

A(A  b) = b.

(3.163)

Therefore, x = Ab can be interpreted as a solution of the original system of equations Ax = b. Notice that in general any vector of the form: x = Ab + (IAA) y, where y is any n × 1 vector, is a solution of the original system of equations (and in fact all the solutions are of this type [3]). Indeed, one can check directly by performing the required calculations that

Ax = A(A  b + (I  A  A)y) = AA  b + (A  AA  A)y = b  0 = b.

The reverse implication can be proved by using the fact that any solution of inhomogeneous system of equations Ax = b can be expressed as the sum between a particular solution (A b) of the original system Ax = b and the general solution of the homogeneous system of equations Ax = 0, which coincides with the null space of A. It is not difficult to check that I − AA is the projector onto the null space of A [3].

Moore-Penrose Inverse

As mentioned before, the definition of a generalized inverse does not yield a unique generalized inverse, unless it is restricted to satisfy some additional requirements. By imposing several additional constraints to the definition (3.161), a unique generalized inverse can be obtained for any arbitrary matrix A. The resulting unique generalized inverse is referred to as the Moore-Penrose inverse of A and is represented in terms of the notation: A+. The additional constraints satisfied by the Moore-Penrose inverse are listed below:

A + AA +  = A + ,

(3.164)

(A + A)T = A + A,

(3.165)

(AA + )T = AA + .

(3.166)

The Moore-Penrose inverse A+ exists for any arbitrary matrix A and is unique. There are several methods to calculate the Moore-Penrose inverse of a matrix. One of the most common methods is based on the singular values decomposition of the matrix, explained in Section 3.17. Let the SVD of matrix A be given by A = UΣVT, where Σ is the same as in (3.122), and matrices U, V are orthogonal. The pseudoinverse of Σ, Σ+, is an n × m diagonal matrix with 1i, i = 1,… ,r, on its main diagonal, and 0 elsewhere. The Moore-Penrose inverse of A is given by (see e.g., [3, 7, 24]):

A +  = VΣ + U.

(3.167)

3.21    Matrix Operations

Several useful matrix operations such as Kronecker product, Hadamard product, dot product, direct sum, and matrix differentiation will be introduced in this section.

3.21.1    Kronecker Product

The Kronecker product of m × n matrix A = [aij ] and p × q matrix B = [bij] is represented in terms of notation ⊗, and it is given by the matrix mp × nq matrix C:

C = AB = [a11Ba12Ba1,nBa21Ba22Ba2nBam1Bam2BamnB].

(3.168)

This Kronecker product is also referred to as direct multiplication.

Suppose that A, B, C, and D are matrices with appropriate sizes. Consider also that a and b are vectors of appropriate sizes, and α and β are scalars. The following properties hold for the Kronecker product (see e.g., [7] for additional properties):

•  (A + B) ⊗ C = AC + BC.

•  A ⊗ (B + C) = A ⊗ B + A ⊗C.

•  (AB) ⊗ C = A ⊗ (BC).

•  (AB) (CD) = (AC) ⊗ (BD) .

•  (AB)T = ATBT .

•  abT = bTa = abT .

•  (AB)−1 = A−1B−1 .

•  rank (AB) = rank (A) rank (B).

3.21.2    Hadamard Product

The Hadamard product, also called the element-wise or array product, is represented in terms of notation ⊙, and it is defined for two m × n matrices A = [ak,l] and B = [bk,l] as follows:

AB = [a1,1b1,1a2,1b2,1a1,nb1,na2,1b2,1a2,2b2,2a2,nb2,nam,1bm,1am,2bm,2am,nbm,n].

(3.169)

For arbitrary matrices A, B, and C, assumed of appropriate sizes, the Hadamard product exhibits the following properties:

•  AB = BA

•  (AB) ⊙ C = A ⊙ (BC)

•  (A + B) ⊙ C = AC + BC

•  (AB)T = ATBT

•  C (AB) = (CA) ⊙ B = A ⊙ (CB)

3.21.3    Dot Product

The dot product, also called the inner-product of two m × n matrices A = [ai] and B = [bi] is defined in terms of their column vectors ai and bi, respectively, as follows:

A|B = mi = 1aTibi.

(3.170)

The dot product satisfies the following properties:

•  〈A|B〉 = tr (ATB)

•  〈A|B〉 = 〈AT|BT

3.21.4    Direct Sum

The direct sum of k square matrices A1, A2, …, Ak with sizes n1 × n1, n2 × n2, …, nk × nk, respectively, is a block diagonal square matrix of size (n1 + n2 + … + nk) × (n1 + n2 + … + nk) defined by

A1A2Ak = diag(A1,A2,,Ak) = [A1000A2000Ak].

(3.171)

The following properties of direct sum hold:

•  tr (A1A2 ⊕ … Ak) = tr (A1) + tr (A2) + … + tr (Ak)

•  |A1A2 ⊕ … ⊕ Ak| = |A1| |A2|… |Ak|

•  rank (A1A2 ⊕ … ⊕ Ak) = rank (A1) + rank (A2) + … + rank (Ak)

3.21.5    Differentiation of Matrix and Vector Functions

This subsection focuses on presenting some basic conventions and results regarding the calculation of derivatives for matrix- and vector-valued functions with respect to a scalar, vector, or matrix-valued independent variable.

Differentiation with Respect to a Scalar

When the differentiation variable is a scalar, the derivative has the same structure as the operand and its elements are simply derivatives of the elements of the operand. Assuming that the m × 1 vector x(t) = (x1(t),…,xm(t))T and m × n matrix Y(t) = [yi,j (t)] are functions of the independent scalar variable t, then their derivatives with respect to t are given by:

x(t)t = (x1(t)t,,xm(t)t)T,

(3.172)

Y(t)t = [y1,1(t)ty1,n(t)tym,1(t)tym,n(t)t] = [yi,j(t)t].

(3.173)

Similarly the higher order derivatives of a vector (matrix) with respect to (wrt) a scalar is a vector (matrix) of the same size, and whose entries consist of the higher order derivatives of the vector (matrix) entries.

Differentiation with Respect to a Vector

The derivative of a scalar function x(t) with respect to an n × 1 vector t = (t1, …,tn)T is also referred to as the gradient, and it is defined as the vector:

x(t)tT = (x(t)t1,.x(t)tn).

(3.174)

An alternative notation for the gradient of x is:

x = x(t)tT.

(3.175)

The derivative of any operand with respect to a vector yields a vector for each element of the operand. Therefore, the derivative of a vector with respect to a vector is a matrix. Moreover, the derivative of a matrix-valued function with respect to a vector is a three-dimensional array.

Assume the m × 1 vector x(t) = (x1(t), …,xm(t))T and p × q matrix Y(t) = [yi,j(t)]. Then the derivatives of x(t) and Y(t) with respect to the n × 1 vector t = (t1, …,tn)T are given by:

x(t)tT = [x1(t)tx1(t)t2x1(t)tnx2(t)t1x2(t)t2x2(t)tnxm(t)t1xm(t)t2xm(t)tn],

(3.176)

Yt = [yi,j,k],yi,k,j = yi,jtk,i = 1,,p,j = 1,,q,k = 1,,n

(3.177)

The m × n matrix x(t)tT is referred to as the Jacobian of x(t) and is usually represented in terms of the notation:

Jx  xtT.

(3.178)

Higher order derivatives can be calculated by applying first-order derivatives successively several times; e.g., the second-order derivative of a scalar x(t) with respect to the n × 1 vector t = (t1, …,tn)T is the first-order derivative of the vector x(t)tT = (x(t)t1,,x(t)tn)T with respect to the vector t, and it is equal to the matrix:

2x(t)ttT = t(x(t)tT) = t(x(t)t1,,x(t)tn) = (2xt212xt2t12xtnt12xt1t22xt222xtnt22xt1tn2xt2tn2xt2n) = (2xt212xt1t22xt1tn2xt2t12xt222xt2tn2xtnt12xtnt22xt2n).

(3.179)

The matrix in (3.179) is called the Hessian of x(t) and is represented in general in terms of one of these notations:

x = 2x = 2xt2.

(3.180)

Derivatives of some commonly used expressions with respect to the n × 1 vector x are shown in Table 3.1. In this table, A stands for an n × n matrix, and x and a are

Table 3.1:  Derivatives of commonly used expressions with respect to vectors.

f (x)

f(x)x

f(x)xT

xT a

a

aT

aT Ax

AT a

aT A

Ax

AT

A

xT x

2x

2xT

xT Ax

Ax + AT x

xT AT + xT A

Table 3.2:  Derivatives of commonly used expressions with respect to matrices.

f(X)

f(x)x

aT Xb

abT

aT XT b

baT

tr(AX)

AT

tr (XXT )

2X

tr (AXB)

AT BT

|X|

|X| (X−1 )T

n × 1 vectors. Generally, the following conventions are adopted f(x)x = (f(x)xT)T and f(x)xT = (f(x)x)T.

Differentiation with Respect to a Matrix

The derivative of each element of an object with respect to a matrix is the matrix of derivatives of that element with respect to the matrix entries. Therefore, the derivatives of a scalar, vector, or a matrix with respect to a matrix are a matrix, three-dimensional array, and four-dimensional array, respectively. Derivatives of some commonly used expressions with respect to a n × n matrix X are depicted in Table 3.2. In this table, A, B, and X are n × n matrices, and a and b are n × 1 vectors. Finally, we would like to mention that reference [7] presents a much more comprehensive description of the results pertaining to the computation of derivatives for vector- and matrix-valued functions.

3.22    References and Further Studies

This chapter presented a short summary of the main concepts and results from linear algebra that are used most frequently in the fields of signal processing, wireless communications, and networking. For more detailed and comprehensive treatment of the basic concepts in linear algebra and matrix theory, the readers are directed to the excellent references [3, 5, 8, 34]. For numerical stable implementations of various matrix factorizations and linear algebra algorithms, [24] represents still one of the state-of-the-art references in this area. The handbook [7] presents an encyclopedic coverage of matrix theory from the viewpoint of a statistician. Another excellent reference that provides an in-depth treatment of basic properties of matrices is [5]. Finally, we end this chapter with the remark that linear algebra represents a very important tool for designing efficient algorithms for signal processing, communications, and networking applications (see e.g., [10, 12, 30, 35]).

Acknowledgment: This work was supported by NSF Grant 0915444.

3.23    Exercises

Exercise 3.23.1. Show that the subset SV, (S ≠ {0}), where V is a vector space over field F, is a subspace if and only if

s1s2S,r1,r2:r1s1 + r2s2S.

Exercise 3.23.2.Let B = {b1, b2,…, bn} be an ordered basis for vector space V.

•  Show that every inner-product on V can be computed using the values

ai,j = bj|bii = 1,,n,j = 1,,n.

Hint: Take x, yV and write them as linear combinations of bi’s (i = 1,….n). Form the inner-product 〈x|y 〉 and the coordinate matrices of x and y in the ordered basis B, [x]B and [y]B. Show that

x|y = [y]BA[x]B,

where n × n matrix A = [ai,j] consists of the inner-products of the vectors in ordered basis B.

•  Show that A is Hermitian positive definite matrix.

Exercise 3.23.3. This problem establishes the Cauchy-Schwarz inequality for the induced norm. Let v be a non-zero vector and define:

v = w  w|vv2v.

Show that 〈v′|v 〉 = 0 and from the fact 0 ≤ ‖v′‖2 conclude that

|v|w|2v2w2.

Using this result, prove the triangle inequality for the induced norm.

Exercise 3.23.4. Show that every orthogonal subset B (with non-zero elements) of a vector space V is linearly independent.

Exercise 3.23.5. Prove that in the Gram-Schmidt orthogonalization process introduced in Theorem 3.1.5, the following relationship holds:

wk + 1|wi = 0,i = 1,,k.

Thus, {w1, w2,…, wk+1} is an orthogonal set. Does the set {w1, w2,…, wk+1} represent a basis for span (v1, v2,…, vk+1)7 Justify.

Exercise 3.23.6. Using the fact that for an m × n matrix A, elementary row (column) operations do not affect the column (row) rank, prove that

rowrank(A) = columnrank(A).

Exercise 3.23.7. If A is an m × n matrix over field F, show that the corresponding linear transformation TA : FnFm is injective if and only if rank (A) = n and is surjective if and only if rank (A) = m.

Exercise 3.23.8. Show that for a projection operation T : VV,

T(v) = v,v(T).

Use this statement to prove Theorem 3.2.3.

Exercise 3.23.9. By computing 〈Pu|(IP)v 〉 show that a Hermitian idempotent matrix P represents an orthogonal projection. Use this fact to prove Theorem 3.2.4.

Exercise 3.23.10. Let B = {v1, v2,…, vn} be a basis for vector space V.

•  Show that there is a unique linear functional vi* on V, such that

v*i(vj) = δij.

•  Show that the set of n distinct linear functionals on V obtained form B (as mentioned above), are linearly independent.

•  From the dimension of V* what can we conclude about the dual basis of B?

•  Let f be a linear functional on V. Write f as a linear combination of vi*’s (i = 1, 2,… ,n). Moreover, let v be a vector in V. So we can write it as a linear combination of vi’s. Write the expression for vi*(v) and find the unique expression for v as a linear combination of vi’s (i = 1,…, n).

Exercise 3.23.11. Let T : VW be a linear transformation between two inner-product finite dimensional vector spaces V and W.

•  Let wR(T) and nNull (T*). By computing 〈w|n 〉 show that

Null(T*)(T).

•  Now, let wR(T) . Using the definition of the adjoint of T, show that

wNull(T*),

and

(T)Null(T*).

•  Using the previous results, show that

(T) = Null(T*).

•  Show that R(T*) = Null (T) using a similar argument.

Exercise 3.23.12. Using the properties of the four fundamental subspaces of linear transformation T, show that T is surjective if and only if T* is injective, and T is injective if and only if T* is surjective.

Exercise 3.23.13. Show the following equalities:

Null(T*T) = Null(T)Null(TT*) = Null(T*)(T*T) = (T*)(TT*) = (T).

Hint: First, show that Null (T) ⊂ Null (T*T). Then using the definition of the adjoint, show that if T*T(v) = 0, then T(v) = 0; and hence, Null (T*T) ⊂ Null (T).

Exercise 3.23.14. Show that the two definitions of the operator norm are equivalent, i.e.,

T = supvV  {0}T(v)v = supvV,v = 1T(v).

Exercise 3.23.15. This problem derives several matrix norms.

•  Using the definition of ‖x‖1, show that:

A1 = maxx1 = 1Ax1 = maxji|aij|.

•  Using the definition of ‖x‖ , show that

A = maxx = 1Ax = maxij|aij|.

•  In order to find ‖ A2, use Lagrange’s multiplier technique to minimize

f(x) = xAAx  λ(xx  1).

Hint: Take the gradient of f (x) with respect to [x λ] and equate it to zero. Exploit then the eigenvalues and eigenvectors of matrix AA.

•  Show that

A2F = tr(AA) = mi = 1nj = 1|aij|2.

Exercise 3.23.16. Let A be an m × n matrix and U be an m × m unitary matrix. Show that

UAF = AF.

Exercise 3.23.17. Let A and B be matrices of dimension m × n and n × m, respectively. Show that

det(Im + AB) = det(In + BA)

Exercise 3.23.18. Let S be a subspace of an inner-product vector space V, and let v ∈ V. Follow the steps below to prove Theorem 3.12.1.

•  Suppose sS and v – s is orthogonal to all s* ∈ S; i.e.,

v  s|s* = 0s*S.

Let s′ ∈ S and s′ ≠ s. By rewriting the difference v − s′ as

v  s = (v  s) + (s  s),

show that

v  s2v  s2.

•  Conversely, now let ‖vs′‖2, ≤ ‖vs2s′ ∈ S. Show that

2(v  s|s*) + s*20,s*S.

Then by a suitable choice for s* show that 〈vs|ss =0 and conclude that vs is orthogonal to every vector in S.

•  Show that the best approximation is unique.

•  Let mk = 1v|skskSk Show that vs is orthogonal to sj , j = 1,…, m, and it is orthogonal to every vector in S. Therefore, s is the best approximation of v by vectors in S.

Exercise 3.23.19. Let S be a finite-dimensional subspace of inner-product vector space V. Suppose P is an orthogonal projection of V onto S.

•  Prove that the mapping defined by matrix I − P which maps any vector vV onto v − Pv is actually the orthogonal projection of V onto S.

•  Show that P is an idempotent linear mapping from V to S.

•  Show that

Ps = 0sS.

•  Show that

V = SS.

Exercise 3.23.20. This problem pertains to the proof of Theorem 3.13.1.

•  Using the best approximation theorem and the four fundamental subspaces related to T, show that the minimum error (T(v) − w) should be in the nullspace of T*.

•  Conclude that

T*T(v) = T*(w).

Exercise 3.23.21. Show that if an n × n matrix A has distinct eigenvalues, the corresponding eigenvectors should be linearly independent.

Exercise 3.23.22. Let vi and vj be two eigenvectors of an n × n Hermitian matrix A, corresponding to different eigenvalues λi and λj, respectively. Compute vivj and show that the eigenvectors are orthogonal. Moreover, by computing viAvi, show that A’s eigenvalues are real. Use these results to prove the claims of Theorem 3.15.2.

Exercise 3.23.23. This problem is related to diagonalizing a matrix (Theorem 3.15.3). Suppose v1, v2,…, vn are n linearly independent eigenvectors of an n × n matrix A, corresponding to eigenvalues λ1, λ2,… ,λn, respectively.

•  Show that

A[v1,v2,,vn] = [v1,v2,,vn][λ1λn].

Define S = [v1, v2,…, vn] and Λ = diag (λ1,…, λn), where diag (·) is a diagonal matrix. Note that S is invertible. Conclude that A is diagonalizable.

•  Conversely, let S be an invertible matrix such that

S  1AS = Λ,

where Λ is diagonal. Show that the ith column of S is an eigenvector of A, corresponding to the ith diagonal entry of Λ.

Exercise 3.23.24. Using the SVD of matrix A, show that

AF = (σ21 + σ22 +  + σ2n)1/2,

where n is the rank of A and σi (i = 1,…, n) are the singular values of A.

Exercise 3.23.25. Show that the multiplication of block matrices A = [A11A12A21A22] and B = [B11B12B21B22] can be expressed as:

AB = [A11B11 + A12B21A1,1B12 + A12B22A21B11 + A22B21A21B12 + A22B22].

Exercise 3.23.26. Let the permutation matrix ∏m be defined as follows:

Πm = circ(0,1,0,,0) = [em,e1,e2,,em  1],

with

ei = 1i  1ii + 1mT(00100).

Prove that the square matrix A is circulant if and only if A = mATm.

Exercise 3.23.27. Using the result proved in Exercise 3.23.26, justify the following properties.

•  If A and B are circulant matrices, then AB is a circulant matrix.

•  If A is circulant, then A−1 is a circulant matrix.

Exercise 3.23.28. Show that:

•  The eigenvalues of a circulant matrix (co, c1,…, cn−1) are

λk = n  1i = 0cie  j2πik/n.

•  The corresponding eigenvectors are

xk = 1n[1e  j2πk/ne  j2π(2k)/ne  j2π(n  1)k/n]T.

Exercise 3.23.29. Using the equation for determinant of a Vandermonde matrix, what conditions must the scalars a1, a2, …, an satisfy in order for Vandermonde matrix A to be invertible?

A = [1a1a21am  111a2a22am  121ama2mam  1m].

Exercise 3.23.30. Consider the square matrix A. Prove that:

1.  If A is positive definite, then all its eigenvalues are positive.

2.  If all eigenvalues of A are positive, A is positive definite.

Exercise 3.23.31. Prove that the following factorization is valid for the block-partitioned matrix M:

M = [ABCD] = [A0CE][IA  1B0I],

where E is the Schur complement of submatrix A in matrix M.

Exercise 3.23.32. Consider the block matrix

M = [ABCD].

Prove that the determinant of matrix M is equal to the product of determinants of A and its Schur complement.

Exercise 3.23.33. Show that the generalized inverse of the block matrix

M = [ABCD]

is given by:

M  1 = [A   + A  BE  CA    A  BE    E  CA  E ],

where E = D − CAB.

Exercise 3.23.34. Prove the following properties of the Kronecker product. Matrices A, B, C and D are assumed of appropriate sizes.

1.  (A + B) ⊗ C = AC + BC.

2.  (AB) ⊗ C = A ⊗ (BC).

3.  (αA) ⊗ (βB) = αβ (AB).

4.  (AB) (CD) = (AC) ⊗ (BD).

Exercise 3.23.35. Prove the vector and matrix derivative results presented in Tables 3.1 and 3.2, respectively.

References

[1]  J. E. Gentle, Matrix Algebra: Theory, Computations, and Applications in Statistics. NY: Springer, 2010.

[2]  S. Leon, Linear Algebra with Applications. 9th ed., NY: Prentice Hall, 2009.

[3]  A. J. Laub, Matrix Analysis for Scientists and Engineers, Philadephia: SIAM, 2004.

[4]  P. Lax, Linear Algebra and Its Applications. 2nd ed., NY: Wiley Interscience, 2007.

[5]  C. Meyer, Matrix Analysis and Applied Linear Algebra Book. Philadephia: SIAM, 2001.

[6]  S. Roman, Advanced Linear Algebra. 3rd ed., NY: Springer, 2010.

[7]  G. A. F. Seber, A Matrix Handbook for Statisticians. NY: Wiley, 2008.

[8]  G. Strang, Linear Algebra and Its Applications. Cengage Learning, Florence, KY: Brooks Cole, 2005.

[9]  R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge: Cambridge University Press, 1985.

[10]  S. Kay, Fundamentals of Statistical Signal Processing, vol. I: Estimation Theory, vol. II: Detection Theory. 1st ed., NY: Prentice Hall, 1993.

[11]  A. H. Sayed, Adaptive Filters. 1st ed., NY: Wiley-IEEE Press, 2008.

[12]  P. Stoica and R. L. Moses, Spectral Analysis of Signals. NY: Prentice Hall, 2005. 91

[13]  J. Dauxois and G. M. Nkiet, “Canonical analysis of two Euclidean subspaces and its applications,” Linear Algebra Appl., vol. 264, no. 10, pp. 355–388, October 1997.

[14]  R. Gittins, Canonical Analysis: A Review with Applications in Ecology, Biomathematic, Berlin: Springer-Verlag, 1985.

[15]  T. Kailath, “A view of three decades of linear filtering theory,” IEEE Transactions on Information Theory, vol. 20, no. 2, pp. 146–181, February 1974.

[16]  I. M. Gelfand and A. M. Yaglom, “Calculation of the amount of information about a random function contained in another such function,” American Mathematical Society Translations, series vol. 12, pp. 199–246, 1959.

[17]  H. Akaike, “Stochastic theory of minimal realization,” IEEE Transactions on Automatic Control, pp. 667–674, 1974.

[18]  P. E. Caines, Linear Stochastic Systems. NY: Wiley, 1988.

[19]  P. V. Oversche and B. de Moor, Subspace Identification for Linear Systems: Theory, Implementation and Applications. Boston: Kluwer Academics Publishers, 1996.

[20]  T. Kim, O. Arandjelovic, and R. Cipolla, “Boosted manifold principal angles for image set-based recognition,” Pattern Recognition, vol. 40, issue 9, pp. 2475–2484, September 2007.

[21]  L. Wolf and A. Shashua, “Learning over Sets using Kernel Principal Angles,” Journal of Machine Learning Research, vol. 4, pp. 913–931, 2003.

[22]  R. T. Behrens and L. Scharf, “Signal Processing Applications of Oblique Projection Operators,” IEEE Transactions on Signal Processing, vol. 42, no. 6, pp. 1413–1414, June 1994.

[23]  I. E. Telatar, “Capacity of multi-antenna Gaussian channels,” Eur. Trans. Telecommun., vol. 10, no. 6, pp. 585–595, November 1999.

[24]  G. Golub and C. F. van Loan, Matrix Computations. 3rd. ed., Baltimore: John Hopkins University Press, 1996.

[25]  I. Jolliffe, Principal Component Analysis. Berlin: Springer, 2002.

[26]  S. K. Mitra, Digital Signal Processing. NY: McGraw-Hill, 2005.

[27]  R. M. Gray, Toeplitz and Circulant Matrices: A Review. Boston-Delft: Now Publishers Inc., 2005.

[28]  M. Verhaegen, “Identification of the Deterministic Part of MIMO State Space Models given in Innovations Form from Input-Output Data,” Automatica, vol. 30, no. 1, pp. 61–74, January 1994.92

[29]  R. Bhatia, Positive Definite Matrices. Princeton: Princeton University Press, 2006.

[30]  S. Haykin, Adaptive Filter Theory. NY: Prentice Hall, 2002.

[31]  T. Soderstrom and P. Stoica, System Identification. NY: Prentice Hall, 1994.

[32]  F. Zhang, The Schur Complement and Its Applications. Berlin: Springer, 2010.

[33]  C. R. Rao and S. K. Mitra, Generalized Inverse of Matrices and Its Applications. NY: Wiley, 1972.

[34]  T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithms for Signal Processing. Upper Saddle River, NJ: Prentice-Hall, 2000.

[35]  S. B. Wicker, Error Control Systems for Digital Communication and Storage. Upper Saddle River, NJ: Prentice-Hall, 1995.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.111.24