Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

C. ChatterjeeAdaptive Machine Learning Algorithms with Pythonhttps://doi.org/10.1007/978-1-4842-8017-1_7

7. Generalized Eigenvectors

Chanchal Chatterjee¹

(1)

San Jose, CA, USA

7.1 Introduction and Use Cases

This chapter is concerned with the adaptive solution of the generalized eigenvalue problems AΦ=BΦΛ, ABΦ=ΦΛ, and BAΦ=ΦΛ, where A and B are real, symmetric, nXn matrices and B is positive definite. In particular, we shall consider the problem AΦ=BΦΛ, although the remaining two problems are similar. The matrix pair (pencil) (A,B) is commonly referred to as a symmetric-definite pencil [Golub and VanLoan 83].

As seen before, the conventional (numerical analysis) method for evaluating Φ and Λ requires the computation of (A,B) after collecting all of the samples, and then the application of a numerical procedure [Golub and VanLoan 83]; in other words, the approach works in a batch fashion. In contrast, for the online case, matrices (A,B) are unknown. Instead, there are available two sequences of random matrices {A_k,B_k} with lim_k→∞E[A_k]=A and lim_k→∞E[B_k]=B. For every sample (A_k,B_k), we need to obtain the current estimates (Φ_k,Λ_k) of (Φ,Λ) respectively, such that (Φ_k,Λ_k) converge strongly to (Φ,Λ).

Application of GEVD in Pattern Recognition

In pattern recognition, there are problems where we are given samples (x∈ℜⁿ) from different populations or pattern classes. The well-known problem of linear discriminant analysis (LDA) [Chatterjee et al. Nov 97, May 97, Mar 97] seeks a transform W∈ℜ^nXp (p≤n), such that the interclass distance (measured by the scatter of the patterns around their mixture mean) is maximized, while at the same time the intra-class distance (measured by the scatter of the patterns around their respective class means) is as small as possible. The objective of this transform is to group the classes into well-separated clusters. The former scatter matrix, known as the mixture scatter matrix, is denoted by A, and the latter matrix, known as the within-class scatter matrix, is denoted by B [Fukunaga 90]. When the first column w of W is needed (i.e., p=1), the problem can be formulated in the constrained optimization framework as

Maximize w^TAw subject to w^TBw = 1. (7.1)

A twin problem to (7.1) is to maximize the Rayleigh quotient criterion [Golub and VanLoan 83] with respect to w:

$Jleft(mathbf{w};A,B ight)=frac{{mathbf{w}}^TAmathbf{w}}{{mathbf{w}}^TBmathbf{w}}$ . (7.2)

A solution to (7.1) or (7.2) leads to the generalized eigen-decomposition problem Aw=λBw, where λ is the largest generalized eigenvalue of A with respect to B. In general, the p columns of W are the p≤n orthogonal unit generalized eigenvectors ϕ₁,...,ϕ_p of A with respect to B where

${oldsymbol{upvarphi}}_i^TA{oldsymbol{upvarphi}}_j={lambda}_i{delta}_{ij}$ , and ${oldsymbol{upvarphi}}_i^TB{oldsymbol{upvarphi}}_j={delta}_{ij}$ for i=1,…,p, (7.3)

where λ₁> ... >λ_p>λ_p+1≥ ... ≥λ_n>0 are the p largest generalized eigenvalues of A with respect to B in descending order of magnitude. In summary, LDA is a powerful feature extraction tool for the class separability feature [Chatterjee May 97], and our adaptive algorithms are suited to this.

Application of GEVD in Signal Processing

Next, let’s discuss an analogous problem of detecting a desired signal in the presence of interference. Here, we seek the optimum linear transform W for weighting the signal plus interference such that the desired signal is detected with maximum power and minimum interference. Given the matrix pair (A,B), where A is the correlation matrix of the signal plus interference plus noise and B is the correlation matrix of interference plus noise, we can formulate the signal detection problem as the constrained maximization problem in (7.1). Here, we maximize the signal power and minimize the power of the interference. The solution for W consists of the p≤n largest generalized eigenvectors of the matrix pencil (A,B). Adaptive generalized eigen-decomposition algorithms also allow the tracking of slow changes in the incoming data [Chatterjee et al. Nov 97, Mar 97; Chen et al. 2000].

Methods for Generalized Eigen-Decomposition

We first define the problem for the non-adaptive case. Each of the three generalized eigenvalue problems (AΦ=BΦΛ, ABΦ=ΦΛ, and BAΦ=ΦΛ, where A and B are real, symmetric, nXn matrices and B is positive definite) can be reduced to a standard symmetric eigenvalue problem using a Cholesky factorization of B as either B=LL^T or B=U^TU. With B = LL^T, we can write AΦ=BΦΛ as

(L⁻¹AL^−T)(L^TΦ) = (L^TΦ)Λ or CΨ = ΨΛ. (7.4)

Here C is the symmetric matrix C = L^–1 A L^–T and Ψ = L^T Φ. Table 7-1 summarizes how each of the three types of problems can be reduced to the standard form CΨ=ΨΛ, and how the eigenvectors Φ of the original problem may be recovered from the eigenvectors Ψ of the reduced problem.

Table 7-1

Types of Generalized Eigen-Decomposition problems and their solutions

Type of Problem	Factorization of B	Reduction	Generalized Eigenvectors
AΦ=BΦΛ	B = LL^T	C = L^–1 A L^–T	Φ = L^–T Ψ
AΦ=BΦΛ	B = U^TU	C = U^–T A U^–1	Φ = U^–1 Ψ
ABΦ=ΦΛ	B = LL^T	C = L^T A L	Φ= L^–T Ψ
ABΦ=ΦΛ	B = U^TU	C = U A U^T	Φ = U^–1 Ψ
BAΦ=ΦΛ	B = LL^T	C = L^T A L	Φ = L Ψ
BAΦ=ΦΛ	B = U^TU	C = U A U^T	Φ = U^T Ψ

In the adaptive case, we can extend these techniques by first (adaptively) computing a matrix W_k for each sample B_k, where W_k tends to the inverse Cholesky factorization L^–1 of B with probability one (w.p.1) as k→∞. Any of the algorithms in Chapter 3 can be considered here. Next, we consider a sequence { ${C}_k={W}_{k-1}{A}_k{W}_{k-1}^T$ }, which is used to adaptively compute a matrix V_k, where V_k tends to the eigenvector matrix of lim_k→∞E[C_k] w.p.1 as k→∞. Any of the algorithms in Chapters 5 and 6 can be considered for this purpose. In conjunction, the two steps yield W_kV_k, which is proven to converge w.p.1 to Φ as k→∞. Thus, the two steps can proceed simultaneously and converge strongly to the eigenvector matrix Φ. A full description of this method is given in [Chatterjee et al. May 97, Mar 97]. In this chapter, I offer a variety of new techniques to solve the generalized eigen-decomposition problem for the adaptive case.

Outline of This Chapter

In Section 7.2, I list the objective functions from which I derive the adaptive generalized eigen-decomposition algorithms. In Sections 7.3, 7.4, and 7.5, I present adaptive algorithms for the homogeneous, deflation, and weighted variations, respectively, from the OJA objective function. In Sections 7.6, 7.7, and 7.8, I analyze the same three variations for the mean squared error (XU) objective function and convergence proofs for the deflation case. In Sections 7.9, 7.10, and 7.11, I discuss algorithms derived from the penalty function (PF) objective function. In Sections 7.12, 7.13, and 7.14, I consider the augmented Lagrangian 1 (AL1) objective function, and in Sections 7.15, 7.16, and 7.17, I present the augmented Lagrangian 2 (AL2) objective function. In Sections 7.18, 7.19, and 7.20, I present the information theory (IT) criterion, and in Sections 7.21, 7.22, and 7.23, I describe the Rayleigh quotient (RQ) criterion. In Section 7.24, I discusses the experimental results, and in Section 7.25, I present conclusions.

7.2 Algorithms and Objective Functions

Similar to the PCA algorithms (Chapter 5), in this chapter, I present several adaptive algorithms for generalized eigenvector computation. I consider two asymptotically stationary sequences {x_k∈ℜⁿ} and {y_k∈ℜⁿ} that have been centered to zero mean. We can represent the corresponding online correlation matrices {A_k,B_k} of {x_k,y_k} either by their instantaneous values { ${mathbf{x}}_k{mathbf{x}}_k^T$ , ${mathbf{y}}_k{mathbf{y}}_k^T$ } or by their running averages by (2.3). If, however, {x_k,y_k} are non-stationary, we can construct correlation matrices {A_k,B_k} out of data samples {x_k,y_k} by (2.5).

Summary of Objective Functions for Adaptive GEVD Algorithms

Conforming to the methodology in Section 2.3, for each algorithm, I describe objective functions and derive the adaptive algorithms for them. The objective functions are

Oja’s objective function (OJA),
Xu’s mean squared error objective function (XU),
Penalty function method (PF),
Augmented Lagrangian Method 1 (AL1),
Augmented Lagrangian Method 2 (AL2),
Information theory criterion (IT), and
Rayleigh quotient criterion (RQ).

Same as the PCA case, there are three variations of algorithms derived from each objective function. They are

1.
Homogeneous Adaptive Rule: These algorithms do not compute the true normalized generalized eigenvectors with decreasing eigenvalues.
2.
Deflation Adaptive Rule: Here, we produce unit generalized eigenvectors with decreasing eigenvalues. However, the training is sequential, thereby making the training process harder for parallel implementations.
3.
Weighted Adaptive Rule: These algorithms are obtained by using a different scalar weight for each generalized eigenvector, making them normalized and in the order of decreasing eigenvalues.

Summary of Generalized Eigenvector Algorithms

For all algorithms, I describe an objective function J(w_i; A, B) and an update rule of the form

${W}_{k+1}={W}_k+{eta}_khleft({W}_k,{A}_k,{B}_k ight),$

where h(W_k,A_k,B_k) follows certain continuity and regularity properties [Ljung 77,92] and are given in Table 7-2.

Table 7-2

List of Adaptive Generalized Eigen-Decomposition Algorithms

Alg.	Type	Adaptive Algorithm h(W_k,A_k)
OJA	Homogeneous	${A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k$
	Deflation	${A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight)$
	Weighted	${A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k$
XU	Homogeneous	$2{A}_k{W}_k-{A}_k{W}_k{W}_k^T{B}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k$
	Deflation	$2{A}_k{W}_k-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k ight)-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight)$
	Weighted	$2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k$
PF	Homogeneous	${A}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p ight)$
	Deflation	${A}_k{W}_k-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p ight)$
	Weighted	${A}_k{W}_kC-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p ight)$
AL1	Homogeneous	${A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p ight)$
	Deflation	${A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p ight)$
	Weighted	${A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p ight)$
AL2	Homogeneous	$2{A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k-{A}_k{W}_k{W}_k^T{B}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p ight)$
	Deflation	$2{A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight)-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p ight)$
	Weighted	$2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p ight)$
IT	Homogeneous	$left({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k ight)kern1em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k ight)}^{-1}$
	Deflation	$left({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight) ight)kern1em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k ight)}^{-1}$
	Weighted	$left({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k ight)kern1em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k ight)}^{-1}$
RQ	Homogeneous	$left({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k ight)kern1em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k ight)}^{-1}$
	Deflation	$left({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight) ight)kern1em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k ight)}^{-1}$
	Weighted	$left({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k ight)kern1em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k ight)}^{-1}$

In the following discussions, I denote Φ=[ϕ₁ ... ϕ_n]∈ℜ^nXn as the orthonormal generalized eigenvector matrix of A with respect to B, and Λ= diag(λ₁,...,λ_n) as the generalized eigenvalue matrix, such that λ₁>λ₂>...>λ_p>λ_p+1≥...≥λ_n>0. I use the subscript (i) to denote the i^th permutation of the indices {1,2,…,n}.

7.3 OJA GEVD Algorithms

OJA Homogeneous Algorithm

The objective function for the OJA homogeneous algorithm can be written as

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-{{mathbf{w}}_k^i}^T{A}_k{B}_k^{-1}{A}_k{mathbf{w}}_k^i+frac{1}{2}{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)}^2+sum limits_{j=1,j e i}^p{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j ight)}^2$ , (7.5)

for i=1,…,p (p≤n). From the gradient of (7.5) with respect to ${mathbf{w}}_k^i$ we obtain the following adaptive algorithm:

${mathbf{w}}_{k+1}^i={mathbf{w}}_k^i-{eta}_k{B}_k{A}_k^{-1}{ abla}_{{mathbf{w}}_k^i}Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)$ for i=1,…,p, (7.6)

where η_k is a decreasing gain constant. Defining ${W}_k=left[{mathbf{w}}_k^1dots {mathbf{w}}_k^p ight]$ , from (7.6) we get

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k ight)$ . (7.7)

OJA Deflation Algorithm

The objective function for the OJA deflation adaptive GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-{{mathbf{w}}_k^i}^T{A}_k{B}_k^{-1}{A}_k{mathbf{w}}_k^i+frac{1}{2}{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)}^2+sum limits_{j=1}^{i-1}{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j ight)}^2$ , (7.8)

for i=1,…,p. From the gradient of (7.8) with respect to ${mathbf{w}}_k^i$ , we obtain the OJA deflation adaptive gradient descent algorithm as

${mathbf{w}}_{k+1}^i={mathbf{w}}_k^i-{eta}_kleft({A}_k{mathbf{w}}_k^i-sum limits_{j=1}^i{B}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i ight)$ , (7.9)

for i=1,…,p (p≤n). The matrix form of the algorithm is

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight) ight)$ , (7.10)

where UT[⋅] sets all elements below the diagonal of its matrix argument to zero.

OJA Weighted Algorithm

The objective function for the OJA weighted adaptive GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-{c}_i{{mathbf{w}}_k^i}^T{A}_k{B}_k^{-1}{A}_k{mathbf{w}}_k^i+frac{c_i}{2}{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)}^2+sum limits_{j=1,j e i}^p{c}_j{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j ight)}^2$ (7.11)

for i=1,…,p, where c₁,…,c_p (p≤n) are small positive numbers satisfying

c₁ > c₂ > … > c_p > 0, p ≤ n. (7.12)

Given a diagonal matrix C = diag (c₁, …, c_p), p ≤ n, the OJA weighted adaptive algorithm is

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k ight)$ . (7.13)

OJA Algorithm Python Code

The following Python code works with multidimensional data X[nDim,nSamples] and Y[nDim,nSamples]:

from numpy import linalg as la

A = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

B = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms

W3 = W2

c = [2.6-0.3*k for k in range(nEA)]

C = np.diag(c)

I = np.identity(nDim)

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

# Update data correlation matrices A,B with current data vectors x,y

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

y = Y[:,iter]

y = y.reshape(nDim,1)

B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)

# Deflated Gradient Descent

W2 = W2 + (1/(150 + cnt))*(A @ W2 - B @ W2 @ np.triu(W2.T @ A @ W2))

# Weighted Gradient Descent

W3 = W3 + (1/(500 + cnt))*(A @ W3 @ C - B @ W3 @ C @ (W3.T @ A @ W3))

7.4 XU GEVD Algorithms

XU Homogeneous Algorithm

The objective function for the XU homogeneous adaptive GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-2{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i ight)+2sum limits_{j=1,j e i}^p{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ (7.14)

for i=1,…,p (p≤n). From the gradient of (7.14) with respect to ${mathbf{w}}_k^i$ , we obtain the XU homogeneous adaptive gradient descent algorithm as

${mathbf{w}}_{k+1}^i={mathbf{w}}_k^i+{eta}_kleft(2{A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{B}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i ight)$ (7.15)

for i=1,…,p, whose matrix form is

${W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_k-{A}_k{W}_k{W}_k^T{B}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k ight)$ . (7.16)

XU Deflation Algorithm

The objective function for the XU deflation adaptive GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-2{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i ight)+2sum limits_{j=1}^{i-1}{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ (7.17)

for i=1,…,p (p≤n). From the gradient of (7.17) with respect to ${mathbf{w}}_k^i$ , we obtain

${W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_k-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k ight)-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight) ight)$ , (7.18)

where UT[⋅] sets all elements below the diagonal of its matrix argument to zero. Chatterjee et al. [Mar 00, Thms 1, 2] proved that W_k converges with probability one to [±ϕ₁ ±ϕ₂ … ±ϕ_p] as k→∞.

XI Weighted Algorithm

The objective function for the XU weighted adaptive GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-2{c}_i{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+{c}_ileft({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i ight)+2sum limits_{j=1,j e i}^p{c}_j{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ (7.19)

for i=1,…,p (p≤n), where c₁,…,c_p are small positive numbers satisfying (7.12). The adaptive algorithm is

${W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k ight)$ , (7.20)

where C = diag(c₁,…,c_p).

XU Algorithm Python Code

The following Python code works with multidimensional data X[nDim,nSamples] and Y[nDim,nSamples]:

from numpy import linalg as la

A np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

B = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms

W3 = W2

c = [2.6-0.3*k for k in range(nEA)]

C = np.diag(c)

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

# Update data correlation matrices A,B with current data vectors x,y

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

y = Y[:,iter]

y = y.reshape(nDim,1)

B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)

# Deflated Gradient Descent

W2 = W2 + (1/(100 + cnt))*(A @ W2 - 0.5 * B @ W2 @

np.triu(W2.T @ A @ W2)- 0.5* A@ W2 @ np.triu(W2.T@ B @ W2))

# Weighted Gradient Descent

W3 = W3 + (1/(300 + cnt))*(A @ W3 @ C - 0.5 * B @ W3 @ C @ (W3.T @ A @ W3) - 0.5 *

A @ W3 @ C @ (W3.T @ B @ W3))

7.5 PF GEVD Algorithms

PF Homogeneous Algorithm

We obtain the objective function for the PF homogeneous generalized eigenvector algorithm by writing the Rayleigh quotient criterion (7.2) as the following penalty function:

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+mu left(sum limits_{j=1,j e i}^p{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)}^2 ight)$ , (7.21)

where μ > 0 and i=1,…,p (p≤n). From the gradient of (7.21) with respect to ${mathbf{w}}_k^i$ , we obtain the PF homogeneous adaptive algorithm:

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p ight) ight)$ , (7.22)

where I_p is a pXp identity matrix.

PF Deflation Algorithm

The objective function for the PF deflation GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+mu left(sum limits_{j=1}^{i-1}{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)}^2 ight)$ , (7.23)

where μ > 0 and i=1,…,p. The adaptive algorithm is

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-mu {B}_k{W}_kmathrm{U}Tleft({W}_k^T{B}_k{W}_k-{I}_p ight) ight)$ , (7.24)

where UT[⋅] sets all elements below the diagonal of its matrix argument to zero.

PF Weighted Algorithm

The objective function for the PF weighted GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-{c}_i{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+mu left(sum limits_{j=1,j e i}^p{c}_i{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i ight)}^2+frac{c_j}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)}^2 ight)$ , (7.25)

where c₁ > c₂ > … > c_p > 0 (p ≤ n), μ > 0, and i=1,…,p. The corresponding adaptive algorithm is

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_kC-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p ight) ight)$ , (7.26)

where C = diag (c₁, …, c_p).

PF Algorithm Python Code

The following Python code works with multidimensional data X[nDim,nSamples] and Y[nDim,nSamples]:

from numpy import linalg as la

A = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

B = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms

W3 = W2

c = [2.6-0.3*k for k in range(nEA)]

C = np.diag(c)

I = np.identity(nDim)

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

# Update data correlation matrices A,B with current data vectors x,y

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

y = Y[:,iter]

y = y.reshape(nDim,1)

B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)

# Deflated Gradient Descent

W2 = W2 + (1/(100 + cnt))*(A @ W2 - mu * B @ W2 @

np.triu((W2.T @ B @ W2) - I))

# Weighted Gradient Descent

W3 = W3 + (1/(500 + cnt))*(A @ W3 @ C - mu * B @ W3 @ C @

((W3.T @ B @ W3) - I))

7.6 AL1 GEVD Algorithms

AL1 Homogeneous Algorithm

We apply the augmented Lagrangian method of nonlinear optimization to the Rayleigh quotient criterion (7.4) to obtain the objective function for the AL1 homogeneous GEVD algorithm as

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)+2sum limits_{j=1,j e i}^p{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ ,

$+mu left(sum limits_{j=1,j e i}^p{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)}^2 ight)$ , (7.27)

for i=1,…,p (p≤n), where (α, β₁, β₂, …, β_p) are Lagrange multipliers and μ is a positive penalty constant. Taking the gradient of $Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)$ with respect to ${mathbf{w}}_k^i$ and equating the gradient to 0 and using the constraint ${{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i={delta}_{ij}$ , we obtain

$alpha ={{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i$ and ${eta}_j={{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i$ for j=1,…,p. (7.28)

Replacing (α, β₁, β₂, …, β_p) in the gradient of (7.27), we obtain the AL1 homogeneous adaptive gradient descent generalized eigenvector algorithm:

${mathbf{w}}_{k+1}^i={mathbf{w}}_k^i+{eta}_kleft({A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{B}_k{mathbf{w}}_k^jleft({{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i ight) ight.left.-mu sum limits_{j=1}^p{B}_k{mathbf{w}}_k^jleft({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i-{delta}_{ij} ight) ight)$ (7.29)

where μ > 0. Defining ${W}_k=left[{mathbf{w}}_k^1dots {mathbf{w}}_k^p ight]$ , we obtain

${W}_{k+1}={W}_k+{eta}_kleft(left({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p ight) ight) ight.$ , (7.30)

where I_p is a pXp identity matrix.

AL1 Deflation Algorithm

The objective function for the AL1 deflation GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)+2sum limits_{j=1}^{i-1}{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$

$+mu left(sum limits_{j=1}^{i-1}{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)}^2 ight)$ , (7.31)

for i=1,…,p (p≤n). Following the steps in Section 7.6.1 we obtain the adaptive algorithm:

${W}_{k+1}={W}_k+{eta}_kleft(left({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p ight) ight) ight.$ . (7.32)

AL1 Weighted Algorithm

The objective function for the AL1 weighted GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-{c}_i{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+alpha {c}_ileft({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)+2sum limits_{j=1,j e i}^p{eta}_j{c}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ ,

$+mu left(sum limits_{j=1,j e i}^p{c}_j{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i ight)}^2+frac{c_i}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)}^2 ight)$ , (7.33)

for i=1,…,p (p≤n), where (α, β₁, β₂, …, β_p) are Lagrange multipliers, μ is a positive penalty constant, and c₁ > c₂ > … > c_p > 0. The adaptive algorithm is

${W}_{k+1}={W}_k+{eta}_kleft(left({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p ight) ight) ight.$ , (7.34)

where C = diag (c₁, …, c_p).

AL1 Algorithm Python Code

The following Python code works with multidimensional data X[nDim,nSamples] and Y[nDim,nSamples]:

from numpy import linalg as la

A = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

B = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms

W3 = W2

c = [2.6-0.3*k for k in range(nEA)]

C = np.diag(c)

I = np.identity(nDim)

mu = 2

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

# Update data correlation matrices A,B with current data vectors x,y

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

y = Y[:,iter]

y = y.reshape(nDim,1)

B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)

# Deflated Gradient Descent

W2 = W2 + (1/(500 + cnt))*(A @ W2 - B @ W2 @ np.triu(W2.T @ A @ W2)

- mu * B @ W2 @ np.triu((W2.T @ B @ W2) - I))

# Weighted Gradient Descent

W3 = W3 + (1/(1000 + cnt))*(A @ W3 @ C - B @ W3 @ C @ (W3.T @ A @ W3)

- mu * B @ W3 @ C @ ((W3.T @ B @ W3) - I))

7.7 AL2 GEVD Algorithms

AL2 Homogeneous Algorithm

The unconstrained objective function for the AL2 homogeneous GEVD algorithm is

$mu left(sum limits_{j=1,j e i}^p{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)}^2 ight)$ , (7.35)

for i=1,…,p, where μ is a positive penalty constant. From (7.35), we obtain the adaptive gradient descent algorithm:

${mathbf{w}}_{k+1}^i={mathbf{w}}_k^i+{eta}_kleft(2{A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{B}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i-mu sum limits_{j=1}^p{B}_k{mathbf{w}}_k^jleft({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i-{delta}_{ij} ight) ight)$ , (7.36)

for i=1,…,p, the matrix version of which is

${W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k-{A}_k{W}_k{W}_k^T{B}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p ight) ight)$ . (7.37)

AL2 Deflation Algorithm

The objective function for the AL2 deflation GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-2{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i ight)+2sum limits_{j=1}^{i-1}{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i+$

$mu left(sum limits_{j=1}^{i-1}{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)}^2 ight)$ , (7.38)

for i=1,…,p. The adaptive GEVD algorithm is

${W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight)-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p ight) ight)$ . (7.39)

AL2 Weighted Algorithm

The objective function for the AL2 weighted generalized eigenvector algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-2{c}_i{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+{c}_ileft({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i ight)+2sum limits_{j=1}^{i-1}{c}_j{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i+$

$mu left(sum limits_{j=1}^{i-1}{c}_j{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i ight)}^2+frac{c_i}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)}^2 ight)$ , (7.40)

for i=1,…,p, where c₁ > c₂ > … > c_p > 0 (p ≤ n). The adaptive algorithm is

${W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p ight) ight)$ , (7.41)

where C = diag (c₁, …, c_p), p ≤ n.

AL2 Algorithm Python Code

The following Python code works with data X[nDim,nSamples] and Y[nDim,nSamples]:

from numpy import linalg as la

A = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

B = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms

W3 = W2

c = [2.6-0.3*k for k in range(nEA)]

C = np.diag(c)

I = np.identity(nDim)

mu = 1

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

# Update data correlation matrices A,B with current data vectors x,y

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

y = Y[:,iter]

y = y.reshape(nDim,1)

B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)

# Deflated Gradient Descent

W2 = W2 + (1/(100 + cnt))*(A @ W2 - 0.5 * B @ W2 @ np.triu(W2.T @ A

@ W2) - 0.5 * A @ W2 @ np.triu(W2.T @ B @ W2) -

0.5 * mu * B @ W2 @ np.triu((W2.T @ B @ W2) - I))

# Weighted Gradient Descent

W3 = W3 + (1/(300 + cnt))*(A @ W3 @ C - 0.5 * B @ W3 @ C @ (W3.T @ A

@ W3) - 0.5 * A @ W3 @ C @ (W3.T @ B @ W3) -

0.5 * mu * B @ W3 @ C @ ((W3.T @ B @ W3) - I))

7.8 IT GEVD Algorithms

IT Homogeneous Algorithm

The objective function for the information theory homogeneous GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)={{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-ln left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)+2sum limits_{j=1,j e i}^p{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ (7.42)

for i=1,…,p, where (α, β₁, β₂, …, β_p) are Lagrange multipliers and ln(.) is logarithm base e. By equating the gradient of (7.42) with respect to ${mathbf{w}}_k^i$ to 0 and using the constraint ${{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i={delta}_{ij}$ , we obtain

α = 0 and ${eta}_j=frac{{{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i}{{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i}$ , (7.43)

for j=1,…,p. Replacing (α, β₁, β₂, …, β_p) in the gradient of (7.42), we obtain the IT homogeneous adaptive gradient descent algorithm for the generalized eigenvector:

${mathbf{w}}_{k+1}^i={mathbf{w}}_k^i+{eta}_kleft({A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{B}_k{mathbf{w}}_k^jleft({{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i ight) ight)/{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i$ , (7.44)

for i=1,…,p, whose matrix version is

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k ight)kern0.5em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k ight)}^{-1}$ , (7.45)

where DIAG[⋅] sets all elements except the diagonal of its matrix argument to zero.

IT Deflation Algorithm

The objective function for the IT deflation GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)={{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-ln left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)+2sum limits_{j=1}^{i-1}{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ (7.46)

for i=1,…,p, where (α, β₁, β₂, …, β_p) are Lagrange multipliers. From (7.43), we obtain the adaptive gradient algorithm:

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight) ight)kern0.6em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k ight)}^{-1}$ (7.47)

IT Weighted Algorithm

The objective function for the IT weighted GEVD algorithm is

$Jleft({mathbf{w}}_k^i ight)={c}_i{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-{c}_iln left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i ight)+alpha {c}_ileft({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)+2sum limits_{j=1,j e i}^p{eta}_j{c}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ (7.48)

for i=1,…,p, where (α, β₁, β₂, …, β_p) are Lagrange multipliers. By solving (α, β₁, β₂, …, β_k) and replacing them in the gradient of (7.48), we obtain the adaptive algorithm:

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k ight)kern0.5em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k ight)}^{-1}$ . (7.49)

IT Algorithm Python Code

The following Python code works with data X[nDim,nSamples] and Y[nDim,nSamples]:

from numpy import linalg as la

A = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

B = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms

W3 = W2

c = [2.6-0.3*k for k in range(nEA)]

C = np.diag(c)

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

# Update data correlation matrices A,B with current data vectors x,y

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

y = Y[:,iter]

y = y.reshape(nDim,1)

B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)

# Deflated Gradient Descent

W2 = W2 + (1/(50 + cnt))*(A @ W2 - B @ W2 @ np.triu(W2.T @ A @ W2))

@ inv(np.diag(np.diagonal(W2.T @ A @ W2)))

# Weighted Gradient Descent

W3 = W3 + (1/(100 + cnt))*(A @ W3 @ C – B @ W3 @ C @ (W3.T @ A @ W3))

@ inv(np.diag(np.diagonal(W3.T @ A @ W3)))

7.9 RQ GEVD Algorithms

RQ Homogeneous Algorithm

We obtain the objective function for the Rayleigh quotient homogeneous GEVD algorithm from the Rayleigh quotient criterion (7.2) as follows:

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-frac{{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i}{{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i}+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)+2sum limits_{j=1,j e i}^p{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ (7.50)

for i=1,…,p, where (α, β₁, β₂, …, β_p) are Lagrange multipliers. By equating the gradient (7.50) with respect to ${mathbf{w}}_k^i$ to 0, and using the constraint ${{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i={delta}_{ij}$ , we obtain

α = 0 and ${eta}_j=frac{{{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i}{{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i}$ for j=1,…,p. (7.51)

Replacing (α, β₁, β₂, …, β_p) in the gradient of (7.50) and making a small approximation, we obtain the RQ homogeneous adaptive gradient descent algorithm for the generalized eigenvector:

${mathbf{w}}_{k+1}^i={mathbf{w}}_k^i+{eta}_kleft({A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{B}_k{mathbf{w}}_k^jleft({{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i ight) ight)/{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i$ , (7.52)

for i=1,…,p, whose matrix version is

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k ight)kern0.5em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k ight)}^{-1}$ . (7.53)

RQ Deflation Algorithm

The objective function for the RQ deflation GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-frac{{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i}{{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i}+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)+2sum limits_{j=1}^i{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ , (7.54)

for i=1,…,p, where (α, β₁, β₂, …, β_p) are Lagrange multipliers. By solving (α, β₁, β₂, …, β_k) and replacing them in the gradient of (7.54), we obtain the adaptive algorithm:

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight) ight)kern0.5em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k ight)}^{-1}$ . (7.55)

RQ Weighted Algorithm

The objective function for the RQ weighted GEVD algorithm is

$Jleft({mathbf{w}}_k^i;{A}_k,{B}_k ight)=-{c}_ifrac{{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i}{{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i}+alpha {c}_ileft({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1 ight)+2sum limits_{j=1}^i{eta}_j{c}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i$ , (7.56)

for i=1,…,p, where (α, β₁, β₂, …, β_p) are Lagrange multipliers. By solving (α, β₁, β₂, …, β_k) and replacing them in the gradient of (7.56), we obtain the adaptive algorithm:

${W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k ight)kern0.5em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k ight)}^{-1}$ . (7.57)

RQ Algorithm Python Code

The following Python code works with data X[nDim,nSamples] and Y[nDim,nSamples]:

from numpy import linalg as la

A = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

B = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms

W3 = W2

c = [2.6-0.3*k for k in range(nEA)]

C = np.diag(c)

I = np.identity(nDim)

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

# Update data correlation matrices A,B with current data vectors x,y

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

y = Y[:,iter]

y = y.reshape(nDim,1)

B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)

# Deflated Gradient Descent

W2 = W2 + (1/(20 + cnt))*(A @ W2 - B @ W2 @ np.triu(W2.T @ A @ W2))

inv(np.diag(np.diagonal(W2.T @ B @ W2)))

# Weighted Gradient Descent

W3 = W3 + (1/(300 + cnt))*(A @ W3 @ C - B @ W3 @ C @ (W3.T @ A @ W3))

@ inv(np.diag(np.diagonal(W3.T @ B @ W3)))

7.10 Experimental Results

I generated 1,000 samples (of {x_k} and {y_k}) from 10-dimensional Gaussian data (i.e., n=10) with the mean zero and covariance given below. The covariance matrix A for {x_k} is obtained from the second covariance matrix in [Okada and Tomita 85] multiplied by 3 as follows:

The covariance matrix B for {y_k} is obtained from the third covariance matrix in [Okada and Tomita 85] multiplied by 2 as follows:

The generalized eigenvalues of (A,B) are

107.9186, 49.0448, 8.3176, 5.1564, 2.8814, 2.3958, 1.9872, 1.2371, 0.9371, 0.1096.

I computed the first four principal generalized eigenvectors (i.e., eigenvectors corresponding to the largest four eigenvalues) (i.e., p=4) by the adaptive algorithms described before. In order to compute the online data sequence {A_k}, I generated random data vectors {x_k} from the above covariance matrix A. I generated {A_k} from {x_k} by using algorithm (2.21) in Chapter 2. Similarly, I generated random data vectors {y_k} from the covariance matrix B and then generated {B_k} from {y_k}. I computed the correlation matrix A_computed and B_computed after collecting all 500 samples x_k and y_k respectively as

${A}_{computed}=frac{1}{1000}sum limits_{i=1}^{1000}{mathbf{x}}_i{mathbf{x}}_i^T$ and ${B}_{computed}=frac{1}{1000}sum limits_{i=1}^{1000}{mathbf{y}}_i{mathbf{y}}_i^T$ .

I referred to the generalized eigenvectors and eigenvalues computed from this A and B by a standard numerical analysis method [Golub and VanLoan 83] as the actual values .

I started all algorithms with w₀ = 0.1*ONE, where ONE is a 10 X 4 matrix whose all elements are ones. In order to measure the convergence and accuracy of the algorithms, I computed the direction cosine at k^th update of each adaptive algorithm as

Direction cosine (k) = $frac{left|{{mathbf{w}}_k^i}^T{oldsymbol{upvarphi}}_i ight|}{leftVert {mathbf{w}}_k^ileftVert ightVert {oldsymbol{upvarphi}}_i ightVert }$ , (7.58)

where ${mathbf{w}}_k^i$ is the estimated generalized eigenvector of (A_k,B_k) at k^th update and ϕ_i is the actual i^th generalized eigenvector computed from all collected samples by a conventional numerical analysis method.

Figure 7-1 shows the iterates of the OJA algorithms (deflated and weighted) to compute the first two principal generalized eigenvectors of (A_k,B_k). Figure 7-2 shows the same for the XU algorithms, Figure 7-3 for the PF algorithms, Figure 7-4 for the AL1 algorithms, Figure 7-5 for the AL2 algorithms, Figure 7-6 for the IT algorithms, and Figure 7-7 for the RQ algorithms.

Figure 7-1
Convergence of the first two principal generalized eigenvectors of (A,B) by the OJA deflation (7.10) and OJA weighted (7.13) adaptive algorithms

Figure 7-2
Convergence of the first two principal generalized eigenvectors of (A,B) by the XU deflation (7.18) and XU weighted (7.20) adaptive algorithms

Figure 7-3
Convergence of the first two principal generalized eigenvectors of (A,B) by the PF deflation (7.24) and PF weighted (7.26) adaptive algorithms

Figure 7-4
Convergence of the first two principal generalized eigenvectors of (A,B) by the AL1 deflation (7.32) and AL1 weighted (7.34) adaptive algorithms

Figure 7-5
Convergence of the first two principal generalized eigenvectors of (A,B) by the AL2 deflation (7.39) and AL2 weighted (7.41) adaptive algorithms

Figure 7-6
Convergence of the first two principal generalized eigenvectors of (A,B) by the IT deflation (7.47) and IT weighted (7.49) adaptive algorithms

Figure 7-7
Convergence of the first two principal generalized eigenvectors of (A,B) by the RQ deflation (7.55) and RQ weighted (7.57) adaptive algorithms

For all algorithms, I used η_k=1/(140+k) for the deflation algorithms and η_k=1/(500+k) for the weighted algorithms. The diagonal weight matrix C used for the weighted algorithms is DIAG(2.6,2.3,2.0,1.7). I ran all algorithms for three epochs of the data, where one epoch means presenting all training data once in random order. I did not show the results for the homogeneous algorithms since the homogeneous method produces a linear combination of the actual generalized eigenvectors of (A,B). Thus, the direction cosines are not indicative of the performance of the algorithms for the homogeneous case.

7.11 Concluding Remarks

Observe that the convergence of all algorithms decreases progressively for the minor generalized eigenvectors and is best for principal generalized eigenvector. This is expected since the convergence of these adaptive algorithms is a function of the relative generalized eigenvalue. Furthermore, the weighted algorithms, which can be implemented in parallel hardware, performed very similarly to the deflation algorithms.

I did the following:

1.
For each algorithm, I rated the compute and convergence performance.
2.
I skipped the homogeneous algorithms because they are not useful for practical applications since they produce arbitrary rotations of the eigenvectors.
3.
Note that A_k∈ℜ^nXn, B_k∈ℜ^nXn and W_k∈ℜ^nXp. I presented the computation complexity of each algorithm in terms of the matrix dimensions n and p.
4.
The convergence performance is determined based on the speed of convergence of the principal and the minor components. I rated convergence in a scale of 1-10 where 10 is the fastest converging algorithm.
5.
I skipped the IT and RQ algorithms because they did not perform well compared to the remaining algorithms and the matrix inversion increases computational complexity. See Table 7-3.

Table 7-3

List of Adaptive GEVD Algorithms, Complexity, and Performance

Alg	Type	Adaptive Algorithm h(W_k,A_k)	Comments
OJA	Deflation	${A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight)$	n³p⁶, 6
OJA	Weighted	${A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k$	n⁴p⁶, 6
XU	Deflation	$2{A}_k{W}_k-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k ight)-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight)$	2n³p⁶, 8
XU	Weighted	$2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k$	2n⁴p⁶, 8
PF	Deflation	${A}_k{W}_k-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p ight)$	n²p⁴, 7
PF	Weighted	${A}_k{W}_kC-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p ight)$	n³p⁴, 7
AL1	Deflation	${A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p ight)$	n³p⁶+ n²p⁴, 9
AL1	Weighted	${A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p ight)$	n⁴p⁶+ n³p⁴, 9
AL2	Deflation	$2{A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight)-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p ight)$	2n³p⁶+ n²p⁴, 10
AL2	Weighted	$2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p ight)$	2n⁴p⁶+ n³p⁴, 10
IT	Deflation	$left({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight) ight)kern1em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k ight)}^{-1}$	Not applicable
IT	Weighted	$left({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k ight)kern1em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k ight)}^{-1}$	Not applicable
RQ	Deflation	$left({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k ight) ight)kern1em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k ight)}^{-1}$	Not applicable
RQ	Weighted	$left({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k ight)kern1em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k ight)}^{-1}$	Not applicable

Observe the following:

1.
The OJA algorithm has the least complexity and good performance.
2.
The AL2 algorithm has the most complexity and best performance.
3.
The AL1 algorithm is the next best after AL2, and PF and XU follow.

The complexity and accuracy tradeoffs will determine the algorithm to use in real-world scenarios. If you can afford the computation, the AL2 algorithm is the best. The XU algorithm is a good balance of complexity and performance.

In summary, I showed 21 algorithms, many of them new, from a common framework with an objective function for each. Note that although I applied the gradient descent technique on these objective functions, I could have applied any other technique of nonlinear optimization such as steepest descent, conjugate direction, Newton-Raphson, or recursive least squares. The availability of the objective functions allows us to derive new algorithms by using new optimization techniques on them and also to perform convergence analyses of the adaptive algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Generalized Eigenvectors

Create new playlist

Sign In

Sign Up

7. Generalized Eigenvectors

7.1 Introduction and Use Cases

Application of GEVD in Pattern Recognition

Application of GEVD in Signal Processing

Methods for Generalized Eigen-Decomposition

Outline of This Chapter

7.2 Algorithms and Objective Functions

Summary of Objective Functions for Adaptive GEVD Algorithms

Summary of Generalized Eigenvector Algorithms

7.3 OJA GEVD Algorithms

OJA Homogeneous Algorithm

OJA Deflation Algorithm

OJA Weighted Algorithm

OJA Algorithm Python Code

7.4 XU GEVD Algorithms

XU Homogeneous Algorithm

XU Deflation Algorithm

XI Weighted Algorithm

XU Algorithm Python Code

7.5 PF GEVD Algorithms

PF Homogeneous Algorithm

PF Deflation Algorithm

PF Weighted Algorithm

PF Algorithm Python Code

7.6 AL1 GEVD Algorithms

AL1 Homogeneous Algorithm

AL1 Deflation Algorithm

AL1 Weighted Algorithm

AL1 Algorithm Python Code

7.7 AL2 GEVD Algorithms

AL2 Homogeneous Algorithm

AL2 Deflation Algorithm

AL2 Weighted Algorithm

AL2 Algorithm Python Code

7.8 IT GEVD Algorithms

IT Homogeneous Algorithm

IT Deflation Algorithm

IT Weighted Algorithm

IT Algorithm Python Code

7.9 RQ GEVD Algorithms

RQ Homogeneous Algorithm

RQ Deflation Algorithm

RQ Weighted Algorithm

RQ Algorithm Python Code

7.10 Experimental Results

7.11 Concluding Remarks

Table of Contents for
7. Generalized Eigenvectors