© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
C. ChatterjeeAdaptive Machine Learning Algorithms with Pythonhttps://doi.org/10.1007/978-1-4842-8017-1_7

7. Generalized Eigenvectors

Chanchal Chatterjee1  
(1)
San Jose, CA, USA
 

7.1 Introduction and Use Cases

This chapter is concerned with the adaptive solution of the generalized eigenvalue problems AΦ=BΦΛ, ABΦ=ΦΛ, and BAΦ=ΦΛ, where A and B are real, symmetric, nXn matrices and B is positive definite. In particular, we shall consider the problem AΦ=BΦΛ, although the remaining two problems are similar. The matrix pair (pencil) (A,B) is commonly referred to as a symmetric-definite pencil [Golub and VanLoan 83].

As seen before, the conventional (numerical analysis) method for evaluating Φ and Λ requires the computation of (A,B) after collecting all of the samples, and then the application of a numerical procedure [Golub and VanLoan 83]; in other words, the approach works in a batch fashion. In contrast, for the online case, matrices (A,B) are unknown. Instead, there are available two sequences of random matrices {Ak,Bk} with limk→∞E[Ak]=A and limk→∞E[Bk]=B. For every sample (Ak,Bk), we need to obtain the current estimates (Φkk) of (Φ,Λ) respectively, such that (Φkk) converge strongly to (Φ,Λ).

Application of GEVD in Pattern Recognition

In pattern recognition, there are problems where we are given samples (x∈n) from different populations or pattern classes. The well-known problem of linear discriminant analysis (LDA) [Chatterjee et al. Nov 97, May 97, Mar 97] seeks a transform W∈nXp (p≤n), such that the interclass distance (measured by the scatter of the patterns around their mixture mean) is maximized, while at the same time the intra-class distance (measured by the scatter of the patterns around their respective class means) is as small as possible. The objective of this transform is to group the classes into well-separated clusters. The former scatter matrix, known as the mixture scatter matrix, is denoted by A, and the latter matrix, known as the within-class scatter matrix, is denoted by B [Fukunaga 90]. When the first column w of W is needed (i.e., p=1), the problem can be formulated in the constrained optimization framework as

Maximize wTAw subject to wTBw = 1.     (7.1)

A twin problem to (7.1) is to maximize the Rayleigh quotient criterion [Golub and VanLoan 83] with respect to w:

$$ Jleft(mathbf{w};A,B
ight)=frac{{mathbf{w}}^TAmathbf{w}}{{mathbf{w}}^TBmathbf{w}} $$.     (7.2)

A solution to (7.1) or (7.2) leads to the generalized eigen-decomposition problem Aw=λBw, where λ is the largest generalized eigenvalue of A with respect to B. In general, the p columns of W are the pn orthogonal unit generalized eigenvectors ϕ1,...,ϕp of A with respect to B where

$$ {oldsymbol{upvarphi}}_i^TA{oldsymbol{upvarphi}}_j={lambda}_i{delta}_{ij} $$, and $$ {oldsymbol{upvarphi}}_i^TB{oldsymbol{upvarphi}}_j={delta}_{ij} $$ for i=1,…,p,     (7.3)

where λ1> ... >λpp+1≥ ... ≥λn>0 are the p largest generalized eigenvalues of A with respect to B in descending order of magnitude. In summary, LDA is a powerful feature extraction tool for the class separability feature [Chatterjee May 97], and our adaptive algorithms are suited to this.

Application of GEVD in Signal Processing

Next, let’s discuss an analogous problem of detecting a desired signal in the presence of interference. Here, we seek the optimum linear transform W for weighting the signal plus interference such that the desired signal is detected with maximum power and minimum interference. Given the matrix pair (A,B), where A is the correlation matrix of the signal plus interference plus noise and B is the correlation matrix of interference plus noise, we can formulate the signal detection problem as the constrained maximization problem in (7.1). Here, we maximize the signal power and minimize the power of the interference. The solution for W consists of the pn largest generalized eigenvectors of the matrix pencil (A,B). Adaptive generalized eigen-decomposition algorithms also allow the tracking of slow changes in the incoming data [Chatterjee et al. Nov 97, Mar 97; Chen et al. 2000].

Methods for Generalized Eigen-Decomposition

We first define the problem for the non-adaptive case. Each of the three generalized eigenvalue problems (AΦ=BΦΛ, ABΦ=ΦΛ, and BAΦ=ΦΛ, where A and B are real, symmetric, nXn matrices and B is positive definite) can be reduced to a standard symmetric eigenvalue problem using a Cholesky factorization of B as either B=LLT or B=UTU. With B = LLT, we can write AΦ=BΦΛ as

(L−1ALT)(LTΦ) = (LTΦ)Λ or CΨ = ΨΛ.     (7.4)

Here C is the symmetric matrix C = L–1 A LT and Ψ = LT Φ. Table 7-1 summarizes how each of the three types of problems can be reduced to the standard form CΨ=ΨΛ, and how the eigenvectors Φ of the original problem may be recovered from the eigenvectors Ψ of the reduced problem.
Table 7-1

Types of Generalized Eigen-Decomposition problems and their solutions

Type of Problem

Factorization of B

Reduction

Generalized Eigenvectors

=Λ

B = LLT

C = L–1 A LT

Φ = LT Ψ

B = UTU

C = UT A U–1

Φ = U–1 Ψ

ABΦ=ΦΛ

B = LLT

C = LT A L

Φ= LT Ψ

B = UTU

C = U A UT

Φ = U–1 Ψ

BAΦ=ΦΛ

B = LLT

C = LT A L

Φ = L Ψ

B = UTU

C = U A UT

Φ = UT Ψ

In the adaptive case, we can extend these techniques by first (adaptively) computing a matrix Wk for each sample Bk, where Wk tends to the inverse Cholesky factorization L–1 of B with probability one (w.p.1) as k→∞. Any of the algorithms in Chapter 3 can be considered here. Next, we consider a sequence {$$ {C}_k={W}_{k-1}{A}_k{W}_{k-1}^T $$}, which is used to adaptively compute a matrix Vk, where Vk tends to the eigenvector matrix of limk→∞E[Ck] w.p.1 as k→∞. Any of the algorithms in Chapters 5 and 6 can be considered for this purpose. In conjunction, the two steps yield WkVk, which is proven to converge w.p.1 to Φ as k→∞. Thus, the two steps can proceed simultaneously and converge strongly to the eigenvector matrix Φ. A full description of this method is given in [Chatterjee et al. May 97, Mar 97]. In this chapter, I offer a variety of new techniques to solve the generalized eigen-decomposition problem for the adaptive case.

Outline of This Chapter

In Section 7.2, I list the objective functions from which I derive the adaptive generalized eigen-decomposition algorithms. In Sections 7.3, 7.4, and 7.5, I present adaptive algorithms for the homogeneous, deflation, and weighted variations, respectively, from the OJA objective function. In Sections 7.6, 7.7, and 7.8, I analyze the same three variations for the mean squared error (XU) objective function and convergence proofs for the deflation case. In Sections 7.9, 7.10, and 7.11, I discuss algorithms derived from the penalty function (PF) objective function. In Sections 7.12, 7.13, and 7.14, I consider the augmented Lagrangian 1 (AL1) objective function, and in Sections 7.15, 7.16, and 7.17, I present the augmented Lagrangian 2 (AL2) objective function. In Sections 7.18, 7.19, and 7.20, I present the information theory (IT) criterion, and in Sections 7.21, 7.22, and 7.23, I describe the Rayleigh quotient (RQ) criterion. In Section 7.24, I discusses the experimental results, and in Section 7.25, I present conclusions.

7.2 Algorithms and Objective Functions

Similar to the PCA algorithms (Chapter 5), in this chapter, I present several adaptive algorithms for generalized eigenvector computation. I consider two asymptotically stationary sequences {xk∈ℜn} and {yk∈ℜn} that have been centered to zero mean. We can represent the corresponding online correlation matrices {Ak,Bk} of {xk,yk} either by their instantaneous values {$$ {mathbf{x}}_k{mathbf{x}}_k^T $$,$$ {mathbf{y}}_k{mathbf{y}}_k^T $$} or by their running averages by (2.3). If, however, {xk,yk} are non-stationary, we can construct correlation matrices {Ak,Bk} out of data samples {xk,yk} by (2.5).

Summary of Objective Functions for Adaptive GEVD Algorithms

Conforming to the methodology in Section 2.​3, for each algorithm, I describe objective functions and derive the adaptive algorithms for them. The objective functions are
  • Oja’s objective function (OJA),

  • Xu’s mean squared error objective function (XU),

  • Penalty function method (PF),

  • Augmented Lagrangian Method 1 (AL1),

  • Augmented Lagrangian Method 2 (AL2),

  • Information theory criterion (IT), and

  • Rayleigh quotient criterion (RQ).

Same as the PCA case, there are three variations of algorithms derived from each objective function. They are
  1. 1.

    Homogeneous Adaptive Rule: These algorithms do not compute the true normalized generalized eigenvectors with decreasing eigenvalues.

     
  2. 2.

    Deflation Adaptive Rule: Here, we produce unit generalized eigenvectors with decreasing eigenvalues. However, the training is sequential, thereby making the training process harder for parallel implementations.

     
  3. 3.

    Weighted Adaptive Rule: These algorithms are obtained by using a different scalar weight for each generalized eigenvector, making them normalized and in the order of decreasing eigenvalues.

     

Summary of Generalized Eigenvector Algorithms

For all algorithms, I describe an objective function J(wi; A, B) and an update rule of the form
$$ {W}_{k+1}={W}_k+{eta}_khleft({W}_k,{A}_k,{B}_k
ight), $$
where h(Wk,Ak,Bk) follows certain continuity and regularity properties [Ljung 77,92] and are given in Table 7-2.
Table 7-2

List of Adaptive Generalized Eigen-Decomposition Algorithms

Alg.

Type

Adaptive Algorithm h(Wk,Ak)

OJA

Homogeneous

$$ {A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k $$

Deflation

$$ {A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight) $$

Weighted

$$ {A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k $$

XU

Homogeneous

$$ 2{A}_k{W}_k-{A}_k{W}_k{W}_k^T{B}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k $$

Deflation

$$ 2{A}_k{W}_k-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k
ight)-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight) $$

Weighted

$$ 2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k $$

PF

Homogeneous

$$ {A}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

Deflation

$$ {A}_k{W}_k-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

Weighted

$$ {A}_k{W}_kC-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

AL1

Homogeneous

$$ {A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

Deflation

$$ {A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

Weighted

$$ {A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

AL2

Homogeneous

$$ 2{A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k-{A}_k{W}_k{W}_k^T{B}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

Deflation

$$ 2{A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k
ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

Weighted

$$ 2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

IT

Homogeneous

$$ left({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k
ight)kern1em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k
ight)}^{-1} $$

Deflation

$$ left({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)
ight)kern1em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k
ight)}^{-1} $$

Weighted

$$ left({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k
ight)kern1em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k
ight)}^{-1} $$

RQ

Homogeneous

$$ left({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k
ight)kern1em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k
ight)}^{-1} $$

Deflation

$$ left({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)
ight)kern1em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k
ight)}^{-1} $$

Weighted

$$ left({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k
ight)kern1em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k
ight)}^{-1} $$

In the following discussions, I denote Φ=[ϕ1 ... ϕn]∈ℜnXn as the orthonormal generalized eigenvector matrix of A with respect to B, and Λ= diag(λ1,...,λn) as the generalized eigenvalue matrix, such that λ1>λ2>...>λp>λp+1≥...≥λn>0. I use the subscript (i) to denote the ith permutation of the indices {1,2,…,n}.

7.3 OJA GEVD Algorithms

OJA Homogeneous Algorithm

The objective function for the OJA homogeneous algorithm can be written as

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-{{mathbf{w}}_k^i}^T{A}_k{B}_k^{-1}{A}_k{mathbf{w}}_k^i+frac{1}{2}{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)}^2+sum limits_{j=1,j
e i}^p{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j
ight)}^2 $$,     (7.5)

for i=1,…,p (pn). From the gradient of (7.5) with respect to $$ {mathbf{w}}_k^i $$ we obtain the following adaptive algorithm:

$$ {mathbf{w}}_{k+1}^i={mathbf{w}}_k^i-{eta}_k{B}_k{A}_k^{-1}{
abla}_{{mathbf{w}}_k^i}Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight) $$ for i=1,…,p,     (7.6)

where ηk is a decreasing gain constant. Defining $$ {W}_k=left[{mathbf{w}}_k^1dots {mathbf{w}}_k^p
ight] $$, from (7.6) we get

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k
ight) $$.     (7.7)

OJA Deflation Algorithm

The objective function for the OJA deflation adaptive GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-{{mathbf{w}}_k^i}^T{A}_k{B}_k^{-1}{A}_k{mathbf{w}}_k^i+frac{1}{2}{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)}^2+sum limits_{j=1}^{i-1}{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j
ight)}^2 $$,     (7.8)

for i=1,…,p. From the gradient of (7.8) with respect to $$ {mathbf{w}}_k^i $$, we obtain the OJA deflation adaptive gradient descent algorithm as

$$ {mathbf{w}}_{k+1}^i={mathbf{w}}_k^i-{eta}_kleft({A}_k{mathbf{w}}_k^i-sum limits_{j=1}^i{B}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i
ight) $$,     (7.9)

for i=1,…,p (pn). The matrix form of the algorithm is

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)
ight) $$,     (7.10)

where UT[⋅] sets all elements below the diagonal of its matrix argument to zero.

OJA Weighted Algorithm

The objective function for the OJA weighted adaptive GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-{c}_i{{mathbf{w}}_k^i}^T{A}_k{B}_k^{-1}{A}_k{mathbf{w}}_k^i+frac{c_i}{2}{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)}^2+sum limits_{j=1,j
e i}^p{c}_j{left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j
ight)}^2 $$     (7.11)

for i=1,…,p, where c1,…,cp (pn) are small positive numbers satisfying

c1 > c2 > … > cp > 0, p ≤ n.     (7.12)

Given a diagonal matrix C =  diag (c1, …, cp), p ≤ n, the OJA weighted adaptive algorithm is

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k
ight) $$.     (7.13)

OJA Algorithm Python Code

The following Python code works with multidimensional data X[nDim,nSamples] and Y[nDim,nSamples]:
from numpy import linalg as la
A = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
B = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms
W3 = W2
c = [2.6-0.3*k for k in range(nEA)]
C = np.diag(c)
I  = np.identity(nDim)
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        # Update data correlation matrices A,B with current data vectors x,y
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        y = Y[:,iter]
        y = y.reshape(nDim,1)
        B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)
        # Deflated Gradient Descent
        W2 = W2 + (1/(150 + cnt))*(A @ W2 - B @ W2 @ np.triu(W2.T @ A @ W2))
        # Weighted Gradient Descent
        W3 = W3 + (1/(500 + cnt))*(A @ W3 @ C - B @ W3 @ C @ (W3.T @ A @ W3))

7.4 XU GEVD Algorithms

XU Homogeneous Algorithm

The objective function for the XU homogeneous adaptive GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-2{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i
ight)+2sum limits_{j=1,j
e i}^p{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$     (7.14)

for i=1,…,p (pn). From the gradient of (7.14) with respect to $$ {mathbf{w}}_k^i $$, we obtain the XU homogeneous adaptive gradient descent algorithm as

$$ {mathbf{w}}_{k+1}^i={mathbf{w}}_k^i+{eta}_kleft(2{A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{B}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i
ight) $$     (7.15)

for i=1,…,p, whose matrix form is

$$ {W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_k-{A}_k{W}_k{W}_k^T{B}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k
ight) $$.     (7.16)

XU Deflation Algorithm

The objective function for the XU deflation adaptive GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-2{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i
ight)+2sum limits_{j=1}^{i-1}{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$     (7.17)

for i=1,…,p (pn). From the gradient of (7.17) with respect to $$ {mathbf{w}}_k^i $$, we obtain

$$ {W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_k-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k
ight)-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)
ight) $$, (7.18)

where UT[⋅] sets all elements below the diagonal of its matrix argument to zero. Chatterjee et al. [Mar 00, Thms 1, 2] proved that Wk converges with probability one to [±ϕ1 ±ϕ2 … ±ϕp] as k→∞.

XI Weighted Algorithm

The objective function for the XU weighted adaptive GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-2{c}_i{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+{c}_ileft({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i
ight)+2sum limits_{j=1,j
e i}^p{c}_j{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$     (7.19)

for i=1,…,p (pn), where c1,…,cp are small positive numbers satisfying (7.12). The adaptive algorithm is

$$ {W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k
ight) $$,  (7.20)

where C = diag(c1,…,cp).

XU Algorithm Python Code

The following Python code works with multidimensional data X[nDim,nSamples] and Y[nDim,nSamples]:
from numpy import linalg as la
A  np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
B  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms
W3 = W2
c = [2.6-0.3*k for k in range(nEA)]
C = np.diag(c)
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        # Update data correlation matrices A,B with current data vectors x,y
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        y = Y[:,iter]
        y = y.reshape(nDim,1)
        B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)
        # Deflated Gradient Descent
        W2 = W2 + (1/(100 + cnt))*(A @ W2 - 0.5 * B @ W2 @
                                   np.triu(W2.T @ A @ W2)- 0.5* A@ W2 @ np.triu(W2.T@ B @ W2))
        # Weighted Gradient Descent
        W3 = W3 + (1/(300 + cnt))*(A @ W3 @ C - 0.5 * B @ W3 @ C @ (W3.T @ A @ W3) - 0.5 *
                                   A @ W3 @ C @ (W3.T @ B @ W3))

7.5 PF GEVD Algorithms

PF Homogeneous Algorithm

We obtain the objective function for the PF homogeneous generalized eigenvector algorithm by writing the Rayleigh quotient criterion (7.2) as the following penalty function:

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+mu left(sum limits_{j=1,j
e i}^p{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i
ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)}^2
ight) $$,     (7.21)

where μ > 0 and i=1,…,p (pn). From the gradient of (7.21) with respect to $$ {mathbf{w}}_k^i $$, we obtain the PF homogeneous adaptive algorithm:

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p
ight)
ight) $$,     (7.22)

where Ip is a pXp identity matrix.

PF Deflation Algorithm

The objective function for the PF deflation GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+mu left(sum limits_{j=1}^{i-1}{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i
ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)}^2
ight) $$, (7.23)

where μ > 0 and i=1,…,p. The adaptive algorithm is

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-mu {B}_k{W}_kmathrm{U}Tleft({W}_k^T{B}_k{W}_k-{I}_p
ight)
ight) $$,     (7.24)

where UT[⋅] sets all elements below the diagonal of its matrix argument to zero.

PF Weighted Algorithm

The objective function for the PF weighted GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-{c}_i{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+mu left(sum limits_{j=1,j
e i}^p{c}_i{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i
ight)}^2+frac{c_j}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)}^2
ight) $$, (7.25)

where c1 > c2 > … > cp > 0 (p ≤ n), μ > 0, and i=1,…,p. The corresponding adaptive algorithm is

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_kC-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p
ight)
ight) $$,     (7.26)

where C =  diag (c1, …, cp).

PF Algorithm Python Code

The following Python code works with multidimensional data X[nDim,nSamples] and Y[nDim,nSamples]:
from numpy import linalg as la
A  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
B  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms
W3 = W2
c = [2.6-0.3*k for k in range(nEA)]
C = np.diag(c)
I  = np.identity(nDim)
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        # Update data correlation matrices A,B with current data vectors x,y
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        y = Y[:,iter]
        y = y.reshape(nDim,1)
        B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)
        # Deflated Gradient Descent
        W2 = W2 + (1/(100 + cnt))*(A @ W2 - mu * B @ W2 @
                                   np.triu((W2.T @ B @ W2) - I))
        # Weighted Gradient Descent
        W3 = W3 + (1/(500 + cnt))*(A @ W3 @ C - mu * B @ W3 @ C @
                                   ((W3.T @ B @ W3) - I))

7.6 AL1 GEVD Algorithms

AL1 Homogeneous Algorithm

We apply the augmented Lagrangian method of nonlinear optimization to the Rayleigh quotient criterion (7.4) to obtain the objective function for the AL1 homogeneous GEVD algorithm as

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)+2sum limits_{j=1,j
e i}^p{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$,

$$ +mu left(sum limits_{j=1,j
e i}^p{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i
ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)}^2
ight) $$,     (7.27)

for i=1,…,p (pn), where (α, β1, β2, …, βp) are Lagrange multipliers and μ is a positive penalty constant. Taking the gradient of $$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight) $$ with respect to $$ {mathbf{w}}_k^i $$ and equating the gradient to 0 and using the constraint $$ {{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i={delta}_{ij} $$, we obtain

$$ alpha ={{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i $$ and $$ {eta}_j={{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i $$for j=1,…,p.     (7.28)

Replacing (α, β1, β2, …, βp) in the gradient of (7.27), we obtain the AL1 homogeneous adaptive gradient descent generalized eigenvector algorithm:

$$ {mathbf{w}}_{k+1}^i={mathbf{w}}_k^i+{eta}_kleft({A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{B}_k{mathbf{w}}_k^jleft({{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i
ight)
ight.left.-mu sum limits_{j=1}^p{B}_k{mathbf{w}}_k^jleft({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i-{delta}_{ij}
ight)
ight) $$ (7.29)

where μ > 0. Defining $$ {W}_k=left[{mathbf{w}}_k^1dots {mathbf{w}}_k^p
ight] $$, we obtain

$$ {W}_{k+1}={W}_k+{eta}_kleft(left({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p
ight)
ight)
ight. $$, (7.30)

where Ip is a pXp identity matrix.

AL1 Deflation Algorithm

The objective function for the AL1 deflation GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)+2sum limits_{j=1}^{i-1}{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$

$$ +mu left(sum limits_{j=1}^{i-1}{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i
ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)}^2
ight) $$,     (7.31)

for i=1,…,p (pn). Following the steps in Section 7.6.1 we obtain the adaptive algorithm:

$$ {W}_{k+1}={W}_k+{eta}_kleft(left({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p
ight)
ight)
ight. $$. (7.32)

AL1 Weighted Algorithm

The objective function for the AL1 weighted GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-{c}_i{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+alpha {c}_ileft({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)+2sum limits_{j=1,j
e i}^p{eta}_j{c}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$,

$$ +mu left(sum limits_{j=1,j
e i}^p{c}_j{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i
ight)}^2+frac{c_i}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)}^2
ight) $$,     (7.33)

for i=1,…,p (pn), where (α, β1, β2, …, βp) are Lagrange multipliers, μ is a positive penalty constant, and c1 > c2 > … > cp > 0. The adaptive algorithm is

$$ {W}_{k+1}={W}_k+{eta}_kleft(left({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p
ight)
ight)
ight. $$, (7.34)

where C =  diag (c1, …, cp).

AL1 Algorithm Python Code

The following Python code works with multidimensional data X[nDim,nSamples] and Y[nDim,nSamples]:
from numpy import linalg as la
A  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
B  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms
W3 = W2
c = [2.6-0.3*k for k in range(nEA)]
C = np.diag(c)
I  = np.identity(nDim)
mu = 2
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        # Update data correlation matrices A,B with current data vectors x,y
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        y = Y[:,iter]
        y = y.reshape(nDim,1)
        B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)
        # Deflated Gradient Descent
        W2 = W2 + (1/(500 + cnt))*(A @ W2 - B @ W2 @ np.triu(W2.T @ A @ W2)
                   - mu * B @ W2 @ np.triu((W2.T @ B @ W2) - I))
        # Weighted Gradient Descent
        W3 = W3 + (1/(1000 + cnt))*(A @ W3 @ C - B @ W3 @ C @ (W3.T @ A @ W3)
                   - mu * B @ W3 @ C @ ((W3.T @ B @ W3) - I))

7.7 AL2 GEVD Algorithms

AL2 Homogeneous Algorithm

The unconstrained objective function for the AL2 homogeneous GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-2{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i
ight)+2sum limits_{j=1,j
e i}^p{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i+ $$

$$ mu left(sum limits_{j=1,j
e i}^p{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i
ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)}^2
ight) $$,     (7.35)

for i=1,…,p, where μ is a positive penalty constant. From (7.35), we obtain the adaptive gradient descent algorithm:

$$ {mathbf{w}}_{k+1}^i={mathbf{w}}_k^i+{eta}_kleft(2{A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{B}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i-mu sum limits_{j=1}^p{B}_k{mathbf{w}}_k^jleft({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i-{delta}_{ij}
ight)
ight) $$, (7.36)

for i=1,…,p, the matrix version of which is

$$ {W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k-{A}_k{W}_k{W}_k^T{B}_k{W}_k-mu {B}_k{W}_kleft({W}_k^T{B}_k{W}_k-{I}_p
ight)
ight) $$.     (7.37)

AL2 Deflation Algorithm

The objective function for the AL2 deflation GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-2{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i
ight)+2sum limits_{j=1}^{i-1}{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i+ $$

$$ mu left(sum limits_{j=1}^{i-1}{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i
ight)}^2+frac{1}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)}^2
ight) $$, (7.38)

for i=1,…,p. The adaptive GEVD algorithm is

$$ {W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k
ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p
ight)
ight) $$. (7.39)

AL2 Weighted Algorithm

The objective function for the AL2 weighted generalized eigenvector algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-2{c}_i{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i+{c}_ileft({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i
ight)+2sum limits_{j=1}^{i-1}{c}_j{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i+ $$

$$ mu left(sum limits_{j=1}^{i-1}{c}_j{left({{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i
ight)}^2+frac{c_i}{2}{left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)}^2
ight) $$,     (7.40)

for i=1,…,p, where c1 > c2 > … > cp > 0 (p ≤ n). The adaptive algorithm is

$$ {W}_{k+1}={W}_k+{eta}_kleft(2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p
ight)
ight) $$, (7.41)

where C =  diag (c1, …, cp), p ≤ n.

AL2 Algorithm Python Code

The following Python code works with data X[nDim,nSamples] and Y[nDim,nSamples]:
from numpy import linalg as la
A  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
B  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms
W3 = W2
c = [2.6-0.3*k for k in range(nEA)]
C = np.diag(c)
I  = np.identity(nDim)
mu = 1
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        # Update data correlation matrices A,B with current data vectors x,y
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        y = Y[:,iter]
        y = y.reshape(nDim,1)
        B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)
        # Deflated Gradient Descent
        W2 = W2 + (1/(100 + cnt))*(A @ W2 - 0.5 * B @ W2 @ np.triu(W2.T @ A
                            @ W2) - 0.5 * A @ W2 @ np.triu(W2.T @ B @ W2) -
                            0.5 * mu * B @ W2 @ np.triu((W2.T @ B @ W2) - I))
        # Weighted Gradient Descent
        W3 = W3 + (1/(300 + cnt))*(A @ W3 @ C - 0.5 * B @ W3 @ C @ (W3.T @ A
                            @ W3) - 0.5 * A @ W3 @ C @ (W3.T @ B @ W3) -
                            0.5 * mu * B @ W3 @ C @ ((W3.T @ B @ W3) - I))

7.8 IT GEVD Algorithms

IT Homogeneous Algorithm

The objective function for the information theory homogeneous GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)={{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-ln left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)+2sum limits_{j=1,j
e i}^p{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$     (7.42)

for i=1,…,p, where (α, β1, β2, …, βp) are Lagrange multipliers and ln(.) is logarithm base e. By equating the gradient of (7.42) with respect to $$ {mathbf{w}}_k^i $$ to 0 and using the constraint $$ {{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i={delta}_{ij} $$, we obtain

α = 0 and $$ {eta}_j=frac{{{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i}{{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i} $$,     (7.43)

for j=1,…,p. Replacing (α, β1, β2, …, βp) in the gradient of (7.42), we obtain the IT homogeneous adaptive gradient descent algorithm for the generalized eigenvector:

$$ {mathbf{w}}_{k+1}^i={mathbf{w}}_k^i+{eta}_kleft({A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{B}_k{mathbf{w}}_k^jleft({{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i
ight)
ight)/{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i $$,     (7.44)

for i=1,…,p, whose matrix version is

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k
ight)kern0.5em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k
ight)}^{-1} $$,     (7.45)

where DIAG[⋅] sets all elements except the diagonal of its matrix argument to zero.

IT Deflation Algorithm

The objective function for the IT deflation GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)={{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-ln left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)+2sum limits_{j=1}^{i-1}{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$     (7.46)

for i=1,…,p, where (α, β1, β2, …, βp) are Lagrange multipliers. From (7.43), we obtain the adaptive gradient algorithm:

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)
ight)kern0.6em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k
ight)}^{-1} $$ (7.47)

IT Weighted Algorithm

The objective function for the IT weighted GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i
ight)={c}_i{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-{c}_iln left({{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i
ight)+alpha {c}_ileft({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)+2sum limits_{j=1,j
e i}^p{eta}_j{c}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$     (7.48)

for i=1,…,p, where (α, β1, β2, …, βp) are Lagrange multipliers. By solving (α, β1, β2, …, βk) and replacing them in the gradient of (7.48), we obtain the adaptive algorithm:

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k
ight)kern0.5em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k
ight)}^{-1} $$.     (7.49)

IT Algorithm Python Code

The following Python code works with data X[nDim,nSamples] and Y[nDim,nSamples]:
from numpy import linalg as la
A  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
B  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms
W3 = W2
c = [2.6-0.3*k for k in range(nEA)]
C = np.diag(c)
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        # Update data correlation matrices A,B with current data vectors x,y
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        y = Y[:,iter]
        y = y.reshape(nDim,1)
        B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)
        # Deflated Gradient Descent
        W2 = W2 + (1/(50 + cnt))*(A @ W2 - B @ W2 @ np.triu(W2.T @ A @ W2))
                                  @ inv(np.diag(np.diagonal(W2.T @ A @ W2)))
        # Weighted Gradient Descent
        W3 = W3 + (1/(100 + cnt))*(A @ W3 @ C – B @ W3 @ C @ (W3.T @ A @ W3))
                                   @ inv(np.diag(np.diagonal(W3.T @ A @ W3)))

7.9 RQ GEVD Algorithms

RQ Homogeneous Algorithm

We obtain the objective function for the Rayleigh quotient homogeneous GEVD algorithm from the Rayleigh quotient criterion (7.2) as follows:

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-frac{{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i}{{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i}+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)+2sum limits_{j=1,j
e i}^p{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$ (7.50)

for i=1,…,p, where (α, β1, β2, …, βp) are Lagrange multipliers. By equating the gradient (7.50) with respect to $$ {mathbf{w}}_k^i $$ to 0, and using the constraint $$ {{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i={delta}_{ij} $$, we obtain

α = 0 and $$ {eta}_j=frac{{{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i}{{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i} $$ for j=1,…,p.     (7.51)

Replacing (α, β1, β2, …, βp) in the gradient of (7.50) and making a small approximation, we obtain the RQ homogeneous adaptive gradient descent algorithm for the generalized eigenvector:

$$ {mathbf{w}}_{k+1}^i={mathbf{w}}_k^i+{eta}_kleft({A}_k{mathbf{w}}_k^i-sum limits_{j=1}^p{B}_k{mathbf{w}}_k^jleft({{mathbf{w}}_k^j}^T{A}_k{mathbf{w}}_k^i
ight)
ight)/{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i $$,     (7.52)

for i=1,…,p, whose matrix version is

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k{W}_k^T{A}_k{W}_k
ight)kern0.5em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k
ight)}^{-1} $$.     (7.53)

RQ Deflation Algorithm

The objective function for the RQ deflation GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-frac{{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i}{{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i}+alpha left({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)+2sum limits_{j=1}^i{eta}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$, (7.54)

for i=1,…,p, where (α, β1, β2, …, βp) are Lagrange multipliers. By solving (α, β1, β2, …, βk) and replacing them in the gradient of (7.54), we obtain the adaptive algorithm:

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)
ight)kern0.5em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k
ight)}^{-1} $$. (7.55)

RQ Weighted Algorithm

The objective function for the RQ weighted GEVD algorithm is

$$ Jleft({mathbf{w}}_k^i;{A}_k,{B}_k
ight)=-{c}_ifrac{{{mathbf{w}}_k^i}^T{A}_k{mathbf{w}}_k^i}{{{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i}+alpha {c}_ileft({{mathbf{w}}_k^i}^T{B}_k{mathbf{w}}_k^i-1
ight)+2sum limits_{j=1}^i{eta}_j{c}_j{{mathbf{w}}_k^j}^T{B}_k{mathbf{w}}_k^i $$,     (7.56)

for i=1,…,p, where (α, β1, β2, …, βp) are Lagrange multipliers. By solving (α, β1, β2, …, βk) and replacing them in the gradient of (7.56), we obtain the adaptive algorithm:

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k
ight)kern0.5em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k
ight)}^{-1} $$.     (7.57)

RQ Algorithm Python Code

The following Python code works with data X[nDim,nSamples] and Y[nDim,nSamples]:
from numpy import linalg as la
A  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
B  = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms
W3 = W2
c = [2.6-0.3*k for k in range(nEA)]
C = np.diag(c)
I  = np.identity(nDim)
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        # Update data correlation matrices A,B with current data vectors x,y
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        y = Y[:,iter]
        y = y.reshape(nDim,1)
        B = B + (1.0/(1 + cnt))*((np.dot(y, y.T)) - B)
        # Deflated Gradient Descent
        W2 = W2 + (1/(20 + cnt))*(A @ W2 - B @ W2 @ np.triu(W2.T @ A @ W2))
                                  inv(np.diag(np.diagonal(W2.T @ B @ W2)))
        # Weighted Gradient Descent
        W3 = W3 + (1/(300 + cnt))*(A @ W3 @ C - B @ W3 @ C @ (W3.T @ A @ W3))
                                   @ inv(np.diag(np.diagonal(W3.T @ B @ W3)))

7.10 Experimental Results

I generated 1,000 samples (of {xk} and {yk}) from 10-dimensional Gaussian data (i.e., n=10) with the mean zero and covariance given below. The covariance matrix A for {xk} is obtained from the second covariance matrix in [Okada and Tomita 85] multiplied by 3 as follows:

The covariance matrix B for {yk} is obtained from the third covariance matrix in [Okada and Tomita 85] multiplied by 2 as follows:

The generalized eigenvalues of (A,B) are

107.9186, 49.0448, 8.3176, 5.1564, 2.8814, 2.3958, 1.9872, 1.2371, 0.9371, 0.1096.

I computed the first four principal generalized eigenvectors (i.e., eigenvectors corresponding to the largest four eigenvalues) (i.e., p=4) by the adaptive algorithms described before. In order to compute the online data sequence {Ak}, I generated random data vectors {xk} from the above covariance matrix A. I generated {Ak} from {xk} by using algorithm (2.21) in Chapter 2. Similarly, I generated random data vectors {yk} from the covariance matrix B and then generated {Bk} from {yk}. I computed the correlation matrix Acomputed and Bcomputed after collecting all 500 samples xk and yk respectively as

$$ {A}_{computed}=frac{1}{1000}sum limits_{i=1}^{1000}{mathbf{x}}_i{mathbf{x}}_i^T $$ and $$ {B}_{computed}=frac{1}{1000}sum limits_{i=1}^{1000}{mathbf{y}}_i{mathbf{y}}_i^T $$.

I referred to the generalized eigenvectors and eigenvalues computed from this A and B by a standard numerical analysis method [Golub and VanLoan 83] as the actual values .

I started all algorithms with w0 = 0.1*ONE, where ONE is a 10 X 4 matrix whose all elements are ones. In order to measure the convergence and accuracy of the algorithms, I computed the direction cosine at kth update of each adaptive algorithm as

Direction cosine (k) = $$ frac{left|{{mathbf{w}}_k^i}^T{oldsymbol{upvarphi}}_i
ight|}{leftVert {mathbf{w}}_k^ileftVert 
ightVert {oldsymbol{upvarphi}}_i
ightVert } $$,     (7.58)

where $$ {mathbf{w}}_k^i $$ is the estimated generalized eigenvector of (Ak,Bk) at kth update and ϕi is the actual ith generalized eigenvector computed from all collected samples by a conventional numerical analysis method.

Figure 7-1 shows the iterates of the OJA algorithms (deflated and weighted) to compute the first two principal generalized eigenvectors of (Ak,Bk). Figure 7-2 shows the same for the XU algorithms, Figure 7-3 for the PF algorithms, Figure 7-4 for the AL1 algorithms, Figure 7-5 for the AL2 algorithms, Figure 7-6 for the IT algorithms, and Figure 7-7 for the RQ algorithms.
Figure 7-1

Convergence of the first two principal generalized eigenvectors of (A,B) by the OJA deflation (7.10) and OJA weighted (7.13) adaptive algorithms

Figure 7-2

Convergence of the first two principal generalized eigenvectors of (A,B) by the XU deflation (7.18) and XU weighted (7.20) adaptive algorithms

Figure 7-3

Convergence of the first two principal generalized eigenvectors of (A,B) by the PF deflation (7.24) and PF weighted (7.26) adaptive algorithms

Figure 7-4

Convergence of the first two principal generalized eigenvectors of (A,B) by the AL1 deflation (7.32) and AL1 weighted (7.34) adaptive algorithms

Figure 7-5

Convergence of the first two principal generalized eigenvectors of (A,B) by the AL2 deflation (7.39) and AL2 weighted (7.41) adaptive algorithms

Figure 7-6

Convergence of the first two principal generalized eigenvectors of (A,B) by the IT deflation (7.47) and IT weighted (7.49) adaptive algorithms

Figure 7-7

Convergence of the first two principal generalized eigenvectors of (A,B) by the RQ deflation (7.55) and RQ weighted (7.57) adaptive algorithms

For all algorithms, I used ηk=1/(140+k) for the deflation algorithms and ηk=1/(500+k) for the weighted algorithms. The diagonal weight matrix C used for the weighted algorithms is DIAG(2.6,2.3,2.0,1.7). I ran all algorithms for three epochs of the data, where one epoch means presenting all training data once in random order. I did not show the results for the homogeneous algorithms since the homogeneous method produces a linear combination of the actual generalized eigenvectors of (A,B). Thus, the direction cosines are not indicative of the performance of the algorithms for the homogeneous case.

7.11 Concluding Remarks

Observe that the convergence of all algorithms decreases progressively for the minor generalized eigenvectors and is best for principal generalized eigenvector. This is expected since the convergence of these adaptive algorithms is a function of the relative generalized eigenvalue. Furthermore, the weighted algorithms, which can be implemented in parallel hardware, performed very similarly to the deflation algorithms.

I did the following:
  1. 1.

    For each algorithm, I rated the compute and convergence performance.

     
  2. 2.

    I skipped the homogeneous algorithms because they are not useful for practical applications since they produce arbitrary rotations of the eigenvectors.

     
  3. 3.

    Note that Ak∈ℜnXn, Bk∈ℜnXn and Wk∈ℜnXp. I presented the computation complexity of each algorithm in terms of the matrix dimensions n and p.

     
  4. 4.

    The convergence performance is determined based on the speed of convergence of the principal and the minor components. I rated convergence in a scale of 1-10 where 10 is the fastest converging algorithm.

     
  5. 5.

    I skipped the IT and RQ algorithms because they did not perform well compared to the remaining algorithms and the matrix inversion increases computational complexity. See Table 7-3.

     
Table 7-3

List of Adaptive GEVD Algorithms, Complexity, and Performance

Alg

Type

Adaptive Algorithm h(Wk,Ak)

Comments

OJA

Deflation

$$ {A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight) $$

n3p6, 6

Weighted

$$ {A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k $$

n4p6, 6

XU

Deflation

$$ 2{A}_k{W}_k-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k
ight)-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight) $$

2n3p6, 8

Weighted

$$ 2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k $$

2n4p6, 8

PF

Deflation

$$ {A}_k{W}_k-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

n2p4, 7

Weighted

$$ {A}_k{W}_kC-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

n3p4, 7

AL1

Deflation

$$ {A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

n3p6+ n2p4, 9

Weighted

$$ {A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

n4p6+ n3p4, 9

AL2

Deflation

$$ 2{A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)-{A}_k{W}_k UTleft({W}_k^T{B}_k{W}_k
ight)-mu {B}_k{W}_k UTleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

2n3p6+ n2p4, 10

Weighted

$$ 2{A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k-{A}_k{W}_k{CW}_k^T{B}_k{W}_k-mu {B}_k{W}_kCleft({W}_k^T{B}_k{W}_k-{I}_p
ight) $$

2n4p6+ n3p4, 10

IT

Deflation

$$ left({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)
ight)kern1em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k
ight)}^{-1} $$

Not applicable

Weighted

$$ left({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k
ight)kern1em mathrm{DIAG}{left({W}_k^T{A}_k{W}_k
ight)}^{-1} $$

Not applicable

RQ

Deflation

$$ left({A}_k{W}_k-{B}_k{W}_k UTleft({W}_k^T{A}_k{W}_k
ight)
ight)kern1em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k
ight)}^{-1} $$

Not applicable

Weighted

$$ left({A}_k{W}_kC-{B}_k{W}_k{CW}_k^T{A}_k{W}_k
ight)kern1em mathrm{DIAG}{left({W}_k^T{B}_k{W}_k
ight)}^{-1} $$

Not applicable

Observe the following:
  1. 1.

    The OJA algorithm has the least complexity and good performance.

     
  2. 2.

    The AL2 algorithm has the most complexity and best performance.

     
  3. 3.

    The AL1 algorithm is the next best after AL2, and PF and XU follow.

     

The complexity and accuracy tradeoffs will determine the algorithm to use in real-world scenarios. If you can afford the computation, the AL2 algorithm is the best. The XU algorithm is a good balance of complexity and performance.

In summary, I showed 21 algorithms, many of them new, from a common framework with an objective function for each. Note that although I applied the gradient descent technique on these objective functions, I could have applied any other technique of nonlinear optimization such as steepest descent, conjugate direction, Newton-Raphson, or recursive least squares. The availability of the objective functions allows us to derive new algorithms by using new optimization techniques on them and also to perform convergence analyses of the adaptive algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.185.194