Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

C. ChatterjeeAdaptive Machine Learning Algorithms with Pythonhttps://doi.org/10.1007/978-1-4842-8017-1_4

4. First Principal Eigenvector

Chanchal Chatterjee¹

(1)

San Jose, CA, USA

4.1 Introduction and Use Cases

In this chapter, I present a unified framework to derive and discuss ten adaptive algorithms (some well-known) for principal eigenvector computation, which is also known as principal component analysis (PCA) or the Karhunen-Loeve [Karhunen–Loève theorem, Wikipedia] transform. The first principal eigenvector of a symmetric positive definite matrix A∈ℜ^nXn is the eigenvector ϕ₁ corresponding to the largest eigenvalue λ₁ of A. Here Aϕ_i= λ_iϕ_i for i=1,…,n, where λ₁>λ₂≥...≥λ_n>0 are the n largest eigenvalues of A corresponding to eigenvectors ϕ₁,…,ϕ_n.

An important problem in machine learning is to extract the most significant feature that represents the variations in the multi-dimensional data. This reduces the multi-dimensional data into one dimension that can be easily modeled. However, in real-world applications, the data statistics change over time (non-stationary). Hence it is challenging to design a solution that adapts to changing data on a low-memory and low-computation edge device.

Figure 4-1 shows an example of streaming 10-dimensional non-stationary data that abruptly changes statistical properties after 500 samples. The overlaid red curve shows the principal eigenvector estimated by the adaptive algorithm. The adaptive estimate of the principal eigenvector converges to its true value within 50 samples. As the data changes abruptly after 500 samples, it readapts to the changed data and converges back to its true value within 100 samples. All of this is achieved with low computation, low memory, and low latency.

Figure 4-1
Rapid convergence of the first principal eigenvector is computed by an adaptive algorithm in spite of abrupt changes in data

Besides this example, there are several applications in machine learning, pattern analysis, signal processing, cellular communications, and automatic control [Haykin 94, Owsley 78, Pisarenko 73, Chatterjee et al. 97-99, Chen et al. 99, Diamantaras and Strintzis 97], where an online (i.e., real-time) solution of principal eigen-decomposition is desired. As discussed in Chapter 2, in these real-time situations, the underlying correlation matrix A is unknown. Instead, we have a sequence of random vectors {x_k∈ℜⁿ} from which we obtain an instantaneous matrix sequence {A_k∈ℜ^nxn}, such that A = lim_k→∞E[A_k]. For every incoming sample x_k, we need to obtain the current estimate w_k of the principal eigenvector ϕ₁, such that w_k converges strongly to its true value ϕ₁.

A common method of computing the online estimate w_k of ϕ₁ is to maximize the Rayleigh quotient (RQ) [Golub and VanLoan 83] criterion J(w_k;A_k), where

$Jleft({mathbf{w}}_k;{A}_k ight)=frac{{mathbf{w}}_k^T{A}_k{mathbf{w}}_k}{{mathbf{w}}_k^T{mathbf{w}}_k}$ . (4.1)

The signal x_k can be compressed to a single value by projecting it onto w_k as ${mathbf{w}}_k^T{mathbf{x}}_k$ .

The literature for PCA algorithms is very diverse and practitioners have approached the problem from a variety of backgrounds including signal processing, neural learning, and statistical pattern recognition. Within each discipline, adaptive PCA algorithms are derived from their own perspectives, which may include ad hoc methods. Since the approaches and solutions to PCA algorithms are distributed along disciplinary lines, a unified framework for deriving and analyzing these algorithms is necessary.

In this chapter, I offer a common framework for derivation, convergence, and rate analyses for the ten adaptive algorithms in four steps outlined in Section 1.4. For each algorithm, I present the results for each of these steps. The unified framework helps in conducting a comparative study of the ten algorithms. In the process, I offer fresh perspectives on known algorithms and present two new adaptive algorithms for PCA. For known algorithms, if results exist from prior implementations, I state them; otherwise, I provide the new results. For the new algorithms, I prove my results.

Outline of This Chapter

In Section 4.2, I list the adaptive PCA algorithms that I derive and discuss in this chapter. I also list the objective functions from which I derive these algorithms and the necessary assumptions. Section 4.3 presents the Oja PCA algorithm and describes its convergence properties. In Section 4.4, I analyze three algorithms based on the Rayleigh quotient criterion (4.1). In Section 4.5, I discuss PCA algorithms based on the information theoretic criterion. Section 4.6 describes the mean squared error objective function and algorithms. In Section 4.7, I discuss penalty function-based algorithms. Sections 4.8 and 4.9 present new PCA algorithms based on the augmented Lagrangian criteria. Section 4.10 presents the summary of convergence results. Section 4.11 discusses the experimental results. Finally, section 4.12 concludes the chapter.

4.2 Algorithms and Objective Functions

Adaptive Algorithms

[Chatterjee Neural Networks, Vol. 18, No. 2, pp. 145-149, March 2005].

I have itemized the algorithms based on their inventors or on the objective functions from which they are derived. All algorithms are of the form

w_k + 1 = w_k + η_kh(w_k, A_k), (4.2)

where the function h(w_k,A_k) follows certain continuity and regularity properties [Ljung 77,92], and η_k is a decreasing gain sequence. The term h(w_k;A_k) for various adaptive algorithms are

OJA: ${A}_k{mathbf{w}}_k-{mathbf{w}}_k{mathbf{w}}_k^T{A}_k{mathbf{w}}_k$ .
RQ: $frac{1}{{mathbf{w}}_k^T{mathbf{w}}_k}left({A}_k{mathbf{w}}_k-{mathbf{w}}_kleft(frac{{mathbf{w}}_k^T{A}_k{mathbf{w}}_k}{{mathbf{w}}_k^T{mathbf{w}}_k} ight) ight)$ .
OJAN: ${A}_k{mathbf{w}}_k-{mathbf{w}}_kleft(frac{{mathbf{w}}_k^T{A}_k{mathbf{w}}_k}{{mathbf{w}}_k^T{mathbf{w}}_k} ight)={mathbf{w}}_k^T{mathbf{w}}_kcdot RQ$ .
LUO: ${mathbf{w}}_k^T{mathbf{w}}_kleft({A}_k{mathbf{w}}_k-{mathbf{w}}_kleft(frac{{mathbf{w}}_k^T{A}_k{mathbf{w}}_k}{{mathbf{w}}_k^T{mathbf{w}}_k} ight) ight)={left({mathbf{w}}_k^T{mathbf{w}}_k ight)}^2cdot RQ$ .
IT: $frac{A_k{mathbf{w}}_k}{{mathbf{w}}_k^T{A}_k{mathbf{w}}_k}-{mathbf{w}}_k=frac{1}{{mathbf{w}}_k^T{A}_k{mathbf{w}}_k}cdot OJA$
XU: $2{A}_k{mathbf{w}}_k-{mathbf{w}}_k{mathbf{w}}_k^T{A}_k{mathbf{w}}_k-{A}_k{mathbf{w}}_k{mathbf{w}}_k^T{mathbf{w}}_k= OJA-{A}_k{mathbf{w}}_kleft({mathbf{w}}_k^T{mathbf{w}}_k-1 ight)$ .
PF: ${A}_k{mathbf{w}}_k-mu {mathbf{w}}_kleft({mathbf{w}}_k^T{mathbf{w}}_k-1 ight)$ .
OJA+: ${A}_k{mathbf{w}}_k-{mathbf{w}}_k{mathbf{w}}_k^T{A}_k{mathbf{w}}_k-{mathbf{w}}_kleft({mathbf{w}}_k^T{mathbf{w}}_k-1 ight)= OJA-{mathbf{w}}_kleft({mathbf{w}}_k^T{mathbf{w}}_k-1 ight)$
AL1: ${A}_k{mathbf{w}}_k-{mathbf{w}}_k{mathbf{w}}_k^T{A}_k{mathbf{w}}_k-mu {mathbf{w}}_kleft({mathbf{w}}_k^T{mathbf{w}}_k-1 ight)$ .
AL2: $2{A}_k{mathbf{w}}_k-{mathbf{w}}_k{mathbf{w}}_k^T{A}_k{mathbf{w}}_k-{A}_k{mathbf{w}}_k{mathbf{w}}_k^T{mathbf{w}}_k-mu {mathbf{w}}_kleft({mathbf{w}}_k^T{mathbf{w}}_k-1 ight)$ .

Here IT denotes information theory, and AL denotes augmented Lagrangian. Although most of these algorithms are known, the new AL1 and AL2 algorithms are derived from an augmented Lagrangian objective function discussed later in this chapter.

Objective Functions

Conforming to my proposed methodology in Chapter 2.2, all algorithms mentioned before are derived from objective functions. Some of these objective functions are

Objective function for the OJA algorithm,
Least mean squared error criterion,
Rayleigh quotient criterion,
Penalty function method,
Information theory criterion, and
Augmented Lagrangian method.

4.3 OJA Algorithm

This algorithm was given by Oja et al. [Oja 85, 89, 92]. Intuitively, the OJA algorithm is derived from the Rayleigh quotient criterion by representing it as a Lagrange function, which minimizes $-{mathbf{w}}_k^T{A}_k{mathbf{w}}_k$ under the constraint ${mathbf{w}}_k^T{mathbf{w}}_k=1$ .

Objective Function

In terms of the data samples x_k, the objective function for the OJA algorithm can be written as

$Jleft({mathbf{w}}_k;{mathbf{x}}_k ight)=-{leftVert {mathbf{x}}_k^Tleft({mathbf{x}}_k-{mathbf{w}}_k{mathbf{w}}_k^T{mathbf{x}}_k ight) ightVert}^2$ (4.3)

If we represent the data correlation matrix A_k by its instantaneous value ${mathbf{x}}_k{mathbf{x}}_k^T$ , then (4.3) is equivalent to the following objective function:

$Jleft({mathbf{w}}_k;{A}_k ight)=-{leftVert {A}_k^{frac{1}{2}}left(I-{mathbf{w}}_k{mathbf{w}}_k^T ight){A}_k^{frac{1}{2}} ightVert}_F^2$ . (4.4)

We see from (4.4) that the objective function J(w_k;x_k) represents the difference between the sample x_k and its transformation due to a matrix ${mathbf{w}}_k{mathbf{w}}_k^T$ . In neural networks, this transform is called auto-association¹ [Haykin 94]. Figure 4-2 shows a two-layer auto-associative network.

Figure 4-2
Two-layer linear auto-associative neural network for the first principal eigenvector

Adaptive Algorithm

The gradient of (4.4) with respect to w_k is

${ abla}_{{mathbf{w}}_k}Jleft({mathbf{w}}_k;{A}_k ight)=-4{A}_kleft({A}_k{mathbf{w}}_k-{mathbf{w}}_k{mathbf{w}}_k^T{A}_k{mathbf{w}}_k ight).$

The adaptive gradient descent OJA algorithm for PCA is

${mathbf{w}}_{k+1}={mathbf{w}}_k-{eta}_k{A}_k^{-1}{ abla}_{{mathbf{w}}_k}Jleft({mathbf{w}}_k;{A}_k ight)={mathbf{w}}_k+{eta}_kleft({A}_k{mathbf{w}}_k-{mathbf{w}}_k{mathbf{w}}_k^T{A}_k{mathbf{w}}_k ight)$ , (4.5)

where η_k is a small decreasing constant.

The Python code for this algorithm with multidimensional data X[nDim,nSamples] is

A = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix

w = 0.1 * np.ones(shape=(nDim,11)) # weight vectors of all algorithms

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

# OJA Algorithm

v = w[:,0].reshape(nDim, 1)

v = v + (1/(100+cnt))*(A @ v - v @ (v.T @ A @ v))

w[:,0] = v.reshape(nDim)

Rate of Convergence

The convergence time constant for the principal eigenvector ϕ₁ is 1/λ₁ and for the minor eigenvectors ϕ_i is 1/(λ₁–λ_i) for i=2,…,n. The time constants are dependent on the eigen-structure of the data correlation matrix A.

4.4 RQ, OJAN, and LUO Algorithms

Objective Function

These three algorithms are different derivations of the following Rayleigh quotient objective function:

$Jleft({mathbf{w}}_k;{A}_k ight)=-left(frac{{mathbf{w}}_k^T{A}_k{mathbf{w}}_k}{{mathbf{w}}_k^T{mathbf{w}}_k} ight)$ . (4.6)

These algorithms were initially presented by Luo et al. [Luo et al. 97; Taleb et al. 99; Cirrincione et al. 00] and Oja et al. [Oja et al. 92]. Variations of the RQ algorithm have been presented by many practitioners [Chauvin 89; Sarkar et al. 89; Yang et al. 89; Fu and Dowling 95; Taleb et al. 99; Cirrincione et al. 00].