Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

C. ChatterjeeAdaptive Machine Learning Algorithms with Pythonhttps://doi.org/10.1007/978-1-4842-8017-1_3

3. Square Root and Inverse Square Root

Chanchal Chatterjee¹

(1)

San Jose, CA, USA

3.1 Introduction and Use Cases

Adaptive computation of the square root and inverse square root of the real-time correlation matrix of a streaming sequence {x_k∈ℜⁿ} has numerous applications in machine learning, data analysis, and image/signal processing. They include data whitening, classifier design, and data normalization [Foley and Sammon 75; Fukunaga 90].

Data whitening is a process of decorrelating the data such that all components have unit variance. It is a data preprocessing step in machine learning and data analysis to “normalize” the data so that it is easier to model. Prominent applications are

To transform correlated noise in a signal to independent and identically distributed (iid) noise, which is easier to classify
Generalized eigenvector computation [Chatterjee et al. Mar 97] (see Chapter 7)
Linear discriminant analysis computation [linear discriminant analysis, Wikipedia]
Gaussian classifier design and computation of distance measures [Chatterjee et al. May 97]

Figure 3-1 shows the correlation matrices of the original and whitened data. The original data is highly correlated as shown by the colors on all axes. The whitened data is fully uncorrelated with no correlation between components since only diagonal values exist.

Figure 3-1
Original correlated data on the left and the uncorrelated “whitened” data on the right

Figure 3-2 shows a handwritten number 0 obtained from the Keras MNIST dataset [Keras, MNIST]. The correlation matrix on the right shows that the data pixels are highly correlated for all pixels.

Figure 3-2
Handwritten MNIST number 0 and the correlation matrix of all characters

The following Python code whitens the data samples X[nDim,nSamples]:

from scipy.linalg import eigh

nDim = X.shape[0]

corX = (X @ X.T) / nSamples

eigvals, eigvecs = eigh(corX)

V = np.fliplr(eigvecs)

D = np.zeros(shape=(nDim,nDim))

for i in range(nDim):

if (eigvals[::-1][i] < 10):

D[i,i] = 0

else:

D[i,i] = np.sqrt(1/eigvals[::-1][i])

Z = V @ D @ V.T @ X

Next let’s see the transformed data and the new correlation matrix. Figure 3-3 shows that the differentiated features of the character are accentuated and the correlation matrix is diagonal and not distributed along all pixels, showing that the data is whitened with the identity correlation matrix.

Figure 3-3
Handwritten MNIST number 0 after whitening. The correlation matrix of all characters is diagonal

We define the data correlation matrix A as

${A}_k=frac{1}{k}sum limits_{i=1}^k{mathbf{x}}_i{mathbf{x}}_i^T={A}_{k-1}+frac{1}{k}left({mathbf{x}}_k{mathbf{x}}_k^T-{A}_{k-1} ight).$

The square root of A, also called the Cholesky decomposition [Cholesky decomposition, Wikipedia], is denoted by A^½. Similarly, A^–½ denotes the inverse square root of A. However, as explained below, there is no unique solution for both of these matrix functions, and various adaptive algorithms can be obtained where each algorithm leads to a different solution. Even though there are numerous solutions for these matrix functions, there are unique solutions under some restrictions, such as a unique symmetric positive definite solution.

In this chapter, I present three adaptive algorithms for each matrix function A^½ and A^–½. Out of the three, one adaptive algorithm leads to the symmetric positive definite solution and two lead to more general solutions.

Various Solutions for A^½ and A^–½

Let A=ΦΛΦ^T be the eigen-decomposition of the real, symmetric, positive definite nXn matrix A, where Φ and Λ are respectively the eigenvector and eigenvalue matrices of A. Here Λ=diag(λ₁,…,λ_n) is the diagonal eigenvalue matrix with λ₁≥…≥λ_n>0, and Φ∈ℜ^nXn is orthonormal. A solution for A^½ is L=ΦD, where D= $operatorname{diag}left(pm {lambda}_1^{frac{1}{2}},dots, pm {lambda}_n^{frac{1}{2}} ight)$ . However, in general this is not a symmetric solution, and for any orthonormal¹ matrix U, ΦDU is also a solution. We can show that A^½ is symmetric if, and only if, it is of the form ΦDΦ^T, and there are 2ⁿ symmetric solutions for A^½. When D is positive definite, we obtain the unique symmetric positive definite solution for A^½ as ΦΛ^½Φ^T, where Λ^½ = $operatorname{diag}left({lambda}_1^{frac{1}{2}},dots, {lambda}_n^{frac{1}{2}} ight)$ . Similarly, a general solution for the inverse square root A^–½ of A is ΦD^–1U, where D is defined before and U is any orthonormal matrix. The unique symmetric positive definite solution for A^–½ is ΦΛ^–½Φ^T, where Λ^–½ = $operatorname{diag}left({lambda}_1^{-frac{1}{2}},dots, {lambda}_n^{-frac{1}{2}} ight)$ .

Outline of This Chapter

In sections 3.2, 3.3, and 3.4, I discuss three algorithms for the adaptive computation of A^½. Section 3.4 discusses the unique symmetric positive definite solution for A^½. In Sections 3.5, 3.6, and 3.7, I discuss three algorithms for the adaptive computation of A^–½. Section 3.7 describes the unique symmetric positive definite solution for A^–½. Section 3.8 presents experimental results for the six algorithms with 10-dimensional Gaussian data. Section 3.9 concludes the chapter.

3.2 Adaptive Square Root Algorithm: Method 1

Let {x_k∈ℜⁿ} be a sequence of data vectors whose online data correlation matrix A_k∈ℜ^nXn is given by

${A}_k=frac{1}{k}sum limits_{i=1}^k{eta}^{k-i}{mathbf{x}}_i{mathbf{x}}_i^T$ . (3.1)

Here x_k is an observation vector at time k and 0<β≤1 is a forgetting factor used for non-stationary sequences. If the data is stationary, the asymptotic correlation matrix A is

$A=underset{k o infty }{lim }Eleft[{A}_k ight]$ . (3.2)

Objective Function

Following the methodology described in Section 1.4, I present the algorithm by first showing an objective function J, whose minimum with respect to matrix W gives us the square root of the asymptotic data correlation matrix A. The objective function is

$J(W)={leftVert A-{W}^TW ightVert}_F^2$ . (3.3)

The gradient of J(W) with respect to W is

∇_WJ(W) = − 4W(A − W^TW). (3.4)

Adaptive Algorithm

From the gradient in (3.4), we obtain the following adaptive gradient descent algorithm:

${W}_{k+1}={W}_k-{eta}_kleft(1/4 ight){ abla}_WJleft({W}_k;{A}_k ight)={W}_k+{eta}_kleft({W}_k{A}_k-{W}_k{W}_k^T{W}_k ight)$ , (3.5)

where η_k is a small decreasing constant and follows assumption A1.2 in the Proofs of Convergence in the GitHub [Chatterjee GitHub].

The following Python code implements this algorithm with data X[nDim,nSamples]:

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

etat1 = 1.0/(50 + cnt)

# Algorithm 1

W1 = W1 + etat1 * (W1 @ A - W1 @ W1.T @ W1)

3.3 Adaptive Square Root Algorithm: Method 2

Objective Function

The objective function J(W), whose minimum with respect to W gives us the square root of A, is

$J(W)={leftVert A-{WW}^T ightVert}_F^2$ . (3.6)

The gradient of J(W) with respect to W is

∇_WJ(W) = − 4(A − WW^T )W. (3.7)

Adaptive Algorithm

We obtain the following adaptive gradient descent algorithm for square root of A:

${W}_{k+1}={W}_k-{eta}_kleft(1/4 ight){ abla}_WJleft({W}_k;{A}_k ight)={W}_k+{eta}_kleft({A}_k{W}_k-{W}_k{W}_k^T{W}_k ight)$ , (3.8)

where η_k is a small decreasing constant.

The following Python code implements this algorithm with data X[nDim,nSamples]:

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

etat1 = 1.0/(50 + cnt)

# Algorithm 2

W2 = W2 + etat1 * (A @ W2 - W2 @ W2.T @ W2)

3.4 Adaptive Square Root Algorithm: Method 3

Adaptive Algorithm

Following the adaptive algorithms (3.5) and (3.8), I now present an algorithm for the computation of a symmetric positive definite square root of A:

${W}_{k+1}={W}_k+{eta}_kleft({A}_k-{W}_k^2 ight)$ , (3.9)

where η_k is a small decreasing constant and W_k is symmetric.

The following Python code implements this algorithm with data X[nDim,nSamples]:

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

etat2 = 1.0/(50 + cnt)

# Algorithm 3

W3 = W3 + etat2 * (A - W3 @ W3)

3.5 Adaptive Inverse Square Root Algorithm: Method 1

Objective Function

The objective function J(W), whose minimizer W^* gives us the inverse square root of A, is

$J(W)={leftVert I-{W}^T AW ightVert}_F^2$ . (3.10)

The gradient of J(W) with respect to W is

∇_WJ(W) = − 4AW(I − W^TAW). (3.11)

Adaptive Algorithm

From the gradient in (3.11), we obtain the following adaptive gradient descent algorithm:

${W}_{k+1}={W}_k-{eta}_kleft(1/4 ight){A}_k^{-1}{ abla}_WJleft({W}_k;{A}_k ight)={W}_k+{eta}_kleft({W}_k-{W}_k{W}_k^T{A}_k{W}_k ight)$ (3.12)

The following Python code implements this algorithm with data X[nDim,nSamples]:

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

etat1 = 1.0/(100 + cnt)

# Algorithm 1

W1 = W1 + etat1 * (W1 - W1 @ W1.T @ A @ W1)

3.6 Adaptive Inverse Square Root Algorithm: Method 2

Objective Function

The objective function J(W), whose minimum with respect to W gives us the inverse square root of A, is

$J(W)={leftVert I-{WAW}^T ightVert}_F^2$ . (3.13)

The gradient of J(W) with respect to W is

∇_WJ(W) = − 4(I − WAW^T)WA. (3.14)

Adaptive Algorithm

We obtain the following adaptive algorithm for the inverse square root of A:

${W}_{k+1}={W}_k-{eta}_kleft(1/4 ight){ abla}_WJleft({W}_k;{A}_k ight){A}_k^{-1}={W}_k+{eta}_kleft({W}_k-{W}_k{A}_k{W}_k^T{W}_k ight)$ , (3.15)

where η_k is a small decreasing constant.

The following Python code implements this algorithm with data X[nDim,nSamples]:

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

etat1 = 1.0/(100 + cnt)

# Algorithm 2

W2 = W2 + etat1 * (W2 - W2 @ A @ W2.T @ W2)

3.7 Adaptive Inverse Square Root Algorithm: Method 3

Adaptive Algorithm

By extending the adaptive algorithms (3.12) and (3.15), I now present an adaptive algorithm for the computation of a symmetric positive definite inverse square root of A:

W_k + 1 = W_k + η_k(I − W_kA_kW_k), (3.16)

where η_k is a small decreasing constant and W_k is symmetric.

The following Python code implements this algorithm with data X[nDim,nSamples]:

for epoch in range(nEpochs):

for iter in range(nSamples):

cnt = nSamples*epoch + iter

x = X[:,iter]

x = x.reshape(nDim,1)

A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)

etat2 = 1.0/(100 + cnt)

# Algorithm 3

W3 = W3 + etat2 * (I - W3 @ A @ W3)

3.8 Experimental Results

I generated 500 samples {x_k} of 10-dimensional (i.e., n=10) Gaussian data with mean zero and covariance, given below. The covariance matrix is obtained from the first covariance matrix in [Okada and Tomita 85] multiplied by 3. The covariance matrix is

The eigenvalues of the covariance matrix are

[17.699, 8.347, 5.126, 3.088, 1.181, 0.882, 0.261, 0.213, 0.182, 0.151].

Experiments for Adaptive Square Root Algorithms

I used adaptive algorithms (3.5), (3.8), and (3.9) for Methods 1, 2, and 3, respectively, to compute the square root A, where A is the correlation matrix computed from all collected samples as

$A=frac{1}{500}sum limits_{i=1}^{500}{mathbf{x}}_i{mathbf{x}}_i^T.$

I started the algorithms with W₀=I (10X10 identity matrix). At k^th update of each algorithm, I computed the Frobenius norm [Frobenius norm, Wikipedia] of the error between the actual correlation matrix A and the square of W_k that is appropriate for each method. I denote this error by e_k as follows:

${e}_k^{mathrm{Method}1}={leftVert A-{W}_k^T{W}_k ightVert}_F$ , (3.17)

${e}_k^{mathrm{Method}2}={leftVert A-{W}_k{W}_k^T ightVert}_F$ , (3.18)

${e}_k^{mathrm{Method}3hbox{-} 1}={leftVert A-{W}_k^2 ightVert}_F$ , (3.19)

${e}_k^{mathrm{Method}3hbox{-} 2}={leftVert {varPhi varLambda}^{frac{1}{2}}{Phi}^T-{W}_k ightVert}_F$ . (3.20)

For each algorithm, I generated A_k from x_k by (2.5) with β=1. In (3.9) I computed the error between the unique positive definite solution of A^½ and its estimate W_k. Figure 3-4 shows the convergence plots for all three methods by plotting the four errors e_k against k. The final values of e_k after 500 samples were e₅₀₀ = 0.451 for Methods 1 and 2, e₅₀₀ = 0.720 for Method 3-1 (3.19), and e₅₀₀ = 0.250 for Method 3-2 (3.20).

Figure 3-4
Convergence of the A^½ algorithms (3.5), (3.8), and (3.9)

It is clear from Figure 3-4 that the errors are all close to zero. The small differences compared to the actual values are due to random fluctuations in the elements of W_k caused by the varying input data.

Experiments for Adaptive Inverse Square Root Algorithms

I used adaptive algorithms (3.12), (3.15), and (3.16) for Methods 1, 2, and 3, respectively, to compute the inverse square root A. I started the algorithms with W₀=I (10X10 identity matrix). At k^th update of each algorithm, I computed the Frobenius norm of the error between the actual correlation matrix A and the inverse square of W_k that is appropriate for each method. I denoted this error by e_k as shown:

${e}_k^{mathrm{Method}1}={leftVert I-{W}_k^T{AW}_k ightVert}_F$ , (3.21)

${e}_k^{mathrm{Method}2}={leftVert I-{W}_k{AW}_k^T ightVert}_F$ , (3.22)

${e}_k^{mathrm{Method}3hbox{-} 1}kern0.5em ={leftVert I-{W}_k{AW}_k ightVert}_F$ , (3.23)

${e}_k^{mathrm{Method}3hbox{-} 2}={leftVert {varPhi varLambda}^{-frac{1}{2}}{Phi}^T-{W}_k ightVert}_F$ , (3.24)

In (3.24) I computed the error between the unique positive definite solution of A^–½ and its estimate W_k. Figure 3-5 shows the convergence plot for all three methods by plotting the four errors e_k against k. The final values of e_k after 500 samples were e₅₀₀ = 0.419 for Methods 1 and 2, e₅₀₀ = 0.630 for Method 3-1 (3.23), and e₅₀₀ = 0.823 for Method 3-2 (3.24).

Figure 3-5
Convergence of the A^–½ algorithms (3.12), (3.15), and (3.16)

It is clear from Figure 3-5 that the errors are all close to zero. As before, experiments with higher epochs show an improvement in the estimation accuracy.

3.9 Concluding Remarks

I presented six adaptive algorithms for the computation of the square root and inverse square root of the correlation matrix of a random vector sequence. In four cases, I presented an objective function and, in all cases, I discussed the convergence properties of the algorithms. Note that although I applied the gradient descent technique on these objective functions, I could have applied any other technique of nonlinear optimization such as steepest descent, conjugate direction, Newton-Raphson, or recursive least squares. The availability of the objective functions allows us to derive new algorithms by using new optimization techniques on them, and also to perform convergence analyses of the adaptive algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Square Root and Inverse Square Root

Create new playlist

Sign In

Sign Up

3. Square Root and Inverse Square Root

3.1 Introduction and Use Cases

Various Solutions for A½ and A–½

Outline of This Chapter

3.2 Adaptive Square Root Algorithm: Method 1

Objective Function

Adaptive Algorithm

3.3 Adaptive Square Root Algorithm: Method 2

Objective Function

Adaptive Algorithm

3.4 Adaptive Square Root Algorithm: Method 3

Adaptive Algorithm

3.5 Adaptive Inverse Square Root Algorithm: Method 1

Objective Function

Adaptive Algorithm

3.6 Adaptive Inverse Square Root Algorithm: Method 2

Objective Function

Adaptive Algorithm

3.7 Adaptive Inverse Square Root Algorithm: Method 3

Adaptive Algorithm

3.8 Experimental Results

Experiments for Adaptive Square Root Algorithms

Experiments for Adaptive Inverse Square Root Algorithms

3.9 Concluding Remarks

Table of Contents for
3. Square Root and Inverse Square Root

Various Solutions for A^½ and A^–½