© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
C. ChatterjeeAdaptive Machine Learning Algorithms with Pythonhttps://doi.org/10.1007/978-1-4842-8017-1_3

3. Square Root and Inverse Square Root

Chanchal Chatterjee1  
(1)
San Jose, CA, USA
 

3.1 Introduction and Use Cases

Adaptive computation of the square root and inverse square root of the real-time correlation matrix of a streaming sequence {xk∈ℜn} has numerous applications in machine learning, data analysis, and image/signal processing. They include data whitening, classifier design, and data normalization [Foley and Sammon 75; Fukunaga 90].

Data whitening is a process of decorrelating the data such that all components have unit variance. It is a data preprocessing step in machine learning and data analysis to “normalize” the data so that it is easier to model. Prominent applications are
  • To transform correlated noise in a signal to independent and identically distributed (iid) noise, which is easier to classify

  • Generalized eigenvector computation [Chatterjee et al. Mar 97] (see Chapter 7)

  • Linear discriminant analysis computation [linear discriminant analysis, Wikipedia]

  • Gaussian classifier design and computation of distance measures [Chatterjee et al. May 97]

Figure 3-1 shows the correlation matrices of the original and whitened data. The original data is highly correlated as shown by the colors on all axes. The whitened data is fully uncorrelated with no correlation between components since only diagonal values exist.
Figure 3-1

Original correlated data on the left and the uncorrelated “whitened” data on the right

Figure 3-2 shows a handwritten number 0 obtained from the Keras MNIST dataset [Keras, MNIST]. The correlation matrix on the right shows that the data pixels are highly correlated for all pixels.
Figure 3-2

Handwritten MNIST number 0 and the correlation matrix of all characters

The following Python code whitens the data samples X[nDim,nSamples]:
from scipy.linalg import eigh
nDim = X.shape[0]
corX = (X @ X.T) / nSamples
eigvals, eigvecs = eigh(corX)
V  = np.fliplr(eigvecs)
D = np.zeros(shape=(nDim,nDim))
for i in range(nDim):
    if (eigvals[::-1][i] < 10):
        D[i,i] = 0
    else:
        D[i,i] = np.sqrt(1/eigvals[::-1][i])
Z = V @ D @ V.T @ X
Next let’s see the transformed data and the new correlation matrix. Figure 3-3 shows that the differentiated features of the character are accentuated and the correlation matrix is diagonal and not distributed along all pixels, showing that the data is whitened with the identity correlation matrix.
Figure 3-3

Handwritten MNIST number 0 after whitening. The correlation matrix of all characters is diagonal

We define the data correlation matrix A as
$$ {A}_k=frac{1}{k}sum limits_{i=1}^k{mathbf{x}}_i{mathbf{x}}_i^T={A}_{k-1}+frac{1}{k}left({mathbf{x}}_k{mathbf{x}}_k^T-{A}_{k-1}
ight). $$

The square root of A, also called the Cholesky decomposition [Cholesky decomposition, Wikipedia], is denoted by A½. Similarly, A–½ denotes the inverse square root of A. However, as explained below, there is no unique solution for both of these matrix functions, and various adaptive algorithms can be obtained where each algorithm leads to a different solution. Even though there are numerous solutions for these matrix functions, there are unique solutions under some restrictions, such as a unique symmetric positive definite solution.

In this chapter, I present three adaptive algorithms for each matrix function A½ and A–½. Out of the three, one adaptive algorithm leads to the symmetric positive definite solution and two lead to more general solutions.

Various Solutions for A½ and A–½

Let A=ΦΛΦT be the eigen-decomposition of the real, symmetric, positive definite nXn matrix A, where Φ and Λ are respectively the eigenvector and eigenvalue matrices of A. Here Λ=diag(λ1,…,λn) is the diagonal eigenvalue matrix with λ1≥…≥λn>0, and Φ∈ℜnXn is orthonormal. A solution for A½ is LD, where D=$$ operatorname{diag}left(pm {lambda}_1^{frac{1}{2}},dots, pm {lambda}_n^{frac{1}{2}}
ight) $$. However, in general this is not a symmetric solution, and for any orthonormal1 matrix U, ΦDU is also a solution. We can show that A½ is symmetric if, and only if, it is of the form ΦT, and there are 2n symmetric solutions for A½. When D is positive definite, we obtain the unique symmetric positive definite solution for A½ as ΦΛ½ΦT, where Λ½ = $$ operatorname{diag}left({lambda}_1^{frac{1}{2}},dots, {lambda}_n^{frac{1}{2}}
ight) $$. Similarly, a general solution for the inverse square root A–½ of A is ΦD–1U, where D is defined before and U is any orthonormal matrix. The unique symmetric positive definite solution for A–½ is ΦΛ–½ΦT, where Λ–½ =$$ operatorname{diag}left({lambda}_1^{-frac{1}{2}},dots, {lambda}_n^{-frac{1}{2}}
ight) $$.

Outline of This Chapter

In sections 3.2, 3.3, and 3.4, I discuss three algorithms for the adaptive computation of A½. Section 3.4 discusses the unique symmetric positive definite solution for A½. In Sections 3.5, 3.6, and 3.7, I discuss three algorithms for the adaptive computation of A–½. Section 3.7 describes the unique symmetric positive definite solution for A–½. Section 3.8 presents experimental results for the six algorithms with 10-dimensional Gaussian data. Section 3.9 concludes the chapter.

3.2 Adaptive Square Root Algorithm: Method 1

Let {xkn} be a sequence of data vectors whose online data correlation matrix AknXn is given by

$$ {A}_k=frac{1}{k}sum limits_{i=1}^k{eta}^{k-i}{mathbf{x}}_i{mathbf{x}}_i^T $$.     (3.1)

Here xk is an observation vector at time k and 0<β≤1 is a forgetting factor used for non-stationary sequences. If the data is stationary, the asymptotic correlation matrix A is

$$ A=underset{k	o infty }{lim }Eleft[{A}_k
ight] $$.     (3.2)

Objective Function

Following the methodology described in Section 1.​4, I present the algorithm by first showing an objective function J, whose minimum with respect to matrix W gives us the square root of the asymptotic data correlation matrix A. The objective function is

$$ J(W)={leftVert A-{W}^TW
ightVert}_F^2 $$.     (3.3)

The gradient of J(W) with respect to W is

WJ(W) =  − 4W(A − WTW).     (3.4)

Adaptive Algorithm

From the gradient in (3.4), we obtain the following adaptive gradient descent algorithm:

$$ {W}_{k+1}={W}_k-{eta}_kleft(1/4
ight){
abla}_WJleft({W}_k;{A}_k
ight)={W}_k+{eta}_kleft({W}_k{A}_k-{W}_k{W}_k^T{W}_k
ight) $$,     (3.5)

where ηk is a small decreasing constant and follows assumption A1.2 in the Proofs of Convergence in the GitHub [Chatterjee GitHub].

The following Python code implements this algorithm with data X[nDim,nSamples]:
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        etat1 = 1.0/(50 + cnt)
        # Algorithm 1
        W1 = W1 + etat1 * (W1 @ A - W1 @ W1.T @ W1)

3.3 Adaptive Square Root Algorithm: Method 2

Objective Function

The objective function J(W), whose minimum with respect to W gives us the square root of A, is

$$ J(W)={leftVert A-{WW}^T
ightVert}_F^2 $$.     (3.6)

The gradient of J(W) with respect to W is

WJ(W) =  − 4(A − WWT )W.     (3.7)

Adaptive Algorithm

We obtain the following adaptive gradient descent algorithm for square root of A:

$$ {W}_{k+1}={W}_k-{eta}_kleft(1/4
ight){
abla}_WJleft({W}_k;{A}_k
ight)={W}_k+{eta}_kleft({A}_k{W}_k-{W}_k{W}_k^T{W}_k
ight) $$,     (3.8)

where ηk is a small decreasing constant.

The following Python code implements this algorithm with data X[nDim,nSamples]:
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        etat1 = 1.0/(50 + cnt)
        # Algorithm 2
        W2 = W2 + etat1 * (A @ W2 - W2 @ W2.T @ W2)

3.4 Adaptive Square Root Algorithm: Method 3

Adaptive Algorithm

Following the adaptive algorithms (3.5) and (3.8), I now present an algorithm for the computation of a symmetric positive definite square root of A:

$$ {W}_{k+1}={W}_k+{eta}_kleft({A}_k-{W}_k^2
ight) $$,     (3.9)

where ηk is a small decreasing constant and Wk is symmetric.

The following Python code implements this algorithm with data X[nDim,nSamples]:
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        etat2 = 1.0/(50 + cnt)
        # Algorithm 3
        W3 = W3 + etat2 * (A - W3 @ W3)

3.5 Adaptive Inverse Square Root Algorithm: Method 1

Objective Function

The objective function J(W), whose minimizer W* gives us the inverse square root of A, is

$$ J(W)={leftVert I-{W}^T AW
ightVert}_F^2 $$.     (3.10)

The gradient of J(W) with respect to W is

WJ(W) =  − 4AW(I − WTAW).     (3.11)

Adaptive Algorithm

From the gradient in (3.11), we obtain the following adaptive gradient descent algorithm:

$$ {W}_{k+1}={W}_k-{eta}_kleft(1/4
ight){A}_k^{-1}{
abla}_WJleft({W}_k;{A}_k
ight)={W}_k+{eta}_kleft({W}_k-{W}_k{W}_k^T{A}_k{W}_k
ight) $$     (3.12)

The following Python code implements this algorithm with data X[nDim,nSamples]:
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        etat1 = 1.0/(100 + cnt)
        # Algorithm 1
        W1 = W1 + etat1 * (W1 - W1 @ W1.T @ A @ W1)

3.6 Adaptive Inverse Square Root Algorithm: Method 2

Objective Function

The objective function J(W), whose minimum with respect to W gives us the inverse square root of A, is

$$ J(W)={leftVert I-{WAW}^T
ightVert}_F^2 $$.     (3.13)

The gradient of J(W) with respect to W is

WJ(W) =  − 4(I − WAWT)WA.     (3.14)

Adaptive Algorithm

We obtain the following adaptive algorithm for the inverse square root of A:

$$ {W}_{k+1}={W}_k-{eta}_kleft(1/4
ight){
abla}_WJleft({W}_k;{A}_k
ight){A}_k^{-1}={W}_k+{eta}_kleft({W}_k-{W}_k{A}_k{W}_k^T{W}_k
ight) $$,     (3.15)

where ηk is a small decreasing constant.

The following Python code implements this algorithm with data X[nDim,nSamples]:
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        etat1 = 1.0/(100 + cnt)
        # Algorithm 2
        W2 = W2 + etat1 * (W2 - W2 @ A @ W2.T @ W2)

3.7 Adaptive Inverse Square Root Algorithm: Method 3

Adaptive Algorithm

By extending the adaptive algorithms (3.12) and (3.15), I now present an adaptive algorithm for the computation of a symmetric positive definite inverse square root of A:

Wk + 1 = Wk + ηk(I − WkAkWk),     (3.16)

where ηk is a small decreasing constant and Wk is symmetric.

The following Python code implements this algorithm with data X[nDim,nSamples]:
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        etat2 = 1.0/(100 + cnt)
        # Algorithm 3
        W3 = W3 + etat2 * (I - W3 @ A @ W3)

3.8 Experimental Results

I generated 500 samples {xk} of 10-dimensional (i.e., n=10) Gaussian data with mean zero and covariance, given below. The covariance matrix is obtained from the first covariance matrix in [Okada and Tomita 85] multiplied by 3. The covariance matrix is

The eigenvalues of the covariance matrix are

[17.699, 8.347, 5.126, 3.088, 1.181, 0.882, 0.261, 0.213, 0.182, 0.151].

Experiments for Adaptive Square Root Algorithms

I used adaptive algorithms (3.5), (3.8), and (3.9) for Methods 1, 2, and 3, respectively, to compute the square root A, where A is the correlation matrix computed from all collected samples as
$$ A=frac{1}{500}sum limits_{i=1}^{500}{mathbf{x}}_i{mathbf{x}}_i^T. $$

I started the algorithms with W0=I (10X10 identity matrix). At kth update of each algorithm, I computed the Frobenius norm [Frobenius norm, Wikipedia] of the error between the actual correlation matrix A and the square of Wk that is appropriate for each method. I denote this error by ek as follows:

$$ {e}_k^{mathrm{Method}1}={leftVert A-{W}_k^T{W}_k
ightVert}_F $$,     (3.17)

$$ {e}_k^{mathrm{Method}2}={leftVert A-{W}_k{W}_k^T
ightVert}_F $$,     (3.18)

$$ {e}_k^{mathrm{Method}3hbox{-} 1}={leftVert A-{W}_k^2
ightVert}_F $$,     (3.19)

$$ {e}_k^{mathrm{Method}3hbox{-} 2}={leftVert {varPhi varLambda}^{frac{1}{2}}{Phi}^T-{W}_k
ightVert}_F $$.     (3.20)

For each algorithm, I generated Ak from xk by (2.5) with β=1. In (3.9) I computed the error between the unique positive definite solution of A½ and its estimate Wk. Figure 3-4 shows the convergence plots for all three methods by plotting the four errors ek against k. The final values of ek after 500 samples were e500 = 0.451 for Methods 1 and 2, e500 = 0.720 for Method 3-1 (3.19), and e500 = 0.250 for Method 3-2 (3.20).
Figure 3-4

Convergence of the A½ algorithms (3.5), (3.8), and (3.9)

It is clear from Figure 3-4 that the errors are all close to zero. The small differences compared to the actual values are due to random fluctuations in the elements of Wk caused by the varying input data.

Experiments for Adaptive Inverse Square Root Algorithms

I used adaptive algorithms (3.12), (3.15), and (3.16) for Methods 1, 2, and 3, respectively, to compute the inverse square root A. I started the algorithms with W0=I (10X10 identity matrix). At kth update of each algorithm, I computed the Frobenius norm of the error between the actual correlation matrix A and the inverse square of Wk that is appropriate for each method. I denoted this error by ek as shown:

$$ {e}_k^{mathrm{Method}1}={leftVert I-{W}_k^T{AW}_k
ightVert}_F $$,     (3.21)

$$ {e}_k^{mathrm{Method}2}={leftVert I-{W}_k{AW}_k^T
ightVert}_F $$,     (3.22)

$$ {e}_k^{mathrm{Method}3hbox{-} 1}kern0.5em ={leftVert I-{W}_k{AW}_k
ightVert}_F $$,     (3.23)

$$ {e}_k^{mathrm{Method}3hbox{-} 2}={leftVert {varPhi varLambda}^{-frac{1}{2}}{Phi}^T-{W}_k
ightVert}_F $$,     (3.24)

In (3.24) I computed the error between the unique positive definite solution of A–½ and its estimate Wk. Figure 3-5 shows the convergence plot for all three methods by plotting the four errors ek against k. The final values of ek after 500 samples were e500 = 0.419 for Methods 1 and 2, e500 = 0.630 for Method 3-1 (3.23), and e500 = 0.823 for Method 3-2 (3.24).
Figure 3-5

Convergence of the A–½ algorithms (3.12), (3.15), and (3.16)

It is clear from Figure 3-5 that the errors are all close to zero. As before, experiments with higher epochs show an improvement in the estimation accuracy.

3.9 Concluding Remarks

I presented six adaptive algorithms for the computation of the square root and inverse square root of the correlation matrix of a random vector sequence. In four cases, I presented an objective function and, in all cases, I discussed the convergence properties of the algorithms. Note that although I applied the gradient descent technique on these objective functions, I could have applied any other technique of nonlinear optimization such as steepest descent, conjugate direction, Newton-Raphson, or recursive least squares. The availability of the objective functions allows us to derive new algorithms by using new optimization techniques on them, and also to perform convergence analyses of the adaptive algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.234.83