© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
C. ChatterjeeAdaptive Machine Learning Algorithms with Pythonhttps://doi.org/10.1007/978-1-4842-8017-1_2

2. General Theories and Notations

Chanchal Chatterjee1  
(1)
San Jose, CA, USA
 

2.1 Introduction

In this chapter, I present algorithms for the adaptive solutions of matrix algebra problems from a sequence of matrices. The streams or sequences can be random matrices {Ak} or {Bk}, or the correlation matrices of random vector sequences {xk} or {yk}. Examples of matrix algebra are matrix inversion, square root, inverse square root, eigenvectors, generalized eigenvectors, singular vectors, and generalized singular vectors.

This chapter additionally covers the basic terminologies and methods used throughout the remaining chapters. I also present well-known adaptive algorithms to compute the mean, median, covariance, inverse covariance, and correlation matrices from random matrix or vector sequences. Furthermore, I present a new algorithm to compute the normalized mean of a random vector sequence.

For the sake of simplicity, let’s assume that the multidimensional data {xk∈ℜn} arrives as a sequence. From this data sequence, we can derive a matrix sequence {$$ {A}_k={mathbf{x}}_k{mathbf{x}}_k^T $$}. We define the data correlation matrix A as follows:

$$ A=underset{k	o infty }{lim }Eleft[{mathbf{x}}_k{mathbf{x}}_k^T
ight] $$.     (2.1)

2.2 Stationary and Non-Stationary Sequences

In practical implementations, we face two types of sequences: stationary and non-stationary . A sequence {xk} is considered asymptotically (weak) stationary if $$ {lim}_{k	o infty }Eleft[{mathbf{x}}_k{mathbf{x}}_k^T
ight] $$ is a constant. For a non-stationary sequence {xk}, $$ Eleft[{mathbf{x}}_k{mathbf{x}}_{k+m}^T
ight] $$ remains a function of both k and m, and $$ Eleft[{mathbf{x}}_k{mathbf{x}}_k^T
ight] $$ is a function of k. Examples of non-stationary data are given in Publicly Real-World Datasets to Evaluate Stream Learning Algorithms [Vinicius Souza et al. 20]. Figure 2-1 shows examples of stationary and non-stationary data.
Figure 2-1

Examples of stationary and non-stationary data

Non-stationarity in data can be detected by well-known techniques described in this reference [Shay Palachy 19].

2.3 Use Cases for Adaptive Mean, Median, and Covariances

Adaptive mean computation is important in real-world applications even though it is one of the simplest algorithms.

Handwritten Character Recognition

Consider the problem of detecting handwritten number 8. Figure 2-2 shows six instances of the number 8 from the Keras dataset MNIST images [Keras, MNIST].
Figure 2-2

Examples of variations in handwritten number 8

One algorithm to recognize the number 8 with all its variations is to find the mean set of pixels that represent it. Due to random variations in handwriting, it is difficult to compile all variations of each character ahead of time. It is important to design an adaptive learning scheme whereby instantaneous variations in these characters are immediately represented in the machine learning algorithm.

The adaptive mean detection algorithm is helpful in finding the common pattens that represent the number 8. Figure 2-3 shows a template for the number 8 obtained by the adaptive mean algorithm given in Eq. (2.2).
Figure 2-3

Template for number 8. 2D representation on the left and 3D on the right

Anomaly Detection of Streaming Data

A simple yet powerful algorithm to detect anomalies in data is to calculate the median and compare the current value against the median of the data. Figure 2-4 shows streaming data [Yahoo Research Webscope S5 Data] containing occasional anomalous values. We adaptively computed the median with algorithm (2.20) and compared it against the data to detect anomalies. Here the data samples are in blue, the adaptive median is in green, and anomalies are in red.
Figure 2-4

Anomalies detected with an adaptive median algorithm on time series data

More on this topic is discussed in Chapter 8.

2.4 Adaptive Mean and Covariance of Nonstationary Sequences

In the stationary case, given a sequence {xk∈ℜn}, we can compute the adaptive mean mk as follows:

$$ {mathbf{m}}_k=frac{1}{k}sum limits_{i=1}^k{mathbf{x}}_i={mathbf{m}}_{k-1}+frac{1}{k}left({mathbf{x}}_k-{mathbf{m}}_{k-1}
ight) $$.     (2.2)

Similarly, the adaptive correlation Ak is

$$ {A}_k=frac{1}{k}sum limits_{i=1}^k{mathbf{x}}_i{mathbf{x}}_i^T={A}_{k-1}+frac{1}{k}left({mathbf{x}}_k{mathbf{x}}_k^T-{A}_{k-1}
ight) $$.     (2.3)

Here we use all samples up to time instant k.

If, however, the data is non-stationary, we use a forgetting factor 0<β≤1 to implement an effective window of size 1/(1–β) as

$$ {mathbf{m}}_k=frac{1}{k}sum limits_{i=1}^k{eta}^{k-i}{mathbf{x}}_i=eta {mathbf{m}}_{k-1}+frac{1}{k}left({mathbf{x}}_k-eta {mathbf{m}}_{k-1}
ight) $$     (2.4)

and

$$ {A}_k=frac{1}{k}sum limits_{i=1}^k{eta}^{k-i}{mathbf{x}}_i{mathbf{x}}_i^T=eta {A}_{k-1}+frac{1}{k}left({mathbf{x}}_k{mathbf{x}}_k^T-eta {A}_{k-1}
ight) $$.     (2.5)

This effective window ensures that the past data samples are downweighted with an exponentially fading window compared to the recent ones in order to afford the tracking capability of the adaptive algorithm. The exact value of β depends on the specific application. Generally speaking, for slow time-varying {xk}, β is chosen close to 1 to implement a large effective window, whereas for fast time-varying {xk}, β is chosen near zero for a small effective window [Benveniste et al. 90].

The following is Python code to adaptively compute a mean vector and correlation matrix with data X[nDim,nSamples]:
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        x = X[:,iter]
        x = x.reshape(nDim,1)
        # Eq.2.4
        m = beta * m  + (1.0/(1 + cnt)) * (x - beta * m)
        # Eq.2.5
        A = beta * A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - beta * A)

2.5 Adaptive Covariance and Inverses

Given the mean and correlations discussed before, the adaptive covariance matrix Bk can also be computed as follows:

$$ {B}_k=frac{1}{k}sum limits_{i=1}^k{eta}^{k-i}left({mathbf{x}}_i-{mathbf{m}}_i
ight){left({mathbf{x}}_i-{mathbf{m}}_i
ight)}^T=eta {B}_{k-1}+frac{1}{k}left(left({mathbf{x}}_k-{mathbf{m}}_k
ight){left({mathbf{x}}_k-{mathbf{m}}_k
ight)}^T-eta {B}_{k-1}
ight) $$     (2.6)

From the adaptive correlation matrix Ak in (2.5), the inverse correlation matrix $$ {A}_k^{-1} $$ can be obtained adaptively by the Sherman-Morrison formula [Sherman–Morrison, Wikipedia] as

$$ {A}_k^{-1}=frac{k}{eta left(k-1
ight)}left({A}_{k-1}^{-1}-frac{A_{k-1}^{-1}{mathbf{x}}_k{mathbf{x}}_k^T{A}_{k-1}^{-1}}{eta left(k-1
ight)+{mathbf{x}}_k^T{A}_{k-1}^{-1}{mathbf{x}}_k}
ight) $$.     (2.7)

Similarly, the inverse covariance matrix $$ {B}_k^{-1} $$ can be obtained adaptively as

$$ {B}_k^{-1}=frac{k}{eta left(k-1
ight)}left({B}_{k-1}^{-1}-frac{B_{k-1}^{-1}left({mathbf{x}}_k-{mathbf{m}}_k
ight){left({mathbf{x}}_k-{mathbf{m}}_k
ight)}^T{B}_{k-1}^{-1}}{eta left(k-1
ight)+{left({mathbf{x}}_k-{mathbf{m}}_k
ight)}^T{B}_{k-1}^{-1}left({mathbf{x}}_k-{mathbf{m}}_k
ight)}
ight) $$.     (2.8)

The following is Python code to adaptively compute inverse correlation and inverse covariance matrices with data X[nDim,nSamples]:
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        x = X[:,iter]
        x = x.reshape(nDim,1)
        y = Y[:,iter]
        y = x.reshape(nDim,1)
        # Eq.2.7
        k = cnt+2
        AW = (k/(beta*(k-1))) * (AW - (AW * (x @ x.T) * AW)
                                 / (beta*(k-1) + x.T @ AW @ x))
        # Eq.2.8
        BW = (k/(beta*(k-1))) * (BW - (BW * (y @ y.T) * BW)
                                 / (beta*(k-1) + y.T @ BW @ y))

2.6 Adaptive Normalized Mean Algorithm

The most obvious choice for an adaptive normalized mean algorithm is to use (2.2) and normalize each mk. However, a more efficient algorithm can be obtained from the following cost function whose minimizer w* is the asymptotic normalized mean m/‖m‖, where m = limk → ∞E[xk]:

$$ Jleft({mathbf{w}}_k;{mathbf{x}}_k
ight)={leftVert {mathbf{x}}_k-{mathbf{w}}_k
ightVert}^2+alpha left({mathbf{w}}_k^T{mathbf{w}}_k-1
ight) $$,     (2.9)

where α is a Lagrange multiplier that enforces the constraint that the mean is normalized. The gradient of J(wk;xk) with respect to wk is

$$ left(1/2
ight){
abla}_{{mathbf{w}}_k}Jleft({mathbf{w}}_k;{mathbf{x}}_k
ight)=-left({mathbf{x}}_k-{mathbf{w}}_k
ight)+alpha {mathbf{w}}_k $$.     (2.10)

Multiplying (2.10) by $$ {mathbf{w}}_k^T $$ and applying the constraint $$ {mathbf{w}}_k^T{mathbf{w}}_k=1 $$, we obtain

$$ alpha ={mathbf{w}}_k^T{mathbf{x}}_k-1 $$.     (2.11)

Using this α in (2.11), we obtain the adaptive gradient descent algorithm for normalized mean:

$$ {mathbf{w}}_{k+1}={mathbf{w}}_k+{eta}_kleft({mathbf{x}}_k-{mathbf{w}}_k^T{mathbf{x}}_k{mathbf{w}}_k
ight) $$,     (2.12)

where ηk is a small decreasing constant, which follows assumption A1.2 in the Proofs of Convergence in GitHub [Chatterjee GitHub].

Variations of the Adaptive Normalized Mean Algorithm

There are several variations of the objective function (2.9) that lead to many other adaptive algorithms for normalized mean computation. One variation of (2.9) is to place the value of α in (2.11) in the objective function (2.9) to obtain the following objective function:

$$ Jleft({mathbf{w}}_k;{mathbf{x}}_k
ight)={leftVert {mathbf{x}}_k-{mathbf{w}}_k
ightVert}^2+left({mathbf{w}}_k^T{mathbf{x}}_k-1
ight)left({mathbf{w}}_k^T{mathbf{w}}_k-1
ight) $$.     (2.13)

Unlike (2.9), this objective function is unconstrained and has the constraint $$ {mathbf{w}}_k^T{mathbf{w}}_k=1 $$ built into it. It leads to the following adaptive algorithm:

$$ {mathbf{w}}_{k+1}={mathbf{w}}_k+{eta}_kleft(2{mathbf{x}}_k-{mathbf{w}}_k^T{mathbf{x}}_k{mathbf{w}}_k-{mathbf{w}}_k^T{mathbf{w}}_k{mathbf{x}}_k
ight) $$.     (2.14)

Another variation is the use of a penalty function method of nonlinear optimization that enforces the constraint $$ {mathbf{w}}_k^T{mathbf{w}}_k=1 $$. This objective function is

$$ Jleft({mathbf{w}}_k;{mathbf{x}}_k
ight)={leftVert {mathbf{x}}_k-{mathbf{w}}_k
ightVert}^2+frac{mu }{2}{left({mathbf{w}}_k^T{mathbf{w}}_k-1
ight)}^2 $$,     (2.15)

where μ is a positive penalty constant. This objective function is also unconstrained and leads to the following adaptive algorithm:

$$ {mathbf{w}}_{k+1}={mathbf{w}}_k+{eta}_kleft({mathbf{x}}_k-{mathbf{w}}_k-mu {mathbf{w}}_kleft({mathbf{w}}_k^T{mathbf{w}}_k-1
ight)
ight) $$.     (2.16)

The following is the Python code to adaptively compute a normalized mean by algorithms (2.12), (2.14), and (2.16) with data X[nDim,nSamples]:
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        x = X[:,iter]
        x = x.reshape(nDim,1)
        # Eq.2.12
        w1 = w1 + (1.0/(100+cnt))*(x - (w1.T @ x)*w1)
        # Eq.2.14
        w2 = w2 + (1.0/(100+cnt))*(2*x-(w2.T @ x)*w2 - (w2.T @ w2)*x)
        # Eq.2.16
        w3 = w3 + (1.0/(100+cnt))*(x - w3- mu* w3 @ ((w3.T @ w3)-1))

2.7 Adaptive Median Algorithm

Given a sequence {xk}, its asymptotic median μ satisfies the following:

$$ underset{k	o infty }{lim}mathrm{P}left({mathbf{x}}_kge mu 
ight)=underset{k	o infty }{lim}mathrm{P}left({mathbf{x}}_k&lt;mu 
ight)=0.5 $$     (2.17)

where P(E) is the probability measure of event E and 0 ≤ P(E) ≤ 1. The objective function J(wk;xk) whose minimizer w* is the asymptotic median μ is

J(wk; xk) = ‖xk − wk‖.     (2.18)

The gradient of J(wk;xk) with respect to wk is

$$ {
abla}_{{mathbf{w}}_k}Jleft({mathbf{w}}_k;{mathbf{x}}_k
ight)=-operatorname{sgn}left({mathbf{x}}_k-{mathbf{w}}_k
ight) $$,     (2.19)

where sgn(.) is the sign operator (sgn(x)=1 if x≥0 and –1 if x<0). From the gradient in (2.19), we obtain the adaptive gradient descent algorithm:

wk + 1 = wk + ηk sgn (xk − wk).     (2.20)

The following is the Python code to adaptively compute the median by algorithm (2.20) with data X[nDim,nSamples]:
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        x = X[:,iter]
        x = x.reshape(nDim,1)
        # Eq.2.20
        md = md + (3.0/(1 + cnt)) * np.sign(x - md)

2.8 Experimental Results

The purpose of these experiments is to demonstrate the performance and accuracies of the adaptive algorithms for mean, correlation, inverse correlation, inverse covariance, normalized mean, and median.

I generated 1,000 sample vectors {xk} of five-dimensional Gaussian data (i.e., n=5) with the following mean and covariance:

Mean = [10 7 6 5 1],

Covariance = $$ left[egin{array}{ccccc}2.091&amp; 0.038&amp; hbox{-} 0.053&amp; hbox{-} 0.005&amp; 0.010\ {}0.038&amp; 1.373&amp; 0.018&amp; hbox{-} 0.028&amp; hbox{-} 0.011\ {}hbox{-} 0.053&amp; 0.018&amp; 1.430&amp; 0.017&amp; 0.055\ {}hbox{-} 0.005&amp; hbox{-} 0.028&amp; 0.017&amp; 1.084&amp; hbox{-} 0.005\ {}0.010&amp; hbox{-} 0.011&amp; 0.055&amp; hbox{-} 0.005&amp; 1.071end{array}
ight] $$.

For each algorithm, I computed the error as the Frobenius norm [Frobenius norm, Wikipedia] of the estimated value at each iteration of the algorithm and the actual computed value from all of the 1,000 samples.
$$ error(k)={leftVert Estimated Value (k)- Actual Value
ightVert}_F. $$
I computed the mean vector and correlation matrices by the adaptive algorithms (2.4) and (2.5), respectively. Figure 2-5 shows the results.
Figure 2-5

Convergence of the mean vector and correlation matrices with algorithms (2.4) and (2.5), respectively

I computed the inverses of the correlation and covariance matrices with algorithms (2.7) and (2.8), respectively. Figure 2-6 shows the results.
Figure 2-6

Convergence of the inverse correlation and inverse covariance matrices with algorithms (2.7) and (2.8), respectively

I computed the normalized mean by adaptive algorithms (2.12) and (2.14), and the median by algorithm (2.20). For each value of k in these algorithms, I computed the errors between the wk (estimate) and the actual values of normalized mean and median obtained from the entire 1,000 sample data. I used the 5X1 zero vector as the starting values (w0) for all algorithms. The results are shown in Figure 2-7.
Figure 2-7

Convergence of the normalized mean of {xk} by adaptive algorithms (2.12), (2.14), and (2.16) and the median by adaptive algorithm (2.20)

All adaptive algorithms converged rapidly. After the 1,000 samples were processed, the errors were 0.0043 for algorithms (2.14) and (2.16), and 0.0408 for algorithm (2.20). See the code in the GitHub repository [Chatterjee GitHub].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.86.138