© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
C. ChatterjeeAdaptive Machine Learning Algorithms with Pythonhttps://doi.org/10.1007/978-1-4842-8017-1_1

1. Introduction

Chanchal Chatterjee1  
(1)
San Jose, CA, USA
 

In this chapter, I present the adaptive computation of important features for data representation and classification. I demonstrate the importance of these features in machine learning, data visualization, and data analytics. I also show the importance of these algorithms in multiple disciplines and present how these algorithms are obtained there. Finally, I present a common methodology to derive these algorithms. This methodology is of high practical value since practitioners can use this methodology to derive their own features and algorithms for their own use cases.

For these data features, I assume that the data arrives as a sequence, has to be used instantaneously, and the entire batch of data cannot be stored in memory.

In machine learning and data analysis problems such as regression, classification, enhancement, or visualization, effective representation of data is key. When this data is multi-dimensional and time varying, the computational challenges are more formidable. Here we not only need to compute the represented data in a timely manner, but also adapt to the changing input in a fast, efficient, and robust manner.

A well-known method of data compression/representation is the Karhunen-Loeve theorem [Karhunen–Loève theorem, Wikipedia] or eigenvector orthonormal expansion [Fukunaga 90]. This method is also known as principal component analysis (PCA) [principal component analysis, Wikipedia]. Since each eigenvector can be ranked by its corresponding eigenvalue, a subset of the “best” eigenvectors can be chosen as the most relevant features.

Figure 1-1 shows two-class, two-dimensional data in which the best feature for data representation is the projection of the data on vector e1, which captures the most significant properties of the data from the two classes.
Figure 1-1

Data representation feature e1 that best represents the data from the two classes [Source: Chatterjee et al. IEEE Transactions on Neural Networks, Vol. 8, No. 3, pp. 663-678, May 1997]

In classification, however, you generally want to extract features that are effective for preserving class separability . Simply stated, the goal of classification feature extraction is to find a transform that maps the raw measurements into a smaller set of features, which contain all the discriminatory information needed to solve the overall pattern recognition problem.

Figure 1-2 shows the same two-class, two-dimensional data in which the best feature for classification is vector e2, whereby projection of the data on e2 leads to best class separability.
Figure 1-2

Data classification feature e2 that leads to best class separability [Source: Chatterjee et al. IEEE Transactions on Neural Networks, Vol. 8, No. 3, pp. 663-678, May 1997]

Before I present the adaptive algorithms for streaming data in the following chapters, I need to discuss the following:
  1. 1.

    Commonly used features that are obtained by a linear transform of the data and are used with streaming data for edge applications

     
  2. 2.

    Historical relevance of these features and how we derive them from different disciplines

     
  3. 3.

    Why we want to use adaptive algorithms to compute these features

     
  4. 4.

    How to create a common mathematical framework to derive adaptive algorithms for these features and many more

     

1.1 Commonly Used Features Obtained by Linear Transform

In this section, I discuss four commonly used features for data analytics and machine learning. These features are effective in data classification and representation, and can be easily obtained by a simple linear transform of the data. The simplicity and effectiveness of these features makes them useful for streaming data and edge applications.

In mathematical terms, let {xk} be an n-dimensional (zero mean) sequence that represents the data. We are seeking a matrix sequence {Wk} and a transform:

$$ {mathbf{y}}_k={W}_k^T{mathbf{x}}_k $$,     (1.1)

such that the linear transform yk has properties of data representation and is our desirable feature. I discuss a few of these features later.

Definition: Define the data correlation matrix A of {xk} as

$$ A=underset{k	o infty }{lim }Eleft[{mathbf{x}}_k{mathbf{x}}_k^T
ight] $$.     (1.2)

Data Whitening

Data whitening is a process of decorrelating the data such that all components have unit variance. It is a data preprocessing step in machine learning and data analysis to “normalize” the data so that it is easier to model. Here the linear transform yk of the data has the property $$ Eleft[{mathbf{y}}_k{mathbf{y}}_k^T
ight] $$=In (identity). I discuss in Chapter 3 that the optimal value of Wk=A–½.

Figure 1-3 shows the correlation matrices of the original and whitened data. The original random normal data is highly correlated as shown by the colors on all axes. The whitened data is fully uncorrelated with no correlation between components since only diagonal values exist.
Figure 1-3

Original correlated data is uncorrelated by the data whitening algorithm

The Python code to generate the whitened data from original dataset X[nDim, nSamples] is
from scipy.linalg import eigh
# Compute Correlation matrix and eigen vectors of the original data
corX = (X @ X.T) / nSamples
# generate the whitened data
eigvals, eigvecs = eigh(corX)
V = np.fliplr(eigvecs)
D  = np.diag(np.sqrt(1/eigvals[::-1]))
Y = V @ D @ V.T @ X
corY = (Y @ Y.T)/nSamples
# Plot the original and whitened correlation matrices
import seaborn as sns
plt.figure(figsize=(10, 4))
plt.rcParams.update({'font.size': 16})
plt.subplot(1, 2, 1)
sns.heatmap(corX, linewidth=0.5, linecolor="green", cmap='RdBu', cbar=False)
plt.title("Original data")
plt.subplot(1, 2, 2)
sns.heatmap(corY, linewidth=0.5, linecolor="green", cmap='hot', cbar=False)
plt.title("Whitened data")
plt.show()

Principal Components

Principal component analysis (PCA) is a well-studied example of the data representation model. From the perspective of classical statistics, PCA is an analysis of the covariance structure of multivariate data {xk}. Let yk=[yk1,…,ykp]T be the components of the PCA-transformed data. In this representation, the first principal component yk1 is a one-dimensional linear subspace where the variance of the projected data is maximal. The second principal component yk2 is the direction of maximal variance in the space orthogonal to the yk1 and so on.

It has been shown that the optimal weight matrix Wk is the eigenvector matrix of the correlation of the zero-mean input process {xk}. Let AΦ=ΦΛ be the eigenvector decomposition (EVD) of A, where Φ and Λ are respectively the eigenvector and eigenvalue matrices. Here Λ=diag(λ1,…,λn) is the diagonal eigenvalue matrix with λ1≥…≥λn>0 and Φ is orthonormal. We denote Φp∈ℜnXp as the matrix whose columns are the first p principal eigenvectors. Then optimal Wkp.

There are three variations of PCA that are useful in applications.
  1. 1.

    When p=n, the n components of yk are ordered according to maximal to minimal variance. This is the component analyzer that is used for data analysis.

     
  2. 2.

    When p<n, the p components of yk have maximal information for data compression.

     
  3. 3.

    For p<n, the np components of yk with minimal variance can be regarded as abnormal signals and reconstructed as (In–ΦpΦpT)xk to obtain a novelty filter.

     
Figure 1-4 shows the correlation matrices of the original and PCA-transformed random normal data. The original data is highly correlated as shown by the colors on all axes. The PCA-transformed data is uncorrelated with diagonal blocks only and highest representation value for component 1 (top left corner block) and decreasing thereafter.
Figure 1-4

Original correlated data is uncorrelated by PCA projection

The Python code for the PCA projected data from original dataset X[nDim, nSamples] is
from scipy.linalg import eigh
# Compute data correlation matrix
corX = (X @ X.T) / nSamples
# generate the PCA transformed data
eigvals, eigvecs = eigh(corX)
EstV = np.fliplr(eigvecs)
Y = EstV.T @ X
corY = (Y @ Y.T)/nSamples
# plot the PCA transformed data
import seaborn as sns
plt.figure(figsize=(10, 5))
plt.rcParams.update({'font.size': 16})
plt.subplot(1, 2, 1)
sns.heatmap(corX, linewidth=0.5, linecolor="green", cmap='RdBu', cbar=False)
plt.title("Original data")
plt.subplot(1, 2, 2)
sns.heatmap(corY, linewidth=0.5, linecolor="green", cmap='hot', cbar=False)
plt.title("PCA Transformed")
plt.show()

Linear Discriminant Features

Linear discriminant analysis (LDA) [linear discriminant analysis, Wikipedia] creates features from data from multiple classes such that the transformed representation yk has the most class separability. Given a data xk and the corresponding classes ck, we can calculate the following matrices:
  • data correlation matrix B=E(xkxkT),

  • cross correlation matrix M=E(xkckT),

  • and A=MMT.

It is well known that the linear transform Wk (Eq 1.1) is the generalized eigen-decomposition (GEVD) [generalized eigenvector, Wikipedia] of A with respect to B. Here AΨ=BΨΔ where Ψ and Δ are respectively the generalized eigenvector and eigenvalue matrices. Furthermore, Ψp∈ℜnXp is the matrix whose columns are the first p≤n principal generalized eigenvectors.

Figure 1-5 shows a two-class classification problem where the real-world data correlation matrix on the left has values on all axes and it is hard to distinguish the two classes. On the right is the LDA-transformed data, which clearly shows the two classes and is easy to classify.
Figure 1-5

Original correlated data is uncorrelated by linear discriminant analysis

The Python code to adaptively generate the LDA transformed correlation matrix from a two-class, multi-dimensional dataset [nDim, nSamples] is
# Adaptively compute matrices A and B
from numpy import linalg as la
dataset2 = dataset.drop(['Class'],1)
nSamples = dataset2.shape[0]
nDim = dataset2.shape[1]
classes = np.array(dataset['Class']-1)
classes_categorical = tf.keras.utils.to_categorical(classes, num_classes=2)
M = np.zeros(shape=(nDim,2)) # stores adaptive correlation matrix
B = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
for iter in range(nSamples):
    cnt = iter + 1
    # generate matrices A and B from current sample x
    x = np.array(dataset2.iloc[iter])
    x = x.reshape(nDim,1)
    B = B + (1.0/cnt)*((np.dot(x, x.T)) - B)
    y = classes_categorical[iter].reshape(2,1)
    M = M + (1.0/cnt)*((np.dot(x, y.T)) - M)
    A = M @ M.T
# generate the LDA transformed data
from scipy.linalg import eigh
from sklearn.preprocessing import normalize
eigvals, eigvecs = eigh(A, B)
V = np.fliplr(eigvecs)
VTAV = np.around(V.T @ A @ V, 2)
VTBV = np.around(V.T @ B @ V, 2)
# plot the LDA transformed data
import seaborn as sns
plt.figure(figsize=(8, 8))
plt.rcParams.update({'font.size': 16})
plt.subplot(2, 2, 1)
sns.heatmap(A, linewidth=0.5, linecolor="green", cmap='RdBu', cbar=False)
plt.title("Original data")
plt.subplot(2, 2, 2)
sns.heatmap(VTBV, linewidth=0.5, linecolor="green", cmap='hot', cbar=False)
plt.title("LDA Transformed")
plt.subplot(2, 2, 3)
sns.heatmap(A, linewidth=0.5, linecolor="green", cmap='RdBu', cbar=False)
plt.subplot(2, 2, 4)
sns.heatmap(VTAV, linewidth=0.5, linecolor="green", cmap='hot', cbar=False)
plt.show()

Singular Value Features

Singular value decomposition (SVD) [singular value decomposition, Wikipedia] is a special case of a EVD problem as follows. Given the cross-correlation (n-by-m real) matrix M=E(xkckT) ∈ℜnXm, SVD computes two matrices Uk∈ℜnXn and Vk∈ℜmXm such that UkTMVk=Snxm, where Uk and Vk are orthonormal and S=diag(s1,...sr), r=min(m, n), with s1>=...>=sr>=0. By rearranging the vectors xk and ck we can make a nxm dimensional SVD problem into a (n+m)x(n+m) dimensional EVD problem.

Summary

Table 1-1 summarizes the discussion in this section. Note that given a sequence of vectors {xk}, we are seeking a matrix sequence {Wk} and a linear transform $$ {mathbf{y}}_k={W}_k^T{mathbf{x}}_k $$.
Table 1-1.

Statistical Property of yk and Matrix Property of Wk

Computation

Statistical Property of yk

Matrix Property of Wk

Whitening

$$ Eleft[{mathbf{y}}_k{mathbf{y}}_k^T
ight]={I}_n $$

Wk = A–½ = ΦΛ–½ΦT

PCA/EVD

Max $$ Eleft[{mathbf{y}}_k^T{mathbf{y}}_k
ight] $$ subj. to $$ {W}_k^T{W}_k={I}_p $$

Wk = Φp and $$ Eleft[{mathbf{y}}_k{mathbf{y}}_k^T
ight]={I}_p $$

LDA/GEVD

Max $$ Eleft[{mathbf{y}}_k^T{mathbf{y}}_k
ight] $$ subj. to $$ {W}_k^T{BW}_k={I}_p $$

Wk = Ψp and $$ Eleft[{mathbf{y}}_k{mathbf{y}}_k^T
ight]={I}_p $$

1.2 Multi-Disciplinary Origin of Linear Features

In this section, I further discuss the importance of data representation and classification features by showing how multiple disciplines derive these features and compute them adaptively for streaming data. For each discipline, I demonstrate the use of these features on real data.

Hebbian Learning or Neural Biology

Hebb’s postulate of learning is the oldest and most famous of all learning rules. Hebb proposed that when an axon of cell A excites cell B, the synaptic weight W is adjusted based on f(x,y), where f(⋅) is a function of presynaptic activity x and postsynaptic activity y. As a special case, we may write the weight adjustment ΔWxyT, where η>0 is a small constant.

Given an input sequence {xk}, we can construct its linear transform $$ {y}_k={mathbf{w}}_k^T{mathbf{x}}_k $$, where wkn is the weight vector. We can describe a Hebbian rule [Haykin 94] for adjusting wk as wk+1=wkxkyk. Since this rule leads to exponential growth, Oja [Oja 89] introduced a rule to limit the growth of wk by normalizing wk as follows:

$$ {mathbf{w}}_{k+1}=frac{{mathbf{w}}_k+eta {mathbf{x}}_k{y}_k}{leftVert {mathbf{w}}_k+eta {mathbf{x}}_k{y}_k
ightVert }, $$     (1.3)

Assuming small η and ‖wk‖=1, (1.3) can be expanded as a power series in η, yielding

$$ {mathbf{w}}_{k+1}={mathbf{w}}_k+eta left({mathbf{x}}_k{y}_k-{y}_k^2{mathbf{w}}_k
ight)+Oleft({eta}^2
ight), $$     (1.4)

Eliminating the O2) term, we get the constrained Hebbian algorithm (CHA) or Oja’s one-unit rule (Chapter 4, Sec 4.​3). Indeed, this algorithm converges to the first principal eigenvector of A.

Note that adaptive PCA algorithms are commonly used on streaming data and some GitHubs exist. I create here a comprehensive repository of these algorithms and also present new ones in this book and the associated GitHub.

For example, algorithm (1.4) has been widely used on streaming data to derive the strongest representation feature from the data for analytics and machine learning. Figure 1-6 shows multivariate non-stationary data with seasonality. Using the adaptive update rule (1.4), we derive the first principal component effectively in 0.2% of the streaming data. Figure 1-6 shows the seasonal time-varying streaming data on the left and the rapid convergence (ideal value is 1) of the algorithm on the right to the first principal eigenvector leading to the strongest data representation feature.
Figure 1-6

Adaptive PCA algorithm used on seasonal streaming data

The Python code to implement the Hebbian adaptive algorithm on real 33-dimensional dataset [nDim][nSamples] is
# Adaptive Hebbian/OJA algorithms
from numpy import linalg as la
A = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
w = 0.1 * np.ones(shape=(nDim)) # weight vectors of all algorithms
for iter in range(nSamples):
    # update the data correlation matrix with latest data vector x
    x = np.array(dataset1.iloc[iter]).reshape(nDim,1)
    A = A + (1.0/(1 + iter))*((np.dot(x, x.T)) - A)
    # Hebbian/OJA Algorithm
    v = w[:].reshape(nDim, 1)
    v = v + (1/(100+iter))*(A @ v - v @ (v.T @ A @ v))
    w[:] = v.reshape(nDim)

Auto-Associative Networks

Auto-association is a neural network structure in which the desired output is same as the network input xk. This is also known as the linear autoencoder [autoencoder, Wikipedia]. Let’s consider a two-layer linear network with weight matrices W1 and W2 for the input and output layers, respectively, and p (≤n) nodes in the hidden layer. The mean square error (MSE) at the network output is given by

$$ e=Eleft[{leftVert {mathbf{x}}_k-{W}_2^T{W}_1^T{mathbf{x}}_k
ightVert}^2
ight] $$.     (1.5)

Due to p nodes in the hidden layer, a minimum MSE produces outputs that represent the best estimates of xkn in the ℜp subspace. Since projection onto Φp minimizes the error in the ℜp subspace, we expect the first layer weight matrix W1 to be rotations of Φp. The second layer weight matrix W2 is the inverse of W1 to finally represent the “best” identity transform. This intuitive argument has been proven [Baldi and Hornik 89,95, Bourland and Kamp 88], where the optimum weight matrices are

W1 = ΦpR and $$ {W}_2={R}^{-1}{Phi}_p^T $$,     (1.6)

where R is a non-singular nXn matrix. Note that if we further impose the constraint $$ {W}_2={W}_1^T $$, then R is a unitary matrix and the input layer weight matrix W1 is orthonormal and spans the space defined by Φp, the p principal eigenvectors of the input correlation matrix A. See Figure 1-7.
Figure 1-7

Auto-associative neural network

Note that if we have a single node in the hidden layer (i.e., p=1), then we obtain e as the output sum squared error for a two-layer linear auto-associative network with input layer weight vector w and output layer weight vector wT. The optimal value of w is the first principal eigenvector of the input correlation matrix Ak.

The result in (1.6) suggests the possibility of a PCA algorithm by using a gradient correction only to the input layer weights, while the output layer weights are modified in a symmetric fashion, thus avoiding the backpropagation of errors in one of the layers. One possible version of this idea is

$$ {W}_1left(k+1
ight)={W}_1(k)-eta frac{mathit{partial e}}{mathit{partial}{W}_1} $$ and W2(k + 1) = W1(k + 1)T.     (1.7)

Denoting W1(k) by Wk, we obtain an algorithm that is the same as Oja’s subspace learning algorithm (SLA).

Algorithm (1.7) has been used widely on multivariate streaming data to extract the significant principal components so that we can instantaneously find the important data representation features. Figure 1-8 shows multidimensional streaming data on the left and the rapid convergence of the first two principal eigenvectors on the right by the adaptive algorithm (1.7), which is further described in Chapter 5.
Figure 1-8

Convergence of the first two principal eigenvectors computed from multidimensional streaming data. Data is on the left and feature convergence (ideal value = 1) is on the right

The Python code to generate the first 4 principal eigenvectors from 10-dimensional synthetic dataset is
from numpy import linalg as la
A = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
W2 = 0.1 * np.ones(shape=(nDim,nEA)) # weight vectors of all algorithms
W3 = W2
c = [2-0.3*k for k in range(nEA)]
C = np.diag(c)
for epoch in range(nEpochs):
    for iter in range(nSamples):
        cnt = nSamples*epoch + iter
        # Update data correlation matrix A with current data sample x
        x = X[:,iter]
        x = x.reshape(nDim,1)
        A = A + (1.0/(1 + cnt))*((np.dot(x, x.T)) - A)
        # Deflated Gradient Descent
        W2 = W2 + (1/(100 + cnt))*(A @ W2 - W2 @ np.triu(W2.T @ A @ W2))
        # Weighted Gradient Descent
        W3 = W3 + (1/(220 + cnt))*(A @ W3 @ C - W3 @ C @ (W3.T @ A @ W3))

Hetero-Associative Networks

Let’s consider a hetero-associative network , which differs from the auto-associative case in the output layer, which is d instead of x. Here d denotes the categorical classes the data belongs to. One example of d=ei where ei is the ith standard basis vector [standard basis, Wikipedia] for class i. In a two-class problem, d=[1 0]T for class 1 and d=[0 1]T for class 2. Let’s denote B=E(xxT), M=E(xdT), and A=MMT. See Figure 1-9.
Figure 1-9

Hetero-associative neural network

We consider a two-layer linear hetero-associative neural network with just a single neuron in the hidden layer and m≤n output units. Let w∈n be the weight vector for the input layer and v∈m be the weight vector for the output layer. The MSE at the network output is

e = E{‖d − vwTx2}.     (1.8)

We further assume that the network has limited power, so wTBw=1. Hence, we impose this constraint on the MSE criterion in (1.8) as

J(w, v) = E{‖d − vwTx2} + μ(wTBw − 1),     (1.9)

where μ is the Lagrange multiplier. This equation has a unique global minimum where w is the first principal eigenvector of the matrix pencil (A,B) and v=MTw. Furthermore, from the gradient of (1.9) with respect w, we obtain the update equation for w as

$$ {mathbf{w}}_{k+1}={mathbf{w}}_k-eta frac{mathit{partial J}}{mathit{partial}mathbf{w}}left({mathbf{w}}_k,{mathbf{v}}_k
ight)={mathbf{w}}_k+eta left(I-{B}_k{mathbf{w}}_k{mathbf{w}}_k^T
ight)kern0.2em {M}_k{mathbf{v}}_k $$.     (1.10)

We can substitute the convergence value of v (1.10) and avoid the back-propagation of errors in the second layer to obtain

$$ {mathbf{w}}_{k+1}={mathbf{w}}_k+eta left({A}_k{mathbf{w}}_k-{B}_k{mathbf{w}}_kleft({mathbf{w}}_k^T{A}_k{mathbf{w}}_k
ight)
ight). $$     (1.11)

This algorithm can be used effectively to adaptively compute classification features from streaming data.

Figure 1-10 shows the multivariate e-shopping clickstream dataset [Apczynski M., et al.] belonging to two classes determining buyer’s pricing sentiments. We use the adaptive algorithm (1.11) to compute the class separability feature w. Figure 1-10 shows the following:
  • The original multi-dimensional e-shopping data on the left 2 figures.

  • The original data correlation on the right (3rd figure). The original data correlation matrix shows that the classes are indistinguishable in the original data.

  • The correlation matrix of the data transformed by algorithm (1.11) on the far right. The white and red blocks on the far right matrix show the two classes are clearly separated in the transformed data by the algorithm (1.11).

Figure 1-10

e-Shopping clickstream data on the left and uncorrelated class separable data on the right

The Python code to generate the class separable transformed correlation matrix from two-class multi-dimensional dataset[nDim, nSamples] is
# Adaptively compute matrices A and B
from numpy import linalg as la
nSamples = dataset1.shape[0]
nDim = dataset1.shape[1]
classes = np.array(dataset['price2']-1)
classes_categorical = tf.keras.utils.to_categorical(classes, num_classes=2)
M = np.zeros(shape=(nDim,2)) # stores adaptive correlation matrix
B = np.zeros(shape=(nDim,nDim)) # stores adaptive correlation matrix
for iter in range(nSamples):
    cnt = iter + 1
    x = np.array(dataset1.iloc[iter])
    x = x.reshape(nDim,1)
    B = B + (1.0/cnt)*((np.dot(x, x.T)) - B)
    y = classes_categorical[iter].reshape(2,1)
    M = M + (1.0/cnt)*((np.dot(x, y.T)) - M)
    A = M @ M.T
# generate the transformed data
from scipy.linalg import eigh
eigvals, eigvecs = eigh(A, B)
V = np.fliplr(eigvecs)
VTAV = np.around(V.T @ A @ V, 2)
VTBV = np.around(V.T @ B @ V, 2)
# plot the LDA transformed data
import seaborn as sns
plt.figure(figsize=(12, 12))
plt.rcParams.update({'font.size': 16})
plt.subplot(2, 2, 1)
sns.heatmap(A, linewidth=0.5, linecolor="green", cmap='RdBu', cbar=False)
plt.title("Original Correlated Data")
plt.subplot(2, 2, 2)
sns.heatmap(VTBV, linewidth=0.5, linecolor="green", cmap='hot', cbar=False)
plt.title("Transformed Class Separable Data")
plt.subplot(2, 2, 3)
sns.heatmap(A, linewidth=0.5, linecolor="green", cmap='RdBu', cbar=False)
plt.title("Original Correlated Data")
plt.subplot(2, 2, 4)
sns.heatmap(VTAV, linewidth=0.5, linecolor="green", cmap='hot', cbar=False)
plt.title("Transformed Class Separable Data")
plt.show()

Statistical Pattern Recognition

One special case of linear hetero-association is a network performing one-from-m classification, where input x to the network is classified into one out of m classes ω1,...,ωm. If x∈ωi then d=ei where ei is the ith standard basis vector. Unlike auto-associative learning, which is unsupervised, this network is supervised. In this case, A and B are scatter matrices. Here A is the between-class scatter matrix Sb, the scatter of the class means around the mixture mean, and B is the mixture scatter matrix Sm, the covariance of all samples regardless of class assignments. The generalized eigenvector decomposition of (A,B) is known as linear discriminant analysis (LDA), which was discussed in Section 1.1.

Information Theory

Another viewpoint of the data model (1.1) is due to Linsker [1988] and Plumbley [1993]. According to Linsker’s Infomax principle, the optimum value of the weight matrix W is when the information I(x,y) transmitted to its output y about its input x is maximized. This is equivalent to information in input x about output y since I(x,y)=I(y,x). However, a noiseless process like (1.1) has infinite information about input x in y and vice versa since y perfectly represents x. In order to proceed, we assume that input x contains some noise n, which prevents x from being measured accurately by y. There are two variations of this model, both inspired by Plumbley [1993].

In the first model, we assume that the output y is corrupted by noise due to the transform W. We further assume that average power available for transmission is limited, the input x is zero-mean Gaussian with covariance A, the noise n is zero-mean uncorrelated Gaussian with covariance N, and the transform noise n is independent of x. The noisy data model is

y = WTx + n.     (1.12)

The mutual information I(y,x) (with Gaussian assumptions) is

I(y, x) = 0.5 log  det(WTAW + N) − 0.5 log  det(N).     (1.13)

We define an objective function with power constraints as

J(W) = I(y, x) − 0.5tr(Λ(WTAW + N)),     (1.14)

where Λ is a diagonal matrix of Lagrange multipliers. This function is maximized when WR, where Φ is the principal eigenvector matrix of A and R is a non-singular rotation matrix.

Optimization Theory

In optimization theory , various matrix functions are computed by evaluating the maximums and minimums of objective functions. Given a symmetric positive definite matrix pencil (A,B), the first principal generalized eigenvector is obtained by maximizing the well-known Raleigh quotient:

$$ Jleft(mathbf{w}
ight)=frac{{mathbf{w}}^TAmathbf{w}}{{mathbf{w}}^TBmathbf{w}} $$.     (1.15)

There are three common modifications of this objective function based on the method of optimization [Luenberger]. They are
  • Lagrange multiplier:

J(w) =  − wTAw + α(wTBw − 1),     (1.16)

where α is a Lagrange multiplier.
  • Penalty function:

J(w) =  − wTAw + μ(wTBw − 1)2,     (1.17)

where μ is a non-negative scalar constant.
  • Augmented Lagrangian:

J(w) =  − wTAw + α(wTBw − 1) + μ(wTBw − 1)2,     (1.18)

where α is a Lagrange multiplier and μ > 0 is the penalty constant. Several algorithms based on this objective function are given in Chapters 4-7.

We obtain adaptive algorithms from these objective functions by using instantaneous values of the gradients and a gradient ascent technique, as discussed in the following chapters. One advantage of these objective functions is that we can use accelerated convergence methods such as steepest descent, conjugate direction, and Newton-Raphson. Chapter 6 discusses several such methods. The recursive least squares method has also been applied to adaptive matrix computations by taking various approximations of the inverse Hessian of the objective functions.

1.3 Why Adaptive Algorithms?

We observed that data representation and classification problems lead to matrix algebra problems, which have two solutions depending on the nature of the inputs.
  1. 1.

    The first set of solutions requires the relevant matrices to be known in advance.

     
  2. 2.

    The second set of solutions requires a sequence of samples from which the matrix can be computed.

     

In this section, I explain the difference between batch and adaptive processing, benefits of adaptive processing on streaming data, and the requirements for adaptive algorithms.

Iterative or Batch Processing of Static Data

When the data is available in advance and the underlying matrices are known, we can use the batch processing approach. These algorithms have been used to solve a large variety of matrix algebra problems such as matrix inversion, EVD, SVD, and PCA [Cichocki et al. 92, 93]. These algorithms are solved in two steps: (1) using the pooled data to estimate the required matrices, and (2) using a numerical matrix algebra procedure to solve the necessary matrix functions.

For example, consider a simple matrix inversion problem for the correlation matrix A of a data sequence {xk}. We can calculate the matrix inverse A-1 after all of the data have been collected and A has been calculated. This approach works in a batch fashion. When a new sample x is added, it is not difficult to get the inverse of the new matrix Anew=(nA+xxT)/(n+1), where n is the total number of samples used to compute A. Although the computation for A is simple, all the computations for solving $$ {A}_{new}^{-1} $$ need to be repeated.

There are two problems with this batch processing approach.
  1. 1.

    First, the dimension of the samples may be large so that even if all the samples are available, performing the matrix algebra may be difficult or may take a prohibitively large amount of computational time. For example, eigenvector evaluation requires O(n3) computation, which is infeasible for samples of a large dimension (say 1,000) which occurs commonly in image processing and automatic control applications.

     
  2. 2.

    Second, the matrix functions evaluated by conventional schemes cannot adapt to small changes in the data (e.g., a few incoming samples). If the matrix functions are estimated by conventional methods from K samples, then for each additional sample all of the computation has to be repeated.

     

These deficiencies make the batch schemes inefficient for real-time applications where the data arrives incrementally or in an online fashion.

My Approach: Adaptive Processing of Streaming Data

When we have a continuous sequence of data samples, batch algorithms are no longer useful. Instead, adaptive algorithms are used to solve matrix algebra problems. An immediate advantage of these algorithms is that they can be used on real-time problems such as edge computation, adaptive data compression [Le Gall 91], antenna array processing for noise analysis and source location [Owsley 78], and adaptive spectral analysis for frequency estimation [Pisarenko 73].

My approach is to offer computationally simple adaptive algorithms for matrix computations from streaming data. Note that adaptive algorithms are critical in environments where the data volume is large, the data has high dimensions, the data is time varying and has changing underlying statistics, and we do not have sufficient storage, compute, and bandwidth to process the data with low latency. One such environment is edge devices and computation.

For example, given streaming samples {xk} of customer e-shopping data, we need to calculate a key customer sentiment wk from the data stream. For that, we need to design a simple update rule to change the customer sentiment w as new data x is available. The simple update rule will change the sentiment wk to its latest value wk+1 as new data xk is available. It is of the format wk+1= wk + f(wk,xk), where
  • wk is last value of the sentiment,

  • wk+1 is the latest updated value of the sentiment,

  • xk is the newest data used to calculate wk+1, and

  • f(.) is a simple function of wk and xk.

An example of this update rule is the well-known algorithm [Oja 82] to compute the first principal eigenvector of a streaming sequence {xk}:

$$ {mathbf{w}}_{k+1}={mathbf{w}}_k+eta left({mathbf{x}}_k{mathbf{x}}_k^T-{mathbf{w}}_k{mathbf{w}}_k^T{mathbf{x}}_k{mathbf{x}}_k^T
ight){mathbf{w}}_k $$,     (1.19)

where η>0 is a small gain constant. In this algorithm, for each sample xk the update procedure requires simple matrix-vector multiplications, and the vector wk converges to the principal eigenvector of the data correlation matrix A. Clearly, this can be easily implemented in small CPUs.

Figure 1-11 shows a multivariate e-shopping clickstream dataset [Apczynski M. et al.]. The adaptive update rule (1.19) is used to compute buyer pricing sentiments. The data is shown on the left and sentiments computed adaptively are shown on the right (ideal value is 1). The sentiments are updated adaptively as new data arrives and the sentiment value converges quickly to its ideal value of 1.
Figure 1-11

e-Shopping clickstream data on the left and buyer sentiments computed by the update rule (1.19) on the right (ideal value = 1)

There are several advantages and disadvantages of adaptive algorithms over conventional iterative batch solutions.

The advantages are
  • While conventional solutions find all eigenvectors and eigenvalues, adaptive solutions compute the desired eigenvectors only. In many applications, we do not need eigenvalues. Hence, they should be more efficient.

  • Furthermore, computational complexity of adaptive algorithms depends on the efficiency of the matrix-vector product Ax. These methods can be even more efficient when the matrix A has a certain structure such as sparse, Hankel, or Toeplitz for which FFT can be used to speed up the computation.

  • Adaptive algorithms easily fit the framework of time-varying multidimensional processes such as adaptive signal processing, where the input process is continuously updated.

The disadvantages are
  • Adaptive algorithms produce an approximate value of the features for the current dataset whereas batch algorithms provide an exact value.

  • The adaptive approach requires a “ramp up” time to reach high accuracy as evidenced by the curves in Figure 1-10.

Requirements of Adaptive Algorithms

It is clear from the previous discussion that adaptive algorithms for matrix computation need to be inexpensive for the computational process to keep pace with the input data stream. We also require that the estimates converge strongly to their actual values. We therefore expect our adaptive algorithms to satisfy the following constraints:
  • The algorithms should adapt to small changes in data, which is useful for real-time applications.

  • The estimates obtained from the algorithms should have known statistical properties.

  • The network architectures associated with the adaptive algorithms consist of linear units in one- or two-layer networks, such that the networks can be easily implemented with simple CPUs and no special hardware.

  • The computation involved in the algorithms is inexpensive such that the statistical procedure can process every data sample as they arrive.

  • The estimates obtained from the algorithms converge strongly to their actual values.

With these requirements, we proceed to design adaptive algorithms that solve the matrix algebra problems considered in this book. The objective of this book is to develop a variety of neuromorphic adaptive algorithms to solve matrix algebra problems.

Although I will discuss many adaptive algorithms to solve the matrix functions, most of the algorithms are of the following general form:

Wk + 1 = Wk + ηH(xk, Wk),     (1.20)

where H(xk,Wk) follows certain continuity and regularity properties [Ljung 77,78,84,92], and η > 0 is a small gain constant. These algorithms satisfy the requirements outlined before.

Real-World Use of Adaptive Matrix Computation Algorithms and GitHub

In Chapter 8, I discuss several real-world applications of these adaptive matrix computation algorithms. I also published the code for these applications in a public GitHub [Chanchal Chatterjee GitHub].

1.4 Common Methodology for Derivations of Algorithms

My contributions in this book are two-fold:
  1. 1.

    I present a common methodology to derive and analyze each adaptive algorithm.

     
  2. 2.

    I present adaptive algorithms to a number of matrix algebra problems.

     

The literature for adaptive algorithms for matrix computation offers a wide range of techniques (including ad hoc methods) and various types of convergence procedures. In this book, I present a common methodology to derive and prove the convergence of the adaptive algorithms (all proofs are in the GitHub).

The advantage of adopting this is to allow the reader to follow the methodology and derive new adaptive algorithms for their use cases.

In the following chapters, I follow the following steps to derive each algorithm:
  1. 1.

    Objective function

     
I first present an objective function J(W;Ak) such that the minimizer W* of J is the desired matrix function of the data matrix A.
  1. 2.

    Derive the adaptive algorithm

     

I derive an adaptive update rule for matrix W by applying the gradient descent technique on the objective function J(W;Ak). The adaptive gradient descent update rule is

Wk + 1 = Wk − ηkWJ(Wk, Ak) = Wk + ηkh(Wk, Ak),     (1.21)

where the function h(Wk,Ak) follows certain continuity and regularity properties and ηk is a decreasing gain sequence.
  1. 3.

    Speed up the adaptive algorithm

     
The availability of the objective function J(W;A) allows us to speed up the adaptive algorithm by applying speedup techniques in optimization theory such as steepest descent, conjugate direction, Newton-Raphson, and recursive least squares. Details of these methods for principal component analysis are given in Chapter 6.
  1. 4.

    Show the algorithms converge to the matrix functions

     

In this book, I provide numerical experiments to demonstrate the convergence of these algorithms. The mathematical proofs are provided in a separate document in the GitHub [Chatterjee Github].

An important benefit of following this methodology is to allow practitioners to derive new algorithms for their own use cases.

Matrix Algebra Problems Solved Here

In the applications, I consider the data as arriving in temporal succession, such as in vector sequences {xk} or {yk}, or in matrix sequences {Ak} or {Bk}. Note that the algorithms given here can be used for non-stationary input streams.

In the following chapters, I present novel adaptive algorithms to estimate the following matrix functions:
  • Mean, correlation, and normalized mean of {xk}

  • Square root (A½) and inverse of the square root (A–½) of A from {xk} or {Ak}

  • Principal eigenvector of A from {xk} or {Ak}

  • Principal and minor eigenvector of A from {xk} or {Ak}

  • Generalized eigenvectors of A with respect to B from {xk} and {zk} or {Ak} and {Bk}

  • Singular value decomposition (SVD) of C from {xk} and {zk} or {Ck}

Besides these algorithms, I also discuss
  • Adaptive computation of mean, covariance, and matrix inversion

  • Methods to accelerate the adaptive algorithms by techniques of nonlinear optimization

1.5 Outline of The Book

In the following chapters, I discuss in detail the formulation, derivation, convergence, and experimental results of many adaptive algorithms for various matrix algebra problems.

In Chapter 2, I discuss basic terminologies and methods used throughout the remaining chapters. This chapter also discusses adaptive algorithms mean, median, correlation, covariance, and inverse correlation/covariance computation for both stationary and non-stationary data. I further discusses novel adaptive algorithms for normalized mean computation.

In Chapter 3, I discuss three adaptive algorithms for the computation of the square root of a matrix sequence. I next discuss three algorithms for the inverse square root of the same. I offer objective functions and convergence proofs for these algorithms.

In Chapter 4, I discuss 11 algorithms, some of them new algorithms, for the adaptive computation of the first principal eigenvector of a matrix sequence or the online correlation matrix of a vector sequence. I offer best practices to choose the algorithm for a given application. I follow the common methodology of deriving and analyzing the convergence of each algorithm, supported by experimental results.

In Chapter 5, I present 21 adaptive algorithms for the computation of principal and minor eigenvectors of a matrix sequence or the online correlation matrix of a vector sequence. These algorithms are derived from 7 different types of objective functions, each under 3 different conditions. Each algorithm is derived, discussed, and shown to converge analytically and experimentally. I offer best practices to choose the algorithm for a given application.

Since I have objective functions for all of the adaptive algorithms, in Chapter 6, I deviate from the traditional gradient descent method of deriving the algorithms. Here I derive new computationally faster algorithms by using steepest descent, conjugate direction, Newton-Raphson, and recursive least squares on the objective functions. Experimental results and comparison with state-of-the-art algorithms show the faster convergence of these adaptive algorithms.

In Chapter 7, I discuss 21 adaptive algorithms for generalized eigen-decomposition from two matrix or vector sequences. Once again, I follow the common methodology and derive all algorithms from objective functions, followed by experimental results.

In Chapter 8, I present real-world applications of these algorithms with examples and code.

The bibliography is in Chapter 9.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.140.206