Sanger's network

A Sanger's network is a neural network model for online Principal Component extraction proposed by T. D. Sanger in Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network, Sanger T. D., Neural Networks, 1989/2. The author started with the standard version of Hebb's rule and modified it to be able to extract a variable number of principal components (v₁, v₂, ..., v_m) in descending order (λ₁ > λ₂ > ... > λ_m). The resulting approach, which is a natural extension of Oja's rule, has been called the Generalized Hebbian Rule (GHA) (or Learning). The structure of the network is represented in the following diagram:

The network is fed with samples extracted from an n-dimensional dataset:

The m output neurons are connected to the input through a weight matrix, W = {w_ij}, where the first index refers to the input components (pre-synaptic units) and the second one to the neuron. The output of the network can be easily computed with a scalar product; however, in this case, we are not interested in it, because just like for the covariance (and Oja's) rules, the principal components are extracted through the weight updates.

The problem that arose after the formulation of Oja's rule was about the extraction of multiple components. In fact, if we applied the original rule to the previous network, all weight vectors (the rows of w) would converge to the first principal component. The main idea (based on the Gram-Schmidt orthonormalization method) to overcome this limitation is based on the observation that once we have extracted the first component w₁, the second one w₂ can be forced to be orthogonal to w₁, the third one w₃ can be forced to be orthogonal to w₁ and w₂, and so on. Consider the following representation:

Orthogonalization of two weight vectors

In this case, we are assuming that w₁ is stable and w₂₀ is another weight vector that is converging to w₁. The projection of w₂₀ onto w₁ is as follows:

In the previous formula, we can omit the norm if we don't need to normalize (in the network, this process is done after a complete weight update). The orthogonal component of w₂₀ is simply obtained with a difference:

Applying this method to the original Oja's rule, we obtain a new expression for the weight update (called Sanger's rule):

The rule is referred to a single input vector x, hence x_j is the j^th component of x. The first term is the classic Hebb's rule, which forces weight w to become parallel to the first principal component, while the second one acts in a way similar to the Gram-Schmidt orthogonalization, by subtracting a term proportional to the projection of w onto all the weights connected to the previous post-synaptic units and considering, at the same time, the normalization constraint provided by Oja's rule (which is proportional to the square of the output).

In fact, expanding the last term, we get the following:

The term subtracted to each component w_ij is proportional to all the components where the index j is fixed and the first index is equal to 1, 2, ..., i. This procedure doesn't produce an immediate orthogonalization but requires several iterations to converge. The proof is non-trivial, involving convex optimization and dynamic systems methods, but, it can be found in the aforementioned paper. Sanger showed that the algorithm converges always to the sorted first n principal components (from the largest eigenvalue to the smallest one) if the learning_rate η(t) decreases monotonically and converges to zero when t → ∞. Even if necessary for the formal proof, this condition can be relaxed (a stable η < 1 is normally sufficient). In our implementation, matrix W is normalized after each iteration, so that, at the end of the process, W^T (the weights are in the rows) is orthonormal and constitutes a basis for the eigenvector subspace.

In matrix form, the rule becomes as follows:

Tril(•) is a matrix function that transforms its argument into a lower-triangular matrix and the term yy^T is equal to Wxx^TW.

The algorithm for a Sanger's network is as follows:

Initialize W⁽⁰⁾ with random values. If the input dimensionality is n and m principal components must be extracted, the shape will be (m × n).
Set a learning_rate η (for example, 0.01).
Set a threshold Thr (for example, 0.001).
Set a counter T = 0.
While ||W^(t) - W^(t-1)||_F > Thr:
1. Set ΔW = 0 (same shape of W)
2. For each x in X:
  1. Set T = T + 1
  2. Compute y = W^(t)x
  3. Compute and accumulate ΔW += η(yx^T - Tril(yy^T)W^(t)
3. Update W^(t+1) = W^(t) + (η / T)ΔW
4. Set W^(t+1) = W^(t+1) / ||W^(t+1)||^(rows) (the norm must be computed row-wise)

The algorithm can also be iterated a fixed number of times (like in our example), or the two stopping approaches can be used together.

Table of Contents for Sanger's network

Create new playlist

Sign In

Sign Up

Table of Contents for
Sanger's network