Sanger's network

A Sanger's network is a neural network model for online Principal Component extraction proposed by T. D. Sanger in Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural NetworkSanger T. D., Neural Networks, 1989/2. The author started with the standard version of Hebb's rule and modified it to be able to extract a variable number of principal components (v1, v2, ..., vm) in descending order (λ1 > λ2 > ... > λm). The resulting approach, which is a natural extension of Oja's rule, has been called the Generalized Hebbian Rule (GHA) (or Learning). The structure of the network is represented in the following diagram:

The network is fed with samples extracted from an n-dimensional dataset:

The m output neurons are connected to the input through a weight matrix, W = {wij}, where the first index refers to the input components (pre-synaptic units) and the second one to the neuron. The output of the network can be easily computed with a scalar product; however, in this case, we are not interested in it, because just like for the covariance (and Oja's) rules, the principal components are extracted through the weight updates.

The problem that arose after the formulation of Oja's rule was about the extraction of multiple components. In fact, if we applied the original rule to the previous network, all weight vectors (the rows of w) would converge to the first principal component. The main idea (based on the Gram-Schmidt orthonormalization method) to overcome this limitation is based on the observation that once we have extracted the first component w1, the second one w2 can be forced to be orthogonal to w1, the third one w3 can be forced to be orthogonal to w1 and w2, and so on. Consider the following representation:

Orthogonalization of two weight vectors

In this case, we are assuming that w1 is stable and w20 is another weight vector that is converging to w1. The projection of w20 onto w1 is as follows:

In the previous formula, we can omit the norm if we don't need to normalize (in the network, this process is done after a complete weight update). The orthogonal component of w20 is simply obtained with a difference:

Applying this method to the original Oja's rule, we obtain a new expression for the weight update (called Sanger's rule):

The rule is referred to a single input vector x, hence xj is the jth component of x. The first term is the classic Hebb's rule, which forces weight w to become parallel to the first principal component, while the second one acts in a way similar to the Gram-Schmidt orthogonalization, by subtracting a term proportional to the projection of w onto all the weights connected to the previous post-synaptic units and considering, at the same time, the normalization constraint provided by Oja's rule (which is proportional to the square of the output).

In fact, expanding the last term, we get the following:

The term subtracted to each component wij is proportional to all the components where the index j is fixed and the first index is equal to 1, 2, ..., i. This procedure doesn't produce an immediate orthogonalization but requires several iterations to converge. The proof is non-trivial, involving convex optimization and dynamic systems methods, but, it can be found in the aforementioned paper. Sanger showed that the algorithm converges always to the sorted first n principal components (from the largest eigenvalue to the smallest one) if the learning_rate η(t) decreases monotonically and converges to zero when t → ∞. Even if necessary for the formal proof, this condition can be relaxed (a stable η < 1 is normally sufficient). In our implementation, matrix W is normalized after each iteration, so that, at the end of the process, WT (the weights are in the rows) is orthonormal and constitutes a basis for the eigenvector subspace. 

In matrix form, the rule becomes as follows:

Tril(•) is a matrix function that transforms its argument into a lower-triangular matrix and the term yyT is equal to WxxTW.

The algorithm for a Sanger's network is as follows:

  1. Initialize W(0) with random values. If the input dimensionality is n and m principal components must be extracted, the shape will be (m × n).
  2. Set a learning_rate η (for example, 0.01).
  3. Set a threshold Thr (for example, 0.001).
  4. Set a counter T = 0.
  5. While ||W(t) - W(t-1)||F > Thr:
    1. Set ΔW = 0 (same shape of W)
    2. For each x in X:
      1. Set T = T + 1 
      2. Compute y = W(t)x 
      3. Compute and accumulate ΔW += η(yxT - Tril(yyT)W(t)
    3.  Update W(t+1) = W(t) + (η / T)ΔW
    4. Set W(t+1)W(t+1) / ||W(t+1)||(rows) (the norm must be computed row-wise)

The algorithm can also be iterated a fixed number of times (like in our example), or the two stopping approaches can be used together.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.79.147