Example of SOM

We can now implement an SOM using the Olivetti faces dataset. As the process can be very long, in this example we limit the number of input patterns to 100 (with a 5 × 5 matrix). The reader can try with the whole dataset and a larger map.

The first step is loading the data, normalizing it so that all values are bounded between 0.0 and 1.0, and setting the constants:

import numpy as np

from sklearn.datasets import fetch_olivetti_faces

faces = fetch_olivetti_faces(shuffle=True)

Xcomplete = faces['data'].astype(np.float64) / np.max(faces['data'])

nb_iterations = 5000
nb_startup_iterations = 500
pattern_length = 64 * 64
pattern_width = pattern_height = 64
eta0 = 1.0
sigma0 = 3.0
tau = 100.0

X = Xcomplete[0:100]
matrix_side = 5

At this point, we can initialize the weight matrix using a normal distribution with a small standard deviation:

W = np.random.normal(0, 0.1, size=(matrix_side, matrix_side, pattern_length))

Now, we need to define the functions to determine the winning unit based on the least distance:

def winning_unit(xt):
distances = np.linalg.norm(W - xt, ord=2, axis=2)
max_activation_unit = np.argmax(distances)
return int(np.floor(max_activation_unit / matrix_side)), max_activation_unit % matrix_side

It's also useful to define the functions η(t) and σ(t):

def eta(t):
return eta0 * np.exp(-float(t) / tau)

def sigma(t):
return float(sigma0) * np.exp(-float(t) / tau)

As explained before, instead of computing the radial basis function for each unit, it's preferable to use a precomputed distance matrix (in this case, 5 × 5 × 5 × 5) containing all the possible distances between couples of units. In this way, NumPy allows a faster calculation thanks to its vectorization features:

precomputed_distances = np.zeros((matrix_side, matrix_side, matrix_side, matrix_side))

for i in range(matrix_side):
for j in range(matrix_side):
for k in range(matrix_side):
for t in range(matrix_side):
precomputed_distances[i, j, k, t] =
np.power(float(i) - float(k), 2) + np.power(float(j) - float(t), 2)

def distance_matrix(xt, yt, sigmat):
dm = precomputed_distances[xt, yt, :, :]
de = 2.0 * np.power(sigmat, 2)
return np.exp(-dm / de)

The distance_matrix function returns the value of the radial basis function for the whole map given the center point (the winning unit) xt, yt and the current value of σ sigmat. Now, it's possible to start the training process (in order to avoid correlations, it's preferable to shuffle the input sequence at the beginning of each iteration):

sequence = np.arange(0, X.shape[0])
t = 0

for e in range(nb_iterations):
t += 1

if e < nb_startup_iterations:
etat = eta(t)
sigmat = sigma(t)
etat = 0.2
sigmat = 1.0

for n in sequence:
x_sample = X[n]

xw, yw = winning_unit(x_sample)
dm = distance_matrix(xw, yw, sigmat)

dW = etat * np.expand_dims(dm, axis=2) * (x_sample - W)
W += dW

W /= np.linalg.norm(W, axis=2).reshape((matrix_side, matrix_side, 1))

In this case, we have set η = 0.2 but I invite the reader to try different values and evaluate the final result. After training for 5000 epochs, we got the following weight matrix (each weight is plotted as a bidimensional array):

As it's possible to see, the weights have converged to faces with slightly different features. In particular, looking at the shapes of the faces and the expressions, it's easy to notice the transition between different attractors (some faces are smiling, while others are more serious; some have glasses, mustaches, and beards, and so on). It's also important to consider that the matrix is larger than the minimum capacity (there are ten different individuals in the dataset). This allows mapping more patterns that cannot be easily attracted by the right neuron. For example, an individual can have pictures with and without a beard and this can lead to confusion. If the matrix is too small, it's possible to observe an instability in the convergence process, while if it's too large, it's easy to see redundancies. The right choice depends on each different dataset and on the internal variance and there's no way to define a standard criterion. A good starting point is picking a matrix whose capacity is between 2.0 and 3.0 times larger than the number of desired attractors and then increasing or reducing its size until the accuracy reaches a maximum. The last element to consider is the labeling phase. At the end of the training process, we have no knowledge about the weight distribution in terms of winning neurons, so it's necessary to process the dataset and annotate the winning unit for each pattern. In this way, it's possible to submit new patterns to get the most likely label. This process has not been shown, but it's straightforward and the reader can easily implement it for every different scenario.

