Differentiating inputs with Siamese networks

Let's see how the similarity function is implemented through Siamese networks. The idea was first implemented at paper published by Taigman in 2014, DeepFace: Closing the Gap to Human-Level Performance in Face Verification. Then we will see how Siamese networks learn by giving a slightly more formal definition.

First, we will continue to use convolution architectures with many convolution layers:

The fully connected layers within neurons, and the softmax for the prediction.

Let's fit the first image we want to compare, X¹:

And what we will do is, through a forward pass, grab the activation values of the last fully connected layer, and we will refer to those values as F(x¹), or sometimes also the encoded values of the image, because we transform this image through the forward paths to another set of values of the activation of the last fully connected layer:

And that will be saved in the memory. We will repeat the same for the second image we want to compare, in other words, X²:

We'll do a forward pass, and then we will gain the F(x²), which refers to the encoded values for the second image:

Notice that the network here stays the same for both of the images. And that's where this Siamese name comes in; since we are using the same network execution for both of the images, in practice, these forward passes are happening in parallel.

By now, it is clear that the softmax layer is redundant, so we didn't use it at all. So let's remove it, and instead, replace with the difference of the encoded values between the two images:

And only when the accounted values are similar, which means that this difference is close to zero, we'll predict that the two images are the same. Otherwise, if the difference between the encoded values is large, this means that these images are different as well. Since, by means of the forward pass, we gain a transformation of the image, which simply describes the image itself, so if those encoded values are different, that indirectly means that the images are different as well.

As you will recall from the previous section, this is exactly what we referred to as similarity function d:

The d denotes the distance between the encoded activation of the last layer of some deep convolution network, such as an inception network. Since this is just a comparison, a question may arise: why don't we compare the image pixels instead of using a forward pass to gain the activation values of the last layer? The reason that would not work well is because, even a slight change on image light will make the pixel distance too big, while the encoded values offer a more robust representation of the image, since the neural network already learned the tricks to observe those differences through intensive training and a lot of data.

Table of Contents for Differentiating inputs with Siamese networks

Create new playlist

Sign In

Sign Up

Table of Contents for
Differentiating inputs with Siamese networks