Style cost function

Let's see how to get the correlations once you have the activations. Correlation will be calculated simply as the multiplication between the numbers of the activations. If the result is high, the multiplication result is high, then we say that those two activations are correlated together. Otherwise, if they are different and the multiplication has a low result, then they aren't correlated together. Since channels are feature detectors, we are more interested in channel correlation rather than some specific activations. So we are going to multiply all the activations between two channels, and then we'll sum up to get some value as shown in the following image, and that value will tell us the degree of correlation between those two channels are:

For example, consider the first channel. We will multiply all the activations here with the second channel activation, and then total them. That will give us this small g₁₂, which is simply the degree of correlation degree between the first channel and the second channel. And we continue to do the same with a third channel, and with a fourth channel, so these g₁₃ and g₁₄, respectively provide us the degree of correlation degree between the first channel, and the third one, and with a fourth. Then the process starts all over again, but now from the second channel. So, we will multiply all the activation values with a first channel and then total them to produce this g₂₁, which simply is the correlation between the second channel and the first one. And we will do the same for the other one, so this g₂₄ is in correlation between the second and fourth channel.

Then the process starts from the third channel, and the corresponding row gives us a correlation of the third channel with all other channels, and the fourth channel with all these three channels as well. Now observe this kind of diagonal here, g₁₁, g₂₂, g₃₃, and g₄₄. We are calculating the correlation within the same channel. The reason we do that is because activation correlation within the same channel measures how big this channel feature is. Hence, the feature that the channel is capturing, impacts the image as a whole, and we also understand the popularity of the feature in the image, that is being detected by the channel.

For instance, if a channel correlation with itself is high, for example, the g₃₃ is high, and the channel captures horizontal lines, this means that the image has a lot of horizontal lines. So this operation is quite useful because it is capturing the style in the image as a whole, , in other words, the popularity of a feature detected by your channel throughout the image as a whole. As we know, due to performance reasons, neural network operations are done through matrixes. Therefore the correlation is a matrix, called the gram matrix, or the style matrix.

Let's see how the gram matrix is calculated:

First we will need to turn the preceding three dimensional structure into a two dimensional structure. We do that by simply putting every value in the first channel, which is two dimensional, in a row, and then we do the same for other channels as well. So, each of the rows represents the activation values in each of the channels. Next, we will simply multiply the matrix with itself, but in its transpose form.

So we turn rows into columns. Then, according to the multiplication of matrix's rules, we will have the first row activations multiplied each with the first column, and then sum up to get the first cell value, which basically is the g₁₁, so the correlation of this layer with itself. And then for the next cell, we have to multiply the first row with a second column, and then sum up, which gives us g₁₂, because this first row represents the first channel, and this second column basically represents the second channel, so this d₁₂ is exactly the correlation between the first channel and the second one. And then we do the same for the other rows and columns. For example, the third row will be multiplied with the second column at some point, and then gives us simply this g₃₂, which is the correlation between the third channel and the second channel, because the second column is just the second channel activations. In the end, we gain this gram matrix in a very performant way, so it is really fast, because each of these correlations can be done in parallel, thanks to the matrices.

Ultimately, what we will do is simply compare the gram matrices of style image and generated image together, and what we want is fir those gram matrices to be as similar as possible; since they are capturing the style, a close similarity would mean that the generated image and the style image are sharing almost the same style:

Let's see more formally how that is done. First, we have the general cost function, we have already seen:

Now let's see the style cost function. And it is being defined such as this: 1/4 and then multiplied by the squared dimensions of the layer we pick up first. And in a moment, we will see why this is squared in the style cost function. Then we multiply by the squared difference of the gram matrices as shown in the following formula:

This is quite similar to what we have seen in the content cost function, but instead of having activations here we simply have the gram matrix:

As you will recall from the previous section, the gram matrix is being defined as deactivation multiplied by the activation the same, but transposed. So if we think of this as just the square of the activation, then the derivative will be two, multiplied by the activation transposed. And this also explains why we squared the layers' dimensions; we don't have just one activation here, but we have two activations, the same activation multiplied by itself. We therefore need to multiply the d measures by themselves as well, in order to have everything consistent.

Since the activation is multiplied with itself we need to also multiply the dimensions. Now the cost function gives us just a hint of how similar those two gram matrices are, or how similar the style of those two images are. In order to turn that into a feedback to cause those images to come together more closely, we need the derivation of the cost function. The cost function is calculated as follows:

As we can see in the preceding formula, one divided by the squared of dimensions does not change because it is just a constant. Then this force goes away because the two that comes from the derivation of the gram matrix is multiplied by the other two here, which comes in front. Then, 4 x 4 gives us one, then new multiplication of the simple difference between the gram matrices. In comparison to the content cost function, we have an additional term, which is basically the transpose of the activation that comes from the derivation of the gram matrix, where two is multiplied by the transpose activation. There is one final detail pertaining to the style cost function; we are going to use several layers' feedback instead of just one, as we will see in the following section, when we will build a neural network that produces art.

Table of Contents for Style cost function

Create new playlist

Sign In

Sign Up

Table of Contents for
Style cost function