Transferring style from one image to another 

Imagine being able to have one of the greatest painters in history, such as Vincent van Gogh or Pablo Picasso, recreate a photo of your liking using their own unique style. In a nutshell, this is what style transfer allows us to do. Quite simply, it's the process of generating a photo using the style of one with the content of another, as shown here:

In this section, we will describe, albeit at a high level, how this works and then move on to an alternative that allows us to perform a similar process in significantly less time.

I encourage you to read the original paper, A Neural Algorithm of Artistic Style, by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, for a more comprehensive overview. This paper is available at https://arxiv.org/abs/1508.06576.

At this stage, we have learned that neural networks learn by iteratively reducing a loss, calculated using some specified cost function that is to indicate how well the neural network did with respect to the expected output. The difference between the predicted output and expected output is then used to adjust the model's weights, through a process known as backpropagation, such to minimize this loss.

The preceding description (intentionally) skips the details of this process as our goal here is to provide an intuitive understanding, rather than the granular details. I recommend reading Andrew Trask's Grokking Deep Learning for a gentle introduction to the underlying details of neural networks.

Unlike the classification models we have worked with thus far, where the output is a probability distribution across some set of labels, we are instead interested in the model's generative abilities. That is, instead of adjusting the model's weights, we want to adjust the generated image's pixel values so as to reduce some defined cost function.

So if we were to define a cost function that could measure the loss between the generated image and content image, and another to measure the loss between the generated image and style image, we could then simply combine them. Thus we obtain the overall loss and use this to adjust the generated image pixels values, to create something that has the targets content in the style of our targets style as illustrated in the following image:

At this point, we have a general idea of the required process; what is left is building some intuition behind these cost functions. That is, how do you determine how well your generated image is, with respect to some content of the content image and with respect to a style of the style image? For this, we will backtrack a little and review what other layers of a CNN learn by inspecting each of their activations. 

The details and images demonstrating what convolutional neural networks (CNNs) learn have been taken from the paper Visualizing and Understanding Convolutional Networks, by Matthew D. Zeiler and Rob Fergus, which is available at https://arxiv.org/abs/1311.2901.

A typical architecture of a CNN consists of a series of convolutional and pooling layers, which is then fed into a fully connected network (for case of classification), as illustrated in this image: 

This flat representation misses an important property of a CNN, which is how, after each subsequent pair of convolution and pooling layers, the input's width and height reduce in size. The consequence of this is that the receptive field increases depth into the network; that is, deeper layers have a larger receptive field and thus capture higher level features than shallower layers.

To better illustrate what each layer learns, we will reference the paper Visualizing and Understanding Convolutional Networks, by Matthew D. Zeiler and Rob Fergus. In their paper (previously referenced), they pass through images from their training set to identify the image patches that maximize each layer's activations; by visualizing these patches, we get a sense of what each neuron (hidden unit) at each of the layers learns. Here is an screenshot showing some of these patches across a CNN:

Source: Visualizing and Understanding Convolutional Networks; Matthew D Zeiler, Rob Fergus 

What you can see in the preceding figure are nine image patches that maximize an individual hidden unit at each of the layers of this particular network. What has been omitted from the preceding figure is the variance in size; that is, the deeper you go, the larger the image patch will be.

What is hopefully obvious from the preceding image is that the shallower layers extract simple features. For example, we can see that a single hidden unit at Layer 1 is activated by a diagonal edge and a single hidden unit at Layer 2 is activated with a vertically striped patch. While the deeper layers extract higher-level features, or more complex features, again, in the preceding figure, we can see that a single hidden unit at Layer 4 is activated by patches of dog faces. 

We return to our task of defining a cost function for content and style, starting with the cost function for content. Given a content image and a generated image, we want to measure how close we are so as to minimize this difference, so that we retain the content. We can achieve this by selecting one of the deeper layers from our CNN, which we saw before have a large receptive field, and capture complex features. We pass through both the content images and the generated image and measure the distance between outputted activations (on this layer). This will hopefully seem logical given that the deeper layers learn complex features, such as a dog's face or a car, but decouple them from lower-level features such as edges, color, and textures. The following figure depicts this process:

This takes care of our cost function for the content which can be easily tested by running a network that implements this. If implemented correctly, it should result in a generated image that looks similar to that of the input (content image). Let's now turn our attention to measuring style. 

We saw in the preceding figure that shallower layers of a network learn simple features such as edges, textures, and color combinations. This gives us a clue as to which layers would be useful when trying to measure style, but we still need a way of extracting and measuring style. However, before we start, what exactly is style? 

A quick search on http://www.dictionary.com/ reveals style being defined as a distinctive appearance, typically determined by the principles according to which something is designed. Let's take Katsushika Hokusai's The Great Wave off Kanagawa as an example:

The Great Wave off Kanagawa is an output of a process known as woodblock printing; this is where an artist's sketch is broken down into layers (carved wooden blocks), with each layer (usually one for each color) used to reproduce the art piece. It's similar to a manual printing press; this process produces a distinctive flat and simplistic style. Another dominate style (and possibly side-effect) that can be seen in the preceding image is that a limited range of colors is being used; for example, the water consists of no more than four colors.

The way we can capture style is as defined in the paper A Neural Algorithm of Artistic Style, by L. Gatys, A. Ecker, and M. Bethge. This way is to use a style matrix (also known as gram matrix) to find the correlation between the activations across different channels for a given layer. It is these correlations that define the style and something we can then use to measure the difference between our style image and generated image to influence the style of the generated image.

To make this more concrete, borrowing from an example used by Andrew Ng in his Coursera course on deep learning, let's take Layer 2 from the earlier example. What the style matrix calculates is the correlation across all channels for a given layer. If we use the following illustration, showing nine activations from two channels, we can see that a correlation exists between vertical textures from the first channel with orange patches from the second channel. That is, when we see a vertical texture in the first channel, we would expect the image patches that maximize the second channel's activations to have an orange tint:

This style matrix is calculated for both the style image and generated image, with our optimization forcing our generated image to adopt these correlations. With both style matrices calculated, we can then calculate the loss by simply finding the sum of the square difference between the two matrices. The following figure illustrates this process, as we have previously done when describing the content loss function:

With that, we have now concluded our introduction to style transfer, and hopefully given you some intuition of how we can use the network's perceptual understanding of images to extract content and style. This approach works well, but there is one drawback that we will address in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.137.58