Building a neural network that produces art

In this section, we are going to build a Java application for creating art. Before jumping into the code and application, let's look at the high-level description on how is it implemented. We will use transfer learning with the pre-trained VGG-16 architecture, trained on ImageNet dataset, which is a bunch of convolution layers in the following diagram:

Basically, as we go deeper, the depth of the third dimension increases, while the first two dimensions shrink in time.

At first, we are going to fit a content image as an input through a forward pass as shown in the following diagram. Then, we will gain the values of all activation layers along with the prediction. But for neural style transfer, we are not interested in the prediction, but only in the activation layers, because those layers are capturing the image features. The following image depicts this clearly:

In the case of the content image, we are going to select one layer, which is shown in the following diagram, and register the activation values. This port denotes the fact that we are on the fourth group, as we can see in this case. We have five groups, as shown in the following image:

So there are five groups that are highlighted in the following diagram. We will select the second layer from the fourth convolution layers' group. Then, we are going to feed this style image and similarly, through a forward pass, we are going to gain all the activation layers. But instead of selecting only one, we will select multiple of them:

Another difference is that, instead of using the activation directly, we will transform those activations into gram matrices. The reason we do that, as we learned in the previous section, is because the gram matrices capture the style better than the activations directly. And those grams are g₁₁s, g₂₁s, g₃₁s, g₅₁s, as shown in the following diagram. Then finally the generated image is fed to gram matrices, which at the beginning looks a bit noisy, as shown in the following diagram:

Similarly, through a forward pass, we are going to get all the gram matrices and the activations for the content selected layers:

Now, we will calculate the difference by means of the cost function, and then use that feedback for the derivation to go through the back propagation till the generated image, to change it to look a bit more like content or maybe the style.

So basically, if we take this g₅₁ here, first we will calculate the difference, then the derivation to use that feedback, and after that, we are going back, step by step to the generated image, and change it a bit to look more like the style image. Similarly, for the content layer, we are doing the cost function as we saw, then the derivation of it, and after we do the derivation, we are ready to go through a back propagation till the generated image, to change to look a bit more like the content image.

We repeat the same for the three other style layers. And after that, the generated image may look a bit more like the style and the content image. If we are not satisfied with the result, we can simply repeat the same process. A forward pass only calculates the functions, that are marked by the red arrow in the preceding diagram, and also calculates the activation of the red and gram matrices. The green and the blue functions are not calculated again, because the content image and the style image did not change, so those are stored in memory and received as feedback relatively more quickly.

We repeat this process until we are satisfied with the result. So each time, we have a new generated image, which is generated through a forward pass, different activation, and gram matrices. This is basically the fundamental concept behind the working code. Let's put this all together slightly more formal manner.

We have the total cost function, defined as the content cost function and multiple style cost functions. And also, we have the alpha and the beta, which simply impact or control how much you want the generated image to look like the content or the style image. The following formula relates to the total cost function:

The content cost function is simply the squared difference between the activations in a selected layer, while the cost function on the style is the squared difference of the gram matrices, rather than activations directly. The first formula reflects the content cost function, while the second formula is the cost function on the styled image:

Let's see how the code looks.

First, we need to select the style layers, and here, we have five of those style layers; on the left is the name of the style layer on VGG-16 architecture, while on the right we have the coefficient:

public class NeuralStyleTransfer {

    public static final String[] STYLE_LAYERS = new String[]{
            "block1_conv1,0.5",
            "block2_conv1,1.0",
            "block3_conv1,1.5",
            "block4_conv2,3.0",
            "block5_conv1,4.0"
    };

And here, along the lines of what we have mentioned previously, the deeper layers are contributing more to the style in comparison to the low level layers, but we also want these early layers in the play, because they may have some low-level details, such as color capturing, which are quite interesting for this style. That's why we use multiple layers in here. But it will be quite interesting to change those weights and to see how the generated image will be different. Then we have the content layer name, which is a block4_conv2. Now once we get all the feedback from the style layers and this content layer, we need to update the image:

private static final String CONTENT_LAYER_NAME = "block4_conv2";
    /**
     * ADAM
     * Possible values 0.8-0.95
     */
    private static final double BETA_MOMENTUM = 0.9;
    /**
     * ADAM
     * Below values rarely change
     */
    private static final double BETA2_MOMENTUM = 0.999;
    private static final double EPSILON = 0.00000008;

The way we update the generated image each time is by means of an automatic updater, because it is quite optimized, and it has these three parameters as shown in the preceding code. Then we have the alpha and the beta; alpha represents the degree to which you want the generated image to look like the content, while the beta represents the degree to which we want the generated image to look like the style image:

public static final double ALPHA = 5;
    public static final double BETA = 100;

    private static final double LEARNING_RATE = 2;
    private static final int ITERATIONS = 1000;
    /**
     * Higher resolution brings better results but
     * changing behind 300x400 on CPU become almost not computable and slow
     */
    public static final int HEIGHT = 224;
    public static final int WIDTH = 224;
    public static final int CHANNELS = 3;

    private ExecutorService executorService;

Beta is much bigger compared with the content image coefficient because, if you remember, we start with a noisy image that looks a bit like the content; that's why we need to make up the difference here to have the style impact greater so to catch up with the content image. Then we have the learning rate and iteration, which are already familiar, then the resolution of the image, and also the height, width, and channels. This is quite important; increasing the height and the width definitely shows a better generated image, so the quality is higher, but on CPU, I would suggest to not go behind this 300 x 400, or even 300 x 300, because then, it is difficult to compute. With GPU, of course, we can go to higher numbers such as 800, 900, with no concern.

Then we have the classes that handle the low-level details of the derivation, the cost function, and the image utilities, which we'll see in the following code:

final ImageUtilities imageUtilities = new ImageUtilities();
    private final ContentCostFunction contentCostFunction = new ContentCostFunction();
    private final StyleCostFunction styleCostFunction = new StyleCostFunction();

    public static void main(String[] args) throws Exception {
        new NeuralStyleTransfer().transferStyle();
    }

    public void transferStyle() throws Exception {

        ComputationGraph vgg16FineTune = ramo.klevis.transfer.style.neural.VGG16.loadModel();

        INDArray content = imageUtilities.loadImage(ImageUtilities.CONTENT_FILE);
        INDArray style = imageUtilities.loadImage(ImageUtilities.STYLE_FILE);
        INDArray generated = imageUtilities.createGeneratedImage();

        Map<String, INDArray> contentActivationsMap = vgg16FineTune.feedForward(content, true);
        Map<String, INDArray> styleActivationsMap = vgg16FineTune.feedForward(style, true);
        HashMap<String, INDArray> styleActivationsGramMap = buildStyleGramValues(styleActivationsMap);

        AdamUpdater adamUpdater = createADAMUpdater();
        executorService = Executors.newCachedThreadPool();

        for (int iteration = 0; iteration < ITERATIONS; iteration++) {
            long start = System.currentTimeMillis();
            //log.info("iteration " + iteration);

            CountDownLatch countDownLatch = new CountDownLatch(2);

            //activations of the generated image
            Map<String, INDArray> generatedActivationsMap = vgg16FineTune.feedForward(generated, true);

            final INDArray[] styleBackProb = new INDArray[1];
            executorService.execute(() -> {
                try {
                    styleBackProb[0] = backPropagateStyles(vgg16FineTune, styleActivationsGramMap, generatedActivationsMap);
                    countDownLatch.countDown();
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
            });
            final INDArray[] backPropContent = new INDArray[1];
            executorService.execute(() -> {
                backPropContent[0] = backPropagateContent(vgg16FineTune, contentActivationsMap, generatedActivationsMap);
                countDownLatch.countDown();
            });

            countDownLatch.await();
            INDArray backPropAllValues = backPropContent[0].muli(ALPHA).addi(styleBackProb[0].muli(BETA));
            adamUpdater.applyUpdater(backPropAllValues, iteration);
            generated.subi(backPropAllValues.get());

            System.out.println((System.currentTimeMillis()) - start);
// log.info("Total Loss: " + totalLoss(styleActivationsMap, generatedActivationsMap, contentActivationsMap));
            if (iteration % ImageUtilities.SAVE_IMAGE_CHECKPOINT == 0) {
                //save image can be found at target/classes/styletransfer/out
                imageUtilities.saveImage(generated.dup(), iteration);
            }
        }

    }

The following are the steps for obtaining the generated output image:

First we load the VGG-16 model through pre-trained ImageNet, as we saw when utilizing transfer learning
Then we load the content file through an image preprocessor, which handles the low-level details such as the scaling of the image down to the selected width and height we choose, and the image normalization
We also do the same here for this style, and then we have the generated image

Here is the output:

As we mentioned, the generated image is partially similar to the content image and partially has some noise. In this example, we are using only 20% noise. The reason being, if you make your generated image look such as the content image, then you will get better and faster results. But even if you put 90% noise and 10% content image, eventually, you get the same result, but it may take longer for you to see the result.

That's why I will suggest starting with the lower noise if you want to see how the algorithm works first.

Then we need to calculate the activations for the content and restyle. So we will do the feed-forward process for the content image, and then for the style image.

For this style, we have another step; we need to gain the gram matrices because we are not using these activations directly. That's why we have built style gram values. Notice that these functions are kept in the memory, and not included in the iteration. So we will just use these values once they are executed for the first time. Then we create the ADAM update, and this executor service that optimizes the code for parallel processing. Basically, this code is already contributed to the deep learning community, but the version here at Packt Pub is quite optimized to run in parallel, and at the same time, we have some modifications in the VGG-16 model to produce better results.

So we are going forward to gain all the activations for the generated image, and then it's time to calculate the derivation, and then back propagate. Now this back propagation from the style layers and only content layers are going to be done in parallel, so for the style, we start another thread here, then for each of the layers, we start another thread, so we are going back and as we can see, the first step is to calculate the derivation of the cost function. Then this feedback is back propagated till the image. The same is also done for the content layer, so we back propagate till the first layer. And then the feedback that has been received, it's multiplied with the coefficient, then goes through the ADAM updater, and then finally the generated image is altered.

So the next time, if the image is different, we will go through the same process of the feed-forward, acquire all the activations, and then undertake this parallel processing, it will gain all the feedback to the back propagation.

Now, I would suggest to turn on the function if you want to debug how the cost function goes down or up step by step. Now, running the code really takes a long time, so gaining good result is starting from the 500 iterations, and to have 500 of iterations in a CPU, it takes something like three or four hours, so basically, it will take a long time to show the results in here; that's why I pre-run the application for just 45 iteration, and see how that looks like. So let's suppose that we use the following image as style image:

And the following as the content image:

Then let's see what the algorithm produces. For iterations 10, 15, 20, 25, 30, 35, 40, and 45, it is as shown in the following image. So after 45 iterations, the algorithm gives the content image as the output. So, basically after the 45th iteration, the generated image starts to obtain some color that is similar to the content image:

So we can see those colors in here, and after 500 and 1,000 duration, this will look more similar to the style and the content image in the same time. Basically, this was all about the application, and in the next section, we are going to tackle the face recognition problem.

Table of Contents for Building a neural network that produces art

Create new playlist

Sign In

Sign Up

Table of Contents for
Building a neural network that produces art