8 Generative adversarial networks (GANs)

This chapter covers

  • Understanding the basic components of GANs: generative and discriminative models
  • Evaluating generative models
  • Learning about popular vision applications of GANs
  • Building a GAN model

Generative adversarial networks (GANs) are a new type of neural architecture introduced by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014.1 GANs have been called “the most interesting idea in the last 10 years in ML” by Yann LeCun, Facebook’s AI research director. The excitement is well justified. The most notable feature of GANs is their capacity to create hyperrealistic images, videos, music, and text. For example, except for the far-right column, none of the faces shown on the right side of figure 8.1 belong to real humans; they are all fake. The same is true for the handwritten digits on the left side of the figure. This shows a GAN’s ability to learn features from the training images and imagine its own new images using the patterns it has learned.

Figure 8.1 Illustration of GANs’ abilities by Goodfellow and co-authors. These are samples generated by GANs after training on two datasets: MNIST and the Toronto Faces Dataset (TFD). In both cases, the right-most column contains true data. This shows that the produced data is really generated and not only memorized by the network. (Source: Goodfellow et al., 2014.)

We’ve learned in the past chapters how deep neural networks can be used to understand image features and perform deterministic tasks on them like object classification and detection. In this part of the book, we will talk about a different type of application for deep learning in the computer vision world: generative models. These are neural network models that are able to imagine and produce new content that hasn’t been created before. They can imagine new worlds, new people, and new realities in a seemingly magical way. We train generative models by providing a training dataset in a specific domain; their job is to create images that have new objects from the same domain that look like the real data.

For a long time, humans have had an advantage over computers: the ability to imagine and create. Computers have excelled in solving problems like regression, classification, and clustering. But with the introduction of generative networks, researchers can make computers generate content of the same or higher quality compared to that created by their human counterparts. By learning to mimic any distribution of data, computers can be taught to create worlds that are similar to our own in any domain: images, music, speech, prose. They are robot artists, in a sense, and their output is impressive. GANs are also seen as an important stepping stone toward achieving artificial general intelligence (AGI), an artificial system capable of matching human cognitive capacity to acquire expertise in virtually any domain--from images, to language, to creative skills needed to compose sonnets.

Naturally, this ability to generate new content makes GANs look a little bit like magic, at least at first sight. In this chapter, we will only attempt to scratch the surface of what is possible with GANs. We will overcome the apparent magic of GANs in order to dive into the architectural ideas and math behind these models in order to provide the necessary theoretical knowledge and practical skills to continue exploring any facet of this field that you find most interesting. Not only will we discuss the fundamental notions that GANs rely on, but we will also implement and train an end-to-end GAN and go through it step by step. Let’s get started!

8.1 GAN architecture

GANs are based on the idea of adversarial training. The GAN architecture basically consists of two neural networks that compete against each other:

  • The generator tries to convert random noise into observations that look as if they have been sampled from the original dataset.

  • The discriminator tries to predict whether an observation comes from the original dataset or is one of the generator’s forgeries.

This competitiveness helps them to mimic any distribution of data. I like to think of the GAN architecture as two boxers fighting (figure 8.2): in their quest to win the bout, both are learning each others’ moves and techniques. They start with less knowledge about their opponent, and as the match goes on, they learn and become better.

Figure 8.2 A fight between two adversarial networks: generative and discriminative

Another analogy will help drive home the idea: think of a GAN as the opposition of a counterfeiter and a cop in a game of cat and mouse, where the counterfeiter is learning to pass false notes, and the cop is learning to detect them (figure 8.3). Both are dynamic: as the counterfeiter learns to perfect creating false notes, the cop is in training and getting better at detecting the fakes. Each side learns the other’s methods in a constant escalation.

Figure 8.3 The GAN’s generator and discriminator models are like a counterfeiter and a police officer.

As you can see in the architecture diagram in figure 8.4, a GAN takes the following steps:

  1. The generator takes in random numbers and returns an image.

  2. This generated image is fed into the discriminator alongside a stream of images taken from the actual, ground-truth dataset.

  3. The discriminator takes in both real and fake images and returns probabilities: numbers between 0 and 1, with 1 representing a prediction of authenticity and 0 representing a prediction of fake.

Figure 8.4 The GAN architecture is composed of generator and discriminator networks. Note that the discriminator network is a typical CNN where the convolutional layers reduce in size until they get to the flattened layer. The generator network, on the other hand, is an inverted CNN that starts with the flattened vector: the convolutional layers increase in size until they form the dimension of the input images.

If you take a close look at the generator and discriminator networks, you will notice that the generator network is an inverted ConvNet that starts with the flattened vector. The images are upscaled until they are similar in size to the images in the training dataset. We will dive deeper into the generator architecture later in this chapter--I just wanted you to notice this phenomenon now.

8.1.1 Deep convolutional GANs (DCGANs)

In the original GAN paper in 2014, multi-layer perceptron (MLP) networks were used to build the generator and discriminator networks. However, since then, it has been proven that convolutional layers give greater predictive power to the discriminator, which in turn enhances the accuracy of the generator and the overall model. This type of GAN is called a deep convolutional GAN (DCGAN) and was developed by Alec Radford et al. in 2016.2 Now, all GAN architectures contain convolutional layers, so the “DC” is implied when we talk about GANs; so, for the rest of this chapter, we refer to DCGANs as both GANs and DCGANs. You can also go back to chapters 2 and 3 to learn more about the differences between MLP and CNN networks and why CNN is preferred for image problems. Next, let’s dive deeper into the architecture of the discriminator and generator networks.

8.1.2 The discriminator model

As explained earlier, the goal of the discriminator is to predict whether an image is real or fake. This is a typical supervised classification problem, so we can use the traditional classifier network that we learned about in the previous chapters. The network consists of stacked convolutional layers, followed by a dense output layer with a sigmoid activation function. We use a sigmoid activation function because this is a binary classification problem: the goal of the network is to output prediction probabilities values that range between 0 and 1, where 0 means the image generated by the generator is fake and 1 means it is 100% real.

The discriminator is a normal, well understood classification model. As you can see in figure 8.5, training the discriminator is pretty straightforward. We feed the discriminator labeled images: fake (or generated) and real images. The real images come from the training dataset, and the fake images are the output of the generator model.

Figure 8.5 The discriminator for the GAN

Now, let’s implement the discriminator network in Keras. At the end of this chapter, we will compile all the code snippets together to build an end-to-end GAN. We will first implement a discriminator_model function. In this code snippet, the shape of the image input is 28 × 28; you can change it as needed for your problem:

def discriminator_model():
       discriminator = Sequential()                                             
 
       discriminator.add(Conv2D(32, kernel_size=3, strides=2,
                         input_shape=(28,28,1),padding="same"))                 
 
       discriminator.add(LeakyReLU(alpha=0.2))                                  
 
       discriminator.add(Dropout(0.25))                                         
 
       discriminator.add(Conv2D(64, kernel_size=3, strides=2, padding="same"))  
       discriminator.add(ZeroPadding2D(padding=((0,1),(0,1))))                  
 
       discriminator.add(BatchNormalization(momentum=0.8))                      
       discriminator.add(LeakyReLU(alpha=0.2))                                  
       discriminator.add(Dropout(0.25))                                         
  
       discriminator.add(Conv2D(128, kernel_size=3, strides=2, padding="same")) 
       discriminator.add(BatchNormalization(momentum=0.8))                      
       discriminator.add(LeakyReLU(alpha=0.2))                                  
       discriminator.add(Dropout(0.25))                                         
  
       discriminator.add(Conv2D(256, kernel_size=3, strides=1, padding="same")) 
       discriminator.add(BatchNormalization(momentum=0.8))                      
       discriminator.add(LeakyReLU(alpha=0.2))                                  
       discriminator.add(Dropout(0.25))                                         
  
       discriminator.add(Flatten())                                             
       discriminator.add(Dense(1, activation='sigmoid'))                        
  
       discriminator.summary()                                                  
  
       img_shape = (28,28,1)                                                    
       img = Input(shape=img_shape)   
 
       probability = discriminator(img)                                         
  
       return Model(img, probability)                                           

Instantiates a sequential model and names it discriminator

Adds a convolutional layer to the discriminator model

Adds a leaky ReLU activation function

Adds a dropout layer with a 25% dropout probability

Adds a second convolutional layer with zero padding

Adds a batch normalization layer for faster learning and higher accuracy

Adds a third convolutional layer with batch normalization, leaky ReLU, and a dropout

Adds the fourth convolutional layer with batch normalization, leaky ReLU, and a dropout

Flattens the network and adds the output dense layer with sigmoid activation function

Prints the model summary

Sets the input image shape

Runs the discriminator model to get the output probability

Returns a model that takes the image as input and produces the probability output

The output summary of the discriminator model is shown in figure 8.6. As you might have noticed, there is nothing new: the discriminator model follows the regular pattern of the traditional CNN networks that we learned about in chapters 3, 4, and 5. We stack convolutional, batch normalization, activation, and dropout layers to create our model. All of these layers have hyperparameters that we tune when we are training the network. For your own implementation, you can tune these hyperparameters and add or remove layers as you see fit. Tuning CNN hyperparameters is explained in detail in chapters 3 and 4.

Figure 8.6 The output summary for the discriminator model

In the output summary in figure 8.6, note that the width and height of the output feature maps decrease in size, whereas the depth increases in size. This is the expected behavior for traditional CNN networks as we’ve seen in previous chapters. Let’s see what happens to the feature maps’ size in the generator network in the next section.

8.1.3 The generator model

The generator takes in some random data and tries to mimic the training dataset to generate fake images. Its goal is to trick the discriminator by trying to generate images that are perfect replicas of the training dataset. As it is trained, it gets better and better after each iteration. But the discriminator is being trained at the same time, so the generator has to keep improving as the discriminator learns its tricks.

Figure 8.7 The generator model of the GAN

As you can see in figure 8.7, the generator model looks like an inverted ConvNet. The generator takes a vector input with some random noise data and reshapes it into a cube volume that has a width, height, and depth. This volume is meant to be treated as a feature map that will be fed to several convolutional layers that will create the final image.

Upsampling to scale feature maps

Traditional convolutional neural networks use pooling layers to downsample input images. In order to scale the feature maps, we use upsampling layers that scale the image dimensions by repeating each row and column of the input pixels.

Keras has an upsampling layer (Upsampling2D) that scales the image dimensions by taking a scaling factor (size) as an argument:

keras.layers.UpSampling2D(size=(2, 2))

This line of code repeats every row and column of the image matrix two times, because the size of the scaling factor is set to (2, 2); see figure 8.8. If the scaling factor is (3, 3), the upsampling layer repeats each row and column of the input matrix three times, as shown in figure 8.9.

Figure 8.8 Upsampling example when the scaling size is (2, 2)

Figure 8.9 Upsampling example when scaling size is (3, 3)

When we build the generator model, we keep adding upsampling layers until the size of the feature maps is similar to the training dataset. You will see how this is implemented in Keras in the next section.

Now, let’s build the generator_model function that builds the generator network:

   def generator_model():
       generator = Sequential()                                              
       generator.add(Dense(128 * 7 * 7, activation="relu", input_dim=100))   
       generator.add(Reshape((7, 7, 128)))                                   
       generator.add(UpSampling2D(size=(2,2)))                               
  
       generator.add(Conv2D(128, kernel_size=3, padding="same"))             
       generator.add(BatchNormalization(momentum=0.8))                       
       generator.add(Activation("relu"))
       generator.add(UpSampling2D(size=(2,2)))                               
  
# convolutional + batch normalization layers
       generator.add(Conv2D(64, kernel_size=3, padding="same"))              
       generator.add(BatchNormalization(momentum=0.8))                       
       generator.add(Activation("relu"))
  
# convolutional layer with filters = 1
       generator.add(Conv2D(1, kernel_size=3, padding="same"))
       generator.add(Activation("tanh"))
       generator.summary()                                                   
  
       noise = Input(shape=(100,))                                           
       fake_image = generator(noise)                                         
       return Model(noise, fake_image)                                       

Instantiates a sequential model and names it generator

Adds a dense layer that has a number of neurons = 128 × 7 × 7

Reshapes the image dimensions to 7 × 7 × 128

Upsampling layer to double the size of the image dimensions to 14 × 14

Adds a convolutional layer to run the convolutional process and batch normalization

Upsamples the image dimensions to 28 × 28

We don’t add upsampling here because the image size of 28 × 28 is equal to the image size in the MNIST dataset. You can adjust this for your own problem.

Prints the model summary

Generates the input noise vector of length = 100. We use 100 here to create a simple network.

Runs the generator model to create the fake image

Returns a model that takes the noise vector as input and outputs the fake image

The output summary of the generator model is shown in figure 8.10. In the code snippet, the only new component is the Upsampling layer to double its input dimensions by repeating pixels. Similar to the discriminator, we stack convolutional layers on top of each other and add other optimization layers like BatchNormalization. The key difference in the generator model is that it starts with the flattened vector; images are upsampled until they have dimensions similar to the training dataset. All of these layers have hyperparameters that we tune when we are training the network. For your own implementation, you can tune these hyperparameters and add or remove layers as you see fit.

Figure 8.10 The output summary of the generator model

Notice the change in the output shape after each layer. It starts from a 1D vector of 6,272 neurons. We reshaped it to a 7 × 7 × 128 volume, and then the width and height were upsampled twice to 14 × 14 followed by 28 × 28. The depth decreased from 128 to 64 to 1 because this network is built to deal with the grayscale MNIST dataset project that we will implement later in this chapter. If you are building a generator model to generate color images, then you should set the filters in the last convolutional layer to 3.

8.1.4 Training the GAN

Now that we’ve learned the discriminator and generator models separately, let’s put them together to train an end-to-end generative adversarial network. The discriminator is being trained to become a better classifier to maximize the probability of assigning the correct label to both training examples (real) and images generated by the generator (fake): for example, the police officer becomes better at differentiating between fakes and real currency. The generator, on the other hand, is being trained to become a better forger, to maximize its chances of fooling the discriminator. Both networks are getting better at what they do.

The process of training GAN models involves two processes:

  1. Train the discriminator. This is a straightforward supervised training process. The network is given labeled images coming from the generator (fake) and the training data (real), and it learns to classify between real and fake images with a sigmoid prediction output. Nothing new here.

  2. Train the generator. This process is a little tricky. The generator model cannot be trained alone like the discriminator. It needs the discriminator model to tell it whether it did a good job of faking images. So, we create a combined network to train the generator, composed of both discriminator and generator models.

Think of the training processes as two parallel lanes. One lane trains the discriminator alone, and the other lane is the combined model that trains the generator. The GAN training process is illustrated in figure 8.11.

Figure 8.11 The process flow to train GANs

As you can see in figure 8.11, when training the combined model, we freeze the weights of the discriminator because this model focuses only on training the generator. We will discuss the intuition behind this idea when we explain the generator training proces. For now, just know that we need to build and train two models: one for the discriminator alone and the other for both discriminator and generator models.

Both processes follow the traditional neural network training process explained in chapter 2. It starts with the feedforward process and then makes predictions and calculates and backpropagates the error. When training the discriminator, the error is backpropagated back to the discriminator model to update its weights; in the combined model, the error is backpropagated back to the generator to update its weights.

During the training iterations, we follow the same neural network training procedure to observe the network’s performance and tune its hyperparameters until we see that the generator is achieving satisfying results for our problem. This is when we can stop the training and deploy the generator model. Now, let’s see how we compile the discriminator and the combined networks to train the GAN model.

Training the discriminator

As we said before, this is a straightforward process. First, we build the model from the discriminator_model method that we created earlier in this chapter. Then we compile the model and use the binary_crossentropy loss function and an optimizer of your choice (we use Adam in this example).

Let’s see the Keras implementation that builds and compiles the generator. Please note that this code snippet is not meant to be compilable on its own--it is here for illustration. At the end of this chapter, you can find the full code of this project:

       discriminator = discriminator_model()
discriminator.compile(loss='binary_crossentropy',optimizer='adam',  metrics=['accuracy']) 

We can train the model by creating random training batches using Keras’ train_on _batch method to run a single gradient update on a single batch of data:

noise = np.random.normal(0, 1, (batch_size, 100))        
gen_imgs = generator.predict(noise)                      
  
# Train the discriminator (real classified as ones and generated as zeros)
d_loss_real = discriminator.train_on_batch(imgs, valid)
d_loss_fake = discriminator.train_on_batch(gen_imgs, fake)

Sample noise

Generates a batch of new images

Training the generator (combined model)

Here is the one tricky part in training GANs: training the generator. While the discriminator can be trained in isolation from the generator model, the generator needs the discriminator in order to be trained. For this, we build a combined model that contains both the generator and the discriminator, as shown in figure 8.12.

Figure 8.12 Illustration of the combined model that contains both the generator and discriminator models

When we want to train the generator, we freeze the weights of the discriminator model because the generator and discriminator have different loss functions pulling in different directions. If we don’t freeze the discriminator weights, it will be pulled in the same direction the generator is learning so it will be more likely to predict generated images as real, which is not the desired outcome. Freezing the weights of the discriminator model doesn’t affect the existing discriminator model that we compiled earlier when we were training the discriminator. Think of it as having two discriminator models--this is not the case, but it is easier to imagine.

Now, let’s build the combined model:

       generator = generator_model()      
  
       z = Input(shape=(100,))            
       image = generator(z)               
  
       discriminator.trainable = False    
  
       valid = discriminator(img)         
  
       combined = Model(z, valid)         

Builds the generator

The generator takes noise as input and generates an image.

Freezes the weights of the discriminator model

The discriminator takes generated images as input and determines their validity.

The combined model (stacked generator and discriminator) trains the generator to fool the discriminator.

Now that we have built the combined model, we can proceed with the training process as normal. We compile the combined model with a binary_crossentropy loss function and an Adam optimizer:

combined.compile(loss='binary_crossentropy', optimizer=optimizer)
g_loss = self.combined.train_on_batch(noise, valid)                  

Trains the generator (wants the discriminator to mistake images for being real)

Training epochs

In the project at the end of the chapter, you will see that the previous code snippet is put inside a loop function to perform the training for a certain number of epochs. For each epoch, the two compiled models (discriminator and combined) are trained simultaneously. During the training process, both the generator and discriminator improve. You can observe the performance of your GAN by printing out the results after each epoch (or a set of epochs) to see how the generator is doing at generating synthetic images. Figure 8.13 shows an example of the evolution of the generator’s performance throughout its training process on the MNIST dataset.

Figure 8.13 The generator gets better at mimicking the handwritten digits of the MNIST dataset throughout its training from epoch 0 to epoch 9,500.

In the example, epoch 0 starts with random noise data that doesn’t yet represent the features in the training dataset. As the GAN model goes through the training, its generator gets better and better at creating high-quality imitations of the training dataset that can fool the discriminator. Manually observing the generator’s performance is a good way to evaluate system performance to decide on the number of epochs and when to stop training. We’ll look more at GAN evaluation techniques in section 8.2.

8.1.5 GAN minimax function

GAN training is more of a zero-sum game than an optimization problem. In zero-sum games, the total utility score is divided among the players. An increase in one player’s score results in a decrease in another player’s score. In AI, this is called minimax game theory. Minimax is a decision-making algorithm, typically used in turn-based, two-player games. The goal of the algorithm is to find the optimal next move. One player, called the maximizer, works to get the maximum possible score; the other player, called the minimizer, tries to get the lowest score by counter-moving against the maximizer.

GANs play a minimax game where the entire network attempts to optimize the function V(D,G) in the following equation:


The goal of the discriminator (D) is to maximize the probability of getting the correct label of the image. The generator’s (G) goal, on the other hand, is to minimize the chances of getting caught. So, we train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize log(1 - D(G(z))). In other words, D and G play a two-player minimax game with the value function V(D,G).

Minimax game theory

In a two-person, zero-sum game, a person can win only if the other player loses. No cooperation is possible. This game theory is widely used in games such as tic-tac-toe, backgammon, mancala, chess, and so on. The maximizer player tries to get the highest score possible, while the minimizer player tries to do the opposite and get the lowest score possible.

In a given game state, if the maximizer has the upper hand, then the score will tend to be a positive value. If the minimizer has the upper hand in that state, then the score will tend to be a negative value. The values are calculated by heuristics that are unique for every type of game.

Like any other mathematical equation, the preceding one looks terrifying to anyone who isn’t well versed in the math behind it, but the idea it represents is simple yet powerful. It’s just a mathematical representation of the two competing objectives of the discriminator and the generator models. Let’s go through the symbols first (table 8.1) and then explain it.

Table 8.1 Symbols used in the minimax equation

Symbol

Explanation

G

Generator.

D

Discriminator.

z

Random noise fed to the generator (G).

G(z)

The generator takes the random noise data (z) and tries to reconstruct the real images.

D(G(z))

The discriminator (D) output from the generator.

logD(x)

The discriminator’s probability output for real data.

The discriminator takes its input from two sources:

  • Data from the generator, G(z)--This is fake data (z). The discriminator output from the generator is denoted as D(G(z)).

  • Real input from the real training data (x)--The discriminator output from the real data is denoted as log D(x).

To simplify the minimax equation, the best way to look at it is to break it down into two components: the discriminator training function and the generator training (combined model) function. During the training process, we created two training flows, and each has its own error function:

  • One for the discriminator alone, represented by the following function that aims to maximize the minimax function by making the predictions as close as possible to 1:

    Ex ~pdata [logD(x)]

  • One for the combined model to train the generator represented by the following function, which aims to minimize the minimax function by making the predictions as close as possible to 0:

    Ez ~P z(z) [log(1 - D(G(z)))]

Now that we understand the equation symbols and have a better understanding of how the minimax function works, let’s look at the function again:


The goal of the minimax objective function V(D, G ) is to maximize D(x) from the true data distribution and minimize D(G(z)) from the fake data distribution. To achieve this, we use the log-likelihood of D(x) and 1 - D(z) in the objective function. The log of a vvalue just makes sure that the closer we are to an incorrect value, the more we are penalized.

Early in the GAN training process, the discriminator will reject fake data from the generator with high confidence, because the fake images are very different from the real training data--the generator hasn’t learned yet. As we train the discriminator to maximize the probability of assigning the correct labels to both real examples and fake images from the generator, we simultaneously train the generator to minimize the discriminator classification error for the generated fake data. The discriminator wants to maximize objectives such that D(x) is close to 1 for real data and D(G(z)) is close to 0 for fake data. On the other hand, the generator wants to minimize objectives such that D(G(z)) is close to 1 so that the discriminator is fooled into thinking the generated G(z) is real. We stop the training when the fake data generated by the generator is recognized as real data.

8.2 Evaluating GAN models

Deep learning neural network models that are used for classification and detection problems are trained with a loss function until convergence. A GAN generator model, on the other hand, is trained using a discriminator that learns to classify images as real or generated. As we learned in the previous section, both the generator and discriminator models are trained together to maintain an equilibrium. As such, no objective loss function is used to train the GAN generator models, and there is no way to objectively assess the progress of the training and the relative or absolute quality of the model from loss alone. This means models must be evaluated using the quality of the generated synthetic images and by manually inspecting the generated images.

A good way to identify evaluation techniques is to review research papers and the techniques the authors used to evaluate their GANs. Tim Salimans et al. (2016) evaluated their GAN performance by having human annotators manually judge the visual quality of the synthesized samples.3 They created a web interface and hired annotators on Amazon Mechanical Turk (MTurk) to distinguish between generated data and real data.

One downside of using human annotators is that the metric varies depending on the setup of the task and the motivation of the annotators. The team also found that results changed drastically when they gave annotators feedback about their mistakes: by learning from such feedback, annotators are better able to point out the flaws in generated images, giving a more pessimistic quality assessment.

Other non-manual approaches were used by Salimans et al. and by other researchers we will discuss in this section. In general, there is no consensus about a correct way to evaluate a given GAN generator model. This makes it challenging for researchers and practitioners to do the following:

  • Select the best GAN generator model during a training run--in other words, decide when to stop training.

  • Choose generated images to demonstrate the capability of a GAN generator model.

  • Compare and benchmark GAN model architectures.

  • Tune the model hyperparameters and configuration and compare results.

Finding quantifiable ways to understand a GAN’s progress and output quality is still an active area of research. A suite of qualitative and quantitative techniques has been developed to assess the performance of a GAN model based on the quality and diversity of the generated synthetic images. Two commonly used evaluation metrics for image quality and diversity are the inception score and the Fréchet inception distance (FID ). In this section, you will discover techniques for evaluating GAN models based on generated synthetic images.

8.2.1 Inception score

The inception score is based on a heuristic that realistic samples should be able to be classified when passed through a pretrained network such as Inception on ImageNet (hence the name inception score). The idea is really simple. The heuristic relies on two values:

  • High predictability of the generated image --We apply a pretrained inception classifier model to every generated image and get its softmax prediction. If the generated image is good enough, then it should give us a high predictability score.

  • Diverse generated samples --No classes should dominate the distribution of the generated images.

A large number of generated images are classified using the model. Specifically, the probability of the image belonging to each class is predicted. The probabilities are then summarized in the score to capture both how much each image looks like a known class and how diverse the set of images is across the known classes. If both these traits are satisfied, there should be a large inception score. A higher inception score indicates better-quality generated images.

8.2.2 Fréchet inception distance (FID)

The FID score was proposed and used by Martin Heusel et al. in 2017.4 The score was proposed as an improvement over the existing inception score.

Like the inception score, the FID score uses the Inception model to capture specific features of an input image. These activations are calculated for a collection of real and generated images. The activations for each real and generated image are summarized as a multivariate Gaussian, and the distance between these two distributions is then calculated using the Fréchet distance, also called the Wasserstein-2 distance.

An important note is that the FID needs a decent sample size to give good results (the suggested size is 50,000 samples). If you use too few samples, you will end up overestimating your actual FID, and the estimates will have a large variance. A lower FID score indicates more realistic images that match the statistical properties of real images.

8.2.3 Which evaluation scheme to use

Both measures (inception score and FID) are easy to implement and calculate on batches of generated images. As such, the practice of systematically generating images and saving models during training can and should continue to be used to allow post hoc model selection. Diving deep into the inception score and FID is out of the scope of this book. As mentioned earlier, this is an active area of research, and there is no consensus in the industry as of the time of writing about the one best approach to evaluate GAN performance. Different scores assess various aspects of the image-generation process, and it is unlikely that a single score can cover all aspects. The goal of this section is to expose you to some techniques that have been developed in recent years to automate the GAN evaluation process, but manual evaluation is still widely used.

When you are getting started, it is a good idea to begin with manual inspection of generated images in order to evaluate and select generator models. Developing GAN models is complex enough for both beginners and experts; manual inspection can get you a long way while refining your model implementation and testing model configurations.

Other researchers are taking different approaches by using domain-specific evaluation metrics. For example, Konstantin Shmelkov and his team (2018) used two measures based on image classification, GAN-train and GAN-test, which approximated the recall (diversity) and precision (quality of the image) of GANs, respectively.5

8.3 Popular GAN applications

Generative modeling has come a long way in the last five years. The field has developed to the point where it is expected that the next generation of generative models will be more comfortable creating art than humans. GANs now have the power to solve the problems of industries like healthcare, automotive, fine arts, and many others. In this section, we will learn about some of the use cases of adversarial networks and which GAN architecture is used for that application. The goal of this section is not to implement the variations of the GAN network, but to provide some exposure to potential applications of GAN models and resources for further reading.

8.3.1 Text-to-photo synthesis

Synthesis of high-quality images from text descriptions is a challenging problem in CV. Samples generated by existing text-to-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts.

The GAN network that was built for this application is the stacked generative adversarial network (StackGAN).6 Zhang et al. were able to generate 256 × 256 photorealistic images conditioned on text descriptions.

StackGANs work in two stages (figure 8.14):

  • Stage-I : StackGAN sketches the primitive shape and colors of the object based on the given text description, yielding low-resolution images.

  • Stage-II : StackGAN takes the output of stage-I and a text description as input and generates high-resolution images with photorealistic details. It is able to rectify defects in the images created in stage-I and add compelling details with the refinement process.

Figure 8.14 (a) Stage-I: Given text descriptions, StackGAN sketches rough shapes and basic colors of objects, yielding low-resolution images. (b) Stage-II takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photorealistic details. (Source: Zhang et al., 2016.)

8.3.2 Image-to-image translation (Pix2Pix GAN)

Image-to-image translation is defined as translating one representation of a scene into another, given sufficient training data. It is inspired by the language translation analogy: just as an idea can be expressed by many different languages, a scene may be rendered by a grayscale image, RGB image, semantic label maps, edge sketches, and so on. In figure 8.15, image-to-image translation tasks are demonstrated on a range of applications such as converting street scene segmentation labels to real images, grayscale to color images, sketches of products to product photographs, and day photographs to night ones.

Pix2Pix is a member of the GAN family designed by Phillip Isola et al. in 2016 for general-purpose image-to-image translation.7 The Pix2Pix network architecture is similar to the GAN concept: it consists of a generator model for outputting new synthetic images that look realistic, and a discriminator model that classifies images as real (from the dataset) or fake (generated). The training process is also similar to that used for GANs: the discriminator model is updated directly, whereas the generator model is updated via the discriminator model. As such, the two models are trained simultaneously in an adversarial process where the generator seeks to better fool the discriminator and the discriminator seeks to better identify the counterfeit images.

Figure 8.15 Examples of Pix2Pix applications taken from the original paper.

The novel idea of Pix2Pix networks is that they learn a loss function adapted to the task and data at hand, which makes them applicable in a wide variety of settings. They are a type of conditional GAN (cGAN) where the generation of the output image is conditional on an input source image. The discriminator is provided with both a source image and the target image and must determine whether the target is a plausible transformation of the source image.

The results of the Pix2Pix network are really promising for many image-to-image translation tasks. Visit https://affinelayer.com/pixsrv to play more with the Pix2Pix network; this site has an interactive demo created by Isola and team in which you can convert sketch edges of cats or products to photos and façades to real images.

8.3.3 Image super-resolution GAN (SRGAN)

A certain type of GAN models can be used to convert low-resolution images into high-resolution images. This type is called a super-resolution generative adversarial networks (SRGAN) and was introduced by Christian Ledig et al. in 2016.8 Figure 8.16 shows how SRGAN was able to create a very high-resolution image.

Figure 8.16 SRGAN converting a low-resolution image to a high-resolution image. (Source: Ledig et al., 2016.)

8.3.4 Ready to get your hands dirty?

GAN models have huge potential for creating and imagining new realities that have never existed before. The applications mentioned in this chapter are just a few examples to give you an idea of what GANs can do today. Such applications come out every few weeks and are worth trying. If you are interested in getting your hands dirty with more GAN applications, visit the amazing Keras-GAN repository at https://github.com/ eriklindernoren/Keras-GAN, maintained by Erik Linder-Norén. It includes many GAN models created using Keras and is an excellent resource for Keras examples. Much of the code in this chapter was inspired by and adapted from this repository.

8.4 Project: Building your own GAN

In this project, you’ll build a GAN using convolutional layers in the generator and discriminator. This is called a deep convolutional GAN (DCGAN) for short. The DCGAN architecture was first explored by Alec Radford et al. (2016), as discussed in section 8.1.1, and has seen impressive results in generating new images. You can follow along with the implementation in this chapter or run code in the project notebook available with this book’s downloadable code.

In this project, you’ll be training DCGAN on the Fashion-MNIST dataset (https:// github.com/zalandoresearch/fashion-mnist). Fashion-MNIST consists of 60,000 grayscale images for training and a test set of 10,000 images (figure 8.17). Each 28 × 28 grayscale image is associated with a label from 10 classes. Fashion-MNIST is intended to serve as a direct replacement for the original MNIST dataset for benchmarking machine learning algorithms. I chose grayscale images for this project because it requires less computational power to train convolutional networks on one-channel grayscale images compared to three-channel colored images, which makes it easier for you to train on a personal computer without a GPU.

Figure 8.17 Fashion-MNIST dataset examples

The dataset is broken into 10 fashion categories. The class labels are as follows:

Label

Description

0

T-shirt/top

1

Trouser

2

Pullover

3

Dress

4

Coat

5

Sandal

6

Shirt

7

Sneaker

8

Bag

9

Ankle boot

Step 1: Import libraries

As always, the first thing to do is to import all the libraries we use in this project:

from __future__ import print_function, division
 
from keras.datasets import fashion_mnist                                 
 
from keras.layers import Input, Dense, Reshape, Flatten, Dropout         
from keras.layers import BatchNormalization, Activation, ZeroPadding2D   
from keras.layers.advanced_activations import LeakyReLU                  
from keras.layers.convolutional import UpSampling2D, Conv2D              
from keras.models import Sequential, Model                               
from keras.optimizers import Adam                                        
 
import numpy as np                                                       
import matplotlib.pyplot as plt                                          

Imports the fashion_mnist dataset from Keras

Imports Keras layers and models

Imports numpy and matplotlib

Step 2: Download and visualize the dataset

Keras makes the Fashion-MNIST dataset available for us to download with just one command: fashion_mnist.load_data(). Here, we download the dataset and rescale the training set to the range -1 to 1 to allow the model to converge faster (see the “Data normalization” section in chapter 4 for more details on image scaling):

(training_data, _), (_, _) = fashion_mnist.load_data()     
 
X_train = training_data / 127.5 - 1.                       
X_train = np.expand_dims(X_train, axis=3)                  

Loads the dataset

Rescales the training data to scale -1 to 1

Just for the fun of it, let’s visualize the image matrix (figure 8.18):

def visualize_input(img, ax):
    ax.imshow(img, cmap='gray')
    width, height = img.shape
    thresh = img.max()/2.5
    for x in range(width):
        for y in range(height):
            ax.annotate(str(round(img[x][y],2)), xy=(y,x),
                        horizontalalignment='center',
                        verticalalignment='center',
                        color='white' if img[x][y]<thresh else 'black')
 
fig = plt.figure(figsize = (12,12)) 
ax = fig.add_subplot(111)
visualize_input(training_data[3343], ax)

Figure 8.18 A visualized example of the Fashion-MNIST dataset

Step 3: Build the generator

Now, let’s build the generator model. The input will be our noise vector (z) as explained in section 8.1.5. The generator architecture is shown in figure 8.19.

Figure 8.19 Architecture of the generator model

The first layer is a fully connected layer that is then reshaped into a deep, narrow layer, something like 7 × 7 × 128 (in the original DCGAN paper, the team reshaped the input to 4 × 4 × 1024). Then we use the upsampling layer to double the feature map dimensions from 7 × 7 to 14 × 14 and then again to 28 × 28. In this network, we use three convolutional layers. We also use batch normalization and a ReLU activation. For each of these layers, the general scheme is convolution ⇒ batch normalization ⇒ ReLU. We keep stacking up layers like this until we get the final transposed convolution layer with shape 28 × 28 × 1:

   def build_generator():
       generator = Sequential()                                             
 
       generator.add(Dense(128 * 7 * 7, activation="relu", input_dim=100))  
 
       generator.add(Reshape((7, 7, 128)))                                  
 
       generator.add(UpSampling2D())                                        
 
       generator.add(Conv2D(128, kernel_size=3, padding="same",             
                     activation="relu"))                                    
       generator.add(BatchNormalization(momentum=0.8))                      
       generator.add(UpSampling2D())                                        
  
# convolutional + batch normalization layers
       generator.add(Conv2D(64, kernel_size=3, padding="same",              
                     activation="relu"))                                    
       generator.add(BatchNormalization(momentum=0.8))                      
  
       # convolutional layer with filters = 1
       generator.add(Conv2D(1, kernel_size=3, padding="same",
                     activation="relu"))
  
       generator.summary()                                                  
  
       noise = Input(shape=(100,))                                          
 
       fake_image = generator(noise)                                        
 
       return Model(inputs=noise, outputs=fake_image)                       

Instantiates a sequential model and names it generator

Adds the dense layer that has a number of neurons = 128 × 7 × 7

Reshapes the image dimensions to 7 × 7 × 128

Upsampling layer to double the size of the image dimensions to 14 × 14

Adds a convolutional layer to run the convolutional process and batch normalization

Upsamples the image dimensions to 28 × 28

We don’t add upsampling here because the image size of 28 × 28 is equal to the image size in the MNIST dataset. You can adjust this for your own problem.

Prints the model summary

Generates the input noise vector of length = 100. We chose 100 here to create a simple network.

Runs the generator model to create the fake image

Returns a model that takes the noise vector as an input and outputs the fake image

Step 4: Build the discriminator

The discriminator is just a convolutional classifier like what we have built before (figure 8.20). The inputs to the discriminator are 28 × 28 × 1 images. We want a few convolutional layers and then a fully connected layer for the output. As before, we want a sigmoid output, and we need to return the logits as well. For the depths of the convolutional layers, I suggest starting with 32 or 64 filters in the first layer, and then double the depth as you add layers. In this implementation, we start with 64 layers, then 128, and then 256. For downsampling, we do not use pooling layers. Instead, we use only strided convolutional layers for downsampling, similar to Radford et al.’s implementation.

Figure 8.20 Architecture of the discriminator model

We also use batch normalization and dropout to optimize training, as we learned in chapter 4. For each of the four convolutional layers, the general scheme is convolution ⇒ batch normalization ⇒ leaky ReLU. Now, let’s build the build_discriminator function:

   def build_discriminator():
       discriminator = Sequential()                                             
 
       discriminator.add(Conv2D(32, kernel_size=3, strides=2, 
                         input_shape=(28,28,1), padding="same"))                
 
       discriminator.add(LeakyReLU(alpha=0.2))                                  
 
       discriminator.add(Dropout(0.25))                                         
 
       discriminator.add(Conv2D(64, kernel_size=3, strides=2, 
                         padding="same"))                                       
 
       discriminator.add(ZeroPadding2D(padding=((0,1),(0,1))))                  
 
       discriminator.add(BatchNormalization(momentum=0.8))                      
 
       discriminator.add(LeakyReLU(alpha=0.2))
       discriminator.add(Dropout(0.25))
  
       discriminator.add(Conv2D(128, kernel_size=3, strides=2, padding="same")) 
       discriminator.add(BatchNormalization(momentum=0.8))                      
       discriminator.add(LeakyReLU(alpha=0.2))                                  
       discriminator.add(Dropout(0.25))                                         
  
       discriminator.add(Conv2D(256, kernel_size=3, strides=1, padding="same")) 
       discriminator.add(BatchNormalization(momentum=0.8))                      
       discriminator.add(LeakyReLU(alpha=0.2))                                  
       discriminator.add(Dropout(0.25))                                         
  
       discriminator.add(Flatten())                                             
       discriminator.add(Dense(1, activation='sigmoid'))                        
  
       img = Input(shape=(28,28,1))                                             
       probability = discriminator(img)                                         
  
       return Model(inputs=img, outputs=probability)                            

Instantiates a sequential model and names it discriminator

Adds a convolutional layer to the discriminator model

Adds a leaky ReLU activation function

Adds a dropout layer with a 25% dropout probability

Adds a second convolutional layer with zero padding

Adds a zero-padding layer to change the dimension from 7 × 7 to 8 × 8

Adds a batch normalization layer for faster learning and higher accuracy

Adds a third convolutional layer with batch normalization, leaky ReLU, and a dropout

Adds the fourth convolutional layer with batch normalization, leaky ReLU, and a dropout

Flattens the network and adds the output dense layer with sigmoid activation function

Sets the input image shape

Runs the discriminator model to get the output probability

Returns a model that takes the image as input and produces the probability output

Step 5: Build the combined model

As explained in section 8.1.3, to train the generator, we need to build a combined network that contains both the generator and the discriminator (figure 8.21). The combined model takes the noise signal as input (z) and outputs the discriminator’s prediction output as fake or real.

Figure 8.21 Architecture of the combined model

Remember that we want to disable discriminator training for the combined model, as explained in detail in section 8.1.3. When training the generator, we don’t want the discriminator to update weights as well, but we still want to include the discriminator model in the generator training. So, we create a combined network that includes both models but freeze the weights of the discriminator model in the combined network:

optimizer = Adam(learning_rate=0.0002, beta_1=0.5)                  
 
discriminator = build_discriminator()                               
discriminator.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
 
discriminator.trainable = False                                     
 
# Build the generator
generator = build_generator()                                       
 
z = Input(shape=(100,))                                             
img = generator(z)                                                  
 
valid = discriminator(img)                                          
 
combined = Model(inputs=z, outputs=valid)                           
combined.compile(loss='binary_crossentropy', optimizer=optimizer)   

Defines the optimizer

Builds and compiles the discriminator

Freezes the discriminator weights because we don’t want to train it during generator training

Builds the generator

The generator takes noise as input with latent_dim = 100 and generates images.

The discriminator takes generated images as input and determines their validity.

The combined model (stacked generator and discriminator) trains the generator to fool the discriminator.

Step 6: Build the training function

When training the GAN model, we train two networks: the discriminator and the combined network that we created in the previous section. Let’s build the train function, which takes the following arguments:

  • The number of epochs

  • The batch size

  • save_interval to state how often we want to save the results

def train(epochs, batch_size=128, save_interval=50):
 
    valid = np.ones((batch_size, 1))                                
    fake = np.zeros((batch_size, 1))                                
 
    for epoch in range(epochs):
 
        ## Train Discriminator network
 
        idx = np.random.randint(0, X_train.shape[0], batch_size)    
        imgs = X_train[idx]                                         
 
        noise = np.random.normal(0, 1, (batch_size, 100))           
        gen_imgs = generator.predict(noise)                         
 
        d_loss_real = discriminator.train_on_batch(imgs, valid)     
        d_loss_fake = discriminator.train_on_batch(gen_imgs, fake)  
        d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)             
 
        ## Train the combined network (Generator)
 
        g_loss = combined.train_on_batch(noise, valid)              
 
print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]" % 
      (epoch, d_loss[0], 100*d_loss[1], g_loss))                    
 
        if epoch % save_interval == 0:                              
            plot_generated_images(epoch, generator)                 

Adversarial ground truths

Selects a random half of images

Sample noise, and generates a batch of new images

Trains the discriminator (real classified as 1s and generated as 0s)

Trains the generator (wants the discriminator to mistake images for real ones)

Prints progress

Saves generated image samples if at save_interval

Before you run the train() function, you need to define the following plot_generated _images() function:

def plot_generated_images(epoch, generator, examples=100, dim=(10, 10), 
                          figsize=(10, 10)):
    noise = np.random.normal(0, 1, size=[examples, latent_dim])
    generated_images = generator.predict(noise)
    generated_images = generated_images.reshape(examples, 28, 28)

    plt.figure(figsize=figsize)
    for i in range(generated_images.shape[0]):
        plt.subplot(dim[0], dim[1], i+1)
        plt.imshow(generated_images[i], interpolation='nearest', cmap='gray_r')
        plt.axis('off')
    plt.tight_layout()
    plt.savefig('gan_generated_image_epoch_%d.png' % epoch)

Step 7: Train and observe results

Now that the code implementation is complete, we are ready to start the DCGAN training. To train the model, run the following code snippet:

train(epochs=1000, batch_size=32, save_interval=50)

This will run the training for 1,000 epochs and saves images every 50 epochs. When you run the train() function, the training progress prints as shown in figure 8.22.

Figure 8.22 Training progress for the first 16 epochs

I ran this training myself for 10,000 epochs. Figure 8.23 shows my results after 0, 50, 1,000, and 10,000 epochs.

Figure 8.23 Output of the GAN generator after 0, 50, 1,000, and 10,000 epochs

As you can see in figure 8.23, at epoch 0, the images are just random noise--no patterns or meaningful data. At epoch 50, patterns have started to form. One very apparent pattern is the bright pixels beginning to form at the center of the image, and the surroundings’ darker pixels. This happens because in the training data, all of the shapes are located at the center of the image. Later in the training process, at epoch 1,000, you can see clear shapes and can probably guess the type of training data fed to the GAN model. Fast-forward to epoch 10,000, and you can see that the generator has become very good at re-creating new images not present in the training dataset. For example, pick any of the objects created at this epoch: let’s say the top-left image (dress). This is a totally new dress design that is not present in the training dataset. The GAN model created a completely new dress design after learning the dress patterns from the training set. You can run the training longer or make the generator network even deeper to get more refined results.

In closing

For this project, I used the Fashion-MNIST dataset because the images are very small and are in grayscale (one-channel), which makes it computationally inexpensive for you to train on your local computer with no GPU. Fashion-MNIST is also very clean data: all of the images are centered and have less noise so they don’t require much preprocessing before you kick off your GAN training. This makes it a good toy dataset to jumpstart your first GAN project.

If you are excited to get your hands dirty with more advanced datasets, you can try CIFAR as your next step (https://www.cs.toronto.edu/~kriz/cifar.html) or Google’s Quick, Draw! dataset (https://quickdraw.withgoogle.com), which is considered the world’s largest doodle dataset at the time of writing. Another, more serious, dataset is Stanford’s Cars Dataset (https://ai.stanford.edu/~jkrause/cars/car_dataset.html), which contains more than 16,000 images of 196 classes of cars. You can try to train your GAN model to design a completely new design for your dream car!

Summary

  • GANs learn patterns from the training dataset and create new images that have a similar distribution of the training set.

  • The GAN architecture consists of two deep neural networks that compete with each other.

  • The generator tries to convert random noise into observations that look as if they have been sampled from the original dataset.

  • The discriminator tries to predict whether an observation comes from the original dataset or is one of the generator’s forgeries.

  • The discriminator’s model is a typical classification neural network that aims to classify images generated by the generator as real or fake.

  • The generator’s architecture looks like an inverted CNN that starts with a narrow input and is upsampled a few times until it reaches the desired size.

  • The upsampling layer scales the image dimensions by repeating each row and column of its input pixels.

  • To train the GAN, we train the network in batches through two parallel networks: the discriminator and a combined network where we freeze the weights of the discriminator and update only the generator’s weights.

  • To evaluate the GAN, we mostly rely on our observation of the quality of images created by the generator. Other evaluation metrics are the inception score and Fréchet inception distance (FID).

  • In addition to generating new images, GANs can be used in applications such as text-to-photo synthesis, image-to-image translation, image super-resolution, and many other applications.


1.Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative Adversarial Networks,” 2014, http://arxiv.org/abs/1406.2661.

2.Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” 2016, http://arxiv.org/abs/1511.06434.

3.Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and xi Chen. “Improved Techniques for Training GANs,” 2016, http://arxiv.org/abs/1606.03498.

4.Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” 2017, http://arxiv.org/ abs/1706.08500.

5.Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari, “How Good Is My GAN?” 2018, http://arxiv .org/abs/1807.09499.

6.Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas, “StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks,” 2016, http://arxiv.org/abs/1612.03242.

7.Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” 2016, http://arxiv.org/abs/1611.07004.

8.Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, et al., “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” 2016, http://arxiv.org/abs/1609.04802.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.174.239