Generative adversarial networks (GANs) are a new type of neural architecture introduced by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014.1 GANs have been called “the most interesting idea in the last 10 years in ML” by Yann LeCun, Facebook’s AI research director. The excitement is well justified. The most notable feature of GANs is their capacity to create hyperrealistic images, videos, music, and text. For example, except for the far-right column, none of the faces shown on the right side of figure 8.1 belong to real humans; they are all fake. The same is true for the handwritten digits on the left side of the figure. This shows a GAN’s ability to learn features from the training images and imagine its own new images using the patterns it has learned.
We’ve learned in the past chapters how deep neural networks can be used to understand image features and perform deterministic tasks on them like object classification and detection. In this part of the book, we will talk about a different type of application for deep learning in the computer vision world: generative models. These are neural network models that are able to imagine and produce new content that hasn’t been created before. They can imagine new worlds, new people, and new realities in a seemingly magical way. We train generative models by providing a training dataset in a specific domain; their job is to create images that have new objects from the same domain that look like the real data.
For a long time, humans have had an advantage over computers: the ability to imagine and create. Computers have excelled in solving problems like regression, classification, and clustering. But with the introduction of generative networks, researchers can make computers generate content of the same or higher quality compared to that created by their human counterparts. By learning to mimic any distribution of data, computers can be taught to create worlds that are similar to our own in any domain: images, music, speech, prose. They are robot artists, in a sense, and their output is impressive. GANs are also seen as an important stepping stone toward achieving artificial general intelligence (AGI), an artificial system capable of matching human cognitive capacity to acquire expertise in virtually any domain--from images, to language, to creative skills needed to compose sonnets.
Naturally, this ability to generate new content makes GANs look a little bit like magic, at least at first sight. In this chapter, we will only attempt to scratch the surface of what is possible with GANs. We will overcome the apparent magic of GANs in order to dive into the architectural ideas and math behind these models in order to provide the necessary theoretical knowledge and practical skills to continue exploring any facet of this field that you find most interesting. Not only will we discuss the fundamental notions that GANs rely on, but we will also implement and train an end-to-end GAN and go through it step by step. Let’s get started!
GANs are based on the idea of adversarial training. The GAN architecture basically consists of two neural networks that compete against each other:
The generator tries to convert random noise into observations that look as if they have been sampled from the original dataset.
The discriminator tries to predict whether an observation comes from the original dataset or is one of the generator’s forgeries.
This competitiveness helps them to mimic any distribution of data. I like to think of the GAN architecture as two boxers fighting (figure 8.2): in their quest to win the bout, both are learning each others’ moves and techniques. They start with less knowledge about their opponent, and as the match goes on, they learn and become better.
Another analogy will help drive home the idea: think of a GAN as the opposition of a counterfeiter and a cop in a game of cat and mouse, where the counterfeiter is learning to pass false notes, and the cop is learning to detect them (figure 8.3). Both are dynamic: as the counterfeiter learns to perfect creating false notes, the cop is in training and getting better at detecting the fakes. Each side learns the other’s methods in a constant escalation.
As you can see in the architecture diagram in figure 8.4, a GAN takes the following steps:
This generated image is fed into the discriminator alongside a stream of images taken from the actual, ground-truth dataset.
The discriminator takes in both real and fake images and returns probabilities: numbers between 0 and 1, with 1 representing a prediction of authenticity and 0 representing a prediction of fake.
If you take a close look at the generator and discriminator networks, you will notice that the generator network is an inverted ConvNet that starts with the flattened vector. The images are upscaled until they are similar in size to the images in the training dataset. We will dive deeper into the generator architecture later in this chapter--I just wanted you to notice this phenomenon now.
In the original GAN paper in 2014, multi-layer perceptron (MLP) networks were used to build the generator and discriminator networks. However, since then, it has been proven that convolutional layers give greater predictive power to the discriminator, which in turn enhances the accuracy of the generator and the overall model. This type of GAN is called a deep convolutional GAN (DCGAN) and was developed by Alec Radford et al. in 2016.2 Now, all GAN architectures contain convolutional layers, so the “DC” is implied when we talk about GANs; so, for the rest of this chapter, we refer to DCGANs as both GANs and DCGANs. You can also go back to chapters 2 and 3 to learn more about the differences between MLP and CNN networks and why CNN is preferred for image problems. Next, let’s dive deeper into the architecture of the discriminator and generator networks.
As explained earlier, the goal of the discriminator is to predict whether an image is real or fake. This is a typical supervised classification problem, so we can use the traditional classifier network that we learned about in the previous chapters. The network consists of stacked convolutional layers, followed by a dense output layer with a sigmoid activation function. We use a sigmoid activation function because this is a binary classification problem: the goal of the network is to output prediction probabilities values that range between 0 and 1, where 0 means the image generated by the generator is fake and 1 means it is 100% real.
The discriminator is a normal, well understood classification model. As you can see in figure 8.5, training the discriminator is pretty straightforward. We feed the discriminator labeled images: fake (or generated) and real images. The real images come from the training dataset, and the fake images are the output of the generator model.
Now, let’s implement the discriminator network in Keras. At the end of this chapter, we will compile all the code snippets together to build an end-to-end GAN. We will first implement a discriminator_model
function. In this code snippet, the shape of the image input is 28 × 28; you can change it as needed for your problem:
def
discriminator_model(): discriminator=
Sequential() ❶ discriminator.add(Conv2D(32
,kernel_size=
3
,strides=
2
,input_shape=
(28,28,1)
,padding=
"same"
)) ❷ discriminator.add(LeakyReLU(alpha=
0.2
)) ❸ discriminator.add(Dropout(0.25
)) ❹ discriminator.add(Conv2D(64
,kernel_size=
3
,strides=
2
,padding=
"same"
)) ❺ discriminator.add(ZeroPadding2D(padding=
((0
,1
),(0
,1
)))) ❺ discriminator.add(BatchNormalization(momentum=
0.8
)) ❻ discriminator.add(LeakyReLU(alpha=
0.2
)) ❻ discriminator.add(Dropout(0.25
)) ❻ discriminator.add(Conv2D(128
,kernel_size=
3
,strides=
2
,padding=
"same"
)) ❼ discriminator.add(BatchNormalization(momentum=
0.8
)) ❼ discriminator.add(LeakyReLU(alpha=
0.2
)) ❼ discriminator.add(Dropout(0.25
)) ❼ discriminator.add(Conv2D(256
,kernel_size=
3
,strides=
1
,padding=
"same"
)) ❽ discriminator.add(BatchNormalization(momentum=
0.8
)) ❽ discriminator.add(LeakyReLU(alpha=
0.2
)) ❽ discriminator.add(Dropout(0.25
)) ❽ discriminator.add(Flatten()) ❾ discriminator.add(Dense(1
,activation=
'sigmoid'
)) ❾ discriminator.summary() ❿ img_shape = (28,28,1) ⓫ img=
Input(shape=
img_shape) probability=
discriminator(img) ⓬return
Model(img, probability) ⓭
❶ Instantiates a sequential model and names it discriminator
❷ Adds a convolutional layer to the discriminator model
❸ Adds a leaky ReLU activation function
❹ Adds a dropout layer with a 25% dropout probability
❺ Adds a second convolutional layer with zero padding
❻ Adds a batch normalization layer for faster learning and higher accuracy
❼ Adds a third convolutional layer with batch normalization, leaky ReLU, and a dropout
❽ Adds the fourth convolutional layer with batch normalization, leaky ReLU, and a dropout
❾ Flattens the network and adds the output dense layer with sigmoid activation function
⓬ Runs the discriminator model to get the output probability
⓭ Returns a model that takes the image as input and produces the probability output
The output summary of the discriminator model is shown in figure 8.6. As you might have noticed, there is nothing new: the discriminator model follows the regular pattern of the traditional CNN networks that we learned about in chapters 3, 4, and 5. We stack convolutional, batch normalization, activation, and dropout layers to create our model. All of these layers have hyperparameters that we tune when we are training the network. For your own implementation, you can tune these hyperparameters and add or remove layers as you see fit. Tuning CNN hyperparameters is explained in detail in chapters 3 and 4.
In the output summary in figure 8.6, note that the width and height of the output feature maps decrease in size, whereas the depth increases in size. This is the expected behavior for traditional CNN networks as we’ve seen in previous chapters. Let’s see what happens to the feature maps’ size in the generator network in the next section.
The generator takes in some random data and tries to mimic the training dataset to generate fake images. Its goal is to trick the discriminator by trying to generate images that are perfect replicas of the training dataset. As it is trained, it gets better and better after each iteration. But the discriminator is being trained at the same time, so the generator has to keep improving as the discriminator learns its tricks.
As you can see in figure 8.7, the generator model looks like an inverted ConvNet. The generator takes a vector input with some random noise data and reshapes it into a cube volume that has a width, height, and depth. This volume is meant to be treated as a feature map that will be fed to several convolutional layers that will create the final image.
Traditional convolutional neural networks use pooling layers to downsample input images. In order to scale the feature maps, we use upsampling layers that scale the image dimensions by repeating each row and column of the input pixels.
Keras has an upsampling layer (Upsampling2D
) that scales the image dimensions by taking a scaling factor (size
) as an argument:
keras.layers.UpSampling2D(size=(2
,2
))
This line of code repeats every row and column of the image matrix two times, because the size of the scaling factor is set to (2, 2); see figure 8.8. If the scaling factor is (3, 3), the upsampling layer repeats each row and column of the input matrix three times, as shown in figure 8.9.
When we build the generator model, we keep adding upsampling layers until the size of the feature maps is similar to the training dataset. You will see how this is implemented in Keras in the next section.
Now, let’s build the generator_model
function that builds the generator network:
def
generator_model(): generator=
Sequential() ❶ generator.add(Dense(128
*
7
*
7
,activation=
"relu"
,input_dim=
100
)) ❷ generator.add(Reshape((7
,7
,128
))) ❸ generator.add(UpSampling2D(size=
(2,2)
)) ❹ generator.add(Conv2D(128
,kernel_size=
3
,padding=
"same"
)) ❺ generator.add(BatchNormalization(momentum=
0.8
)) ❺ generator.add(Activation("relu
")) generator.add(UpSampling2D(size=
(2,2)
)) ❻ # convolutional + batch normalization layers generator.add(Conv2D(64
,kernel_size=
3
,padding=
"same"
)) ❼ generator.add(BatchNormalization(momentum=
0.8
)) ❼ generator.add(Activation("relu
")) # convolutional layer with filters = 1 generator.add(Conv2D(1
,kernel_size=
3
,padding=
"same"
)) generator.add(Activation("tanh
")) generator.summary() ❽ noise=
Input(shape=
(100
,)) ❾ fake_image=
generator(noise) ❿return
Model(noise, fake_image) ⓫
❶ Instantiates a sequential model and names it generator
❷ Adds a dense layer that has a number of neurons = 128 × 7 × 7
❸ Reshapes the image dimensions to 7 × 7 × 128
❹ Upsampling layer to double the size of the image dimensions to 14 × 14
❺ Adds a convolutional layer to run the convolutional process and batch normalization
❻ Upsamples the image dimensions to 28 × 28
❼ We don’t add upsampling here because the image size of 28 × 28 is equal to the image size in the MNIST dataset. You can adjust this for your own problem.
❾ Generates the input noise vector of length = 100. We use 100 here to create a simple network.
❿ Runs the generator model to create the fake image
⓫ Returns a model that takes the noise vector as input and outputs the fake image
The output summary of the generator model is shown in figure 8.10. In the code snippet, the only new component is the Upsampling
layer to double its input dimensions by repeating pixels. Similar to the discriminator, we stack convolutional layers on top of each other and add other optimization layers like BatchNormalization
. The key difference in the generator model is that it starts with the flattened vector; images are upsampled until they have dimensions similar to the training dataset. All of these layers have hyperparameters that we tune when we are training the network. For your own implementation, you can tune these hyperparameters and add or remove layers as you see fit.
Notice the change in the output shape after each layer. It starts from a 1D vector of 6,272 neurons. We reshaped it to a 7 × 7 × 128 volume, and then the width and height were upsampled twice to 14 × 14 followed by 28 × 28. The depth decreased from 128 to 64 to 1 because this network is built to deal with the grayscale MNIST dataset project that we will implement later in this chapter. If you are building a generator model to generate color images, then you should set the filters in the last convolutional layer to 3.
Now that we’ve learned the discriminator and generator models separately, let’s put them together to train an end-to-end generative adversarial network. The discriminator is being trained to become a better classifier to maximize the probability of assigning the correct label to both training examples (real) and images generated by the generator (fake): for example, the police officer becomes better at differentiating between fakes and real currency. The generator, on the other hand, is being trained to become a better forger, to maximize its chances of fooling the discriminator. Both networks are getting better at what they do.
The process of training GAN models involves two processes:
Train the discriminator. This is a straightforward supervised training process. The network is given labeled images coming from the generator (fake) and the training data (real), and it learns to classify between real and fake images with a sigmoid prediction output. Nothing new here.
Train the generator. This process is a little tricky. The generator model cannot be trained alone like the discriminator. It needs the discriminator model to tell it whether it did a good job of faking images. So, we create a combined network to train the generator, composed of both discriminator and generator models.
Think of the training processes as two parallel lanes. One lane trains the discriminator alone, and the other lane is the combined model that trains the generator. The GAN training process is illustrated in figure 8.11.
As you can see in figure 8.11, when training the combined model, we freeze the weights of the discriminator because this model focuses only on training the generator. We will discuss the intuition behind this idea when we explain the generator training proces. For now, just know that we need to build and train two models: one for the discriminator alone and the other for both discriminator and generator models.
Both processes follow the traditional neural network training process explained in chapter 2. It starts with the feedforward process and then makes predictions and calculates and backpropagates the error. When training the discriminator, the error is backpropagated back to the discriminator model to update its weights; in the combined model, the error is backpropagated back to the generator to update its weights.
During the training iterations, we follow the same neural network training procedure to observe the network’s performance and tune its hyperparameters until we see that the generator is achieving satisfying results for our problem. This is when we can stop the training and deploy the generator model. Now, let’s see how we compile the discriminator and the combined networks to train the GAN model.
As we said before, this is a straightforward process. First, we build the model from the discriminator_model
method that we created earlier in this chapter. Then we compile the model and use the binary_crossentropy
loss function and an optimizer
of your choice (we use Adam
in this example).
Let’s see the Keras implementation that builds and compiles the generator. Please note that this code snippet is not meant to be compilable on its own--it is here for illustration. At the end of this chapter, you can find the full code of this project:
discriminator=
discriminator_model() discriminator.compile(loss=
'binary_crossentropy'
,optimizer=
'adam'
,metrics=
['accuracy'
])
We can train the model by creating random training batches using Keras’ train_on
_batch
method to run a single gradient update on a single batch of data:
noise=
np.random.normal(0
,1
, (batch_size,100
)) ❶ gen_imgs=
generator.predict(noise) ❷ # Train the discriminator (real classified as ones and generated as zeros) d_loss_real=
discriminator.train_on_batch(imgs, valid) d_loss_fake=
discriminator.train_on_batch(gen_imgs, fake)
❷ Generates a batch of new images
Here is the one tricky part in training GANs: training the generator. While the discriminator can be trained in isolation from the generator model, the generator needs the discriminator in order to be trained. For this, we build a combined model that contains both the generator and the discriminator, as shown in figure 8.12.
When we want to train the generator, we freeze the weights of the discriminator model because the generator and discriminator have different loss functions pulling in different directions. If we don’t freeze the discriminator weights, it will be pulled in the same direction the generator is learning so it will be more likely to predict generated images as real, which is not the desired outcome. Freezing the weights of the discriminator model doesn’t affect the existing discriminator model that we compiled earlier when we were training the discriminator. Think of it as having two discriminator models--this is not the case, but it is easier to imagine.
Now, let’s build the combined model:
generator=
generator_model() ❶ z=
Input(shape=
(100
,)) ❷ image=
generator(z) ❷ discriminator.trainable=
False
❸ valid=
discriminator(img) ❹ combined=
Model(z, valid) ❺
❷ The generator takes noise as input and generates an image.
❸ Freezes the weights of the discriminator model
❹ The discriminator takes generated images as input and determines their validity.
❺ The combined model (stacked generator and discriminator) trains the generator to fool the discriminator.
Now that we have built the combined model, we can proceed with the training process as normal. We compile the combined model with a binary_crossentropy
loss function and an Adam optimizer:
combined.compile(loss=
'binary_crossentropy'
,optimizer=
optimizer) g_loss=
self
.combined.train_on_batch(noise, valid) ❶
❶ Trains the generator (wants the discriminator to mistake images for being real)
In the project at the end of the chapter, you will see that the previous code snippet is put inside a loop function to perform the training for a certain number of epochs. For each epoch, the two compiled models (discriminator and combined) are trained simultaneously. During the training process, both the generator and discriminator improve. You can observe the performance of your GAN by printing out the results after each epoch (or a set of epochs) to see how the generator is doing at generating synthetic images. Figure 8.13 shows an example of the evolution of the generator’s performance throughout its training process on the MNIST dataset.
In the example, epoch 0 starts with random noise data that doesn’t yet represent the features in the training dataset. As the GAN model goes through the training, its generator gets better and better at creating high-quality imitations of the training dataset that can fool the discriminator. Manually observing the generator’s performance is a good way to evaluate system performance to decide on the number of epochs and when to stop training. We’ll look more at GAN evaluation techniques in section 8.2.
GAN training is more of a zero-sum game than an optimization problem. In zero-sum games, the total utility score is divided among the players. An increase in one player’s score results in a decrease in another player’s score. In AI, this is called minimax game theory. Minimax is a decision-making algorithm, typically used in turn-based, two-player games. The goal of the algorithm is to find the optimal next move. One player, called the maximizer, works to get the maximum possible score; the other player, called the minimizer, tries to get the lowest score by counter-moving against the maximizer.
GANs play a minimax game where the entire network attempts to optimize the function V(D,G) in the following equation:
The goal of the discriminator (D) is to maximize the probability of getting the correct label of the image. The generator’s (G) goal, on the other hand, is to minimize the chances of getting caught. So, we train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize log(1 - D(G(z))). In other words, D and G play a two-player minimax game with the value function V(D,G).
Like any other mathematical equation, the preceding one looks terrifying to anyone who isn’t well versed in the math behind it, but the idea it represents is simple yet powerful. It’s just a mathematical representation of the two competing objectives of the discriminator and the generator models. Let’s go through the symbols first (table 8.1) and then explain it.
The generator takes the random noise data (z) and tries to reconstruct the real images. |
|
The discriminator takes its input from two sources:
Data from the generator, G(z)--This is fake data (z). The discriminator output from the generator is denoted as D(G(z)).
Real input from the real training data (x)--The discriminator output from the real data is denoted as log D(x).
To simplify the minimax equation, the best way to look at it is to break it down into two components: the discriminator training function and the generator training (combined model) function. During the training process, we created two training flows, and each has its own error function:
One for the discriminator alone, represented by the following function that aims to maximize the minimax function by making the predictions as close as possible to 1:
One for the combined model to train the generator represented by the following function, which aims to minimize the minimax function by making the predictions as close as possible to 0:
Now that we understand the equation symbols and have a better understanding of how the minimax function works, let’s look at the function again:
The goal of the minimax objective function V(D, G ) is to maximize D(x) from the true data distribution and minimize D(G(z)) from the fake data distribution. To achieve this, we use the log-likelihood of D(x) and 1 - D(z) in the objective function. The log of a vvalue just makes sure that the closer we are to an incorrect value, the more we are penalized.
Early in the GAN training process, the discriminator will reject fake data from the generator with high confidence, because the fake images are very different from the real training data--the generator hasn’t learned yet. As we train the discriminator to maximize the probability of assigning the correct labels to both real examples and fake images from the generator, we simultaneously train the generator to minimize the discriminator classification error for the generated fake data. The discriminator wants to maximize objectives such that D(x) is close to 1 for real data and D(G(z)) is close to 0 for fake data. On the other hand, the generator wants to minimize objectives such that D(G(z)) is close to 1 so that the discriminator is fooled into thinking the generated G(z) is real. We stop the training when the fake data generated by the generator is recognized as real data.
Deep learning neural network models that are used for classification and detection problems are trained with a loss function until convergence. A GAN generator model, on the other hand, is trained using a discriminator that learns to classify images as real or generated. As we learned in the previous section, both the generator and discriminator models are trained together to maintain an equilibrium. As such, no objective loss function is used to train the GAN generator models, and there is no way to objectively assess the progress of the training and the relative or absolute quality of the model from loss alone. This means models must be evaluated using the quality of the generated synthetic images and by manually inspecting the generated images.
A good way to identify evaluation techniques is to review research papers and the techniques the authors used to evaluate their GANs. Tim Salimans et al. (2016) evaluated their GAN performance by having human annotators manually judge the visual quality of the synthesized samples.3 They created a web interface and hired annotators on Amazon Mechanical Turk (MTurk) to distinguish between generated data and real data.
One downside of using human annotators is that the metric varies depending on the setup of the task and the motivation of the annotators. The team also found that results changed drastically when they gave annotators feedback about their mistakes: by learning from such feedback, annotators are better able to point out the flaws in generated images, giving a more pessimistic quality assessment.
Other non-manual approaches were used by Salimans et al. and by other researchers we will discuss in this section. In general, there is no consensus about a correct way to evaluate a given GAN generator model. This makes it challenging for researchers and practitioners to do the following:
Select the best GAN generator model during a training run--in other words, decide when to stop training.
Choose generated images to demonstrate the capability of a GAN generator model.
Tune the model hyperparameters and configuration and compare results.
Finding quantifiable ways to understand a GAN’s progress and output quality is still an active area of research. A suite of qualitative and quantitative techniques has been developed to assess the performance of a GAN model based on the quality and diversity of the generated synthetic images. Two commonly used evaluation metrics for image quality and diversity are the inception score and the Fréchet inception distance (FID ). In this section, you will discover techniques for evaluating GAN models based on generated synthetic images.
The inception score is based on a heuristic that realistic samples should be able to be classified when passed through a pretrained network such as Inception on ImageNet (hence the name inception score). The idea is really simple. The heuristic relies on two values:
High predictability of the generated image --We apply a pretrained inception classifier model to every generated image and get its softmax prediction. If the generated image is good enough, then it should give us a high predictability score.
Diverse generated samples --No classes should dominate the distribution of the generated images.
A large number of generated images are classified using the model. Specifically, the probability of the image belonging to each class is predicted. The probabilities are then summarized in the score to capture both how much each image looks like a known class and how diverse the set of images is across the known classes. If both these traits are satisfied, there should be a large inception score. A higher inception score indicates better-quality generated images.
The FID score was proposed and used by Martin Heusel et al. in 2017.4 The score was proposed as an improvement over the existing inception score.
Like the inception score, the FID score uses the Inception model to capture specific features of an input image. These activations are calculated for a collection of real and generated images. The activations for each real and generated image are summarized as a multivariate Gaussian, and the distance between these two distributions is then calculated using the Fréchet distance, also called the Wasserstein-2 distance.
An important note is that the FID needs a decent sample size to give good results (the suggested size is 50,000 samples). If you use too few samples, you will end up overestimating your actual FID, and the estimates will have a large variance. A lower FID score indicates more realistic images that match the statistical properties of real images.
Both measures (inception score and FID) are easy to implement and calculate on batches of generated images. As such, the practice of systematically generating images and saving models during training can and should continue to be used to allow post hoc model selection. Diving deep into the inception score and FID is out of the scope of this book. As mentioned earlier, this is an active area of research, and there is no consensus in the industry as of the time of writing about the one best approach to evaluate GAN performance. Different scores assess various aspects of the image-generation process, and it is unlikely that a single score can cover all aspects. The goal of this section is to expose you to some techniques that have been developed in recent years to automate the GAN evaluation process, but manual evaluation is still widely used.
When you are getting started, it is a good idea to begin with manual inspection of generated images in order to evaluate and select generator models. Developing GAN models is complex enough for both beginners and experts; manual inspection can get you a long way while refining your model implementation and testing model configurations.
Other researchers are taking different approaches by using domain-specific evaluation metrics. For example, Konstantin Shmelkov and his team (2018) used two measures based on image classification, GAN-train and GAN-test, which approximated the recall (diversity) and precision (quality of the image) of GANs, respectively.5
Generative modeling has come a long way in the last five years. The field has developed to the point where it is expected that the next generation of generative models will be more comfortable creating art than humans. GANs now have the power to solve the problems of industries like healthcare, automotive, fine arts, and many others. In this section, we will learn about some of the use cases of adversarial networks and which GAN architecture is used for that application. The goal of this section is not to implement the variations of the GAN network, but to provide some exposure to potential applications of GAN models and resources for further reading.
Synthesis of high-quality images from text descriptions is a challenging problem in CV. Samples generated by existing text-to-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts.
The GAN network that was built for this application is the stacked generative adversarial network (StackGAN).6 Zhang et al. were able to generate 256 × 256 photorealistic images conditioned on text descriptions.
StackGANs work in two stages (figure 8.14):
Stage-I : StackGAN sketches the primitive shape and colors of the object based on the given text description, yielding low-resolution images.
Stage-II : StackGAN takes the output of stage-I and a text description as input and generates high-resolution images with photorealistic details. It is able to rectify defects in the images created in stage-I and add compelling details with the refinement process.
Image-to-image translation is defined as translating one representation of a scene into another, given sufficient training data. It is inspired by the language translation analogy: just as an idea can be expressed by many different languages, a scene may be rendered by a grayscale image, RGB image, semantic label maps, edge sketches, and so on. In figure 8.15, image-to-image translation tasks are demonstrated on a range of applications such as converting street scene segmentation labels to real images, grayscale to color images, sketches of products to product photographs, and day photographs to night ones.
Pix2Pix is a member of the GAN family designed by Phillip Isola et al. in 2016 for general-purpose image-to-image translation.7 The Pix2Pix network architecture is similar to the GAN concept: it consists of a generator model for outputting new synthetic images that look realistic, and a discriminator model that classifies images as real (from the dataset) or fake (generated). The training process is also similar to that used for GANs: the discriminator model is updated directly, whereas the generator model is updated via the discriminator model. As such, the two models are trained simultaneously in an adversarial process where the generator seeks to better fool the discriminator and the discriminator seeks to better identify the counterfeit images.
The novel idea of Pix2Pix networks is that they learn a loss function adapted to the task and data at hand, which makes them applicable in a wide variety of settings. They are a type of conditional GAN (cGAN) where the generation of the output image is conditional on an input source image. The discriminator is provided with both a source image and the target image and must determine whether the target is a plausible transformation of the source image.
The results of the Pix2Pix network are really promising for many image-to-image translation tasks. Visit https://affinelayer.com/pixsrv to play more with the Pix2Pix network; this site has an interactive demo created by Isola and team in which you can convert sketch edges of cats or products to photos and façades to real images.
A certain type of GAN models can be used to convert low-resolution images into high-resolution images. This type is called a super-resolution generative adversarial networks (SRGAN) and was introduced by Christian Ledig et al. in 2016.8 Figure 8.16 shows how SRGAN was able to create a very high-resolution image.
GAN models have huge potential for creating and imagining new realities that have never existed before. The applications mentioned in this chapter are just a few examples to give you an idea of what GANs can do today. Such applications come out every few weeks and are worth trying. If you are interested in getting your hands dirty with more GAN applications, visit the amazing Keras-GAN repository at https://github.com/ eriklindernoren/Keras-GAN, maintained by Erik Linder-Norén. It includes many GAN models created using Keras and is an excellent resource for Keras examples. Much of the code in this chapter was inspired by and adapted from this repository.
In this project, you’ll build a GAN using convolutional layers in the generator and discriminator. This is called a deep convolutional GAN (DCGAN) for short. The DCGAN architecture was first explored by Alec Radford et al. (2016), as discussed in section 8.1.1, and has seen impressive results in generating new images. You can follow along with the implementation in this chapter or run code in the project notebook available with this book’s downloadable code.
In this project, you’ll be training DCGAN on the Fashion-MNIST dataset (https:// github.com/zalandoresearch/fashion-mnist). Fashion-MNIST consists of 60,000 grayscale images for training and a test set of 10,000 images (figure 8.17). Each 28 × 28 grayscale image is associated with a label from 10 classes. Fashion-MNIST is intended to serve as a direct replacement for the original MNIST dataset for benchmarking machine learning algorithms. I chose grayscale images for this project because it requires less computational power to train convolutional networks on one-channel grayscale images compared to three-channel colored images, which makes it easier for you to train on a personal computer without a GPU.
The dataset is broken into 10 fashion categories. The class labels are as follows:
As always, the first thing to do is to import all the libraries we use in this project:
from __future__ import print_function, division from keras.datasets import fashion_mnist ❶ from keras.layers import Input, Dense, Reshape, Flatten, Dropout ❷ from keras.layers import BatchNormalization, Activation, ZeroPadding2D ❷ from keras.layers.advanced_activations import LeakyReLU ❷ from keras.layers.convolutional import UpSampling2D, Conv2D ❷ from keras.models import Sequential, Model ❷ from keras.optimizers import Adam ❷ import numpy as np ❸ import matplotlib.pyplot as plt ❸
❶ Imports the fashion_mnist dataset from Keras
❷ Imports Keras layers and models
❸ Imports numpy and matplotlib
Keras makes the Fashion-MNIST dataset available for us to download with just one command: fashion_mnist.load_data()
. Here, we download the dataset and rescale the training set to the range -1 to 1 to allow the model to converge faster (see the “Data normalization” section in chapter 4 for more details on image scaling):
(training_data, _), (_, _) = fashion_mnist.load_data() ❶ X_train = training_data / 127.5 - 1. ❷ X_train = np.expand_dims(X_train, axis=3) ❷
❷ Rescales the training data to scale -1 to 1
Just for the fun of it, let’s visualize the image matrix (figure 8.18):
def visualize_input(img, ax): ax.imshow(img, cmap='gray'
) width, height = img.shape thresh = img.max()/2.5 for x inrange
(width): for y inrange
(height): ax.annotate(str
(round
(img[x][y],2)), xy=(y,x), horizontalalignment='center'
, verticalalignment='center'
, color='white'
if img[x][y]<thresh else'black'
) fig = plt.figure(figsize = (12,12)) ax = fig.add_subplot(111) visualize_input(training_data[3343], ax)
Now, let’s build the generator model. The input will be our noise vector (z) as explained in section 8.1.5. The generator architecture is shown in figure 8.19.
The first layer is a fully connected layer that is then reshaped into a deep, narrow layer, something like 7 × 7 × 128 (in the original DCGAN paper, the team reshaped the input to 4 × 4 × 1024). Then we use the upsampling layer to double the feature map dimensions from 7 × 7 to 14 × 14 and then again to 28 × 28. In this network, we use three convolutional layers. We also use batch normalization and a ReLU activation. For each of these layers, the general scheme is convolution ⇒ batch normalization ⇒ ReLU. We keep stacking up layers like this until we get the final transposed convolution layer with shape 28 × 28 × 1:
def
build_generator(): generator=
Sequential() ❶ generator.add(Dense(128
*
7
*
7
,activation=
"relu"
,input_dim=
100
)) ❷ generator.add(Reshape((7
,7
,128
))) ❸ generator.add(UpSampling2D()) ❹ generator.add(Conv2D(128
,kernel_size=
3
,padding=
"same"
, ❺activation=
"relu"
)) ❺ generator.add(BatchNormalization(momentum=
0.8
)) ❺ generator.add(UpSampling2D()) ❻ # convolutional + batch normalization layers generator.add(Conv2D(64
,kernel_size=
3
,padding=
"same"
, ❼activation=
"relu"
)) ❼ generator.add(BatchNormalization(momentum=
0.8
)) ❼ # convolutional layer with filters = 1 generator.add(Conv2D(1
,kernel_size=
3
,padding=
"same"
,activation=
"relu"
)) generator.summary() ❽ noise=
Input(shape=
(100
,)) ❾ fake_image=
generator(noise) ❿return
Model(inputs=noise, outputs=fake_image) ⓫
❶ Instantiates a sequential model and names it generator
❷ Adds the dense layer that has a number of neurons = 128 × 7 × 7
❸ Reshapes the image dimensions to 7 × 7 × 128
❹ Upsampling layer to double the size of the image dimensions to 14 × 14
❺ Adds a convolutional layer to run the convolutional process and batch normalization
❻ Upsamples the image dimensions to 28 × 28
❼ We don’t add upsampling here because the image size of 28 × 28 is equal to the image size in the MNIST dataset. You can adjust this for your own problem.
❾ Generates the input noise vector of length = 100. We chose 100 here to create a simple network.
❿ Runs the generator model to create the fake image
⓫ Returns a model that takes the noise vector as an input and outputs the fake image
The discriminator is just a convolutional classifier like what we have built before (figure 8.20). The inputs to the discriminator are 28 × 28 × 1 images. We want a few convolutional layers and then a fully connected layer for the output. As before, we want a sigmoid output, and we need to return the logits as well. For the depths of the convolutional layers, I suggest starting with 32 or 64 filters in the first layer, and then double the depth as you add layers. In this implementation, we start with 64 layers, then 128, and then 256. For downsampling, we do not use pooling layers. Instead, we use only strided convolutional layers for downsampling, similar to Radford et al.’s implementation.
We also use batch normalization and dropout to optimize training, as we learned in chapter 4. For each of the four convolutional layers, the general scheme is convolution ⇒ batch normalization ⇒ leaky ReLU. Now, let’s build the build_discriminator
function:
def
build_discriminator(): discriminator=
Sequential() ❶ discriminator.add(Conv2D(32
,kernel_size=
3
,strides=
2
,input_shape=
(28,28,1)
,padding=
"same"
)) ❷ discriminator.add(LeakyReLU(alpha=
0.2
)) ❸ discriminator.add(Dropout(0.25
)) ❹ discriminator.add(Conv2D(64
,kernel_size=
3
,strides=
2
,padding=
"same"
)) ❺ discriminator.add(ZeroPadding2D(padding=
((0
,1
),(0
,1
)))) ❻ discriminator.add(BatchNormalization(momentum=
0.8
)) ❼ discriminator.add(LeakyReLU(alpha=
0.2
)) discriminator.add(Dropout(0.25
)) discriminator.add(Conv2D(128
,kernel_size=
3
,strides=
2
,padding=
"same"
)) ❽ discriminator.add(BatchNormalization(momentum=
0.8
)) ❽ discriminator.add(LeakyReLU(alpha=
0.2
)) ❽ discriminator.add(Dropout(0.25
)) ❽ discriminator.add(Conv2D(256
,kernel_size=
3
,strides=
1
,padding=
"same"
)) ❾ discriminator.add(BatchNormalization(momentum=
0.8
)) ❾ discriminator.add(LeakyReLU(alpha=
0.2
)) ❾ discriminator.add(Dropout(0.25
)) ❾ discriminator.add(Flatten()) ❿ discriminator.add(Dense(1
,activation=
'sigmoid'
)) ❿ img=
Input(shape=
(28,28,1)) ⓫ probability=
discriminator(img) ⓬return
Model(inputs=img, outputs=probability) ⓭
❶ Instantiates a sequential model and names it discriminator
❷ Adds a convolutional layer to the discriminator model
❸ Adds a leaky ReLU activation function
❹ Adds a dropout layer with a 25% dropout probability
❺ Adds a second convolutional layer with zero padding
❻ Adds a zero-padding layer to change the dimension from 7 × 7 to 8 × 8
❼ Adds a batch normalization layer for faster learning and higher accuracy
❽ Adds a third convolutional layer with batch normalization, leaky ReLU, and a dropout
❾ Adds the fourth convolutional layer with batch normalization, leaky ReLU, and a dropout
❿ Flattens the network and adds the output dense layer with sigmoid activation function
⓬ Runs the discriminator model to get the output probability
⓭ Returns a model that takes the image as input and produces the probability output
As explained in section 8.1.3, to train the generator, we need to build a combined network that contains both the generator and the discriminator (figure 8.21). The combined model takes the noise signal as input (z) and outputs the discriminator’s prediction output as fake or real.
Remember that we want to disable discriminator training for the combined model, as explained in detail in section 8.1.3. When training the generator, we don’t want the discriminator to update weights as well, but we still want to include the discriminator model in the generator training. So, we create a combined network that includes both models but freeze the weights of the discriminator model in the combined network:
optimizer = Adam(learning_rate=0.0002
, beta_1=0.5
) ❶ discriminator = build_discriminator() ❷ discriminator.compile(loss='binary_crossentropy'
, optimizer=optimizer, metrics=['accuracy'
]) discriminator.trainable = False ❸ # Build the generator generator = build_generator() ❹ z=
Input(shape=
(100
,)) ❺ img = generator(z) ❺ valid = discriminator(img) ❻ combined = Model(inputs=z, outputs=valid) ❼ combined.compile(loss='binary_crossentropy'
, optimizer=optimizer) ❼
❷ Builds and compiles the discriminator
❸ Freezes the discriminator weights because we don’t want to train it during generator training
❺ The generator takes noise as input with latent_dim = 100 and generates images.
❻ The discriminator takes generated images as input and determines their validity.
❼ The combined model (stacked generator and discriminator) trains the generator to fool the discriminator.
When training the GAN model, we train two networks: the discriminator and the combined network that we created in the previous section. Let’s build the train
function, which takes the following arguments:
deftrain
(epochs, batch_size=128, save_interval=50): valid = np.ones((batch_size, 1)) ❶ fake = np.zeros((batch_size, 1)) ❶ for epoch inrange
(epochs):## Train Discriminator network
idx = np.random.randint(0, X_train.shape[0], batch_size) ❷ imgs = X_train[idx] ❷ noise = np.random.normal(0, 1, (batch_size, 100)) ❸ gen_imgs = generator.predict(noise) ❸ d_loss_real = discriminator.train_on_batch(imgs, valid) ❹ d_loss_fake = discriminator.train_on_batch(gen_imgs, fake) ❹ d_loss = 0.5 * np.add(d_loss_real, d_loss_fake) ❹## Train the combined network (Generator)
g_loss = combined.train_on_batch(noise, valid) ❺"
%d[D loss:
%f, acc.:
%.2f%%] [G loss:
%f]"
% (epoch, d_loss[0], 100*d_loss[1], g_loss)) ❻ if epoch % save_interval == 0: ❼ plot_generated_images(epoch, generator) ❼
❷ Selects a random half of images
❸ Sample noise, and generates a batch of new images
❹ Trains the discriminator (real classified as 1s and generated as 0s)
❺ Trains the generator (wants the discriminator to mistake images for real ones)
❼ Saves generated image samples if at save_interval
Before you run the train()
function, you need to define the following plot_generated
_images()
function:
defplot_generated_images
(epoch, generator, examples=100, dim=(10, 10), figsize=(10, 10)): noise = np.random.normal(0, 1, size=[examples, latent_dim]) generated_images = generator.predict(noise) generated_images = generated_images.reshape(examples, 28, 28) plt.figure(figsize=figsize) for i inrange
(generated_images.shape[0]): plt.subplot(dim[0], dim[1], i+1) plt.imshow(generated_images[i], interpolation='nearest'
, cmap='gray_r'
) plt.axis('off'
) plt.tight_layout() plt.savefig('gan_generated_image_epoch_
%d.png'
% epoch)
Now that the code implementation is complete, we are ready to start the DCGAN training. To train the model, run the following code snippet:
train(epochs=1000, batch_size=32, save_interval=50)
This will run the training for 1,000 epochs and saves images every 50 epochs. When you run the train()
function, the training progress prints as shown in figure 8.22.
I ran this training myself for 10,000 epochs. Figure 8.23 shows my results after 0, 50, 1,000, and 10,000 epochs.
As you can see in figure 8.23, at epoch 0, the images are just random noise--no patterns or meaningful data. At epoch 50, patterns have started to form. One very apparent pattern is the bright pixels beginning to form at the center of the image, and the surroundings’ darker pixels. This happens because in the training data, all of the shapes are located at the center of the image. Later in the training process, at epoch 1,000, you can see clear shapes and can probably guess the type of training data fed to the GAN model. Fast-forward to epoch 10,000, and you can see that the generator has become very good at re-creating new images not present in the training dataset. For example, pick any of the objects created at this epoch: let’s say the top-left image (dress). This is a totally new dress design that is not present in the training dataset. The GAN model created a completely new dress design after learning the dress patterns from the training set. You can run the training longer or make the generator network even deeper to get more refined results.
For this project, I used the Fashion-MNIST dataset because the images are very small and are in grayscale (one-channel), which makes it computationally inexpensive for you to train on your local computer with no GPU. Fashion-MNIST is also very clean data: all of the images are centered and have less noise so they don’t require much preprocessing before you kick off your GAN training. This makes it a good toy dataset to jumpstart your first GAN project.
If you are excited to get your hands dirty with more advanced datasets, you can try CIFAR as your next step (https://www.cs.toronto.edu/~kriz/cifar.html) or Google’s Quick, Draw! dataset (https://quickdraw.withgoogle.com), which is considered the world’s largest doodle dataset at the time of writing. Another, more serious, dataset is Stanford’s Cars Dataset (https://ai.stanford.edu/~jkrause/cars/car_dataset.html), which contains more than 16,000 images of 196 classes of cars. You can try to train your GAN model to design a completely new design for your dream car!
GANs learn patterns from the training dataset and create new images that have a similar distribution of the training set.
The GAN architecture consists of two deep neural networks that compete with each other.
The generator tries to convert random noise into observations that look as if they have been sampled from the original dataset.
The discriminator tries to predict whether an observation comes from the original dataset or is one of the generator’s forgeries.
The discriminator’s model is a typical classification neural network that aims to classify images generated by the generator as real or fake.
The generator’s architecture looks like an inverted CNN that starts with a narrow input and is upsampled a few times until it reaches the desired size.
The upsampling layer scales the image dimensions by repeating each row and column of its input pixels.
To train the GAN, we train the network in batches through two parallel networks: the discriminator and a combined network where we freeze the weights of the discriminator and update only the generator’s weights.
To evaluate the GAN, we mostly rely on our observation of the quality of images created by the generator. Other evaluation metrics are the inception score and Fréchet inception distance (FID).
In addition to generating new images, GANs can be used in applications such as text-to-photo synthesis, image-to-image translation, image super-resolution, and many other applications.
1.Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative Adversarial Networks,” 2014, http://arxiv.org/abs/1406.2661.
2.Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” 2016, http://arxiv.org/abs/1511.06434.
3.Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and xi Chen. “Improved Techniques for Training GANs,” 2016, http://arxiv.org/abs/1606.03498.
4.Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” 2017, http://arxiv.org/ abs/1706.08500.
5.Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari, “How Good Is My GAN?” 2018, http://arxiv .org/abs/1807.09499.
6.Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas, “StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks,” 2016, http://arxiv.org/abs/1612.03242.
7.Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” 2016, http://arxiv.org/abs/1611.07004.
8.Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, et al., “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” 2016, http://arxiv.org/abs/1609.04802.
3.17.174.239