22

Introduction to Generative Adversarial Networks

In this chapter, we're going to provide a brief introduction to a family of generative models based on some game theory concepts. Their main peculiarity is an adversarial training procedure that is aimed at learning to distinguish between true and fake samples, driving, at the same time, another component that generates samples more and more similar to the training examples.

In particular, we will be discussing:

  • Adversarial training and standard Generative Adversarial Networks (GANs)
  • Deep Convolutional GANs (DCGANs)
  • Wasserstein GANs (WGANs)

We can now introduce the concept of adversarial training of neural models, its connection to game theory and its applications to GANs.

Adversarial training

The brilliant idea of adversarial training, proposed by Goodfellow et al. (in Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y., Generative Adversarial Networks, arXiv:1406.2661 [stat.ML] – although this idea has been, at least in theory, discussed earlier by other authors), ushered in a new generation of generative models that immediately outperformed the majority of existing algorithms. All of the derived models are based on the same fundamental concept of adversarial training, which is an approach partially inspired by game theory.

Let's suppose that we have a data-generating process, , that represents an actual data distribution and a finite number of data points that we suppose are drawn from pdata:

Our goal is to train a model called a generator, whose distribution must be as close as possible to pdata. This is the trickiest part of the algorithm because instead of standard methods (for example, variational autoencoders), adversarial training is based on a minimax game between two players (we can simply say that, given an objective, the goal of both players is to minimize the maximum possible loss, but in this case, each of them works on different parameters). One player is the generator, which we can define as a parameterized function of a noise sample:

The generator is fed with a noise vector (in this case, we have employed a uniform distribution, but there are no particular restrictions; therefore, we are simply going to say that is drawn from a random noise distribution pnoise), and outputs a value that has the same dimensionality as the samples drawn from pdata. Without any further control, the generator distribution will be completely different to the data-generating process, but this is the moment for the other player to enter the scene. The second model is called the discriminator (or critic), and it is responsible for evaluating the samples drawn from pdata and the ones produced by the generator:

The role of this model is to output a probability that must reflect the fact that the sample is drawn from pdata, instead of being generated by . What happens is very simple: the first player (the generator) outputs a sample, . If x actually belongs to pdata, the discriminator will output a value close to 1, while if it's very different from the other true samples, will output a very low probability. The real structure of the game is based on the idea of training the generator to deceive the discriminator by producing samples that could potentially be drawn from pdata. This result can be achieved by trying to maximize the log-probability, , when x is a true sample (drawn from pdata), while minimizing the log-probability, , with sampled from a noise distribution.

The first operation forces the discriminator to become more and more aware of the true samples (this condition is necessary to avoid being deceived too easily).

The second objective is a little bit more complex because the discriminator has to evaluate a sample that can be acceptable or not. Let's suppose that the generator is not smart enough, and outputs a sample that cannot belong to pdata. As the discriminator is learning how pdata is structured, it will very soon distinguish the wrong sample, outputting a low probability. Hence, by minimizing , we are forcing the discriminator to become more and more critical when the samples are quite different from the ones drawn from pdata, and the generator becomes more and more able to produce acceptable samples. On the other hand, if the generator outputs a sample that belongs to the data-generating process, the discriminator will output a high probability, and the minimization falls back into the previous case.

The authors expressed this minimax game using a shared value function, V(G, D), that must be minimized by the generator and maximized by the discriminator:

This formula represents the dynamics of a non-cooperative game between two players (for further information, refer to Tadelis S., Game Theory, Princeton University Press, 2013) that theoretically admits a special configuration, called a Nash equilibrium, that can be described by saying that if the two players know each other's strategy, they have no reason to change their own strategy if the other player doesn't.

In this case, both the discriminator and generator will pursue their strategies until no change is needed, reaching a final, stable configuration, which is potentially a Nash equilibrium (even if there are many factors that can prevent reaching this goal). A common problem is the premature convergence of the discriminator, which forces the gradients to vanish because the loss function becomes flat in a region close to 0. As this is a game, a fundamental condition is the possibility of providing information to allow the player to make corrections. If the discriminator learns how to separate true samples from fake ones too quickly, the generator convergence slows down, and the player can remain trapped in a sub-optimal configuration.

In general, when the distributions are rather complex, the discriminator is slower than the generator; but in some cases, it is necessary to update the generator more times after each single discriminator update. Unfortunately, there is no rule of thumb; but, for example, when working with images, it's possible to observe the samples generated after a sufficiently large number of iterations. If the discriminator loss has become very small and the samples appear corrupted or incoherent, it means that the generator did not have enough time to learn the distribution, and it's necessary to slow down the discriminator.

The authors (in the aforementioned paper) showed that given a generator characterized by a distribution , the optimal discriminator is:

At this point, considering the previous value function V(G, D) and using the optimal discriminator, we can rewrite it in a single objective (as a function of G) that must be minimized by the generator:

To better understand how a GAN works, we need to expand the previous expression:

Applying some simple manipulations, we get the following:

The last term represents the Jensen-Shannon divergence between pdata and pg. This measure is similar to the Kullback-Leibler divergence, but it's symmetrical and bounded between 0 and log 2. When the two distributions are identical, DJS = 0, but if their supports (the value sets where ) are disjoint, DJS = log 2 (while ). Therefore, the value function can be expressed as:

Now, it should be clearer that a GAN tries to minimize the Jensen-Shannon divergence between the data-generating process and the generator distribution. In general, this procedure is quite effective; however, when the supports are disjoint, a GAN has no information about the true distance.

This consideration (analyzed with more mathematical rigor in Salimans T., Goodfellow I., Zaremba W., Cheung V., Radford A., and Chen X., Improved Techniques for Training GANs, arXiv:1606.03498 [cs.LG]) explains why training a GAN can become quite difficult and, consequently, why the Nash equilibrium cannot be found in many cases. For these reasons, we are going to analyze an alternative approach in the next section.

The complete GAN algorithm (as proposed by the authors) is:

  1. Set the number of epochs, Nepochs.
  2. Set the number of discriminator iterations, Niter (in most cases, Niter = 1).
  3. Set the batch size, k.
  4. Define a noise-generating process, N (for example, N = U(-1,1)).
  5. For e = 1 to Nepochs:
    1. Sample k values from X.
    2. Sample k values from N.
    3. For i = 1 to Niter:
      1. Compute the gradients, (only with respect to the discriminator variables). The expected value is approximated with a sample mean.
      2. Update the discriminator parameters by stochastic gradient ascent (as we are working with logarithms, it's possible to minimize the negative loss).
    4. Sample k values from N.
    5. Compute the gradients, (only with respect to the generator variables).
    6. Update the generator parameters by stochastic gradient descent.

As these models need to sample noisy vectors in order to guarantee reproducibility, I suggest setting the random seed in both NumPy (np.random.seed(...)) and TensorFlow (tf..random.set_seed(...)). The default value chosen for all of these experiments is 1,000.

Deep Convolutional GANs

After discussing the basic concepts of adversarial training, we can apply them to a practical example of DCGANs. In fact, even if it's possible to use only dense layers (MLPs), as we want to work with images, it's preferable to employ convolutions and transpose convolutions to obtain the best results.

Example of DCGAN with TensorFlow

In this example, we want to build a DCGAN (proposed in Radford A., Metz L., Chintala S., Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, arXiv:1511.06434 [cs.LG]) with the Fashion-MNIST dataset (obtained through the TensorFlow/Keras helper function). As the training speed is not very high, we limit the number of samples to 5,000, but I suggest repeating the experiment with larger values. The first step is loading and normalizing (between -1 and 1) the dataset:

import tensorflow as tf
import numpy as np
nb_samples = 5000
(X_train, _), (_, _) = 
        tf.keras.datasets.fashion_mnist.load_data()
X_train = X_train.astype(np.float32)[0:nb_samples]/255.0
X_train = (2.0 * X_train) - 1.0
width = X_train.shape[1]
height = X_train.shape[2]
code_length = 100

According to the original paper, the generator is based on four transpose convolutions with kernel sizes equal to (4, 4) and strides equal to (2, 2). The input is a single multi-channel pixel (1 × 1 × code_length) that is expanded by subsequent convolutions. The number of filters is 1024, 512, 256, 128, and 1 (we are working with grayscale images). The authors suggest employing a symmetric-valued dataset (that's why we have normalized between -1 and 1), batch normalization after each layer, and leaky ReLU activation (with a default negative slope set equal to 0.3):

generator = tf.keras.models.Sequential([
    tf.keras.layers.Conv2DTranspose(
        input_shape=(1, 1, code_length),
        filters=1024,
        kernel_size=(4, 4),
        padding='valid'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Conv2DTranspose(
        filters=512,
        kernel_size=(4, 4),
        strides=(2, 2),
        padding='same'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Conv2DTranspose(
        filters=256,
        kernel_size=(4, 4),
        strides=(2, 2),
        padding='same'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Conv2DTranspose(
        filters=128,
        kernel_size=(4, 4),
        strides=(2, 2),
        padding='same'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Conv2DTranspose(
        filters=1,
        kernel_size=(4, 4),
        strides=(2, 2),
        padding='same',
        activation='tanh')
])

The strides are set to work with 64 × 64 images and, as the Fashion-MNIST dataset has 28 × 28 samples, which cannot be generated with power-of-two modules, we are going to resize the samples while training. Contrary to older TensorFlow versions, in this case we don't need to declare any variable scope because the training scope will be managed using the GradientTape contexts.

Moreover, all Keras-derived models inherit the parameter "training" to enable/disable dropout and batch normalization. The output of the generator is already normalized in the range (-1,1), thanks to hyperbolic tangent activation.

The discriminator is almost the same as a generator (the only main differences are the inverse convolution sequence and the absence of batch normalization after the first layer):

discriminator = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(
        input_shape=(64, 64, 1),
        filters=128,
        kernel_size=(4, 4),
        strides=(2, 2),
        padding='same'),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Conv2D(
        filters=256,
        kernel_size=(4, 4),
        strides=(2, 2),
        padding='same'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Conv2D(
        filters=512,
        kernel_size=(4, 4),
        strides=(2, 2),
        padding='same'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Conv2D(
        filters=1024,
        kernel_size=(4, 4),
        strides=(2, 2),
        padding='same'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Conv2D(
        filters=1,
        kernel_size=(4, 4),
        padding='valid')
])

The discriminator is still a fully convolutional network, even if the output (with a single filter) is a vector of values representing the logits of the samples. As explained in the chapter about regression models, the logit can be immediately transformed into an actual probability by using the sigmoid function, but in this case, we prefer to output the original value, letting TensorFlow perform the transformation in a more robust way when computing the loss function. Of course, if it's necessary to obtain the probability, all we need to do is use the appropriate function:

p = tf.math.sigmoid(discriminator(x, training=False))

We can also create a couple of helper functions to run both generator and discriminator by taking care of converting the output every time:

def run_generator(z, training=False):
    zg = tf.reshape(z, (-1, 1, 1, code_length))
    return generator(zg, training=training)
def run_discriminator(x, training=False):
    xd = tf.image.resize(x, (64, 64))
    return discriminator(xd, training=training)

At this point, we need to define the optimizers and the loss meters:

optimizer_generator = 
    tf.keras.optimizers.Adam(0.0002, beta_1=0.5)
optimizer_discriminator = 
    tf.keras.optimizers.Adam(0.0002, beta_1=0.5)
train_loss_generator = 
    tf.keras.metrics.Mean(name='train_loss')
train_loss_discriminator = 
    tf.keras.metrics.Mean(name='train_loss')

Both networks will be trained using the Adam optimizer with and . This choice has been suggested by the authors after testing different configurations, and results in fast convergence and average-to-good generative quality.

At this point, we can define the training function using two different GradientTape contexts (for the generator and the discriminator):

@tf.function
def train(xi):
    zn = tf.random.uniform(
        (batch_size, code_length), -1.0, 1.0)
    with tf.GradientTape() as tape_generator, 
            tf.GradientTape() as tape_discriminator:
        xg = run_generator(zn, training=True)
        zd1 = run_discriminator(xi, training=True)
        zd2 = run_discriminator(xg, training=True)
        loss_d1 = tf.keras.losses.
            BinaryCrossentropy(from_logits=True)
            (tf.ones_like(zd1), zd1)
        loss_d2 = tf.keras.losses.
            BinaryCrossentropy(from_logits=True)
            (tf.zeros_like(zd2), zd2)
        loss_discriminator = loss_d1 + loss_d2
        loss_generator = tf.keras.losses.
            BinaryCrossentropy(from_logits=True)
            (tf.ones_like(zd2), zd2)
    gradients_generator = 
        tape_generator.gradient(
        loss_generator,
        generator.trainable_variables)
    gradients_discriminator = 
        tape_discriminator.gradient(
        loss_discriminator,
        discriminator.trainable_variables)
    optimizer_discriminator.apply_gradients(
        zip(gradients_discriminator,
            discriminator.trainable_variables))
    optimizer_generator.apply_gradients(
        zip(gradients_generator,
            generator.trainable_variables))
    train_loss_discriminator(loss_discriminator)
    train_loss_generator(loss_generator)

After generating the noise (zn ~ U(-1,1)), the generator is invoked, followed by a double call to the discriminator to obtain the evaluation for a true image batch and an equal number of generated samples. The next step is defining the loss functions. As we are working with logarithms, there can be stability problems when the values become close to 0. For this reason, it's preferable to employ the built-in TensorFlow class tf.keras.losses.BinaryCrossentropy, which guarantees numerical stability in every case. This class must be initialized by selecting whether the input is either a probability (bounded between 0 and 1) or a logit (unbounded). As we are working with the linear output of the final 2D convolution, we are also imposing from_logits=True in order to ask the algorithm to apply the sigmoid transformation internally. In general, the output (given the logits) is:

Therefore, setting the label equal to 1 forces the second term to be null, and vice versa. The training step is split into two parts, which act on the discriminator and generator variables separately. Contrary to older versions of TensorFlow, we don't need to worry about the reuse of discriminator variables because every time we call the model, the same instance will be used. However, as the training procedure is split, we can simply compute the gradients of both generator and discriminator and apply the corrections only to the respective models (even if the discriminator is also fed with the output of the generator, which becomes part of the same computational graph). Therefore, when the discriminator is trained, the generator variables remain unchanged if this model generated a batch that contributed to the final loss function.

We can now implement the training cycle with a code length equal to 100 epochs and a batch size of 128 (the reader is free to change these values and observe the effects as an exercise):

nb_epochs = 100
batch_size = 128
x_train_g = tf.data.Dataset.from_tensor_slices(
        np.expand_dims(X_train, axis=3)).
        shuffle(1000).batch(batch_size)
for e in range(nb_epochs):
for xi in x_train_g:
      		train(xi)
        	print("Epoch {}: "
                "Discriminator Loss: {:.3f}, "
                "Generator Loss: {:.3f}".
                 format(e + 1,
                     train_loss_discriminator.result(),
                     train_loss_generator.result()))
train_loss_discriminator.reset_states()
train_loss_generator.reset_states()

Once the training process has finished, we can generate some images (50) by executing the generator with a matrix of noise samples:

Z = np.random.uniform(-1.0, 1.0, 
                      size=(50, code_length)).
        astype(np.float32)
Ys = run_generator(Z, training=False)
Ys = np.squeeze((Ys + 1.0) * 0.5 * 255.0).
        astype(np.uint8)

The result (which depends on the random seed) is shown in the following screenshot:

Samples generated by a DCGAN trained with the Fashion-MNIST dataset

As an exercise, I invite the reader to employ more complex convolutional architectures and an RGB dataset such as CIFAR-10 (https://www.cs.toronto.edu/~kriz/cifar.html).

Mode collapse

We have seen that a GAN is a generative model that learns to reproduce a data-generating process pdata. In the best cases, the artificial distribution is close enough to pdata according to a predefined metric (for example, Kullback-Leibler divergence). Unfortunately, however, this case is often impossible to achieve, and the distribution learned by the GAN is only partially overlapped onto the data-generating process. From a generic viewpoint, the discrepancy might have two different aspects:

  • The two distributions differ in many regions; therefore, the GAN isn't able to output any correct examples.
  • The two distributions have a strong overlap limited to a single region.

In the first case, the model is clearly underfitted and it's necessary to increase its capacity and tune up the learning algorithm in order to achieve better performance. In the second case, the GAN instead remained stuck in a high-probability region and discarded all the remaining ones. This particular phenomenon is called mode collapse, and it's a common problem that affects these models. Given a distribution , the mode is corresponding to . For example, a normal distribution is monomodal and the mode is clearly x = 0. Conversely, a mixture of Gaussians is a multi-modal distribution where all local maxima are associated with different modes, as shown in the following figure:

C:UsersgiuseAppDataLocalMicrosoftWindowsINetCacheContent.MSOF32CC0B5.tmp

Mono-modal distribution (left). Multi-modal distribution (right)

From a statistical viewpoint, a mode is very likely to be a data point, therefore it's not surprising that a GAN learns to output it (and all its neighbors) with high probability. However, real-world data distributions are multi-modal, and it's also extremely difficult (or impossible) to know where the modes are located. Therefore, a GAN that learns to reproduce only a region of pdata collapses in small subspace and lose the ability to output other samples. Even if mode collapse has been discovered and studied, unfortunately, there are no explicit solutions. Models with a more flexible distance function (such as the one that we are going to study in the next section) can mitigate the problem and reduce its probability. However, the usage of GANs should always include a massive test phase to check whether any regions of the data-generating process are completely missing.

The test is not simple, but in some cases (for example, with images), it's possible to sample many values from the GAN, measure their frequencies, and compare them with the expected ones. For example, if we know that the Fashion-MNIST dataset has 10 different classes. After training the GAN and sampling 1,000 images, we should expect about 100 images for each class. If, for instance, all images are shoes or shoes are completely missing, it means that the GAN has collapsed. In the first case, the effect is dramatic and it's probably due to a bad shuffling, class imbalance, or very low capacity. Hence, the simplest solution is to check the dataset and, if it's perfectly balanced, to increase the capacity of the model. In the second case, the problem is tougher because a specific class is completely missing. If all other images are correctly reproduced, the problem may depend on the overspecialization of the units.

For example, a convolutional generator can become more and more specialized and output only shirts and other similar shapes. This is a sort of overfitting (even if the accuracy with respect to the training set is not saturated), and one potential mitigating strategy is based on the usage of dropout layers or other regularization techniques. In particular, dropout is able to limit the overspecialization even when the capacity is very large and should be employed as a first choice. Layer regularization is also a reasonable approach, but it increases the computational complexity and might yield only a sub-optimal result.

On the other hand, when the generator collapses around a mode, the information provided to the discriminator will become very limited and it will consequently lose the chance to be able to discriminate between noise and other valid classes. The usage of dropout (also in the discriminator) may help to leave some free capacity that can be used to limit the overfitting. In this way, the gradients are forced to vanish more slowly and the double feedback generator → discriminator (and vice-versa) can be active for a longer time. This is clearly not a general-purpose solution (the problem is extremely complex), but it's a strategy that should be kept in mind when working with GANs because, contrary to other models, they could fail in a way that is not easy to immediately verify.

Wasserstein GAN

As explained in the previous section, one of the most difficult problems with standard GANs is caused by the loss function based on the Jensen-Shannon divergence, whose value becomes constant when two distributions have disjointed supports. This situation is quite common with high-dimensional, semantically structured datasets. For example, images are constrained to having particular features in order to represent a specific subject (this is a consequence of the manifold assumption discussed in Chapter 3, Introduction to Semi-Supervised Learning). The initial generator distribution is very unlikely to overlap a true dataset, and in many cases, they are also very far from each other. This condition increases the risk of learning the wrong representation (a problem known as mode collapse), even when the discriminator is able to distinguish between true and generated samples (such a condition arises when the discriminator learns too quickly with respect to the generator). Moreover, the Nash equilibrium becomes harder to achieve, and the GAN can easily remain blocked in a sub-optimal configuration.

In order to mitigate this problem, Arjovsky, Chintala, and Bottou (in Arjovsky M., Chintala S., Bottou L., Wasserstein GAN, arXiv:1701.07875 [stat.ML]) proposed employing a different divergence, called the Wasserstein distance (or Earth Mover's distance), which is formally defined as follows:

The term represents the set of all possible joint probability distributions between pdata and pg. Hence, the Wasserstein distance is the infimum (considering all joint distributions) of the set of expected values of , where x and y are sampled from the joint distribution .

The Wasserstein distance can be directly employed also when the couple represent, for example, word embeddings obtained from algorithms like Word2Vec/Doc2Vec (for further details, see Mikolov T., Sutskever I., Chen K., Corrado G. S., Dean J., Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, arXiv:1310.4546) or fastText (Bojanowski P., Grave E., Joulin A., Mikolov T., Enriching Word Vectors with Subword Information, arXiv:1607.04606 [cs.CL]). Using these algorithms, the words (or also n-grams) of a text are transformed into high-dimensional vectors, whose distance is proportional to the actual semantic distance of the words/sentences. Therefore, a GAN can be trained to generate sequences of words sampled from a semantically-acceptable distribution (for example, "An apple is a fruit" and "A car is a fruit" should be considered as drawn from different distributions even if their composition is very similar).

This topic is very interesting and quite complex at the same time. If, in fact, a slightly corrupted image can go undetected by the human eye (or simply be considered a normal image), a sentence with a semantic mistake is almost always immediately identified as flawed. Therefore, these models must be trained with very large corpora (even when using pretrained vectors, like fastText based on Wikipedia) in order to guarantee reliable results.

The main property of the Wasserstein distance is that even when two distributions have disjointed support, its value is proportional to the actual distributional distance. The formal proof is not very complex, but it's easier to understand the concept intuitively. In fact, given two distributions with disjointed support, the infimum operator forces taking the shortest distance between each possible couple of samples. Clearly, this measure is more robust than the Jensen-Shannon divergence, but there's a practical drawback: it's extremely difficult to compute. As we cannot work with all possible joint distributions (nor with an approximation), a further step is necessary to employ this loss function. In the aforementioned paper, the authors proved that it's possible to apply a transformation, thanks to the Kantorovich-Rubinstein theorem (the topic is quite complex, but the reader can find further information in Edwards D. A., On the Kantorovich–Rubinstein Theorem, Expositiones Mathematicae, 2011):

The first element to consider is the nature of . The theorem requires considering only L-Lipschitz functions, which means that (assuming a real-valued function of a single variable defined over a set D) must obey:

At this point, the Wasserstein distance is proportional to the supremum (with respect to all L-Lipschitz functions) of the difference between two expected values, which are extremely easy to compute. In a WGAN, the function is represented by a neural network; therefore, we have no warranties about the Lipschitz condition. To solve this problem, the author suggested a very simple procedure: clipping the discriminator (which is normally called the critic), and whose responsibility is to represent the parameterized function ) variables after applying the corrections. If the input is bounded, all of the transformations will yield a bounded output; however, the clipping factor must be small enough (0.01, or even smaller) to avoid the additive effect of multiple operations leading to an inversion of the Lipschitz condition.

This is not an efficient solution (because it slows down the training process when it's not necessary), but it permits the exploitation of the Kantorovich-Rubinstein theorem even when there are no formal constraints imposed on the function family.

Using a parameterized function (such as a Deep Convolutional Network), the Wasserstein distance becomes as follows (omitting the term L, which is constant):

In the previous expression, we explicitly extracted the generator output, and in the last step, separated the term that will be optimized separately. The reader has probably noticed that the computation is simpler than a standard GAN because in this case, we have to average over only the values of a batch (there's no more need for a logarithm). However, as the Critic variables are clipped, the number of required iterations is normally larger, and in order to compensate the difference between the training speeds of the Critic and generator, it's often necessary to set Ncritic > 1 (the authors suggest a value equal to 5, but this is a hyperparameter that must be tuned in every specific context).

The complete WGAN algorithm is:

  1. Set the number of epochs, Nepochs.
  2. Set the number of Critic iterations, Ncritic (in most cases, Ncritic = 5).
  3. Set the batch size, k.
  4. Set a clipping constant c (for example, c = 0.01).
  5. Define a noise-generating process N (for example, N = U(-1,1)).
  6. For e = 1 to Nepochs:
    1. Sample k values from X.
    2. Sample k values from N.
    3. For i = 1 to Ncritic:
      1. Compute the gradients, (only with respect to the Critic variables). The expected values are approximated by sample means.
      2. Update the Critic parameters by stochastic gradient ascent.
      3. Clip the Critic parameters in the range (-c, c).
    4. Sample k values from N.
    5. Compute the gradients, (only with respect to the generator variables).
    6. Update the generator parameters by stochastic gradient descent.

We can now implement a WGAN using TensorFlow. As we are going to see, the loss function is now much simpler, but it's important to clip the variables in order to guarantee the L-Lipschitz condition.

Example of WGAN with TensorFlow

This example can be considered a variant of the previous one because it uses the same dataset, generator, and discriminator structures. The only main difference is that in this case, the discriminator has been renamed critic() and the helper function is run_critic(). Moreover, to simplify the training process, we have also introduced another helper function that runs the whole model and computes the simplified loss functions, which are:

The snippet to run the model is:

def run_model(xi, zn, training=True):
    xg = run_generator(zn, training=training)
    zc1 = run_critic(xi, training=training)
    zc2 = run_critic(xg, training=training)
    loss_critic = tf.reduce_mean(zc2 - zc1)
    loss_generator = tf.reduce_mean(-zc2)
    return loss_critic, loss_generator

The two loss functions are simpler than a standard GAN as they work directly with the Critic outputs, computing the sample mean over a batch. In the original paper, the authors suggest using RMSProp as the standard optimizer in order to avoid the instabilities that a momentum-based algorithm can produce. However, Adam, with lower forgetting factors ( and ) and a learning rate , is faster than RMSProp, and doesn't lead to instabilities. I suggest testing both options, trying to maximize the training speed while preventing the mode collapse:

import tensorflow as tf
optimizer_generator = 
    tf.keras.optimizers.Adam(
        0.00005, beta_1=0.5, beta_2=0.9)
optimizer_critic = 
    tf.keras.optimizers.Adam(
        0.00005, beta_1=0.5, beta_2=0.9)
train_loss_generator = 
    tf.keras.metrics.Mean(name='train_loss')
train_loss_critic = 
    tf.keras.metrics.Mean(name='train_loss')

We can now define the training functions which, for simplicity, are now separated. The main reason is that we need to perform more critic iterations for each generator step. Moreover, the critic variables must be clipped in the range (-0.01, 0.01) (as suggested by the authors) to meet the requirements of the Kantorovich-Rubinstein theorem and, consequently, use the simplified loss function:

@tf.function
def train_critic(xi):
    zn = tf.random.uniform(
        (batch_size, code_length), -1.0, 1.0)
    with tf.GradientTape() as tape:
        loss_critic, _ = run_model(xi, zn,
                                   training=True)
    gradients_critic = tape.gradient(
        loss_critic,
        critic.trainable_variables)
    optimizer_critic.apply_gradients(
        zip(gradients_critic,
            critic.trainable_variables))
    for v in critic.trainable_variables:
        v.assign(tf.clip_by_value(v, -0.01, 0.01))
    train_loss_critic(loss_critic)
@tf.function
def train_generator():
    zn = tf.random.uniform(
        (batch_size, code_length), -1.0, 1.0)
    xg = tf.zeros((batch_size, width, height, 1))
    with tf.GradientTape() as tape:
        _, loss_generator = run_model(xg, zn,
                                      training=True)
    gradients_generator = tape.gradient(
        loss_generator,
        generator.trainable_variables)
    optimizer_generator.apply_gradients(
        zip(gradients_generator,
            generator.trainable_variables))
    train_loss_generator(loss_generator)

The structures of each function are straightforward and don't need detailed explanations. However, it's very important to notice that the critic variables are clipped after the gradients are applied. Carrying out this operation before the gradients are applied leads to an inconsistency because the gradients can push the values outside the predefined range and the critic can lose the property to be Lipschitz-continuous. Therefore, when running the generator training step, the loss function might not be accurate anymore.

The complete training procedure is shown in the following snippet:

nb_samples = 10240
nb_epochs = 100
nb_critic = 5
batch_size = 64
code_length = 256
x_train = tf.data.Dataset.from_tensor_slices(
        np.expand_dims(X_train, axis=3)).
        shuffle(1000).batch(nb_critic * batch_size)
for e in range(nb_epochs):
    for xi in x_train:
            for i in range(nb_critic):
                train_critic(xi[i * batch_size:
                                (i + 1) * batch_size])
            train_generator()
        print("Epoch {}: "
              "Critic Loss: {:.3f}, "
              "Generator Loss: {:.3f}".
              format(e + 1, 
                     train_loss_critic.result(), 
                     train_loss_generator.result()))
        train_loss_critic.reset_states()
        train_loss_generator.reset_states()

In this example, we have decided to employ a larger training set (10,240 images), batch size equal to 64, and 5 critic steps per iteration. I invite the reader to employ a larger training set (of course, the computational cost will grow proportionally) and to also test a different number of critic steps. The optimal choice in this case is based on the original paper. However, a simple way to find a suitable value is to monitor both losses during the training. If the generator converges much faster than the critic (that is, it stabilizes very quickly to a steady value), ncritic must be increased.

Ideally, both components should have the same training speed in order to guarantee a constant flow of information (depending on the magnitude of the gradients) from the critic to the generator and vice-versa. If the latter stops modifying the variables very early, the generator stops receiving information to improve the quality of reproduction of pdata and the GAN will likely reach a mode collapse. On the other hand, a very large ncritic value can force the model to hyperspecialize the critic before the generator has reached a satisfactory accuracy, leading to an underfitted GAN with very poor performance.

The result of the generation of 50 random samples is shown in the following screenshot:

Samples generated by a WGAN trained with the Fashion-MNIST dataset

As we can see, the quality is slightly higher than the DCGAN, and the samples are smoother and better defined. I invite the reader to also test this model with an RGB dataset because the final quality is normally excellent (with a proportionally longer training time).

When working with these models, the training time can be very long. To avoid waiting to see the initial results (and to perform the required tuning), I suggest using Jupyter. In this way, it's possible to stop the learning process, check the generator ability, and restart it without any problem. Of course, the graph must remain the same, and the variable initialization (which, in TensorFlow 2, happens when defining the models) must be performed only at the beginning.

Summary

In this chapter, we discussed the main principles of adversarial training and explained the roles of two players: the generator and discriminator. We described how to model and train them using a minimax approach whose double goal is to force the generator to learn the true data distribution pdata and get the discriminator to distinguish perfectly between true samples (belonging to pdata) and unacceptable ones. In the same section, we analyzed the inner dynamics of a GAN and some common problems that can slow down the training process and lead to a sub-optimal final configuration.

One of the most difficult problems experienced with standard GANs arises when the data-generating process and the generator distribution have disjointed support. In this case, the Jensen-Shannon divergence becomes constant and doesn't provide precise information about the distance. An excellent alternative is provided by the Wasserstein measure, which is employed in a more efficient model, called WGAN. This method can efficiently manage disjointed distributions, but it's necessary to enforce the L-Lipschitz condition on the Critic. The standard approach is based on clipping the parameters after each gradient ascent update. This simple technique guarantees the L-Lipschitz condition, but it's necessary to use very small clipping factors, and this can lead to a slower conversion. For this reason, it's normally necessary to repeat the training of the Critic a fixed number of times (such as five) before each single generator training step.

In the next chapter, we are going to introduce another probabilistic generative neural model, based on a particular kind of neural network, called the Restricted Boltzmann Machine.

Further reading

  • Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y., Generative Adversarial Networks, arXiv:1406.2661 [stat.ML]
  • Tadelis S., Game Theory, Princeton University Press, 2013
  • Radford A., Metz L., Chintala S., Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, arXiv:1511.06434 [cs.LG]
  • Salimans T., Goodfellow I., Zaremba W., Cheung V., Radford A., and Chen X., Improved Techniques for Training GANs, arXiv:1606.03498 [cs.LG]
  • Arjovsky M., Chintala S., Bottou L., Wasserstein GAN, arXiv:1701.07875 [stat.ML]
  • Edwards D. A., On the Kantorovich-Rubinstein Theorem, Expositiones Mathematicae, 2011
  • Holdroyd T., TensorFlow 2.0 Quick Start Guide, Packt Publishing, 2019
  • Goodfellow I., Bengio Y., Courville A., Deep Learning, The MIT Press, 2016
  • Mikolov T., Sutskever I., Chen K., Corrado G. S., Dean J., Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, arXiv:1310.4546
  • Bojanowski P., Grave E., Joulin A., Mikolov T., Enriching Word Vectors with Subword Information, arXiv:1607.04606 [cs.CL]
  • Bonaccorso G., Hands-On Unsupervised Learning with Python, Packt Publishing, 2019
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.86.134