Jyoti Arora1*, Meena Tushir2, Pooja Kherwa3 and Sonia Rathee3
1Department of Information Technology, Maharaja Surajmal Institute of Technology, GGSIPU, New Delhi, India
2Department of Electronics and Electrical Engineering, Maharaja Surajmal Institute of Technology, GGSIPU, New Delhi, India
3Department of Computer Science and Engineering, Maharaja Surajmal Institute of Technology, GGSIPU, New Delhi, India
Generative Adversarial Networks (GANs) have gained immense popularity since their introduction in 2014. It is one of the most popular research area right now in the field of computer science. GANs are arguably one of the newest yet most powerful deep learning techniques with applications in several fields. GANs can be applied to areas ranging from image generation to synthetic drug synthesis. They also find use in video generation, music generation, as well as production of novel works of art. In this chapter, we attempt to present detail study about the GAN and make the topic understandable to the readers of this work. This chapter presents an extensive review of GANs, their anatomy, types, and several applications. We have also discussed the shortcomings of GANs.
Keywords: Generative adversarial networks, learning process, computer vision, deep learning, machine learning
Abbreviation | Full Form |
GAN | Generative Adversarial Network |
DBM | Deep Boltzmann Machine |
DBN | Deep Belief Network |
VAE | Variational Autoencoder |
DCGAN | Deep Convolutional GAN |
cGAN | conditional GAN |
WGAN | Wasserstein GAN |
LSGAN | Least Square GAN |
INFOGAN | Information Maximizing Generative Adversarial Network |
ReLU | Rectified Linear Unit |
GPU | Graphics Processing Unit |
Generative Adversarial Networks (GANs) are an emerging topic of interest among today’s researchers. A large proportion of research is being done on GANs as can be seen from the number of research articles on GANs on Google Scholar. The term “Generative Adversarial Networks” yielded more than 3200 search results for the year 2021 alone (upto 20 March 2021). GANs have also been termed as the most interesting innovation in the field of Machine Learning since past 10 years by Yan LeCun who has a major contribution in the area of Deep Learning networks. The major applications of GANs lie in computer vision [1–5]. GANs are extensively used in the generation of images from text [6, 7], translation of image to image [8, 9], image completion [10, 11].
Ian Goodfellow et al. in their research paper “Generative Adversarial Nets” [12] introduced the cocept of GANs. In simplest words GANs are machine learning systems made up of the discriminator, the generator and two neural networks, that generate realistic looking images, video, etc. The generator generates new content which is then evaluated by the discriminator network. In a typical GAN the objective of the generator network is to successively “fool” the discriminator by producing new content that cannot be term as “synthesized” by the discriminator. Such a network can be thought of as analogous to a two player game (zero-sum game i.e the total gain of two players is zero [13]) where the players contest to win. GANs are an adversarial game setting where the generator is pitted against the discriminator [14]. In case of GANs the optimisation process is a minimax game and the goal is to reach Nash equilibrium [15].
Nowedays, GANs are one of the most commonly used deep learning networks. They fall into the class of deep generative networks which also include Deep Belief Network (DBN), Deep Boltzmann Machine (DBM) and Variational Autoencoder (VAE) [16]. Recently GANs and VAE have become popular techniques for unsupervised learning. Though originally intended for unsupervised learning [17–19]. GANs offer several advantages over other deep generative networks like VAE such as the ability of GANs to handle missing data and to model high dimensional data. GANs also have the ability to deliver multimodal outputs (Multiple feasible answers) [20]. In general, GANs are known to generate fine grained and realistic data whereas images generated by VAE tend to be blurred. Even though GANs offer several advantages they have some shortcomings as well. Two of the major limitations of GANs are that they are difficult to train and not easy to evaluate. It is difficult for the generator and the discriminator to attain the Nash equilibrium at the time of training [21] and difficult for the generator to learn the distribution of full datasets completely (leads to mode collapse). The term, mode collapse defines a condition wherein the limited amounts of samples are generated by the generator regardless of the input.
In this paper, we have extensively reviewed Generative Adversarial Networks and have discussed about the anatomy of GANs, types of GANs, areas of applications as well as the shortcomings of GANs.
To understand GANs it is important to have some background of supervised and unsupervised learning. It is also required to understand generative modelling and how it differs from discriminative modelling. In this section, we attempt to discuss these.
A supervised learning process is carried by training of a model using a training dataset which consists of several samples with input as well as output labels corresponding to those input values. The model is trained using these samples and the end goal is that the model is able to predict the output label for an unseen input [22]. The objective is basically to train a model in order to generate a mapping capability between inputs, x and outputs, y given multiple labeled input-output pairs [23].
Another type of learning is where a data is given only with input variables (x). This problem does not have labeling of data [23]. The model is built by extracting patterns in the input data. Since the model in question here does not predict anything, no corrections take place here as in case of supervised learning. Generative modelling is a notable unsupervised learning problem. GANs are an example of unsupervised learning algorithms [12].
Deep Learning models can be characterised into two types—generative models and discriminative models. Discriminative modelling is the same as classification in which we focus on evolving a model to forecast a class label, given a set of input-output pairs (supervised learning). The motive for this particular terminology is to design a model that must discriminate the inputs across classes and make a decision of which class the given input belongs to. Alternatively generative models are unsupervised models that summarise the distributions of inputs and generate new examples [24]. Really good generative models are able to make samples that are not only accurate but also not able to differentiate from the real examples supplied to the model.
In past few years, generative models have seen a significant rise in popularity, specially Generative Adversarial Networks (GANs) which have rendered very realistic results (Figure 10.1). The major difference between generative and discriminative models is that the aim in case of discriminative models is to learn the conditional probability distribution (P(y|x)) whereas, a generative model aims to learn the joint probability distribution (P(x,y)) [25]. In contrary to discriminative models, generative models can use this joint probability distribution to generate likely (x,y) samples. One might assume that there is no need of generating new data samples, owing to the abundance of data already available. However, in reality generative modelling has several important uses. Generative models can be used for text to image translation [6, 7] as well as for applications like generating a text sample in a particular handwriting fed to the system. Generative models, specifically GANs can also be used in reinforcement learning to generate artificial environments [26].
A GAN is a bipartite model consist of two neural networks; (i) generator and (ii) a discriminator (Figure 10.2). The task of the generator network is to produce a set of synthetic data when fed with a random noise vector. This fixed-length vector is created randomly from a Gaussian distribution and is used to start the generative process. Following the training, the vector contains points that form a compressed representation of the original data distribution. The generator model acts on these points and applies meaning to them.
The task of the discriminator model is to classify the real data from the one generated by the generator. For doing this, it takes two inputs, an instance from the real domain and another one that comes from the set of examples generated by the generator and then labels them as fake or real i.e. 0 or 1 respectively.
These two networks are trained together with the generator generating a collection of samples. Further, these samples are fed to the discriminator along with real examples which classifies them as real or synthetic. With every successful classification, the discriminator is rewarded while the generator is penalized which it uses to tweak its weights. On the other hand, when the discriminator fails to predict, the generator is rewarded and parameters are not changed while the discriminator is penalized and the parameters of the model are revised. This process continues until the generator becomes skilled enough of synthesizing data which can fool the discriminator or the confidence of correct classification done by the discriminator drops to 50%.
This adversarial training of the two networks makes the generative adversarial network interesting with the discriminator keen on maximizing the loss function while the generator trying to minimize it. The loss function is given below:
where, D(x) is the discriminator’s confidence, Ex is the expected value over all real data samples, G(z) is the sample generated by the generator when fed with noise z, D(G(z)) is the discriminator’s confidence as probability that fake data sample is real and, Ez is the estimated value over all generated fake instances G(z).
In this section several types of GANs have been discussed. There are many types of GANs that have been proposed till date. These include Deep Convolutional GANs (DCGAN), conditional GANs (cGAN), InfoGANs, StackGANs, Wasserstein GANs (WGAN), Discover Cross Domain Relations with GANs (DiscoGAN), CycleGANs, Least Square GANs (LSGAN), etc.
CGANs or Conditional GANs was developed by Mirza et al. [28] with a thought that the plain GANs can be extended to a conditional network by feeding some supplementary information to the generator as well as the discriminator as an additional input layer as shown in Figure 10.3 anything from class labels to data from other modalities. These class labels control the generation of data of a particular class type. Furthermore, the input data with correlated information allows for improved GAN’s training. In the generator, the conditional information Y is fed along with the random noise Z merged in a hidden representation while in the discriminator this information is provided along with data instances.
The authors then trained the network on the MNIST dataset [29] where class labels were conditioned, encoded as one-hot vectors. Building on this, the authors then demonstrated automated image tagging with the predictions using multilabels, the conditional adversarial network to define a set of tag vectors conditioned on image features. A convolutional model inspired from [30] where full Imagenet dataset was pretrained for the image features and for word representation a corpus of text was acquired from the YFCC100M [31] dataset metadata to which proper preprocessing was applied. Finally, the model was then trained on the MIR Flickr dataset [32] to generate automated image tags (refer Figure 10.3).
These were introduced by Radford et al. [33] in late 2015 as a strong contender for practicing unsupervised learning using CNNs in computer vision tasks. The authors of DCGAN mention three major ideas that helped them come up with a class of architectures that wins over the problems faced by prior efforts of building CNN based GANs which lead to training instability when working with high-resolution data (refer Figure 10.4).
The first idea was to replace any pooling layers with strided convolutional layers in both the discriminator and the generator, taking motivation from the all convolutional network [34]. This allows the network to learn its spatial downsampling. The second was to remove the deeper architectures with fully connected layers and finally, the third idea was to use the concept of Batch Normalization [35] which transforms each input unit to have zero mean and unit variance and stabilizes the learning process by allowing the gradient to flow to deeper models. The technique, however, is not applied to the output layer of the generator and the input layer of the discriminator as its direct application to all the layers leads to training instability and sample oscillations. Additionally, ReLU [36] activation function is used in the generator saving the TanH activation function for the output layer. While the discriminator employs leaky rectified activation [37, 38] which works well with higher resolution images.
DCGAN was trained on three datasets: Large Scale Scene Understanding (LSUN) [39], Imagenet-1k [40] and a then newly assembled faces dataset having 3M images of 10K people. The main idea behind training DCGAN is to use the features realized by the model’s discriminator as a feature extractor for the classification model. Radford et al. in particular used the concept combined with a L2+SVM classifier which is when tested against the CIFAR-10 dataset leads an 82.8% accuracy.
They were introduced in 2017 by Martin Arjovsky et al. [41] as an alternate to the traditional GAN training methods that had proven to be quite delicate and unstable. WGAN is an impressive extension to GANs that improves stability while the model is being trained as well as helps in analysing the quality of the images generated by associating them with a loss function. The characteristic feature of this model is that it replaces the basic discriminator model with a critic that can be trained to optimality because of the Wasserstein distance [42] which is continuous and differentiable. Wasserstein distance is better than Kullback-Leibler [43] or Jensen-Shannon [44] divergences as it seeks to provide the minimum distance with a smooth and meaningful representation between two data distribution probabilities even when they are located in lower dimensional manifolds without overlaps.
The most compelling feature of WGAN is the drastic reduction of mode dropping phenomenon that is mostly found in GANs. A loss metric is correlated with the generator’s convergence. It is backed up by a strong mathematical motivation and theoretical argument. In simpler terms, a reliable gradient of Wasserstein GAN can be obtained by extensively training the critic. However, it might become unstable with the use of momentum-based optimiser (on critic), such as Adam optimizer [45]. Moreover, when the training of the algorithm is done by the generator without constant number of filters and batch normalization, WGAN produces samples while standard GAN fails to learn. WGAN does not show mode collapse when trained with an MLP generator with 4 layers and 512 units with ReLU nonlinearities while it can be significantly seen in standard GAN. The benefit of WGAN is that while being less sensitive to model architecture, it can still learn when the critic performs well. WGAN promises better convergence and training stability while generating high quality images.
Stacked Generative Adversarial Networks (StackGANs) with Conditional Augmentation [46] for synthesizing 256*256 photorealistic images conditioned on text descriptions was introduced by Han Zhang et al. [46]. Generating high-quality images from text is of immense importance in applications like computer-aided design or photo-editing. However, a simple addition of unsampling layers in the current state-of-the-art GAN results in training instability. Several techniques such as energy-based GAN [47] or super-resolution methods [48, 49] may provide stability but limited details are added to the images with the low resolution like 64*64 images generated by Reed et al. [50].
StackGANs overcame this challenge by decomposing the text-to-image synthesis into a two-stage problem. Stage I GAN sketches follow the primitive shape and basic colour constrained to the given text description and yields a image with the low-resolution. Stage II GAN rectifies the faults in resulting in Stage I by reading the description of the text again and supplements the image by addition of compelling details. A new augmentation technique with proper conditioning encourages the stabilized training of conditional GAN. Images with the more photo realistic details and the diversities are generated using STACK GAN.
Least Square GANs (LSGANs) was given by Xudong Mao, et al. in 2016 [51]. LSGANs have been developed with an idea of using the least square loss function which provides a nonsaturating gradient in the discriminator contrary to the sigmoid cross entropy function used by Regular GANS. The loss function based on least squares penalizes the fake samples and pulls them close to the decision boundary. The penalization caused by the least square loss function results to generate the samples by the generator closer to the decision boundary and hence they resemble the real data. This happens even when the samples are correctly seperated by the decision boundary. The convergence of the LSGANs shows a relatively good performance even without batch normalization [6].
Various quantitative and qualitative results have proved the stability of LSGANs along with their power to generate realistic images [52]. Recent studies [53] have shown that Gradient penalty has improved stability of GAN training. LSGANs with Gradient Penalty (LSGANs-GP) have been successfully trained over difficult architectures including 101-layered ResNet using complex datasets such as ImageNet [40].
Information Maximizing GANs (InfoGAN) was introduced by Xi Chen et al. [54] as an extension with the information-theory concept to the regular GANs with an ability to learn disentangled representations in an unsupervised manner.
InfoGAN provides a disentangled representation that represents the salient attributes of a data instance which are helpful for tasks like face and object recognition. Mutual information is a simple and effective modification to traditional GANs. The concept core to InfoGAN is that a single unstructured noise vector is decomposed into two parts, as a source of incompressible noise(z) and latent code(c). In order to discover highly semantic and meaningful representations the common facts between generated samples and latent code is maximised with the use of variational lower bound. Although there have been previous works to learn disentangled representations like bilinear models [55], multiview perception [56], disBM [57] but they all rely on supervised grouping of data. InfoGAN does not require supervision of any kind and it can disentangle both discrete and continuous latent factors unlike hossRBM [58] which can be useful only for discrete latent variables with an exponentially increasing computational cost.
InfoGAN can successfully disentangle writing styles from the shapes of digits on the MNIST dataset. The latent codes(c) are modelled with one categorical code (c1) that switches between digits and models discontinuous variation in data. The continuous codes (c2 and c3) model rotation of digits and control the width respectively. The details like stroke style and thickness are adjusted in such a way that the resulting images are natural looking and a meaningful generalisation can be obtained.
Semantic variations like pose from lighting in 3D images, absence or presence of glasses, hairstyles and emotions can also be successfully disentangled with the help of InfoGAN. Without any supervision, a high level of visual understanding is demonstrated by them. Hence, InfoGAN can learn complex representations on complex datasets with superior image quality as compared to previous unsupervised approaches. Moreover, the use of latent code adds up only negligible computational cost on top of a regular GAN without any training difficulty.
The idea to use mutual information can be further applied to other methods like VAE [59], semisupervised learning with better codes [60] and InfoGAN is used as a tool for high dimensional data discovery.
As captivating training a generative adversarial network may sound, it also has its own share of shortcomings when it comes to practicality, with the most significant ones being as follows:
A frequently encountered problem one faces while training a GAN is the enormous computational cost it requires. While a GAN might run for hours, on a single GPU and on a CPU, on the other hand, it may continue to run beyond even a day! Various researchers have come forward with different strategies to minimize this problem, one such being the idea of a building an architecture with effecient memory utilization. Shuanglong Liu et al. centered around architecture based on a parameters deconvolution, an FPGA-friendly method [61-63]. Based on a similar approach, A. Yazdanbakhsh et al. devised FlexiGan [64], an end-to-end solution, which produces FPGA based accelerator which is highly optimized from a high-level GAN specification.
The output of the discriminator calculates the loss function therefore the parameters are updated fastly. As a result, the convergence of discriminator is faster and this affects the functioning of the generator due to which parameters are not updated. Furthermore, the generator does not converges and thus generative adversarial networks suffers the problem of partial or total mode collapse, a state where in the generator is generating almost indistinguishable outputs for different latent encodings. To address this Srivastava et al. suggested VEEGAN [65] which contains a reconstructor network, which maps the data to noise by reversing the action of the generator.. Elsewhere, Kanglin Liu et al. proposed a spectral regularization technique (SR-GAN) [66] which balances the spectral distributions of the weight matrices saving them from getting collapse which consequently prevents mode collapsing in GANs.
Another difficulty experienced while developing a generative adversarial network is the inherent instability caused by training both the generator and the discriminator concurrently. Sometimes the parameters oscillate or destabilize, and never seem to converge. Through their work, Mescheder et al. [67] presented how training a GAN for absolutely continuous data and generator distributions show local convergence while performing unregularized training over a realistic case of distributions which are not absolutely continuous is not always convergent. Furthermore, by describing some of the regularization techniques put forward they analyze that GAN training with an instance or zero-centered gradient penalties leads to convergence. Another technique that can fix the instability problems of GANs is Spectral Normalization, a particular kind of normalization applied to the convolutional kernels which can greatly improve the training dynamics as shown by Zhang et al. through their model SAGAN [68].
An important point to consider is the influence that a dataset may have on the GAN which is being trained on it. Through their work, Ilya Kamenshchikov and Matthias Krauledat [69] demonstrate that how datasets also play a key role in the successful training of a GAN by taking into notice the influence of datasets like Fashion MNIST [70], CIFAR-10 [71] and ImageNet [40]. Also, building a GAN model requires a large training dataset otherwise its progress in the semantic domain is hampered.
Adding further to the list is the problem of the vanishing gradient that crops up during the training if the discriminator is highly accurate thereby, not providing enough information for the generator to make progress. To solve this problem a new loss function Wasserstein loss was proposed in the model W-GAN [41] by Arjovsky et al. where loss is updated by a GAN method and the instances are not actually classified by the discriminator. For each sample, a number is received as output. The value of the number need not necessarily be less than one or greater than 0, thus to decide whether the sample is real or fake, the value of threshold value is not 0.5. The training of the discriminator tries to make the output bigger for real instances as compare to fake instances. Working for a similar cause Salimans et al. in 2016 [72] proposed a set of heuristics to solve the problem of vanishing gradient and mode collapse among others by introducing the concept of feature matching. Other efforts worth highlighting include improving the WGAN [42] by Gulrajani et al. addressing the problems arising due to weight clipping, Fisher GAN [73] suggested by Mroueh and Sercu introduced a constraint dependent on the data to maintain the capacity of the critic to ensure the stability of training, and Improving Training of WGANs [74] by Wei et al.
Known for revolutionizing the realm of machine learning ever since their introduction, GANs find their way in a plethora of applications ranging from image synthesis to synthetic drug discovery. This section brings to the fore some of the most important areas of application of GANs with each being discussed in detail as below:
Perhaps, some of the most glorious exploits of GANs have surfaced in the field of image synthesis or manipulation. A major advancement in the field of image synthesis came in late 2015 with the introduction of DCGANs by Radford et al. [33] capable of generating random images from scratch. In the year 2017, Liqian Ma et al. [75] proposed a GANs based architecture that when supplied with an input image, could generate its variants with each having different postures of the element in the input image. Some other notable applications of GANs in the domain of image synthesis and manipulation include Recycle GAN [76], an approach based on data-driven methodology. It is used for transferring the content of one video or photo to another; ObjGAN [77], a novel GAN architecture developed by a team of scientists at Microsoft understands sketch layouts, captions, and based on the wording details are refined; StyleGAN [78], a model Nvidia developed, is capable of synthesizing high-resolution images of fictional people by learning attributes like facial pose, freckles, and hair.
With a video being described as a series of images in motion, the involvement of various state-of-the-art GAN approaches in the domain of video synthesis is no surprise. With DeepMind’s proposal of DVDGAN [79], the generation of realistic-looking videos by a model when fed with a custom-tailored dataset is a matter of just a few lines of code and patience. Another noteworthy contribution of GANs in this sector is DeepRay, a Cambridge Consultants’ creation. It helps to generate images which are less distorted and more sharper from pictures that have been damaged or had obscured elements. This can be used to get rid of noise in videos too.
GANs have the ability to generate more then images and video footage. They are capable of producing novel works of art provided they are supplied with the right dataset. Art-GAN [80], a conditional GAN based network generates images with abstract information like images with a certain art style after being trained on the Wikiart dataset. GauGAN [81] developed by the company can turn rough doodles into photorealistic masterpieces with breathtaking ease and NVIDIA Research has investigated AI-based arts as a deep learning model.
After giving astonishing results when applied to images or videos, GANs are being involved in the field of music generation too. MidiNet [82], a CNN inspired GAN model developed by DeepMind is one such attempt that aims at producing realistic melody from random noise as input. Conditional-LSTM GAN [83] presented by the researchers based at the National Institute of Informatics in Tokyo which learns the latent relationship between the different lyrics and their corresponding melodies and then applies it to generate lyrics conditioned melodies is another effort worth mentioning.
Owing to the ability to synthesize images with an unmatched degree of realism and the adversarial training, GANs are a boon for the medical industry. They are frequently used in image analysis, anomaly detection or even for the discovery of new drugs. More recently, the Imperial College London, University of Augsburg, and the Technical University of Munich The model dubbed Snore-GAN [84] is used to synthesize data to fill in gaps in real data. Meanwhile, Schlegl et al. suggested an unsupervised approach to detect anomalies relevant for disease progression and treatment monitoring through their discovery AnoGAN [85]. On the drug synthesis side of the equation, LatentGAN [86] an effort by Prykhodko et al. integerates a generative adversarial neural network with an autoencoder for de novo molecular design. It can be used with many other applications [89, 90].
With GANs being applied to various domains, it seems the field of security has a lot to gain from them as well. A recently developed machine learning approach to password cracking PassGAN [87] generates password guesses by training a GAN on a list of leaked passwords. Keeping their potential to synthesize plausible instances of data, GANs are being used to make the existing deep learning networks used in cybersecurity more robust by manufacturing more fake data and training the existing deep learning techniques on them. In a similar vein, Haichao et al. have come up with SSGAN [88], a new strategy that generates more suitable and secure covers for steganography with an adversarial learning scheme.
This paper provides a comprehensive review of generative adversarial networks. We have discussed the basic anatomy of GANs and the various kinds of GANs that have been widely used nowadays. This papers also discusses the various application areas of GANs. Despite the extensive potential, GANs have several shortcomings which have also been discussed. This review of generative adversarial networks extensively covers the basic fundamentals about GANs and will help the readers to gain a good understanding of this famous deep learning network, which has gained immense populatrity recently.
3.138.124.135