© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. YeModern Deep Learning Design and Application Developmenthttps://doi.org/10.1007/978-1-4842-7413-2_3

3. The Versatility of Autoencoders

Andre Ye1  
(1)
Redmond, WA, USA
 

One word is worth a thousand pictures, if it’s the right word.

—Edward Abbey, American author and essayist1

The concept of encoding and decoding entities – ideas, images, physical material, information, and so on – is a particularly profound and important one, because it is so deeply embedded into how we experience and understand the environment around us. Encoding and decoding information is a key process in communication and learning. Every time you communicate with someone, observe the weather, read from a book – like this one – or in some way interact with information, you are engaging in a process of encoding and decoding, observing and interpreting. We can make use of this idea with the deep learning concept of an autoencoder.

This chapter will begin by exploring the intuition and theory behind autoencoders and how they can be used. It will also cover Keras neural network manipulation methods and neural network design that are integral to implementing not only autoencoders but also other more complex neural network designs in later chapters. The other major half of the chapter will be dedicated toward exploring the versatility of autoencoders through five applications. While the autoencoder is a simple concept, its incorporation into efforts tackling deep learning problems can make the difference between a mediocre and a successful solution.

Autoencoder Intuition and Theory

Although we will clarify what encoding and decoding refer to more specifically with relevance to the intuition behind autoencoders, for now they can be thought of as abstract operations tied inextricably together to one another. To encode an object, for our purposes, is to represent it at a loss of quantifiable information. For instance, we could encode a book by summarizing it, or encode an experience with key sensory aspects, like notable senses of touch or hearing. To decode is to take in the encoded representation of the object and to reconstruct the object. It is the objective of encoding and decoding to “work together” such that the reconstruction is as accurate as possible. An effective summary (encoding) of a book, for instance, is one from which a reader could more or less construct the main ideas of the book to a high accuracy.

Suppose someone shows you an image of a dog. They allow you to look at that image for a few seconds, take it away, and ask you to reconstruct the image by drawing it. Within those few seconds, ideally you would have extracted the key features of that image so that you could draw it as accurately as possible through some efficient method. Perhaps you remember that the dog was facing right, its head upright, that it was tall, and that it looked like it was standing still.

What is interesting is that these are all high-level abstract concepts – to deduce that an object is facing a certain direction, for instance, you must know the relationship between the directions of each of the object’s parts (for instance, if the head is facing one way, the tail should be facing the other) and be able to represent them in a spatial environment. Alternatively, knowing that a dog is tall requires knowledge of what a “standard” or “short” dog looks like. Identifying the absence of motion requires knowledge of what motion for this particular class of objects looks like.

The key idea here is that for effective compression and reconstruction of an object in the context of complex objects and relationships, the object must be compressed with respect to other objects. Compressing an object in isolation with a high reconstruction performance is more difficult or performs worse than compressing an object with access to additional sets of related knowledge. This is why recovering an image of a dog is easier than recovering an image of an unfamiliar object. With the image of the dog, you first identify that the object is a dog and thus identify that knowledge relating to the “dog” entity is relevant and then can observe how the image deviates from the template of the dog. When decoding, if you already know that the image was of a dog (a technical detail: this information can be passed through the encoding), you can do most of the heavy lifting by first initializing the knowledge related to a dog – its standard anatomy, its behavior, and its character. Then, you can add deviations and specifics, as necessary.

Because efficient compression of complex entities requires the construction of highly effective representations of knowledge and of quick retrieval, this encoding-decoding pair of operations is extremely useful in deep learning. This pair of operations is referred to generally as the autoencoder structure, although we will explore more solidly what it entails. Because the processes of encoding and decoding are so fundamental to learning and conducive to the development of effective representations of knowledge, autoencoders have found many applications in deep learning.

Likewise, it should be said that this sort of dependence on context makes autoencoders often a bad approach for tasks like image compression that require a more universal method of information extraction. Autoencoders are limited as tools of data compression to the context of the data they are working with and even with that context perform at best equivalently to existing, more universal compression algorithms. For instance, you could only train a standard autoencoder to effectively encode a few pineapple images if you fed it thousands of other images of pineapples; the autoencoder would fail to reconstruct an image of an X-ray, for instance. This is a good example of critically evaluating deep learning’s viability as an approach to certain problems and a much needed reminder that deep learning is not a universal solution and needs to work with other components to produce a complete product.

Consider the following reconstructions of images in the MNIST digit dataset by an autoencoder that was forced to encode each image into only four numbers and then to decode the image from those four numbers (Figure 3-1). We can observe that the network relies upon “templates” of digits. For instance, the original digit “5” has a very distinct shape, with sharp edges and joints, but the network’s reconstruction looks like a different image – although it expresses the same concept of “5,” it is much more rounded and less stylistic. Similar patterns can be observed for the digits “0,” “4,” “1,” and “9.” Here, we see that the autoencoder is not merely finding an efficient compression algorithm for the images but is performing clustering and a bit of under-the-hood “soft” classification to help reconstruct the general concepts of the input image.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig1_HTML.jpg
Figure 3-1

Top row: inputs/original images. Bottom row: reconstructed inputs

Despite the autoencoder being an unsupervised task – there are no labels – it is able to perform highly effective clustering and learn key features, attributes, and structures of the data.

Formally, the autoencoder consists of three general components: the encoder, the bottleneck, and the decoder (Figure 3-2). These components are not exclusive from one another; they overlap. The encoder takes in the input and outputs the encoded representation, and the decoder takes in the encoded representation and decodes it into the output. The “bottleneck” refers to the nodes shared between the encoder and the decoder: it is the output of the encoder and the input to the decoder. It can be thought of as holding the encoded representations of the input. It is also known as the latent space, “latent” coming from the Latin word for “to be hidden” – it is a space containing “hidden” representations for compressed/encoded data decipherable only by the decoder.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig2_HTML.jpg
Figure 3-2

Components of an autoencoder diagram

The input and output of autoencoders have the same size, because autoencoders are trained on the task of reconstruction – given an input item, the objective is to reconstruct the output with as high accuracy as possible. This is a crucial feature of autoencoders that will be core to how we go about designing these structures.

Training a neural network to “do nothing” – to output whatever the input is – is a trivial task for many architectures. What makes the autoencoder so meaningful and useful for a wide variety of applications, though, is the bottleneck. The decrease in the number of nodes in the bottleneck layer compared to the number of nodes needed to represent the input data in its original form (or substitute nodes with any other unit of data storage) makes the task of reconstruction difficult and forces the autoencoder to develop efficient encoding and decoding methods. Figures 3-3 and 3-4 show the reconstructed representations for two autoencoders trained on the MNIST digit dataset, one with a larger bottleneck than the other. While the first autoencoder clearly had too small of an autoencoder to develop reasonably accurate reconstructions, the other autoencoder is able to develop a certain level of detail – the pointiness of the five, the slight curl of the nine’s stem – that it may warrant concern as to if the task is too trivial. This depends on the application you are using the autoencoder for, though.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig3_HTML.jpg
Figure 3-3

Original images and their reconstructions by an autoencoder with a small bottleneck size

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig4_HTML.jpg
Figure 3-4

Original images and their reconstructions by an autoencoder with a large bottleneck size

Most of the “meat” within autoencoder usage resides in their applications. We’ll begin by discussing how autoencoders are designed and implemented for tabular data and image data. Then, we’ll explore the intuition and implementation for five applications of autoencoders.

The Design of Autoencoder Implementation

There are many considerations to be made in the implementation of autoencoders, depending on the form of their application. In this section, we will broadly discuss considerations for designing autoencoders on one-dimensional (vector) and two-dimensional (tabular) data by example.

Autoencoders for Tabular Data

Tabular data , for our purposes, refers to data that can be put in the form (n, s), where n indicates the number of data points and s indicates the number of elements in each data point. The entire dataset can be put into a two-dimensional table, where each row indicates a specific data point and each column indicates a feature of that data point.

Building autoencoders for tabular data is an appropriate place to begin, because the shape of data at any one layer can be easily manipulated simply by changing the number of nodes at that layer. The shape of other forms of data, like images, is more difficult to handle. Being able to manipulate shape is crucial to autoencoder design because the concept of encoding and decoding entities comes with relatively strict instructions about what the shape of the data should look like before encoding, after encoding, and before decoding.

The difference between a standard autoencoder (often abbreviated simply as “AE”) and a “deep autoencoder” should also be noted here. Although the exact terminology has not seemed to have stabilized yet, generally an autoencoder refers to a shallow autoencoder structure, whereas a deep autoencoder contains more depth. Often, this is determined by the complexity of the input data, since more complex forms of data usually necessitate greater depth. Because data that can be arranged in a tabular format is generally less complex than image or text data (or other sorts of highly specialized and complex data that neural networks have become good at modeling), it’s generally safe to refer to autoencoder structures used on high-resolution sets of images or text as deep autoencoders and autoencoder structures used on tabular data as simply autoencoders.

In this example, we will be building an autoencoder with 784-dimensional data. That is, each data point consists of 784 features. Let’s begin by importing necessary models and modules. We’ll use the import keras.layers as L method of dealing with importing layers instead of importing each layer individually by name out of convenience (Listing 3-1). Additionally, because there is no need to use the Functional API in this case – we are not building a nonlinear topology and do not need such strong access to some layer as to warrant assigning it its own variable – we will opt for the simpler Sequential model structure.
import keras.layers as L
from keras.models import Sequential
Listing 3-1

Importing important layers and models

Once we initialize the Sequential model structure, we can add the Input layer, which takes in the 784-dimensional input data.

Now, we will build the encoding component of the autoencoder (Listing 3-2). The encoder should successively decrease the size the data takes in as to guide the process of encoding. In this case, the first decrease in size is from 784 nodes to 256 nodes; the second is from 256 nodes to 64 nodes; the third is from 64 nodes to 32 nodes. By decreasing the number of nodes in each Dense layer, we are decreasing how much space we give the autoencoder to represent the input data.
model = Sequential()
model.add(L.Input((784,)))
model.add(L.Dense(256, activation='relu'))
model.add(L.Dense(64, activation='relu'))
model.add(L.Dense(32, activation='relu'))
Listing 3-2

Building a simple encoder using Dense layers for tabular data

The last layer of the encoder contains 32 nodes, indicating that the bottleneck will be 32 nodes wide. This means that, at its most extreme, the autoencoder must find a representation of 784 features with 32 values.

It should also be noted that it is convention for the number of nodes in each layer to be a power of two. You will see this pattern both throughout examples in this book and in the architectures of neural networks designed by researchers (some of these are presented in case studies). It’s thought by many to be convenient for memory and a good way to scale the number of nodes meaningfully (when the number of nodes is high, meaningful change is proportional rather than additive). This convention by no means is required, though, if your design requires node quantities that cannot accommodate this convention.

We can then add the decoder layers, which should successively expand the space the neural network has to decode data from its encoded representation (Listing 3-3). Although it’s not strictly necessary for the success of an autoencoder, often decoders are built symmetrically to the encoder out of convenience. Here, the decoder performs the same steps as the encoder, but “in reverse.” The first expansion is from the encoded representation of 32 nodes into 64 nodes; the second is from 64 nodes to 256 nodes; the last is from 256 nodes to 600 nodes, the shape of the original input data.
model.add(L.Dense(64, activation='relu'))
model.add(L.Dense(256, activation='relu'))
model.add(L.Dense(784, activation='sigmoid'))
Listing 3-3

Building a simple decoder using Dense layers for tabular data

Note that, in this example, the activation for the last layer of the model is the sigmoid activation function. We traditionally put the sigmoid activation (or other related curved and bounded functions and adaptations) on the last layer to bound the neural network for classification problems. This may or may not be suitable toward your particular task.

If your input data consists of entirely binary data or can be put into that form appropriately, sigmoid may be an appropriate activation. It is important to make sure that your input data is scaled properly; for instance, if a feature is binary in that it contains only the values 10 and 20, you would need to adjust the feature such that it consists only of the values 0 and 1. Alternatively, if your feature is not strictly binary but tends to cluster around two bounds, sigmoid may also be an appropriate choice.

On the other hand, if a feature is spread relatively uniformly across a wide range, it may not be appropriate to use the sigmoid activation. The sigmoid function is sloped such that it is more “tricky” to output an intermediate value near 0.5 than near 0 or 1; if this does not adequately represent the distribution of a feature, there are other options available. An activation like ReLU (Rectified Linear Unit, defined as y = max(x, 0)) may be more appropriate. However, if your feature ranges across negative outputs as well, use the linear activation (simply y = x). Note that depending on the character of the features in tabular data, you will need to choose different losses and metrics – the primary consideration being regression or classification.

A challenge of autoencoders for tabular data is that often tabular data is not held together by a unifying factor of context in the same way that image or text is. One feature may be derived from a completely different context than another feature, and thus you may simultaneously have a continuous feature and a categorical feature that a tabular autoencoder must both accommodate. Here, feature engineering (e.g., encoding categorical features to continuous values) is needed to “unify” the problem types of the features.

Plotting the model with plot_model() shows us a successive decrease in size during the encoding step and a successive increase in size during the decoding step, as expected (Figure 3-5).
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig5_HTML.jpg
Figure 3-5

Sample autoencoder using only Dense layers for tabular data

When building autoencoders, however, generally a compartmentalized design is preferred over sequentially stacking layers. A compartmentalized design refers to implementing a model as a relationship between several sub-models. By using a compartmentalized design, we can easily access and manipulate the sub-models. Moreover, it is more clear how the model functions and what its components are. In an autoencoder, the two sub-models are the encoder and the decoder.

To build with a compartmentalized design, first define the architectures of each of the sub-models. Then, use the Functional API to define each sub-model as an input of another object and aggregate the sub-models into an overarching model using keras.models.Model.

Let’s build the same architecture as we did before, but with a compartmentalized design (Listing 3-4). Note that you can pass the name='name' argument into a model to attach a name to it for later reference.
encoder = Sequential(name='encoder')
encoder.add(L.Input((784,)))
encoder.add(L.Dense(256, activation='relu'))
encoder.add(L.Dense(64, activation='relu'))
encoder.add(L.Dense(32, activation='relu'))
decoder = Sequential(name='decoder')
decoder.add(L.Input((32,)))
decoder.add(L.Dense(64, activation='relu'))
decoder.add(L.Dense(256, activation='relu'))
decoder.add(L.Dense(784, activation='sigmoid'))
Listing 3-4

Building a simple encoder and decoder using Dense layers for tabular data with compartmentalized design

Once we’ve defined the sub-models (being encoder and decoder), we can treat them as layers and use the Functional API to aggregate them into an overarching model (Listing 3-5). Recall that we can treat models just like layers in the Functional API by writing them as functions of the previous object.
ae_input = L.Input((784,), name='input')
ae_encoder = encoder(ae_input)
ae_decoder = decoder(ae_encoder)
from keras.models import Model
ae = Model(inputs = ae_input,
           outputs = ae_decoder)
Listing 3-5

Compiling sub-models into an overarching model – the autoencoder – with compartmentalized design and the Functional API

When plotting the architecture of the overarching model (Figure 3-6), Keras now displays only the direct layers and sub-models that compose it. This can be helpful for understanding the general architecture and data flow of an autoencoder with many layers without needing to see the specifics of what each individual layer does.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig6_HTML.jpg
Figure 3-6

Visualization of an autoencoder architecture built with compartmentalized design

The primary practical benefit of using compartmentalized design for autoencoders is that after compiling and fitting, we can call encoder.predict(input_data) to obtain the learned encoding. If you do not use compartmentalized design, you can also use layer-retrieving methods discussed in Chapter 2 (e.g., get_layer()) to create a model object consisting of the encoding layers, but it’s more work than is necessary and is less portable. Accessing the encoder’s encoding of data is necessary for many of the autoencoder applications we will discuss in the second half of this chapter.

As mentioned in Chapter 2, you can use this implementation design method with all sorts of other structures beyond autoencoders to easily freeze entire groups of layers, or for the other benefits of compartmentalized design mentioned here, like better organization or easier referencing of models. For instance, this sort of compartmentalized design can be used to separate a model using transfer learning into two sub-models: the pretrained convolutional component and the custom fully connected layer. Freezing the pretrained convolutional component can be easily done by calling submodel.trainable = False.

Autoencoders for Image Data

Building autoencoders for image data follows the same logic as building one for tabular data: the encoder should condense the input into an encoded representation by using “reducing” operations, and the decoder should expand the encoded representation into the output by using “enlarging” operations. However, we need to make additional considerations to adapt to an increased complexity of the data shape.

The “enlarging” operation needs to be some sort of an “inverse” of the “reducing” operation. This was not much of a concern with Dense layers, because both an enlarging and a reducing operation could be performed simply by increasing or decreasing the number of nodes in a following layer. However, because common image-based layers like the convolutional layer and the pooling layer can only be reductive operations, we need to explicitly note that the decoding component is not only the encoding component “in reverse” (as was described in building autoencoders for tabular data) but is inverting – step by step – the encoding operations. This poses complications to building the encoder and decoder.

Although there are developments on using deep autoencoders for language and advanced tabular data, autoencoders have primarily been used for image data. Because of this, an extensive knowledge of autoencoders is necessary to successfully deal with most image-related deep learning tasks.

Image Data Shape Structure and Transformations

Because shape is so important to convolutional autoencoder design, first, we must briefly discuss image shape and methods of transforming it.

An image has the shape (a, b, c). Using the commonly used channels-last notation, a and b represent the spatial dimensions of the image and c represents the number of channels or the depth (Figure 3-7). For instance, a standard color image has three channels, corresponding to the red, green, and blue values in RGB images. Image-based layers can generally manipulate an image by altering all three elements of its shape.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig7_HTML.jpg
Figure 3-7

Illustration of the three dimensions of an image

Table 3-1 enumerates four key layers used in convolutional neural networks. Note that usually the height and the width of the image are the same (the image is square), which makes exploring transformations simpler. This table assumes a square shape, although you can apply similar logic to rectangular inputs. Moreover, the table assumes a stride of 1 is used and that the inputs are spatially two-dimensional, although again the logic applies to other forms.
Table 3-1

Important layers in a convolutional neural network and their transformations to the image shape

Layer

Parameters

Output given input shape (a, a, c)

Description

Convolution

Kernel shape = (x, y)

Number of filters = n

Padding = “same” or “valid”

If padding = 'valid': (a-(x-1), a-(x-1), n)

If padding = 'same': (a, a, n)

The two-dimensional convolution slides a kernel of shape (x, y) across the image. Hence, it will reduce the image’s spatial dimensions by x-1 and y-1 pixels. Generally x=y (thus the kernel is square-shaped), but some architectures exploit a rectangular kernel shape, which can be successful in certain domains. See the Chapter 6 case study on the InceptionV3 architecture to explore convolution factorization and nonsquare kernel shapes.

However, if padding is set as “same,” the image is padded (extra “blank” dimensions are added to its size) such that when the convolution is performed, the resulting image has the same shape as the previous one. We’ll see why this is helpful later.

Pooling

Pooling size = (x, x)

Padding = “same” or “valid”

If a is divisible by x and padding='same' or 'valid': (a/x, a/x, c)

If a is not divisible by x and padding='valid': (floor(a/x), floor(a/x), c)

If a is not divisible by x and padding='same': (ceil(a/x), ceil(a/x), c)2

The two-dimensional pooling operation offers a faster way to reduce the size of an image shape by placing non-overlapping (unlike convolutions, which do overlap) windows of size (x, x) across the image to “summarize” the important findings. Pooling is usually either used in the form of average pooling (all elements in the pooling window are averaged) or max pooling (the maximum element in the pooling window is passed on). Pooling divides the image size by the pooling window’s respective dimension. However, because images sometimes may not be exact multiples of the window’s respective dimension, you can use padding modes to help determine the exact shape of the output.

Transpose convolution

Kernel shape = (x, y)

Number of filters = n

Padding = “same” or “valid”

If padding = 'valid': (a+(x-1), a+(x-1), n)

If padding = 'same': (a, a, n)

The transpose convolution can be thought of as the “inverse” of the convolution. If you passed an input through a convolutional layer and then a transpose convolutional layer (with the same kernel shape), you’d end up with the same shape. When you’re building the decoder, use transpose convolutional layers in lieu of the convolutional layers in the encoder to increase the size of the image shape.

Like the convolutional layer, you can also choose padding modes.

Upsampling

Upsampling factor: (x, y)

(a*x, a*y, n)

The upsampling layer simply “magnifies” an image by a certain factor without changing any of the image actual values. For instance, the array [[1, 2], [3, 4]] would get upsampled simply as [[1, 1, 2, 2], [1, 1, 2, 2, ], [3, 3, 4, 4], [3, 3, 4, 4]] with an upsampling factor of (2,2). The upsampling layer can be thought of as the inverse of the pooling operation – while pooling divides the dimensions in the image size by a certain quantity (assuming no padding is being used), upsampling multiples the image size by that quantity. You cannot use padding with upsampling. When you’re building the decoder, use upsampling layers in lieu of pooling layers in the encoder to increase the size of the image shape.

Note that only convolutional and transpose convolutional layers contain weights; pooling and upsampling are simple ways to aggregate extracted features without any particular learnable parameters. Additionally, note that the default padding method is “valid.”

There are many approaches to building convolutional autoencoders. We’ll cover many approaches, beginning with a simple convolutional autoencoder without pooling to introduce the concept.

Convolutional Autoencoder Without Pooling

As noted before, autoencoders are generally built symmetrically, but this is even more true with image-based autoencoders. In this context, not building the autoencoder symmetrically requires a lot of arduous shape tracking and manipulation.

Let’s begin with building an encoder that takes in data with shape (256, 256, 3) and successively encodes it with convolution layers (Listing 3-6). In many convolutional autoencoder designs, the number of filters increases as the image size decreases. Make sure that the increase in number of filters does not outweigh the decrease in image size in terms of the amount of storage the network has available, such that the storage capacity for data decreases throughout the encoder (and correspondingly increases throughout the decoder).
encoder = Sequential()
encoder.add(L.Input((64, 64, 3)))
encoder.add(L.Conv2D(8, (3, 3)))
encoder.add(L.Conv2D(8, (3, 3)))
encoder.add(L.Conv2D(16, (3, 3)))
encoder.add(L.Conv2D(16, (3, 3)))
encoder.add(L.Conv2D(32, (3, 3)))
encoder.add(L.Conv2D(32, (3, 3)))
Listing 3-6

Building an encoder for image data using convolutions without pooling

In this case, we are successively increasing the number of filters (from 3 channels initially to 8, 16, and 32) while keeping the filter size at (3,3).

By visualizing with plot_model (Figure 3-8), we see how the shape of the image changes over time in accordance with the convolutional layers. The output of the encoder is a compressed image of shape (52, 52, 32). However, you’ll notice that this model violates the requirement that the number of filters shouldn’t outweigh the decrease in image size without pooling – the “encoded representation” is larger than the input (64*64*3 < 52*52*32)! This is because convolutions are simply not a good way to reduce the image size quickly. We need pooling to address this problem. For now, to keep things simple, we will ignore this issue.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig8_HTML.jpg
Figure 3-8

The architecture of an example convolutional encoder without pooling and only using convolutional layers

We can build our decoder to “mirror” the encoder by reversing the layer order and using the inverse layers to “undo” the encoding (Listing 3-7, Figure 3-9). We know that the decoder will take in an encoded representation of size (52, 52, 32).
decoder = Sequential()
decoder.add(L.Input((52, 52, 32)))
decoder.add(L.Conv2DTranspose(32, (3, 3)))
decoder.add(L.Conv2DTranspose(32, (3, 3)))
decoder.add(L.Conv2DTranspose(16, (3, 3)))
decoder.add(L.Conv2DTranspose(16, (3, 3)))
decoder.add(L.Conv2DTranspose(8, (3, 3)))
decoder.add(L.Conv2DTranspose(8, (3, 3)))
Listing 3-7

Building a decoder for image data using convolutions without pooling

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig9_HTML.jpg
Figure 3-9

The architecture of an example convolutional decoder without pooling and only using convolutional layers

There’s one problem – the output of the decoder has shape (64, 64, 8), whereas the input has shape (64, 64, 3). There are two ways of addressing this. You could change the last layer to L.Conv2DTranspose(3, (3, 3)) such that it has three channels. Alternatively, you could add another layer to the end of the decoder: L.Conv2DTranspose(3, (1, 1)). Because it has a filter size of (1,1), the image width and height are not changed, but the number of channels is collapsed from 8 into 3.

Because convolutional autoencoders (and networks in general) rely on repeated convolutions and other operations, it’s generally good practice to build them using a for loop and index lists of parameters as necessary. For instance, we could rewrite the encoder more efficiently as follows (Listing 3-8).
num_filters = [8, 16, 32]
encoder = Sequential()
encoder.add(L.Input((64, 64, 3)))
for i in range(3):
    encoder.add(L.Conv2D(num_filters[i], (3,3)))
    encoder.add(L.Conv2D(num_filters[i], (3,3)))
Listing 3-8

Building long, repeated architectures using for loops and lists of parameters accordingly referenced within the loop

In this case, since there is a clear pattern with the number of filters in each layer, you could even write it without indexing a parameter list (Listing 3-9).
encoder = Sequential()
encoder.add(L.Input((64, 64, 3)))
for i in range(3):
    encoder.add(L.Conv2D(2**(i+3), (3,3)))
    encoder.add(L.Conv2D(2**(i+3), (3,3)))
Listing 3-9

Building long, repeated architectures using for loops without lists of parameters accordingly referenced within the loop

A primary benefit of this sort of design is that you can easily extend the depth of the network simply by increasing the number of iterations layer-adding code is looped through, saving you from needing to type a lot of code manually.

Convolutional Autoencoder Vector Bottleneck Design

Often, the bottleneck (the output to the encoder and input to the decoder) is not left as an image – it’s usually flattened into a vector and reshaped into an image. Their primary benefit from this is that we are able to obtain vector representations of images that are independent from any spatial dimensions, which makes them more “clean” and easy to work with. Moreover, they can be more easily used with applications like pretraining (more on this later).

To do this, we need to add a Flatten layer at the end of the encoder (Listing 3-10, Figure 3-10). Since flattening an image usually leads to a very large vector, it’s common to apply some Dense layers to further process and reduce its size. Another alternative is the GlobalAveragePooling2D or GlobalMaxPooling2D layer, which produces a smaller output vector (averages/takes the maximum of each filter); this layer follows the same syntax as the flattening layer.
encoder = Sequential()
encoder.add(L.Input((64, 64, 3)))
#convolutional component
encoder.add(L.Conv2D(8, (3,3)))
encoder.add(L.Conv2D(16, (3,3)))
#flattening and processing
encoder.add(L.Flatten())
encoder.add(L.Dense(256, activation='relu'))
encoder.add(L.Dense(32))
Listing 3-10

The architecture of an example convolutional encoder without pooling and only using convolutional layers with a vector-based bottleneck. For simplicity, the convolutional component has been reduced to two convolutional layers

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig10_HTML.jpg
Figure 3-10

The architecture of an example convolutional encoder without pooling and only using convolutional layers, using a vector bottleneck design

Visualizing is especially helpful to help us understand transformations to the shape. We see that before flattening, the encoder had encoded the image into an image of shape (60, 60, 16), which was flattened into a vector with dimension 57,600. The output of the encoder is a vector of dimension 32.

We need this information to construct the decoder (Listing 3-11, Figure 3-11). The encoder takes in the encoded representation, which is a vector of dimension 32. It then uses Dense layers to increase its size up to the same size as the data right after the encoder had flattened it, which is (57600,). From there, we can reshape the vector into an image of shape (60, 60, 16) such that inverse convolutional operations can be applied to it.
decoder = Sequential()
decoder.add(L.Input((32,)))
#processing and reshaping
decoder.add(L.Dense(256, activation='relu'))
decoder.add(L.Dense(57_600, activation='relu'))
decoder.add(L.Reshape((60, 60, 16)))
#applying transpose-convolutional layers
decoder.add(L.Conv2DTranspose(16, (3,3)))
decoder.add(L.Conv2DTranspose(8, (3,3)))
decoder.add(L.Conv2D(3, (1,1)))
Listing 3-11

The architecture of an example convolutional decoder without pooling and only using convolutional layers with a vector-based bottleneck

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig11_HTML.jpg
Figure 3-11

The architecture of an example convolutional encoder without pooling and only using convolutional layers, using a vector bottleneck design

Recall that to address the number of channels in the input image, in this case we put an extra convolutional layer with filter size (1,1) to maintain the image size but to collapse the number of channels.

Convolutional Autoencoder with Pooling and Padding

While we were technically successful in building a convolutional autoencoder in that the inputs and outputs were identical in shape, we failed to adhere to a fundamental principle of autoencoder design: the encoder should progressively decrease the size of the data. We need pooling in order to cut down on the image size quickly.

Convolutional neural networks are generally constructed in modules (Figure 3-12) of convolutional layer – convolutional layer – pooling layer (with the number of convolutional layers varying). These modules can be repeated over and over again. By following several convolutional layers of feature extraction with a pooling layer, these extracted features can be aggregated, and the key highlights can be passed onto the next convolutional layers for further feature extraction and processing. (You can find a more detailed discussion of module/cell-based architectural design in Chapter 6.)
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig12_HTML.jpg
Figure 3-12

Example repeated module/cell-based design in convolutional autoencoders

However, when using convolutional layers in conjunction with pooling, we run into the problem of image sizes not divisible by the pooling factor. For instance, consider an image of shape (64, 64, 3). After a convolutional layer with filter size (2, 2) is applied to it, the image has a new shape of (63, 63, 3). If we want to apply pooling with size (2, 2) to it, we need to use padding in order to determine if the output of pooling will be (31, 31, 3) or (32, 32, 3). This is hardly a concern with standard convolutional neural networks. However, in autoencoders, we need to consider not only operations in the encoder but also the corresponding inverse “undoing” operations in the decoder. The upsampling layer has no padding layer. Thus, if we applied upsampling with size (2, 2) to an image with size (31, 31, 3), we would obtain an image of size (62, 62, 3); if we applied it to an image with size (32, 32, 3), we would obtain an image of size (64, 64, 3). In this case, there is no easy way in which we can obtain the original image size of (63, 63, 3).

You could attempt to work with specific padding and add padding layers manually, but it’s a lot of work and difficult to manipulate and organize systematically.

To address this problem, one approach of building convolutional autoencoders with pooling and convolutional layers is to use padding='same' on all convolutional layers. This means that convolutional layers have no effect on the shape of the image – images are padded on the side before the convolution is performed such that the input and output images have identical shapes. The convolution is still changing the content of the image, but the image size remains constant. Removing the effect of convolutional layers significantly simplifies the management of image shape. Beyond this simplification, padding also allows for convolutions to process features on the edge of the image that might be passed over without padding by adding more buffer room such that edge features can be processed by the center of the kernel.

Let’s build an encoder for an autoencoder with padding='same' for all convolutional layers (Listing 3-12).
encoder = Sequential()
encoder.add(L.Input((64, 64, 3)))
#convolutional component
for i in range(3):
    encoder.add(L.Conv2D(2**(i+3), (3,3),
                padding='same'))
    encoder.add(L.Conv2D(2**(i+3), (3,3),
                padding='same'))
    encoder.add(L.MaxPooling2D((2,2)))
#flattening and processing
encoder.add(L.Flatten())
encoder.add(L.Dense(256, activation='relu'))
encoder.add(L.Dense(32))
Listing 3-12

The architecture of an example convolutional encoder with pooling and padding, using a vector-based bottleneck

From visualizing (Figure 3-13), we can see that even though we prevented the convolutional layers from decreasing the size, with the addition of pooling, we were able to decrease the feature map resolution significantly. The flattened vector (before any processing) is 2048-dimensional – compare this to the 57600-dimensional post-flattening vector without pooling!
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig13_HTML.jpg
Figure 3-13

Complete convolutional encoder with pooling and padding

We can correspondingly construct the decoder (Listing 3-13, Figure 3-14). Like the convolutional layers in the encoder, the transpose convolutional layers in the decoder must use padding='same' to ensure symmetry.
decoder = Sequential()
decoder.add(L.Input((32,)))
#processing and reshaping
decoder.add(L.Dense(256, activation='relu'))
decoder.add(L.Dense(2048, activation='relu'))
decoder.add(L.Reshape((8, 8, 32)))
#applying transpose-convolutional layers
for i in range(3):
    decoder.add(L.UpSampling2D((2,2)))
    decoder.add(L.Conv2DTranspose(2**(3-i), (3,3),
                padding='same'))
    decoder.add(L.Conv2DTranspose(2**(3-i), (3,3),
                padding='same'))
#adding additional layer to collapse channels
decoder.add(L.Conv2D(3, (1,1)))
Listing 3-13

The architecture of an example convolutional decoder with pooling and padding, using a vector-based bottleneck

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig14_HTML.jpg
Figure 3-14

Complete convolutional decoder with pooling and padding

This method, of course, relies upon your input having a certain size. In this case, the input size must be a power of 2, since each pooling factor decreases the image’s spatial dimensions by a factor of 2. You can insert a reshaping layer right after the input or reshape your dataset to accommodate this. The primary advantage of this method is that it makes organizing the symmetry of shape transformation much more simpler. If you know you want the shape of the encoded image right before flattening to be (16, 16, x)3, for instance, and that you want to have three pooling layers with size (2,2) and three pooling layers with size (3,3), you can calculate the corresponding input shape to be of 16 · 23 · 33 pixels in width and height.

Autoencoders for Other Data Forms

Using this logic, you can build autoencoders for data of all forms. For instance, you can use recurrent layers to encode and decode text forms of data, as long as an exact inverse decoding layer exists for every encoding layer you add. Although much of autoencoder work has been based around images, recent work is exploring the many applications of autoencoders for non-image-based data types. See the third case study in this chapter for an example of autoencoder applications in non-image data.

Autoencoder Applications

As we’ve seen, the concept of the autoencoder is relatively simple. Because of the need to keep the input and output the same size, however, we’ve seen that implementing autoencoder architectures for complex data forms can require a significant amount of forethought and pre-planning. The good news, though, is that implementing the autoencoder architecture is – in most autoencoders – the most time-intensive step. Once the autoencoder structure has been built (with the preferred compartmentalized design), you can easily adapt it for several applications to suit your purposes.

Using Autoencoders for Denoising

The purpose of denoising autoencoders is largely implied by its name: the “de” prefix in this context means “away” or “opposite,” and thus denoising is to move “opposite of” or to remove noise. Denoising is simple to implement and can be used for many purposes.

Intuition and Theory

In a standard autoencoder, the model is trained to reconstruct whatever input it is given. A denoising autoencoder is the same, but the model must reconstruct a denoised version of a noisy input (Figure 3-15). Whereas the encoder in a standard autoencoder only needs to develop a representation of the input image that can be decoded with low reconstruction error (which can be very difficult as is), the encoder in a denoising autoencoder must also develop a representation that is robust to any noise. Denoising autoencoders can be applied to denoise messy signals, images, text, tabular data, and other forms of data.

Note

Denoising autoencoders are often abbreviated as DAE. You may notice that this is in conflict with the abbreviation of “deep autoencoder.” Because the term “denoising autoencoders” is relatively more established and clearly defined than “deep autoencoder,” when you see the abbreviation “DAE” in most contexts, it should be safe to assume that it refers to a denoising autoencoder. For the sake of clarity, in this book, we will favor not using the abbreviation “DAE”; if it is used, it will refer to the denoising autoencoder rather than the deep autoencoder.

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig15_HTML.jpg
Figure 3-15

Conceptual diagram of the components of a denoising autoencoder

Of course, the concept of the denoising autoencoder assumes that your data is relatively free from noise in the first place. Because the denoising autoencoder relies upon the original data as the ground truth to perform denoising on the noisy version of that original data, if the original data is heavily noisy itself, the autoencoder will learn heavily arbitrary and noisy representation procedures. If the data you are using to train the denoising autoencoder has a high degree of noise, in many cases, there is little difference between using a denoising autoencoder and a standard autoencoder. In fact, using the former may be more damaging to your workflow because you may be operating under the assumption that the denoising autoencoder is learning meaningful representations of the data robust to noise when this is not true.

The noise to the input can be constructed through three methods (Figure 3-16): by inserting noise as a layer, into the dataset directly, or as a data generator. Which method to use depends on your problem type:
  • Apply noise as a layer: Insert a random noise-adding layer directly after the input of the autoencoder such that the noise is applied to the input before any encoding and decoding is performed on it. The primary advantage of this method is that the model learns to be robust to multiple noisy versions of the same original image, since each time the model is passed through the model, different noise is applied. However, the noise-adding layer needs to be removed before the denoising autoencoder is used in application; it only serves as an artificial instrument during training, and in application, we expect the input to already be “naturally” noisy. When using this method, you can create the dataset like you would create it for an autoencoder – the input and outputs are the same.

  • Apply noise to the dataset directly: Before training, apply noise to the dataset directly during its construction such that the data contains x as noisy data and y as the original data. The primary advantage of using this method is customizability: you can use any functions you would like to construct the dataset, since it is outside the scope of the neural network and therefore not subject to the restrictions of Keras and TensorFlow. You may want to add complex forms of noise that are not available as default layers or generators in Keras/TensorFlow. Moreover, there is no need to manipulate individual noise-adding layers of the autoencoder. However, you run the risk of overfitting (especially with small datasets and large architectures), because the autoencoder only sees one noisy form of each original input instance. Of course, you could manually produce several noisy forms of each instance, although it may take more time and be less efficient.

  • Apply noise through a data generator: Keras/TensorFlow contains an ImageDataGenerator object that can perform a variety of augmentations and other forms of noise to the image, like adjusting the brightness, making small rotations and shears, or distorting the hue. Moreover, the image data generator is similar to the layer-based method in that the network is exposed to many different noisy representations of the input data – data is passed through the random generator in each feed-forward motion and distorted before it is formally processed by the network. The primary advantage of using the data generator is that you can apply forms of noise that are more natural or expected to occur in the data than with layers, which can implement more “artificial” forms of noise like adding Gaussian noise to the image. Moreover, there is no need to manipulate noise-adding layers after training the denoising autoencoder. However, image data generators are limited only to images, which means you will need to use another method for other forms of data.

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig16_HTML.jpg
Figure 3-16

Three methods of inducing noise in a denoising autoencoder

Denoising autoencoders are primarily used, as expected, to denoise input data we expect will be noisy. You can insert the denoising autoencoder directly after the input such that the neural network performing some supervised task receives a denoised version of the input (Figure 3-17).
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig17_HTML.jpg
Figure 3-17

One method of using denoising autoencoders to perform denoising on input data before the denoised input is processed. You can also unfreeze the denoising autoencoder to fine-tune afterward

Consider a task in which a model must classify the sound of an audio input. For example, the model should be able to output a representation of the label “bird” if the audio input is of a bird song, or it should be able to output a representation of the label “lion” if the audio input is of a lion’s roar. As one may imagine, a challenge in this dataset of real-life audio is that many of the audio inputs will contain overlapping sounds. For instance, if the audio clip of the bird song was from a park surrounded by the metropolitan bustle, the background would contain sounds of cars driving and people talking. To be successful, a model will need to be able to remove these sources of background noise and isolate the core, primary sound.

To do this, say you obtain another dataset of real-life sounds without background noise. You artificially add some form of noise that resembles the noise you would encounter in the noisy dataset. For instance, you can use audio processing libraries in Python to overlay a sound without background noise with smaller, dimmed down background noise. It is key for the success of the denoising autoencoder that the noise you artificially generate resembles the expected form of noise in the data. Adding Gaussian noise to images, for instance, may not help much as a denoising task unless the dataset you would like to denoise contains Gaussian noise. (You may find that adding some form of noise is better than adding no noise at all in terms of creating a self-supervised pretraining task, but this will be discussed more later.) You train a denoising autoencoder to reconstruct the original unnoisy sound signal from the noise-overlayed signal and then use the denoising autoencoder to denoise real-life sounds before they are classified.

Denoising can occur in other forms, though, too. For instance, you can denoise the entire dataset with the denoising autoencoder before training a model on the ultimate task instead of architecturally inserting the denoising autoencoder into the model operating on the ultimate task. This could be successful if you expect the “ultimate” model to be applied to relatively clean data in deployment, but know that the training data available is noisy. It’s important to understand your particular problem well so you can successfully implement and manipulate your denoising autoencoder design.

Implementation

In our discussion of the implementation of denoising autoencoders, we will assume the autoencoder is being used for tabular data. The logic and syntax still can be applied to the use of denoising autoencoders for other data, like image or sequence data, with the necessary considerations for that particular data format.

Inducing Noise

As discussed, there are three practical methods of inducing noise.

One method is to insert a noising layer directly after the input (Listing 3-14, Figure 3-18). The most useful layer for this method will likely be the keras.layers.GaussianNoise layer, which adds Gaussian noise with a specified standard deviation. For a very heavy form of noise, you can also use layers like dropout or other modifications (although this is stretching the limits of what we would consider to strictly be a denoising autoencoder). You can also write your own custom layer for more complex noise-adding operations.
ae_input = L.Input((784,), name='input')
ae_noising = L.GaussianNoise(0.1, name='noising')(ae_input)
ae_encoder = encoder(ae_noising)
ae_decoder = decoder(ae_encoder)
ae = Model(inputs = ae_input,
           outputs = ae_decoder)
Listing 3-14

Layer-based method of inducing noise into a denoising autoencoder

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig18_HTML.jpg
Figure 3-18

Example implementation of inducing noise via layer method

Another method of inducing noise is to apply it to the dataset directly, ideally during parsing when creating a TensorFlow dataset (Listing 3-15). When creating the parsing function, you can use a variety of TensorFlow image operations to induce random forms of noise. In this example, the unparsed TensorFlow dataset consists of a filename and a label for each instance (although you may find it helpful to arrange the unparsed dataset in another format if you know you will only use it for unsupervised learning).

In this example, we are using two helpful tf.image.random_x operations to induce random noise into the image, which vary brightness and JPEG quality noise. (The latter refers to a phenomenon in JPEG images in which visual artifacts distort the image due to the compression algorithm. This is especially helpful if your dataset consists of images in JPEG format or contains images transferred electronically that likely were once in JPEG format.) It should be noted that when specifying the range of noise – for instance, in JPEG quality, “80” and “100” in the example are the minimum and maximum JPEG quality – you should leave the possibility of cases in which no noise is applied to the image. This allows for the model not only to denoise a noisy image but also to recognize the presence of noise in the first place. We build this possibility by including the upper JPEG quality bound as 100%, in which the output of the random_jpeg_quality function is identical to the input. Functions like random_brightness require only for a maximum noise bound, and thus you can assume that there will be some examples with minimal levels of noise that satisfy this good practice.
def parse_file(filename, label):
    orig_img = do_preprocessing_stuff(filename)
    rand_img = tf.image.random_brightness(image, 0.01)
    rand_img = tf.image.random_jpeg_quality(image, 80, 100)
    return rand_img, orig_img
Listing 3-15

Example function to pre-alter the dataset with noise using .map() on a TensorFlow dataset

A third method to induce certain noise into image data is to use the Keras/TensorFlow ImageDataGenerator (Listing 3-16). We begin by instantiating an ImageDataGenerator object with augmentation parameters for noise (the range to randomly shift brightness, small levels of shearing, etc.). To provide the image data generator object with data, use .flow or .flow_from_x. If using .flow_from_directory, be sure to set class_mode='input' so the image data generator does not assume labels and arrange the data generator as a classification problem.
from keras.preprocessing.image import ImageDataGenerator
idg = ImageDataGenerator(rotation_range=30,
                         width_shift_range=5,
                         ...)
idg_flow = idg.flow_from_directory(
    directory = '/directory/data',
    class_mode = 'input'
)
Listing 3-16

Using the Image Data Generator method of inducing noise into an image dataset. Substitute augmentation_params for augmentation parameters. See Chapter 1 for a detailed discussion of Image Data Generator usage

You can also use ImageDataGenerator with class_mode='input' as an alternative data source for autoencoders generally (not just for denoising) instead of using TensorFlow datasets. If you do decide to use image data generators for autoencoders, be sure to be careful with how you control your augmentation parameters for your particular purpose. If you are training a standard autoencoder, for instance, in which the input is identical to the ideal output, make sure to eliminate all sources of artificial noise by adjusting the augmentation parameters accordingly.

Using Denoising Autoencoders
If you are using the layer method of inducing noise into the denoising autoencoder, you will need to remove the noise-adding layer when using the denoising autoencoder in another application (Listing 3-17). Assume that the variable ae refers to the denoising autoencoder with a noise-adding layer, compartmentalized design, and proper naming. We can use the .get_layer(name) method to retrieve a layer or groups of layers by name and build them into a “final denoising autoencoder” model without the noise-adding layer. The example retrieves only the encoding and decoding layers, which store the weights that perform the key steps of encoding while denoising and decoding, thus eliminating the noise-adding layer in the final model. Here, you can see the many benefits that naming, compartmentalized design, and other organizational good practices can offer in manipulating the network structure.
final_dae_input = L.Input((784,), name='input')
final_dae_encoder = ae.get_layer('encoder')(final_dae_input)
final_dae_decoder = ae.get_layer('decoder')(final_dae_encoder)
final_dae = Model(inputs=final_dae_input,
                  outputs=final_dae_decoder)
Listing 3-17

Removing the noise-inducing layer in a denoising autoencoder

If you are dealing with a model that is not compartmentalized or that has many layers that need to be transferred, you can also refer to each layer by its index (Listing 3-18). Begin by defining the input layer and the layer after the input layer. In this case, the layer/group of layers we would like to follow the input layer in the final denoising autoencoder model is the encoder, which we can reference as ae.layers[2] because it is the third layer/component in the original denoising autoencoder architecture containing the noise-adding layer. From there, we can iteratively loop through the remaining layers and attach each one to the previous using Functional API syntax.
inp = L.Input((784,), name='input')
x = ae.layers[2](inp)
for layer in ae.layers[3:]:
    x = layer(x)
final_dae = Model(inputs=inp, outputs=x)
Listing 3-18

Alternate method of removing the noise-inducing layer in a denoising autoencoder

Keep in mind that there are complications with this sort of method when using nonlinear topologies.

By using any of these two methods, you can transfer the relevant weights into a final model without the intermediate layer. You will find this method of “surgically” removing an unwanted layer from a model helpful in other applications, especially in transfer learning. The resulting model (Figure 3-19) with the removed noise layer can be used for validation testing.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig19_HTML.jpg
Figure 3-19

The denoising autoencoder after removing the Gaussian noise layer

Once you have obtained a “cleaned” autoencoder, you can insert it into another model as a preprocessing segment to denoise the input data (Figure 3-20, Listing 3-19).
process = Sequential(name='processing')
process.add(L.Input((784,)))
process.add(L.Dense(256, activation='relu'))
process.add(L.Dense(1, activation='sigmoid'))
inp = L.Input((784,), name='input')
denoising = ae(inp)
processing = process(denoising)
ult_model = Model(inputs=inp,
                  outputs=processing)
Listing 3-19

Using a denoising autoencoder to denoise inputs before they are passed onto another model for further processing

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig20_HTML.jpg
Figure 3-20

Using the denoising autoencoder to decode the input before it is passed to the following layers for processing

You can set your own layer freezing strategy (see Chapter 2) as necessary to suit your own purposes. For instance, you may decide to freeze the weights in the denoising autoencoder for the majority of training and then to unfreeze the entire network and perform some fine-tuning afterward.

Using Autoencoders for Pretraining

Another application of autoencoders is for pretraining. Recall that pretraining is used to provide “context” such that it develops certain skills or representations that allow it to succeed in performing its ultimate task. In Chapter 2, we discussed various pretraining methods and strategies. We will build upon this prior discussion to demonstrate when and how to use autoencoders in the context of pretraining. With both extensive knowledge of autoencoders and of pretraining at this point, you will find that the intuition and implementation of autoencoders for pretraining is quite straightforward.

Intuition

The use of autoencoders in pretraining falls under the category of self-supervised learning. You could think of autoencoders as the simplest form of self-supervised learning. Recall that in self-supervised learning, a model is trained on an altered dataset, which is constructed only on the input data, not the labels, of the task dataset. Some self-supervised learning tasks, for instance, involve predicting the degree by which an image was rotated or the degree of noise that was added to some set of data. In a standard autoencoder, however, no alterations to the data are needed beyond moving data instances into a dataset such that the input and output are the same for each instance.

Generally, when autoencoders are used for pretraining, the entire autoencoder is trained on the pretraining dataset, the encoder is extracted, more layers are appended to the encoder, and the newly formed model is trained on the task dataset (Figure 3-21). Because the encoder extracts important representations of the data from the unsupervised pretraining task, the hope of using the encoder for pretraining is that it begins with the basic ability to represent and encode key features of the input. The appended layers can then process the encoded key features into an output for the task dataset.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig21_HTML.jpg
Figure 3-21

Conceptual map of using autoencoders for pretraining. Input data is the same for autoencoders. “network” is used to refer to the layers appended to the encoder. This isn’t entirely accurate, but it represents the idea that there is another “mini-network” that processes the output of the encoder to map whatever encoded representation comes out of the encoder into the output

If you plan to use an autoencoder for pretraining, it’s important you consider a key factor in how large to build the bottleneck (and correspondingly the widths of the encoder and the decoder components): the amount of information the layers appended after the encoder have to work with. If your bottleneck is large, the task may be trivial, and the network may not develop meaningful representations and processes for understanding the input. In this case, the processing layers after the encoder would receive a high number of features, but each feature would not contain much valuable information. On the other hand, if the bottleneck is too small, the model may develop meaningful representations, but the following processing layers may not have enough features to work with. This is a balancing process that takes experience and experimentation.

Like pretraining, it’s best to freeze components strategically to make the most out of the weights in the encoder derived from pretraining (Figure 3-22). Like many pretraining tasks, generally it’s good practice to freeze the pretrained component – in this case, the encoder – and train the following component(s) to better interpret the output of the encoder and then to unfreeze the pretrained component such that the entire network can be fine-tuned.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig22_HTML.jpg
Figure 3-22

How components of the autoencoder for pretraining are frozen or unfrozen throughout training and fine-tuning

If you are using this method of pretraining with autoencoders with image data and you follow the design of flattening and reshaping the data around the bottleneck such that the bottleneck is a vector rather than an image, you will find that the structure of data transformation is especially clean. The encoder converts an image into a vector, and the following processing component processes that vector (which contains the encoded representation of the input) into the desired output. Thus, in this context, the encoder functions as the convolutional component and the following processing layers function as the fully connected component of an image-based deep learning model. This not only increases conceptual and organizational clarity but also allows you to further manipulate these sorts of autoencoder for pretraining designs for greater performance with the tools of transfer learning (see Chapter 2).

Note that while using autoencoders for pretraining, you can use a wide variety of autoencoder training structures beyond the standard autoencoder, in which the input is equivalent to the desired output.

For instance, denoising is a powerful self-supervised task that can be addressed with a denoising autoencoder (Figure 3-23).
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig23_HTML.jpg
Figure 3-23

Example of using non-standard autoencoder tasks for pretraining, like denoising autoencoder

However, when you are choosing the self-supervised task to train your autoencoder, it’s important to have a strong conceptual understanding of what that self-supervised task will accomplish. In this self-supervised context, the denoising autoencoder offers primarily a strong understanding of key features of the data, developed by identifying and correcting artificially inserted noise, and only secondarily actual denoising capabilities. This means that you do not need to be as careful about ensuring that the artificial noise resembles the true noise in the dataset when developing denoising autoencoders for pretraining. Of course, being conscious about the setup of the denoising autoencoder can allow you to maximize both the benefit of denoising (i.e., the encoder develops representations more or less robust to noise) and self-supervised learning (i.e., the encoder develops abstract representations of key features and ideas by learning to separate noisy artifacts from true meaningful objects).

Note

It should be noted that the concepts of denoising and self-supervised learning are not completely independent from one another. To properly denoise an input, a model must develop representations of key features and concepts within the input, which is the goal of self-supervised learning.

This sort of simplicity and conceptual ease in manipulating autoencoders for pretraining makes them exceptionally popular in modern deep learning design.

Implementation

Implementing autoencoders for pretraining is simple, given that you have used compartmentalized design. Recall the autoencoder structure built to demonstrate the construction of autoencoders for tabular data, which took in data with 784 features and compressed it into 32 neurons in the bottleneck layer before reconstructing the 784 features.

In the spirit of compartmentalized design, let’s build a component that takes in the output of the encoder and processes it through a series of Dense layers to derive the output (in this case, for a binary classification problem, Listing 3-20). Note that this component takes in 32-dimensional data because the encoder outputs encoded representations that are 32-dimensional.
process = Sequential(name='processing')
process.add(L.Input((32,)))
process.add(L.Dense(16, activation='relu'))
process.add(L.Dense(10, activation='sigmoid'))
Listing 3-20

Building a sub-model to process the outputs of the encoded features

Even though we compiled the encoder and decoder models into the autoencoder, we can still reference the encoder and decoder individually, with their weights retained. The final model with only the pretrained encoder and the processing component can be built with the Functional API (Listing 3-21).
inp = L.Input((784,), name='input')
encoding = encoder(inp)
processing = process(encoding)
ult_model = Model(inputs=inp,
                  outputs=processing)
Listing 3-21

Using the processing sub-model with the encoder in an overarching model for a supervised task

Make sure to freeze layers as appropriate.

Using Autoencoders for Dimensionality Reduction

The concept of the autoencoder was initially presented as a method of dimensionality reduction. Because the application of dimensionality reduction is almost “built into” the design of the autoencoder, you will find that using autoencoders for dimensionality reduction is very simple to implement. However, there’s still much to consider in performing dimensionality reduction for autoencoders; with the right design, autoencoders can offer a unique method of dimensionality reduction that is more powerful and versatile than other existing methods.

Intuition

Dimensionality reduction is generally performed as an unsupervised task, in which data must be represented with a smaller number of dimensions than it currently exists in. Many dimensionality reduction algorithms like Principal Component Analysis (PCA) and t-Stochastic Neighbor Embedding (t-SNE) attempt to project data into lower spaces according to certain mathematical articulations of what features of the data should be valued most in a reduction. PCA attempts to preserve the global variance, for instance, while t-SNE instead seeks to capture the local variance. Because different dimensionality reduction algorithms are built to prioritize the preservation of different features of the data, they are fundamentally different in character and thus are limited in their effectiveness on a wide variety of datasets.

However, in an autoencoder approach to dimensionality reduction, the autoencoder is trained to reconstruct the input after passing it through a bottleneck. After the autoencoder is fitted, the encoder is detached and the dimensionality reduction for any input can be obtained by passing it through the encoder and receiving its output (Figure 3-24).
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig24_HTML.jpg
Figure 3-24

Conceptual map of using autoencoders for dimensionality reduction

Thus, autoencoders are distinctly different from other dimensionality reduction algorithms in two key ways: adaptability and articulation of valued features.
  • Adaptability: Algorithms like PCA and t-SNE allow the user to adapt the algorithm to their dataset by manipulating a few parameters, but that quantity is vastly outnumbered by the adaptability of the autoencoder. Because autoencoders are more of a concept than an algorithm, they can be built with a much greater degree of adaptation to your particular problem. You can change the depth and width of the autoencoder structure, the activations in each layer, the loss function, regularization, and many other features of neural network architectural and training design to change how an autoencoder behaves in performing dimensionality reduction. This also means that using autoencoders for dimensionality reduction is likely to be successful only if you are aware of how different manipulations to the autoencoder structure translate to changes in its behavior and the behavior of a dimensionality reduction algorithm necessary to handle dimensionality reduction on your dataset.

  • Articulation of valued features : Autoencoders prioritize certain features of the data and optimize the reduction of data in a way different from algorithms like PCA and t-SNE in character. Autoencoders attempt to minimize the reconstruction loss, while PCA and t-SNE attempt to maximize some relatively explicit mathematical formulation of what to prioritize, like the preservation of local or global structure (e.g., variance). These formulations attempt to capture what “information” entails in the context of dimensionality reduction. On the other hand, autoencoders do not seem, at least on the surface level, to have these prioritizations built explicitly into their design – they simply use whatever reduction allows for the most reliable reconstruction of the original input. Perhaps autoencoders are one of the most faithful representatives to capturing “information” in a broad, conceptual sense – rather than being tied toward any particular explicit assumptions of what constitutes a preservation of information (i.e., the means of preservation), it adopts whatever procedures and assumptions are necessary for the original item to be reconstructed (i.e., using whatever means are necessary to obtain optimal ends of preservation).

These two features can both be advantages and disadvantages. For instance, adaptability can be a curse rather than a tool if your dataset is too complex or difficult to understand. Moreover, increased adaptability does not necessarily suggest increased interpretability of adaptation; that is, while the autoencoder possesses a much wider range of possible behaviors, it is not necessarily simple to identify which changes to the architecture will correspond to certain changes (or absence of changes). Note that the autoencoder’s articulation of valued features is determined by its loss function, which uses as one component the model prediction, which depends on the model architecture (among other modeling details). Finding your way through this chain of considerations is likely more arduous a task than adjusting relatively interpretable parameters on other dimensionality reduction methods.

Modern autoencoders for dimensionality reduction are most often used on very high-dimensional data since deep learning has evolved to be most successful on complex forms of data. More traditional algorithms like PCA developed for the reduction of lower-dimensional data are unlikely to be suited toward data like word embeddings in NLP-based models and high-resolution images. t-SNE is a popular choice for high-dimensional data, but primarily for the purpose of visualization. If you are looking to maximize the information richness of a dimensionality reduction and are willing to sacrifice some interpretability, autoencoders are generally the way to go.

Implementation

Using autoencoders for dimensionality reduction requires no further code from building and training the original autoencoder (see previous sections on building autoencoders for tabular and two-dimensional data). Assuming the autoencoder was built with compartmentalized design, you can simply call encoder.predict(input_data), where encoder corresponds to the encoder architecture and input_data represents the data you would like encoded.

Using Autoencoders for Feature Generation

Feature generation and feature engineering is often thought of as a relic of classical machine learning, in which engineering a handful of new features could boost the performance of machine learning models. In deep learning applications involving relatively more complex forms of data, however, in many cases, using standard feature engineering methods like finding the row-wise mean of a group of columns or binning is unsuccessful or provides minimal improvement.

With autoencoders, however, we are able to perform feature generation for deep learning using an entity with the power and depth of deep learning methods.

Intuition

The encoding component of the autoencoder can be used to generate new features for a model to take in and process when performing a task. Because the encoder has learned to take in a standard input and compress it such that each feature in the encoded representation contains the most important information from the standard input, these encoded features can be exploited to aid the prediction of another model.

Functionally, the idea is almost identical to autoencoders for dimensionality reduction. However, using autoencoders for feature generation requires the additional step of generating new features and feeding the new features into the model.

We’ve seen earlier that a similar concept is employed in using autoencoders for pretraining, in which the encoder is detached from a trained autoencoder and inserted directly after the input of another network, such that the component(s) of the model after the encoder receive enriched, key features and representations of the input. However, the purpose of feature generation is to generate, or add, features rather than replacing them. Thus, when using autoencoders for feature generation, the encoder provides one set of encoded, information-rich features that are considered alongside the original set of features.

Consider, as an example, these two small datasets, consisting of the original features and the generated features of a hypothetical autoencoder fitted on the original features (Figure 3-25).
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig25_HTML.jpg
Figure 3-25

Hypothetical datasets: original features and the generated features (produced by the output of an encoder in a trained autoencoder). The exact numbers in these tables are hypothetical (randomly generated)

When using autoencoders for pretraining, the autoencoder is trained to reconstruct the original features and the encoder is transferred to the new model to provide a feature-extracting mechanism directly after the input (Figure 3-26). Thus, the remainder of the network, which performs much of the interpretation of these features, cannot access the original features directly. Rather, it understands the original features through the compression and “interpretation” of the encoder model.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig26_HTML.jpg
Figure 3-26

Using only generated features from the encoder as inputs to the network

However, a concern arises from this particular approach: even though the encoder from the autoencoder will almost always give the model a boost (if designed and trained properly), does the encoder impose a limit on the performance of the model by forcing it to take information strictly through the lenses of the autoencoder? Sometimes, a bit of fine-tuning after training (in which the encoder weights are unfrozen and the entire model is trained) is enough to resolve this concern.

Even fine-tuning may not be an adequate address of this concern, though, in several circumstances. If the autoencoder does not obtain a relatively high performance in reconstruction (i.e., middling, mediocre performance), forcing the model only to take in mediocre-level features could limit its performance. Alternatively, if the data is not of extremely high complexity, like tabular data or lower-resolution images, having an encoder compress the original inputs may be valuable, but not completely necessary. In many cases, models intended for less complex data benefit from processing both the original input and encoded features.

When using autoencoders for feature generation, the encoder’s output is concatenated (or merged through some other mechanism) with the original input data, such that the remainder of the network can consider and process both sets of features (Figure 3-27). This method of feature generation can be thought of giving the remainder of the network the “best of both worlds” – access to the original, raw, unchanged data and a developed interpretation of that original data. Here, the network attempts to make the decision as to how each set of features is to be weighted and processed to optimize its performance.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig27_HTML.jpg
Figure 3-27

Using both generated features from the encoder and original features as inputs to the network

Architecturally, we can represent this with a nonlinear topology, in which the input layer feeds both into the encoder and the output of the encoder (Figure 3-28).
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig28_HTML.jpg
Figure 3-28

Conceptual map of how fitted autoencoders can be used in a prediction task with feature generation in which both the original features and the generated features are inputted to the remaining network layers for processing

This is similar in character to the layer-based method of inducing noise when training a variational autoencoder. You could pass data through the encoder and concatenate the encoded features to the original dataset and then feed the merged dataset into a standard network architecture. However, such an approach, in which changes to the data are performed outside of the neural network, is doing more work than necessary – it’s much easier to use the Functional API to add the encoder than to get messy with predictions and data organization, especially if your dataset is on the larger side. Adding the encoder model directly into the new model automatically puts these relationships and data flows in place.

Implementation

Like many autoencoder applications, implementing autoencoders for feature generation is straightforward with the tools we’ve developed prior. Three lines of code using the Functional API allow us to define the input layer, the encoder, and the junction at which the original data and the generated features are merged (Listing 3-22).
inp = L.Input((784,))
encoding = encoder(inp)
merge = L.Concatenate()([inp, encoding])
Listing 3-22

Creating the feature generation component of the autoencoder, in which inputs are passed through the encoder and concatenated to those outputs

Afterward, we can add more Dense layers to process the merged features (Figure 3-29, Listing 3-23).
dense1 = L.Dense(256, activation='relu')(merge)
dense2 = L.Dense(64, activation='relu')(dense1)
output = L.Dense(1, activation='sigmoid')(dense2)
model = Model(inputs=inp, outputs=output)
Listing 3-23

Processing the concatenated features

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig29_HTML.jpg
Figure 3-29

Hypothetical map of the architecture of an autoencoder used for feature generation

Be sure to set encoding.trainable = False, as the encoder’s weights – the basis for the method by which it extracts core features and representations – should be frozen during training. The need for fine-tuning is less significant than if the autoencoder were used for pretraining.

You can also build more complex topologies to better take advantage of the encoder’s encoded features by first processing the original features and the generated features independently before merging and following with further processing (Figure 3-30, Listing 3-24).
inp = L.Input((784,))
encoding = encoder(inp)
# processing the input independently
p_inp_1 = L.Dense(256, activation='relu')(inp)
p_inp_2 = L.Dense(128, activation='relu')(p_inp_1)
# processing the encoder output independently
p_encoding_1 = L.Dense(32, activation='relu')(encoding)
p_encoding_2 = L.Dense(32, activation='relu')(p_encoding_1)
# merge and process
merge = L.Concatenate()([p_inp_2,
                         p_encoding_2])
dense1 = L.Dense(256, activation='relu')(merge)
dense2 = L.Dense(64, activation='relu')(dense1)
output = L.Dense(1, activation='sigmoid')(dense2)
# aggregate into model
model = Model(inputs=inp, outputs=output)
# freeze encoder weights
encoding.trainable = False
Listing 3-24

Processing the encoded representation and the original input separately before they are concatenated

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig30_HTML.jpg
Figure 3-30

Hypothetical map of the architecture of an autoencoder used for feature generation with further processing on the output of the encoder and the original features before concatenation

Broadly, you can use these sorts of architectural manipulations to build all sorts of complex nonlinear topologies to take advantage of pretraining methods.

Using Variational Autoencoders for Data Generation

The variational autoencoder is one of the more modern conceptions of the autoencoder. It serves a relatively recent developing subfield in deep learning: data generation. While variational autoencoders are most often employed in the generation of images, they also have applications in generating language data and tabular data. Although image generation can be used to generate photorealistic images, more practically variational autoencoders are often used to generate more data to train another model on, which can be useful for small datasets. Because variational autoencoders heavily rely upon the notion of a latent space, they allow us to manipulate their output by traversing the latent space in certain ways. This allows it to offer more control and stability in the generated outputs than other data generation methods, like Generative Adversarial Networks (GANs).

Intuition

The goal of an autoencoder is to reconstruct the original input data as identically as possible. On the other hand, the goal of the variational autoencoder is to produce a similar image with reasonable variations – hence, the name “variationalautoencoder.

The fundamental idea behind generation using the variational autoencoder isn’t too complex: not only individual points within the latent space (corresponding to data points in the dataset) are meaningfully arranged so that they can be decoded into their original form, but the latent space itself – consisting of all the space in and around existing data points – is mapped to by the encoder such that it contains relationships relevant to the dataset. Therefore, we can sample points from the latent space not occupied by existing encoded representations of data points from the dataset and decode them to generate a new data item that wasn’t part of the original dataset.

For instance, consider data in two classes (perhaps images of dogs and cats or of the handwritten digits “0” and “1”). Of course, the autoencoder is not given these labels when trained, but it likely is able to perform separations and placement of these images such that these two classes are generally separated from one another (Figure 3-31). Thus, if we want an image resembling a dog, we can pass in an actual image of a dog from the dataset, make some changes to the corresponding location in the latent space to introduce variation, and decode the deviation/variation point to output a slightly different image of a dog.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig31_HTML.jpg
Figure 3-31

Decoding regions of the latent space near the latent space point of a data point to generate variations of the data – data that is different, but that shares similar features. The box in the center represents the latent space – the points in the bottleneck region to which input data has been mapped to and which will be mapped back into a reconstructed version

Alternatively, you could randomly select several points from the latent space and decode them to generate corresponding images (Figure 3-32). This method is more commonly used to generate a large and diverse array of data, since making small deviations or variations to an existing point still restricts the generated image to be relatively similar to the original image. By completely randomly selecting several points from within the latent space, you can produce a much more diverse generated dataset.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig32_HTML.jpg
Figure 3-32

Sampling randomly from the latent space to produce a wide, diverse dataset of generated data

We can write the general process of generating data using an autoencoder at this point as follows:
  1. 1.

    Train an autoencoder on a reconstruction task.

     
  2. 2.

    Randomly select several points in the latent space, either by making changes to existing data points (ensuring that it is within a reasonable domain).

     
  3. 3.

    Decode each of those points into the respective generated image.

     

While this general logic and intuition is valid, there is a problem with an assumption we have made: with standard autoencoders, the latent space is not continuous; rather, it’s much more likely to be heavily discrete, separated into clusters. Continuity is unlikely to be helpful in the task of reconstructing inputs, because it magnifies the effect of small “slips” and “mistakes”; rather, having a discrete design allows for some minimum “guarantee” of success (Figure 3-33).

For instance, consider an autoencoder attempting to reconstruct images of the digits “0” and “1.” Assume the latent space of an autoencoder is discrete, meaning that there were purposeful gaps separating clusters of images (in this case, images appearing to contain the digit “0” vs. images appearing to contain the digit “1”). Even if the encoder did not well encode an image of an image containing the digit “0” (it was not positioned in the optimal position in the latent space), it would still likely be within the cluster of images labeled “0,” because there are large gaps in the latent space that separate off images containing “0” vs. containing “1.” This means that the decoder would still reconstruct an image with a similar shape to “0” – even if it didn’t capture the specifics of the image, it captures the main idea.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig33_HTML.jpg
Figure 3-33

A discrete vs. continuous latent space in autoencoders

On the other hand, assume the latent space is continuous, meaning that there are no purposeful gaps separating clusters of images. If there is a similar deviation in the positioning of an encoded image containing the digit “0,” it may be reconstructed by the decoder as “1” – there is no gap, or barrier, that separates the concepts of “0” and “1” from each other.

Thus, discreteness within autoencoders is a useful tool for autoencoders to improve the performance of their reconstruction by sectioning off main ideas. It has been empirically observed that successful autoencoders tend to produce discrete latent spaces.

However, this becomes a challenge for generating data, because when we randomly sample from the latent space, we assume that the space is continuous. What would happen, for instance, if we sampled a point in the latent space that happened to reside in one of the “gaps” between discrete clusters (Figure 3-34)?
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig34_HTML.jpg
Figure 3-34

Sampling from “gap” regions of the latent space that the decoder has not learned to decode leads to bizarre and wildly unrealistic inputs

You may hypothesize that decoding a randomly sampled point in this “gap” would result in some sort of image that is “in between” both of the classes. However, the decoder has never been exposed to points in the latent space from that region and therefore does not have the experience and knowledge to interpret it. Therefore, the decoder will likely output some sort of gibberish that is in no way related to the original dataset.

Thus, variational autoencoders employ two key changes to force the autoencoder to develop continuous latent spaces, such that the latent space can both be sampled randomly (enabling a diverse generated dataset) and that the decoded latent space points will produce realistic/reasonable outputs (enabling a plausible generated dataset). Fundamentally, these changes force the autoencoder to learn continuous distributions rather than discrete locations in the latent space:
  • The encoder learns the optimal mean and standard deviation of variables in the latent space, from which an encoded representation is sampled and passed on to the decoder for decoding.

  • The variational autoencoder is optimized on a custom loss function: a blend of reconstruction loss (this is the cross-entropy, mean squared error, or any other standard autoencoder loss function used) and KL divergence to prevent the model from “cheating” (more on this “cheating” later).

After the input is passed through several processing layers to encode the data, it is separated into two vectors, representing a vector of means and a vector of standard deviations. Assuming the latent variables are drawn from a normal distribution, we can use the mean and standard deviation to randomly sample an encoded representation. This means that the same input can be encoded twice with different encoded representations. This encoded representation – now with some variation – is decoded and trained as an output (Figure 3-35).
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig35_HTML.jpg
Figure 3-35

Architecture of a variational autoencoder

The architecture of the variational autoencoder exposes a certain beauty in deep learning: we can profoundly shape the neural network’s thinking processes to our own desires with a few well-placed nudges. For instance, you may be wondering: what if certain latent variables aren’t normally distributed? Over time, the network will adapt the encoding and decoding processes to accommodate the normal distribution because we use the normal distribution assumption to choose the encoded representation. If it did not adapt, it would not attain optimal performance. We can expect the variational autoencoder to learn this adaptation with high certainty because modern neural networks possess reliably high processing power. This, in turn, allows us to more freely impose (reasonable) expectations within models as a means to accomplish some end.

Similarly, even though we would like the branches after encoding to represent the means and standard deviations of the latent variables, we don’t build any hands-on mechanisms before these two branches to manually tell the network what the mean or standard deviation is. Rather, we assume that each of the branches takes the role of the mean and the standard deviation of a distribution when generating the encoded representation (the expectation) and allow the network to adapt to that expectation. This sort of expectation-based design allows you to implement deep learning ideas faster and more easily.

With enough training, the autoencoder gradually develops an understanding of the probability distribution of each latent variable, allowing the latent space to be, by design, continuous (Figure 3-36). Instead of learning the absolute location of points, the autoencoder learns probabilistic “clouds” around each original input to gain experience with variations on the original input. This allows the autoencoder to interpolate in the latent space more smoothly.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig36_HTML.jpg
Figure 3-36

Standard autoencoders learn the absolute locations of data in the latent space, whereas variational autoencoders learn probability distributions

However, there’s still one way the network can cheat: by setting the branch representing the standard deviation to be arbitrarily small, the network can reduce the variation around each learned point in the latent space such that it is functionally identical to learning the location of absolute points (Figure 3-37). Since we know that autoencoders will tend toward building discrete spaces, variational autoencoders implement another mechanism to force the autoencoder to learn a continuous space: KL divergence.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig37_HTML.jpg
Figure 3-37

In practice, variational autoencoders can “cheat” and replicate the absolute point learning of standard autoencoders by forming clustering patterns and reducing the standard deviation

KL divergence, or Kullback-Leibler divergence, measures the level of divergence, or difference, between two probability distributions (roughly speaking). For the purposes of variational autoencoders, KL divergence is minimized when the mean is 0 and the standard deviation is 1 and thus acts as a regularization term, punishing the network for clustering or reducing the standard deviation.

If the variational autoencoder is optimized only on KL divergence, it will tend to cluster all the points toward the center without any regard for decoding ability, because KL divergence is concerned with the distribution parameters outputted by the encoder, not the decoder’s ability. Hence, like only using reconstruction loss, the latent space cannot be interpolated (Figure 3-38).
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig38_HTML.jpg
Figure 3-38

Results of latent space representations when only using reconstruction loss vs. only using KL loss

To address this problem, variational autoencoders are trained on the sum of reconstruction loss and KL loss. Thus, the variational autoencoder must simultaneously develop encoded representations that can be decoded into the original representation but will be punished if the representations make use of clustering, low standard deviations, and other marks of a discrete space. The variational autoencoder is a unique model in that its objective cannot be clearly articulated by any one existing loss function and thus is formulated as a combination of two functions.

Now, you can random uniformly sample the latent space within the domain of learned representations and produce both a diverse and realistic set of generated images. Moreover, our intuitive logic of interpolation applies: if you want to visualize what lies “in between” two images (e.g., a cat and a dog or the digit “1” and “9”), you can find the point in between the corresponding two points in the latent space and decode it.

Implementation

Implementing variational autoencoders is more difficult than implementing other autoencoder applications, because we need to access and manipulate components within the autoencoder itself (the latent space), whereas prior the only significant step was to change the input and output flow. Luckily, this serves as a useful exercise in the construction of more complex neural network structures in Keras.

Tip

Recall that in the Functional API, each layer is assigned to a variable. Although you should always be careful and specific with your variable naming, this is especially true with complex and nonlinear neural network architectures like the variational autoencoder. You’ll need to keep track of the relationships between several layers across several components, so establish a naming convention and code organization that works best for you. If you make mistakes, you’ll run into graph construction errors. Frequently use plot_model with show_shapes=True if you need guidance in debugging these errors to show what went wrong.

Let’s begin by implementing the encoder. Prior, we used the Sequential model-building structure to construct the encoder and decoder, because the encoder and the decoder were not topologically nonlinear and we did not need to reference any intermediate layers. However, in variational autoencoders, both of these conditions are true; thus, we will build the encoder using the Functional API (Listing 3-25).
enc_inputs = L.Input((784,), name='input')
enc_dense1 = L.Dense(256, activation='relu',
                     name='dense1')(enc_inputs)
enc_dense2 = L.Dense(128, activation='relu',
                     name='dense2')(enc_dense1)
Listing 3-25

Implementing the first layers of the encoder for a variational autoencoder

After the inputs are processed by several Dense layers (or convolutional or any sort of appropriate layer, given your data type), the network splits into two branches, representing the mean and the standard deviation of the latent variable distribution (Listing 3-26). Note that in practice the network learns the log standard deviation rather than the raw standard deviation to prevent negative solutions. You may also see some approaches that use log variance, which is functionally no different from using standard deviation (variance is just standard deviation squared). Recall that with expectation-based design, we can expect the network to learn expectations as long as we set them (and the expectation is not too difficult to reach), so you can more or less direct the model to learn whatever attribute you would like it to.
means = L.Dense(32, name='means')(enc_dense2)
log_stds = L.Dense(32, name='log-stds')(enc_dense2)
Listing 3-26

Creating the branches to represent the mean and log standard deviations of the latent variable distributions

We’ve built two branches, representing the mean and log standard deviation of the latent variable distributions. In this case, our latent space has 32 dimensions (like a standard autoencoder with 32 bottleneck nodes), with one mean value and one log standard deviation value to describe the distribution. To attain the output of the encoder – the encoded representation of the input – we need to take in both the mean and the standard deviation, randomly sample from that distribution, and output that random sample (Listing 3-27). This sort of operation isn’t built as a default layer in Keras/TensorFlow, so we’re going to have to build it ourselves.

To build a custom layer, we can take advantage of the keras.layers.Lambda() method of constructing layers that perform operations. It can be used as follows.
def custom_func(args):
    param1, param2 = args
    perform_tf_operation(param1, param2)
    ...
    return result
custom_layer = L.Lambda(custom_func)([inp1, inp2])
Listing 3-27

Structure of creating a custom layer that performs TensorFlow operations on inputs using Lambda layers

In this example, custom_layer takes in the outputs of two previous layers. However, Keras delivers this data together in one argument variable containing all of the arguments to the custom function, which can be unpacked within the custom function. Additionally, all operations must be Keras/TensorFlow operations (see Keras backend for Keras’ low-level operations). Note that you can use methods like TensorFlow’s py_func (see Chapter 1) to convert Python functions into TensorFlow operations, if necessary.

The custom sampling layer for the variational autoencoder takes in the means and the log standard deviations. In the custom sampling function, we take in the argument variable and unpack it into a tensor containing the means and log standard deviations for each latent variable. Then, we generate a random, normally distributed tensor of the shape of the means and log_stds objects as eps, or epsilon. We can use the formula μ + eln(σ) · ϵ, which simplifies cleanly and intuitively to μ + σ · ϵ (Listing 3-28).
def sampling(args):
    means, log_stds = args
    eps = tf.random.normal(shape=(tf.shape(means)[0], 32),
                           mean=0, stddev=0.15)
    return means + tf.exp(log_stds) * eps
x = L.Lambda(sampling, name='sampling')([means, log_stds])
Listing 3-28

Creating a custom layer to sample encoded representations from the mean and log standard deviations of latent variables

The shape parameter used in creating the epsilon variable is expressed in this example as (tf.shape(means)[0], 32). Alternative methods include (tf.shape(log_stds)[0], 32) and (tf.shape(means)[0], tf.shape(means)[1]): all capture the same idea. In order for the actual sampling (μ + σ · ϵ) to be successful, the tensors for the mean, standard deviation/log standard deviation, and epsilon normally distributed randomness need to be the same size. Because this data is one-dimensional, we know that the input shapes to the sampling layer will be in the form of (batch_size, latent_space_size). We haven’t specified what the batch size is, so we simply used tf.shape(mean)[0]. However, we know the latent space size – 32 dimensions. With more complex data types, it’s important to understand the shape of the inputs to this sampling data you’re dealing with.

It should also be noted that while the mean with which the epsilon tensor is generated should remain at 0, the standard deviation can be changed depending on how much variation you’re willing to introduce. More variation could allow the model to better interpolate and produce more realistic and diverse results, but too much could risk introducing too much noise to possibly be plausibly decoded. On the other hand, less variation allows the decoder to better decode the image but may result in less exploration and smooth interpolation of “gaps” in the latent space.

We can now aggregate these layers into the autoencoder (Listing 3-29).
encoder = keras.Model(inputs=enc_inputs,
                      outputs=[means, log_stds, x],
                      name='encoder')
Listing 3-29

Aggregating the encoder model

Note that we output the means, log_stds, and x, whereas in the original design, the encoder only truly outputs x, the sampled encoded representation. Recall that the loss function for variational autoencoders is specialized: a hybrid of reconstruction loss and KL divergence. In order to calculate KL divergence loss, the loss function requires access to the means and standard deviations. Therefore, in order to provide this access, we will list means and log_stds as outputs to the encoder model. While we won’t use them in the decoder, this allows our loss function to work. The encoder architecture is as follows (Figure 3-39).
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig39_HTML.jpg
Figure 3-39

Building the encoder of the variational autoencoder

Building the decoder is relatively straightforward: like the standard autoencoder, it takes in an encoded representation (in this case, a 32-dimensional vector) and progressively decodes it (Listing 3-30).
dec_inputs = L.Input((32,), name='input')
dec_dense1 = L.Dense(128, activation='relu',
                     name='dense1')(dec_inputs)
dec_dense2 = L.Dense(256, activation='relu',
                     name='dense2')(dec_dense1)
output = L.Dense(784, activation='sigmoid',
                 name='output')(dec_dense2)
decoder = keras.Model(inputs=dec_inputs,
                      outputs=output,
                      name='decoder')
Listing 3-30

Creating and aggregating the decoder model

We can join these two components together into a variational autoencoder model (Listing 3-31).
vae_inputs = enc_inputs
encoded = encoder(vae_inputs)
decoded = decoder(encoded[2])
vae = keras.Model(inputs=vae_inputs,
                  outputs=decoded,
                  name='vae')
Listing 3-31

Aggregating the encoder and decoder sub-models into an overarching variational autoencoder model

This method of combining components together into a model is like the method that has been used for standard autoencoders with two key differences. Firstly, the input to the variational autoencoder is not a unique keras.layers.Input or keras.Input input mechanism, but the input to the encoder (enc_inputs). This is because of the design of the variational autoencoder’s loss function: because the model must have access to the output of the encoder, the encoder’s input must also be on the same “level” as the variational autoencoder itself. Note that this is a valid operation because we’re not making any actual architectural changes (i.e., rerouting layers or adding multiple branches to one layer), only simplifying a technically redundant (but perhaps organizationally helpful) process of passing data through two input layers. Secondly, the decoder does not take in the entire encoder output, but instead indexes it (decoded = decoder(encoded[2])). Recall that for the KL divergence component of the loss function to have access to the distribution parameters, the encoder outputs the tensors containing the mean and log standard deviation, as well as the actual encoded representation. The decoder only takes in the encoded representation, which is the third output of the encoder, and is thus indexed accordingly.

All that is left before fitting is to construct the loss function. Recall that our loss function is a combination of the reconstruction loss (we’ll use binary cross-entropy in this case) and KL divergence loss. Creating the reconstruction loss is easy because it’s already implemented in Keras (Listing 3-32).
from keras.losses import binary_crossentropy
reconst_loss = binary_crossentropy(vae_inputs, decoded)
Listing 3-32

Building the reconstruction loss component of the variational autoencoder’s custom loss function

Building KL divergence is more difficult. Although KL divergence is already implemented in Keras/TensorFlow, in our case we need to build a custom simplified KL divergence that does not take in predictions and ground truth labels, but instead means and standard deviations, measuring how far the distribution is from a distribution with mean 0 and standard deviation 1 (Listing 3-33).
kl_loss = 1 + log_stds - tf.square(means) - tf.exp(log_stds)
kl_loss = tf.square(tf.reduce_sum(kl_loss, axis=-1))
Listing 3-33

Building the KL divergence loss component of the variational autoencoder’s custom loss function

Note that tf.reduce_sum allows you to sum the losses for each item across a specified axis; this is necessary to deal with batching, in which there is more than one data point per batch.

The final variational autoencoder loss can be expressed as a sum of the reconstruction loss and the KL divergence loss (Listing 3-34). tf.reduce_mean averages the losses for each of the data points to produce the final loss value.
vae_loss = tf.reduce_mean(reconst_loss + kl_loss)
Listing 3-34

Combining reconstruction and KL divergence loss into the VAE loss

We need to make one more accommodation for the variational autoencoder’s unique loss function: usually, the loss function is passed into the model’s .compile() function . However, we can’t do this in this case because the inputs to our loss function are not predictions and ground truth labels that the Keras data flow assumes loss functions take in when passed through compiling: instead, it takes in the outputs of specific layers. Because of this, we use vae.add_loss(vae_loss) to attach the loss function to the variational autoencoder.

The model can then be compiled (without the loss function, which has already been attached) and fitted on the data. Since there are so many crucial parameters that determine the behavior of the variational autoencoder (architecture size, sampling standard deviation, etc.), it’s likely that you’ll need to tweak the model parameters a few times before you reach relatively optimal performance.

We’ve taken several measures to adapt our model structure and implementation to work with the unique nature of the variational autoencoder’s loss function. While this book – and many tutorials – will present the building of these sorts of architectures in a linear format, the actual process of constructing these models is anything but sequential. The only way to truly know if the unique situation you are working with requires a certain workaround or needs to be rewritten or re-expressed somehow is to experiment. In this case, it is likely helpful to first write down the basic architectures of the encoder and the decoder, implement the custom loss function, and then test and experiment which changes to the architecture and/or loss function are needed to make it all work, consulting documentation and online forums as necessary.

Case Studies

In these case studies, we will discuss the usage of autoencoders for a wide variety of applications in recent research. Some can be accessed via existing implementations, whereas others may be out of the reach of our current implementation skills – nevertheless, these case studies provide rich and interesting examples of the versatility of autoencoders and put forth empirical evidence of their relevance in modern deep learning.

Autoencoders for Pretraining Case Study: TabNet

Deep learning is quite famously seldom used for tabular data. A common critique is that deep learning is “overkill” for simpler tabular datasets, in which the massive fitting power of a neural network overfits on a dataset without developing any meaningful representations. Often, tree-based machine learning models and frameworks like XGBoost, LightGBM, and CatBoost are employed instead to handle tabular data.

However, especially with larger tabular datasets, deep learning has a lot to offer. Its complexity allows it to use tabular data in a multimodal fashion (considering tabular data alongside image, text, and other data), engineer features optimally on its own, and use tabular data in deep learning domains like generative modeling and semi-supervised learning (see Chapter 7).

In the 2020 paper “TabNet: Attentive Interpretable Tabular Learning,”4 Sercan O. Arik and Tomas Pfister propose a deep learning architecture for tabular data named TabNet. Arik and Pfister highlight three key aspects of TabNet’s architecture and design:
  • End-to-end design: TabNet does not need feature processing and other preprocessing methods required when using tabular data for other approaches.

  • Sequential attention: Attention is a mechanism that is commonly used in language models to allow the model to develop and understand relationships between various entities in a sequence. In TabNet, sequential attention allows the model to choose, at each step, which features it should consider. This, moreover, allows for greater interpretability of the model’s decision-making processes.

  • Unsupervised pretraining : TabNet is pretrained using an autoencoder-based architecture to predict masked column values in a tabular fashion. This leads to a significant performance and improvement. TabNet demonstrated, for the first time, large improvement from self-supervised pretraining on tabular data.

For the purposes of this case study, the unsupervised pretraining via autoencoder structure is the focus.

Using an encoder and a decoder, TabNet was trained via self-supervised learning to reconstruct masked column values (Figure 3-40). After pretraining is completed, the encoder is detached and altered to be fitted for a supervised task. By pretraining such that the decoder and reconstruct missing inputs based on the TabNet encoder’s encoded representations, the TabNet encoder develops efficient processes for extracting and aggregating key information given existing information. Moreover, it must develop a strong understanding of the relationships between each of the tabular dataset’s columns. These representations and knowledge allow TabNet to more easily be adapted toward some sort of decision-making, in which the information-rich encoded representations need only to be further interpreted and adapted for the particular classification task.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig40_HTML.jpg
Figure 3-40

TabNet method of unsupervised pretraining before supervised fine-tuning. Produced by Arik and Pfister

When tested on a wide range of problems, TabNet performs the same or better than standard decision tree-based approaches to deep learning (Table 3-2). The Forest Cover Type dataset, introduced in 2017, is a classification task; TabNet outperforms XGBoost, LightGBM, and CatBoost by a significant margin. It also has improved performance over the AutoML Tables automated search framework without using as thorough a parameter search as AutoML Tables required.
Table 3-2

Performance on Forest Cover Type dataset

Model

Test Accuracy (%)

XGBoost

89.34

LightGBM

89.28

CatBoost

85.14

AutoML Tables

94.95

TabNet

96.99

The Sarcos dataset is a problem relating to the dynamics of an anthropomorphic robot arm, in which 7 joint positions, 7 joint velocities, and 7 joint accelerations (a 21-dimensional input space) must be mapped to 7 joint torques. A TabNet model with only 6300 parameters has a lower MSE on this dataset than any other tested non-TabNet model (Table 3-3). A large TabNet model with 1.75 million parameters achieves an astoundingly low test error far from that of other models.
Table 3-3

Performance on the Sarcos dataset

Model

Test MSE

Model Size

Random forest

2.39

16.7 K

Stochastic decision tree

2.11

28 K

MLP

2.13

0.14 M

Adaptive neural tree

1.23

0.60 M

Gradient boosted tree

1.44

0.99 M

Small TabNet

1.25

6.3 K

Medium TabNet

0.28

0.59 M

Large TabNet

0.14

1.75 M

Moreover, experiments comparing the performance of TabNet trained with and without pretraining show that pretraining significantly improves performance, especially in cases in which an unlabeled dataset is much larger than the labeled dataset (Table 3-4). This makes sense – self-supervised pretraining allows the model to take advantage of unlabeled data that a purely supervised training process would not be able to exploit. Pretraining both increases test accuracy and increases the speed of convergence.
Table 3-4

Performance on the Higgs Boson dataset with the medium-sized TabNet model. Similar improvements with pretraining are observed with other databases

Training Dataset Size

Test Accuracy (%)

Without Pretraining

With Pretraining

1 K

57.47 ± 1.78

61.37 ± 0.88

10 K

66.66 ± 0.88

68.06 ± 0.39

100 K

72.92 ± 0.21

73.19 ± 0.15

To work with the TabNet model, install the tabnet library with pip install tabnet. Two relevant ready-to-go TabNet models are implemented: tabnet.TabNetClassifier and tabnet.TabNetRegressor, for classification and regression tasks, respectively (Listing 3-35). These models can be compiled and fitted like standard Keras models.

There are several key features in the TabNet classifier model to be aware of: feature_columns accepts a list or tuple of feature columns to indicate the number of features but can instead be specified via num_features; num_classes indicates the number of output classes; feature_dim determines the dimensionality of the hidden representation in the feature transformation block; and output_dim determines the dimensionality of the outputs of each decision step.
from tabnet import TabNetClassifier
model = TabNetClassifier(feature_columns=None,
                         num_classes=10,
                         num_features=784,
                         feature_dim=32,
                         output_dim=16)
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.fit(x_train, y_train, epochs=100)
Listing 3-35

Instantiating a TabNet classifier from the tabnet library

It is important to fit the model with a high initial learning rate that gradually decays and a large batch size anywhere between 1% and 10% of the training dataset size, memory permitting. See the documentation for information on syntax for more specific modeling.

Denoising Autoencoders Case Study: Chinese Spelling Checker

English , like many other European languages, relies upon the piecing together of letter-like entities sequentially to form words. Misspelling in English can be addressed relatively well with simple rule-based methods, in which a rulebook maps common misspelled words into their correct spellings. Deep learning models, however, are able to consider the context of the surrounding text to make more informed decisions about what word the user intended to use, as well as fix grammar mistakes and other language errors.

However, Chinese “words” are characters rather than strings of letters. Mistakes in Chinese are commonly the result of two primary errors: visual similarity (two characters share similar visual features) and phonological similarity (two characters are pronounced similarly or identically). Because Chinese and English have different structures and therefore different requirements for correcting spelling, many of the deep learning approaches adapted for English and other European languages cannot be readily transferable to Chinese.

In “FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based on DAE-Decoder Paradigm,” Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, and Junhui Liu5 propose a deep learning approach to correcting Chinese spelling based upon the denoising autoencoder architecture.

The approach consists of two components: a masked language model and a confidence-similarity decoder. The masked language model takes in sample text and outputs several candidates for replacement. The confidence-similarity decoder selects replacements from these candidates to form the completed, spell-checked sentence.

The masked language model, which functions as the denoising autoencoder (parsing “noise” – the spelling errors – from the input), uses a BERT-like structure. In models like BERT, pretraining is often performed by randomly masking certain words and training the model to predict the masked word. However, the errors induced by using random masks are not representative of the real-world errors seen in Chinese spelling. The pretrained masked language model in FASPell is fine-tuned on a composite dataset constructed with three processes:
  • Standard masking : Some fraction of the dataset uses the same random masking as BERT: some character(s) is masked with a [MASK] token, and the model predicts which character is masked based on the context. This is performed on texts that have no errors.

  • Error self-masking : Errors in the text are “masked” as themselves. The target label corresponds to the correct character. The model must identify that there is an error and perform a correction to obtain perfect performance.

  • Non-error self-masking : Correct characters in the text are also “masked” as themselves, and the target label is that character. This is to prevent overfitting. In this case, the model must identify that there is no error and allow it to pass through the model unchanged.

While the masked language model provides the basis for the spelling correction, on its own is a weak Chinese spell checker – the addition of the confidence-similarity decoder to improve performance.

The decoder attempts to process and understand the denoised candidates generated by the masked language model. The decoder takes in both the confidence of the masked language model in each character and the similarity between characters. The similarity between two characters is calculated as a combination of their visual and phonological similarity.

The logic of comparing the masked language model’s confidence and the character similarity is as follows (Figure 3-41). Say that the masked language model proposes a character replacement that has a high confidence and is very similar to the original erroneous character. The high confidence indicates that the masked language model is sure that the correction is correct. Because the original erroneous character is highly similar to the correction, we are more confident that the correction is correct because it is more likely for a mistake to be more than less similar to the correct character.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig41_HTML.jpg
Figure 3-41

Architecture and example data flow throughout the FASPell model, including the masked language model and the confidence-similarity decoder. Produced by Hong et al

On the other hand, if the masked language model has a high confidence in a correction, but the correction and the original character have little similarity, these two signals must be closely compared. If the masked language model’s confidence is high enough as to outweigh the low similarity between the original character and the correction, the correction is accepted. For each character in the sentence, the correction that maximizes the model confidence and the similarity with the original character is accepted and inserted into the final output.

The FASPell design achieves state-of-the-art performance in both error detection and correction (Table 3-5), outperforming prior Chinese spelling correction systems proposed by Wang et al. (2018) and Yeh et al. (2013).
Table 3-5

Performance of FASPell in various spelling tasks, measured with three different metrics

Model

Detection

Correction

Precision

Recall

F1

Precision

Recall

F1

Wang et al.

54.0

69.3

60.7

Not reported

Not reported

52.1

Yeh et al.

Not reported

Not reported

Not reported

70.3

62.5

66.2

FASPell

76.2

63.2

69.1

73.1

60.5

66.2

Moreover, removing the fine-tuning on the masked language model and removing the confidence-similarity decoder both generally significantly decrease performance, demonstrating the important role both elements play in adapting the traditional denoising autoencoder-style masked language model for the unique challenges of Chinese text (Table 3-6).
Table 3-6

Performance of versions of FASPell in various spelling tasks, measured with three different metrics

Model

Detection

Correction

Precision

Recall

F1

Precision

Recall

F1

FASPell

76.2

63.2

69.1

73.1

60.5

66.2

FASPell without fine-tuning

75.5

40.9

53.0

73.2

39.6

51.4

FASPell without confidence-similarity decoder

42.3

41.1

41.6

32.2

31.3

31.8

FASPell without fine-tuning or confidence-similarity decoder

65.2

47.8

55.2

48.4

35.4

40.9

FASPell shows the importance of adapting autoencoders toward your particular task. By changing the content and method by which the denoising autoencoder was trained and adding a confidence-similarity decoder to incorporate knowledge of character similarity into understanding the denoising autoencoder’s output, FASPell was able to address a key shortcoming in standard pretrained autoencoder designs.

Variational Autoencoders Case Study: Text Generation

Earlier , we looked at variational autoencoders for image data. Recall that variational autoencoders are especially valuable because we can control the generated data by making changes to the sampled location in the latent space. This allows for fine control over attributes like the text’s sentiment and writing style that is not possible with other generation methods.

Variational autoencoders generally work with the right design well on images. However, applying variational autoencoders for text generation, in which convolutional layers are replaced with recurrent layers, is more difficult. Recall that the loss for variational autoencoders is a sum of the reconstruction loss and KL divergence. KL divergence acts as a regularizing term that can be optimized by clustering the latent variable distributions closer together.

When training variational autoencoders for natural language tasks with recurrent based structures (Figure 3-42), the KL divergence term of the loss function tends to collapse or drop to 0. This means that all the latent variable distributions are essentially all on top of each other, clustered as closely together as possible. Because these overlapping distributions are very difficult to decode and make sense of, the decoder that generates the text completely ignores these latent representations. Instead, because RNN (recurrent neural network)-based decoders have significant modeling power, the decoder can produce text that scores a relatively low reconstruction error without relying on encoded representations provided by the decoder.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig42_HTML.jpg
Figure 3-42

The traditional LSTM-based variational autoencoder for text generation, which suffers from KL divergence term collapse

This results in repetitive and unoriginal outputs that “cheat” the loss function, like these sample generated tweets.
@userid @userid @userid @userid @userid
@userid thanks for the follow
@userid @userid @userid @userid @userid @userid @userid @userid @userid
@userid @userid thanks for the follow
In “A Hybrid Convolutional Variational Autoencoder for Text Generation,” Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth6 propose a variational autoencoder-based structure to address this problem. Using their approach, the generated output is both more realistic and diverse:
@userid All the best!!
@userid you should come to my house tomorrow
I wanna go to the gym and I want to go to the beach
@userid and it's a great place
@userid I hope you're feeling better

The architecture includes a hybrid of convolutional and recurrent layers; the encoder is composed completely of convolutional layers, whereas the decoder combines deconvolutions (like transpose convolutions) and a traditional RNN. Deconvolutional layers were chosen instead of recurrent layers for two key reasons: deconvolutional layers have more efficient GPU implementations, and feed-forward architectures are easier to optimize than recurrent ones, due to a more consistent and smaller number of steps required.

By using standard convolutional and deconvolutional layers in a variational autoencoder format, key features from the sequence can be learned and mapped to a continuous latent space (Figure 3-43). The sampled encoded representation can be decoded via a series of deconvolutions. The output of this convolutional variational autoencoder is passed through a conventional LSTM network for further processing (Figure 3-44). The recurrent component is specialized to consider dependencies between entities of various texts by considering sequence. This alleviates the burden of this task from convolutional/deconvolutional layers, which are not built to perform it successfully.
../images/516104_1_En_3_Chapter/516104_1_En_3_Fig43_HTML.jpg
Figure 3-43

The feed-forward component of the variational autoencoder. Inputs enter through the bottom and are processed through convolutions and deconvolutions to produce a generated set of features

../images/516104_1_En_3_Chapter/516104_1_En_3_Fig44_HTML.jpg
Figure 3-44

The LSTM component of the hybrid approach, in which the generated features are further decoded by an LSTM-style model into text to capture complex sequence-based dependencies between entities of a sentence

This architecture allows for meaningful text generation without employing overpowering stacks of recurrent layers. However, three other mechanisms are required to fully address the KL divergence collapse problem:
  • Input dropout: A dropout layer is applied to the input such that the input, even before encoding, is already corrupted in a random fashion.

  • KL term annealing: By gradually increasing the weight of the KL divergence term in the loss function across time, the neural network transitions from being a standard autoencoder into a variational autoencoder.

  • Auxiliary reconstruction term: A specialized term is added to the loss, which can be optimized only by meaningful reconstruction and encoding. This exacerbates the importance of the reconstruction component of the loss function.

Using a convolutional recurrent hybrid variational autoencoder with appropriate modifications to the optimization strategy yields a much more successful model that is both able to produce diverse and realistic outputs. By linearly interpolating across two points in the latent space, we’re able to transition from one tweet to a completely different tweet:
@userid I do that too. I have to wait a lot of people too
@userid I do not know about it and then I can find a contest
@userid I am so excited for this summer I hope you are well
@userid i don't know what to do in the morning i love you so much for the shoutout
@userid i don't know what to do if you don't mind follow me i love you so much xx
@userid it would be awesome to hear it's a great place to see you around the weekends
@userid it would be awesome to hear I'm so excited for the summer I'm going to see them

Semeniuta, Severyn, and Barth’s work shows how being creative with combinations of layers and transgressing prepackaged boundaries for what certain model types should look like, in addition to well-pointed, specific mechanisms to target particular challenges, can lead to significantly greater performance.

Key Points

In this chapter, we discussed the intuition and implementation of autoencoders for tabular and image data, as well as for five key autoencoder applications: denoising via denoising autoencoders, pretraining, dimensionality reduction, feature generation, and data generation via variational autoencoders.
  • Autoencoders encode and decode data in the process developing important representations of the dataset and efficient processes for extracting key information and features from data. Autoencoders rely upon the context of the dataset to support their efficient representations. Therefore, while they are not good universal compression algorithms, they have many applications that exploit the autoencoder’s unique ability to learn complex relationships and representations of data in an unsupervised fashion.

  • A key trade-off in autoencoder design is the size of the bottleneck: a bottleneck that is too small may make any encoding task too difficult to be practical, whereas a bottleneck that is too large can be solved trivially without developing meaningful representations.

  • Use compartmentalized design when implementing autoencoders and other complex neural network structures by defining models as a series of relationships between sub-models. This makes implementing autoencoder applications significantly easier and is also conceptually more clear to mentally understand and manipulate.

  • To build an autoencoder for tabular data, simply stack Dense layers with a progressively decreasing number of neurons in the encoder and a progressively increasing number of neurons in the decoder. To build an autoencoder for image data, use convolutional layers and pooling layers in the encoder, and transpose convolutional layers and upsampling layers in the decoder. Convolutional autoencoders often flatten and reshape in the region surrounding the bottleneck such that the output of the encoder (the encoded representation) is a vector. Use padding='same' in convolutional layers for easy manipulation of the shape size. Autoencoders can also be built for other data types, as long as there exists an inverse decoding layer for every encoding layer.

  • Denoising autoencoders are autoencoders trained to reconstruct the original data from a corrupted, noisy input. If you are using denoising autoencoders to denoise dataset, ensure that the form of noise is realistic and exists within the dataset. When implementing denoising autoencoders, noise can be induced either as a layer, applied to the dataset, or as an Image Data Generator (in the case of images).

  • When using autoencoders for pretraining, the autoencoder is first trained on unsupervised data to extract key ideas and representations. Then, the encoder is detached and custom processing layers are added after it to alter the model structure such that it can perform a supervised task. The encoder weights are usually frozen such that the following layers can be trained to interpret the output of the encoded representations; this can be followed by some fine-tuning to optimize performance.

  • Autoencoders are different from other dimensionality reduction algorithms in two key ways: higher adaptability and a model-centric articulation of valued features. These can be implemented simply by having the encoder predict encoded representations for certain data inputs.

  • Using autoencoders for feature generation is like using them for dimensionality reduction, but the encoded features are concatenated to the original dataset rather than replacing the original dataset. This can be implemented with a nonlinear architecture, in which the input is both passed through the encoder and concatenated to the output of the encoder; this enriched data can then be passed through a series of processing layers to be interpreted and used to produce an output.

  • Variational autoencoders allow for the generation of new data by choosing locations in the latent space and decoding them. However, because standard autoencoders tend to produce discrete spaces that cannot be sampled randomly, variational autoencoders implement two key changes to force development of a continuous latent space: learning the probability distributions of latent variables rather than the absolute position and adding KL divergence to the loss function to provide regularization in the formation of the latent space. Variational autoencoders rely upon relatively more low-level operations but can be accomplished with good organization and active experimentation.

Autoencoders are extraordinarily versatile structures that you can work into almost any deep learning design. Keep them in mind when you approach a new deep learning problem and need somewhere to start – autoencoders provide a good starting point for developing knowledge and building important representations.

In the next chapter, we’ll discuss model compression and the exciting implications it holds for the future of deep learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.196.175