© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Ye, Z. WangModern Deep Learning for Tabular Datahttps://doi.org/10.1007/978-1-4842-8692-0_8

8. Autoencoders

Andre Ye1   and Zian Wang2
(1)
Seattle, WA, USA
(2)
Redmond, WA, USA
 

Weak encoding means mistakes and weak decoding means illiteracy.

—Rajesh Walecha, Author

An autoencoder is a very simple model: a model that predicts its own input. In fact, it may seem deceivingly simple to the point of being worthless. (After all, what use is a model that predicts what we already know?) Autoencoders are extraordinarily valuable and versatile architectures not because of their functionality to reproduce the input, but because of the internal capabilities developed in order to obtain said functionality. Autoencoders can be chopped up into desirable parts and stuck onto other neural networks, like playing with Legos or performing surgery (take your pick of analogy), with incredible success, or can be used to perform other useful tasks, such as denoising.

This chapter begins by explaining the intuition of the autoencoder concept, followed by a demonstration of how one would implement a simple “vanilla” autoencoder. Afterward, four applications of autoencoders – pretraining, denoising, sparse learning for robust representations, and denoising – are discussed and implemented.

The Concept of the Autoencoder

The operations of encoding and decoding are fundamental to information. Some hypothesize that all transformation and evolution of information results from these two abstracted actions of encoding and decoding (Figures 8-1 and 8-2). Say Alice sees Humpty Dumpty hit his head on the ground after some precarious wall-sitting and tells Bob, “Humpty Dumpty hit his head badly on the ground!” Upon hearing this information, Bob encodes the information from a language representation into thoughts and opinions – what we might call latent representations.

Say Bob is a chef, and so his encoding “specializes” in features relating to food. Bob then decodes the latent representations back into a language representation when he tells Carol, “Humpty Dumpty cracked his shell! We can use the innards to make an omelet.” Carol, in turn, encodes the information.

Say Carol is an egg activist and cares deeply about the well-being of Humpty Dumpty. Her latent representations will encode the information in a way that reflects her priorities and interests as a thinker. When she decodes her latent representations into language, she tells Drew that “People are trying to eat Humpty Dumpty after he has suffered a serious injury! It is horrible.”

So on and so forth. The conversation continues and evolves, information passed and transformed from thinker to thinker. Because each thinker encodes the information represented in language in a semantic system relevant to their experiences, priorities, and interests, they correspondingly decode information in a fashion colored through these lens.

A flow diagram depicts the abstract actions of encoding and decoding. The labels are input, encoder, latent space, decoder and output.

Figure 8-1

A high-level autoencoder architecture

A flow diagram highlights the encoding and decoding transformation results. It exhibits evolution of information from the abstracted actions.

Figure 8-2

Transformation of information as a series of encoding and decoding operations

Of course, this interpretation of encoding and decoding is very broad and more psychological than anything else. In the strict context of computer science, encoding is an operation to represent some information in another form, usually with smaller information content (there are few applications for encoding techniques that make the storage size larger). Decoding, in turn, “undoes” the encoding operation to recover the original information. Encoding and decoding are commonly used terms in the context of compression (Figure 8-3). Various computer scientists throughout the decades have proposed very clever algorithms to map information to smaller storage sizes with lossless and lossy reconstruction of the original information, making the transmission of large data like long text, images, and videos across limited information transfer connections feasible.

A flow diagram depicts the sender and receiver. The sender labels are data, encoding scheme and encrypted while that of receiver are encrypted, decoding scheme and reconstructed data.

Figure 8-3

Interpretation of autoencoders as sending and receiving encrypted data

Encoding and decoding in deep learning are a bit of a fusion of these two understandings. Autoencoders are versatile neural network structures consisting of an encoder and a decoder component. The encoder maps the input into a smaller latent/encoded space, and the decoder maps the encoded representation back to the original input. The goal of the autoencoder is to reconstruct the original input as faithfully as possible, that is, to minimize the reconstruction loss. In order to do so, the encoder and decoder need to “work together” to develop an encoding and decoding scheme.

Autoencoders reduce the representation size of the original input information, which can be thought of as a form of lossy compression. However, numerous studies have demonstrated that autoencoders are generally quite bad at compression when put into contrast with human-designed compression schemes. Rather, when we build autoencoders, it is almost always to extract meaningful features at the “core” of the data. The smaller representation size of the latent space compared with the original input emerges only because we need to impose an information bottleneck to force the network to learn meaningful latent features. The architectures in Figures 8-4 and 8-5 demonstrate constant and expanded information representations compared with the input; the autoencoder can trivially learn weights that simply pass/carry the input from the input to the output. On the other hand, the architecture in Figure 8-6 must learn nontrivial patterns to compress and reconstruct the original input. The information bottleneck and information compression, therefore, is the means, not the end, of the autoencoder.

A network architecture highlights constant data representation related with the input.

Figure 8-4

A bad autoencoder architecture (latent space representation size is equal to input representation size)

A network architecture represents expanded data in comparison with the input.

Figure 8-5

An even worse autoencoder architecture (latent space representation size is larger than input representation size)

A network diagram depicts the compression and reconstruction of the original input.

Figure 8-6

A good autoencoder architecture (latent space representation size is smaller than input representation size)

Autoencoders are very good at finding higher-level abstractions and features. In order to reliably reconstruct the original input from a much smaller latent space representation, the encoder and decoder must develop a system of mapping that most meaningfully characterizes and distinguishes each input or set of inputs from others. This is no trivial task!

Consider the following autoencoder design scheme adapted for humans (Figure 8-7): person A is the encoder and attempts to “encode” a high-resolution image of a sketch as a natural language description, restricted to N words or less; person B is the decoder and attempts to “decode” the original image person A was looking at by drawing a reconstructed image based on person A’s natural language description. Person A and person B must work together to develop a system to reliably reconstruct the original image.

A flow diagram depicts two images, a photograph of a cat on the left and a diagram on the right.

Figure 8-7

Image-to-text encoding guessing game

Say that you are person B and you are given the following natural language description by person A: “a black pug dressed in a black and white scarf looks at the upper-left region of the camera among an orange background.” For the sake of intuition, it is a worthwhile exercise to try actually playing the role of person B in this game by sketching out/”decoding” the original input.

Figure 8-8 shows the (hypothetical) image that person A encoded into the given natural language description. Chances are that your sketch is very different from the actual image. By performing this exercise, you will have experienced first-hand two key low-level challenges in autoencoding: reconstructing a complex output from a comparatively simpler encoding requires a lot of thinking and conceptual reasoning about the encoding, and the encoding scheme itself needs to effectively communicate both key concepts and precision/positioning information.

A photograph of a dog.

Figure 8-8

What person A was hypothetically looking at when they provided you the natural language encoding. Taken by Charles Deluvio

In this example, the latent space is in the form of language – which is discrete, sequential, and variable-length. Most autoencoders for tabular data use latent spaces that satisfy none of these attributes: they are (quasi-) continuous, read and generated all at once rather than sequentially, and fixed-length. These general autoencoders can reliably find effective encoding and decoding schemes with lifted restrictions, but the two-player game is still good intuition for thinking through challenges associated with autoencoder training.

Although autoencoders are relatively simple neural network architectures, they are incredibly versatile. In this section, we will begin with the plain “vanilla” autoencoder and move to more complex forms and applications of autoencoders.

Vanilla Autoencoders

Let’s begin with the traditional understanding of an autoencoder, which merely consists of an encoder and a decoder component working together to translate the input into a latent representation and then back into original form. The value of autoencoders will become clearer in following sections, in which we will use autoencoders to substantively improve model training.

The goal of this subsection is not only to demonstrate and implement autoencoder architectures but also to understand implementation best practices and to perform technical investigations and explorations into how and why autoencoders work.

Autoencoders are traditionally applied to image and text-based datasets, because this sort of data often features semantic concepts that should take a smaller amount of space to represent than is used in raw form. For instance, consider the following approximately 3000-by-3000 pixel image of a line (Figure 8-9).

An image of a line embedded in a square.

Figure 8-9

An image of a line

This image contains nine million pixels, meaning we are representing the concept of this line with nine million data values. However, in actuality we can express any line with just four numbers: a slope, a y-intercept, a lower x bound, and a higher x bound (or a starting x point, a starting y point, an ending x point, and an ending y point). If we were to design an encoding and decoding scheme set, the encoder would identify these four parameters – yielding a very compact four-dimensional latent space – and the decoder would redraw the line given those four parameters. By collecting higher-level abstract latent features from the semantics represented in the images, we are able to represent the dataset more compactly. We’ll revisit this example later in the subsection.

Notice, however, that the autoencoder’s reconstruction capability is conditional on the existence of structural similarities (and differences) within the dataset. An autoencoder cannot reliably reconstruct an image of random noise, for instance.

The MNIST dataset is a particularly useful demonstration of autoencoders. It is technically visual/image-based, which is useful for understanding various autoencoder forms and applications (given that autoencoders are most well developed for images). However, it spans a small enough number of features and is structurally simple enough such that we can model it without any convolutional layers. Thus, the MNIST dataset serves as a nice link between the image and tabular data worlds. Throughout this section, we’ll use the MNIST dataset as an introduction to autoencoder techniques before demonstrating applications to “real” tabular/structured datasets.

Let’s begin by loading the MNIST dataset from Keras datasets (Listing 8-1).
from keras.datasets.mnist import load_data
(x_train, y_train), (x_valid, y_valid) = load_data()
x_train = x_train.reshape(len(x_train),784)/255
x_valid = x_valid.reshape(len(x_valid),784)/255
Listing 8-1

Loading the MNIST dataset

Recall that the key feature of an autoencoder is an information bottleneck. We want to begin from the original representation size, progressively force the information flow into smaller vector sizes, and then progressively force the information back into the original size. Such a design is simple to quickly implement in Keras, where we can successively decrease and increase the number of nodes in a sequence of fully connected layers (Listing 8-2).
import keras.layers as L
from keras.models import Sequential
# define architecture
model = Sequential()
model.add(L.Input((784,)))
model.add(L.Dense(256, activation='relu'))
model.add(L.Dense(64, activation='relu'))
model.add(L.Dense(32, activation='relu'))
model.add(L.Dense(64, activation='relu'))
model.add(L.Dense(256, activation='relu'))
model.add(L.Dense(784, activation='sigmoid'))
# compile
model.compile(optimizer='adam',
              loss='binary_crossentropy')
# fit
model.fit(x_train, x_train, epochs=1,
          validation_data=(x_valid, x_valid))
Listing 8-2

Building an autoencoder sequentially

The architecture is visualized in Figure 8-10.

A framework architecture visualises features and layers of an autoencoder.

Figure 8-10

A sequential autoencoder architecture

There are a few features of this autoencoder architecture to note. Firstly, the output activation of the autoencoder is a sigmoid function, but this is only because the input vector has values ranging from 0 to 1 (recall that we scaled the dataset upon loading in Listing 8-1). If we had not scaled the dataset as such, we would need to change the activation function such that the network could feasibly predict in the entire domain of possible values. If the input values consist of values larger than 0, ReLU may be a good activation output choice. If the inputs contain both positive and negative values, using a plain linear activation may be the easiest possible option. Moreover, the loss function chosen must be reflective of the output activation. Since our particular example contains outputs between 0 and 1 and the distribution of values is more or less binary (i.e., most values are very close to 0 or 1, as shown in Figure 8-11), binary cross-entropy is a suitable loss to apply. We can treat reconstruction as a series of binary classification problems for each pixel in the original input.

A bar graph represents the construction of the binary classification of the pixel in its original input. It depicts that maximum values are near 0 or 1.

Figure 8-11

Distribution of pixel values (scaled between 0 and 1) in the MNIST dataset

However, in other cases, reconstruction is more of a regression problem in which the distribution of possible values is not binarized toward the ends of the domains but rather more spread out. This is common in more complex image datasets (Figure 8-12) and in many tabular datasets (Figure 8-13).

A frequency distribution represents the possible values that are not binarised in a complex image datasets.

Figure 8-12

Distribution of pixel values (scaled between 0 and 1) from a set of images in CIFAR-10

A bar graph represents the distribution of values in tabular datasets.

Figure 8-13

Distribution of values for a feature in the Higgs Boson dataset (we will work with this dataset later in the chapter)

In these cases, it is more suitable to use a regression loss, like the generic Mean Squared Error or a more specialized alternative (e.g., Huber). Refer to Chapter 1 for a review of regression losses.

Autoencoders are generally easier to work with when implemented in compartmentalized form. Rather than simply constructing the autoencoder as a continuous stack of layers with a bottleneck, we can build encoder and decoder models/components and chain them together to form a complete autoencoder (Listing 8-3).
from keras.models import Model
# define architecture components
encoder = Sequential(name='encoder')
encoder.add(L.Input((784,)))
encoder.add(L.Dense(256, activation='relu'))
encoder.add(L.Dense(64, activation='relu'))
encoder.add(L.Dense(32, activation='relu'))
decoder = Sequential(name='decoder')
decoder.add(L.Input((32,)))
decoder.add(L.Dense(64, activation='relu'))
decoder.add(L.Dense(256, activation='relu'))
decoder.add(L.Dense(784, activation='sigmoid'))
# define model architecture from components
ae_input = L.Input((784,), name='input')
ae_encoder = encoder(ae_input)
ae_decoder = decoder(ae_encoder)
ae = Model(inputs = ae_input,
           outputs = ae_decoder)
# compile
ae.compile(optimizer='adam',
           loss='binary_crossentropy') # note that in other situations other losses may be more suitable
Listing 8-3

Building an autoencoder with compartmentalized design

This method of construction is philosophically more desirable because it reflects our understanding of the autoencoder structure as meaningfully composed of a separate encoding and decoding component. When we visualize our architecture, we obtain a much cleaner high-level breakdown of the autoencoder model (Figure 8-14).

A visualised architecture model diagram represents the advanced level breakdown of the autoencoder model. It has labels namely the input, encoder sequential and decoder sequential layers.

Figure 8-14

Visualization of the compartmentalized model

However, using compartmentalized design is incredibly helpful because we can reference the encoder and decoder components separately from the autoencoder. For instance, if we desire to obtain the encoded representation for an input, we can simply call encoder.predict(…) on our input. The encoder and decoder are used to build the autoencoder; after the autoencoder is trained, the encoder and decoder still exist as references to components of that (now trained) autoencoder. The alternative would be to go searching for the latent space layer of the model and create a temporary model to run predictions, in a similar approach to the demonstration in Chapter 4 used to visualize learned convolutional transformations in CNNs. Similarly, if we desire to decode a latent space vector, we can simply call decoder.predict(…) on our sample latent vector.

For instance, Listing 8-4 demonstrates visualization (Figures 8-15 through 8-18) of the internal state and reconstruction of the autoencoder created in Listing 8-3 after training.
for i in range(10):
    plt.figure(figsize=(10, 5), dpi=400)
    plt.subplot(1, 3, 1)
    plt.imshow(x_valid[i].reshape((28, 28)))
    plt.axis('off')
    plt.title('Original Input')
    plt.subplot(1, 3, 2)
    plt.imshow(encoder.predict(x_valid[i:i+1]).reshape((8, 4)))
    plt.axis('off')
    plt.title('Latent Space (Reshaped)')
    plt.subplot(1, 3, 3)
    plt.imshow(ae.predict(x_valid[i:i+1]).reshape((28, 28)))
    plt.axis('off')
    plt.title('Reconstructed')
    plt.show()
Listing 8-4

Visualizing the input, latent space, and reconstruction of an autoencoder

A visualisation of images represents the shape of the internal state of encoder. It exhibits the results from original input, latent space and reconstructed.

Figure 8-15

Sample latent shape and reconstruction for the digit “7”

A visual representation of reconstruction for the digit 1 created after the training stage.

Figure 8-16

Sample latent shape and reconstruction for the digit “1”

An image represents the decoding of sample shapes in the latent space and reconstruction stage of the number two. It exhibits the results from original input, latent space, and reconstructed.

Figure 8-17

Sample latent shape and reconstruction for the digit “2”

An image depicts the reconstruction of the encoder formed after training according to the listing. It exhibits the results from original input, latent space and reconstructed.

Figure 8-18

Sample latent shape and reconstruction for the digit “5”

When we build standard neural networks that we may want multiple models of with small differences, it is often useful to create a “builder” or “constructor.” The two key parameters of a neural network are the input size and the latent space size. Given these two key “determining’ parameters,” we can infer how we generally want information to flow. For instance, halving the information space in each subsequent layer in the encoder (and doubling in the decoder) is a good generic update rule.

Let the input size be I, and let the latent space size be L. In order to maintain this rule, we want all intermediate layers to use nodes as multiples of L. Consider the case in which I = 4L, for instance (Figure 8-19).

An infographic image represents the intermediate layers that is used to denote nodes as multiples of L.

Figure 8-19

Visualization of a “halving” autoencoder architecture logic

We see that the number of layers needed to either reduce the input to the latent space or to expand the latent space to the output is
$$ {log}_2frac{I}{L} $$

This simple expression measures how many times we need to multiply L by 2 in order to reach I.

However, it will often be the case that $$ 
aisebox{1ex}{$I$}!left/ !
aisebox{-1ex}{$L$}
ight.
otin mathbb{Z} $$ (i.e., I does not divide cleanly into L), in which case our earlier logarithmic expression will not be integer. In these cases, we have a simple fix: we can cast the input to a layer with N nodes, where N = 2k · L for the largest integer k such that N < I. For instance, if I = 4L + 8, we first “cast” down to 4L and execute our standard halving policy from that point (Figure 8-20).

A flow diagram depicts execution of halving policy from the point 4 L.

Figure 8-20

Adapting the halving autoencoder logic to inputs that are not powers of 2

To accommodate for cases in which $$ {log}_2
aisebox{1ex}{$I$}!left/ !
aisebox{-1ex}{$L$}
ight.
otin mathbb{Z} $$ (i.e., we cannot express the input size in relationship to the layer size as an exponent of 2), we can modify our expression for the number of layers required by wrapping with the floor function:
$$ leftlfloor {log}_2frac{I}{L}
ight
floor $$
Using this halving/doubling information flow logic, we can create a generalized buildAutoencoder function that constructs a feed-forward autoencoder given an input size and a latent size (Listing 8-5).
def buildAutoencoder(inputSize=784, latentSize=32,
                   outActivation='sigmoid'):
    # define architecture components
encoder = Sequential(name='encoder')
    encoder.add(L.Input((inputSize,)))
    for i in range(int(np.floor(np.log2(inputSize/latentSize))), -1, -1):
        encoder.add(L.Dense(latentSize * 2**i, activation='relu'))
    decoder = Sequential(name='decoder')
    decoder.add(L.Input((latentSize,)))
    for i in range(1,int(np.floor(np.log2(inputSize/latentSize)))+1):
        decoder.add(L.Dense(latentSize * 2**i, activation='relu'))
    decoder.add(L.Dense(inputSize, activation=outActivation))
    # define model architecture from components
    ae_input = L.Input((inputSize,), name='input')
    ae_encoder = encoder(ae_input)
    ae_decoder = decoder(ae_encoder)
    ae = Model(inputs = ae_input,
               outputs = ae_decoder)
    return {'model': ae, 'encoder': encoder, 'decoder': decoder}
Listing 8-5

A general function to construct an autoencoder architecture given an input size and a desired latent space, constructed using halving/doubling architectural logic. Note this implementation also has an outActivation parameter in cases where our output is not between 0 and 1

Rather than just returning the model, we also return the encoder and decoder. Recall from earlier discussion of compartmentalized design that retaining a reference to the encoder and decoder components of the autoencoder can be helpful. If not returned, these references – created internally inside the function – will be lost and irretrievable.

Having a generalized autoencoder creation function allows us to perform larger-scale autoencoder experiments. One particularly important phenomenon to understand is the trade-off between model performance and the latent size. As previously mentioned, the latent size must be configured properly such that the task is challenging enough to force the autoencoder to develop meaningful and nontrivial representations, but also feasible enough such that the autoencoder can gain traction at solving the problem (rather than stagnating and not learning anything at all due to the difficulty of the reconstruction problem). Let’s train several autoencoders on the MNIST dataset with bottleneck sizes 2n where n ∈ [1, 2, …, ⌊log2I⌋] (the last value of n being the largest power of 2 less than the original input size) and obtain each one’s validation performance (Listing 8-6, Figure 8-21).
inputSize = 784
earlyStopping = keras.callbacks.EarlyStopping(monitor='loss',
                                              patience=5)
latentSizes = list(range(1, int(np.floor(np.log2(inputSize)))))
validPerf = []
for latentSize in tqdm(latentSizes):
    model = buildAutoencoder(inputSize, 2**latentSize)['model']
    model.compile(optimizer='adam', loss='binary_crossentropy')
history = model.fit(x_train, x_train, epochs=50,
                   callbacks=[earlyStopping], verbose=0)
    score = keras.metrics.MeanAbsoluteError()
    score.update_state(model.predict(x_valid), x_valid)
    validPerf.append(score.result().numpy())
plt.figure(figsize=(15, 7.5), dpi=400)
plt.plot(latentSizes, validPerf, color='red')
plt.ylabel('Validation Performance')
plt.xlabel('Latent Size (power of 2)')
plt.grid()
plt.show()
Listing 8-6

Training autoencoders with varying latent space sizes and observing the performance trend

A line graph of validation performance against Latent size represents a sloping curve.

Figure 8-21

Relationship between the latent size of a tabular autoencoder (2x neurons) and the validation performance. Note the diminishing returns

The diminishing returns for larger latent sizes are very apparently clear. As the latent size increases, the benefit we can reap from it decreases. This phenomenon is true generally in deep learning models (recall “Deep Double Descent” from Chapter 1, which similarly compared model size vs. performance in a supervised domain with CNNs).

We can do one better and visualize the differences in the learned latent representations for different bottleneck sizes. The latent representations for the training set after the autoencoder has been trained can be obtained via encoder.predict(x_train). Of course, the latent representations will be in different dimensions for each autoencoder. We can use the t-SNE method (introduced in Chapter 2) to visualize these latent spaces (Listing 8-7, Figures 8-22 through 8-30).
from sklearn.manifold import TSNE
inputSize = 784
earlyStopping = keras.callbacks.EarlyStopping(monitor='loss',
                                              patience=5)
latentSizes = list(range(1, int(np.floor(np.log2(inputSize))) + 1))
for latentSize in tqdm(latentSizes):
    modelSet = buildAutoencoder(inputSize, 2**latentSize)
    model = modelSet['model']
    encoder = modelSet['encoder']
    model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(x_train, x_train, epochs=50,
         callbacks=[earlyStopping], verbose=0)
    transformed = encoder.predict(x_train)
    tsne_ = TSNE(n_components=2).fit_transform(transformed)
    plt.figure(figsize=(10, 10), dpi=400)
    plt.scatter(tsne_[:,0], tsne_[:,1], c=y_train)
    plt.show()
    plt.close()
Listing 8-7

Plotting a t-SNE representation of the latent space of autoencoders with varying latent space sizes

A visualisation of variation of latent representation for the given training set.

Figure 8-22

t-SNE projection of a latent space for an autoencoder with a bottleneck size of two nodes trained on MNIST. Note that in this case, we are projecting into a number of dimensions (2) equal to the dimensionality of the original dataset (2), hence the pretty snake-like arrangements

An image represents size of the bottleneck in relation with the algorithm.

Figure 8-23

t-SNE projection of a latent space for an autoencoder with a bottleneck size of four nodes trained on MNIST

An image plotted against t S N E and latent space of autoencoders with size of eight nodes.

Figure 8-24

t-SNE projection of a latent space for an autoencoder with a bottleneck size of eight nodes trained on MNIST

An image within a square represents the bottleneck size of sixteen nodes.

Figure 8-25

t-SNE projection of a latent space for an autoencoder with a bottleneck size of 16 nodes trained on MNIST

An image represents latent space with the bottleneck size of 32 nodes.

Figure 8-26

t-SNE projection of a latent space for an autoencoder with a bottleneck size of 32 nodes trained on MNIST

A t S N E projection image depicts the bottleneck size of 64 nodes after training set.

Figure 8-27

t-SNE projection of a latent space for an autoencoder with a bottleneck size of 64 nodes trained on MNIST

A projection image displays 128 nodes training set that results from an autoencoder.

Figure 8-28

t-SNE projection of a latent space for an autoencoder with a bottleneck size of 128 nodes trained on MNIST

An image depicts varying shapes resulting from t S N E project of size of 256 nodes.

Figure 8-29

t-SNE projection of a latent space for an autoencoder with a bottleneck size of 256 nodes trained on MNIST

A visual representation of S N E projection of an autoencoder with bottleneck size of 516 nodes.

Figure 8-30

t-SNE projection of a latent space for an autoencoder with a bottleneck size of 512 nodes trained on MNIST

Note

If we had loaded the model as model = buildAutoencoder(784, 32)[‘model’] and the encoder as encoder = buildAutoencoder(784, 32)[‘encoder’], we indeed would obtain a model architecture and an encoder architecture – but they wouldn’t be “linked.” The stored model would be associated with an encoder that we haven’t captured, and the stored encoder would be part of an overarching model that we haven’t captured. Thus, we make sure to store the entire set of model components into modelSet first.

Each individual point is colored by the target label (i.e., the digit associated with the data point) for the purpose of exploring the autoencoder’s ability to implicitly “cluster” points of the same digit together or separate them, even though the autoencoder was never exposed to the labels. Observe that as the dimensionality of the latent space increases, the overlap between data samples of different digits decreases until there is functionally complete separation between digits of different classes.

If we build an architecture in which the input is expanded rather than compressed and visualize a dimensionality reduction of the latent space (Listing 8-8), we find that the learned representations are significantly less meaningful (Figure 8-31) – despite this architecture obtaining very high performance (i.e., low training error).
model = Sequential()
model.add(L.Input((784,)))
model.add(L.Dense(1024, activation='relu'))
model.add(L.Dense(2048, activation='relu'))
model.add(L.Dense(1024, activation='relu'))
model.add(L.Dense(784, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(x_train, x_train, epochs=50)
transformed = encoder.predict(x_train)
tsne_ = TSNE(n_components=2).fit_transform(transformed)
plt.figure(figsize=(10, 10), dpi=400)
plt.scatter(tsne_[:,0], tsne_[:,1], c=y_train)
plt.show()
Listing 8-8

Training and visualizing the latent space of an overcomplete, architecturally redundant autoencoder architecture. This particular architecture has slightly over 5.8 million parameters!

An image visualises compressed latent space reduction. It indicates high performance with low training errors.

Figure 8-31

t-SNE projection of a latent space for an overcomplete autoencoder with a bottleneck size of 2048 trained on MNIST

Let’s revisit the example given at the beginning of this subsection: reconstruction of an image of a line. Listing 8-9 generates a dataset of 50-by-50 images with randomly placed line segments using the image processing library cv2 (Listing 8-9).
x = np.zeros((1024, 50, 50))
for i in range(1024):
    start = [np.random.randint(0, 50), np.random.randint(0, 50)]
    end = [np.random.randint(0, 50), np.random.randint(0, 50)]
    x[i,:,:] = cv2.line(x[i,:,:], start, end, color=1, thickness=4)
x = x.reshape((1024, 50 * 50))
Listing 8-9

Generating a dataset of 50-by-50 images of lines

Since theoretically we can intuitively represent each line segment with four values, we’ll build and train an autoencoder with four neurons in the latent space on the dataset (Listing 8-10).
modelSet = buildAutoencoder(50 * 50, 4)
model = modelSet['model']
encoder = modelSet['encoder']
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(x, x, epochs=400, validation_split=0.2)
Listing 8-10

Fitting a simple autoencoder on the synthetic toy line dataset

The model reaches near 0.03 binary cross-entropy, which is quite good. Its reconstructions are very accurate (Figure 8-32).

An image represents binary cross-entropy that results in accurate reconstructions. It depicts a line in different dimensions.

Figure 8-32

Left column: original input images of lines. Right column: reconstructions via an autoencoder with a latent space dimensionality of 4

In fact, an autoencoder trained with only two neurons does a decent job at identifying the general shape of the line marked in the input (Figure 8-33). If you look closely, you will identify the silhouette of other lines. There are many hypotheses to explain their presence. One possibility is that the autoencoder has “memorized”/”internalized” a set of generally useful “landmark” samples that are then mapped to during prediction and that with a larger latent space increased information for precise placement could be passed through.

An image represents input received as a straight line from an autoencoder trained set with two neurons.

Figure 8-33

Left column: original input images of lines. Right column: reconstructions via an autoencoder with a latent space dimensionality of 2

Finally, let’s explore how we can apply autoencoders to a strictly tabular dataset – the Mice Protein Expression dataset, used in previous chapters (Listing 8-11).
from sklearn.model_selection import train_test_split as tts
mpe_x = df.drop('class', axis=1)
mpe_y = df['class']
mpe_x_train, mpe_x_valid, mpe_y_train, mpe_y_vaid = tts(mpe_x, mpe_y,
                                                train_size=0.8,
                                                random_state=42)
Listing 8-11

Splitting the dataset into training and validation sets

Recall that we need to look at the input data in order to gauge how to deal with the model output in autoencoders. If we call mpe_x_train.min(), Pandas returns a series with the minimum value per column.
DYRK1A_N     0.156849
ITSN1_N      0.261185
BDNF_N       0.115181
NR1_N        1.330831
NR2A_N       1.737540
               ...
H3MeK4_N     0.101787
CaNA_N       0.586479
Genotype     1.000000
Treatment    1.000000
Behavior     1.000000
Length: 80, dtype: float64
Calling .min() again takes the minimum of the minimums across columns. We find that the smallest value across the entire dataset is –0.062007874, whereas the maximum is 8.482553422. Since values can theoretically be negative, we’ll use a linear output activation instead of a ReLU and optimize using the standard Mean Squared Error loss for regression problems (Listing 8-12).
modelSet = buildAutoencoder(len(mpe_x.columns), 8,
                       outActivation='linear')
model = modelSet['model']
encoder = modelSet['encoder']
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
history = model.fit(mpe_x_train, mpe_x_train, epochs=150)
Listing 8-12

Fitting an autoencoder on the Mice Protein Expression dataset

After 150 epochs of training, which progresses very quickly (this is a comparatively small dataset), the autoencoder obtains good training and validation performance (Table 8-1, Figure 8-34).
Table 8-1

Performance of the autoencoder trained on the Mice Protein Expression dataset

 

Train

Validation

Mean Squared Error

0.0117

0.0118

Mean Absolute Error

0.0626

0.0625

A graph represents validation and training performance received after 150 epochs of training sets. It exhibits an L shaped curve.

Figure 8-34

Training history of an autoencoder trained on the Mice Protein Expression dataset

Figures 8-35 demonstrates some sample latent vectors and reconstructions made by our autoencoder, with the input and reconstructed vectors reshaped into 8-by-10 grids for more convenient viewing.

An image pattern highlights the latent vectors samples and reconstruction created by the autoencoder. It exhibits the vectors into 8 by 10 grids.

Figure 8-35

Samples and the associated latent vector and reconstruction by an autoencoder trained on the Mice Protein Expression dataset. Samples and reconstructions are represented in two spatial dimensions for convenience of viewing

We can employ a similar technique as previously employed on the MNIST dataset – visualizing the latent space of an autoencoder using t-SNE. Each data point in Figure 8-36 is colored by one of the eight classes each row in the Mice Protein Expression dataset falls into. This tabular autoencoder obtains pretty good separation between classes without any exposure to the labels.

A scatter diagram visualises data points obtained as a result of tabular autoencoder.

Figure 8-36

t-SNE projection of a latent space for an autoencoder trained on the Mice Protein Expression dataset

Note that a more formal/rigorous tabular autoencoder design would require us to standardize or normalize all columns to within the same domain. Tabular datasets often contain features that operate on different scales; for instance, say feature A represents a proportion (i.e., between 0 and 1, inclusive), whereas feature B measures years (i.e., likely larger than 1000). Regression losses simply take the mean error across all columns, which means that the reward for correctly reconstructing A is negligible compared to reconstructing feature B. In this case, however, all columns are in roughly the same range, so skipping this step is tolerable.

In the next subsection, we will explore a direct application of autoencoders to concretely improve the performance of supervised models.

Autoencoders for Pretraining

Vanilla autoencoders, as we have already seen, can do some pretty cool things. We see that a vanilla autoencoder trained on various datasets can perform implicit clustering and classification of digits, without being exposed to the labels themselves as well. Rather, natural differences in the input resulting from differences in labels are independently observed and implicitly recognized by the autoencoder.

This sort of impressive feature extraction capability is valuable in the context of training neural networks to perform supervised tasks. Say we want a neural network to classify digits from the MNIST dataset. If we start from scratch, we are asking the neural network to learn both how to extract the optimal set of features and how to interpret them – all at once, with no prior information. However, we see that the encoder of an autoencoder trained on the MNIST dataset has developed an impressive feature extraction and class separation scheme. We can use the encoder of the autoencoder as a pretraining instrument; rather than building and training a new network that learns both extraction and interpretation from scratch, we can simply append a model component to the output of the encoder to interpret the already-learned feature extractor (i.e., the encoder) (Figure 8-37).

An infographic image illustrates autoencoder and task training sets. It describes the features of the extractor.

Figure 8-37

Schematic of multistage pretraining

In the first stage of training, we train the autoencoder on the standard input reconstruction task. After sufficient training, we can extract the encoder and append an “interpretation”-focused model component that assembles and arranges the features extracted by the encoder into the desired output.

During stage 2, we impose layer freezing upon the encoder, meaning that we prevent its weights from being trained. This is to retain the learned structures of the encoder. We spent a significant amount of effort obtaining a good feature extractor; if we do not impose layer freezing, we will find that optimizing a good feature extractor connected to a very poor (randomly initialized) feature interpreter degrades the feature extractor.

However, once good performance is obtained on training with a frozen feature extractor and a trainable feature interpreter, the entire model can be trained for a few epochs for the purposes of fine-tuning (Figure 8-38). The idea here is that the feature interpreter has developed a good relationship with the static feature extractor, but now both can be jointly optimized to improve the relationship. (Just like couples in relationships, it’s not healthy if one partner is always static!)

A flow diagram depicts the performance achieved as a result of the feature interpreter. It has labels namely primary training and fine-tuning.

Figure 8-38

Freezing followed by fine-tuning can be an effective way to perform autoencoder pretraining.

Let’s begin by demonstrating autoencoder pretraining on MNIST. We’ll use the buildAutoencoder function defined previously to fit an autoencoder, making sure to retain references to both the original model and the encoder (Listing 8-13).
modelSet = buildAutoencoder(784, 32)
model = modelSet['model']
encoder = modelSet['encoder']
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(x_train, x_train, epochs=20)
Listing 8-13

Training an autoencoder on MNIST

After the model has been sufficiently trained, we can extract the encoder and stack it as the feature extraction unit/component of our task model (Listing 8-14). The outputs of the encoder (named encoded in the following script) are further interpreted via several fully connected layers. The encoder is set not to be trainable (i.e., layer freezing). The task model is trained on the original supervised task.
inp = L.Input((784,))
encoded = encoder(inp)
dense1 = L.Dense(16, activation='relu')(encoded)
dense2 = L.Dense(16, activation='relu')(dense1)
dense3 = L.Dense(10, activation='softmax')(dense2)
encoded.trainable = False
task_model = Model(inputs=inp, outputs=dense3)
task_model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy')
task_model.fit(x_train, y_train, epochs=50)
Listing 8-14

Repurposing the encoder of the autoencoder as the frozen encoder/feature extractor of a supervised network

After sufficient training, it is common practice to make the encoder trainable again and fine-tune the entire architecture in an end-to-end fashion (Listing 8-15).
encoded.trainable = True
task_model.fit(x_train, y_train, epochs=5)
Listing 8-15

Fine-tuning the whole supervised network by unfreezing the encoder

We often reduce the learning rate on fine-tuning tasks to prevent destruction/”overwriting” of information learned during the pretraining process. This can be accomplished by recompiling the model after pretraining with an optimizer configured with a different initial learning rate.

We can compare the performance of this model to one with no pretraining (i.e., begins learning in a supervised fashion from scratch) (Listing 8-16, Figure 8-39).
modelSet = buildAutoencoder(784, 32)
model = modelSet['model']
encoder = modelSet['encoder']
inp = L.Input((784,))
encoded = encoder(inp)
dense1 = L.Dense(16, activation='relu')(encoded)
dense2 = L.Dense(16, activation='relu')(dense1)
dense3 = L.Dense(10, activation='softmax')(dense2)
task_model = Model(inputs=inp, outputs=dense3)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
history2 = model.fit(x_train, y_train, epochs=20)
plt.figure(figsize=(15, 7.5), dpi=400)
plt.plot(history.history['loss'], color='red',
         label='With AE Pretraining')
plt.plot(history2.history['loss'], color='blue',
         label='Without AE Pretraining')
plt.grid()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
Listing 8-16

Training a supervised model with the same architecture as the model with pretraining, but without pretraining the encoder via an autoencoding task

A line graph represents the comparison of the given model during the pretraining stage.

Figure 8-39

Comparing the training curves for a classifier trained on the MNIST dataset with and without autoencoder pretraining

The MNIST dataset is relatively simple, so both models converge relatively quickly to good weights. However, the model with pretraining is noticeably “ahead” of the other. By taking the difference between the epoch at which a model with and without pretraining obtains some performance value, we can estimate how “far ahead” a model with autoencoder pretraining is. For any loss p (at least one epoch in training), the model with pretraining reaches p two to four epochs before the model without pretraining.

This process seems and is superfluous on the MNIST dataset, which has a comparatively simple set of rules in a comparatively small number of dimensions. However, this advantage manifests more significantly for more complex datasets, as has been shown with more advanced computer vision and natural language processing tasks. Neural networks trained to perform large-scale image classification (e.g., ImageNet), for instance, benefit significantly from performing an autoencoder pretraining task that learns useful latent features that are later interpreted and fine-tuned. Similarly, it has been shown that language models learn important fundamental structures of language by performing reconstruction tasks, which can be later used as the basis for a supervised task like text classification or generation (Figure 8-40).

A model diagram represents reconstruction tasks that can be used as a basis for generation and text classification.

Figure 8-40

General transfer learning/pretraining design used dominantly in computer vision

Recall, for instance, the Inception and EfficientNet models discussed in Chapter 4. Keras allows users to load weights from a model trained on ImageNet because the feature extraction “skills” required to perform well on a wide-ranging task like ImageNet are valuable or can be adapted to become valuable in most computer vision tasks.

However, as we have previously seen in Chapters 4 and 5, the success of a deep learning method on complex image and natural language data does not necessarily bar it from being useful to tabular data applications too.

Let’s consider the Mice Protein Expression dataset. We can begin by instantiating and training a sample autoencoder (Listing 8-17).
modelSet = buildAutoencoder(len(mpe_x_train.columns), 32,
                            outActivation='linear')
model = modelSet['model']
encoder = modelSet['encoder']
model.compile(optimizer='adam', loss='mse')
history = model.fit(mpe_x_train, mpe_x_train, epochs=50)
Listing 8-17

Building and training an autoencoder on the Mice Protein Expression dataset

We can now create and fit a task model using the trained encoder in two phases, the first in which the encoder is frozen and the second in which the encoder is trainable (Listing 8-18, Figure 8-41).
inp = L.Input((len(mpe_x_train.columns),))
encoded = encoder(inp)
dense1 = L.Dense(32, activation='relu')(encoded)
dense2 = L.Dense(32, activation='relu')(dense1)
dense3 = L.Dense(32, activation='relu')(dense2)
dense4 = L.Dense(8, activation='softmax')(dense2)
encoded.trainable = False
task_model = Model(inputs=inp, outputs=dense4)
task_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
history_i = task_model.fit(mpe_x_train, mpe_y_train-1, epochs=30,
                           validation_data=(mpe_x_valid, mpe_y_valid-1))
encoded.trainable = True
history_ii = task_model.fit(mpe_x_train, mpe_y_train-1, epochs=10,
                            validation_data=(mpe_x_valid,
                                             mpe_y_valid-1))
Listing 8-18

Using the pretrained encoder in a supervised task

A line graph represents a task model utilising the trained encoder in the frozen and training phases.

Figure 8-41

Validation and training curves for stages 1 and 2

Alternatively, consider the Higgs Boson dataset. This dataset only has 28 features. If we use our standard autoencoder logic, which halves the number of nodes in each encoder layer and doubles the number of nodes in each decoder layer, we will either need to have a very smaller number of layers to use a reasonable latent space size or a very small latent space to use a reasonable number of layers. For instance, if our latent space has only eight features, the autoencoder logic would build only two layers (28 → 16 → 8). On the other hand, if we want a larger number of layers (e.g., five), we would need a very small latent space (e.g., an autoencoder with 28 → 16 → 8 → 4 → 2 → 1). In this case, it’s most beneficial to design a custom autoencoder with a sufficiently large latent space and a sufficient number of layers. We could design an autoencoder, for instance, with six layers in the encoder and decoder each and a latent space of 16 dimensions (Listing 8-19).
encoder = Sequential()
encoder.add(L.Input((len(X_train.columns),)))
encoder.add(L.Dense(28, activation='relu'))
encoder.add(L.Dense(28, activation='relu'))
encoder.add(L.Dense(28, activation='relu'))
encoder.add(L.Dense(16, activation='relu'))
encoder.add(L.Dense(16, activation='relu'))
encoder.add(L.Dense(16, activation='relu'))
decoder = Sequential()
decoder.add(L.Input((16,)))
decoder.add(L.Dense(16, activation='relu'))
decoder.add(L.Dense(16, activation='relu'))
decoder.add(L.Dense(16, activation='relu'))
decoder.add(L.Dense(28, activation='relu'))
decoder.add(L.Dense(28, activation='relu'))
decoder.add(L.Dense(28, activation='linear'))
inp = L.Input((28,))
encoded = encoder(inp)
decoded = decoder(encoded)
ae = keras.models.Model(inputs=inp, outputs=decoded)
ae.compile(optimizer='adam', loss='mse', metrics=['mae'])
history = ae.fit(X_train, X_train, epochs=100,
                 validation_data=(X_valid, X_valid))
Listing 8-19

Defining a custom autoencoder architecture for the Higgs Boson dataset

We can treat a static encoder as a feature extractor for our task model (Listing 8-20, Figures 8-42 and Figure 8-43).
inp = L.Input((len(X_train.columns),))
encoded = encoder(inp)
dense1 = L.Dense(16, activation='relu')(encoded)
dense2 = L.Dense(16, activation='relu')(dense1)
dense3 = L.Dense(16, activation='relu')(dense2)
dense4 = L.Dense(1, activation='sigmoid')(dense3)
encoded.trainable = False
task_model = keras.models.Model(inputs=inp, outputs=dense4)
task_model.compile(optimizer='adam', loss='binary_crossentropy',
              metrics=['accuracy'])
history_i = task_model.fit(X_train, y_train, epochs=70,
                           validation_data=(X_valid, y_valid))
encoded.trainable = True
history_ii = task_model.fit(X_train, y_train, epochs=30,
                            validation_data=(X_valid, y_valid))
Listing 8-20

Using the pretrained encoder as a feature extractor for a supervised task

A line graph represents downward-sloping and erratic curves. It exhibits pre-trained encoder functioning as a feature extractor.

Figure 8-42

Validation and training loss curves for stages 1 and 2

A line graph represents a rising and erratic curve which makes the static encoder a feature extractor.

Figure 8-43

Validation and training accuracy curves for stages 1 and 2

We can observe a significant amount of overfitting in this particular case. We can attempt to improve generalization by employing best practices such as adding dropout or batch normalization.

Lastly, it should be noted that using autoencoders for pretraining is a great semi-supervised method. Semi-supervised methods make use of data with and without labels (and are used most often in cases where labeled data is scarce and unlabeled data is abundant). Say you possess three sets of data: Xunlabeled, Xlabeled, and y (which corresponds to Xlabeled). You can train an autoencoder to reconstruct Xunlabeled and then use the frozen encoder as the feature extractor in a task model to predict y from Xlabeled. This technique generally works well even when the size of Xunlabeled is significantly larger than the size of Xlabeled; the autoencoding task learns meaningful representations that should be significantly easier to associate with a supervised target than beginning from initialization.

Multitask Autoencoders

Pretraining with autoencoders is often an effective strategy to take advantage of quality learned latent features. However, one criticism of the system is that it proceeds sequentially – autoencoder training takes place at a separate stage than the task training. Multitask autoencoders train the network on the autoencoder task and the intended task simultaneously (hence the name multitask). These autoencoders accept one input that is encoded by the encoder into a latent space. This one set of latent features is decoded separately by two “decoders” into two outputs; one output is dedicated to the autoencoder task, while the other is dedicated to the intended task. The network learns both of these tasks at the same time during training (Figures 8-44 and 8-45).

A model represents output dedicated to the task of the autoencoder. It has labels namely the input data, model and task output.

Figure 8-44

Original task model

A flow diagram depicts the latent features dedicated to the intended task. It has labels namely the input data, encoding layers, and latent space.

Figure 8-45

Multitask learning

By training the autoencoder simultaneously along the task network, we can theoretically experience the benefits of the autoencoder in a dynamic fashion. Say the encoder has “difficulty” encoding features in a way relevant to the task output, which can be difficult. However, the encoder component of the model can still decrease the overall loss by learning features relevant to the autoencoder reconstruction task. These features may provide continuous support for the task output by providing the optimizer a viable path to loss minimization – it is “another way out,” so to speak. Using multitask autoencoders is often an effective technique to avoid or minimize difficult local minimum problems, in which the model makes mediocre to negligible progress in the first few moments of training and then plateaus (i.e., is stuck in a poor local minimum).

In order to construct a multitask autoencoder, we begin by initializing an autoencoder and extracting the encoder and decoder components. We create a “tasker” model that accepts latent features (i.e., data of the shape of the encoder output) and processes them into the task output (i.e., one of ten digits, in the case of MNIST). Each of these components can be linked using functional API syntax to form a complete multitask autoencoder architecture (Listing 8-21, Figure 8-46).
modelSet = buildAutoencoder(784, 32)
model = modelSet['model']
encoder = modelSet['encoder']
decoder = modelSet['decoder']
tasker = keras.models.Sequential(name='taskOut')
tasker.add(L.Input((32,)))
for i in range(3):
    tasker.add(L.Dense(16, activation='relu'))
tasker.add(L.Dense(10, activation='softmax'))
inp = L.Input((784,), name='input')
encoded = encoder(inp)
decoded = decoder(encoded)
taskOut = tasker(encoded)
taskModel = Model(inputs=inp, outputs=[decoded, taskOut])
Listing 8-21

Building a multitask autoencoder for the MNIST dataset

A network architecture depicts the interlinking of functional A P I syntax to create a complete multitask autoencoder.

Figure 8-46

Visualization of a multitask autoencoder architecture

Because the multitask autoencoder has multiple outputs, we need to specify losses and labels for each of the outputs by referencing a particular output’s name. In this case, the two outputs have been named “decoder” and “taskOut.” The decoder output will be given the original input (i.e., x_train) and optimized with binary cross-entropy, since its objective is to perform pixel-wise reconstruction. The task output will be given the image labels (i.e., y_train) and optimized with categorical cross-entropy, since its objective is to perform multiclass classification (Listing 8-22).
taskModel.compile(optimizer='adam',
                  loss = {'decoder':'binary_crossentropy',
                          'taskOut':'sparse_categorical_crossentropy'})
history = taskModel.fit(x_train, {'decoder':x_train,
                                  'taskOut': y_train},
                        epochs=100)
Listing 8-22

Compiling and fitting the task model

We can observe from the training history that the model is able to reach both a fairly good task loss and a reconstruction loss within just a few dozen epochs (Listing 8-23, Figure 8-47).
plt.figure(figsize=(15, 7.5), dpi=400)
plt.plot(history.history['decoder_loss'], color='red', linestyle='--', label='Reconstruction Loss')
plt.plot(history.history['taskOut_loss'], color='blue', label='Task Loss')
plt.plot(history.history['loss'], color='green', linestyle='-.', label='Overall Loss')
plt.grid()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
Listing 8-23

Plotting out different dimensions of the performance over time

A line graph depicts three different curves. It exhibits a similar pattern.

Figure 8-47

Different dimensions of performance (reconstruction loss, task loss, overall loss)

Figures 8-48 to 8-51 visualize how the state of the multitask autoencoder progresses throughout each epoch.

An image interprets the performance of the encoder at every epoch.

Figure 8-48

Multitask autoencoder at zero epochs

An image in a square represents an autoencoder at one epoch.

Figure 8-49

Multitask autoencoder at one epoch

An image in the shape of the number seven at two epochs.

Figure 8-50

Multitask autoencoder at two epochs

An output image represents the progression of multitask encoder at several more epochs.

Figure 8-51

Several more epochs

From these visualizations and the training history, we see that the multitask autoencoder obtains better performance on the task than the autoencoding task that is intended to assist task performance! In this case, the MNIST dataset’s task output is more straightforward than the autoencoding task, which makes sense. In this case, using a multitask autoencoder is not beneficial. It probably is more beneficial to directly train or use an autoencoder for pretraining when multitask autoencoders perform poorly.

We can use an adapted approach on the Mice Protein Expression dataset, in which we see that autoencoding is a more approachable problem than the task of classification itself, both from the training history (Figures 8-52) and output state progression visualizations (Figures 8-53 through 8-56).

A line graph exhibits autoencoding from the training history of the approach on the Mice protection expression dataset.

Figure 8-52

Different dimensions of performance on the Mice Protein Expression dataset

Four images. 3 grid image visualizes various output state progression values of the set of 80 features and one graph plots the Absolute error, truth and predicted values.

Figure 8-53

The state of the multitask autoencoder after zero epochs (i.e., upon initialization). Top: displays the original set of 80 features in the Mice Protein Expression dataset (arranged in a grid for more convenient visual viewing), the output of the decoder (of which the goal is to reconstruct the input), and the absolute error of the reconstruction. Bottom: the predicted and true classes (eight in total) and the absolute probability error

Four images. 3 grid images visualize the various outputs obtained after one epoch and one graph plots the Absolute error, truth, and predicted values.

Figure 8-54

The state of the multitask autoencoder after one epoch

Four images. 3 grid images illustrate the variation in output in a multitask encoder and one graph plots the Absolute error, truth, and predicted values.

Figure 8-55

The state of the multitask autoencoder after five epochs

Four images. 3 grid images display various encoding features at the original input, decoder output, and other stages and one graph plots the Absolute error, truth, and predicted values.

Figure 8-56

The state of the multitask autoencoder after 50 epochs

Figures 8-53 through 8-56 demonstrate the performance of the reconstruction task alongside the classification task at various stages in training. Notice that the reconstruction error converges near zero quickly and helps “pull”/”guide” the task error to zero over time.

In many cases, simultaneous execution of the autoencoder task and the original desired task can help provide stimulus to “push” progress on the desired task. However, you may make the valid objection that once the desired task reaches sufficiently good performance, it becomes limited by the autoencoding task.

One method to reconcile with this is simply to detach the autoencoder output from the model by creating a new model connecting the input to the task output and fine-tuning on the dataset.

Another more sophisticated technique is to change the loss weights between the original desired task and the autoencoding task. While Keras weighs multiple losses equally by default, we can provide different weights to reflect different levels of priority or importance delegated to each of the tasks. At the beginning of training, we can give a high weight to the autoencoding task, since we want the model to develop useful representations through a (ideally somewhat easier) task of autoencoding. Throughout the training duration, the weight on the original task model loss can be successively increased and the weight on the autoencoder model loss decreased. To formalize this, let α be the weight on the task output loss, and let 1 − α be the weight on the decoder output loss (with 0 < α < 1).

The sigmoid equation $$ sigma (x)=frac{1}{1+{e}^{-x}} $$ is a pretty good way to get from a value very close to some minimum bound to another value very close to an upper bound. Over the span of 100 epochs, we can employ a simple (arbitrarily set but functional) transformation on the sigmoid function to obtain a smooth transition from a slow to high value of α (visualized by Listing 8-24 in Figure 8-57), where t represents the epoch number:
$$ alpha =sigma left(frac{t-50}{10}
ight)=frac{1}{1+{e}^{-left(frac{t-50}{10}
ight)}} $$

A line graph represents a smooth transition from a slow to a high value of the sigmoid function. It plots the output weight and the decoder output weight.

Figure 8-57

Plot of the task output loss weight and the decoder weight across each epoch

plt.figure(figsize=(15, 7.5), dpi=400)
epochs = np.linspace(1, 100, 100)
alpha = 1/(1 + np.exp(-(epochs-50)/10))
plt.plot(epochs, alpha, color='red', label='Task Output Weight')
plt.plot(epochs, 1-alpha, color='blue', label='Decoder Output Weight')
plt.xlabel('Epochs')
plt.legend()
plt.show()
Listing 8-24

Plotting out our custom α-adjusting curve

A generalized equation to scale α through tmax using a transformation of the sigmoid function is as follows:
$$ alpha =sigma left(frac{t-frac{t_{	extrm{max}}}{2}}{frac{t_{	extrm{max}}}{10}}
ight)=frac{1}{1+{e}^{-left(frac{t-frac{t_{	extrm{max}}}{2}}{frac{t_{	extrm{max}}}{10}}
ight)}} $$
At initial conditions, we have α at a very small value. (For the purposes of this calculation, we use t = 1 for simplification of calculation.)
$$ alpha @left{tapprox 0
ight}	o frac{1}{1+{e}^{-left(frac{-frac{t_{	extrm{max}}}{2}}{frac{t_{	extrm{max}}}{10}}
ight)}}=frac{1}{1+{e}^5}approx 0.006692 $$
The training regime completes at t = tmax, at which α is very close to 1:
$$ alpha @left{t={t}_{	extrm{max}}
ight}	o frac{1}{1+{e}^{-left(frac{t_{	extrm{max}}-frac{t_{	extrm{max}}}{2}}{frac{t_{	extrm{max}}}{10}}
ight)}}=frac{e^5}{1+{e}^5}approx 0.993307 $$

Moreover, we observe by taking the derivative and solving for the maximum that the largest change for some tmax is $$ frac{5}{2cdotp {t}_{	extrm{max}}} $$. As tmax increases, analysis of the derivative reveals that the overall change becomes more uniformly spread out. For large values of tmax, this functionally becomes a horizontal line (i.e., the derivative becomes near 0). A simple linear transformation of α also suffices in most cases in which tmax is reasonably large.

Loss weighting is conveyed in the compiling stage. This means that we’ll have to recompile and fit every epoch. This is not difficult to do; we can write a for loop that loops through every epoch, calculates the α value for that epoch, compiles the model with that loss weighting, and fits for one epoch. Collecting the training history is slightly more manual; we need to collect the metrics for the single epoch and append them to user-created lists (Listing 8-25).
total_epochs = 100
lossParams = {'decoder':'binary_crossentropy',
              'taskOut':'sparse_categorical_crossentropy'}
loss, decoderLoss, taskOutLoss = [], [], []
for epoch in range(1, total_epochs+1):
    alpha = 1/(1 + np.exp(-(epoch-50)/10))
    taskModel.compile(optimizer='adam',
                      loss = lossParams,
                      loss_weights = {'taskOut': alpha,
                                      'decoder': 1-alpha})
    history = taskModel.fit(x_train, {'decoder':x_train,
                                      'taskOut': y_train},
                            epochs = 1)
    loss.extend(history.history['loss'])
    decoderLoss.extend(history.history['decoder_loss'])
    taskOutLoss.extend(history.history['taskOut_loss'])
Listing 8-25

Recompiling and fitting a multitask autoencoder with varied loss weighting

For another higher-code but perhaps smoother approach to dynamically adjusting the loss calculation weights of multi-output models, which does not require repeated refitting, see Anuj Arora’s well-written post on adaptive loss weighting in Keras using callbacks: https://​medium.​com/​dive-into-ml-ai/​adaptive-weighing-of-loss-functions-for-multiple-output-keras-models-71a1b0aca66e.

Figure 8-58 demonstrates the history of the reconstruction, task, and overall losses throughout training of the multitask autoencoder, with the background shaded by the value of α used at that epoch. Note that the reconstruction task is more trivial than the original intended task (hence the faster decline in loss) and the logistically shaped overall loss function that makes major changes in α in epochs 40–60 and “switches” bounds from the reconstruction to the original task loss.

A diagram interprets the reconstruction task and the total losses incurred through the entire training of the multitask autoencoder.

Figure 8-58

Diagram of reconstruction loss, task loss, and overall loss (now a dynamically weighted sum) with the weighting gradient shaded in the background

Multitask autoencoders perform best in difficult supervised classification tasks that benefit from rich latent features, which can be learned well by autoencoders.

Sparse Autoencoders

Standard autoencoders are given the limitation of size representation – autoencoder architectures are built with a “physical” bottleneck through which information must be compressed. The autoencoder attempts to maximize the amount of information it can squeeze through a significantly compressed latent space such that the information can reliably be decoded into the original output (Figure 8-59).

A network architecture exhibits compressed latent space. It indicates data is decoded reliably into the original output.

Figure 8-59

A standard autoencoder, which encodes information into a densely packed and quasi-continuous latent space

However, this is not the only limit we can impose. Another information bottleneck tool is sparsity. We can make the bottleneck layer very large, but force only a few nodes to be active at any one pass. While this still forces a limitation on the amount of information that can pass through the bottleneck layer, the network is given more freedom and control to “choose” which nodes information passes through, which itself is an additional medium of information expression (Figure 8-60).

A network diagram exhibits a free pattern that enables the selection of the data of nodes.

Figure 8-60

A sparse autoencoder, in which a much larger latent size is accessible but only a few nodes can be used at any one time

To maintain sparsity, we generally impose L1 regularization on the layer’s activity. (Recall the discussion of regularization in Chapter 3, “Regularization Learning Networks.”) L1 regularization penalizes the output activity of the bottleneck layer by being too large. Assuming a network uses binary cross-entropy loss to minimize task output and λ represents the overall activity/output of the bottleneck layer, the joint loss of an L1-regularized network is as follows:
$$ 	extrm{loss}= BCEleft({y}_{	extrm{pred}},{	extrm{y}}_{	extrm{true}}
ight)+alpha cdotp mid lambda mid $$

The parameter α is user-defined and controls the “importance” of the L1 regularization term relative to the task loss. Setting the correct value of α is important for correct behavior. If α is too small, the network ignores the sparsity restriction in favor of completing the task, which is now made quasi-trivial by the overcomplete bottleneck layer. If α is too large, the network ignores the task by learning the “ultimate sparsity” – predicting all zeros in the bottleneck layer, which entirely minimizes λ but performs poorly on the actual task we want it to learn.

An alternative commonly used penalty is L2 regularization, in which the square rather than the absolute value is penalized:
$$ 	extrm{loss}= BCEleft({y}_{	extrm{pred}},{	extrm{y}}_{	extrm{true}}
ight)+alpha cdotp {lambda}^2 $$

This is a common machine learning paradigm. L2 regularization tends to produce sets of values generally near zero but not at zero, whereas L1 regularization tends to produce values solidly at zero. An intuitive explanation is that L2 regularization significantly discounts the need to decrease values that are already somewhat near zero. The decrease from 3 to 2, for instance, is rewarded with a penalty decrease of 32 − 22 = 5. The decrease from 1 to 0, on the other hand, is rewarded with a measly penalty decrease of 12 − 02 = 1. On the other hand, L1 regularization rewards the decrease from 3 to 2 identically as the decrease from 1 to 0. We generally use L1 regularization to impose sparsity constraints because of this property.

To implement this, we need to make a slight modification to our original buildAutoencoder function. We can build the autoencoder as if we were leading up to and from a certain implicit latent size, but replace the implicit latent size with the real (expanded) latent size. For instance, consider an autoencoder build with an input of 64 dimensions and an implicit latent space of 8 dimensions. The node count progression in each layer of a standard autoencoder using our prebuilt autoencoder logic would be 64 → 32 → 16 → 8 → 16 → 32 → 64. However, because we are planning to impose a sparsity constraint on the bottleneck layer, we need to provide an expanded set of nodes to pass information through. Say the real bottleneck size is 128 nodes. The node count progression in each layer of this sparse autoencoder would be 64 → 32 → 16 → 128 → 16 → 32 → 64.

To actually implement the sparsity constraint, note that almost all layers in Keras have an activity_regularizer parameter, set upon initialization. This parameter penalizes the activity, or the output, of the layer (Listing 8-26). Note that you can also set the weight_regularizer or bias_regularizer parameter if you desire to penalize the learned weights or biases. In this case, we don’t care about how the encoder arrives at a sparse encoding, only that the encoder creates a sparse encoding. Hence, we perform regularization on the layer activity. The arguments accept a keras.regularizers object. We will use the L1 regularization object, which accepts the specific weighting of the penalty as a parameter. Setting the weight is important and should be given thought and experimentation, considering the model power, difficulty of autoencoding, and latent space size. As discussed previously, setting an improper weight in either direction (too large or too small) yields adverse outcomes.
from keras.regularizers import L1
def buildSparseAutoencoder(inputSize=784,
                           impLatentSize=32,
                           realLatentSize=128,
                           outActivation='sigmoid'):
    # define architecture components
    encoder = Sequential(name='encoder')
    encoder.add(L.Input((inputSize,)))
    for i in range(int(np.floor(np.log2(inputSize/impLatentSize))), -1, -1):
        encoder.add(L.Dense(impLatentSize * 2**i, activation='relu'))
        encoder.add(L.Dense(impLatentSize * 2**i, activation='relu'))
    encoder.add(L.Dense(realLatentSize, activation='relu',
                        activity_regularizer = L1(0.001)))
    decoder = Sequential(name='decoder')
    decoder.add(L.Input((realLatentSize,)))
    for i in range(1,int(np.floor(np.log2(inputSize/impLatentSize)))+1):
        decoder.add(L.Dense(impLatentSize * 2**i, activation='relu'))
        decoder.add(L.Dense(impLatentSize * 2**i, activation='relu'))
    decoder.add(L.Dense(inputSize, activation=outActivation))
    # define model architecture from components
    ae_input = L.Input((inputSize,), name='input')
    ae_encoder = encoder(ae_input)
    ae_decoder = decoder(ae_encoder)
    ae = Model(inputs = ae_input,
               outputs = ae_decoder)
    return {'model': ae, 'encoder': encoder, 'decoder': decoder}
Listing 8-26

Defining a sparse autoencoder with L1 regularization

Figure 8-61 demonstrates performance of the sparse autoencoder on the MNIST dataset, in which a 64-dimensional latent space vector is reshaped into an 8-by-8 grid for convenient viewing. The reconstruction is not visibly worse than a standard autoencoder without a sparsity constraint. Notice that only two to five of the 64 nodes are active at any one pass (and that which node(s) are active vary for each image). A standard autoencoder trained even with five nodes in the bottleneck layer (no sparsity requirement) would obtain poor performance on reconstruction, demonstrating the informational richness of “choosing” which nodes are active.

An interpreter image represents sparse encoder performance on the M N I S T dataset.

Figure 8-61

Sampled original inputs (left), latent space (middle), and reconstruction (right) for a sparse autoencoder trained on MNIST. The latent space is 256 neurons reshaped into a 16-by-16 grid for viewing. The actual latent space is not arranged in two spatial directions

If we decreased the regularization alpha value (i.e., the L1 penalty would be weighted less relative to the loss), the network would obtain better overall loss at the cost of decreased sparsity (i.e., more nodes would be active at any one pass). If we increased the regularization alpha, the network would obtain worse overall loss at the benefit of increased sparsity (i.e., even fewer nodes would be active at any one pass).

We can apply the same sparse autoencoding scheme to the Higgs Boson dataset, encoding a 28-dimensional input vector into a 64-dimensional latent space. At each pass, about one-fourth to one-third of the latent space is active, although many bottleneck nodes are “quasi-active” – they are not zero, but very close. Figure 8-62 demonstrates the internal state and reconstruction of the sparse autoencoder on different inputs, with 28-dimensional input vectors reshaped into 7-by-4 grids for more convenient viewing.

12 Images depict the reshaping of a sparse autoencoder with a latent space of a 16 by 16 grid.

Figure 8-62

Sampled original inputs (left), latent space (middle), and reconstruction (right) for a sparse autoencoder trained on the Higgs Boson dataset. The latent space is 256 neurons reshaped into a 16-by-16 grid for viewing; the input and reconstruction are 28 dimensions arranged into 7-by-4 grids

Similarly, Figure 8-63 shows the application of a trained sparse autoencoder to various elements of the Mice Protein Expression dataset.

12 images represent the utilization of sparse autoencoder to different elements pertaining to the given dataset.

Figure 8-63

Sampled original inputs (left), latent space (middle), and reconstruction (right) for a sparse autoencoder trained on the Mice Protein Expression dataset. The latent space is 256 neurons reshaped into a 16-by-16 grid for viewing; the input and reconstruction are 80 dimensions arranged into 8-by-10 grids

Why would you want to use sparse autoencoders? The primary reason is to take advantage of sparse encoders’ robustness properties. Adversarial examples are instances generated to deliberately fool a neural network into some image originally correctly classified as class A into class B with high confidence simply by making miniscule, barely visible changes to the input. The canonical example in the field is a diagram created by Ian Goodfellow et al. in the paper “Explaining and Harnessing Adversarial Examples.” The Fast Signed Gradient Method (FSGM) generates a permutation matrix that adjusts every pixel in the input in a way that significantly changes the network’s final prediction (Figure 8-64).

A mathematical expression. The first input is the photograph of a panda, the second one is a nematode and the output is a gibbon which is depicted by a panda.

Figure 8-64

Demonstration of the FSGM method. From “Explaining and Harnessing Adversarial Examples,” Goodfellow et al.

Adversarial example finders profit from continuity and gradients. Because neural networks operate in very large continuous spaces, adversarial examples can be found by “sneaking” through smooth channels and ridges in the surface of the landscape. Adversarial examples can be security threats (some instances of naturally occurring adversarial examples, like tape placed onto a traffic sign in a particular orientation causing egregious misidentification), as well as potential symptoms of poor generalization.1 However, sparse encoders impose a discreteness upon the encoded space. It becomes significantly more difficult to generate successful adversarial examples when a frozen encoder is used as the feature extractor for a network.

Sparse autoencoders can also be useful for the purposes of interpretability. We’ll talk more about specialized interpretability techniques later in this chapter, but sparse autoencoders can be easily interpreted without additional complex theoretical tools. Because only a few neurons are active at any one time, understanding which neurons are activated for any one input is relatively simple, especially compared with the latent vectors generated by standard autoencoders.

Denoising and Reparative Autoencoders

So far, we’ve only considered applications of autoencoder training in which the desired output is identical to the input. However, autoencoders can perform another function: to repair or restore a damaged or noisy input.

Here’s the clever way we go about it – we artificially add realistic noise or corruption to a “pure”/”clean” dataset and then train the model to recover the cleaned image from its artificially corrupted version (Figure 8-65).

A flow diagram represents the training dataset to recover a clean image from the artificially corrupted image.

Figure 8-65

Deriving a noisy image as input and the original clean image as the desired output of a denoising autoencoder

There are many applications for such a model. We can use it, most obviously, to denoise a noisy input; the “cleaned” input can then be used for other purposes. Alternatively, if we are developing a model that we know will operate in a domain with lots of noisy data, we can use the encoder of a denoising autoencoder as a robust or resilient feature extractor (similarly to in autoencoder pretraining), exploiting the encoder’s “denoised” latent representations (Figure 8-66).

A flow diagram depicts the functioning of the denoising autoencoder. It has the labels input data, denoising autoencoder, model and task output.

Figure 8-66

A potential application of denoising autoencoders as a structure that learns to clean up the input before it is actually used in a model for a task

These reparative models have particularly exciting applications for intelligent or deep graphics processing. Many graphics operations are not trivially two-way invertible in that it is trivial to go from one state to another but not in the inverse direction. For instance, if I convert a color image or movie into grayscale (for instance, using the pixel-wise methods covered in the image case study in Chapter 2), there is no simple way to invert it back to color. Alternatively, if you spill coffee on an old family photo, there is no trivial process to “erase” the stain.

Autoencoders, however, exploit the triviality of going from the “pure” to the “corrupted” state by artificially imposing corruption upon pure data and forcing powerful autoencoder architectures to learn the “undoing.” Researchers have used denoising autoencoder architectures to generate color versions of historical black-and-white film and to repair photos that have been ripped, stained, or streaked. Another application is in biological/medical imaging, where an imaging operation can be disrupted by environmental conditions; replicating this noise/image damage artificially and training an autoencoder to become robust to it can make the model more resilient to noise.

We will begin with demonstrating the application of a denoising autoencoder to the MNIST dataset by successively increasing the amount of noise in the image and observing how well the denoising autoencoder performs (similarly to exercises in Chapter 4).

We can use a simple but effective technique to introduce noise into an image: adding random noise sampled from a normal distribution with mean 0 and a specified standard deviation. The result is clipped to ensure that the resulting value is still between 0 and 1, the feasible domain of pixel values. Listing 8-27 implements and visualizes artificial noise for a given standard deviation std.
modified = x_train + np.random.normal(0, std, size=x_train.shape)
modified_clipped = np.clip(modified, 0, 1)
plt.set_cmap('gray')
plt.figure(figsize=(20, 20), dpi=400)
for i in range(25):
    plt.subplot(5, 5, i+1)
    plt.imshow(modified_clipped[i].reshape((28, 28)))
    plt.axis('off')
    plt.show()
Listing 8-27

Displaying data corrupted by random noise

Figure 8-67 demonstrates a grid of sample images with no artificial noise added as reference for comparison.

A numerical grid represents various sample images that have no artificial noise added as a comparison reference.

Figure 8-67

A grid of untampered clean images from MNIST for reference

Figure 8-68 demonstrates the same set of images with random noise sampled from a normal distribution with standard deviation of 0.1. We can observe marginal noise, especially in affecting the consistency of the digit outlines.

A 6 by 6 numerical grid illustrates the images acquired as a result of random noise sampled from a normal distribution with standard deviation.

Figure 8-68

A sample of MNIST images with added normally distributed random noise using standard deviation 0.1

Let’s build an autoencoder to denoise this data (Listing 8-28). There is no difference between the architecture of the autoencoder used here and in previous applications; the difference rather is in the data that we pass in (namely, that the input should have artificial noise applied). In this implementation, we compute new noise in each epoch. This is desired because it provides “fresh” noise that the denoising autoencoder must learn to denoise rather than to “accept”/”memorize.”
models = buildAutoencoder(784, 32)
model = models['model']
encoder = models['encoder']
model.compile(optimizer='adam', loss='mse')
TOTAL_EPOCHS = 100
loss = []
for i in tqdm(range(TOTAL_EPOCHS)):
    modified = x_train + np.random.normal(0, std, size=x_train.shape)
    modified_clipped = np.clip(modified, 0, 1)
    history = model.fit(modified_clipped, x_train, epochs=1, verbose=0)
    loss.append(history.history['loss'])
Listing 8-28

Training the denoising autoencoder on novel corrupted MNIST data each epoch

After training, we can evaluate the Mean Absolute Error on a fresh validation set of noisy images (Listing 8-29).
modified = x_valid + np.random.normal(0, std, size=x_valid.shape)
modified_clipped = np.clip(modified, 0, 1)
from sklearn.metrics import mean_absolute_error as mae
mae(model.predict(modified_clipped), x_valid)
Listing 8-29

Evaluating the performance of the denoising autoencoder on a fresh set of noisy images

Listing 8-30 and Figure 8-69, respectively, implement and demonstrate a sampling of images with normally distributed random noise, using a standard deviation of 0.1. A denoising autoencoder trained to recover the original version given a noisy image generated using this procedure obtains a validation Mean Squared Error of 0.0266.
plt.set_cmap('gray')
for i in range(3):
    plt.figure(figsize=(15, 5), dpi=400)
    plt.subplot(1, 3, 1)
    plt.imshow(modified_clipped[i].reshape((28, 28)))
    plt.axis('off')
    plt.title('Noisy Input')
    plt.subplot(1, 3, 2)
    plt.imshow(x_valid[i].reshape((28, 28)))
    plt.axis('off')
    plt.title('True Denoised')
    plt.subplot(1, 3, 3)
    plt.imshow(model.predict(x_valid[i:i+1]).reshape((28, 28)))
    plt.axis('off')
    plt.title('Predicted Denoised')
    plt.show()
Listing 8-30

Displaying the corrupted image, the reconstruction, and the desired reconstruction (i.e., the original uncorrupted image)

A 3 by 3 grid represents the corrupted images with reference to random noise distributed using a standard deviation.

Figure 8-69

The noisy/perturbed input (left), the unperturbed desired output (middle), and the predicted output (right) for a denoising autoencoder trained on MNIST with a noisy normal distribution of standard deviation 0.1

Let’s increase the standard deviation to 0.2. Figure 8-70 demonstrates the noise effect on the images, and Figure 8-71 demonstrates the reconstruction performance on a set of images. The denoising autoencoder obtains a validation Mean Absolute Error of 0.0289, slightly more than that of the denoising autoencoder trained on noise drawn from a normal distribution with standard deviation 0.1.

A 6 by 6 grid highlights the noise impact on the images with the increase of standard deviation of zero point two.

Figure 8-70

A sample of MNIST images with added normally distributed random noise using standard deviation 0.2

A 3 by 3 grid represents a set of images displaying reconstruction performance.

Figure 8-71

The noisy/perturbed input (left), the unperturbed desired output (middle), and the predicted output (right) for a denoising autoencoder trained on MNIST with a noisy normal distribution of standard deviation 0.2

Figures 8-72 and 8-73 demonstrate a sample of images and the denoising autoencoder performance on images corrupted using noise drawn from a normal distribution with standard deviation 0.3. The denoising autoencoder obtains a validation Mean Absolute Error of about 0.0343.

A 6 by 6 numerical grid network exhibits a sample of images that interprets normal distribution.

Figure 8-72

A sample of MNIST images with added normally distributed random noise using standard deviation 0.3

A 3 by 3 numerical grid interprets images achieved through the performance of a denoising autoencoder that obtains a validation mean absolute error.

Figure 8-73

The noisy/perturbed input (left), the unperturbed desired output (middle), and the predicted output (right) for a denoising autoencoder trained on MNIST with a noisy normal distribution of standard deviation 0.3

Figures 8-74 and 8-75 demonstrate a sample of images and the denoising autoencoder performance on images corrupted using noise drawn from a normal distribution with standard deviation 0.5. The denoising autoencoder obtains a validation Mean Absolute Error of about 0.0427.

A 6 by 6 grid framework represents images drawn from the standard deviation of zero point five.

Figure 8-74

A sample of MNIST images with added normally distributed random noise using standard deviation 0.5

A 3 by 3 number grid framework represents noisy input and desired output using denoising autoencoder performance.

Figure 8-75

The noisy/perturbed input (left), the unperturbed desired output (middle), and the predicted output (right) for a denoising autoencoder trained on MNIST with a noisy normal distribution of standard deviation 0.5

Figures 8-76 and 8-77 demonstrate a sample of images and the denoising autoencoder performance on images corrupted using noise drawn from a normal distribution with standard deviation 0.9. The denoising autoencoder obtains a validation Mean Absolute Error of about 0.0683. Note that this is an exceedingly nontrivial task – even humans would have some difficulty denoising many of the shown samples! The autoencoder reconstructions are more abstracted – there is no physical way to exactly reconstruct all the details due to the significant amount of information corruption, so the autoencoder instead performs implicit digit recognition and reconstructs the image as a “generalized” digit with specific positional and orientational characteristics.

A 6 by 6 number grid of various image patterns acquired as a result of the performance of autoencoder having the denoising feature.

Figure 8-76

A sample of MNIST images with added normally distributed random noise using standard deviation 0.9

A grid of images represents noisy input, true and predicted denoised output. It depicts the performance of the denoising autoencoder with a standard deviation of 0.9.

Figure 8-77

The noisy/perturbed input (left), the unperturbed desired output (middle), and the predicted output (right) for a denoising autoencoder trained on MNIST with a noisy normal distribution of standard deviation 0.9

We can see that denoising autoencoders can perform reconstruction to a pretty impressive degree. In practice, however, we want to keep our noise level somewhat low; increasing the noise level can destroy information and cause the network to develop incorrect and/or overly simplified representations of decisions.

A similar logic can be applied to tabular data. There are many situations in which you find that a tabular dataset is particularly noisy. This is especially common in scientific datasets recording variable physical activity, like low-level physics dynamics or biological system data.

Let’s build a denoising autoencoder for the Mice Protein Expression dataset. Listing 8-31 loads the dataset and splits it into a training and a validation set.
data = pd.read_csv('../input/mpempe/mouse-protein-expression.csv').drop(['Unnamed: 0', 'class'], axis=1)
train_indices = np.random.choice(data.index, replace=False,
                               size = round(0.8 * len(data)))
valid_indices = np.array([ind for ind in data.index if ind
                              not in train_indices])
x_train, x_valid = data.loc[train_indices], data.loc[valid_indices]
Listing 8-31

Loading and splitting the Mice Protein Expression dataset

Listing 8-32 builds a standard autoencoder architecture.
models = buildAutoencoder(len(data.columns), 16)
model = models['model']
encoder = models['encoder']
model.compile(optimizer='adam', loss='mse')
Listing 8-32

Building an autoencoder architecture to fit the Mice Protein Expression dataset

To train, we generate noise to the input and train the model to reconstruct the original input from the noisy input. In tabular datasets, we generally cannot add randomly distributed noise to the entire set of data in blanket fashion, because different features operate on different scales. Instead, the noise should be dependent on the standard deviation of each feature itself. In this implementation, we add noise randomly sampled from a normal distribution with a standard deviation equal to one-fifth of the actual feature’s standard deviation (Listing 8-33).
TOTAL_EPOCHS = 100
loss = []
stds = x_train.std()
for i in tqdm(range(TOTAL_EPOCHS)):
    noise = pd.DataFrame(index=x_train.index, columns=x_train.columns)
    for col in noise.columns:
        noise[col] = np.random.normal(0, stds[col]/5,
                                      size=(len(x_train),))
    history = model.fit(x_train + noise, x_train, epochs=1, verbose=0)
    loss.append(history.history['loss'])
Listing 8-33

Adding noise to each column of the Mice Protein Expression dataset with a reflective standard deviation

Listing 8-34 demonstrates the evaluation of such a model on novel validation noisy data.
noise = pd.DataFrame(index=x_valid.index, columns=x_valid.columns)
for col in noise.columns:
    noise[col] = np.random.normal(0, np.sqrt(stds[col]),
                                  size=(len(x_valid),))
from sklearn.metrics import mean_absolute_error as mae
mae(model.predict(x_valid + noise), x_valid)
Listing 8-34

Evaluating the performance of the denoising tabular autoencoder on novel noisy data

After training, the encoder of the denoising autoencoder can be used for pretraining or other previously described applications.

Key Points

In this chapter, we discussed the autoencoder architecture and how it can be used in four different contexts – pretraining, multitask training, sparse autoencoders, and denoising autoencoders.
  • Autoencoders are neural network architectures trained to encode an input into a latent space with a representation size smaller than the original input and then to reconstruct the input from the latent space. Autoencoders are forced to learn meaningful latent representations of the data because of this imposed information bottleneck.

  • The encoder of a trained autoencoder can be detached and built as the feature extractor of a supervised network; that is, the autoencoder serves the purpose of pretraining.

  • In cases where supervised learning is difficult to get started with, creating a multitask autoencoder that can optimize its loss by performing both the supervised task and an auxiliary autoencoding task can help overcome initial learning hurdles.

  • Sparse autoencoders use a significantly expanded latent space size, but are trained with restrictions on latent space activity, such that only a few nodes/neurons can be active at any one pass. Sparse autoencoders are thought to be more robust.

  • Denoising autoencoders are trained to reconstruct clean data from an artificially corrupted, noisy data. In the process, the encoder learns to look for key patterns and “denoises” data, which can be a useful component for supervised models.

In the next chapter, we will look into deep generative models – including a particular type of autoencoder, the Variational Autoencoder (VAE) – which can notably be used to reconcile unbalanced datasets, improve model robustness, and train models on sensitive/private data, in addition to other applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.105.87