Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A. Ye, Z. WangModern Deep Learning for Tabular Datahttps://doi.org/10.1007/978-1-4842-8692-0_8

8. Autoencoders

Andre Ye¹ and Zian Wang²

(1)

Seattle, WA, USA

(2)

Redmond, WA, USA

Weak encoding means mistakes and weak decoding means illiteracy.
—Rajesh Walecha, Author

An autoencoder is a very simple model: a model that predicts its own input. In fact, it may seem deceivingly simple to the point of being worthless. (After all, what use is a model that predicts what we already know?) Autoencoders are extraordinarily valuable and versatile architectures not because of their functionality to reproduce the input, but because of the internal capabilities developed in order to obtain said functionality. Autoencoders can be chopped up into desirable parts and stuck onto other neural networks, like playing with Legos or performing surgery (take your pick of analogy), with incredible success, or can be used to perform other useful tasks, such as denoising.

This chapter begins by explaining the intuition of the autoencoder concept, followed by a demonstration of how one would implement a simple “vanilla” autoencoder. Afterward, four applications of autoencoders – pretraining, denoising, sparse learning for robust representations, and denoising – are discussed and implemented.

The Concept of the Autoencoder

The operations of encoding and decoding are fundamental to information. Some hypothesize that all transformation and evolution of information results from these two abstracted actions of encoding and decoding (Figures 8-1 and 8-2). Say Alice sees Humpty Dumpty hit his head on the ground after some precarious wall-sitting and tells Bob, “Humpty Dumpty hit his head badly on the ground!” Upon hearing this information, Bob encodes the information from a language representation into thoughts and opinions – what we might call latent representations.

Say Bob is a chef, and so his encoding “specializes” in features relating to food. Bob then decodes the latent representations back into a language representation when he tells Carol, “Humpty Dumpty cracked his shell! We can use the innards to make an omelet.” Carol, in turn, encodes the information.

Say Carol is an egg activist and cares deeply about the well-being of Humpty Dumpty. Her latent representations will encode the information in a way that reflects her priorities and interests as a thinker. When she decodes her latent representations into language, she tells Drew that “People are trying to eat Humpty Dumpty after he has suffered a serious injury! It is horrible.”

So on and so forth. The conversation continues and evolves, information passed and transformed from thinker to thinker. Because each thinker encodes the information represented in language in a semantic system relevant to their experiences, priorities, and interests, they correspondingly decode information in a fashion colored through these lens.

Figure 8-1
A high-level autoencoder architecture

Figure 8-2
Transformation of information as a series of encoding and decoding operations

Of course, this interpretation of encoding and decoding is very broad and more psychological than anything else. In the strict context of computer science, encoding is an operation to represent some information in another form, usually with smaller information content (there are few applications for encoding techniques that make the storage size larger). Decoding, in turn, “undoes” the encoding operation to recover the original information. Encoding and decoding are commonly used terms in the context of compression (Figure 8-3). Various computer scientists throughout the decades have proposed very clever algorithms to map information to smaller storage sizes with lossless and lossy reconstruction of the original information, making the transmission of large data like long text, images, and videos across limited information transfer connections feasible.

Figure 8-3
Interpretation of autoencoders as sending and receiving encrypted data

Encoding and decoding in deep learning are a bit of a fusion of these two understandings. Autoencoders are versatile neural network structures consisting of an encoder and a decoder component. The encoder maps the input into a smaller latent/encoded space, and the decoder maps the encoded representation back to the original input. The goal of the autoencoder is to reconstruct the original input as faithfully as possible, that is, to minimize the reconstruction loss. In order to do so, the encoder and decoder need to “work together” to develop an encoding and decoding scheme.

Autoencoders reduce the representation size of the original input information, which can be thought of as a form of lossy compression. However, numerous studies have demonstrated that autoencoders are generally quite bad at compression when put into contrast with human-designed compression schemes. Rather, when we build autoencoders, it is almost always to extract meaningful features at the “core” of the data. The smaller representation size of the latent space compared with the original input emerges only because we need to impose an information bottleneck to force the network to learn meaningful latent features. The architectures in Figures 8-4 and 8-5 demonstrate constant and expanded information representations compared with the input; the autoencoder can trivially learn weights that simply pass/carry the input from the input to the output. On the other hand, the architecture in Figure 8-6 must learn nontrivial patterns to compress and reconstruct the original input. The information bottleneck and information compression, therefore, is the means, not the end, of the autoencoder.

Figure 8-4
A bad autoencoder architecture (latent space representation size is equal to input representation size)

Figure 8-5
An even worse autoencoder architecture (latent space representation size is larger than input representation size)

Figure 8-6
A good autoencoder architecture (latent space representation size is smaller than input representation size)

Autoencoders are very good at finding higher-level abstractions and features. In order to reliably reconstruct the original input from a much smaller latent space representation, the encoder and decoder must develop a system of mapping that most meaningfully characterizes and distinguishes each input or set of inputs from others. This is no trivial task!

Consider the following autoencoder design scheme adapted for humans (Figure 8-7): person A is the encoder and attempts to “encode” a high-resolution image of a sketch as a natural language description, restricted to N words or less; person B is the decoder and attempts to “decode” the original image person A was looking at by drawing a reconstructed image based on person A’s natural language description. Person A and person B must work together to develop a system to reliably reconstruct the original image.

Figure 8-7
Image-to-text encoding guessing game

Say that you are person B and you are given the following natural language description by person A: “a black pug dressed in a black and white scarf looks at the upper-left region of the camera among an orange background.” For the sake of intuition, it is a worthwhile exercise to try actually playing the role of person B in this game by sketching out/”decoding” the original input.

Figure 8-8 shows the (hypothetical) image that person A encoded into the given natural language description. Chances are that your sketch is very different from the actual image. By performing this exercise, you will have experienced first-hand two key low-level challenges in autoencoding: reconstructing a complex output from a comparatively simpler encoding requires a lot of thinking and conceptual reasoning about the encoding, and the encoding scheme itself needs to effectively communicate both key concepts and precision/positioning information.

Figure 8-8
What person A was hypothetically looking at when they provided you the natural language encoding. Taken by Charles Deluvio

In this example, the latent space is in the form of language – which is discrete, sequential, and variable-length. Most autoencoders for tabular data use latent spaces that satisfy none of these attributes: they are (quasi-) continuous, read and generated all at once rather than sequentially, and fixed-length. These general autoencoders can reliably find effective encoding and decoding schemes with lifted restrictions, but the two-player game is still good intuition for thinking through challenges associated with autoencoder training.

Although autoencoders are relatively simple neural network architectures, they are incredibly versatile. In this section, we will begin with the plain “vanilla” autoencoder and move to more complex forms and applications of autoencoders.

Vanilla Autoencoders

Let’s begin with the traditional understanding of an autoencoder, which merely consists of an encoder and a decoder component working together to translate the input into a latent representation and then back into original form. The value of autoencoders will become clearer in following sections, in which we will use autoencoders to substantively improve model training.

The goal of this subsection is not only to demonstrate and implement autoencoder architectures but also to understand implementation best practices and to perform technical investigations and explorations into how and why autoencoders work.

Autoencoders are traditionally applied to image and text-based datasets, because this sort of data often features semantic concepts that should take a smaller amount of space to represent than is used in raw form. For instance, consider the following approximately 3000-by-3000 pixel image of a line (Figure 8-9).

This image contains nine million pixels, meaning we are representing the concept of this line with nine million data values. However, in actuality we can express any line with just four numbers: a slope, a y-intercept, a lower x bound, and a higher x bound (or a starting x point, a starting y point, an ending x point, and an ending y point). If we were to design an encoding and decoding scheme set, the encoder would identify these four parameters – yielding a very compact four-dimensional latent space – and the decoder would redraw the line given those four parameters. By collecting higher-level abstract latent features from the semantics represented in the images, we are able to represent the dataset more compactly. We’ll revisit this example later in the subsection.

Notice, however, that the autoencoder’s reconstruction capability is conditional on the existence of structural similarities (and differences) within the dataset. An autoencoder cannot reliably reconstruct an image of random noise, for instance.

The MNIST dataset is a particularly useful demonstration of autoencoders. It is technically visual/image-based, which is useful for understanding various autoencoder forms and applications (given that autoencoders are most well developed for images). However, it spans a small enough number of features and is structurally simple enough such that we can model it without any convolutional layers. Thus, the MNIST dataset serves as a nice link between the image and tabular data worlds. Throughout this section, we’ll use the MNIST dataset as an introduction to autoencoder techniques before demonstrating applications to “real” tabular/structured datasets.

Let’s begin by loading the MNIST dataset from Keras datasets (Listing 8-1).

from keras.datasets.mnist import load_data

(x_train, y_train), (x_valid, y_valid) = load_data()

x_train = x_train.reshape(len(x_train),784)/255

x_valid = x_valid.reshape(len(x_valid),784)/255

Listing 8-1

Loading the MNIST dataset

Recall that the key feature of an autoencoder is an information bottleneck. We want to begin from the original representation size, progressively force the information flow into smaller vector sizes, and then progressively force the information back into the original size. Such a design is simple to quickly implement in Keras, where we can successively decrease and increase the number of nodes in a sequence of fully connected layers (Listing 8-2).

import keras.layers as L

from keras.models import Sequential

# define architecture

model = Sequential()

model.add(L.Input((784,)))

model.add(L.Dense(256, activation='relu'))

model.add(L.Dense(64, activation='relu'))

model.add(L.Dense(32, activation='relu'))

model.add(L.Dense(64, activation='relu'))

model.add(L.Dense(256, activation='relu'))

model.add(L.Dense(784, activation='sigmoid'))

# compile

model.compile(optimizer='adam',

loss='binary_crossentropy')

# fit

model.fit(x_train, x_train, epochs=1,

validation_data=(x_valid, x_valid))

Listing 8-2

Building an autoencoder sequentially

The architecture is visualized in Figure 8-10.

Figure 8-10
A sequential autoencoder architecture

There are a few features of this autoencoder architecture to note. Firstly, the output activation of the autoencoder is a sigmoid function, but this is only because the input vector has values ranging from 0 to 1 (recall that we scaled the dataset upon loading in Listing 8-1). If we had not scaled the dataset as such, we would need to change the activation function such that the network could feasibly predict in the entire domain of possible values. If the input values consist of values larger than 0, ReLU may be a good activation output choice. If the inputs contain both positive and negative values, using a plain linear activation may be the easiest possible option. Moreover, the loss function chosen must be reflective of the output activation. Since our particular example contains outputs between 0 and 1 and the distribution of values is more or less binary (i.e., most values are very close to 0 or 1, as shown in Figure 8-11), binary cross-entropy is a suitable loss to apply. We can treat reconstruction as a series of binary classification problems for each pixel in the original input.

Figure 8-11
Distribution of pixel values (scaled between 0 and 1) in the MNIST dataset

However, in other cases, reconstruction is more of a regression problem in which the distribution of possible values is not binarized toward the ends of the domains but rather more spread out. This is common in more complex image datasets (Figure 8-12) and in many tabular datasets (Figure 8-13).

Figure 8-12
Distribution of pixel values (scaled between 0 and 1) from a set of images in CIFAR-10

Figure 8-13
Distribution of values for a feature in the Higgs Boson dataset (we will work with this dataset later in the chapter)

In these cases, it is more suitable to use a regression loss, like the generic Mean Squared Error or a more specialized alternative (e.g., Huber). Refer to Chapter 1 for a review of regression losses.

Autoencoders are generally easier to work with when implemented in compartmentalized form. Rather than simply constructing the autoencoder as a continuous stack of layers with a bottleneck, we can build encoder and decoder models/components and chain them together to form a complete autoencoder (Listing 8-3).

from keras.models import Model

# define architecture components

encoder = Sequential(name='encoder')

encoder.add(L.Input((784,)))

encoder.add(L.Dense(256, activation='relu'))

encoder.add(L.Dense(64, activation='relu'))

encoder.add(L.Dense(32, activation='relu'))

decoder = Sequential(name='decoder')

decoder.add(L.Input((32,)))

decoder.add(L.Dense(64, activation='relu'))

decoder.add(L.Dense(256, activation='relu'))

decoder.add(L.Dense(784, activation='sigmoid'))

# define model architecture from components

ae_input = L.Input((784,), name='input')

ae_encoder = encoder(ae_input)

ae_decoder = decoder(ae_encoder)

ae = Model(inputs = ae_input,

outputs = ae_decoder)

# compile

ae.compile(optimizer='adam',

loss='binary_crossentropy') # note that in other situations other losses may be more suitable

Listing 8-3

Building an autoencoder with compartmentalized design

This method of construction is philosophically more desirable because it reflects our understanding of the autoencoder structure as meaningfully composed of a separate encoding and decoding component. When we visualize our architecture, we obtain a much cleaner high-level breakdown of the autoencoder model (Figure 8-14).

Figure 8-14
Visualization of the compartmentalized model

However, using compartmentalized design is incredibly helpful because we can reference the encoder and decoder components separately from the autoencoder. For instance, if we desire to obtain the encoded representation for an input, we can simply call encoder.predict(…) on our input. The encoder and decoder are used to build the autoencoder; after the autoencoder is trained, the encoder and decoder still exist as references to components of that (now trained) autoencoder. The alternative would be to go searching for the latent space layer of the model and create a temporary model to run predictions, in a similar approach to the demonstration in Chapter 4 used to visualize learned convolutional transformations in CNNs. Similarly, if we desire to decode a latent space vector, we can simply call decoder.predict(…) on our sample latent vector.

For instance, Listing 8-4 demonstrates visualization (Figures 8-15 through 8-18) of the internal state and reconstruction of the autoencoder created in Listing 8-3 after training.

for i in range(10):

plt.figure(figsize=(10, 5), dpi=400)

plt.subplot(1, 3, 1)

plt.imshow(x_valid[i].reshape((28, 28)))

plt.axis('off')

plt.title('Original Input')

plt.subplot(1, 3, 2)

plt.imshow(encoder.predict(x_valid[i:i+1]).reshape((8, 4)))

plt.axis('off')

plt.title('Latent Space (Reshaped)')

plt.subplot(1, 3, 3)

plt.imshow(ae.predict(x_valid[i:i+1]).reshape((28, 28)))

plt.axis('off')

plt.title('Reconstructed')

plt.show()

Listing 8-4

Visualizing the input, latent space, and reconstruction of an autoencoder

Figure 8-15
Sample latent shape and reconstruction for the digit “7”

Figure 8-16
Sample latent shape and reconstruction for the digit “1”

Figure 8-17
Sample latent shape and reconstruction for the digit “2”

Figure 8-18
Sample latent shape and reconstruction for the digit “5”

When we build standard neural networks that we may want multiple models of with small differences, it is often useful to create a “builder” or “constructor.” The two key parameters of a neural network are the input size and the latent space size. Given these two key “determining’ parameters,” we can infer how we generally want information to flow. For instance, halving the information space in each subsequent layer in the encoder (and doubling in the decoder) is a good generic update rule.

Let the input size be I, and let the latent space size be L. In order to maintain this rule, we want all intermediate layers to use nodes as multiples of L. Consider the case in which I = 4L, for instance (Figure 8-19).

Figure 8-19
Visualization of a “halving” autoencoder architecture logic

We see that the number of layers needed to either reduce the input to the latent space or to expand the latent space to the output is

${log}_2frac{I}{L}$

This simple expression measures how many times we need to multiply L by 2 in order to reach I.

However, it will often be the case that $$
aisebox{1ex}{$I$}!left/ !
aisebox{-1ex}{$L$}
ight.
otin mathbb{Z} $$

(i.e., I does not divide cleanly into L), in which case our earlier logarithmic expression will not be integer. In these cases, we have a simple fix: we can cast the input to a layer with N nodes, where N = 2^k · L for the largest integer k such that N < I. For instance, if I = 4L + 8, we first “cast” down to 4L and execute our standard halving policy from that point (Figure 8-20).

Figure 8-20
Adapting the halving autoencoder logic to inputs that are not powers of 2

To accommodate for cases in which ${log}_2 aisebox{1ex}{$I$}!left/ ! aisebox{-1ex}{$L$} ight. otin mathbb{Z}$ (i.e., we cannot express the input size in relationship to the layer size as an exponent of 2), we can modify our expression for the number of layers required by wrapping with the floor function:

$leftlfloor {log}_2frac{I}{L} ight floor$

Using this halving/doubling information flow logic, we can create a generalized buildAutoencoder function that constructs a feed-forward autoencoder given an input size and a latent size (Listing 8-5).

def buildAutoencoder(inputSize=784, latentSize=32,

outActivation='sigmoid'):

# define architecture components

encoder = Sequential(name='encoder')

encoder.add(L.Input((inputSize,)))

for i in range(int(np.floor(np.log2(inputSize/latentSize))), -1, -1):

encoder.add(L.Dense(latentSize * 2**i, activation='relu'))

decoder = Sequential(name='decoder')

decoder.add(L.Input((latentSize,)))

for i in range(1,int(np.floor(np.log2(inputSize/latentSize)))+1):

decoder.add(L.Dense(latentSize * 2**i, activation='relu'))

decoder.add(L.Dense(inputSize, activation=outActivation))

# define model architecture from components

ae_input = L.Input((inputSize,), name='input')

ae_encoder = encoder(ae_input)

ae_decoder = decoder(ae_encoder)

ae = Model(inputs = ae_input,

outputs = ae_decoder)

return {'model': ae, 'encoder': encoder, 'decoder': decoder}

Listing 8-5

A general function to construct an autoencoder architecture given an input size and a desired latent space, constructed using halving/doubling architectural logic. Note this implementation also has an outActivation parameter in cases where our output is not between 0 and 1

Rather than just returning the model, we also return the encoder and decoder. Recall from earlier discussion of compartmentalized design that retaining a reference to the encoder and decoder components of the autoencoder can be helpful. If not returned, these references – created internally inside the function – will be lost and irretrievable.

Having a generalized autoencoder creation function allows us to perform larger-scale autoencoder experiments. One particularly important phenomenon to understand is the trade-off between model performance and the latent size. As previously mentioned, the latent size must be configured properly such that the task is challenging enough to force the autoencoder to develop meaningful and nontrivial representations, but also feasible enough such that the autoencoder can gain traction at solving the problem (rather than stagnating and not learning anything at all due to the difficulty of the reconstruction problem). Let’s train several autoencoders on the MNIST dataset with bottleneck sizes 2ⁿ where n ∈ [1, 2, …, ⌊log₂I⌋] (the last value of n being the largest power of 2 less than the original input size) and obtain each one’s validation performance (Listing 8-6, Figure 8-21).

inputSize = 784

earlyStopping = keras.callbacks.EarlyStopping(monitor='loss',

patience=5)

latentSizes = list(range(1, int(np.floor(np.log2(inputSize)))))

validPerf = []

for latentSize in tqdm(latentSizes):

model = buildAutoencoder(inputSize, 2**latentSize)['model']

model.compile(optimizer='adam', loss='binary_crossentropy')

history = model.fit(x_train, x_train, epochs=50,

callbacks=[earlyStopping], verbose=0)

score = keras.metrics.MeanAbsoluteError()

score.update_state(model.predict(x_valid), x_valid)

validPerf.append(score.result().numpy())

plt.figure(figsize=(15, 7.5), dpi=400)

plt.plot(latentSizes, validPerf, color='red')

plt.ylabel('Validation Performance')

plt.xlabel('Latent Size (power of 2)')

plt.grid()

plt.show()

Listing 8-6

Training autoencoders with varying latent space sizes and observing the performance trend

Figure 8-21
Relationship between the latent size of a tabular autoencoder (2^x neurons) and the validation performance. Note the diminishing returns

The diminishing returns for larger latent sizes are very apparently clear. As the latent size increases, the benefit we can reap from it decreases. This phenomenon is true generally in deep learning models (recall “Deep Double Descent” from Chapter 1, which similarly compared model size vs. performance in a supervised domain with CNNs).

We can do one better and visualize the differences in the learned latent representations for different bottleneck sizes. The latent representations for the training set after the autoencoder has been trained can be obtained via encoder.predict(x_train). Of course, the latent representations will be in different dimensions for each autoencoder. We can use the t-SNE method (introduced in Chapter 2) to visualize these latent spaces (Listing 8-7, Figures 8-22 through 8-30).

from sklearn.manifold import TSNE

inputSize = 784

earlyStopping = keras.callbacks.EarlyStopping(monitor='loss',

patience=5)

latentSizes = list(range(1, int(np.floor(np.log2(inputSize))) + 1))

for latentSize in tqdm(latentSizes):

modelSet = buildAutoencoder(inputSize, 2**latentSize)

model = modelSet['model']

encoder = modelSet['encoder']

model.compile(optimizer='adam', loss='binary_crossentropy')

model.fit(x_train, x_train, epochs=50,

callbacks=[earlyStopping], verbose=0)

transformed = encoder.predict(x_train)

tsne_ = TSNE(n_components=2).fit_transform(transformed)

plt.figure(figsize=(10, 10), dpi=400)

plt.scatter(tsne_[:,0], tsne_[:,1], c=y_train)

plt.show()

plt.close()

Listing 8-7

Plotting a t-SNE representation of the latent space of autoencoders with varying latent space sizes

Figure 8-22
t-SNE projection of a latent space for an autoencoder with a bottleneck size of two nodes trained on MNIST. Note that in this case, we are projecting into a number of dimensions (2) equal to the dimensionality of the original dataset (2), hence the pretty snake-like arrangements

Figure 8-23
t-SNE projection of a latent space for an autoencoder with a bottleneck size of four nodes trained on MNIST

Figure 8-24
t-SNE projection of a latent space for an autoencoder with a bottleneck size of eight nodes trained on MNIST

Figure 8-25
t-SNE projection of a latent space for an autoencoder with a bottleneck size of 16 nodes trained on MNIST

Figure 8-26
t-SNE projection of a latent space for an autoencoder with a bottleneck size of 32 nodes trained on MNIST

Figure 8-27
t-SNE projection of a latent space for an autoencoder with a bottleneck size of 64 nodes trained on MNIST

Figure 8-28
t-SNE projection of a latent space for an autoencoder with a bottleneck size of 128 nodes trained on MNIST

Figure 8-29
t-SNE projection of a latent space for an autoencoder with a bottleneck size of 256 nodes trained on MNIST

Figure 8-30
t-SNE projection of a latent space for an autoencoder with a bottleneck size of 512 nodes trained on MNIST

Note

If we had loaded the model as model = buildAutoencoder(784, 32)[‘model’] and the encoder as encoder = buildAutoencoder(784, 32)[‘encoder’], we indeed would obtain a model architecture and an encoder architecture – but they wouldn’t be “linked.” The stored model would be associated with an encoder that we haven’t captured, and the stored encoder would be part of an overarching model that we haven’t captured. Thus, we make sure to store the entire set of model components into modelSet first.

Each individual point is colored by the target label (i.e., the digit associated with the data point) for the purpose of exploring the autoencoder’s ability to implicitly “cluster” points of the same digit together or separate them, even though the autoencoder was never exposed to the labels. Observe that as the dimensionality of the latent space increases, the overlap between data samples of different digits decreases until there is functionally complete separation between digits of different classes.

If we build an architecture in which the input is expanded rather than compressed and visualize a dimensionality reduction of the latent space (Listing 8-8), we find that the learned representations are significantly less meaningful (Figure 8-31) – despite this architecture obtaining very high performance (i.e., low training error).

model = Sequential()

model.add(L.Input((784,)))

model.add(L.Dense(1024, activation='relu'))

model.add(L.Dense(2048, activation='relu'))

model.add(L.Dense(1024, activation='relu'))

model.add(L.Dense(784, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy')

model.fit(x_train, x_train, epochs=50)

transformed = encoder.predict(x_train)

tsne_ = TSNE(n_components=2).fit_transform(transformed)

plt.figure(figsize=(10, 10), dpi=400)

plt.scatter(tsne_[:,0], tsne_[:,1], c=y_train)

plt.show()

Listing 8-8

Training and visualizing the latent space of an overcomplete, architecturally redundant autoencoder architecture. This particular architecture has slightly over 5.8 million parameters!

Figure 8-31
t-SNE projection of a latent space for an overcomplete autoencoder with a bottleneck size of 2048 trained on MNIST

Let’s revisit the example given at the beginning of this subsection: reconstruction of an image of a line. Listing 8-9 generates a dataset of 50-by-50 images with randomly placed line segments using the image processing library cv2 (Listing 8-9).

x = np.zeros((1024, 50, 50))

for i in range(1024):

start = [np.random.randint(0, 50), np.random.randint(0, 50)]

end = [np.random.randint(0, 50), np.random.randint(0, 50)]

x[i,:,:] = cv2.line(x[i,:,:], start, end, color=1, thickness=4)

x = x.reshape((1024, 50 * 50))

Listing 8-9

Generating a dataset of 50-by-50 images of lines

Since theoretically we can intuitively represent each line segment with four values, we’ll build and train an autoencoder with four neurons in the latent space on the dataset (Listing 8-10).

modelSet = buildAutoencoder(50 * 50, 4)

model = modelSet['model']

encoder = modelSet['encoder']

model.compile(optimizer='adam', loss='binary_crossentropy')

model.fit(x, x, epochs=400, validation_split=0.2)

Listing 8-10

Fitting a simple autoencoder on the synthetic toy line dataset

The model reaches near 0.03 binary cross-entropy, which is quite good. Its reconstructions are very accurate (Figure 8-32).

Figure 8-32
Left column: original input images of lines. Right column: reconstructions via an autoencoder with a latent space dimensionality of 4

In fact, an autoencoder trained with only two neurons does a decent job at identifying the general shape of the line marked in the input (Figure 8-33). If you look closely, you will identify the silhouette of other lines. There are many hypotheses to explain their presence. One possibility is that the autoencoder has “memorized”/”internalized” a set of generally useful “landmark” samples that are then mapped to during prediction and that with a larger latent space increased information for precise placement could be passed through.

Figure 8-33
Left column: original input images of lines. Right column: reconstructions via an autoencoder with a latent space dimensionality of 2

Finally, let’s explore how we can apply autoencoders to a strictly tabular dataset – the Mice Protein Expression dataset, used in previous chapters (Listing 8-11).

from sklearn.model_selection import train_test_split as tts

mpe_x = df.drop('class', axis=1)

mpe_y = df['class']

mpe_x_train, mpe_x_valid, mpe_y_train, mpe_y_vaid = tts(mpe_x, mpe_y,

train_size=0.8,

random_state=42)

Listing 8-11

Splitting the dataset into training and validation sets

Recall that we need to look at the input data in order to gauge how to deal with the model output in autoencoders. If we call mpe_x_train.min(), Pandas returns a series with the minimum value per column.

DYRK1A_N 0.156849

ITSN1_N 0.261185

BDNF_N 0.115181

NR1_N 1.330831

NR2A_N 1.737540

...

H3MeK4_N 0.101787

CaNA_N 0.586479

Genotype 1.000000

Treatment 1.000000

Behavior 1.000000

Length: 80, dtype: float64

Calling .min() again takes the minimum of the minimums across columns. We find that the smallest value across the entire dataset is –0.062007874, whereas the maximum is 8.482553422. Since values can theoretically be negative, we’ll use a linear output activation instead of a ReLU and optimize using the standard Mean Squared Error loss for regression problems (Listing 8-12).

modelSet = buildAutoencoder(len(mpe_x.columns), 8,

outActivation='linear')

model = modelSet['model']

encoder = modelSet['encoder']

model.compile(optimizer='adam', loss='mse', metrics=['mae'])

history = model.fit(mpe_x_train, mpe_x_train, epochs=150)

Listing 8-12

Fitting an autoencoder on the Mice Protein Expression dataset

After 150 epochs of training, which progresses very quickly (this is a comparatively small dataset), the autoencoder obtains good training and validation performance (Table 8-1, Figure 8-34).

Table 8-1

Performance of the autoencoder trained on the Mice Protein Expression dataset

	Train	Validation
Mean Squared Error	0.0117	0.0118
Mean Absolute Error	0.0626	0.0625

Figure 8-34
Training history of an autoencoder trained on the Mice Protein Expression dataset

Figures 8-35 demonstrates some sample latent vectors and reconstructions made by our autoencoder, with the input and reconstructed vectors reshaped into 8-by-10 grids for more convenient viewing.

Figure 8-35
Samples and the associated latent vector and reconstruction by an autoencoder trained on the Mice Protein Expression dataset. Samples and reconstructions are represented in two spatial dimensions for convenience of viewing

We can employ a similar technique as previously employed on the MNIST dataset – visualizing the latent space of an autoencoder using t-SNE. Each data point in Figure 8-36 is colored by one of the eight classes each row in the Mice Protein Expression dataset falls into. This tabular autoencoder obtains pretty good separation between classes without any exposure to the labels.

Figure 8-36
t-SNE projection of a latent space for an autoencoder trained on the Mice Protein Expression dataset

Note that a more formal/rigorous tabular autoencoder design would require us to standardize or normalize all columns to within the same domain. Tabular datasets often contain features that operate on different scales; for instance, say feature A represents a proportion (i.e., between 0 and 1, inclusive), whereas feature B measures years (i.e., likely larger than 1000). Regression losses simply take the mean error across all columns, which means that the reward for correctly reconstructing A is negligible compared to reconstructing feature B. In this case, however, all columns are in roughly the same range, so skipping this step is tolerable.

In the next subsection, we will explore a direct application of autoencoders to concretely improve the performance of supervised models.

Autoencoders for Pretraining

Vanilla autoencoders, as we have already seen, can do some pretty cool things. We see that a vanilla autoencoder trained on various datasets can perform implicit clustering and classification of digits, without being exposed to the labels themselves as well. Rather, natural differences in the input resulting from differences in labels are independently observed and implicitly recognized by the autoencoder.

This sort of impressive feature extraction capability is valuable in the context of training neural networks to perform supervised tasks. Say we want a neural network to classify digits from the MNIST dataset. If we start from scratch, we are asking the neural network to learn both how to extract the optimal set of features and how to interpret them – all at once, with no prior information. However, we see that the encoder of an autoencoder trained on the MNIST dataset has developed an impressive feature extraction and class separation scheme. We can use the encoder of the autoencoder as a pretraining instrument; rather than building and training a new network that learns both extraction and interpretation from scratch, we can simply append a model component to the output of the encoder to interpret the already-learned feature extractor (i.e., the encoder) (Figure 8-37).

Figure 8-37
Schematic of multistage pretraining

In the first stage of training, we train the autoencoder on the standard input reconstruction task. After sufficient training, we can extract the encoder and append an “interpretation”-focused model component that assembles and arranges the features extracted by the encoder into the desired output.

During stage 2, we impose layer freezing upon the encoder, meaning that we prevent its weights from being trained. This is to retain the learned structures of the encoder. We spent a significant amount of effort obtaining a good feature extractor; if we do not impose layer freezing, we will find that optimizing a good feature extractor connected to a very poor (randomly initialized) feature interpreter degrades the feature extractor.

However, once good performance is obtained on training with a frozen feature extractor and a trainable feature interpreter, the entire model can be trained for a few epochs for the purposes of fine-tuning (Figure 8-38). The idea here is that the feature interpreter has developed a good relationship with the static feature extractor, but now both can be jointly optimized to improve the relationship. (Just like couples in relationships, it’s not healthy if one partner is always static!)

Figure 8-38
Freezing followed by fine-tuning can be an effective way to perform autoencoder pretraining.

Let’s begin by demonstrating autoencoder pretraining on MNIST. We’ll use the buildAutoencoder function defined previously to fit an autoencoder, making sure to retain references to both the original model and the encoder (Listing 8-13).

modelSet = buildAutoencoder(784, 32)

model = modelSet['model']

encoder = modelSet['encoder']

model.compile(optimizer='adam', loss='binary_crossentropy')

model.fit(x_train, x_train, epochs=20)

Listing 8-13

Training an autoencoder on MNIST

After the model has been sufficiently trained, we can extract the encoder and stack it as the feature extraction unit/component of our task model (Listing 8-14). The outputs of the encoder (named encoded in the following script) are further interpreted via several fully connected layers. The encoder is set not to be trainable (i.e., layer freezing). The task model is trained on the original supervised task.

inp = L.Input((784,))

encoded = encoder(inp)

dense1 = L.Dense(16, activation='relu')(encoded)

dense2 = L.Dense(16, activation='relu')(dense1)

dense3 = L.Dense(10, activation='softmax')(dense2)

encoded.trainable = False

task_model = Model(inputs=inp, outputs=dense3)

task_model.compile(optimizer='adam',

loss='sparse_categorical_crossentropy')

task_model.fit(x_train, y_train, epochs=50)

Listing 8-14

Repurposing the encoder of the autoencoder as the frozen encoder/feature extractor of a supervised network

After sufficient training, it is common practice to make the encoder trainable again and fine-tune the entire architecture in an end-to-end fashion (Listing 8-15).

encoded.trainable = True

task_model.fit(x_train, y_train, epochs=5)

Listing 8-15

Fine-tuning the whole supervised network by unfreezing the encoder

We often reduce the learning rate on fine-tuning tasks to prevent destruction/”overwriting” of information learned during the pretraining process. This can be accomplished by recompiling the model after pretraining with an optimizer configured with a different initial learning rate.

We can compare the performance of this model to one with no pretraining (i.e., begins learning in a supervised fashion from scratch) (Listing 8-16, Figure 8-39).

modelSet = buildAutoencoder(784, 32)

model = modelSet['model']

encoder = modelSet['encoder']

inp = L.Input((784,))

encoded = encoder(inp)

dense1 = L.Dense(16, activation='relu')(encoded)

dense2 = L.Dense(16, activation='relu')(dense1)

dense3 = L.Dense(10, activation='softmax')(dense2)

task_model = Model(inputs=inp, outputs=dense3)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

history2 = model.fit(x_train, y_train, epochs=20)

plt.figure(figsize=(15, 7.5), dpi=400)

plt.plot(history.history['loss'], color='red',

label='With AE Pretraining')

plt.plot(history2.history['loss'], color='blue',

label='Without AE Pretraining')

plt.grid()

plt.xlabel('Epoch')

plt.ylabel('Loss')

plt.legend()

plt.show()

Listing 8-16

Training a supervised model with the same architecture as the model with pretraining, but without pretraining the encoder via an autoencoding task

Figure 8-39
Comparing the training curves for a classifier trained on the MNIST dataset with and without autoencoder pretraining

The MNIST dataset is relatively simple, so both models converge relatively quickly to good weights. However, the model with pretraining is noticeably “ahead” of the other. By taking the difference between the epoch at which a model with and without pretraining obtains some performance value, we can estimate how “far ahead” a model with autoencoder pretraining is. For any loss p (at least one epoch in training), the model with pretraining reaches p two to four epochs before the model without pretraining.

This process seems and is superfluous on the MNIST dataset, which has a comparatively simple set of rules in a comparatively small number of dimensions. However, this advantage manifests more significantly for more complex datasets, as has been shown with more advanced computer vision and natural language processing tasks. Neural networks trained to perform large-scale image classification (e.g., ImageNet), for instance, benefit significantly from performing an autoencoder pretraining task that learns useful latent features that are later interpreted and fine-tuned. Similarly, it has been shown that language models learn important fundamental structures of language by performing reconstruction tasks, which can be later used as the basis for a supervised task like text classification or generation (Figure 8-40).

Figure 8-40
General transfer learning/pretraining design used dominantly in computer vision

Recall, for instance, the Inception and EfficientNet models discussed in Chapter 4. Keras allows users to load weights from a model trained on ImageNet because the feature extraction “skills” required to perform well on a wide-ranging task like ImageNet are valuable or can be adapted to become valuable in most computer vision tasks.

However, as we have previously seen in Chapters 4 and 5, the success of a deep learning method on complex image and natural language data does not necessarily bar it from being useful to tabular data applications too.

Let’s consider the Mice Protein Expression dataset. We can begin by instantiating and training a sample autoencoder (Listing 8-17).

modelSet = buildAutoencoder(len(mpe_x_train.columns), 32,

outActivation='linear')

model = modelSet['model']

encoder = modelSet['encoder']

model.compile(optimizer='adam', loss='mse')

history = model.fit(mpe_x_train, mpe_x_train, epochs=50)

Listing 8-17

Building and training an autoencoder on the Mice Protein Expression dataset

We can now create and fit a task model using the trained encoder in two phases, the first in which the encoder is frozen and the second in which the encoder is trainable (Listing 8-18, Figure 8-41).

inp = L.Input((len(mpe_x_train.columns),))

encoded = encoder(inp)

dense1 = L.Dense(32, activation='relu')(encoded)

dense2 = L.Dense(32, activation='relu')(dense1)

dense3 = L.Dense(32, activation='relu')(dense2)

dense4 = L.Dense(8, activation='softmax')(dense2)

encoded.trainable = False

task_model = Model(inputs=inp, outputs=dense4)

task_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',

metrics=['accuracy'])

history_i = task_model.fit(mpe_x_train, mpe_y_train-1, epochs=30,

validation_data=(mpe_x_valid, mpe_y_valid-1))

encoded.trainable = True

history_ii = task_model.fit(mpe_x_train, mpe_y_train-1, epochs=10,

validation_data=(mpe_x_valid,

mpe_y_valid-1))

Listing 8-18

Using the pretrained encoder in a supervised task

Figure 8-41
Validation and training curves for stages 1 and 2

Alternatively, consider the Higgs Boson dataset. This dataset only has 28 features. If we use our standard autoencoder logic, which halves the number of nodes in each encoder layer and doubles the number of nodes in each decoder layer, we will either need to have a very smaller number of layers to use a reasonable latent space size or a very small latent space to use a reasonable number of layers. For instance, if our latent space has only eight features, the autoencoder logic would build only two layers (28 → 16 → 8). On the other hand, if we want a larger number of layers (e.g., five), we would need a very small latent space (e.g., an autoencoder with 28 → 16 → 8 → 4 → 2 → 1). In this case, it’s most beneficial to design a custom autoencoder with a sufficiently large latent space and a sufficient number of layers. We could design an autoencoder, for instance, with six layers in the encoder and decoder each and a latent space of 16 dimensions (Listing 8-19).

encoder = Sequential()

encoder.add(L.Input((len(X_train.columns),)))

encoder.add(L.Dense(28, activation='relu'))

encoder.add(L.Dense(16, activation='relu'))

decoder = Sequential()

decoder.add(L.Input((16,)))

decoder.add(L.Dense(16, activation='relu'))

decoder.add(L.Dense(28, activation='relu'))

decoder.add(L.Dense(28, activation='linear'))

inp = L.Input((28,))

encoded = encoder(inp)

decoded = decoder(encoded)

ae = keras.models.Model(inputs=inp, outputs=decoded)

ae.compile(optimizer='adam', loss='mse', metrics=['mae'])

history = ae.fit(X_train, X_train, epochs=100,

validation_data=(X_valid, X_valid))

Listing 8-19

Defining a custom autoencoder architecture for the Higgs Boson dataset

We can treat a static encoder as a feature extractor for our task model (Listing 8-20, Figures 8-42 and Figure 8-43).

inp = L.Input((len(X_train.columns),))

encoded = encoder(inp)

dense1 = L.Dense(16, activation='relu')(encoded)

dense2 = L.Dense(16, activation='relu')(dense1)

dense3 = L.Dense(16, activation='relu')(dense2)

dense4 = L.Dense(1, activation='sigmoid')(dense3)

encoded.trainable = False

task_model = keras.models.Model(inputs=inp, outputs=dense4)

task_model.compile(optimizer='adam', loss='binary_crossentropy',

metrics=['accuracy'])

history_i = task_model.fit(X_train, y_train, epochs=70,

validation_data=(X_valid, y_valid))

encoded.trainable = True

history_ii = task_model.fit(X_train, y_train, epochs=30,

validation_data=(X_valid, y_valid))

Listing 8-20

Using the pretrained encoder as a feature extractor for a supervised task

Figure 8-42
Validation and training loss curves for stages 1 and 2

Figure 8-43
Validation and training accuracy curves for stages 1 and 2

We can observe a significant amount of overfitting in this particular case. We can attempt to improve generalization by employing best practices such as adding dropout or batch normalization.

Lastly, it should be noted that using autoencoders for pretraining is a great semi-supervised method. Semi-supervised methods make use of data with and without labels (and are used most often in cases where labeled data is scarce and unlabeled data is abundant). Say you possess three sets of data: X_unlabeled, X_labeled, and y (which corresponds to X_labeled). You can train an autoencoder to reconstruct X_unlabeled and then use the frozen encoder as the feature extractor in a task model to predict y from X_labeled. This technique generally works well even when the size of X_unlabeled is significantly larger than the size of X_labeled; the autoencoding task learns meaningful representations that should be significantly easier to associate with a supervised target than beginning from initialization.

Multitask Autoencoders

Pretraining with autoencoders is often an effective strategy to take advantage of quality learned latent features. However, one criticism of the system is that it proceeds sequentially – autoencoder training takes place at a separate stage than the task training. Multitask autoencoders train the network on the autoencoder task and the intended task simultaneously (hence the name multitask). These autoencoders accept one input that is encoded by the encoder into a latent space. This one set of latent features is decoded separately by two “decoders” into two outputs; one output is dedicated to the autoencoder task, while the other is dedicated to the intended task. The network learns both of these tasks at the same time during training (Figures 8-44 and 8-45).

By training the autoencoder simultaneously along the task network, we can theoretically experience the benefits of the autoencoder in a dynamic fashion. Say the encoder has “difficulty” encoding features in a way relevant to the task output, which can be difficult. However, the encoder component of the model can still decrease the overall loss by learning features relevant to the autoencoder reconstruction task. These features may provide continuous support for the task output by providing the optimizer a viable path to loss minimization – it is “another way out,” so to speak. Using multitask autoencoders is often an effective technique to avoid or minimize difficult local minimum problems, in which the model makes mediocre to negligible progress in the first few moments of training and then plateaus (i.e., is stuck in a poor local minimum).

In order to construct a multitask autoencoder, we begin by initializing an autoencoder and extracting the encoder and decoder components. We create a “tasker” model that accepts latent features (i.e., data of the shape of the encoder output) and processes them into the task output (i.e., one of ten digits, in the case of MNIST). Each of these components can be linked using functional API syntax to form a complete multitask autoencoder architecture (Listing 8-21, Figure 8-46).

modelSet = buildAutoencoder(784, 32)

model = modelSet['model']

encoder = modelSet['encoder']

decoder = modelSet['decoder']

tasker = keras.models.Sequential(name='taskOut')

tasker.add(L.Input((32,)))

for i in range(3):

tasker.add(L.Dense(16, activation='relu'))

tasker.add(L.Dense(10, activation='softmax'))

inp = L.Input((784,), name='input')

encoded = encoder(inp)

decoded = decoder(encoded)

taskOut = tasker(encoded)

taskModel = Model(inputs=inp, outputs=[decoded, taskOut])

Listing 8-21

Building a multitask autoencoder for the MNIST dataset

Figure 8-46
Visualization of a multitask autoencoder architecture

Because the multitask autoencoder has multiple outputs, we need to specify losses and labels for each of the outputs by referencing a particular output’s name. In this case, the two outputs have been named “decoder” and “taskOut.” The decoder output will be given the original input (i.e., x_train) and optimized with binary cross-entropy, since its objective is to perform pixel-wise reconstruction. The task output will be given the image labels (i.e., y_train) and optimized with categorical cross-entropy, since its objective is to perform multiclass classification (Listing 8-22).

taskModel.compile(optimizer='adam',

loss = {'decoder':'binary_crossentropy',

'taskOut':'sparse_categorical_crossentropy'})

history = taskModel.fit(x_train, {'decoder':x_train,

'taskOut': y_train},

epochs=100)

Listing 8-22

Compiling and fitting the task model

We can observe from the training history that the model is able to reach both a fairly good task loss and a reconstruction loss within just a few dozen epochs (Listing 8-23, Figure 8-47).

plt.figure(figsize=(15, 7.5), dpi=400)

plt.plot(history.history['decoder_loss'], color='red', linestyle='--', label='Reconstruction Loss')

plt.plot(history.history['taskOut_loss'], color='blue', label='Task Loss')

plt.plot(history.history['loss'], color='green', linestyle='-.', label='Overall Loss')

plt.grid()

plt.xlabel('Epoch')

plt.ylabel('Loss')

plt.legend()

plt.show()

Listing 8-23

Plotting out different dimensions of the performance over time

Figure 8-47
Different dimensions of performance (reconstruction loss, task loss, overall loss)

Figures 8-48 to 8-51 visualize how the state of the multitask autoencoder progresses throughout each epoch.

Figure 8-48
Multitask autoencoder at zero epochs

Figure 8-49
Multitask autoencoder at one epoch

Figure 8-50
Multitask autoencoder at two epochs

From these visualizations and the training history, we see that the multitask autoencoder obtains better performance on the task than the autoencoding task that is intended to assist task performance! In this case, the MNIST dataset’s task output is more straightforward than the autoencoding task, which makes sense. In this case, using a multitask autoencoder is not beneficial. It probably is more beneficial to directly train or use an autoencoder for pretraining when multitask autoencoders perform poorly.

We can use an adapted approach on the Mice Protein Expression dataset, in which we see that autoencoding is a more approachable problem than the task of classification itself, both from the training history (Figures 8-52) and output state progression visualizations (Figures 8-53 through 8-56).

Figure 8-52
Different dimensions of performance on the Mice Protein Expression dataset

Figure 8-53
The state of the multitask autoencoder after zero epochs (i.e., upon initialization). Top: displays the original set of 80 features in the Mice Protein Expression dataset (arranged in a grid for more convenient visual viewing), the output of the decoder (of which the goal is to reconstruct the input), and the absolute error of the reconstruction. Bottom: the predicted and true classes (eight in total) and the absolute probability error

Figure 8-54
The state of the multitask autoencoder after one epoch

Figure 8-55
The state of the multitask autoencoder after five epochs

Figure 8-56
The state of the multitask autoencoder after 50 epochs

Figures 8-53 through 8-56 demonstrate the performance of the reconstruction task alongside the classification task at various stages in training. Notice that the reconstruction error converges near zero quickly and helps “pull”/”guide” the task error to zero over time.

In many cases, simultaneous execution of the autoencoder task and the original desired task can help provide stimulus to “push” progress on the desired task. However, you may make the valid objection that once the desired task reaches sufficiently good performance, it becomes limited by the autoencoding task.

One method to reconcile with this is simply to detach the autoencoder output from the model by creating a new model connecting the input to the task output and fine-tuning on the dataset.

Another more sophisticated technique is to change the loss weights between the original desired task and the autoencoding task. While Keras weighs multiple losses equally by default, we can provide different weights to reflect different levels of priority or importance delegated to each of the tasks. At the beginning of training, we can give a high weight to the autoencoding task, since we want the model to develop useful representations through a (ideally somewhat easier) task of autoencoding. Throughout the training duration, the weight on the original task model loss can be successively increased and the weight on the autoencoder model loss decreased. To formalize this, let α be the weight on the task output loss, and let 1 − α be the weight on the decoder output loss (with 0 < α < 1).

The sigmoid equation $sigma (x)=frac{1}{1+{e}^{-x}}$ is a pretty good way to get from a value very close to some minimum bound to another value very close to an upper bound. Over the span of 100 epochs, we can employ a simple (arbitrarily set but functional) transformation on the sigmoid function to obtain a smooth transition from a slow to high value of α (visualized by Listing 8-24 in Figure 8-57), where t represents the epoch number:

$alpha =sigma left(frac{t-50}{10} ight)=frac{1}{1+{e}^{-left(frac{t-50}{10} ight)}}$

Figure 8-57
Plot of the task output loss weight and the decoder weight across each epoch

plt.figure(figsize=(15, 7.5), dpi=400)

epochs = np.linspace(1, 100, 100)

alpha = 1/(1 + np.exp(-(epochs-50)/10))

plt.plot(epochs, alpha, color='red', label='Task Output Weight')

plt.plot(epochs, 1-alpha, color='blue', label='Decoder Output Weight')

plt.xlabel('Epochs')

plt.legend()

plt.show()

Listing 8-24

Plotting out our custom α-adjusting curve

A generalized equation to scale α through t_max using a transformation of the sigmoid function is as follows:

$alpha =sigma left(frac{t-frac{t_{ extrm{max}}}{2}}{frac{t_{ extrm{max}}}{10}} ight)=frac{1}{1+{e}^{-left(frac{t-frac{t_{ extrm{max}}}{2}}{frac{t_{ extrm{max}}}{10}} ight)}}$

At initial conditions, we have α at a very small value. (For the purposes of this calculation, we use t = 1 for simplification of calculation.)

$alpha @left{tapprox 0 ight} o frac{1}{1+{e}^{-left(frac{-frac{t_{ extrm{max}}}{2}}{frac{t_{ extrm{max}}}{10}} ight)}}=frac{1}{1+{e}^5}approx 0.006692$

The training regime completes at t = t_max, at which α is very close to 1:

$alpha @left{t={t}_{ extrm{max}} ight} o frac{1}{1+{e}^{-left(frac{t_{ extrm{max}}-frac{t_{ extrm{max}}}{2}}{frac{t_{ extrm{max}}}{10}} ight)}}=frac{e^5}{1+{e}^5}approx 0.993307$

Moreover, we observe by taking the derivative and solving for the maximum that the largest change for some t_max is $frac{5}{2cdotp {t}_{ extrm{max}}}$ . As t_max increases, analysis of the derivative reveals that the overall change becomes more uniformly spread out. For large values of t_max, this functionally becomes a horizontal line (i.e., the derivative becomes near 0). A simple linear transformation of α also suffices in most cases in which t_max is reasonably large.

Loss weighting is conveyed in the compiling stage. This means that we’ll have to recompile and fit every epoch. This is not difficult to do; we can write a for loop that loops through every epoch, calculates the α value for that epoch, compiles the model with that loss weighting, and fits for one epoch. Collecting the training history is slightly more manual; we need to collect the metrics for the single epoch and append them to user-created lists (Listing 8-25).

total_epochs = 100

lossParams = {'decoder':'binary_crossentropy',

'taskOut':'sparse_categorical_crossentropy'}

loss, decoderLoss, taskOutLoss = [], [], []

for epoch in range(1, total_epochs+1):

alpha = 1/(1 + np.exp(-(epoch-50)/10))

taskModel.compile(optimizer='adam',

loss = lossParams,

loss_weights = {'taskOut': alpha,

'decoder': 1-alpha})

history = taskModel.fit(x_train, {'decoder':x_train,

'taskOut': y_train},

epochs = 1)

loss.extend(history.history['loss'])

decoderLoss.extend(history.history['decoder_loss'])

taskOutLoss.extend(history.history['taskOut_loss'])

Listing 8-25

Recompiling and fitting a multitask autoencoder with varied loss weighting

For another higher-code but perhaps smoother approach to dynamically adjusting the loss calculation weights of multi-output models, which does not require repeated refitting, see Anuj Arora’s well-written post on adaptive loss weighting in Keras using callbacks: https://medium.com/dive-into-ml-ai/adaptive-weighing-of-loss-functions-for-multiple-output-keras-models-71a1b0aca66e.

Figure 8-58 demonstrates the history of the reconstruction, task, and overall losses throughout training of the multitask autoencoder, with the background shaded by the value of α used at that epoch. Note that the reconstruction task is more trivial than the original intended task (hence the faster decline in loss) and the logistically shaped overall loss function that makes major changes in α in epochs 40–60 and “switches” bounds from the reconstruction to the original task loss.

Figure 8-58
Diagram of reconstruction loss, task loss, and overall loss (now a dynamically weighted sum) with the weighting gradient shaded in the background

Multitask autoencoders perform best in difficult supervised classification tasks that benefit from rich latent features, which can be learned well by autoencoders.

Sparse Autoencoders

Standard autoencoders are given the limitation of size representation – autoencoder architectures are built with a “physical” bottleneck through which information must be compressed. The autoencoder attempts to maximize the amount of information it can squeeze through a significantly compressed latent space such that the information can reliably be decoded into the original output (Figure 8-59).

Figure 8-59
A standard autoencoder, which encodes information into a densely packed and quasi-continuous latent space

However, this is not the only limit we can impose. Another information bottleneck tool is sparsity. We can make the bottleneck layer very large, but force only a few nodes to be active at any one pass. While this still forces a limitation on the amount of information that can pass through the bottleneck layer, the network is given more freedom and control to “choose” which nodes information passes through, which itself is an additional medium of information expression (Figure 8-60).

Figure 8-60
A sparse autoencoder, in which a much larger latent size is accessible but only a few nodes can be used at any one time

To maintain sparsity, we generally impose L1 regularization on the layer’s activity. (Recall the discussion of regularization in Chapter 3, “Regularization Learning Networks.”) L1 regularization penalizes the output activity of the bottleneck layer by being too large. Assuming a network uses binary cross-entropy loss to minimize task output and λ represents the overall activity/output of the bottleneck layer, the joint loss of an L1-regularized network is as follows:

$extrm{loss}= BCEleft({y}_{ extrm{pred}},{ extrm{y}}_{ extrm{true}} ight)+alpha cdotp mid lambda mid$

The parameter α is user-defined and controls the “importance” of the L1 regularization term relative to the task loss. Setting the correct value of α is important for correct behavior. If α is too small, the network ignores the sparsity restriction in favor of completing the task, which is now made quasi-trivial by the overcomplete bottleneck layer. If α is too large, the network ignores the task by learning the “ultimate sparsity” – predicting all zeros in the bottleneck layer, which entirely minimizes λ but performs poorly on the actual task we want it to learn.

An alternative commonly used penalty is L2 regularization, in which the square rather than the absolute value is penalized:

$extrm{loss}= BCEleft({y}_{ extrm{pred}},{ extrm{y}}_{ extrm{true}} ight)+alpha cdotp {lambda}^2$

This is a common machine learning paradigm. L2 regularization tends to produce sets of values generally near zero but not at zero, whereas L1 regularization tends to produce values solidly at zero. An intuitive explanation is that L2 regularization significantly discounts the need to decrease values that are already somewhat near zero. The decrease from 3 to 2, for instance, is rewarded with a penalty decrease of 3² − 2² = 5. The decrease from 1 to 0, on the other hand, is rewarded with a measly penalty decrease of 1² − 0² = 1. On the other hand, L1 regularization rewards the decrease from 3 to 2 identically as the decrease from 1 to 0. We generally use L1 regularization to impose sparsity constraints because of this property.

To implement this, we need to make a slight modification to our original buildAutoencoder function. We can build the autoencoder as if we were leading up to and from a certain implicit latent size, but replace the implicit latent size with the real (expanded) latent size. For instance, consider an autoencoder build with an input of 64 dimensions and an implicit latent space of 8 dimensions. The node count progression in each layer of a standard autoencoder using our prebuilt autoencoder logic would be 64 → 32 → 16 → 8 → 16 → 32 → 64. However, because we are planning to impose a sparsity constraint on the bottleneck layer, we need to provide an expanded set of nodes to pass information through. Say the real bottleneck size is 128 nodes. The node count progression in each layer of this sparse autoencoder would be 64 → 32 → 16 → 128 → 16 → 32 → 64.

To actually implement the sparsity constraint, note that almost all layers in Keras have an activity_regularizer parameter, set upon initialization. This parameter penalizes the activity, or the output, of the layer (Listing 8-26). Note that you can also set the weight_regularizer or bias_regularizer parameter if you desire to penalize the learned weights or biases. In this case, we don’t care about how the encoder arrives at a sparse encoding, only that the encoder creates a sparse encoding. Hence, we perform regularization on the layer activity. The arguments accept a keras.regularizers object. We will use the L1 regularization object, which accepts the specific weighting of the penalty as a parameter. Setting the weight is important and should be given thought and experimentation, considering the model power, difficulty of autoencoding, and latent space size. As discussed previously, setting an improper weight in either direction (too large or too small) yields adverse outcomes.

from keras.regularizers import L1

def buildSparseAutoencoder(inputSize=784,

impLatentSize=32,

realLatentSize=128,

outActivation='sigmoid'):

# define architecture components

encoder = Sequential(name='encoder')

encoder.add(L.Input((inputSize,)))

for i in range(int(np.floor(np.log2(inputSize/impLatentSize))), -1, -1):

encoder.add(L.Dense(impLatentSize * 2**i, activation='relu'))

encoder.add(L.Dense(realLatentSize, activation='relu',

activity_regularizer = L1(0.001)))

decoder = Sequential(name='decoder')

decoder.add(L.Input((realLatentSize,)))

for i in range(1,int(np.floor(np.log2(inputSize/impLatentSize)))+1):

decoder.add(L.Dense(impLatentSize * 2**i, activation='relu'))

decoder.add(L.Dense(inputSize, activation=outActivation))

# define model architecture from components

ae_input = L.Input((inputSize,), name='input')

ae_encoder = encoder(ae_input)

ae_decoder = decoder(ae_encoder)

ae = Model(inputs = ae_input,

outputs = ae_decoder)

return {'model': ae, 'encoder': encoder, 'decoder': decoder}

Listing 8-26

Defining a sparse autoencoder with L1 regularization

Figure 8-61 demonstrates performance of the sparse autoencoder on the MNIST dataset, in which a 64-dimensional latent space vector is reshaped into an 8-by-8 grid for convenient viewing. The reconstruction is not visibly worse than a standard autoencoder without a sparsity constraint. Notice that only two to five of the 64 nodes are active at any one pass (and that which node(s) are active vary for each image). A standard autoencoder trained even with five nodes in the bottleneck layer (no sparsity requirement) would obtain poor performance on reconstruction, demonstrating the informational richness of “choosing” which nodes are active.

Figure 8-61
Sampled original inputs (left), latent space (middle), and reconstruction (right) for a sparse autoencoder trained on MNIST. The latent space is 256 neurons reshaped into a 16-by-16 grid for viewing. The actual latent space is not arranged in two spatial directions

If we decreased the regularization alpha value (i.e., the L1 penalty would be weighted less relative to the loss), the network would obtain better overall loss at the cost of decreased sparsity (i.e., more nodes would be active at any one pass). If we increased the regularization alpha, the network would obtain worse overall loss at the benefit of increased sparsity (i.e., even fewer nodes would be active at any one pass).

We can apply the same sparse autoencoding scheme to the Higgs Boson dataset, encoding a 28-dimensional input vector into a 64-dimensional latent space. At each pass, about one-fourth to one-third of the latent space is active, although many bottleneck nodes are “quasi-active” – they are not zero, but very close. Figure 8-62 demonstrates the internal state and reconstruction of the sparse autoencoder on different inputs, with 28-dimensional input vectors reshaped into 7-by-4 grids for more convenient viewing.

Figure 8-62
Sampled original inputs (left), latent space (middle), and reconstruction (right) for a sparse autoencoder trained on the Higgs Boson dataset. The latent space is 256 neurons reshaped into a 16-by-16 grid for viewing; the input and reconstruction are 28 dimensions arranged into 7-by-4 grids

Similarly, Figure 8-63 shows the application of a trained sparse autoencoder to various elements of the Mice Protein Expression dataset.

Figure 8-63
Sampled original inputs (left), latent space (middle), and reconstruction (right) for a sparse autoencoder trained on the Mice Protein Expression dataset. The latent space is 256 neurons reshaped into a 16-by-16 grid for viewing; the input and reconstruction are 80 dimensions arranged into 8-by-10 grids

Why would you want to use sparse autoencoders? The primary reason is to take advantage of sparse encoders’ robustness properties. Adversarial examples are instances generated to deliberately fool a neural network into some image originally correctly classified as class A into class B with high confidence simply by making miniscule, barely visible changes to the input. The canonical example in the field is a diagram created by Ian Goodfellow et al. in the paper “Explaining and Harnessing Adversarial Examples.” The Fast Signed Gradient Method (FSGM) generates a permutation matrix that adjusts every pixel in the input in a way that significantly changes the network’s final prediction (Figure 8-64).

Figure 8-64
Demonstration of the FSGM method. From “Explaining and Harnessing Adversarial Examples,” Goodfellow et al.

Adversarial example finders profit from continuity and gradients. Because neural networks operate in very large continuous spaces, adversarial examples can be found by “sneaking” through smooth channels and ridges in the surface of the landscape. Adversarial examples can be security threats (some instances of naturally occurring adversarial examples, like tape placed onto a traffic sign in a particular orientation causing egregious misidentification), as well as potential symptoms of poor generalization.¹ However, sparse encoders impose a discreteness upon the encoded space. It becomes significantly more difficult to generate successful adversarial examples when a frozen encoder is used as the feature extractor for a network.

Sparse autoencoders can also be useful for the purposes of interpretability. We’ll talk more about specialized interpretability techniques later in this chapter, but sparse autoencoders can be easily interpreted without additional complex theoretical tools. Because only a few neurons are active at any one time, understanding which neurons are activated for any one input is relatively simple, especially compared with the latent vectors generated by standard autoencoders.

Denoising and Reparative Autoencoders

So far, we’ve only considered applications of autoencoder training in which the desired output is identical to the input. However, autoencoders can perform another function: to repair or restore a damaged or noisy input.

Here’s the clever way we go about it – we artificially add realistic noise or corruption to a “pure”/”clean” dataset and then train the model to recover the cleaned image from its artificially corrupted version (Figure 8-65).

Figure 8-65
Deriving a noisy image as input and the original clean image as the desired output of a denoising autoencoder

There are many applications for such a model. We can use it, most obviously, to denoise a noisy input; the “cleaned” input can then be used for other purposes. Alternatively, if we are developing a model that we know will operate in a domain with lots of noisy data, we can use the encoder of a denoising autoencoder as a robust or resilient feature extractor (similarly to in autoencoder pretraining), exploiting the encoder’s “denoised” latent representations (Figure 8-66).

Figure 8-66
A potential application of denoising autoencoders as a structure that learns to clean up the input before it is actually used in a model for a task

These reparative models have particularly exciting applications for intelligent or deep graphics processing. Many graphics operations are not trivially two-way invertible in that it is trivial to go from one state to another but not in the inverse direction. For instance, if I convert a color image or movie into grayscale (for instance, using the pixel-wise methods covered in the image case study in Chapter 2), there is no simple way to invert it back to color. Alternatively, if you spill coffee on an old family photo, there is no trivial process to “erase” the stain.

Autoencoders, however, exploit the triviality of going from the “pure” to the “corrupted” state by artificially imposing corruption upon pure data and forcing powerful autoencoder architectures to learn the “undoing.” Researchers have used denoising autoencoder architectures to generate color versions of historical black-and-white film and to repair photos that have been ripped, stained, or streaked. Another application is in biological/medical imaging, where an imaging operation can be disrupted by environmental conditions; replicating this noise/image damage artificially and training an autoencoder to become robust to it can make the model more resilient to noise.

We will begin with demonstrating the application of a denoising autoencoder to the MNIST dataset by successively increasing the amount of noise in the image and observing how well the denoising autoencoder performs (similarly to exercises in Chapter 4).

We can use a simple but effective technique to introduce noise into an image: adding random noise sampled from a normal distribution with mean 0 and a specified standard deviation. The result is clipped to ensure that the resulting value is still between 0 and 1, the feasible domain of pixel values. Listing 8-27 implements and visualizes artificial noise for a given standard deviation std.

modified = x_train + np.random.normal(0, std, size=x_train.shape)

modified_clipped = np.clip(modified, 0, 1)

plt.set_cmap('gray')

plt.figure(figsize=(20, 20), dpi=400)

for i in range(25):

plt.subplot(5, 5, i+1)

plt.imshow(modified_clipped[i].reshape((28, 28)))

plt.axis('off')

plt.show()

Listing 8-27

Displaying data corrupted by random noise

Figure 8-67 demonstrates a grid of sample images with no artificial noise added as reference for comparison.

Figure 8-67
A grid of untampered clean images from MNIST for reference

Figure 8-68 demonstrates the same set of images with random noise sampled from a normal distribution with standard deviation of 0.1. We can observe marginal noise, especially in affecting the consistency of the digit outlines.

Figure 8-68
A sample of MNIST images with added normally distributed random noise using standard deviation 0.1

Let’s build an autoencoder to denoise this data (Listing 8-28). There is no difference between the architecture of the autoencoder used here and in previous applications; the difference rather is in the data that we pass in (namely, that the input should have artificial noise applied). In this implementation, we compute new noise in each epoch. This is desired because it provides “fresh” noise that the denoising autoencoder must learn to denoise rather than to “accept”/”memorize.”

models = buildAutoencoder(784, 32)

model = models['model']

encoder = models['encoder']

model.compile(optimizer='adam', loss='mse')

TOTAL_EPOCHS = 100

loss = []

for i in tqdm(range(TOTAL_EPOCHS)):

modified = x_train + np.random.normal(0, std, size=x_train.shape)

modified_clipped = np.clip(modified, 0, 1)

history = model.fit(modified_clipped, x_train, epochs=1, verbose=0)

loss.append(history.history['loss'])

Listing 8-28

Training the denoising autoencoder on novel corrupted MNIST data each epoch

After training, we can evaluate the Mean Absolute Error on a fresh validation set of noisy images (Listing 8-29).

modified = x_valid + np.random.normal(0, std, size=x_valid.shape)

modified_clipped = np.clip(modified, 0, 1)

from sklearn.metrics import mean_absolute_error as mae

mae(model.predict(modified_clipped), x_valid)

Listing 8-29

Evaluating the performance of the denoising autoencoder on a fresh set of noisy images

Listing 8-30 and Figure 8-69, respectively, implement and demonstrate a sampling of images with normally distributed random noise, using a standard deviation of 0.1. A denoising autoencoder trained to recover the original version given a noisy image generated using this procedure obtains a validation Mean Squared Error of 0.0266.

plt.set_cmap('gray')

for i in range(3):

plt.figure(figsize=(15, 5), dpi=400)

plt.subplot(1, 3, 1)

plt.imshow(modified_clipped[i].reshape((28, 28)))

plt.axis('off')

plt.title('Noisy Input')

plt.subplot(1, 3, 2)

plt.imshow(x_valid[i].reshape((28, 28)))

plt.axis('off')

plt.title('True Denoised')

plt.subplot(1, 3, 3)

plt.imshow(model.predict(x_valid[i:i+1]).reshape((28, 28)))

plt.axis('off')

plt.title('Predicted Denoised')

plt.show()

Listing 8-30

Displaying the corrupted image, the reconstruction, and the desired reconstruction (i.e., the original uncorrupted image)

Figure 8-69
The noisy/perturbed input (left), the unperturbed desired output (middle), and the predicted output (right) for a denoising autoencoder trained on MNIST with a noisy normal distribution of standard deviation 0.1

Let’s increase the standard deviation to 0.2. Figure 8-70 demonstrates the noise effect on the images, and Figure 8-71 demonstrates the reconstruction performance on a set of images. The denoising autoencoder obtains a validation Mean Absolute Error of 0.0289, slightly more than that of the denoising autoencoder trained on noise drawn from a normal distribution with standard deviation 0.1.

Figure 8-70
A sample of MNIST images with added normally distributed random noise using standard deviation 0.2

Figure 8-71
The noisy/perturbed input (left), the unperturbed desired output (middle), and the predicted output (right) for a denoising autoencoder trained on MNIST with a noisy normal distribution of standard deviation 0.2

Figures 8-72 and 8-73 demonstrate a sample of images and the denoising autoencoder performance on images corrupted using noise drawn from a normal distribution with standard deviation 0.3. The denoising autoencoder obtains a validation Mean Absolute Error of about 0.0343.

Figure 8-72
A sample of MNIST images with added normally distributed random noise using standard deviation 0.3

Figure 8-73
The noisy/perturbed input (left), the unperturbed desired output (middle), and the predicted output (right) for a denoising autoencoder trained on MNIST with a noisy normal distribution of standard deviation 0.3

Figures 8-74 and 8-75 demonstrate a sample of images and the denoising autoencoder performance on images corrupted using noise drawn from a normal distribution with standard deviation 0.5. The denoising autoencoder obtains a validation Mean Absolute Error of about 0.0427.

Figure 8-74
A sample of MNIST images with added normally distributed random noise using standard deviation 0.5

Figure 8-75
The noisy/perturbed input (left), the unperturbed desired output (middle), and the predicted output (right) for a denoising autoencoder trained on MNIST with a noisy normal distribution of standard deviation 0.5

Figures 8-76 and 8-77 demonstrate a sample of images and the denoising autoencoder performance on images corrupted using noise drawn from a normal distribution with standard deviation 0.9. The denoising autoencoder obtains a validation Mean Absolute Error of about 0.0683. Note that this is an exceedingly nontrivial task – even humans would have some difficulty denoising many of the shown samples! The autoencoder reconstructions are more abstracted – there is no physical way to exactly reconstruct all the details due to the significant amount of information corruption, so the autoencoder instead performs implicit digit recognition and reconstructs the image as a “generalized” digit with specific positional and orientational characteristics.

Figure 8-76
A sample of MNIST images with added normally distributed random noise using standard deviation 0.9

Figure 8-77
The noisy/perturbed input (left), the unperturbed desired output (middle), and the predicted output (right) for a denoising autoencoder trained on MNIST with a noisy normal distribution of standard deviation 0.9

We can see that denoising autoencoders can perform reconstruction to a pretty impressive degree. In practice, however, we want to keep our noise level somewhat low; increasing the noise level can destroy information and cause the network to develop incorrect and/or overly simplified representations of decisions.

A similar logic can be applied to tabular data. There are many situations in which you find that a tabular dataset is particularly noisy. This is especially common in scientific datasets recording variable physical activity, like low-level physics dynamics or biological system data.

Let’s build a denoising autoencoder for the Mice Protein Expression dataset. Listing 8-31 loads the dataset and splits it into a training and a validation set.

data = pd.read_csv('../input/mpempe/mouse-protein-expression.csv').drop(['Unnamed: 0', 'class'], axis=1)

train_indices = np.random.choice(data.index, replace=False,

size = round(0.8 * len(data)))

valid_indices = np.array([ind for ind in data.index if ind

not in train_indices])

x_train, x_valid = data.loc[train_indices], data.loc[valid_indices]

Listing 8-31

Loading and splitting the Mice Protein Expression dataset

Listing 8-32 builds a standard autoencoder architecture.

models = buildAutoencoder(len(data.columns), 16)

model = models['model']

encoder = models['encoder']

model.compile(optimizer='adam', loss='mse')

Listing 8-32

Building an autoencoder architecture to fit the Mice Protein Expression dataset

To train, we generate noise to the input and train the model to reconstruct the original input from the noisy input. In tabular datasets, we generally cannot add randomly distributed noise to the entire set of data in blanket fashion, because different features operate on different scales. Instead, the noise should be dependent on the standard deviation of each feature itself. In this implementation, we add noise randomly sampled from a normal distribution with a standard deviation equal to one-fifth of the actual feature’s standard deviation (Listing 8-33).

TOTAL_EPOCHS = 100

loss = []

stds = x_train.std()

for i in tqdm(range(TOTAL_EPOCHS)):

noise = pd.DataFrame(index=x_train.index, columns=x_train.columns)

for col in noise.columns:

noise[col] = np.random.normal(0, stds[col]/5,

size=(len(x_train),))

history = model.fit(x_train + noise, x_train, epochs=1, verbose=0)

loss.append(history.history['loss'])

Listing 8-33

Adding noise to each column of the Mice Protein Expression dataset with a reflective standard deviation

Listing 8-34 demonstrates the evaluation of such a model on novel validation noisy data.

noise = pd.DataFrame(index=x_valid.index, columns=x_valid.columns)

for col in noise.columns:

noise[col] = np.random.normal(0, np.sqrt(stds[col]),

size=(len(x_valid),))

from sklearn.metrics import mean_absolute_error as mae

mae(model.predict(x_valid + noise), x_valid)

Listing 8-34

Evaluating the performance of the denoising tabular autoencoder on novel noisy data

After training, the encoder of the denoising autoencoder can be used for pretraining or other previously described applications.

Key Points

In this chapter, we discussed the autoencoder architecture and how it can be used in four different contexts – pretraining, multitask training, sparse autoencoders, and denoising autoencoders.

Autoencoders are neural network architectures trained to encode an input into a latent space with a representation size smaller than the original input and then to reconstruct the input from the latent space. Autoencoders are forced to learn meaningful latent representations of the data because of this imposed information bottleneck.
The encoder of a trained autoencoder can be detached and built as the feature extractor of a supervised network; that is, the autoencoder serves the purpose of pretraining.
In cases where supervised learning is difficult to get started with, creating a multitask autoencoder that can optimize its loss by performing both the supervised task and an auxiliary autoencoding task can help overcome initial learning hurdles.
Sparse autoencoders use a significantly expanded latent space size, but are trained with restrictions on latent space activity, such that only a few nodes/neurons can be active at any one pass. Sparse autoencoders are thought to be more robust.
Denoising autoencoders are trained to reconstruct clean data from an artificially corrupted, noisy data. In the process, the encoder learns to look for key patterns and “denoises” data, which can be a useful component for supervised models.

In the next chapter, we will look into deep generative models – including a particular type of autoencoder, the Variational Autoencoder (VAE) – which can notably be used to reconcile unbalanced datasets, improve model robustness, and train models on sensitive/private data, in addition to other applications.

Footnotes

This is an ongoing research topic in the field and a debated position. The paper “Adversarial Examples Are Not Bugs, They Are Features” is worth reading as complementary to the landmark Goodfellow et al. FGSM paper. A commonly raised criticism of neural networks vulnerable to adversarial examples is that they do not reflect generalization in the same way that humans can (i.e., humans can look at an image without an adversarial mask applied and an image with an adversarial mask applied but identify both as of the same class, as in Figure <InternalRef RefID="Fig64" >8-64</Internal Ref>). However, some researchers, such as Alec Bunn at the University of Washington, have suggested that humans may also be prone to adversarial examples – examples deliberately designed to trick the system given the system’s behavior – by somehow tracing perception and thought patterns throughout the brain, but that we simply don’t currently possess the knowledge to neurologically generate adversarial examples for humans.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8. Autoencoders

Create new playlist

Sign In

Sign Up

8. Autoencoders

The Concept of the Autoencoder

Vanilla Autoencoders

Autoencoders for Pretraining

Multitask Autoencoders

Sparse Autoencoders

Denoising and Reparative Autoencoders

Key Points

Table of Contents for
8. Autoencoders