Introduction to VAEs

A VAE is extremely similar in nature to the more basic autoencoder; it learns how to encode the data that it is fed into a simplified representation, and it is then able to recreate it on the other side based on that encoding. Unfortunately, standard autoencoders are usually limited to tasks such as denoising. Using standard autoencoders for generation is problematic, as the latent space in standard autoencoders does not lend itself to this purpose. The encodings they produce may not be continuous—they may cluster around very specific portions, and may be difficult to perform interpolation on.

However, as we want to build a more generative model, and we don't want to replicate the same image that we put in, we need variations on the input. If we attempt to do this with a standard autoencoder, there is a good chance that the end result will be rather absurd, especially if the input differs a fair amount from the training set.

The standard autoencoder structure looks a little like this:

We've already built this standard autoencoder; however, a VAE has a slightly different way of encoding, which makes it look more like the following diagram:

A VAE is different from the standard autoencoder; it has a continuous latent space by design, making it easier for us to do random sampling and interpolation. It does this by encoding its data into two vectors: one to store its estimate of means, and another to store its estimate of the standard deviation.

Using these mean and standard deviations, we then sample an encoding that we then pass onto the decoder. The decoder then works off the sampled encoding to generate a result. Because we are inserting an amount of random noise during sampling, the actual encoding will vary slightly every time.

By allowing this variation to occur, the decoder isn't limited to specific encodings; instead, it can function across a much larger area in the latent space, as it is exposed to not just variations in the data but to variations in the encoding as well, during the training process.

In order to ensure that the encodings are close to each other on the latent space, we include a measure called the Kullback-Leibler (KL) divergence into our loss function during training. KL divergence measures the difference between two probability functions. In this case, by minimizing this divergence, we can reward the model for having the encodings close by, and vice versa for when the model attempts to cheat by creating more distance between the encodings.

In VAEs, we measure KL divergence against the standard normal (which is a Gaussian distribution with a mean of 0 and a standard deviation of 1). We can calculate this using the following formula:

klLoss = 0.5 * sum(mean^2 + exp(sd) - (sd + 1))

Unfortunately, just using KL divergence is insufficient, as all we are doing is ensuring that the encodings are not spread too far apart; we still need to ensure that the encodings are meaningful, and not just mixed with one another. As such, for optimizing a VAE, we also add another loss function to compare the input with the output. This will cause the encodings for similar objects (or, in the case of MNIST, handwritten digits) to cluster closer together. This will enable the decoder to reconstruct the input better and allow us, via manipulation of the input, to produce different results along the continuous axis.

Table of Contents for Introduction to VAEs

Create new playlist

Sign In

Sign Up

Table of Contents for
Introduction to VAEs