Variational autoencoders

A variational autoencoder (VAE) is a generative model proposed by Kingma and Wellin (in their work Auto-Encoding Variational Bayes, arXiv:1312.6114 [stat.ML]) that partially resembles a standard autoencoder, but it has some fundamental internal differences. The goal, in fact, is not finding an encoded representation of a dataset, but determining the parameters of a generative process that is able to yield all possible outputs given an input data-generating process.

Let's take the example of a model based on a learnable parameter vector θ and a set of latent variables z that have a probability density function p(z;θ). Our goal can therefore be expressed as the research of the θ parameters that maximize the likelihood of the marginalized distribution p(x;θ) (obtained through the integration of the joint probability p(x,z;θ)):

If this problem could be easily solved in closed form, a large set of samples drawn from the p(x) data generating process would be enough to find a p(x;θ) good approximation. Unfortunately, the previous expression is intractable in the majority of cases because the true prior p(z) is unknown (this is a secondary issue, as we can easily make some helpful assumptions) and the posterior distribution p(x|z;θ) is almost always close to zero. The first problem can be solved by selecting a simple prior (the most common choice is z ∼ N(0, I)), but the second one is still very hard because only a few z values can lead to the generation of acceptable samples. This is particularly true when the dataset is very high dimensional and complex (for example, images). Even if there are millions of combinations, only a small number of them can yield realistic samples (if the images are photos of cars, we expect four wheels in the lower part, but it's still possible to generate samples where the wheels are on the top). For this reason, we need to exploit a method to reduce the sample space. Variational Bayesian methods (read C. Fox and S. Roberts's work A Tutorial on Variational Bayesian Inference from Orchid for further information) are based on the idea of employing proxy distributions, which are easy to sample and, in this case, whose density is very high (that is, the probability of generating a reasonable output is much higher than the true posterior).

In this case, we define an approximate posterior, considering the architecture of a standard autoencoder. In particular, we can introduce a q(z|x;θ_q) distribution that acts as an encoder (that doesn't behave determinastically anymore), which can be easily modeled with a neural network. Our goal, of course, is to find the best θ_q parameter set to maximize the similarity between q and the true posterior distribution p(z|x;θ). This result can be achieved by minimizing the Kullback–Leibler divergence:

In the last formula, the term log p(x;θ) doesn't depend on z, and therefore it can be extracted from the expected value operator and the expression can be manipulated to simplify it:

The equation can be also rewritten as the following:

On the right-hand side, we now have the term ELBO (short for evidence lower bound) and the Kullback–Leibler divergence between the probabilistic encoder q(z|x;θ_q) and the true posterior distribution p(z|x;θ). As we want to maximize the log-probability of a sample under the θ parametrization, and considering that the KL divergence is always non-negative, we can only work with the ELBO (which is a lot easier to manage than the other term). Indeed, the loss function that we are going to optimize is the negative ELBO. To achieve this goal, we need two more important steps.

The first one is choosing an appropriate structure for q(z|x;θ_q). As p(z;θ) is assumed to be normal, we can supposedly model q(z|x;θ_q) as a multivariate Gaussian distribution, splitting the probabilistic encoder into two blocks fed with the same lower layers:

A mean μ(z|x;θ_q) generator that outputs a μ_i ∈ ℜ^p vector
A Σ(z|x;θ_q) covariance generator (assuming a diagonal matrix) that outputs a σ_i ∈ ℜ^p vector so that Σ_i=diag(σ_i)

In this way, q(z|x;θ_q) = N(μ(z|x;θ_q), Σ(z|x;θ_q)), and therefore the second term on the right-hand side is the Kullback-Leibler divergence between two Gaussian distributions that can be easily expressed as follows (p is the dimension of both the mean and covariance vector):

This operation is simpler than expected because, as Σ is diagonal, the trace corresponds to the sum of the elements Σ₁ + Σ₂ + [...] + Σ_p and log(|Σ|) = log(Σ₁Σ₂...Σ_p) = log Σ₁ + log Σ₂ + ... + log Σ_p.

At this point, maximizing the right-hand side of the previous expression is equivalent to maximizing the expected value of the log probability to generate acceptable samples and minimizing the discrepancy between the normal prior and the Gaussian distribution synthesized by the encoder. Everything seems much simpler now, but there is still a problem to solve. We want to use neural networks and the stochastic gradient descent algorithm, and therefore we need differentiable functions. As the Kullback-Leibler divergence can be computed only using minibatches with n elements (the approximation becomes close to the true value after a sufficient number of iterations), it's necessary to sample n values from the distribution N(μ(z|x;θ_q), Σ(z|x;θ_q)) and, unfortunately, this operation is not differentiable. To solve this problem, the authors suggested a reparameterization trick: instead of sampling from q(z|x;θ_q), we can sample from a normal distribution, ε ∼ N(0, I), and build the actual samples as μ(z|x;θ_q) + ε · Σ(z|x;θ_q)². Considering that ε is a constant vector during a batch (both the forward and backward phases), it's easy to compute the gradient with respect to the previous expression and optimize both the decoder and the encoder.

The last element to consider is the first term on the right-hand side of the expression that we want to maximize:

This term represents the negative cross-entropy between the actual distribution and the reconstructed one. As discussed in the first section, there are two feasible choices: Gaussian or Bernoulli distributions. In general, variational autoencoders employ a Bernoulli distribution with input samples and reconstruction values constrained between 0 and 1. However, many experiments have confirmed that the mean squared error can speed up the training process, and therefore I suggest that the reader test both methods and pick the one that guarantees the best performance (both in terms of accuracy and training speed).

Table of Contents for Variational autoencoders

Create new playlist

Sign In

Sign Up

Table of Contents for
Variational autoencoders