Autoencoders

In the previous chapters, we discussed how real datasets are very often high-dimensional representations of samples that lie on low-dimensional manifolds (this is one of the semi-supervised pattern's assumptions, but it's generally true). As the complexity of a model is proportional to the dimensionality of the input data, many techniques have been analyzed and optimized in order to reduce the actual number of valid components. For example, PCA selects the features according to the relative explained variance, while ICA and generic dictionary learning techniques look for basic atoms that can be combined to rebuild the original samples. In this chapter, we are going to analyze a family of models based on a slightly different approach, but whose capabilities are dramatically increased by the employment of deep learning methods.

A generic autoencoder is a model that is split into two separate (but not completely autonomous) components called an Encoder and a Decoder. The task of the encoder is to transform an input sample into an encoded feature vector, while the task of the decoder is the opposite: rebuilding the original sample using the feature vector as input. The following diagram shows a schematic representation of a generic model:

Schema of a generic autoencoder

More formally, we can describe the encoder as a parametrized function:

The output z_i is a vectorial code whose dimensionality is normally quite lower than the inputs. Analogously, the decoder is described as the following:

The goal of a standard algorithm is to minimize a cost function that is proportional to the reconstruction error. A classic method is based on the mean squared error (working on a dataset with M samples):

This function depends only on the input samples (which are constant) and the parameter vectors; therefore, this is de facto an unsupervised method where we can control the internal structure and the constraints imposed on the z_icode. From a probabilistic viewpoint, if the input x_isamples are drawn from a p(X) data-generating process, our goal is to find a q(•) parametric distribution that minimizes the Kullback–Leibler divergence with p(X). Considering the previous definitions, we can define q(•) as follows:

Therefore, the Kullback–Leibler divergence becomes the following:

The first term represents the negative entropy of the original distribution, which is constant and isn't involved in the optimization process. The other term is the cross-entropy between the p and q. If we assume Gaussian distributions for p and q, the mean squared error is proportional to the cross-entropy (for optimization purposes, it's equivalent to it), and therefore this cost function is still valid under a probabilistic approach. Alternatively, it's possible to consider Bernoulli distributions for p and q, and the cross-entropy becomes the following:

The main difference between the two approaches is that while a mean squared error can be applied to x_i ∈ ℜ^q (or multidimensional matrices), Bernoulli distributions need x_i ∈ [0, 1]^q(formally, this condition should be x_i ∈ {0, 1}^q; however, the optimization can also be successfully performed when the values are not binary). The same constraint is necessary for the reconstructions; therefore, when using neural networks, the most common choice is to employ sigmoid layers.

Table of Contents for Autoencoders

Create new playlist

Sign In

Sign Up

Table of Contents for
Autoencoders