The main limitation of a perceptron is its linearity. How is it possible to exploit this kind of architecture by removing such a constraint? The solution is easier than any speculation. Adding at least a non-linear layer between input and output leads to a highly non-linear combination, parametrized with a larger number of variables. The resulting architecture, called Multilayer Perceptron (MLP) and containing a single (only for simplicity) Hidden Layer, is shown in the following diagram:
This is a so-called feed-forward network, meaning that the flow of information begins in the first layer, proceeds always in the same direction and ends at the output layer. Architectures that allow a partial feedback (for example, in order to implement a local memory) are called recurrent networks and will be analyzed in the next chapter.
In this case, there are two weight matrices, W and H, and two corresponding bias vectors, b and c. If there are m hidden neurons, xi ∈ ℜn × 1 (column vector), and yi ∈ ℜk × 1, the dynamics are defined by the following transformations:
A fundamental condition for any MLP is that at least one hidden-layer activation function fh(•) is non-linear. It's straightforward to prove that m linear hidden layers are equivalent to a single linear network and, hence, an MLP falls back into the case of a standard perceptron. Conventionally, the activation function is fixed for a given layer, but there are no limitations in their combinations. In particular, the output activation is normally chosen to meet a precise requirement (such as multi-label classification, regression, image reconstruction, and so on). That's why the first step of this analysis concerns the most common activation functions and their features.