Example of MLP with Keras

Keras (https://keras.io) is a powerful Python toolkit that allows modeling and training complex deep learning architectures with minimum effort. It relies on low-level frameworks, such as Tensorflow, Theano, or CNTK, and provides high-level blocks to build the single layers of a model. In this book, we need to be very pragmatic because there's no room for a complete explanation; however, all the examples will be structured to allow the reader to try different configurations and options without a full knowledge (for further details, I suggest the book Deep Learning with Keras, Gulli A, Pal S., Packt Publishing).

In this example, we want to build a small MLP with a single hidden layer to solve the XOR problem (the dataset is the same created in the previous example). The simplest and most common way is to instantiate the class Sequential, which defines an empty container for an indefinite model. In this initial part, the fundamental method is add(), which allows adding a layer to the model. For our example, we want to employ four hidden layers with hyperbolic tangent activation and two softmax output layers. The following snippet defines the MLP:

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()

model.add(Dense(4, input_dim=2))
model.add(Activation('tanh'))

model.add(Dense(2))
model.add(Activation('softmax'))

The Dense class defines a fully connected layer (a classical MLP layer), and the first parameter is used to declare the number of desired units. The first layer must declare the input_shape or input_dim, which specify the dimensions (or the shape) of a single sample (the batch size is omitted as it's dynamically set by the framework). All the subsequent layers compute the dimensions automatically. One of the strengths of Keras is the possibility to avoid setting many parameters (like weight initializers), as they will be automatically configured using the most appropriate default values (for example, the default weight initializer is Xavier). In the next examples, we are going to explicitly set some of them, but I suggest that the reader checks the official documentation to get acquainted with all the possibilities and features. The other layer involved in this experiment is Activation, which specifies the desired activation function (it's also possible to declare it using the parameter activation implemented by almost all layers, but I prefer to decouple the operations to emphasize the single roles, and also because some techniques—such as batch normalization—are normally applied to the linear output, before the activation).

At this point, we must ask Keras to compile the model (using the preferred backend):

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

The parameter optimizer defines the stochastic gradient descent algorithm that we want to employ. Using optimizer='sgd', it's possible to implement a standard version (as described in the previous paragraph). In this case, we are employing Adam (with the default parameters), which is a much more performant variant that will be discussed in the next section. The parameter loss is used to define the cost function (in this case, cross-entropy) and metrics is a list of all the evaluation score we want to be computed ('accuracy' is enough for many classification tasks). Once the model is compiled, it's possible to train it:

from keras.utils import to_categorical

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1000)

model.fit(X_train, 
          to_categorical(Y_train, num_classes=2), 
          epochs=100, 
          batch_size=32,
          validation_data=(X_test, to_categorical(Y_test, num_classes=2)))

Train on 700 samples, validate on 300 samples
Epoch 1/100
700/700 [==============================] - 1s 2ms/step - loss: 0.7227 - acc: 0.4929 - val_loss: 0.6943 - val_acc: 0.5933
Epoch 2/100
700/700 [==============================] - 0s 267us/step - loss: 0.7037 - acc: 0.5371 - val_loss: 0.6801 - val_acc: 0.6100
Epoch 3/100
700/700 [==============================] - 0s 247us/step - loss: 0.6875 - acc: 0.5871 - val_loss: 0.6675 - val_acc: 0.6733

...

Epoch 98/100
700/700 [==============================] - 0s 236us/step - loss: 0.0385 - acc: 0.9986 - val_loss: 0.0361 - val_acc: 1.0000
Epoch 99/100
700/700 [==============================] - 0s 261us/step - loss: 0.0378 - acc: 0.9986 - val_loss: 0.0355 - val_acc: 1.0000
Epoch 100/100
700/700 [==============================] - 0s 250us/step - loss: 0.0371 - acc: 0.9986 - val_loss: 0.0347 - val_acc: 1.0000

The operations are quite simple. We have split the dataset into training and test/validation sets (in deep learning, cross-validation is seldom employed) and, then, we have trained the model setting batch_size=32 and epochs=100. The dataset is automatically shuffled at the beginning of each epoch, unless setting shuffle=False. In order to convert the discrete labels into one-hot encoding, we have used the utility function to_categorical. In this case, the label 0 becomes (1, 0) and the label 1 (0, 1). The model converges before reaching 100 epochs; therefore, I invite the reader to optimize the parameters as an exercise. However, at the end of the process, the training accuracy is about 0.999 and the validation accuracy is 1.0.

The final classification plot is shown in the following diagram:

MLP classification of the XOR dataset

Only three points have been misclassified, but it's clear that the MLP successfully separated the XOR dataset. To have a confirmation of the generalization ability, we've plotted the decision surfaces for a hyperbolic tangent hidden layer and ReLU one:

MLP decision surfaces with Tanh (left) and ReLU (right) hidden layer

In both cases, the MLPs delimited the areas in a reasonable way. However, while a tanh hidden layer seems to be overfitted (this is not true in our case, as the dataset represents exactly the data generating process), the ReLU layer generates less smooth boundaries with an apparent lower variance (in particular for considering the outliers of a class). We know that the final validation accuracies confirm an almost perfect fit, and the decision plots (which is easy to create with two dimensions) show in both cases acceptable boundaries, but this simple exercise is useful to understand the complexity and the sensitivity of a deep model. For this reason, it's absolutely necessary to select a valid training set (representing the ground-truth) and employ all possible techniques to avoid the overfitting (as we're going to discuss later). The easiest way to detect such a situation is checking the validation loss. A good model should reduce both training and validation loss after each epoch, reaching a plateau for the latter. If, after n epochs, the validation loss (and, consequently, the accuracy) begins to increase, while the training loss keeps decreasing, it means that the model is overfitting the training set.

Another empirical indicator that the training process is evolving correctly is that, at least at the beginning, the validation accuracy should be higher than the training one. This can seem strange, but we need to consider that the validation set is slightly smaller and less complex than the training set; therefore, if the capacity of the model is not saturated with training samples, the probability of misclassification is higher for the training set than for the validation set. When this trend is inverted, the model is very likely to overfit after a few epochs. To verify these concepts, I invite the reader to repeat the exercise using a large number of hidden neurons (so as to increase dramatically the capacity), but they will be clearer when working with much more complex and unstructured datasets.

Keras can be installed using the command pip install -U keras. The default framework is Theano with CPU support. In order to use other frameworks (such as Tensorflow GPU), I suggest reading the instructions reported on the home page https://keras.io. As also suggested by the author, the best backend is Tensorflow, which is available for Linux, Mac OSX, and Windows. To install it (together with all dependencies), please follow the instructions on the following page: https://www.tensorflow.org/install/

Table of Contents for Example of MLP with Keras

Create new playlist

Sign In

Sign Up

Table of Contents for
Example of MLP with Keras