Facial expressions

Our face is one of the strongest indicators of emotions; as we laugh or cry, we put our emotions on display, allowing others to glimpse into our minds. It's a form of nonverbal communication that, apparently, accounts for over 50% of our communication with others. Forty independently controlled muscles make the face one of the most complex systems we possess, which could be the reason we use it as a medium for communicating something so important as our current emotional state. But can we classify it?

In 2013, the International Conference on Machine Learning (ICML) ran a competition inviting contestants to build a facial expression classifier using a training dataset of over 28,000 grayscale images. They were labeled as either anger, disgust, fear, happiness, sadness, surprise, or neutral. The following are a few samples of this training data (available at https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge):

As previously mentioned, the training dataset consists of 28,709 grayscale images of faces in 48 x 48 pixels, where each face is centered and associated with a label defining the assigned emotion. This emotion can be one of the following labels (textual description was added for legibility):

Neural networks (or any other machine learning algorithm) can't really do anything by themselves. All a neural network does is find a direct or indirect correlation between two datasets (inputs and their corresponding outputs). In order for a neural network to learn, we need to present it with two meaningful datasets where some true correlation exists between the inputs and outputs. A good practice when tackling any new data problem is to come up with a predictive theory of how you might approach it or search for correlation using techniques such as data visualization or some other explorational data analysis technique. In doing so, we also better understand how we need to prepare our data to align it with the training data.

Let's look at the results of a data visualization technique that can be performed on the training data; here, it's our assumption that some pattern exists between each expression (happy, sad, angry, and so on). One way of visually inspecting this is by averaging each expression and the associated variance. This can be achieved simply by finding the mean and standard deviation across all images for their respective class (expression example, happy, angry, and so on). The results of some of the expressions can be seen in the following image:

After you get over the creepiness of the images, you get a sense that a pattern does exist, and you understand what our model needs to learn to be able to recognize facial expressions. Some other notable, and fairly visible, takeaways from this exercise include the amount of variance with the disgust expression; this hints that our model might find it difficult to effectively learn to recognize this expression. The other observation - and the one more applicable to our task in this chapter - is that the training data consists of forward-facing faces with little padding beyond the face, therefore highlighting what the model expects for its input. Now that we have a better sense of our data; let's move on and introduce the model we will be using in this chapter.

In chapter 3, Recognising Objects in the World, we presented the intuition behind CNNs or ConvNets. So, given that we won't be introducing any new concepts in this chapter, we will omit any discussion on the details of the model and just present it here for reference, with some commentary about its architecture and the format of the data it is expecting for its input:

The preceding figure is a visualization of the architecture of the model; it's your typical CNN, with a stack of convolutional and pooling layers before being flattened and fed into a series of fully connected layers. Finally, it is fed into a softmax activation layer for multi-class classification. As mentioned earlier, the model is expecting a 3D tensor with the dimensions 48 x 48 x 1 (width, height, channels). To avoid feeding our model with large numbers (0 - 255), the input has been normalized (dividing each pixel by 255, which gives us a range of 0.0 - 1.0). The model outputs the probability of a given input with respect to each class, that is, seven outputs with each class representing the probability of how likely it is correlated for the given input. To make a prediction, we simply take the class with the largest probability.

This model was trained on 22,967 samples, reversing the other 5,742 samples for validation. After 15 epochs, the model achieved approximately 59% accuracy on the validation set, managing to squeeze into the 13^th place of the Kaggle competition (at the time of writing this chapter). The following graphs show the training accuracy and loss during training:

This concludes our brief introduction of the data and model we will be using for this chapter. The two main takeaways are an appreciation of what data the model has been fed during training, and the fact that our model achieved just 59% accuracy.

The former dictates how we approach obtaining and process the data before feeding it into the model. The latter poses an opportunity for further investigation to better understand what is pulling the accuracy down and how to improve it; it also can be seen as a design challenge—a design to be made around this constraint.

In this chapter, we are mainly concerned with the former so, in the next section, we will explore how to obtain and preprocess the data before feeding it to the model. Let's get started.

Table of Contents for Facial expressions

Create new playlist

Sign In

Sign Up

Table of Contents for
Facial expressions