Chapter 11. Face Detection and Recognition

Not long ago, I boarded a flight to Europe and was surprised that I didn’t have to show my passport. I passed in front of a camera and was promptly welcomed aboard the flight. It was part of an early pilot for Delta Air Lines’ effort to push forward with facial recognition and offer a touchless curb-to-gate travel experience.

Facial recognition is everywhere. It’s one of the most common, and sometimes controversial, applications for AI. Meta, formerly known as Facebook, uses it to tag friends in photos—at least it did until it killed the feature due to privacy concerns. Apple uses it to allow users to unlock their iPhones, while Microsoft uses it to unlock Windows PCs. Uber uses it to confirm the identity of its drivers. Used properly, facial recognition has vast potential to make the world a better, safer, and more secure place.

Suppose you want to build a system that identifies people in photos or video frames. Perhaps it’s part of a security system that restricts access to college dorms to students and staff who are authorized to enter. Or perhaps you’re writing an app that searches your hard disk for photos of people you know. (“Show me all the photos of me and my daughter.”) Building systems such as these requires algorithms or models capable of:

  • Finding faces in photos or video frames, a process known as face detection

  • Identifying the faces detected, a process known as facial recognition or face identification

Numerous well-known algorithms exist for finding and identifying faces in photos. Some rely on deep learning—in particular, convolutional neural networks—while some do not. Facial recognition, after all, predated the explosion of deep learning by decades. But deep learning has supercharged the science of facial recognition and made it more practical than ever before.

This chapter begins by introducing two popular face detection methods. Then it moves on to facial recognition and introduces transfer learning as a means for recognizing faces. It concludes with a tutorial in which you put the pieces together and build a facial recognition system of your own. Sound like fun? Then let’s get started.

Face Detection

The sections that follow introduce two widely used algorithms for face detection—one that relies on machine learning and another that uses deep learning—as well as libraries that implement them. The goal is to be able to find all the faces in a photo or video frame like the one in Figure 11-1. Afterward, I’ll present an easy-to-use function that you can call to extract all the facial images from a photo and save them to disk or submit them to a facial recognition model.

Figure 11-1. Face detection

Face Detection with Viola-Jones

One of the fastest and most popular algorithms for detecting faces in photos stems from a paper published in 2001 titled “Rapid Object Detection Using a Boosted Cascade of Simple Features”. Sometimes known as Viola-Jones (the authors of the paper), the algorithm keys on the relative intensities of adjacent blocks of pixels. For example, the average pixel intensity in a rectangle around the eyes is typically darker than the average pixel intensity in a rectangle immediately below that area. Similarly, the bridge of the nose is usually lighter than the region around the eyes, so two dark rectangles with a bright rectangle in the middle might represent two eyes and a nose. The presence of many such Haar-like features in a frame at the right locations is an indicator that the frame contains a face (Figure 11-2).

Figure 11-2. Face detection using Haar-like features

Viola-Jones works by sliding windows of various sizes over an image looking for frames with Haar-like features in the right places. At each stop, the pixels in the window are scaled to a specified size (typically 24 × 24), and features are extracted and fed into a binary classifier that returns positive indicating the frame contains a face or negative indicating it does not. Then the window slides to the next location and the detection regimen begins again.

The key to Viola-Jones’s performance is the binary classifier. A frame that is 24 pixels wide and 24 pixels high contains more than 160,000 combinations of rectangles representing potential Haar-like features. Rather than compute values for every combination, Viola-Jones computes only those that the classifier requires. Furthermore, how many features the classifier requires depends on the content of the frame. The classifier is actually several binary classifiers arranged in stages. The first stage might require just one feature. The second stage might require 10, the third might require 20, and so on. Features are extracted and passed to stage n only if stage n – 1 returns positive, giving rise to the term cascade classifier.

Figure 11-3 depicts a three-stage cascade classifier. Each stage is carefully tuned to achieve a 100% detection rate using a limited number of features even if the false-positive rate is high. In the first stage, one feature determines whether the frame contains a face. A positive response means the frame might contain a face; a negative response means that it most certainly doesn’t, in which case no further checks are performed. If stage 1 returns positive, however, 10 other features are extracted and passed to stage 2. A frame is judged to contain a face only if all stages return positive, yielding a cumulative false-positive rate near zero. In machine learning, this is a design pattern known as high recall then precision because while individual stages are tuned for high recall, the cumulative effect is one of high precision.

Figure 11-3. Face detection using a cascade classifier

One benefit of this architecture is that frames lacking faces tend to fall out fast because they evoke a negative response early in the cascade. Because most frames don’t contain faces, the algorithm runs very quickly until it encounters a frame that does. In testing with a 38-stage classifier trained on 6,061 features from 4,916 facial images, Viola and Jones found that, on average, just 10 features were extracted from each frame.

The efficacy of Viola-Jones depends on the cascade classifier, which is essentially a machine learning model trained with facial and nonfacial images. Training is slow, but predictions are fast. In some respects, Viola-Jones acts like a CNN handcrafted to extract the minimum number of features needed to determine whether a frame contains a face. To speed feature extraction, Viola-Jones uses a clever mathematical trick called integral images to rapidly compute the difference in intensity between two blocks of pixels. The result is a system that can identify bounding boxes surrounding faces in an image with a relatively high degree of accuracy, and it can do so quickly enough to detect faces in live video frames.

Using the OpenCV Implementation of Viola-Jones

OpenCV is a popular open source computer-vision library that’s free for commercial use. It provides an implementation of Viola-Jones in its CascadeClassifier class, along with an XML file containing a cascade classifier trained to detect faces. The following statements use Cascade​Classi⁠fier in a Jupyter notebook to detect faces in an image and draw rectangles around the faces. You can use an image of your own or download the one featured in my example from GitHub:

import cv2
from cv2 import CascadeClassifier
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
%matplotlib inline

image = plt.imread('Data/Amsterdam.jpg')
fig, ax = plt.subplots(figsize=(12, 8), subplot_kw={'xticks': [], 'yticks': []})
ax.imshow(image)

model = CascadeClassifier(cv2.data.haarcascades +
                          'haarcascade_frontalface_default.xml')
faces = model.detectMultiScale(image)

for face in faces:
    x, y, w, h = face
    rect = Rectangle((x, y), w, h, color='red', fill=False, lw=2)
    ax.add_patch(rect)

Here’s the output with a photo of a mother and her daughter taken in Amsterdam a few years ago:

CascadeClassifier detected the two faces in the photo, but it also suffered a number of false positives. One way to mitigate that is to use the minNeighbors parameter. It defaults to 3, but higher values make CascadeClassifier more selective. With minNeighbors=20, detectMultiScale finds just the faces of the two people:

faces = model.detectMultiScale(image, minNeighbors=20)

Here is the output:

As detectMultiScale analyzes an image, it typically detects a face multiple times, each defined by a bounding box that’s aligned slightly differently. The minNeighbors parameter specifies the minimum number of times a face must be detected to be reported as a face. Higher values deliver higher precision (fewer false positives), but at the cost of lower recall, which means some faces might not be detected.

CascadeClassifier frequently requires tuning in this manner to strike the right balance between finding too many faces and finding too few. With that in mind, it is among the fastest face detection algorithms in existence. It can also be used to detect objects other than faces by loading XML files containing other pretrained classifiers. In OpenCV’s GitHub repo, you’ll find XML files for detecting silverware and other objects using Haar-like features, and XML files that detect objects using a different type of discriminator called local binary patterns.

Face Detection with Convolutional Neural Networks

While more computationally expensive, deep-learning methods often do a better job of detecting faces in images than Viola-Jones. In particular, multitask cascaded convolutional neural networks, or MTCNNs, have proven adept at face detection in a variety of benchmarks. They also identify facial landmarks such as the eyes, the nose, and the mouth.

Figure 11-4 is adapted from a diagram in the 2016 paper titled “Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks” that proposed MTCNNs. An MTCNN uses three CNNs arranged in a series to detect faces. The first one, called the Proposal Network, or P-Net, is a shallow CNN that searches the image at various resolutions looking for features indicative of faces. Rectangles identified by P-Net are combined to form candidate face rectangles and are input to the Refine Network, or R-Net, which is a deeper CNN that examines each rectangle more closely and rejects those that lack faces. Finally, output from R-Net is input to the Output Network (O-Net), which further filters candidate rectangles and identifies facial landmarks. MTCNNs are multitask CNNs because they produce three outputs each—a classification output indicating the confidence level that the rectangle contains a face, and two regression outputs locating the face and facial landmarks—rather than just one. And they’re cascaded like Viola-Jones classifiers to quickly rule out frames that don’t contain faces.

Figure 11-4. Multitask cascaded convolutional neural network

A handy MTCNN implementation is available in the Python package named MTCNN. The following statements use it to detect faces in the same photo featured in the previous example:

import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from mtcnn.mtcnn import MTCNN
%matplotlib inline

image = plt.imread('Data/Amsterdam.jpg')
fig, ax = plt.subplots(figsize=(12, 8), subplot_kw={'xticks': [], 'yticks': []})
ax.imshow(image)

detector = MTCNN()
faces = detector.detect_faces(image)

for face in faces:
    x, y, w, h = face['box']
    rect = Rectangle((x, y), w, h, color='red', fill=False, lw=2)
    ax.add_patch(rect)

Here’s the result:

The MTCNN detected not only the faces of the two people but also the face of a statue reflected in the door behind them. Here’s what detect_faces actually returned—a list containing three dictionaries, each corresponding to one of the faces in the photo:

[
  {
    'box': [723, 248, 204, 258],
    'confidence': 0.9997798800468445,
    'keypoints': {
      'left_eye': (765, 341),
      'right_eye': (858, 343),
      'nose': (800, 408),
      'mouth_left': (770, 432),
      'mouth_right': (864, 433)
    }
  },
  {
    'box': [538, 258, 183, 232],
    'confidence': 0.9997591376304626,
    'keypoints': {
      'left_eye': (601, 353),
      'right_eye': (685, 344),
      'nose': (662, 394),
      'mouth_left': (614, 433),
      'mouth_right': (689, 424)
    }
  },
  {
    'box': [1099, 84, 40, 41],
    'confidence': 0.8863282203674316,
    'keypoints': {
      'left_eye': (1108, 101),
      'right_eye': (1123, 96),
      'nose': (1116, 102),
      'mouth_left': (1114, 115),
      'mouth_right': (1127, 111)
    }
  }
]

You can eliminate the face in the reflection in either of two ways: by ignoring faces with a confidence level below a certain threshold, or by passing a min_face_size parameter to the MTCNN function so that detect_faces ignores faces smaller than a specified size. Here’s a modified for loop that does the former:

for face in faces:
    if face['confidence'] > 0.9:
        x, y, w, h = face['box']
        rect = Rectangle((x, y), w, h, color='red', fill=False, lw=2)
        ax.add_patch(rect)

And here’s the result:

The facial rectangles in Figure 11-1 were generated using MTCNN’s default settings—that is, without any filtering based on confidence levels or face sizes. Generally speaking, it does a better job out of the box than CascadeClassifier at detecting faces.

Extracting Faces from Photos

Once you know how to find faces in photos, it’s a simple matter to extract facial images in order to train a model or submit them to a trained model for identification. Example 11-1 presents a Python function that accepts a path to an image file and returns a list of facial images. By default, it crops facial images so that they’re square (perfect for passing them to a CNN), but you can disable cropping by passing the function a crop=False parameter. You can also specify a minimum confidence level with a min_confidence parameter, which defaults to 0.9.

Example 11-1. Function for extracting facial images from a photo
import numpy as np
from PIL import Image, ImageOps
from mtcnn.mtcnn import MTCNN

def extract_faces(input_file, min_confidence=0.9, crop=True):
    # Load the image and orient it correctly
    pil_image = Image.open(input_file)
    exif = pil_image.getexif()
    
    for k in exif.keys():
        if k != 0x0112:
            exif[k] = None
            del exif[k]

    pil_image.info["exif"] = exif.tobytes()
    pil_image = ImageOps.exif_transpose(pil_image)
    image = np.array(pil_image)

    # Find the faces in the image
    detector = MTCNN()
    faces = detector.detect_faces(image)
    faces = [face for face in faces if face['confidence'] >= min_confidence]
    results = []

    for face in faces:
        x1, y1, w, h = face['box']

        if (crop):
            # Compute crop coordinates
            if w > h:
                x1 = x1 + ((w - h) // 2)
                w = h
            elif h > w:
                y1 = y1 + ((h - w) // 2)
                h = w

        # Extract the facial image and add it to the list
        x2 = x1 + w
        y2 = y1 + h
        results.append(Image.fromarray(image[y1:y2, x1:x2]))

    # Return all the facial images
    return results

I passed the photo in Figure 11-5 to the function, and it returned the faces underneath. The items returned from extract_faces are Python Imaging Library (PIL) images, so you can resize them or save them to disk with a single line of code. Here’s a code snippet that extracts all the faces from a photo, resizes them to 224 × 224 pixels, and saves the resized images:

faces = extract_faces('PATH_TO_IMAGE_FILE')

for i, face in enumerate(faces):
    face.resize((224, 224)).save(f'face{i}.jpg')
Figure 11-5. Facial images extracted from a photo

With extract_faces to lend a hand, it’s a relatively simple matter to generate a set of facial images for training a CNN from a batch of photos on your hard disk, or to extract faces from a photo and submit them to a CNN for identification.

Facial Recognition

Now that you know how to detect faces in photos, the next step is to learn how to identify them. Several algorithms for recognizing faces in photos have been developed over the years. Some rely on biometrics, such as the distance between the eyes or the texture of the skin, while others take a more holistic approach by treating facial identification as a pattern recognition problem. State-of-the-art models today frequently rely on deep convolutional neural networks. One of the primary benchmarks for facial recognition models is the Labeled Faces in the Wild (LFW) dataset pictured in Figure 11-6, which contains more than 13,000 facial images of more than 5,000 people collected from the web. Deep-learning models such as MobiFace and FaceNet routinely achieve greater than 99% accuracy on the dataset. This equals or exceeds a human’s ability to identify faces in LFW photos.

Figure 11-6. The Labeled Faces in the Wild dataset

Chapter 5 presented a support vector machine (SVM) that achieved 85% accuracy using a subset of 500 images—100 each of five famous people—from the dataset. Chapter 9 tackled the same problem with a neural network, with similar results. These models merely scratch the surface of what modern facial recognition can accomplish. Let’s apply CNNs and transfer learning to the same LFW subset and see if they can do better at recognizing faces in photos. Along the way, you’ll learn a valuable lesson about pretrained CNNs and the specificity of the weights that are generated when those CNNs are trained.

Applying Transfer Learning to Facial Recognition

The first step in exploring CNN-based facial recognition is to load the LFW dataset. This time, we’ll load full-size color images and crop them to 128 × 128 pixels. Here’s the code:

import pandas as pd
from sklearn.datasets import fetch_lfw_people

faces = fetch_lfw_people(min_faces_per_person=100, slice_=None, resize=1.0,
                         color=True)
faces.images = faces.images[:, 60:188, 60:188]
faces.data = faces.images.reshape(faces.images.shape[0], faces.images.shape[1] *
                                  faces.images.shape[2], faces.images.shape[3])
class_count = len(faces.target_names)

print(faces.target_names)
print(faces.images.shape)

Because we set min_faces_per_person to 100, a total of 1,140 facial images corresponding to five people were loaded. Use the following statements to show the first several images and the labels that go with them:

import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(3, 6, figsize=(18, 10))

for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i])
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])

The dataset is imbalanced, containing almost as many photos of George W. Bush as of everyone else combined. Use the following code to reduce the dataset to 100 images of each person, for a total of 500 facial images:

import numpy as np

mask = np.zeros(faces.target.shape, dtype=bool)

for target in np.unique(faces.target):
    mask[np.where(faces.target == target)[0][:100]] = 1

x_faces = faces.data[mask]
y_faces = faces.target[mask]
x_faces = np.reshape(x_faces, (x_faces.shape[0], faces.images.shape[1],
                               faces.images.shape[2], faces.images.shape[3]))
x_faces.shape

Now preprocess the pixel values for input to a pretrained ResNet50 CNN and use Scikit-Learn’s train_test_split function to split the dataset, yielding 400 training samples and 100 test samples:

from sklearn.model_selection import train_test_split
from tensorflow.keras.applications.resnet50 import preprocess_input

face_images = preprocess_input(np.array(x_faces * 255))

x_train, x_test, y_train, y_test = train_test_split(
    face_images, y_faces, train_size=0.8, stratify=y_faces,
    random_state=0)

If you wanted, you could divide the preprocessed pixel values by 255 and train a CNN from scratch right now with this data. Here’s how you’d go about it (in case you care to give it a try):

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(x_train.shape[1:])))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(class_count, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train / 255, y_train, validation_data=(x_test / 255, y_test),
          epochs=20, batch_size=10)

I did it and then plotted the training and validation accuracy:

The validation accuracy is better than that of an SVM or a conventional neural network, but it’s nowhere near what modern CNNs achieve on the LFW dataset. So clearly there is a better way.

That better way, of course, is transfer learning, which we covered in Chapter 10. ResNet50 was trained with more than 1 million images from the ImageNet dataset, so it should be adept at extracting features from photos—more so than a handcrafted CNN trained with 400 images. Let’s see if that’s the case. Use the following statements to load ResNet50’s feature extraction layers, initialize them with the ImageNet weights, and freeze them so that the weights aren’t adjusted during training:

from tensorflow.keras.applications import ResNet50

base_model = ResNet50(weights='imagenet', include_top=False)
base_model.trainable = False

Now add classification layers to the base model and include a Resizing layer to resize images input to the network to the size that ResNet50 expects:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Resizing

model = Sequential()
model.add(Resizing(224, 224))
model.add(base_model)
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(class_count, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Train the model and plot the training and validation accuracy:

import seaborn as sns
sns.set()

hist = model.fit(x_train, y_train, validation_data=(x_test, y_test),
                 batch_size=10, epochs=10)

acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training Accuracy')
plt.plot(epochs, val_acc, ':', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

Results will vary, but my run produced a validation accuracy around 94%:

This is an improvement over a CNN trained from scratch, and it’s an indication that ResNet50 does a better job of extracting features from facial images. But it’s still not state of the art. Is it possible to do even better?

Boosting Transfer Learning with Task-Specific Weights

Initialized with ImageNet weights, ResNet50 does a credible job of feature extraction. Those weights were arrived at when ResNet50 was trained on more than 1 million photos of objects ranging from basketballs to butterflies. It was not, however, trained with facial images. Would it be better at extracting features from facial images if it were trained with facial images?

In 2017, a group of researchers at the University of Oxford’s Visual Geometry Group published a paper titled “VGGFace2: A Dataset for Recognising Faces Across Pose and Age”. After assembling a dataset comprising several million facial images, they trained two variations of ResNet50 with it and published the results. They also published the weights, which are wrapped in a handy Python library named Keras-vggface. That library includes a class named VGGFace that encapsulates ResNet50 with TensorFlow-compatible weights. Out of the box, the VGGFace model is capable of recognizing the faces of thousands of celebrities ranging from Brie Larson to Jennifer Aniston. But its real value lies in using transfer learning to repurpose it to recognize faces it wasn’t trained to recognize before.

To simplify matters, I installed Keras-vggface, created an instance of VGGFace without the classification layers, initialized the weights, and saved the model to an H5 file named vggface.h5. Download that file and drop it into your notebooks’ Data subdirectory. Then use the following code to create an instance of VGGFace built on top of ResNet50, and add custom classification layers:

from tensorflow.keras.models import load_model

base_model = load_model('Data/vggface.h5')
base_model.trainable = False

model = Sequential()
model.add(Resizing(224, 224))
model.add(base_model)
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(class_count, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Next, train the model and plot the training and validation accuracy:

hist = model.fit(x_train, y_train, validation_data=(x_test, y_test),
                 batch_size=10, epochs=10)

acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training Accuracy')
plt.plot(epochs, val_acc, ':', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

The results are spectacular:

To be sure, run the test data through the network and use a confusion matrix to assess the results:

from sklearn.metrics import ConfusionMatrixDisplay as cmd

sns.reset_orig()
y_pred = model.predict(x_test)
fig, ax = plt.subplots(figsize=(5, 5))
ax.grid(False)

cmd.from_predictions(y_test, y_pred.argmax(axis=1),
                     display_labels=faces.target_names, colorbar=False,
                     cmap='Blues', xticks_rotation='vertical', ax=ax)

Because VGGFace was tuned to extract features from facial images, it achieves a perfect score on the 100 test images. That’s not to say that it will never fail to recognize a face. It does indicate that, on the dataset you trained it with, it is remarkably adept at extracting features from facial images:

And therein lies an important lesson. CNNs that are trained in task-specific ways frequently provide a better base for transfer learning than CNNs trained in a more generic fashion. If the goal is to perform facial recognition, you’ll almost always do better with a CNN trained with facial images than a CNN trained with photos of thousands of dissimilar objects. For a neural network, it’s all about the weights.

ArcFace

VGGFace isn’t the only pretrained CNN that excels at extracting features from facial images. Another is ArcFace, which was introduced in a 2019 paper titled “ArcFace: Additive Angular Margin Loss for Deep Face Recognition”. A handy implementation is available in a Python package named Arcface.

Each facial image submitted to ArcFace is transformed into a dense vector of 512 values known as a face embedding. The code for creating an embedding is simple:

from arcface import ArcFace

af = ArcFace.ArcFace()
embedding = af.calc_emb(image)

Embeddings can be used to train machine learning models, and they can be used to make predictions with those models. Thanks to the loss function named in the title of the paper, embeddings created by ArcFace often do a better job of capturing the uniqueness of a face than embeddings generated by conventional CNNs.

Another use for the embeddings created by ArcFace is face verification, which compares two facial images and computes the probability that they represent the same person. The following statements generate embeddings for two facial images and use the cosine_similarity function introduced in Chapter 4 to quantify the similarity between the two:

af = ArcFace.ArcFace()
face_emb1 = af.calc_emb(image1)
face_emb2 = af.calc_emb(image2)
sim = cosine_similarity([face_emb1, face_emb2])[0][1]

The result is a value from 0.0 to 1.0, with higher values reflecting greater similarity between the faces.

Putting It All Together: Detecting and Recognizing Faces in Photos

Any time a model scores perfectly in testing, you should be skeptical. No model is perfect, and even if it achieves 100% accuracy against a test dataset, it won’t duplicate that in the wild. Given that VGGFace was trained with images of some of the same famous people found in the LFW dataset, is it possible that it’s biased toward those people? That transfer learning with VGGFace wouldn’t do as well if trained with images of ordinary people? And how would it perform with just a handful of training images?

To answer these questions, let’s build a notebook that trains a facial recognition model based on VGGFace, uses an MTCNN to detect faces in photos, and uses the model to identify the faces it detects. The dataset you’ll use contains eight pictures each of three ordinary people in slightly different poses, at ages up to 20 years apart, with and without glasses (Figure 11-7). These images were extracted from photos using the extract_faces function in Example 11-1 and resized to 224 × 224.

Figure 11-7. Photos for training a facial recognition model

Begin by downloading a ZIP file containing the facial images and copying the contents of the ZIP file into a subdirectory named Faces where your notebooks are hosted. The ZIP file contains four folders: one named Jeff, one named Lori, one named Abby, and one named Samples that contains uncropped photos for testing.

Now create a new notebook and run the following code in the first cell to define helper functions for loading and displaying facial images from the subdirectories you copied them to, and declare a pair of Python lists to hold the images and labels:

import os
from tensorflow.keras.preprocessing import image
import matplotlib.pyplot as plt
%matplotlib inline

def load_images_from_path(path, label):
    images, labels = [], []

    for file in os.listdir(path):
        images.append(image.img_to_array(image.load_img(os.path.join(path, file),
                      target_size=(224, 224, 3))))
        labels.append((label))
        
    return images, labels

def show_images(images):
    fig, axes = plt.subplots(1, 8, figsize=(20, 20),
                             subplot_kw={'xticks': [], 'yticks': []})
 
    for i, ax in enumerate(axes.flat):
        ax.imshow(images[i] / 255)

x, y = [], []

Next, load the images of Jeff and label them with 0s:

images, labels = load_images_from_path('Faces/Jeff', 0)
show_images(images)
    
x += images
y += labels

Load the images of Lori and label them with 1s:

images, labels = load_images_from_path('Faces/Lori', 1)
show_images(images)
    
x += images
y += labels

Load the images of Abby and label them with 2s:

images, labels = load_images_from_path('Faces/Abby', 2)
show_images(images)
    
x += images
y += labels

Finally, preprocess the pixels for the ResNet50 version of VGGFace and split the data fifty-fifty so that the network will be trained with four randomly selected images of each person and validated with the same number of images:

import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.applications.resnet50 import preprocess_input

faces = preprocess_input(np.array(x))
labels = np.array(y)

x_train, x_test, y_train, y_test = train_test_split(
    faces, labels, train_size=0.5, stratify=labels,
    random_state=0)

The next step is to load the saved VGGFace model and freeze the bottleneck layers. If you didn’t download vggface.h5 earlier, download it now and drop it into your notebooks’ Data subdirectory. Then execute the following code:

from tensorflow.keras.models import load_model

base_model = load_model('Data/vggface.h5')
base_model.trainable = False

Now define a network that uses transfer learning with VGGFace to identify faces. The Resizing layer ensures that each image measures exactly 224 × 224 pixels. The Dense layer contains just eight neurons because the training dataset is small and we don’t want the network to fit too tightly to it:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Resizing

model = Sequential()
model.add(Resizing(224, 224))
model.add(base_model)
model.add(Flatten())
model.add(Dense(8, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Train the model:

hist = model.fit(x_train, y_train, validation_data=(x_test, y_test),
                 batch_size=2, epochs=10)

Plot the training and validation accuracy:

import seaborn as sns
sns.set()

acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training Accuracy')
plt.plot(epochs, val_acc, ':', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

Hopefully you got something like this:

Now comes the fun part: using an MTCNN to detect the faces in a photo and the trained model to identify those faces. First make sure the MTCNN package is installed in your environment. Then define a pair of helper functions—one that retrieves a face from a specified location in an image (get_face), and another that loads a photo and annotates faces in the photo with names and confidence levels (label_faces):

from mtcnn.mtcnn import MTCNN
from PIL import Image, ImageOps
from tensorflow.keras.preprocessing import image
from matplotlib.patches import Rectangle

def get_face(image, face):
    x1, y1, w, h = face['box']
    
    if w > h:
        x1 = x1 + ((w - h) // 2)
        w = h
    elif h > w:
        y1 = y1 + ((h - w) // 2)
        h = w
    
    x2 = x1 + h
    y2 = y1 + w
    
    return image[y1:y2, x1:x2]

def label_faces(path, model, names, face_threshold=0.9, prediction_threshold=0.9,
                show_outline=True, size=(12, 8)):
    # Load the image and orient it correctly
    pil_image = Image.open(path)
    exif = pil_image.getexif()
    
    for k in exif.keys():
        if k != 0x0112:
            exif[k] = None
            del exif[k]

    pil_image.info["exif"] = exif.tobytes()
    pil_image = ImageOps.exif_transpose(pil_image)
    np_image = np.array(pil_image)

    fig, ax = plt.subplots(figsize=size, subplot_kw={'xticks': [], 'yticks': []})
    ax.imshow(np_image)

    detector = MTCNN()
    faces = detector.detect_faces(np_image)
    faces = [face for face in faces if face['confidence'] > face_threshold]

    for face in faces:
        x, y, w, h = face['box']
        
        # Use the model to identify the face
        face_image = get_face(np_image, face)
        face_image = image.array_to_img(face_image)
        face_image = preprocess_input(np.array(face_image))
        predictions = model.predict(np.expand_dims(face_image, axis=0))
        confidence = np.max(predictions)

        if (confidence > prediction_threshold):
            # Optionally draw a box around the face
            if show_outline:
                rect = Rectangle((x, y), w, h, color='red', fill=False, lw=2)
                ax.add_patch(rect)
            
            # Label the face
            index = int(np.argmax(predictions))
            text = f'{names[index]} ({confidence:.1%})'
            ax.text(x + (w / 2), y, text, color='white', backgroundcolor='red',
                    ha='center', va='bottom', fontweight='bold',
                    bbox=dict(color='red'))

Now pass the first sample image in the Samples folder to label_faces:

labels = ['Jeff', 'Lori', 'Abby']
label_faces('Faces/Samples/Sample-1.jpg', model, labels)

The output should look like this, although your percentages might be different:

Try it again, but this time with a different photo:

label_faces('Faces/Samples/Sample-2.jpg', model, labels)

Here’s the output:

Finally, submit a photo containing all three individuals that the model was trained with:

label_faces('Faces/Samples/Sample-3.jpg', model, labels)

Trained with just 12 facial images—four of each person—the model does a credible job of identifying faces in photos. Of course, you could generate a dataset of your own by passing photos of friends and family members through the function in Example 11-1 and training the model with the resulting images.

Handling Unknown Faces: Closed-Set Versus Open-Set Classification

Now for some bad news. A VGGFace facial recognition model is adept at identifying faces it was trained with, but it doesn’t know what to do when it encounters a face it wasn’t trained with. Try it: pass in a photo of yourself. The model will probably identify you as Jeff, Lori, or Abby, and it might do so with a high level of confidence. It literally doesn’t know what it doesn’t know. This is especially true when the dataset is small and the network is given room to overfit.

The reason why has nothing to do with VGGFace. It has everything to do with the fact that a neural network with a softmax output layer is a closed-set classifier, meaning it classifies any sample presented to it for predictions as one of the classes it was trained with. (Remember that softmax ensures that the sum of the probabilities for all classes is 1.0.) The alternative is an open-set classifier (Figure 11-8), which has the ability to say “this sample doesn’t belong to any of the classes I was trained with.”

Figure 11-8. Closed-set versus open-set classification

There is not a one-size-fits-all solution for building open-set classifiers in the deep-learning community today. A 2016 paper titled “Towards Open Set Deep Networks” proposed one solution in the form of openmax output layers, which replace softmax output layers and “estimate the probability of an input being from an unknown class.” Essentially, if the network is trained with 10 classes, the openmax output layer adds an 11th output representing the unknown class. It works by taking the activations from the final classification layer and adjusting them using a Weibull distribution rather than normalizing the probabilities as softmax does.

Another potential solution was put forth in a 2018 paper titled “Reducing Network Agnostophobia” from researchers at the University of Colorado. It proposed replacing cross-entropy loss with a new loss function called entropic open-set loss that drives softmax scores for unknown classes toward a uniform probability distribution. Using this technique, you could more reliably detect samples belonging to an unknown class using probability thresholds paired with conventional softmax output layers. For a great summary of the problems posed by open-set classification in deep learning and an overview of openmax and entropic open-set loss, see “Does a Neural Network Know When It Doesn’t Know?” by Tivadar Danka.

Yet another solution is to use ArcFace to verify each face the model identifies by comparing an embedding generated from that face to a reference embedding for the same person. You could reject the model’s conclusion if cosine similarity falls below a predetermined threshold.

A more naive approach is to prevent the network from learning the training data too well in hopes that unknown classes will yield lower softmax probabilities. That’s why I included just eight neurons in the classification layer in the previous example. (You could go even further by introducing a dropout layer.) It works up to a point, but it isn’t perfect. The label_faces function has a default prediction threshold of 0.9, meaning it labels a face only if the model classifies it with at least 90% confidence. You could set prediction_threshold to 0.99 to rule out more unknown faces, but at the cost of failing to identify more known faces. Tuning in this manner to strike the right balance between recognizing known faces while ignoring unknowns is an inevitable part of readying a facial recognition model for production.

Summary

Building an end-to-end facial recognition system requires a means for detecting faces in photos as well as a means for classifying (identifying) those faces. One way to detect faces is the Viola-Jones algorithm, for which the OpenCV library provides a convenient implementation. An alternative that relies on deep learning is a multitask cascaded convolutional neural network, or MTCNN. The Python package named MTCNN contains a ready-to-use MTCNN implementation. Viola-Jones is faster and more suitable for real-time applications (for example, identifying faces in a live webcam feed), but MTCNNs are generally more accurate and incur fewer false positives.

Deep learning can be applied to the task of facial recognition by employing convolutional neural networks. Transfer learning with a pretrained CNN such as ResNet50 can identify faces with a relatively high degree of accuracy, but transfer learning with a CNN that was trained with millions of facial images delivers unparalleled accuracy. The primary reason is that the CNN’s bottleneck layers are optimized for extracting features from facial images.

With a highly optimized set of weights, a neural network can do almost anything. Generic weights will suffice when nothing better is available, but given a task-specific set of weights to start with, facial recognition via transfer learning can achieve human-like accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.27.45