Bringing it all together

If you haven't done already, pull down the latest code from the accompanying repository: https://github.com/packtpublishing/machine-learning-with-core-ml. Once downloaded, navigate to the directory Chapter4/Start/FacialEmotionDetection and open the project FacialEmotionDetection.xcodeproj. Once loaded, you will hopefully recognize the project structure as it closely resembles our first example. For this reason, we will just concentrate on the main components that are unique for this project, and I suggest reviewing previous chapters for clarification on anything that is unclear.

Let's start by reviewing our project and its main components; your project should look similar to what is shown in the following screenshot:

As shown in the preceding screenshot, the project looks a lot like our previous projects. I am going to make an assumption that the classes VideoCapture, CaptureVideoPreviewView, and UIColorExtension look familiar and you are comfortable with their contents. CIImageExtension is what we just implemented in the previous section, and therefore we won't be covering it here. The EmotionVisualizerView class is a custom view that visualizes the outputs from our model. And, finally, we have the bundled ExpressionRecognitionModelRaw.mlmodel. Our main focus in this section will be on wrapping the functionality we implemented in the previous section to handle preprocessing and hooking it up within the ViewController class. Before we start, let's quickly review what we are doing and consider some real-life applications for expression/emotion recognition.

In this section, we are building a simple visualization of the detected faces; we will pass in our camera feed to our preprocessor, then hand it over to our model to perform inference, and finally feed the results to our EmotionVisualizerView to render the output as an overlay on the screen. It's a simple example but sufficiently implements the mechanics required to embed in your own creations. So, what are some of its practical uses?

In a broad sense, there are three main uses: analytical, reactive, and anticipatory. Analytical is generally what you are likely to hear. These applications typically observe reactions by the user in relation to the content being presented; for example, you might measure arousal from content observed by the user, which is then used to drive future decisions.

While analytical experiences remain mostly passive, reactive applications proactively adjust the experience based on live feedback. One example that illustrates this well is DragonBot, a research project from the Social Robotics Group at MIT exploring intelligent tutoring systems.

DragonBot uses emotional awareness to adapt to the student; for example, one of its applications is a reading game that adapts the words based on the recognized emotion. That is, the system can adjust the difficulty of the task (words in this case) based on the user's ability, determined by the recognized emotion.

Finally, we have anticipatory applications. Anticipatory applications are semi-autonomous. They proactively try to infer the user's context and predict a likely action, therefore adjusting their state or triggering an action. An fictional example could be an email client that delays sending messages if the user had composed the message when angry.

Hopefully, this highlights some of the opportunities, but for now, let's return to our example and start building out the class that will be responsible for handling the preprocessing. Start off by creating a new swift file called ImageProcess.swift; and, within the file, add the following code:

import UIKit
import Vision

protocol ImageProcessorDelegate : class{
    func onImageProcessorCompleted(status: Int, faces:[MLMultiArray]?)
}

class ImageProcessor{
    
    weak var delegate : ImageProcessorDelegate?
    
    init(){
        
    }
    
    public func getFaces(pixelBuffer:CVPixelBuffer){
        DispatchQueue.global(qos: .background).async {  

    }
}

Here, we have defined the protocol for the delegate to handle the result once the preprocessing has completed, as well as the main class that exposes the method for initiating the task. Most of the code we will be using is what we have written in the playground; start off by declaring the request and request handler at the class level:

let faceDetection = VNDetectFaceRectanglesRequest()

let faceDetectionRequest = VNSequenceRequestHandler()

Let's now make use of the request by having our handler execute it within the body of the getFaces method's background queue dispatch block:

let ciImage = CIImage(cvPixelBuffer: pixelBuffer)
let width = ciImage.extent.width
let height = ciImage.extent.height

// Perform face detection
try? self.faceDetectionRequest.perform(
    [self.faceDetection],
    on: ciImage) 

var facesData = [MLMultiArray]()

if let faceDetectionResults = self.faceDetection.results as? [VNFaceObservation]{
    for face in faceDetectionResults{

    }
}

This should all look familiar to you. We pass in our request and image to the image handler. Then, we instantiate an array to hold the data for each face detected in the image. Finally, we obtain the observations and start iterating through each of them. It's within this block that we will perform the preprocessing and populate our facesData array as we had done in the playground. Add the following code within the loop:

let bbox = face.boundingBox

let imageSize = CGSize(width:width,
                       height:height)

let w = bbox.width * imageSize.width
let h = bbox.height * imageSize.height
let x = bbox.origin.x * imageSize.width
let y = bbox.origin.y * imageSize.height

let paddingTop = h * 0.2
let paddingBottom = h * 0.55
let paddingWidth = w * 0.15

let faceRect = CGRect(x: max(x - paddingWidth, 0),
                      y: max(0, y - paddingTop),
                      width: min(w + (paddingWidth * 2), imageSize.width),
                      height: min(h + paddingBottom, imageSize.height))

In the preceding block, we obtain the detected face's bounding box and create the cropping bounds, including padding. Our next task will be to crop the face from the image, resize it to our target size of 48 x 48, extract the raw pixel data along with normalizing the values, and finally populate an MLMultiArray. This is then added to our facesData array to be returned to the delegate; appending the following code to your script does just that:

if let pixelData = ciImage.crop(rect: faceRect)?
    .resize(size: CGSize(width:48, height:48))
    .getGrayscalePixelData()?.map({ (pixel) -> Double in
        return Double(pixel)/255.0 
    }){
    if let array = try? MLMultiArray(shape: [1, 48, 48], dataType: .double)     {
        for (index, element) in pixelData.enumerated() {
            array[index] = NSNumber(value: element)
        }
        facesData.append(array)
    }
}

Nothing new has been introduced here apart from chaining the methods to make it more legible (at least for me). Our final task is to notify the delegate once we have finished; add the following just outside the observations loop block:

DispatchQueue.main.async {
    self.delegate?.onImageProcessorCompleted(status: 1, faces: facesData)
}

Now, with that complete, our ImageProcessor is ready to be used. Let's hook everything up. Jump into the ViewController class, where we will hook our ImageProcessor. We will pass its results to our model and finally pass the output from our model to EmotionVisualizerView to present the results to the user. Let's start by reviewing what currently exists:

import UIKit
import Vision
import AVFoundation

class ViewController: UIViewController {

    @IBOutlet weak var previewView: CapturePreviewView!
    
    @IBOutlet weak var viewVisualizer: EmotionVisualizerView!
    
    @IBOutlet weak var statusLabel: UILabel!
    
    let videoCapture : VideoCapture = VideoCapture() 
    
    override func viewDidLoad() {
        super.viewDidLoad()
        
        videoCapture.delegate = self
        
        videoCapture.asyncInit { (success) in
            if success{

                (self.previewView.layer as! AVCaptureVideoPreviewLayer).session = self.videoCapture.captureSession

                (self.previewView.layer as! AVCaptureVideoPreviewLayer).videoGravity = AVLayerVideoGravity.resizeAspectFill
                
                self.videoCapture.startCapturing()
            } else{
                fatalError("Failed to init VideoCapture")
            }
        }
        
        imageProcessor.delegate = self
    }
}

extension ViewController : VideoCaptureDelegate{
    
    func onFrameCaptured(
        videoCapture: VideoCapture,
        pixelBuffer:CVPixelBuffer?,
        timestamp:CMTime){
        // Unwrap the parameter pixxelBuffer; exit early if nil
        guard let pixelBuffer = pixelBuffer else{
            print("WARNING: onFrameCaptured; null pixelBuffer")
            return
        }
    }
}

Our ViewController has references to its IB counterpart, most notably the previewView and viewVisualizer. The former will render the captured camera frames and viewVisualizer will be responsible for visualizing the output of our model. We then have videoCapture, which is a utility class that encapsulates setting up, capturing, and tearing down the camera. We get access to the captured frames by assigning ourselves as the delegate and implement the appropriate protocol as we have done as an extension at the bottom.

Let's begin by declaring the model and ImageProcessor variables required for our task; add the following at the class level of your ViewController:

let imageProcessor : ImageProcessor = ImageProcessor()

let model = ExpressionRecognitionModelRaw()

Next, we need to assign ourselves as the delegate of ImageProcessor in order to receive the results once the processing has completed. Add the following statement to the bottom of your viewDidLoad method:

imageProcessor.delegate = self

We will return shortly to implement the required protocol; for now, let's make use of our ImageProcessor by passing in the frame we receive from the camera. Within the onFrameCaptured method, we add the following statement, which will pass each frame to our ImageProcessor instance. It's shown in bold in the following code block:

extension ViewController : VideoCaptureDelegate{
    
    func onFrameCaptured(
        videoCapture: VideoCapture,
        pixelBuffer:CVPixelBuffer?,
        timestamp:CMTime){

        guard let pixelBuffer = pixelBuffer else{
            print("WARNING: onFrameCaptured; null pixelBuffer")
            return
        }
        
        self.imageProcessor.getFaces(
            pixelBuffer: pixelBuffer)
    }
}

Our final task will be to implement the ImageProcessorDelegate protocol; this will be called when our ImageProcessor has completed identifying and extracting each face for a given camera frame along with performing the preprocessing necessary for our model. Once completed, we will pass the data to our model to perform inference, and finally pass these onto our EmotionVisualizerView. Because nothing new is being introduced here, let's go ahead and add the block in its entirety:

extension ViewController : ImageProcessorDelegate{
    
    func onImageProcessorCompleted(
        status: Int,
        faces:[MLMultiArray]?){
        guard let faces = faces else{ return }
        
        self.statusLabel.isHidden = faces.count > 0
        
        guard faces.count > 0 else{
            return
        }
        
        DispatchQueue.global(qos: .background).async {
            for faceData in faces{
                
                let prediction = try? self.model
                    .prediction(image: faceData)
                
                if let classPredictions =
                    prediction?.classLabelProbs{
                    DispatchQueue.main.sync {
                        self.viewVisualizer.update(
                            labelConference: classPredictions
                        )
                    }
                }
            }
        }
    }
}

The only notable thing to point out is that our model needs to perform inference in the background thread and our ImageProcessor calls its delegate on the main thread. For this reason, we dispatch inference to the background and then return the results on the main thread—this is necessary whenever you want to update the user interface.

With that complete, we are now in a good place to build and deploy to test; if all goes well, you should see something like the following:

Let's wrap up the chapter by reviewing what we have covered and point out some interesting areas to explore before moving on to the next chapter.

In this chapter, we have taken a naive approach with respect to processing the captured frames; in a commercial application you would want to optimize this process such as utilizing object tracking from the Vision framework to replace explicit face detection, which is computationally cheaper.

Table of Contents for Bringing it all together

Create new playlist

Sign In

Sign Up

Table of Contents for
Bringing it all together