Input data and preprocessing 

In this section, we will implement the preprocessing functionality required to transform images into something the model is expecting. We will build up this functionality in a playground project before migrating it across to our project in the next section.

If you haven't done so already, pull down the latest code from the accompanying repository: https://github.com/packtpublishing/machine-learning-with-core-ml. Once downloaded, navigate to the directory Chapter4/Start/ and open the Playground project ExploringExpressionRecognition.playground. Once loaded, you will see the playground for this chapter, as shown in the following screenshot:

Before starting, to avoid looking at images of me, please replace the test images with either personal photos of your own or royalty free images from the internet, ideally a set expressing a range of emotions.

Along with the test images, this playground includes a compiled Core ML model (we introduced it in the previous image) with its generated set of wrappers for inputs, outputs, and the model itself. Also included are some extensions for UIImage, UIImageView, CGImagePropertyOrientation, and an empty CIImage extension, to which we will return later in the chapter. The others provide utility functions to help us visualize the images as we work through this playground.

Before jumping into the code, let's quickly discuss the approach we will take in order to determine what we actually need to implement.

Up to this point, our process of performing machine learning has been fairly straightforward; apart from some formatting of input data, our model didn't require too much work. This is not the case here. A typical photo of someone doesn't normally have just a face, nor is their face nicely aligned to the frame unless you're processing passport photos. When developing machine learning applications, you have two broad paths.

The first, which is becoming increasingly popular, is to use an end-to-end machine learning model capable of just being fed the raw input and producing adequate results. One particular field that has had great success with end-to-end models is speech recognition. Prior to end-to-end deep learning, speech recognition systems were made up of many smaller modules, each one focusing on extracting specific pieces of data to feed into the next module, which was typically manually engineered. Modern speech recognition systems use end-to-end models that take the raw input and output the result. Both of the described approaches can been seen in the following diagram:

Obviously, this approach is not constrained to speech recognition and we have seen it applied to image recognition tasks, too, along with many others. But there are two things that make this particular case different; the first is that we can simplify the problem by first extracting the face. This means our model has less features to learn and offers a smaller, more specialized model that we can tune. The second thing, which is no doubt obvious, is that our training data consisted of only faces and not natural images. So, we have no other choice but to run our data through two models, the first to extract faces and the second to perform expression recognition on the extracted faces, as shown in this diagram:

Luckily for us, Apple has mostly taken care of our first task of detecting faces through the Vision framework it released with iOS 11. The Vision framework provides performant image analysis and computer vision tools, exposing them through a simple API. This allows for face detection, feature detection and tracking, and classification of scenes in images and video. The latter (expression recognition) is something we will take care of using the Core ML model introduced earlier.

Prior to the introduction of the Vision framework, face detection would typically be performed using the Core Image filter. Going back further, you had to use something like OpenCV. You can learn more about Core Image here: https://developer.apple.com/library/content/documentation/GraphicsImaging/Conceptual/CoreImaging/ci_detect_faces/ci_detect_faces.html.

Now that we have got a bird's-eye view of the work that needs to be done, let's turn our attention to the editor and start putting all of this together. Start by loading the images; add the following snippet to your playground:

var images = [UIImage]()
for i in 1...3{
    guard let image = UIImage(named:"images/joshua_newnham_(i).jpg")
        else{ fatalError("Failed to extract features") }
    
    images.append(image)
}

let faceIdx = 0 
let imageView = UIImageView(image: images[faceIdx])
imageView.contentMode = .scaleAspectFit

In the preceding snippet, we are simply loading each of the images we have included in our resources' Images folder and adding them to an array we can access conveniently throughout the playground. Once all the images are loaded, we set the constant faceIdx, which will ensure that we access the same images throughout our experiments. Finally, we create an ImageView to easily preview it. Once it has finished running, click on the eye icon in the right-hand panel to preview the loaded image, as shown in the following screenshot:

Next, we will take advantage of the functionality available in the Vision framework to detect faces. The typical flow when working with the Vision framework is defining a request, which determines what analysis you want to perform, and defining the handler, which will be responsible for executing the request and providing means of obtaining the results (either through delegation or explicitly queried). The result of the analysis is a collection of observations that you need to cast into the appropriate observation type; concrete examples of each of these can be seen here:

As illustrated in the preceding diagram, the request determines what type of image analysis will be performed; the handler, using a request or multiple requests and an image, performs the actual analysis and generates the results (also known as observations). These are accessible via a property or delegate if one has been assigned. The type of observation is dependent on the request performed; it's worth highlighting that the Vision framework is tightly integrated into Core ML and provides another layer of abstraction and uniformity between you and the data and process. For example, using a classification Core ML model would return an observation of type VNClassificationObservation. This layer of abstraction not only simplifies things but also provides a consistent way of working with machine learning models.

In the previous figure, we showed a request handler specifically for static images. Vision also provides a specialized request handler for handling sequences of images, which is more appropriate when dealing with requests such as tracking. The following diagram illustrates some concrete examples of the types of requests and observations applicable to this use case:

So, when do you use VNImageRequestHandler and VNSequenceRequestHandler? Though the names provide clues as to when one should be used over the other, it's worth outlining some differences.

The image request handler is for interactive exploration of an image; it holds a reference to the image for its life cycle and allows optimizations of various request types. The sequence request handler is more appropriate for performing tasks such as tracking and does not optimize for multiple requests on an image.

Let's see how this all looks in code; add the following snippet to your playground:

let faceDetectionRequest = VNDetectFaceRectanglesRequest()
let faceDetectionRequestHandler = VNSequenceRequestHandler()

Here, we are simply creating the request and handler; as discussed in the preceding code, the request encapsulates the type of image analysis while the handler is responsible for executing the request. Next, we will get faceDetectionRequestHandler to run faceDetectionRequest; add the following code:

try? faceDetectionRequestHandler.perform(
    [faceDetectionRequest],
    on: images[faceIdx].cgImage!,
    orientation: CGImagePropertyOrientation(images[faceIdx].imageOrientation))

The perform function of the handler can throw an error if it fails; for this reason, we wrap the call with try? at the beginning of the statement and can interrogate the error property of the handler to identify the reason for failing. We pass the handler a list of requests (in this case, only our faceDetectionRequest), the image we want to perform the analysis on, and, finally, the orientation of the image that can be used by the request during analysis.

Once the analysis is done, we can inspect the observation obtained through the results property of the request itself, as shown in the following code:

if let faceDetectionResults = faceDetectionRequest.results as? [VNFaceObservation]{
    for face in faceDetectionResults{
        // ADD THE NEXT SNIPPET OF CODE HERE
    }
}

The type of observation is dependent on the analysis; in this case, we're expecting a VNFaceObservation. Hence, we cast it to the appropriate type and then iterate through all the observations.

Next, we will take each recognized face and extract the bounding box. Then, we'll proceed to draw it in the image (using an extension method of UIImageView found within the UIImageViewExtension.swift file). Add the following block within the for loop shown in the preceding code:

if let currentImage = imageView.image{
    let bbox = face.boundingBox
    
    let imageSize = CGSize(
        width:currentImage.size.width,
        height: currentImage.size.height)
    
    let w = bbox.width * imageSize.width
    let h = bbox.height * imageSize.height
    let x = bbox.origin.x * imageSize.width
    let y = bbox.origin.y * imageSize.height
    
    let faceRect = CGRect(
        x: x,
        y: y,
        width: w,
        height: h)
    
    let invertedY = imageSize.height - (faceRect.origin.y + faceRect.height)
    let invertedFaceRect = CGRect(
        x: x,
        y: invertedY,
        width: w,
        height: h)
    
    imageView.drawRect(rect: invertedFaceRect)
}

We can obtain the bounding box of each face via the let boundingBox property; the result is normalized, so we then need to scale this based on the dimensions of the image. For example, you can obtain the width by multiplying boundingBox with the width of the image: bbox.width * imageSize.width.

Next, we invert the y axis as the coordinate system of Quartz 2D is inverted with respect to that of UIKit's coordinate system, as shown in this diagram:

We invert our coordinates by subtracting the bounding box's origin and height from height of the image and then passing this to our UIImageView to render the rectangle. Click on the eye icon in the right-hand panel in line with the statement imageView.drawRect(rect: invertedFaceRect) to preview the results; if successful, you should see something like the following:

An alternative to inverting the face rectangle would be to use an AfflineTransform, such as:
var transform = CGAffineTransform(scaleX: 1, y: -1)
transform = transform.translatedBy(x: 0, y: -imageSize.height)
let invertedFaceRect = faceRect.apply(transform)
This approach leads to less code and therefore less chances of errors. So, it is the recommended approach. The long approach was taken previously to help illuminate the details.

Let's now take a quick detour and experiment with another type of request; this time, we will analyze our image using VNDetectFaceLandmarksRequest. It is similar to VNDetectFaceRectanglesRequest in that this request will detect faces and expose their bounding boxes; but, unlike VNDetectFaceRectanglesRequest, VNDetectFaceLandmarksRequest also provides detected facial landmarks. A landmark is a prominent facial feature such as your eyes, nose, eyebrow, face contour, or any other feature that can be detected and describes a significant attribute of a face. Each detected facial landmark consists of a set of points that describe its contour (outline). Let's see how this looks; add a new request as shown in the following code:

imageView.image = images[faceIdx]

let faceLandmarksRequest = VNDetectFaceLandmarksRequest()

try? faceDetectionRequestHandler.perform(
    [faceLandmarksRequest],
    on: images[faceIdx].cgImage!,
    orientation: CGImagePropertyOrientation(images[faceIdx].imageOrientation))

The preceding snippet should look familiar to you; it's almost the same as what we did previously, but this time replacing VNDetectFaceRectanglesRequest with VNDetectFaceLandmarksRequets. We have also refreshed the image in our image view with the statement imageView.image = images[faceIdx]. As we did before, let's iterate through each of the detected observations and extract some of the common landmarks. Start off by creating the outer loop, as shown in this code:

if let faceLandmarkDetectionResults = faceLandmarksRequest.results as? [VNFaceObservation]{
    for face in faceLandmarkDetectionResults{
        if let currentImage = imageView.image{
            let bbox = face.boundingBox
            
            let imageSize = CGSize(width:currentImage.size.width,
                                   height: currentImage.size.height)
            
            let w = bbox.width * imageSize.width
            let h = bbox.height * imageSize.height
            let x = bbox.origin.x * imageSize.width
            let y = bbox.origin.y * imageSize.height
            
            let faceRect = CGRect(x: x,
                                  y: y,
                                  width: w,
                                  height: h)
                                    
        }
    }
}

Up to this point, the code will look familiar; next, we will look at each of the landmarks. But first, let's create a function to handle the transformation of our points from the Quartz 2D coordinate system to UIKit's coordinate system. We add the following function but within the same block as our faceRect declaration:

func getTransformedPoints(
    landmark:VNFaceLandmarkRegion2D,
    faceRect:CGRect,
    imageSize:CGSize) -> [CGPoint]{
    
    return landmark.normalizedPoints.map({ (np) -> CGPoint in
        return CGPoint(
            x: faceRect.origin.x + np.x * faceRect.size.width,
            y: imageSize.height - (np.y * faceRect.size.height + faceRect.origin.y))
    })
}

As mentioned before, each landmark consists of a set of points that describe the contour of that particular landmark, and, like our previous feature, the points are normalized between 0.0 - 1.0. Therefore, we need to scale them based on the associated face rectangle, which is exactly what we did in the preceding example. For each point, we are scaling and transforming it into the appropriate coordinate space, and then returning the mapped array to the caller.

Let's now define some constants that we will use to visualize each landmark; we add the following two constants in the function we implemented just now, getTransformedPoints:

let landmarkWidth : CGFloat = 1.5
let landmarkColor : UIColor = UIColor.red

We will now step through a few of the landmarks, showing how we extract the features and occasionally showing the result. Let's start with the left eye and right eye; continue adding the following code just after the constants you just defined:

if let landmarks = face.landmarks?.leftEye {
    let transformedPoints = getTransformedPoints(
        landmark: landmarks,
        faceRect: faceRect,
        imageSize: imageSize)
    
    imageView.drawPath(pathPoints: transformedPoints,
                       closePath: true,
                       color: landmarkColor,
                       lineWidth: landmarkWidth,
                       vFlip: false)
    
    var center = transformedPoints
        .reduce(CGPoint.zero, { (result, point) -> CGPoint in
        return CGPoint(
            x:result.x + point.x,
            y:result.y + point.y)
    })
    
    center.x /= CGFloat(transformedPoints.count)
    center.y /= CGFloat(transformedPoints.count)
    imageView.drawCircle(center: center,
                         radius: 2,
                         color: landmarkColor,
                         lineWidth: landmarkWidth,
                         vFlip: false)
}

if let landmarks = face.landmarks?.rightEye {
    let transformedPoints = getTransformedPoints(
        landmark: landmarks,
        faceRect: faceRect,
        imageSize: imageSize)
    
    imageView.drawPath(pathPoints: transformedPoints,
                       closePath: true,
                       color: landmarkColor,
                       lineWidth: landmarkWidth,
                       vFlip: false)
    
    var center = transformedPoints.reduce(CGPoint.zero, { (result, point) -> CGPoint in
        return CGPoint(
            x:result.x + point.x,
            y:result.y + point.y)
    })
    
    center.x /= CGFloat(transformedPoints.count)
    center.y /= CGFloat(transformedPoints.count)
    imageView.drawCircle(center: center,
                         radius: 2,
                         color: landmarkColor,
                         lineWidth: landmarkWidth,
                         vFlip: false)
}

Hopefully, as is apparent from the preceding code snippet, we get a reference to each of the landmarks by interrogating the face observations landmark property, which itself references the appropriate landmark. In the preceding code, we get reference to the landmarks leftEye and rightEye. And for each, we first render the contour of the eye, as shown in this screenshot:

Next, we iterate through each of the points to find the center of the eye and render a circle using the following code:

var center = transformedPoints
    .reduce(CGPoint.zero, { (result, point) -> CGPoint in
    return CGPoint(
        x:result.x + point.x,
        y:result.y + point.y)
})

center.x /= CGFloat(transformedPoints.count)
center.y /= CGFloat(transformedPoints.count)
imageView.drawCircle(center: center,
                     radius: 2,
                     color: landmarkColor,
                     lineWidth: landmarkWidth,
                     vFlip: false)

This is slightly unnecessary as one of the landmarks available is leftPupil, but I wanted to use this instance to highlight the importance of inspecting the available landmarks. The next half of the block is concerned with performing the same tasks for the right eye; by the end of it, you should have an image resembling something like the following, with both eyes drawn:

Let's continue highlighting some of the landmarks available. Next, we will inspect the face contour and nose; add the following code:

if let landmarks = face.landmarks?.faceContour {
    let transformedPoints = getTransformedPoints(
        landmark: landmarks,
        faceRect: faceRect,
        imageSize: imageSize)
    
    imageView.drawPath(pathPoints: transformedPoints,
                       closePath: false,
                       color: landmarkColor,
                       lineWidth: landmarkWidth,
                       vFlip: false)
}

if let landmarks = face.landmarks?.nose {
    let transformedPoints = getTransformedPoints(
        landmark: landmarks,
        faceRect: faceRect,
        imageSize: imageSize)
    
    imageView.drawPath(pathPoints: transformedPoints,
                       closePath: false,
                       color: landmarkColor,
                       lineWidth: landmarkWidth,
                       vFlip: false)
}

if let landmarks = face.landmarks?.noseCrest {
    let transformedPoints = getTransformedPoints(
        landmark: landmarks,
        faceRect: faceRect,
        imageSize: imageSize)
    
    imageView.drawPath(pathPoints: transformedPoints,
                       closePath: false,
                       color: landmarkColor,
                       lineWidth: landmarkWidth,
                       vFlip: false)
}

The patterns should be obvious now; here we can draw the landmarks faceContour, nose, and noseCrest; with that done, your image should look something like the following:

As an exercise, draw the lips (and any other facial landmark) using the landmarks innerLips and outerLips. With that implemented, you should end up with something like this:

Before returning to our task of classifying facial expressions, let's quickly finish our detour with some practical uses for landmark detection (other than drawing or placing glasses on a face).

As highlighted earlier, our training set consists of images that are predominantly forward-facing and orientated fairly straight. With this in mind, one practical use of knowing the position of each eye is being able to qualify an image; that is, is the face sufficiently in view and orientated correctly? Another use would be to slightly realign the face so that it fits in better with your training set (keeping in mind that our images are reduced to 28 x 28, so some detriment to quality can be ignored).

For now, I'll leave the implementation of these to you but, by using the angle between the two eyes, you can apply an affine transformation to correct the orientation, that is, rotate the image.

Let's now return to our main task of classification; as we did before, we will create a VNDetectFaceRectanglesRequest request to handle identifying each face within a given image and, for each face, we will perform some preprocessing before feeding it into our model. If you recall our discussion on the model, our model is expecting a single-channel (grayscale) image of a face with the size 48 x 48 and its values normalized between 0.0 and 1.0. Let's walk through each part of the task piece by piece, starting with creating the request, as we did previously:

imageView.image = images[faceIdx]
let model = ExpressionRecognitionModelRaw()

if let faceDetectionResults = faceDetectionRequest.results as? [VNFaceObservation]{
    for face in faceDetectionResults{
        if let currentImage = imageView.image{
            let bbox = face.boundingBox
            
            let imageSize = CGSize(width:currentImage.size.width,
                                   height: currentImage.size.height)
            
            let w = bbox.width * imageSize.width
            let h = bbox.height * imageSize.height
            let x = bbox.origin.x * imageSize.width
            let y = bbox.origin.y * imageSize.height
            
            let faceRect = CGRect(x: x,
                                  y: y,
                                  width: w,
                                  height: h)                        
        }
    }
}

The preceding code should look familiar to you now, with the only difference being the instantiation of our model (the bold statement): let model = ExpressionRecognitionModelRaw(). Next, we want to crop out the face from the image; in order to do this, we will need to write a utility function that will implement this. Since we want to carry this over to our application, let's write it as an extension of the CIImage class. Click on the CIImageExtension.swift file within the Sources folder in the left-hand panel to open up the relevant file; currently, this file is just an empty extension body, as shown in the following code:

extension CIImage{
}

Go ahead and add the following snippet of code within the body of CIImage to implement the functionality of cropping:

public func crop(rect:CGRect) -> CIImage?{
    let context = CIContext()
    guard let img = context.createCGImage(self, from: rect) else{
        return nil
    }
    return CIImage(cgImage: img)
}

In the preceding code, we are simply creating a new image of itself constrained to the region passed in; this method, context.createCGImage, returns a CGImage, which we then wrap in a CIImage before returning to the caller. With our crop method taken care of, we return to our main playground source and add the following snippet after the face rectangle declared previously to crop a face from our image:

let ciImage = CIImage(cgImage:images[faceIdx].cgImage!)

let cropRect = CGRect(
    x: max(x - (faceRect.width * 0.15), 0),
    y: max(y - (faceRect.height * 0.1), 0),
    width: min(w + (faceRect.width * 0.3), imageSize.width),
    height: min(h + (faceRect.height * 0.6), imageSize.height))

guard let croppedCIImage = ciImage.crop(rect: cropRect) else{
    fatalError("Failed to cropped image")
}

We first create an instance of CIImage from CGImage (referenced by the UIImage instance); we then pad out our face rectangle. The reason for doing this is to better match it with our training data; if you refer to our previous experiments, the detected bounds fit tightly around the eyes and chin while our training data encompasses a more holistic view of the face. The numbers selected were through trial and error, but I imagine there is some statistically relevant ratio between the distance between the eyes and height of the face—maybe. We finally crop our image using the crop method we implemented earlier.

Next, we will resize the image (to the size the model is expecting) but, once again, this functionality is not yet available. So, our next task! Jump back into the CIImageExtension.swift file and add the following method to handle resizing:

public func resize(size: CGSize) -> CIImage {
    let scale = min(size.width,size.height) / min(self.extent.size.width, self.extent.size.height)
    
    let resizedImage = self.transformed(
        by: CGAffineTransform(
            scaleX: scale,
            y: scale))
    
    let width = resizedImage.extent.width
    let height = resizedImage.extent.height
    let xOffset = (CGFloat(width) - size.width) / 2.0
    let yOffset = (CGFloat(height) - size.height) / 2.0
    let rect = CGRect(x: xOffset,
                      y: yOffset,
                      width: size.width,
                      height: size.height)
    
    return resizedImage
        .clamped(to: rect)
        .cropped(to: CGRect(
            x: 0, y: 0,
            width: size.width,
            height: size.height))
}

You may notice that we are not inverting the face rectangle here as we did before; the reason is that we were only required to do this to transform from the Quartz 2D coordinate system to UIKit's coordinate system, which we are not doing here.

Despite the number of lines, the majority of the code is concerned with calculating the scale and translation required to center it. Once we have calculated these, we simply pass in a CGAffineTransform, with our scale, to the transformed method and then our centrally aligned rectangle to the clamped method. With this now implemented, let's return to our main playground code and make use of it by resizing our cropped image, as shown in the following lines:

let resizedCroppedCIImage = croppedCIImage.resize(
    size: CGSize(width:48, height:48))

Three more steps are required before we can pass our data to our model for inference. The first is to convert it to a single channel, the second is to rescale the pixels so that they are between the values of 0.0 and 1.0, and finally we wrap it in a MLMultiArray, which we can then feed into our model's predict method. To achieve the previous, we will add another extension to our CIImage class. It will render out the image using a single channel, along with extracting the pixel data and returning it in an array, which we can then easily access for rescaling. Jump back into the CIImageExtension.swift file and add the following method:

public func getGrayscalePixelData() -> [UInt8]?{
    var pixelData : [UInt8]?
    
    let context = CIContext()
    
    let attributes = [
        kCVPixelBufferCGImageCompatibilityKey:kCFBooleanTrue,
        kCVPixelBufferCGBitmapContextCompatibilityKey:kCFBooleanTrue
        ] as CFDictionary
    
    var nullablePixelBuffer: CVPixelBuffer? = nil
    let status = CVPixelBufferCreate(
        kCFAllocatorDefault,
        Int(self.extent.size.width),
        Int(self.extent.size.height),
        kCVPixelFormatType_OneComponent8,
        attributes,
        &nullablePixelBuffer)
    
    guard status == kCVReturnSuccess, let pixelBuffer = nullablePixelBuffer
        else { return nil }
    
    CVPixelBufferLockBaseAddress(
        pixelBuffer,
        CVPixelBufferLockFlags(rawValue: 0))
    
    context.render(
        self,
        to: pixelBuffer,
        bounds: CGRect(x: 0,
                       y: 0,
                       width: self.extent.size.width,
                       height: self.extent.size.height),
        colorSpace:CGColorSpaceCreateDeviceGray())
    
    
    let width = CVPixelBufferGetWidth(pixelBuffer)
    let height = CVPixelBufferGetHeight(pixelBuffer);
    
    if let baseAddress = CVPixelBufferGetBaseAddress(pixelBuffer) {
        pixelData = Array<UInt8>(repeating: 0, count: width * height)
        let buf = baseAddress.assumingMemoryBound(to: UInt8.self)
        for i in 0..<width*height{
            pixelData![i] = buf[i]
        }
    }
    
    CVPixelBufferUnlockBaseAddress(
        pixelBuffer,
        CVPixelBufferLockFlags(rawValue: 0))
    
    return pixelData
}

Once again, don't be intimidated by the amount of code; there are two main tasks this method does. The first is rendering out the image to a CVPixelBuffer using a single channel, grayscale. To highlight this, the code responsible is shown in the following block:

public func getGrayscalePixelData() -> [UInt8]?{
    let context = CIContext()

    let attributes = [
        kCVPixelBufferCGImageCompatibilityKey:kCFBooleanTrue,
        kCVPixelBufferCGBitmapContextCompatibilityKey:kCFBooleanTrue
        ] as CFDictionary
    
    var nullablePixelBuffer: CVPixelBuffer? = nil
    let status = CVPixelBufferCreate(
        kCFAllocatorDefault,
        Int(self.extent.size.width),
        Int(self.extent.size.height),
        kCVPixelFormatType_OneComponent8,
        attributes,
        &nullablePixelBuffer)
    
    guard status == kCVReturnSuccess, let pixelBuffer = nullablePixelBuffer
        else { return nil }
    
    // Render the CIImage to our CVPixelBuffer and return it
    CVPixelBufferLockBaseAddress(
        pixelBuffer,
        CVPixelBufferLockFlags(rawValue: 0))
    
    context.render(
        self,
        to: pixelBuffer,
        bounds: CGRect(x: 0,
                       y: 0,
                       width: self.extent.size.width,
                       height: self.extent.size.height),
        colorSpace:CGColorSpaceCreateDeviceGray())        
    
    CVPixelBufferUnlockBaseAddress(
        pixelBuffer,
        CVPixelBufferLockFlags(rawValue: 0))
}

We render the image to a CVPixelBuffer to provide a convenient way for us to access the raw pixels that we can then use to populate our array. We then return this to the caller. The main chunk of code that is responsible for this is shown here:

let width = CVPixelBufferGetWidth(pixelBuffer)
let height = CVPixelBufferGetHeight(pixelBuffer);

if let baseAddress = CVPixelBufferGetBaseAddress(pixelBuffer) {
    pixelData = Array<UInt8>(repeating: 0, count: width * height)
    let buf = baseAddress.assumingMemoryBound(to: UInt8.self)
    for i in 0..<width*height{
        pixelData![i] = buf[i]
    }
}

Here, we first determine the dimensions by obtaining the width and height of our image, using CVPixelBufferGetWidth and CVPixelBufferGetHeight respectively. Then we use these to create an appropriately sized array to hold the pixel data. We then obtain the base address of our CVPixelBuffer and call its assumingMemoryBound method to give us a typed pointer. We can use this to access each pixel, which we do to populate our pixelData array before returning it.

With your getGrayscalePixelData method now implemented, return to the main source of the playground and resume where you left off by adding the following code:

guard let resizedCroppedCIImageData =
    resizedCroppedCIImage.getGrayscalePixelData() else{
        fatalError("Failed to get (grayscale) pixel data from image")
}

let scaledImageData = resizedCroppedCIImageData.map({ (pixel) -> Double in
    return Double(pixel)/255.0
})

In the preceding snippet, we are obtaining the raw pixels of our cropped image using our getGrayscalePixelData method, before rescaling them by dividing each pixel by 255.0 (the maximum value). Our final task of preparation is putting our data into a data structure that our model will accept, a MLMultiArray. Add the following code to do just this:

guard let array = try? MLMultiArray(shape: [1, 48, 48], dataType: .double) else {
    fatalError("Unable to create MLMultiArray")
}

for (index, element) in scaledImageData.enumerated() {
    array[index] = NSNumber(value: element)
}

We start by creating an instance of MLMultiArray with the shape of our input data and then proceed to copy across our standardized pixel data.

With our model instantiated and data prepared, we can now perform inference using the following code:

DispatchQueue.global(qos: .background).async {
    let prediction = try? model.prediction(
        image: array)
    
    if let classPredictions = prediction?.classLabelProbs{
        DispatchQueue.main.sync {
            for (k, v) in classPredictions{
                print("(k) (v)")
            }
        }
    }
}

Previously, we dispatched inference on a background thread then printed out all probabilities of each class to the console. With that now complete, run your playground, and if everything is working fine, you should get something like the following:

Angry	0.0341557003557682
Happy	0.594196200370789
Disgust	2.19011440094619e-06
Sad	0.260873317718506
Fear	0.013140731491148
Surprise	0.000694742717314512
Neutral	0.0969370529055595

As a designer and builder of intelligent systems, it is your task to interpret these results and present them to the user. Some questions you'll want to ask yourself are as follows:

What is an acceptable threshold of a probability before setting the class as true?
Can this threshold be dependent on probabilities of other classes to remove ambiguity? That is, if Sad and Happy have a probability of 0.3, you can infer that the prediction is inaccurate, or at least not useful.
Is there a way to accept multiple probabilities?
Is it useful to expose the threshold to the user and have it manually set and/or tune it?

These are only a few questions you should ask. The specific questions, and their answers, will depend on your use case and users. At this point, we have everything we need to preprocess and perform inference; let's now turn our attention to the application for this chapter.

If you find that you are not getting any output, it could be that you need to flag the playground as running indefinitely so that it doesn't exit before running the background thread. You can do this by adding the following statement in your playground: PlaygroundPage.current.needsIndefiniteExecution = true
When this is set to true, you will need to explicitly stop the playground.

Table of Contents for Input data and preprocessing&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Input data and preprocessing