Building the photo effects application

In this section, we will be looking briefly at the application and highlighting some of the interesting pieces of the code, omitting most of it as it has already been discussed in previous chapters. As mentioned in the introduction, this example is to provide a case study for a later section, where we will discuss some broad strategies to use when building intelligent interfaces and services.

If you haven't already, pull down the latest code from the accompanying repository at https://github.com/packtpublishing/machine-learning-with-core-ml. Once downloaded, navigate to the Chapter9/Start/ directory and open the project ActionShot.xcodeproj.

As mentioned in the previous section, the example for this chapter is an photo effects application. In it, the user is able to take an action shot, have the application extract each person from the frames, and compose them onto the final frame, as illustrated in the following figure:

The application consists of two view controllers; one is responsible for capturing the frames and the other for presenting the composite image. The workhorse for the processing, once again, has been delegated to the ImageProcessor class and it is the perspective from which we will be reviewing this project.

ImageProcessor acts as both the sink and processor; by sink I refer to it being the class that is passed captured frames from the camera, using the CameraViewController, and holding them in memory for processing. Let's see what the code for this looks like; select ImageProcessor.swift from the left panel to bring the source code into focus. Let's see what exists; initially paying particular attention to the properties and methods responsible for handling received frames and then move on to their processing.

At the top of the file, you will notice that a protocol has been declared, which is implemented by the EffectViewController; it is used to broadcast the progress of the tasks:

protocol ImageProcessorDelegate : class{

    func onImageProcessorFinishedProcessingFrame(
        status:Int, processedFrames:Int, framesRemaining:Int)

    func onImageProcessorFinishedComposition(
        status:Int, image:CIImage?)
}

The first callback, onImageProcessorFinishedProcessingFrame, is used to notify the delegate of frame-by-frame processing progress while the other, onImageProcessorFinishedComposition, is used to notify the delegate once the final image has be created. These discrete callbacks are intentionally split as the processing has been broken down into segmentation and composition. Segmentation is responsible for segmenting each of the frames using our model, and composition is responsible for generating the final image using the processed (segmented) frames. This structure is also mimicked in the layout of the class, with the class broken down into four parts and the flow we will follow in this section.

The first part declares all the variables. The second implements the properties and methods responsible for retrieving the frames while they're being captured. The third contains all the methods for processing the frames, whereby the delegate is notified using the onImageProcessorFinishedProcessingFrame callback. The final part, and the one we will focus on the most, contains the methods responsible for generating the final image, that is, it composites the frames. Let's peek at the first part to get a sense of what variables are available, which are shown in the following code snippet:

class ImageProcessor{
    
    weak var delegate : ImageProcessorDelegate?
    
    lazy var model : VNCoreMLModel = {
        do{
            let model = try VNCoreMLModel(
                for: small_unet().model
            )
            return model
        } catch{
            fatalError("Failed to create VNCoreMLModel")
        }
    }()    
    
    var minMaskArea:CGFloat = 0.005
    var targetSize = CGSize(width: 448, height: 448)
    let lock = NSLock()
    var frames = [CIImage]()
    var processedImages = [CIImage]()
    var processedMasks = [CIImage]()
    private var _processingImage = false
    
    init(){
        
    }
}

Nothing extraordinary. We first declare a property that wraps our model in an instance of VNCoreMLModel so that we can take advantage of the Vision framework's preprocessing functionality. We then declare a series of variables to deal with storing the frames and handling the processing; we make use of an NSLock instance to avoid different threads reading stale property values.

The following code snippet, and part of the ImageProcessor class, includes variables and methods for handling retrieving and releasing the captured frames:

extension ImageProcessor{
    
    var isProcessingImage : Bool{
        get{
            self.lock.lock()
            defer {
                self.lock.unlock()
            }
            return _processingImage
        }
        set(value){
            self.lock.lock()
            _processingImage = value
            self.lock.unlock()
        }
    }

    var isFrameAvailable : Bool{
        get{
            self.lock.lock()
            let frameAvailable =
                    self.frames.count > 0
            self.lock.unlock()
            return frameAvailable
        }
    }
    
    public func addFrame(frame:CIImage){
        self.lock.lock()
        self.frames.append(frame)
        self.lock.unlock()
    }
    
    public func getNextFrame() -> CIImage?{
        self.lock.lock()
        let frame = self.frames.removeFirst()
        self.lock.unlock()
        return frame
    }
    
    public func reset(){
        self.lock.lock()
        self.frames.removeAll()
        self.processedImages.removeAll()
        self.processedMasks.removeAll()
        self.lock.unlock()
    }
}

Although fairly verbose, it should all be self-explanatory; probably the only method worth outlying is the method addFrame, which is called each time a frame is captured by the camera. To give some bearing of how everything is tied together, the following diagram illustrates the general flow whilst capturing frames:

The details of the flow are covered in the following points:

Although capturing of the frames is persistent throughout the lifetime of the CameraViewController, they are only passed to the ImageProcessor once flagged once the user taps (and holds) their finger on the action button
During this time, each frame that is captured (at the throttled rate—currently 10 frames per second) is passed to the CameraViewController
This subsequently passes it to the ImageProcessor using the addFrame method shown earlier
The capturing stops when the user lifts their finger from the action button and, once finished, the EffectsViewController is instantiated and presented, along with passing it a reference to the ImageProcessor with the reference to the captured frames

The next part of the ImageProcessor class is responsible for processing each of these images; this is initiated using the processFrames method, which is called by the EffectsViewController once it has loaded. This part has a lot more code, but most of it should be familiar to you as it's the boilerplate code we've used in many of the projects during the course of this book. Let's start by inspecting the processFrames method, as shown in the following snippet:

All of the remaining code is assumed to be inside the ImageProcessor class for the rest of this chapter unless stated otherwise; that is, the class and class extension declaration will be omitted to make the code easier to read.

public func processFrames(){
    if !self.isProcessingImage{
        DispatchQueue.global(qos: .background).async {
            self.processesingNextFrame()
        }
    }
}

This method simply dispatches the method call processingNextFrame to the background thread. This is mandatory when performing inference with Core ML and also a good practice when performing compute-intensive tasks to avoid locking up the user interface. Let's continue the trail by inspecting the processingNextFrame method along with the method responsible for returning an instance of a VNCoreMLRequest, which is shown in the following code snippet:

func getRequest() -> VNCoreMLRequest{
    let request = VNCoreMLRequest(
        model: self.model,
        completionHandler: { [weak self] request, error in
            self?.processRequest(for: request, error: error)
    })
    request.imageCropAndScaleOption = .centerCrop
    return request
}

func processesingNextFrame(){
    self.isProcessingImage = true
    
    guard let nextFrame = self.getNextFrame() else{
        self.isProcessingImage = false
        return
    }
    
    var ox : CGFloat = 0
    var oy : CGFloat = 0
    let frameSize = min(nextFrame.extent.width, nextFrame.extent.height)
    if nextFrame.extent.width > nextFrame.extent.height{
        ox = (nextFrame.extent.width - nextFrame.extent.height)/2
    } else if nextFrame.extent.width < nextFrame.extent.height{
        oy = (nextFrame.extent.height - nextFrame.extent.width)/2
    }
    guard let frame = nextFrame
        .crop(rect: CGRect(x: ox,
                           y: oy,
                           width: frameSize,
                           height: frameSize))?
        .resize(size: targetSize) else{
            self.isProcessingImage = false
            return
    }
    
    self.processedImages.append(frame)
    let handler = VNImageRequestHandler(ciImage: frame)
    
    do {
        try handler.perform([self.getRequest()])
    } catch {
        print("Failed to perform classification.
(error.localizedDescription)")
        self.isProcessingImage = false
        return
    }
}

We start off by setting the property isProcessingImage to true and checking that we have a frame to process, otherwise exiting early from the method.

The following might seem a little counter-intuitive (because it is); we have seen from previous chapters that VNCoreMLRequest handles the preprocessing task of resizing the cropping of our images. So, why are we doing it manually here? The reason has more to do with keeping the code simpler and meeting publishing deadlines. In this example, the final image is composited using the resized frames to avoid scaling and offsetting the output from the model, which I'll leave as an exercise for you. So here, we are performing that operation and persisting the result in the array processedImages to be used in the final stage. Finally, we execute the request, passing in the image, which calls our method processRequest once finished, passing in the results from the model.

Continuing on our trail, we will now inspect the processRequest method; as this method is quite long, we will break it down into chunks, working top to bottom:

func processRequest(for request:VNRequest, error: Error?){
    self.lock.lock()
    let framesReaminingCount = self.frames.count
    let processedFramesCount = self.processedImages.count
    self.lock.unlock()
    
    ...
}

We start off by getting the latest counts, which will be broadcast to the delegate when this method finishes or fails. Talking of which, the following block verifies that a result was returned of type [VNPixelBufferObservation], otherwise notifying the delegate and returning, as shown in the following snippet:

 func processRequest(for request:VNRequest, error: Error?){
    ... 

    guard let results = request.results,
        let pixelBufferObservations = results as? [VNPixelBufferObservation],
        pixelBufferObservations.count > 0 else {
            print("ImageProcessor", #function, "ERROR:",
                  String(describing: error?.localizedDescription))
            
            self.isProcessingImage = false
            
            DispatchQueue.main.async {
                self.delegate?.onImageProcessorFinishedProcessingFrame(
                    status: -1,
                    processedFrames: processedFramesCount,
                    framesRemaining: framesReaminingCount)
            }
            return
    }

    ...
}

With reference to our result (CVBufferPixel), our next task is to create an instance of CIImage, passing in the buffer and requesting the color space to be grayscale to ensure that a single channel image is created. Then, we will be adding it to our processedMasks array, shown in the following snippet:

func processRequest(for request:VNRequest, error: Error?){
    ...
    
    let options = [
        kCIImageColorSpace:CGColorSpaceCreateDeviceGray()
        ] as [String:Any]
    
    let ciImage = CIImage(
        cvPixelBuffer: pixelBufferObservations[0].pixelBuffer,
        options: options)
    
    self.processedMasks.append(ciImage)
    
    ...
}

Only two more things left to do! We notify the delegate that we have finished a frame and proceed to process the next frame, if available:

func processRequest(for request:VNRequest, error: Error?){
    ... 
    
    DispatchQueue.main.async {
        self.delegate?.onImageProcessorFinishedProcessingFrame(
            status: 1,
            processedFrames: processedFramesCount,
            framesRemaining: framesReaminingCount)
    }
    
    if self.isFrameAvailable{
        self.processesingNextFrame()
    } else{
        self.isProcessingImage = false
    }
}

This concludes the third part of our ImageProcessor; at this point, we have two arrays containing the resized captured frames and the segmented images from our model. Before moving on to the final part of this class, let's get a bird's-eye view of what we just did, illustrated in this flow diagram:

The details of the flow are shown in the following points:

As mentioned in the preceding diagram, processing is initiated once the EffectsViewController is loaded, which kicks off the background thread to process each of the captured frames
Each frame is first resized and cropped to match the output of the model
Then, it is added to the processedFrames array and passed to our model for interference (segmentation)
Once the model returns with the result, we instantiate a single color instance of CIImage
This instance is stored in the array processedMasks and the delegate is notified of the progress

What happens when all frames have been processed? This is what we plan on answering in the next part, where we will discuss the details of how to create the effect. To start with, let's discuss how the process is initiated.

Once the delegate (EffectsViewController) receives a callback, using onImageProcessorFinishedProcessingFrame, where all of the frames have been processed, it calls the compositeFrames method from the ImageProcessor to start the process of creating the effect. Let's review this and the existing code within this part of the ImageProcessor class:

func compositeFrames(){
    
    var selectedIndicies = self.getIndiciesOfBestFrames()
    if selectedIndicies.count == 0{
        DispatchQueue.main.async {
            self.delegate?.onImageProcessorFinishedComposition(
                status: -1,
                image: self.processedImages.last!)
        }
        return
    }
    
    var finalImage = self.processedImages[selectedIndicies.last!]
    selectedIndicies.removeLast()

    // TODO Composite final image using segments from intermediate frames
    DispatchQueue.main.async {
        self.delegate?.onImageProcessorFinishedComposition(
            status: 1,
            image: finalImage)
    }
}

func getIndiciesOfBestFrames() -> [Int]{
    // TODO; find best frames for the sequence i.e. avoid excessive overlapping
    return (0..<self.processedMasks.count).map({ (i) -> Int in
        return i
    })
}

func getDominantDirection() -> CGPoint{
    var dir = CGPoint(x: 0, y: 0)
    // TODO detected dominate direction
    return dir
}

I have bolded the important/interesting parts, essentially the parts we will be implementing, but before writing any more code, let's review what we currently have (in terms of processed images) and an approach to creating our effect.

At this stage, we have an array of processedFrames that contains the resized and cropped versions of the captured images, and we have another array, processedMasks, containing the single-channel images from our segmentation model. Examples of these are shown in the following figure:

If we were to composite each of the frames as they are, we would end up with a lot of unwanted artifacts and excessive overlapping. One approach could be to adjust the frames that have been processed (and possibly captured), that is, skip every n frames to spread out the frames. The problem with this approach is that it assumes all subjects will be moving at the same speed; to account for this, you would need to expose this tuning to the user for manual adjustment (which is an reasonable approach). The approach we will take here will be to extract the bounding box for each of the frames, and using the displacement and relative overlap of these to determine when to insert a frame and when to skip a frame.

To calculate the bounding box, we simply scan each line from each of the edges of the image, that is, from top to bottom, to determine the top of the object. Then, we do it bottom to top to determine the bottom of the object. Similarly, we do it on the horizontal axis, illustrated in the following figure:

Even with bounding boxes, we still need to determine how far the object should move before inserting a frame. To determine this, we first determine the dominant direction, which is calculated by finding the direction between the first and last frames of the segmented object. This is then used to determine what axis to compare displacement on; that is, if the dominant direction is in the horizontal axis (as shown in the preceding figure), then we measure the displacement across the x axis, ignoring the y axis. We then simply measure the distance between the frames against some predetermined threshold to decide whether to composite the frame or ignore it. This is illustrated in the following figure:

Let's see what this looks like in code, starting from determining the dominant direction. Add the following code to the getDominantDirection method:

var dir = CGPoint(x: 0, y: 0)

var startCenter : CGPoint?
var endCenter : CGPoint?

// Find startCenter
for i in 0..<self.processedMasks.count{
    let mask = self.processedMasks[i]
    
    guard let maskBB = mask.getContentBoundingBox(),
    (maskBB.width * maskBB.height) >=
        (mask.extent.width * mask.extent.height) * self.minMaskArea
    else {
        continue
    }
    
    startCenter = maskBB.center
    break
}

// Find endCenter
for i in (0..<self.processedMasks.count).reversed(){
    let mask = self.processedMasks[i]
    
    guard let maskBB = mask.getContentBoundingBox(),
    (maskBB.width * maskBB.height) >=
        (mask.extent.width * mask.extent.height) * self.minMaskArea
    else {
        continue
    }
    
    endCenter = maskBB.center
    break
}

if let startCenter = startCenter, let endCenter = endCenter, startCenter != endCenter{
    dir = (startCenter - endCenter).normalised
}

return dir

As described earlier, we first find the bounding boxes of the start and end of our sequence of frames, and use their centers to calculate the dominate direction.

The implementation of the CIImage method getContentBoundingBox is omitted here, but it can be found in the accompanying the source code within the CIImage+Extension.swift file.

Armed with the dominant direction, we can now proceed with determining what frames to include and what frames to ignore. We will implement this in the method getIndiciesOfBestFrames of the ImageProcessor class, which iterates over all frames, measuring the overlap and ignoring those that don't meet a specific threshold. The method returns an array of indices that satisfy this threshold to be composited onto the final image. Add the following code to the getIndiciesOfBestFrames method:

var selectedIndicies = [Int]()
var previousBoundingBox : CGRect?
let dir = self.getDominateDirection()

for i in (0..<self.processedMasks.count).reversed(){
    let mask = self.processedMasks[i]
   guard let maskBB = mask.getContentBoundingBox(),
        maskBB.width < mask.extent.width * 0.7,
        maskBB.height < mask.extent.height * 0.7 else {
        continue
    }
    
    if previousBoundingBox == nil{
        previousBoundingBox = maskBB
        selectedIndicies.append(i)
    } else{
        let distance = abs(dir.x) >= abs(dir.y)
            ? abs(previousBoundingBox!.center.x - maskBB.center.x)
            : abs(previousBoundingBox!.center.y - maskBB.center.y)
        let bounds = abs(dir.x) >= abs(dir.y)
            ? (previousBoundingBox!.width + maskBB.width) / 4.0
            : (previousBoundingBox!.height + maskBB.height) / 4.0
        
        if distance > bounds * 0.5{
            previousBoundingBox = maskBB
            selectedIndicies.append(i)
        }
    }
    
}

return selectedIndicies.reversed()

We begin by getting the dominant direction, as discussed earlier, and then proceed to iterate through our sequence of frames in reverse order (reverse as it is assumed that the user's hero shot is the last frame). With each frame, we obtain the bounding box, and if it's the first frame to be checked, we assign it to the variable previousBoundingBox. This will be used to compare subsequent bounding boxes (and updated to the latest included frame). If previousBoundingBox is not null, then we calculate the displacement between the two based on the dominant direction, as shown in the following snippet:

let distance = abs(dir.x) >= abs(dir.y)
    ? abs(previousBoundingBox!.center.x - maskBB.center.x)
    : abs(previousBoundingBox!.center.y - maskBB.center.y)

We then calculate the minimum length needed to separate the two objects, which is calculated by the combined size of the relative axis divided by 2. This gives us a distance of half of the combined frame, as shown in the following snippet:

let bounds = abs(dir.x) >= abs(dir.y)
    ? (previousBoundingBox!.width + maskBB.width) / 2.0
    : (previousBoundingBox!.height + maskBB.height) / 2.0

We then compare the distance with the bounds along with a threshold and proceed to add the frame to the current index if the distance satisfies this threshold:

if distance > bounds * 0.15{
    previousBoundingBox = maskBB
    selectedIndicies.append(i)
}

Returning to the compositeFrames method, we are now ready to composite the selected frames. To achieve this, we will leverage CoreImages filters; but before doing so, let's quickly review what it is exactly that we want to achieve.

For each selected (processed) image and mask pair, we want to clip out the image and overlay it onto the final image. To improve the effect, we will apply a progressively increasing alpha so that frames closer to the final frame will have an opacity closer to 1.0 while the frames further away will be progressively transparent; this will give us a faded trailing effect. This process is summarized in the following figure:

Let's turn this into code by first implementing the filter. As shown earlier, we will be passing the kernel the output image, the overlay and its corresponding mask, and an alpha. Near the top of the ImageProcessor class, add the following code:

lazy var compositeKernel : CIColorKernel? = {
    let kernelString = """
        kernel vec4 compositeFilter(
            __sample image,
            __sample overlay,
            __sample overlay_mask,
            float alpha){
            float overlayStrength = 0.0;

            if(overlay_mask.r > 0.0){
                overlayStrength = 1.0;
            }

            overlayStrength *= alpha;
            
            return vec4(image.rgb * (1.0-overlayStrength), 1.0)
                + vec4(overlay.rgb * (overlayStrength), 1.0);
        }
    """
    return CIColorKernel(source:kernelString)
}()

Previously, we have implemented the CIColorKernel, which is responsible for compositing all of our frames onto the final image as discussed. We start by testing the mask's value, and if it is 1.0, we assign the strength 1.0 (meaning we want to replace the color at that location of the final image with that of the overlay). Otherwise, we assign 0, ignoring it. Then, we multiply the strength with the blend argument passed to our kernel. Finally, we calculate and return the final color with the statement vec4(image.rgb * (1.0-overlayStrength), 1.0) + vec4(overlay.rgb * (overlayStrength), 1.0). With our filter now implemented, let's return the compositeFrames method and put it to use. Within compositeFrames, replace the comment // TODO Composite final image using segments from intermediate frames with the following code:

let alphaStep : CGFloat = 1.0 / CGFloat(selectedIndicies.count)

for i in selectedIndicies{
    let image = self.processedImages[i]
    let mask = self.processedMasks[i]
    
    let extent = image.extent
    let alpha = CGFloat(i + 1) * alphaStep
    let arguments = [finalImage, image, mask, min(alpha, 1.0)] as [Any]
    if let compositeFrame = self.compositeKernel?.apply(extent: extent, arguments: arguments){
        finalImage = compositeFrame
    }
}

Most of this should be self-explanatory; we start by calculating an alpha stride that will be used to progressively increase opacity as we get closer to the final frame. We then iterate through all the selected frames, applying the filter we just implemented in the preceding snippet, compositing our final image.

With that done, we have now finished this method and the coding for this chapter. Well done! It's time to test it out; build and run the project to see your hard work in action. The following is a result from a weekend park visit:

Before wrapping up this chapter, let's briefly discuss some strategies when working with machine learning models.

Table of Contents for Building the photo effects application

Create new playlist

Sign In

Sign Up

Table of Contents for
Building the photo effects application