Making it easier to find photos

In this section, we will put our model to work in an intelligent search application; we'll start off by quickly introducing the application, giving us a clear vision of what we intend to build. Then, we'll work through implementing the functionality related to interpreting the model's output and search heuristic for the desired functionality. We will be omitting a lot of the usual iOS functionality so that we can stay focused on the intelligent aspect of the application.

Over the past few years, we have seen a surge of intelligence being embedded in photo gallery applications, providing us with efficient ways of surfacing those cat photos hidden deep in hundreds (if not thousands) of photos we have accumulated over the years. In this section, we want to continue with this theme but push that level of intelligence a little bit further by taking advantage of the semantic information gained through object detection. Our users will be able to search for not only specific objects within a image, but also photos based on the objects and their relative positioning. For example, they can search for an image, or images, with two people standing side by side and in front of a car.

The user interface allows the user to draw the objects as they would like them positioned and their relative sizes. It will be our job in this section to implement the intelligence that returns relevant images based on this search criteria.

The next figure shows the user interface; the first two screenshots show the search screen, where the user can visually articulate what they are looking for. By using labeled bounding boxes, the user is able to describe what they are looking for, how they would like these objects arranged, and relative object sizes. The last two screenshots show the result of a search, and when expanded (last screenshot), the image will be overlaid with the detected objects and associated bounding boxes:

Let's start by taking a tour of the existing project before importing the model we have just converted and downloaded.

If you haven't already, pull down the latest code from the accompanying repository: https://github.com/packtpublishing/machine-learning-with-core-ml. Once downloaded, navigate to the directory Chapter5/Start/ and open the project ObjectDetection.xcodeproj. Once loaded, you will see the project for this chapter, as shown in the following screenshot:

I will leave exploring the full project as an exercise for you, and I'll just concentrate on the files PhotoSearcher.swift and YOLOFacade.swift for this section. PhotoSearcher.swift is where we will implement the cost functions responsible for filtering and sorting the photos based on the search criteria and detected objects from YOLOFacade.swift, whose sole purpose is to wrap the Tiny YOLO model and implement the functionality to interpret its output. But before jumping into the code, let's quickly review the flow and data structures we will be working with.

The following diagram illustrates the general flow of the application; the user first defines the search criteria via SearchViewController, which is described as an array of normalized ObjectBounds. We'll cover more details on these later. When the user initiates the search (top-right search icon) these are passed to SearchResultsViewController, which delegates the task of finding suitable images to PhotoSearcher.

PhotoSearcher proceeds to iterate through all of our photos, passing each of them through to YOLOFacade to perform object detection using the model we converted in the previous section. The results of these are passed back to PhotoSearcher, which evaluates the cost of each with respect to the search criteria and then filters and orders the results, before passing them back to SearchResultsViewController to be displayed:

Each component communicates with the another using either the data object ObjectBounds or SearchResult. Because we will be working with them throughout the rest of this chapter, let's quickly introduce them here, all of which are defined in the DataObjects.swift file. Let's start with ObjectBounds, the structure shown in the following snippet:

struct ObjectBounds {
    public var object : DetectableObject
    public var origin : CGPoint
    public var size : CGSize
    
    var bounds : CGRect{
        return CGRect(origin: self.origin, size: self.size)
    }
}

As the name suggests, ObjectBounds is just that—it encapsulates the boundary of an object using the variables origin and size. The object itself is of type DetectableObject, which provides a structure to store both the class index and its associated label. It also provides a static array of objects that are available in our search, as follows:

struct DetectableObject{
    public var classIndex : Int
    public var label : String
    
    static let objects = [
        DetectableObject(classIndex:19, label:"tvmonitor"),
        DetectableObject(classIndex:18, label:"train"),
        DetectableObject(classIndex:17, label:"sofa"),
        DetectableObject(classIndex:14, label:"person"),
        DetectableObject(classIndex:11, label:"dog"),
        DetectableObject(classIndex:7, label:"cat"),
        DetectableObject(classIndex:6, label:"car"),
        DetectableObject(classIndex:5, label:"bus"),
        DetectableObject(classIndex:4, label:"bottle"),
        DetectableObject(classIndex:3, label:"boat"),
        DetectableObject(classIndex:2, label:"bird"),
        DetectableObject(classIndex:1, label:"bicycle")
    ]
}

ObjectBounds are used for both the search criteria defined by the user and search results returned by YOLOFacade; in the former, they describe where and which objects the user is interested in finding (search criteria), and the latter encapsulates the results from object detection.

SearchResult doesn't get any more complex; it's intended to encapsulate the result of a search with the addition of the image and cost, which is set during the cost evaluation stage (step 8), as shown in the previous diagram. For the complete code, the structure is as follows:

struct SearchResult{
    public var image : UIImage  
    public var detectedObjects : [ObjectBounds]
    public var cost : Float
}

It's worth noting that the ObjectBounds messages, in the previous diagram, annotated with the word Normalized, refer to the values being in unit values based on the source or target size; that is, an origin of x = 0.5 and y = 0.5 defines the center of the source image it was defined on. The reason for this to ensure that the bounds are invariant to changes in the images they are operating on. You will soon see that, before passing images to our model, we need to resize and crop to a size of 416 x 416 (the expected input to our model), but we need to transform them back to the original for rendering the results.

Now, we have a better idea of what objects we will be consuming and generating; let's proceed with implementing the YOLOFacade and work our way up the stack.

Let's start by importing the model we have just converted in the previous section; locate the downloaded .mlmodel file and drag it onto Xcode. Once imported, select it from the left-hand panel to inspect the metadata to remind ourselves what we need to implement. It should resemble this screenshot:

With our model now imported, let's walk through implementing the functionality YOLOFacade is responsible for; this includes preprocessing the image, passing it to our model for inference, and then parsing the model's output, including performing non-max supression. Select YOLOFacade.swift from the left-hand panel to bring up the code in the main window.

The class is broken into three parts, via an extension, with the first including the variables and entry point; the second including the functionality for performing inference and parsing the models outputs; and the third part including the non-max supression algorithm we discussed at the start of this chapter. Let's start at the beginning which currently looks like this:

class YOLOFacade{
    
    // TODO add input size (of image)
    // TODO add grid size
    // TODO add number of classes
    // TODO add number of anchor boxes
    // TODO add anchor shapes (describing aspect ratio)
    
    lazy var model : VNCoreMLModel? = {
        do{
            // TODO add model
            return nil
        } catch{
            fatalError("Failed to obtain tinyyolo_voc2007")
        }
    }()
    
    func asyncDetectObjects(
        photo:UIImage,
        completionHandler:@escaping (_ result:[ObjectBounds]?) -> Void){
        
        DispatchQueue.global(qos: .background).sync {
            
            self.detectObjects(photo: photo, completionHandler: { (result) -> Void in
                DispatchQueue.main.async {
                    completionHandler(result)
                }
            })
        }
    }
    
}

The asyncDetectObjects method is the entry point of the class and is called by PhotoSearcher for each image it receives from the Photos framework; when called, this method simply delegates the task to the method detectObject in the background and waits for the results, before passing them back to the caller on the main thread. I have annotated the class with TODO to help you keep focused.

Let's start by declaring the target size required by our model; this will be used for preprocessing of the input of our model and transforming the normalized bounds to those of the source image. Add the following code:

// TODO add input size (of image)
var targetSize = CGSize(width: 416, height: 416)

Next, we define properties of our model that are used during parsing of the output; these include grid size, number of classes, number of anchor boxes, and finally, the dimensions for each of the anchor boxes (each pair describes the width and height, respectively). Make the following amendments to your YOLOFacade class:

// TODO add grid size
let gridSize = CGSize(width: 13, height: 13)
// TODO add number of classes
let numberOfClasses = 20
// TODO add number of anchor boxes
let numberOfAnchorBoxes = 5
// TODO add anchor shapes (describing aspect ratio)
let anchors : [Float] = [1.08, 1.19, 3.42, 4.41, 6.63, 11.38, 9.42, 5.11, 16.62, 10.52]

Let's now implement the model property; in this example, we will take advantage of the Vision framework for handling the preprocessing. For this, we will need to wrap our model in an instance of VNCoreMLModel so that we can pass it into a VNCoreMLRequest; make the following amendments, as shown in bold:

lazy var model : VNCoreMLModel = {
    do{
        // TODO add model
        let model = try VNCoreMLModel(
            for: tinyyolo_voc2007().model)
        return model
    } catch{
        fatalError("Failed to obtain tinyyolo_voc2007")
    }
}()

Let's now turn our attention to the detectObjects method. It will be responsible for performing inference via VNCoreMLRequest and VNImageRequestHandler, passing the model's output to the detectObjectsBounds method (which we will come to next), and finally transforming the normalized bounds to the dimensions of the original (source) image.

In this chapter, we will postpone the discussion around the Vision framework classes (VNCoreMLModel, VNCoreMLRequest, and VNImageRequestHandler) until the next chapter, where we will elaborate a little on what each does and how they work together.

Within the detectObjects method of YOLOFacade, replace the comment // TODO preprocess image and pass to model with the following code:

let request = VNCoreMLRequest(model: self.model)
request.imageCropAndScaleOption = .centerCrop

let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])

do {
    try handler.perform([request])
} catch {
    print("Failed to perform classification.
(error.localizedDescription)")
    completionHandler(nil)
    return
}

In the preceding snippet, we start by creating an instance of VNCoreMLRequest, passing in our model, which itself has been wrapped with an instance of VNCoreMLModel. This request performs the heavy lifting, including preprocessing (inferred by the model's metadata) and performing inference. We set its imageCropAndScaleOption property to centerCrop, which determines, as you might expect, how the image is resized to fit into the model's input. The request itself doesn't actually execute the task; this is the responsibility of VNImageRequestHandler, which we declare next by passing in our source image and then executing the request via the handler's perform method.

If all goes to plan, we should expect to have the model's output available via the request's results property. Let's move on to the last snippet for this method; replace the comment // TODO pass models results to detectObjectsBounds(::) and the following statement, completionHandler(nil), with this code:

guard let observations = request.results as? [VNCoreMLFeatureValueObservation] else{
    completionHandler(nil)
    return
}

var detectedObjects = [ObjectBounds]()

for observation in observations{
    guard let multiArray = observation.featureValue.multiArrayValue else{
        continue
    }
    
    if let observationDetectedObjects = self.detectObjectsBounds(array: multiArray){
        
        for detectedObject in observationDetectedObjects.map(
            {$0.transformFromCenteredCropping(from: photo.size, to: self.targetSize)}){
                detectedObjects.append(detectedObject)
        }
    }
}

completionHandler(detectedObjects)

We begin by trying to cast the results to an array of VNCoreMLFeatureValueObservation, a type of image analysis observation that provides key-value pairs. One of them is multiArrayValue, which we then pass to the detectObjectsBounds method to parse the output and return the detected objects and their bounding boxes. Once detectObjectsBounds returns, we map each of the results with the ObjectBounds method transformFromCenteredCropping, which is responsible for transforming the normalized bounds into the space of the source image. Once each of the bounds has been transformed, we call the completion handler, passing in the detected bounds.

The next two methods encapsulate the bulk of the YOLO algorithm and the bulk of the code for this class. Let's start with the detectObjectsBounds method, making our way through it in small chunks.

This method will receive an MLMultiArray with the shape of (125, 13, 13); this will hopefully look familiar to you (although reversed) where the (13, 13) is the size of our grid and the 125 encodes five blocks (coinciding with our five anchor boxes) each containing the bounding box, the probability of an object being present (or not), and the probability distribution across 20 classes. For your convenience, I have again added the diagram illustrating this structure:

To improve performance, we will access the MLMultiArray's raw data directly, rather than through the MLMultiArray subscript. Although having direct access gives us a performance boost, it does have a trade-off of requiring us to correctly calculate the index for each value. Let's define the constants that we will use when calculating these indexes, as well as obtaining access to the raw data buffer and some arrays to store the intermediate results; add the following code within your detectObjectsBounds method:

let gridStride = array.strides[0].intValue
let rowStride = array.strides[1].intValue
let colStride = array.strides[2].intValue

let arrayPointer = UnsafeMutablePointer<Double>(OpaquePointer(array.dataPointer))

var objectsBounds = [ObjectBounds]()
var objectConfidences = [Float]()

As mentioned before, we start by defining constants of stride values for the grid, row, and column—each used to calculate the current value. These values are obtained through the strides property of MLMultiArray, which gives us the number of data elements in each dimension. In this case, this would be 125, 13, and 13 respectively. Next, we get a reference to the underlying buffer of the MLMultiArray and, finally, we create two arrays to store the bounds and associated confidence value.

Next, we want to iterate through the model's output and process each of the grid cells and their subsequent anchor boxes independently; we do this by using three nested loops and then calculating the relevant index. Let's do that by adding the following snippet:

for row in 0..<Int(gridSize.height) {
    for col in 0..<Int(gridSize.width) {
        for b in 0..<numberOfAnchorBoxes {
            
            let gridOffset = row * rowStride + col * colStride
            let anchorBoxOffset = b * (numberOfClasses + numberOfAnchorBoxes)
            // TODO calculate the confidence of each class, ignoring if under threshold 
        }
    }
}

The important values here are gridOffset and anchorBoxOffset; gridOffset gives us the relevant offset for the specific grid cell (as the name implies), while anchorBoxOffset gives us the index of the associated anchor box. Now that we have these values, we can access each of the elements using the [(anchorBoxOffset + INDEX_TO_VALUE) * gridStride + gridOffset index, where INDEX_TO_VALUE is the relevant value within the anchor box vector we want to access, as illustrated in this diagram:

Now we know how to access each bounding box for each grid cell in our buffer, let's use it to find the most probable class and put in our first test of ignoring any prediction if it doesn't meet our threshold (defined as a method parameter with the default value of 0.3: objectThreshold:Float = 0.3). Add the following code, replacing the comment // TODO calculate the confidence of each class, ignoring if under threshold, as seen previously:

let confidence = sigmoid(x: Float(arrayPointer[(anchorBoxOffset + 4) * gridStride + gridOffset]))

var classes = Array<Float>(repeating: 0.0, count: numberOfClasses)
for c in 0..<numberOfClasses{
    classes[c] = Float(arrayPointer[(anchorBoxOffset + 5 + c) * gridStride + gridOffset])
}
classes = softmax(z: classes)

let classIdx = classes.argmax
let classScore = classes[classIdx]
let classConfidence = classScore * confidence

if classConfidence < objectThreshold{
    continue
}

// TODO obtain bounding box and transform to image dimensions

In the preceding code snippet, we first obtain the probability of an object being present and store it in the constant confidence. Then, we populate an array with the probabilities of all the classes, before applying a softmax across them all. This will squash the values so that the accumulated value of them equals 1.0, essentially providing us with our probability distribution across all classes.

We then find the class index with the largest probability and multiply it with our confidence constant, which gives us the class confidence we will threshold against and use during non-max suppression, ignoring the prediction if it doesn't meet our threshold.

Before continuing with the procedure, I want to take a quick detour to highlight and explain a couple of the methods used in the preceding snippet, namely the softmax method and argmax property of the classes array. Softmax is a logistic function that essentially squashes a vector of numbers so that all values in the vector add up to 1; it's an activation function commonly used when dealing with multi-class classification problems where the result is interpreted as the likelihood of each class, typically taking the class with the largest value as the predicted class (within a threshold). The implementation can be found in the Math.swift file, which makes use of the Accelerate framework to improve performance. The equation and implementation are shown here for completeness, but the details are omitted and left for you to explore:

Here, we use a slightly modified version of the equation shown previously; in practice, calculating the softmax values can be problematic if any of the values are very large. Applying an exponential operation on it will make it explode, and dividing any value by a huge value can cause arithmetic computation problems. To avoid this, it is often best practice to subtract the maximum value from all elements.

Because there are quite a few functions for this operation, let's build it up piece by piece, from the inside out. The following is the function that performs element-wise subtraction:

/**
 Subtract a scalar c from a vector x
 @param x Vector x.
 @param c Scalar c.
 @return A vector containing the difference of the scalar and the vector
 */
public func sub(x: [Float], c: Float) -> [Float] {
    var result = (1...x.count).map{_ in c} 
    catlas_saxpby(Int32(x.count), 1.0, x, 1, -1.0, &amp;result, 1) 
    return result
}

Next, the function that computes the element-wise exponential for an array:

/**
 Perform an elementwise exponentiation on a vector 
 @param x Vector x.
 @returns A vector containing x exponentiated elementwise.
 */
func exp(x: [Float]) -> [Float] {
    var results = [Float](repeating: 0.0, count: x.count) 
    vvexpf(&amp;results, x, [Int32(x.count)]) 
    return results
}

Now, the function to perform summation on an array, as follows:

/**
 Compute the vector sum of a vector
 @param x Vector.
 @returns A single precision vector sum.
 */
public func sum(x: [Float]) -> Float {
    return cblas_sasum(Int32(x.count), x, 1)
}

This is the last function used by the softmax function! This will be responsible for performing element-wise division for a given scalar, as follows:

 /**
 Divide a vector x by a scalar y
 @param x Vector x.
 @parame c Scalar c.
 @return A vector containing x dvidided elementwise by vector c.
 */
public func div(x: [Float], c: Float) -> [Float] {
    let divisor = [Float](repeating: c, count: x.count)
    var result = [Float](repeating: 0.0, count: x.count) 
    vvdivf(&amp;result, x, divisor, [Int32(x.count)]) 
    return result
}

Finally, we the softmax function (using the max trick as described previously):

/**
 Softmax function
 @param z A vector z.
 @return A vector y = (e^z / sum(e^z))
 */
func softmax(z: [Float]) -> [Float] {
    let x = exp(x:sub(x:z, c: z.maxValue))    
    return div(x:x, c: sum(x:x))
}

In addition to the preceding functions, it uses an extension property, maxValue, of Swift's array class; this extension also includes the argmax property alluded to previously. So, we will present both together in the following snippet, found in the Array+Extension.swift file. Before presenting the code, just a reminder about the function of the argmax property—its purpose is to return the index of the largest value within the array, a common method available in the Python package NumPy:

extension Array where Element == Float{
    
    /**
     @return index of the largest element in the array
     **/
    var argmax : Int {
        get{
            precondition(self.count > 0)
            
            let maxValue = self.maxValue
            for i in 0..<self.count{
                if self[i] == maxValue{
                    return i
                }
            }
            return -1
        }
    }
    
    /**
     Find the maximum value in array
     */
    var maxValue : Float{
        get{
            let len = vDSP_Length(self.count)
            
            var max: Float = 0
            vDSP_maxv(self, 1, &amp;max, len)
            
            return max
        }
    }
}

Let's now turn our attention back to the parsing of the model's output and extracting the detected objects and associated bounding boxes. Within the loop, we now have a prediction we are somewhat confident with, having passed our threshold filter. The next task is to extract and transform the bounding box of the predicted object. Add the following code, replacing the line // TODO obtain bounding box and transform to image dimensions:

let tx = CGFloat(arrayPointer[anchorBoxOffset * gridStride + gridOffset])
let ty = CGFloat(arrayPointer[(anchorBoxOffset + 1) * gridStride + gridOffset])
let tw = CGFloat(arrayPointer[(anchorBoxOffset + 2) * gridStride + gridOffset])
let th = CGFloat(arrayPointer[(anchorBoxOffset + 3) * gridStride + gridOffset])

let cx = (sigmoid(x: tx) + CGFloat(col)) / gridSize.width 
let cy = (sigmoid(x: ty) + CGFloat(row)) / gridSize.height
let w = CGFloat(anchors[2 * b + 0]) * exp(tw) / gridSize.width 
let h = CGFloat(anchors[2 * b + 1]) * exp(th) / gridSize.height

// TODO create a ObjectBounds instance and store it in our array of candidates

We start by getting the first four values of from the grid cell's anchor box segment; this returns the center position and size relative to the grid. The next block is responsible for transforming these values from the grid coordinate system to the image coordinate system. For the center position, we pass the returned value through a sigmoid function, keeping it between 0.0 - 1.0, and offset based on the relevant column (or row). Finally we divide it by the grid size (13). Similarly with the dimensions, we first get the associated anchor box, multiplying it by the exponential of the predicted dimension and then dividing it by the grid size.

As we have done previously, I now present the implementations for the function sigmoid for reference, which can found in the Math.swift file. The equation is shown as follows:

/**
 A sigmoid function 
 @param x Scalar
 @return 1 / (1 + exp(-x))
 */
public func sigmoid(x: CGFloat) -> CGFloat {
    return 1 / (1 + exp(-x))
}

The final chunk of code simply creates an instance ObjectBounds, passing in the transformed bounding box and the associated DetectableObject class (filtering on the class index). Add the following code, replacing the comment // TODO create a ObjectBounds instance and store it in our array of candidates:

guard let detectableObject = DetectableObject.objects.filter(
    {$0.classIndex == classIdx}).first else{
    continue
}

let objectBounds = ObjectBounds(
    object: detectableObject,
    origin: CGPoint(x: cx - w/2, y: cy - h/2),
    size: CGSize(width: w, height: h))

objectsBounds.append(objectBounds)
objectConfidences.append(classConfidence)

In addition to storing the ObjectBounds, we also store confidence, which will be used when we get to implementing non-max suppression.

This completes the functionality required within the nested loops; by the end of this process, we have an array populated with our candidate detected objects. Our next task will be to filter them. Near the end of the detectObjectsBounds method, add the following statement (outside any loops):

return self.filterDetectedObjects(
    objectsBounds: objectsBounds,
    objectsConfidence: objectConfidences)

Here, we are simply returning the results from the filterDetectedObjects method, which we will now turn our attention to. The method has been blocked out but is vacant of functionality, as follows:

func filterDetectedObjects(
    objectsBounds:[ObjectBounds],
    objectsConfidence:[Float],
    nmsThreshold : Float = 0.3) -> [ObjectBounds]?{
    
    // If there are no bounding boxes do nothing
    guard objectsBounds.count > 0 else{
        return []
    }
        // TODO implement Non-Max Supression
    
    return nil
}

Our job will be to implement the non-max suppression algorithm; just to recap, the algorithm can be described as follows:

Order the detected boxes from most confident to least
While valid boxes remain, do the following:
1. Pick the box with the highest confidence value (the top of our ordered array)
2. Iterate through all the remaining boxes, discarding any with an IoU value greater than a predefined threshold

Let's start by creating a clone of the confidence array passed into the method; we will use this to obtain an array of sorted indices, as well as to flag any boxes that are sufficiently overlapped by the preceding box. This is done by simply setting its confidence value to 0. Add the following statement to do just this, along with creating the sorted array of indices, replacing the comment // TODO implement Non-Max Suppression:

var detectionConfidence = objectsConfidence.map{
    (confidence) -> Float in
    return confidence
}

let sortedIndices = detectionConfidence.indices.sorted {
    detectionConfidence[$0] > detectionConfidence[$1]
}

var bestObjectsBounds = [ObjectBounds]()

// TODO iterate through each box

As mentioned previously, we start by cloning the confidence array, assigning it to the variable detectionConfidence. Then, we sort the indices in descending order and, finally, create an array to store the boxes we want to keep and return.

Next, we will create the loops that embody the bulk of the algorithm, including picking the next box with the highest confidence and storing it in our bestObjectsBounds array. Add the following code, replacing the comment // TODO iterate through each box:

for i in 0..<sortedIndices.count{
    let objectBounds = objectsBounds[sortedIndices[i]]
    
    guard detectionConfidence[sortedIndices[i]] > 0 else{
        continue
    }

    bestObjectsBounds.append(objectBounds)
    
    for j in (i+1)..<sortedIndices.count{
        guard detectionConfidence[sortedIndices[j]] > 0 else {
            continue
        }
        let otherObjectBounds = objectsBounds[sortedIndices[j]]
        
        // TODO calculate IoU and compare against our threshold        
    }
}

Most of the code should be self-explanatory; what's worth noting is that within each loop, we test that the associated boxes confidence is greater than 0. As mentioned before, we use this to indicate that an object has been discarded due to being sufficiently overlapped by a box with higher confidence.

What remains is calculating the IoU between objectBounds and otherObjectBounds, and invaliding otherObjectBounds if it doesn't meet our IoU threshold, nmsThreshold. Replace the comment // TODO calculate IoU and compare against our threshold, with this:

if Float(objectBounds.bounds.computeIOU(
    other: otherObjectBounds.bounds)) > nmsThreshold{
    detectionConfidence[sortedIndices[j]] = 0.0
}

Here, we are using a CGRect extension method, computeIOU, to handle the calculation. Let's have a peek at this, implemented in the file CGRect+Extension.swift:

extension CGRect{
    
    ...
    
    var area : CGFloat{
        get{
            return self.size.width * self.size.height
        }
    }
    
    func computeIOU(other:CGRect) -> CGFloat{
        return self.intersection(other).area / self.union(other).area
    }    
}

Thanks to the existing intersection and union of the CGRect structure, this method is nice and concise.

One final thing to do before we finish with the YOLOFacade class as well as the YOLO algorithm is to return the results. At the bottom of the filterDetectedObjects method, return the array bestObjectsBounds; with that done, we can now turn our attention to the last piece of functionality before implementing our intelligent search photo application.

This chapter does a good job highlighting that most of the effort integrating ML into your applications surrounds the preprocessing of the data before feeding it into the model and interpreting the output of the model. The Vision framework does a good job alleviating the preprocessing tasks, but there is still significant effort handling the output. Fortunately, no doubt because object detection is compelling for many applications, Apple has added a new observation type explicitly for object detection called VNRecognizedObjectObservation. Although we don't cover it here; I encourage you to review the official documentation https://developer.apple.com/documentation/vision/vnrecognizedobjectobservation.

The next piece of functionality is concerned with evaluating a cost on each of the returned detected objects with respect to the user's search criteria; by this, I mean filtering and sorting the photos so that the results are relevant to what the user sought. As a reminder, the search criteria is defined by an array of ObjectBounds, collectively describing the objects the user wants within a image, their relative positions, as well as the sizes relative to each other and to the image itself. The following figure shows how the user defines their search within our application:

Here, we will implement only two of the four evaluations, but it should provide a sufficient base for you to implement the remaining two yourself.

The cost evaluation is performed within the PhotoSearcher class once the YOLOFacade has returned the detected objects for all of the images. This code resides in the asyncSearch method (within the PhotoSearcher.swift file), highlighted in the following code snippet:

public func asyncSearch(
    searchCriteria : [ObjectBounds]?,
    costThreshold : Float = 5){
    DispatchQueue.global(qos: .background).async {
        let photos = self.getPhotosFromPhotosLibrary()
        
        let unscoredSearchResults = self.detectObjects(photos: photos)
        
        var sortedSearchResults : [SearchResult]?
        
        if let unscoredSearchResults = unscoredSearchResults{
            sortedSearchResults = self.calculateCostForObjects(
                detectedObjects:unscoredSearchResults,
                searchCriteria: searchCriteria).filter({
                    (searchResult) -> Bool in
                    return searchResult.cost < costThreshold
                }).sorted(by: { (a, b) -> Bool in
                    return a.cost < b.cost
                })
        }
        
        DispatchQueue.main.sync {
            self.delegate?.onPhotoSearcherCompleted(
                status: 1,
                result: sortedSearchResults)
        }
    }
}

calculateCostForObjects takes in the search criteria and results from the YOLOFacade and returns an array of SearchResults from the detectObjects with their cost properties set, after which they are filtered and sorted before being returned to the delegate.

Let's jump into the calculateCostForObjects method and discuss what we mean by cost; the code of the method calculateCostForObjects is as follows:

 private func calculateCostForObjects(
    detectedObjects:[SearchResult],
    searchCriteria:[ObjectBounds]?) -> [SearchResult]{
    
    guard let searchCriteria = searchCriteria else{
        return detectedObjects
    }
    
    var result = [SearchResult]()
    
    for searchResult in detectedObjects{
        let cost = self.costForObjectPresences(
            detectedObject: searchResult,
            searchCriteria: searchCriteria) +
            self.costForObjectRelativePositioning(
                detectedObject: searchResult,
                searchCriteria: searchCriteria) +
            self.costForObjectSizeRelativeToImageSize(
                detectedObject: searchResult,
                searchCriteria: searchCriteria) +
            self.costForObjectSizeRelativeToOtherObjects(
                detectedObject: searchResult,
                searchCriteria: searchCriteria)
        
        let searchResult = SearchResult(
            image: searchResult.image,
            detectedObjects:searchResult.detectedObjects,
            cost: cost)
        
        result.append(searchResult)
    }
    
    return result
}

A SearchResult incurs a cost each time it differs from the user's search criteria, meaning that the results with the least cost are those that better match the search criteria. We perform cost evaluation on four different heuristics; each method will be responsible for adding the calculated cost to each result. Here we will only implement costForObjectPresences and costForObjectRelativePositioning, leaving the remaining two as an exercise for your.

Let's jump straight in and start implementing the costForObjectPresences method; at the moment, it's nothing more than a stub, as follows:

private func costForObjectPresences(
    detectedObject:SearchResult,
    searchCriteria:[ObjectBounds],
    weight:Float=2.0) -> Float{
    
    var cost : Float = 0.0
    
    // TODO implement cost function for object presence
    
    return cost * weight
}

Before writing the code, let's quickly discuss what we are evaluating for. Maybe, a better name for this function would have been costForDifference as we not only want to assess that the image has objects declared in the search criteria, but also we equally want to increase the cost for additional objects. That is, if the user searches for just two dogs but a photo has three dogs or two dogs and a cat, we want to increase the cost for these additional objects such that we are favoring the one that is most similar to the search criteria (just two dogs).

To calculate this, we simply need to find the absolute difference between the two arrays; to do this, we first create a dictionary of counts for all classes in both detectedObject and searchCriteria. The directory's key will be the object's label and the corresponding value will be the count of objects within the array. The following figure illustrates these arrays and formula used to calculate:

Let's now implement it; add the following code to do this, replacing the comment // TODO implement cost function for object presence:

var searchObjectCounts = searchCriteria.map {
    (detectedObject) -> String in
    return detectedObject.object.label
    }.reduce([:]) {
        (counter:[String:Float], label) -> [String:Float] in
        var counter = counter
        counter[label] = counter[label]?.advanced(by: 1) ?? 1
        return counter
}

var detectedObjectCounts = detectedObject.detectedObjects.map {
    (detectedObject) -> String in
    return detectedObject.object.label
    }.reduce([:]) {
        (counter:[String:Float], label) -> [String:Float] in
        var counter = counter
        counter[label] = counter[label]?.advanced(by: 1) ?? 1
        return counter
}

// TODO accumulate cost based on the difference

Now, with our count dictionaries created and populated, it's simply a matter of iterating over all available classes (using the items in DetectableObject.objects) and calculating the cost based on the absolute difference between the two. Add the following code, which does this, by replacing the comment // TODO accumulate cost based on the difference:

for detectableObject in DetectableObject.objects{
    let label = detectableObject.label
    
    let searchCount = searchObjectCounts[label] ?? 0
    let detectedCount = detectedObjectCounts[label] ?? 0
    
    cost += abs(searchCount - detectedCount)
}

The result of this is a cost that is larger for images that differ the most from the search criteria; the last thing worth noting is that the cost is multiplied by a weight before being returned (function parameter). Each evaluation method has a weight parameter which allows for easy tuning (during either design time or runtime) of the search, giving preference to one evaluation over another.

The next, and last, cost evaluation function we are going to implement is the method costForObjectRelativePositioning; the stub of this method is as follows:

 private func costForObjectRelativePositioning(
    detectedObject:SearchResult,
    searchCriteria:[ObjectBounds],
    weight:Float=1.5) -> Float{
    
    var cost : Float = 0.0
    
    // TODO implement cost function for relative positioning
    
    return cost * weight
}

As we did before, let's quickly discuss the motivation behind this evaluation and how we plan to implement it. This method is used to favor items that match the composition of the user's search; this allows our search to surface images that closely resemble the arrangement the user is searching for. For example, the user may be looking for an image or images where two dogs are sitting next to each other, side by side, or they may want an image with two dogs sitting next to each other on a sofa.

There are no doubt many approaches you could take for this, and it's perhaps a use case for a neural network, but the approach taken here is the simplest I could think of to avoid having to explain complicated code; the algorithm used is described as follows:

For each object (a) of type ObjectBounds within searchCriteria
1. Find the closest object (b) in proximity (still within searchCriteria)
2. Create a normalized direction vector from a to b
3. Find the matching object a' (the same class) within the detectedObject
  1. Search all other objects (b') in detectedObject that have the same class as b
    1. Create a normalized direction vector from a' to b'
    2. Calculate the dot product between the two vectors (angle); in this case, our vectors are a->b and a'->b'
4. Using a' and b', which have the lowest dot product, increment the cost by how much the angle differs from the search criteria and images

Essentially, what we are doing is finding two matching pairs from the searchCriteria and detectedObject arrays, and calculating the cost based on the difference in the angles.

A direction vector of two objects is calculated by subtracting one's position from the other and then normalizing it. The dot product can then be used on two (normalized) vectors to find their angle, where 1.0 would be returned if the vectors are pointing in the same direction, 0.0 if they are perpendicular, and -1.0 if pointing in opposite directions.

The following figure presents part of this process; we first find an object pair in close proximity within the search criteria. After calculating the dot product, we iterate over all the objects detected in the image and find the most suitable pair; "suitable" here means the same object type and the closest angle to the search criteria within the possible matching pairs:

Once comparable pairs are found, we calculate the cost based on the difference in angle, as we will soon see. But we are getting a little ahead of ourselves; we first need a way to find the closest object. Let's do this using a nested function we can call within our costForObjectRelativePositioning method. Add the following code, replacing the comment // TODO implement cost function for relative positioning:

func indexOfClosestObject(
    objects:[ObjectBounds],
    forObjectAtIndex i:Int) -> Int{
    
    let searchACenter = objects[i].bounds.center
    
    var closestDistance = Float.greatestFiniteMagnitude
    var closestObjectIndex : Int = -1
    
    for j in 0..<objects.count{
        guard i != j else{
            continue
        }
        
        let searchBCenter = objects[j].bounds.center
        let distance = Float(searchACenter.distance(other: searchBCenter))
        if distance < closestDistance{
            closestObjectIndex = j
            closestDistance = distance
        }
    }
    
    return closestObjectIndex
}

// TODO Iterate over all items in the searchCriteria array

The preceding function will be used to find the closest object, given an array of ObjectBounds and index of the object we are searching against. From there, it simply iterates over all of the items in the array, returning the one that is, well, closest.

With our helper function now implemented, let's create the loop that will inspect the search item pair from the user's search criteria. Append the following code to the costForObjectRelativePositioning method, replacing the comment // TODO Iterate over all items in the searchCriteria array:

for si in 0..<searchCriteria.count{
    let closestObjectIndex = indexOfClosestObject(
        objects: searchCriteria,
        forObjectAtIndex: si)
    
    if closestObjectIndex < 0{
        continue
    }
    
    // Get object types
    let searchAClassIndex = searchCriteria[si].object.classIndex
    let searchBClassIndex = searchCriteria[closestObjectIndex].object.classIndex
    
    // Get centers of objects
    let searchACenter = searchCriteria[si].bounds.center
    let searchBCenter = searchCriteria[closestObjectIndex].bounds.center
    
    // Calcualte the normalised vector from A -> B
    let searchDirection = (searchACenter - searchBCenter).normalised
    
    // TODO Find matching pair
}

We start by searching for the closest object to the current object, jumping to the next item if nothing is found. Once we have our search pair, we proceed to calculate the direction by subtracting the first bound's center from its pair and normalizing the result.

We now need to find all objects of both classes, whereby we will proceed to evaluate each of them to find the best match. Before that, let's get all the classes with the index of searchAClassIndex and searchBClassIndex; add the following code, replacing the comment // TODO Find matching pair:

// Find comparable objects in detected objects
let detectedA = detectedObject.detectedObjects.filter {
    (objectBounds) -> Bool in
    objectBounds.object.classIndex == searchAClassIndex
}

let detectedB = detectedObject.detectedObjects.filter {
    (objectBounds) -> Bool in
    objectBounds.object.classIndex == searchBClassIndex
}

// Check that we have matching pairs
guard detectedA.count > 0, detectedB.count > 0 else{
    continue
}

// TODO Search for the most suitable pair

If we are unable to find a matching pair, we continue to the next item, knowing that a cost has already been added for the mismatch in objects of both arrays. Next, we iterate over all pairs. For each pair, we calculate the normalized direction vector and then the dot product against our searchDirection vector, taking the one that has the closest dot product (closest in angle). Add the following code in place of the comment // TODO Search for the most suitable pair:

var closestDotProduct : Float = Float.greatestFiniteMagnitude
for i in 0..<detectedA.count{
    for j in 0..<detectedB.count{
        if detectedA[i] == detectedB[j]{
            continue
        }
        
        let detectedDirection = (detectedA[i].bounds.center - detectedB[j].bounds.center).normalised
        let dotProduct = Float(searchDirection.dot(other: detectedDirection))
        if closestDotProduct > 10 ||
            (dotProduct < closestDotProduct &amp;&amp;
                dotProduct >= 0) {
            closestDotProduct = dotProduct
        }
    }
}

// TODO Add cost

Similar to what we did with our search pair, we calculate the direction vector by subtracting the pair's center positions and then normalize the result. Then, with the two vectors searchDirection and detectedDirection, we calculate the dot product, keeping reference to it if it is the first or lowest dot product so far.

There is just one last thing we need to do for this method, and this project. But before doing so, let's take a little detour and look at a couple of extensions made to CGPoint, specifically the dot and normalize used previously. You can find these extensions in the CGPoint+Extension.swift file. As I did previously, I will list the code for reference rather than describing the details, most of which we have already touched upon:

extension CGPoint{
    
    var length : CGFloat{
        get{
            return sqrt(
                self.x * self.x + self.y * self.y
            )
        }
    }
    
    var normalised : CGPoint{
        get{
            return CGPoint(
                x: self.x/self.length,
                y: self.y/self.length)
        }
    }
    
    func distance(other:CGPoint) -> CGFloat{
        let dx = (self.x - other.x)
        let dy = (self.y - other.y)
        
        return sqrt(dx*dx + dy*dy)
    }
    
    func dot(other:CGPoint) -> CGFloat{
        return (self.x * other.x) + (self.y * other.y)
    }
    
    static func -(left: CGPoint, right: CGPoint) -> CGPoint{
        return CGPoint(
            x: left.x - right.x,
            y: left.y - right.y)
    }
}

Now, back to the costForObjectRelativePositioning method to finish our method and project. Our final task is to add to the cost; this is done simply by subtracting the stored closestDotProduct from 1.0 (remembering that we want to increase the cost for larger differences where the dot product of two normalized vectors pointing in the same direction is 1.0) and ensuring that the value is positive by wrapping it in an abs function. Let's do that now; add the following code, replacing the comment // TODO add cost:

cost += abs((1.0-closestDotProduct))

With that done, we have finished this method, and the coding for this chapter. Well done! It's time to test it out; build and run the project to see your hard work in action. Shown here are a few searches and their results:

Although the YOLO algorithm is performant and feasible for near real-time use, our example is far from optimized and unlikely to perform well on large sets of photos. With the release of Core ML 2, Apple provides one avenue we can use to make our process more efficient. This will be the topic of the next section before wrapping up.

Table of Contents for Making it easier to find photos

Create new playlist

Sign In

Sign Up

Table of Contents for
Making it easier to find photos