© Özgür Sahin 2021
Ö. SahinDevelop Intelligent iOS Apps with Swifthttps://doi.org/10.1007/978-1-4842-6421-8_2

2. Introduction to Apple ML Tools

Özgür Sahin1  
(1)
Feneryolu Mh. Goztepe, Istanbul, Turkey
 

In the software and machine learning world, it’s very important to learn and try the latest tools. If you don’t know how to use these productivity tools, you may waste a lot of time. This chapter will introduce the tools Apple provides to build ML applications easily for iOS developers. The frameworks and tools introduced in this chapter are Vision, VisionKit, Natural Language, Speech, Core ML, Create ML, and Turi Create. We will learn what capabilities these tools have to offer and what kind of applications we can build using them.

Vision

The Vision framework deals with images and videos. It offers a variety of computer vision and machine learning capabilities to apply to visual data. Some capabilities of the Vision framework include face detection, body detection, animal detection, text detection and recognition, barcode recognition, object tracking, image alignment, and so on. I will mention the main features and methods considering some hidden gems of iOS that you may not have heard of. As this book focuses on text processing, it won’t cover the details of image processing. If you need more information related to computer vision algorithms, you can find the details and sample projects on the Apple Developer website.

Face and Body Detection

Vision has several request types for detecting faces and humans in images. I will mention some of the requests here to recall what Apple provides with built-in APIs. VNDetectFaceRectanglesRequest is used for face detection which returns the rectangles of faces detected in a given image yaw angle. It also provides a face’s yaw and roll angles. VNDetectFaceLandmarksRequest gives you the location of the mouth, eyes, face contour, eyebrow, nose, and lips. VNDetectFaceCaptureQualityRequest captures the quality of the face in an image that you can use in selfie editing applications. There is a sample project, namely, “Selecting a Selfie Based on Capture Quality,” which compares face qualities across images.

VNDetectHumanRectanglesRequest detects humans and returns the rectangles that locate humans in images.

To use these requests, you create an ImageRequestHandler and a specific type of request. Pass this request to the handler with the perform method as shown in Listing 2-1. This executes the request on an image buffer and returns the results. The sample shows face detection on a given image.
let handler = VNImageRequestHandler(cvPixelBuffer:
pixelBuffer, orientation: .leftMirrored, options:
requestOptions)
        let faceDetectionRequest =
VNDetectFaceCaptureQualityRequest()
        do {
            try
handler.perform([faceDetectionRequest])
            guard let faceObservations =
faceDetectionRequest.results as? [VNFaceObservation]
else {return}
            }
        } catch {
            print("Vision error:
(error.localizedDescription)")
        }
Listing 2-1

Face Detection Request

Image Analysis

With the built-in image analysis capabilities, you can create applications that understand what is in the image. You can detect and locate rectangles, faces, barcodes, and text in images using the Vision framework. If you want to dig deeper, Apple offers a sample project where they show how to detect text and QR codes in images.

Apple also offers a built-in ML model that can classify 1303 classes. It has many classes from vehicles to animals and objects. Some examples are acrobat, airplane, biscuit, bear, bed, kitchen sink, tuna, volcano, zebra, and so on.

You can get the list of these classes by calling the knownClassifications method as shown in Listing 2-2.
let handler = VNImageRequestHandler(cgImage:
image.cgImage!, options: [:])
let classes = try VNClassifyImageRequest.knownClassifications(forRevision:
VNDetectFaceLandmarksRequestRevision1)
let classIdentifiers = classes.map({$0.identifier})
Listing 2-2

Built-in Image Classes

I created a Swift playground showing how to use the built-in classifier.1 Apple made it super-simple to classify images. The sample code in Listing 2-3 is all you need to classify images.
import Vision
let handler = VNImageRequestHandler(cgImage: image,
options: [:])
let request = VNClassifyImageRequest()
try? handler.perform([request])
let observations = request.results as?
[VNClassificationObservation]
Listing 2-3

Image Classification

Another capability of the Vision framework is image similarity detection. This can be achieved using VNGenerateImageFeaturePrintRequest. This creates the feature print of the image, and then you can compare this feature print using the computeDistance method. The code sample in Listing 2-4 shows how to use this method. Again, we create ImageRequestHandler and a request and then call perform to execute this request.
func featureprintObservationForImage(atURL url: URL)
-> VNFeaturePrintObservation? {
       let requestHandler =
VNImageRequestHandler(url: url, options: [:])
       let request =
VNGenerateImageFeaturePrintRequest()
       do {
           try requestHandler.perform([request])
           return request.results?.first as?
VNFeaturePrintObservation
       } catch {
           print("Vision error: (error)")
           return nil
       }
  }
Listing 2-4

Create a Feature Print of an Image

This function creates a feature print of an image. It is a mathematical representation of an image which we can use to compare with other images. Listing 2-5 show how to use this feature print to compare images.
let apple1 = featureprintObservationForImage(atURL:
Bundle.main.url(forResource:"apple1", withExtension:
"jpg")!)
let apple2 = featureprintObservationForImage(atURL:
Bundle.main.url(forResource:"apple2", withExtension:
"jpg")!)
let pear = featureprintObservationForImage(atURL:
Bundle.main.url(forResource:"pear", withExtension:
"jpg")!)
var distance = Float(0)
try apple1!.computeDistance(&distance, to: apple2!)
var distance2 = Float(0)
try apple1!.computeDistance(&distance2, to: pear!)
Listing 2-5

Feature Print

Here, I am comparing pear to apple images. The image distance results are shown in Figure 2-1.
../images/501003_1_En_2_Chapter/501003_1_En_2_Fig1_HTML.jpg
Figure 2-1

Comparing Image Distances

You can find the full code sample of the Swift playground in the link found in the corresponding footnote.2

Text Detection and Recognition

To detect and recognize text in images, you don’t need any third-party framework. Apple offers these capabilities with the Vision framework.

You can use VNDetectTextRectanglesRequest to detect text areas in the image. It returns rectangular bounding boxes with origin and size. If you want to detect each character box separately, you should set the reportCharacterBoxes variable to true.

The Vision framework also provides text recognition (optical character recognition) capability which you can use to process text from scanned documents or business cards.

Figure 2-2 shows the text recognition that runs on a playground.
../images/501003_1_En_2_Chapter/501003_1_En_2_Fig2_HTML.jpg
Figure 2-2

Text Recognition Results

Similar to other Vision functions, to process text in images, we create VNRecognizeTextRequest as shown in Listing 2-6 and perform this request using VNImageRequestHandler. Text request has a closure which is called when the process is finished. It returns the observations for each text rectangle it detects.
let textRecognitionRequest = VNRecognizeTextRequest {
(request, error) in
    guard let observations = request.results as?
[VNRecognizedTextObservation] else {
        print("The observations are of an unexpected type.")
        return
    }
    let maximumCandidates = 1
    for observation in observations {
        guard let candidate =
observation.topCandidates(maximumCandidates).first
else { continue }
        textResults += candidate.string + " "
     }
}
     }
}
let requestHandler = VNImageRequestHandler(cgImage:
image, options: [:])
do {
    try
requestHandler.perform([textRecognitionRequest])
} catch {
    print(error)
}
Listing 2-6

Text Recognition

Text recognition request has a recognitionLevel property which is used to trade off between accuracy and speed. You can set it accurate or fast.

Other Capabilities of Vision

The Vision framework provides other capabilities like image saliency analysis, horizon detection, and object recognition. With image saliency analysis, iOS lets us detect which parts of the image draw people’s attention. It also offers object-based attention saliency which detects foreground objects. You can use these features to crop images automatically or generate heat maps. These two types of requests are VNGenerateAttentionBasedSaliencyImageRequest (attention based) and VNGenerateObjectnessBasedSaliencyImageRequest (object based). Similar to other Vision APIs, you create a request and perform it using the image request handler as shown in Listing 2-7.
let request =
VNGenerateAttentionBasedSaliencyImageRequest()
try? requestHandler.perform([request])
Listing 2-7

Image Saliency

Horizon detection lets us determine the horizon angle in the image. With this request (VNDetectHorizonRequest), you can get the image angle and the CGAffineTransform required to fix the image orientation. You can also use VNHomographicImageRegistrationRequest to determine the perspective warp matrix needed to align two images.

Another capability of Vision is object recognition. You can use the built-in VNClassifyImageRequest to detect objects, or you can create a custom model using Create ML or Turi Create if you want to train on your own image dataset.

VisionKit

If you ever used the Notes app on iOS, you might have used the built-in document scanner which is shown in Figure 2-3. VisionKit lets us use this powerful document scanner in our apps. Implementation is very simple:
  1. 1.

    Present the document camera as shown in Listing 2-8.

     
let vc = VNDocumentCameraViewController()
vc.delegate = self
present(vc, animated: true)
Listing 2-8

Instantiate Document Camera

  1. 2.

    Implement the VNDocumentCameraViewControllerDelegate to receive callbacks as shown in Listing 2-9. It returns an image of each page with the following function.

     
  func documentCameraViewController(_ controller:
VNDocumentCameraViewController, didFinishWith scan:
VNDocumentCameraScan) {
     var scannedImageList = []
     for pageNumber in 0 ..< scan.pageCount {
                                         let image =
scan.imageOfPage(at: pageNumber)
self.scannedImageList.append(image)
                                    }
}
Listing 2-9

Capture Scanned Document Images

../images/501003_1_En_2_Chapter/501003_1_En_2_Fig3_HTML.jpg
Figure 2-3

Built-in Document Scanner

Natural Language

The Natural Language framework lets you analyze text data and extract knowledge. It provides functions like language identification, tokenization (enumerating words in a string), lemmatization, part-of-speech tagging, and named entity recognition.

Language Identification

Language identification lets you determine the language of the text. We can detect the language of a given text by using the NLLanguageRecognizer class. It supports 57 languages. Check the code in Listing 2-10 to detect the language of a given string.
import NaturalLanguage
let recognizer = NLLanguageRecognizer()
recognizer.processString("hello")
let lang = recognizer.dominantLanguage
Listing 2-10

Language Recognition

Tokenization

Before we can perform natural language processing on a text, we need to apply some preprocessing to make the data more understandable for computers. Usually, we need to split the words to process the text and remove any punctuation marks. Apple provides NLTokenizer to enumerate the words, so there’s no need to manually parse spaces between words. Also, some languages like Chinese and Japanese don’t use spaces to delimit words; luckily, NLTokenizer handles these edge cases for you. The code sample in Listing 2-11 shows how to enumerate words in a given string.
import NaturalLanguage
let text = "A colourful image of blood vessel cells
has won this year's Reflections of Research
competition, run by the British Heart Foundation"
let tokenizer = NLTokenizer(unit: .word)
tokenizer.string = text
tokenizer.enumerateTokens(in:
text.startIndex..<text.endIndex) { tokenRange, _ in
    print(text[tokenRange])
    return true
}
Listing 2-11

Enumerating Words

As you see, we import the NaturalLanguage framework and create NLTokenizer by specifying the unit type. This allows us to determine the enumeration type; here, we can enumerate documents, words, paragraphs, or sentences. The enumerateTokens function enumerates the selected token type (word, in this case) and returns closure for each word. In closure, we print each word enumerated, and the result is shown in Figure 2-4.
../images/501003_1_En_2_Chapter/501003_1_En_2_Fig4_HTML.jpg
Figure 2-4

Tokenization

Part-of-Speech Tagging

To understand the language better, we need to identify the words and their functions in a given sentence. Part-of-speech tagging allows us to classify nouns, verbs, adjectives, and other parts of speech in a string. Apple provides a linguistic tagger that analyzes natural language text called NLTagger.

The code sample in Listing 2-12 shows how to detect the tags of the words by using NLTagger. Lexical class is a scheme that classifies tokens according to class: part of speech, type of punctuation, or whitespace. We use this scheme and print each word’s type.
import NaturalLanguage
let text = "The ripe taste of cheese improves with
age."
let tagger = NLTagger(tagSchemes: [.lexicalClass])
tagger.string = text
let options: NLTagger.Options =
[.omitPunctuation, .omitWhitespace]
tagger.enumerateTags(in:
text.startIndex..<text.endIndex, unit: .word, scheme:
.lexicalClass, options: options) { tag, tokenRange in
    if let tag = tag {
        print("(text[tokenRange]):  (tag.rawValue)")
    }
    return true
}
Listing 2-12

Word Tagging

As you can see in Figure 2-5, it successfully determines the types of words.
../images/501003_1_En_2_Chapter/501003_1_En_2_Fig5_HTML.jpg
Figure 2-5

Determining Word Types

When using NLTagger, depending on the type that you want to detect, you can specify one or more tag schemes (NLTagScheme) as a parameter. For example, the tokenType scheme classifies words, punctuations, and spaces; and the lexicalClass scheme classifies word types, punctuation types, and spaces.

While enumerating the tags, you can skip the specific types (e.g., by setting the options parameter). In the preceding code, the punctuations and spaces options are set to [.omitPunctuation, .omitWhitespace].

NLTagger can detect all of these lexical classes: noun, verb, adjective, adverb, pronoun, determiner, particle, preposition, number, conjunction, interjection, classifier, idiom, otherWord, sentenceTerminator, openQuote, closeQuote, openParenthesis, closeParenthesis, wordJoiner, dash, otherPunctuation, paragraphBreak, and otherWhitespace.

Identifying People, Places, and Organizations

NLTagger also makes it very easy to detect people’s names, places, and organization names in a given text.

Finding this type of data in text-based apps opens new ways to deliver information to users. For example, you can create an app that can automatically summarize the text by showing how many times these names (people, places, and organizations) are referred to in that text (via blog, news article, etc.).

Take a look to Listing 2-13 to see how we can detect these names in a sample sentence.
import NaturalLanguage
let text = "Prime Minister Boris Johnson has urged
the EU to re-open the withdrawal deal reached with
Theresa May, and to make key changes that would allow
it to be passed by Parliament."
let tagger = NLTagger(tagSchemes: [.nameType ])
tagger.string = text
let options: NLTagger.Options =
    [.omitPunctuation, .omitWhitespace, .joinNames]
let tags: [NLTag] =
[.personalName, .placeName, .organizationName, .adver
b ,NLTag.pronoun, NLTag.determiner, NLTag.noun ,
     NLTag.interjection ]
tagger.enumerateTags(in:
text.startIndex..<text.endIndex, unit: .word, scheme:
                           NLTagScheme.nameType,
options: options) { tag,
tokenRange in
    if let tag = tag, tags.contains(tag) {
        print("(text[tokenRange]):  (tag.rawValue)")
        }
        return true
}
Listing 2-13

Identify People and Places

Here we use NLTagger again, but this time we set another option called joinNames, which concatenates names and surnames. To filter personal names, places, and organizations, we create an NLTag array.

The tags of the words that NLTagger can find are shown in Figure 2-6.
../images/501003_1_En_2_Chapter/501003_1_En_2_Fig6_HTML.jpg
Figure 2-6

Identifying People and Places

As you can see, we can deduce specific knowledge from text using iOS’s Natural Language framework.

NLEmbedding

Embedding in ML is used for mathematical representation of the given data. Here in Natural Language, it’s used for vector representation of a word. After you convert the word to a vector, you can do arithmetic calculations on it. For example, you can calculate the distance between words or sum them up. Calculating the distance between words lets you find similar words.

NLEmbedding lets us determine the distance between two strings or find the nearest neighbors of a string in a word set. The higher the similarity of any two words, the smaller the distance between them. The code sample in Listing 2-14 shows how to calculate the distance between words.
import NaturalLanguage
//calculate distance between words
let embedding =
NLEmbedding.wordEmbedding(for:  .english)
let distance1 = embedding?.distance(between: "movie",
and: "film")
let distance2 = embedding?.distance(between: "movie",
and: "car")
Listing 2-14

Measuring Distance Between Words

In the preceding code, we use wordEmbedding and specify its language. Distance calculation is done using distance function. The distance between “movie” and “film” is 0.64 and between “movie” and “car” is 1.21. As you see, similar words have less distance between them. With this distance calculation, you can create apps that cluster words according to similarity or create recommendation apps that can detect similar texts or titles. You can even create custom embeddings for any kind of string. For example, you could make embeddings of news titles and recommend new articles based on the previous interest of your users. To create custom embeddings, you can use Create ML’s MLWordEmbedding and export it as a file to use in your Xcode project. This will be covered later in the book after we learn Create ML.

Speech

The Speech framework provides speech recognition on live or prerecorded audio data. Using this framework, you can create transcriptions of spoken words in your apps. iOS built-in dictation support also uses speech recognition to convert audio data into text.

With this framework, you can create applications that understand verbal commands like Siri or Alexa. Apple says on-device speech recognition is available for some languages, but always assume that performing speech recognition requires network connection because this framework relies on Apple’s servers for speech recognition.

To transcribe an audio, you should create a SFSpeechRecognizer instance for each language you want to support. Create SFSpeechRecognizer and SFSpeechAudioBufferRecognitionRequest to call the recognitionTask function which starts the speech recognition process and returns the result. Here we get the transcription results with result.bestTranscription.formattedString as seen in the code sample in Listing 2-15.
let recognitionRequest =
SFSpeechAudioBufferRecognitionRequest()
let recognitionTask =
speechRecognizer.recognitionTask(with:
recognitionRequest) { result, error in
    if let result = result {
       self.textView.text =
result.bestTranscription.formattedString
    }
}
Listing 2-15

Speech Recognition

If you want the result block to be called with partial transcription results, you can set recognitionRequest.shouldReportPartialResults to true.

Core ML

Apple announced the Core ML framework at WWDC 2017. This framework was Apple’s alignment to the fast-developing machine learning world. Via Core ML, developers could integrate third-party machine learning models into their apps. Core ML APIs let us train and fine-tune ML models and make predictions, all on a user’s device.
../images/501003_1_En_2_Chapter/501003_1_En_2_Fig7_HTML.jpg
Figure 2-7

Core ML

As shown in Figure 2-7, Core ML is the underlying framework that powers Vision, Natural Language, Speech, and Sound Analysis frameworks.

A Python framework called coremltools was also made available to convert deep learning models from popular frameworks like Keras, Caffe, and scikit-learn to Core ML format.

To use coremltools, one needs to know Python. This created a learning barrier for iOS developers. To lower this barrier, Apple announced a simpler machine learning tool called Create ML in WWDC 2018.

Create ML

Create ML is a separate developer application like Xcode. It lets us create ML models easily, be it image classification, text classification, or sound classification. Thanks to this tool, iOS developers have fewer excuses for not developing smart iOS apps right now. Create ML and Xcode provide an end-to-end machine learning solution so developers can create their solutions all in Apple’s ecosystem.

Create ML makes it easy to train models with image, text, or sound datasets and then test those models. When you finish training and testing, you can drag and drop your trained model from Create ML into your Xcode project.
../images/501003_1_En_2_Chapter/501003_1_En_2_Fig8_HTML.jpg
Figure 2-8

Create ML

Create ML has ready-to-use templates to make training custom models easier. These templates include image classifier, object detector, sound classifier, activity classifier (motion classifier), text classifier, word tagger, tabular regressor, tabular classifier, and recommender. You can use Create ML as a separate application or as a framework in Swift Playgrounds.

Turi Create

To simplify the process of ML model training for developers, many decisions are done behind the scenes in Create ML. ML models have parameters that you can fine-tune to achieve better accuracy. If you are not satisfied with the choices Create ML offers and want more freedom over your ML models, you can use Turi Create.

In August 2016, Apple acquired Turi, a machine learning software startup, and open sourced and developed its library Turi Create. Turi Create is a Python framework that simplifies the development of custom machine learning models. You can export models from Turi Create for use in iOS, macOS, watchOS, and tvOS apps.

It supports a variety of data types: text, image, audio, video, and sensor data. You can create many different types of custom ML models using Turi Create. Some of them are text classification, image classification, object detection, regression (prediction of numeric values), clustering, activity classification, style transfer, and so on.

For text classification, it offers some preprocessing methods to clean the text data before training. For example, you can remove some words that have a small frequency or remove common words, for example, “and,” “the,” and so on (generally called stop words in the ML world).

When we say text classification, often sentiment analysis comes to people’s minds, but there are many use cases. For instance, you can train a model on App Store app reviews and categorize the reviews as a feature request, complaint, compliment, and so on. Or you can determine the author of a piece of writing by training a model with writings. Anything you can imagine and have enough text samples of, you can train a classifier.

Another text processing capability of Turi Create is text analysis which lets us understand a large collection of documents. We can create “topic models” which are statistical models for text data. They represent documents with a small set of topics and can create a probability of any word to occur in a given “topic.” This way we can represent large documents with five to ten words or find words that likely occur together. We will learn how to use Turi Create to train text classification models or to create topic models in the next chapters.

In this chapter, we covered the tools and frameworks Apple provides for ML. We looked at how to use the Vision framework to recognize text, VisionKit to scan documents, Natural Language to understand a text, Core ML and Create ML to train custom models, and finally Turi Create to train models with more advanced techniques. In the next chapters, we will dive deeper and create intelligent applications using Natural Language.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.254.35