Taking advantage of the GPU 

You may recall that when we introduced the MLCustomLayer protocol, there was an optional method, encode(commandBuffer, inputs, outputs), reserved for performing the evaluation on the GPU if the hosting device supported it. This flexibility is one of the advantages Core ML has over other machine learning frameworks; it allows mixing layers, which run on the CPU and GPU, and allows them to work coherently together. 

To use the GPU, we will be using Apple's Metal framework, a graphics framework equivalent to OpenGL and DirectX (and now Vulkan), for those who are familiar with 3D graphics. Unlike our previous solutions, which included all code in a single method, we need to write the code that performs the computation in an external file called a Metal shader file. Within this file we will define a kernel, which will be complied and stored on the GPU (when loaded), allowing it to fan out the data in parallel across the GPU. Let's create this kernel now; create a new metal file called rescale.metal and add the following code:

#include <metal_stdlib>
using namespace metal;

kernel void rescale(
texture2d_array<half, access::read> inTexture [[texture(0)]],
texture2d_array<half, access::write> outTexture [[texture(1)]],
ushort3 gid [[thread_position_in_grid]])
{
if (gid.x >= outTexture.get_width() || gid.y >= outTexture.get_height())
{
return;
}

const float4 x = float4(inTexture.read(gid.xy, gid.z));
const float4 y = (1.0f + x) * 127.5f;

outTexture.write(half4(y), gid.xy, gid.z);
}

It is out of scope to discuss the details of metal, so instead, we'll just highlight some of the key differences and commonalities between this and the previous approaches. First, it's worth recognizing why GPUs have been a major catalyst for the resurgence of neural networks. The GPU architecture allows a kernel (seen earlier) to be spawned for each element in our array—mass parallelism!

Because GPU frameworks were traditionally built with graphics manipulation in mind, there are some nuances with how we operate on data and what we operate on. The most notable of them is that we have swapped MLMultiArray for texture2d_array (textures) and we access them through sampling, using thread_position_in_grid. Nonetheless, the actual computation should look familiar from the original Python code, const float4 y = (1.0f + x) * 127.5f. Once calculated, we cast it to float 16 (half) and write it to the output texture. 

Our next step is to configure our RescaleOutputLambda class to use Metal and the GPU, rather than the CPU. Return to the RescaleOutputLambda.swift file and make the following amendments. 

Start by importing the Metal framework by adding the following statement at the top of your file:

import Metal

Next, we define a class variable of the type MTLComputePipelineState as a handler to the kernel we have just created, along with setting this up within the constructor of the RescaleOutputLambda class. Make the following amendments to the class and constructor, as shown in bold in the snippet: 

@objc(RescaleOutputLambda) class RescaleOutputLambda: NSObject, MLCustomLayer {

let computePipeline: MTLComputePipelineState

required init(parameters: [String : Any]) throws {
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let rescaleFunction = library.makeFunction(name: "rescale")!
self.computePipeline = try! device.makeComputePipelineState(function: rescaleFunction)

super.init()
}
...
}

If no errors are thrown, we will have reference to a complied version of our rescale kernel; the final step is making use of it. Within the RescaleOutputLambda class, add the following method:

func encode(commandBuffer: MTLCommandBuffer,
inputs: [MTLTexture],
outputs: [MTLTexture]) throws {

guard let encoder = commandBuffer.makeComputeCommandEncoder() else{
return
}

let w = computePipeline.threadExecutionWidth
let h = computePipeline.maxTotalThreadsPerThreadgroup / w
let threadGroupSize = MTLSizeMake(w, h, 1)

for i in 0..<inputs.count {
let threadGroups = MTLSizeMake(
(inputs[i].width + threadGroupSize.width - 1) /
threadGroupSize.width,
(inputs[i].height+ threadGroupSize.height - 1) /
threadGroupSize.height,
(inputs[i].arrayLength + threadGroupSize.depth - 1) /
threadGroupSize.depth)

encoder.setTexture(inputs[i], index: 0)
encoder.setTexture(outputs[i], index: 1)
encoder.setComputePipelineState(computePipeline)
encoder.dispatchThreadgroups(
threadGroups,
threadsPerThreadgroup:
threadGroupSize)
encoder.endEncoding()
}

As mentioned before, we will omit the details here and only highlight some key differences and commonalities between this approach and the previous approaches.

In short, the bulk of this method is responsible for passing data through to the compute kernel via the encoder and then dispatching it across the GPU. We first pass the input and output textures, as shown in the following snippet: 

encoder.setTexture(inputs[i], index: 0)
encoder.setTexture(outputs[i], index: 1)

And then we're setting the handler, which points to the rescale kernel we created in the preceding snippet: 

encoder.setComputePipelineState(computePipeline)

Finally, dispatch the job to the GPU; in this instance, our compute kernel is invoked for every pixel in every channel of the input texture:

encoder.dispatchThreadgroups(
threadGroups,
threadsPerThreadgroup:
threadGroupSize)
encoder.endEncoding()

If you build and run again, you will hopefully get the same result but in less time. We have now seen two approaches to optimizing our network; I leave optimizing ResCropBlockLambda as an exercise for you. For now, let's shift our focus to talking about your model's weight before we wrap up this chapter. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.17.137