Positioning our label

Before jumping into the code, let's review what we will be doing over the next few blocks of code. As we saw earlier in the FrameGrabbers SetFrame method; the MediaFrameReference contains reference to a SpatialCoordinateSystem. As we discussed previously, the SpatialCoordinateSystem is a representation of a coordinate system that can be used to reason about the user's surroundings. Each SpatialCoordinateSystem has a relationship with other coordinate systems, which means that given the user's current coordinate system, we can work out where the frame is relative to the user's current position. We make use of this to determine where to place the label, which is the work of this section.

Jump back into the ProcessFrame method of the FaceTagMain class and make the following amendments (shown in bold):

 void ProcessFrame(SpatialCoordinateSystem worldCoordinateSystem)
 {
 if (!IsInValidateStateToProcessFrame())
 {
 return;
 }

 // obtain the details of the last frame captured 
 FrameGrabber.Frame frame = frameGrabber.LastFrame;

 if (frame.mediaFrameReference == null)
 {
 return;
 }

 MediaFrameReference mediaFrameReference = frame.mediaFrameReference;

 SpatialCoordinateSystem cameraCoordinateSystem = 
 mediaFrameReference.CoordinateSystem;
 CameraIntrinsics cameraIntrinsics = 
 mediaFrameReference.VideoMediaFrame.CameraIntrinsics;

 Matrix4x4? cameraToWorld = 
 cameraCoordinateSystem.TryGetTransformTo(worldCoordinateSystem);

 if (!cameraToWorld.HasValue)
 {
 return;
 }

 frameAnalyzer.AnalyzeFrame(frame.mediaFrameReference, (status, 
 detectedPersons) =>
 {
 if (status > 0 && detectedPersons.Count > 0)
 {
 FrameAnalyzer.Bounds? bestRect = null;
 Vector3 bestRectPositionInCameraSpace = Vector3.Zero;
 float bestDotProduct = -1.0f;
 FrameAnalyzer.DetectedPerson bestPerson = null;

 foreach (var dp in detectedPersons)
 {
 Debug.WriteLine($"Detected person: {dp.ToString()}"); 
 }
 }
 });
 }

In the preceding code snippet, we are getting reference to SpatialCoordinateSystem and CameraIntrinsics of the MediaFrameReference; we then try to transform the frame's coordinates into the user's current coordinate space, done in the statement:

Matrix4x4? cameraToWorld = cameraCoordinateSystem.TryGetTransformTo(worldCoordinateSystem);

As mentioned earlier, each coordinate system has a dynamic relationship with the other coordinate systems; this relationship can be encoded as a 3D transformation matrix, returned by the preceding statement if successful. With this matrix, we can transform points and directions from one space (frames coordinate system) into another (user's coordinate system).

Along with the transformation, we add some variables that we will use over the next few paragraphs to find the most relevant detected face and details of where they are located in the world space.

Next, we will define some variables that we will use to position the label; add the following lines to the code you added earlier (before passing the frame to the FrameAnalzyer):

 float averageFaceWidthInMeters = 0.15f;

 float pixelsPerMeterAlongX = cameraIntrinsics.FocalLength.X;
 float averagePixelsForFaceAt1Meter = pixelsPerMeterAlongX * 
 averageFaceWidthInMeters;

 Vector3 labelOffsetInWorldSpace = new Vector3(0.0f, 0.25f, 0.0f);

We first calculate the average number of pixels for a face 1 meter away; we will use this soon to infer the depth of any person we identify. We then define a vector that will be used to offset the label from the center of the detected face; here, we are positioning the label 0.25 meters directly above the face.

Next, we will amend the body of the detected face loop; we restrict this example to display a single person at a time. To select the most relevant person, we calculate the dot product for each face, keeping reference to the largest value (which gives us the detected face that's most aligned with the user's current gazing direction); we step through the following amendments:

 FrameAnalyzer.Bounds? bestRect = null;
 Vector3 bestRectPositionInCameraSpace = Vector3.Zero;
 float bestDotProduct = -1.0f;
 FrameAnalyzer.DetectedPerson bestPerson = null;

We start by reconstructing the ray of the camera to the face when the frame is captured; we will use this to determine the difference in angle (dot product) with respect to the user's current gaze, and thus determine how relevant this person is. Add the following code to create a point to represent the center of the current face detected:

 Point faceRectCenterPoint = new Point(
 dp.bounds.left + dp.bounds.width /2, 
 dp.bounds.top + dp.bounds.height / 2
 );

Next, we unproject this from pixel coordinates into a camera space ray from the camera origin, expressed as a x and y coordinates on the plane at z = 1.0, and then align the z axis with the user's gaze:

Vector2 centerOfFace = cameraIntrinsics.UnprojectAtUnitDepth(faceRectCenterPoint);
Vector3 vectorTowardsFace = Vector3.Normalize(new Vector3(centerOfFace.X, centerOfFace.Y, -1.0f));

Next, we calculate the dot product that will be used to determine how relevant this face is compared to any other detected faces:

float dotFaceWithGaze = Vector3.Dot(vectorTowardsFace, -Vector3.UnitZ);

This gives us our dot product between the user's gaze and detected face; we compare our current best with this and if better, update the variables. Check the following code snippet for this:

 if (dotFaceWithGaze > bestDotProduct)
 {
 float estimatedFaceDepth = averagePixelsForFaceAt1Meter / 
 (float)dp.bounds.width;
 Vector3 targetPositionInCameraSpace = vectorTowardsFace * 
 estimatedFaceDepth;

 bestDotProduct = dotFaceWithGaze;
 bestRect = dp.bounds;
 bestRectPositionInCameraSpace = targetPositionInCameraSpace;
 bestPerson = dp; 
 }

Along with updating the local variables, we also calculate a position shown in the following lines:

 float estimatedFaceDepth = averagePixelsForFaceAt1Meter / 
 (float)dp.bounds.width;
 Vector3 targetPositionInCameraSpace = vectorTowardsFace * 
 estimatedFaceDepth;

Once we have iterated through all detected faces, we test whether a face was detected, and if so, update our QuadRenderer and TextRenderer, also known as our label, as follows:

 if (bestRect.HasValue)
 {
 Vector3 bestRectPositionInWorldspace = 
 Vector3.Transform(bestRectPositionInCameraSpace, cameraToWorld.Value);
 Vector3 labelPosition = bestRectPositionInWorldspace + 
 labelOffsetInWorldSpace; 

 quadRenderer.TargetPosition = labelPosition;
 textRenderer.RenderTextOffscreen($"{bestPerson.name}, 
 {bestPerson.gender}, Age: {bestPerson.age}");

 lastFaceDetectedTimestamp = Utils.GetCurrentUnixTimestampMillis();
 }

Most notable is how we transform the position into the user's space, using the 3D transformation matrix we obtained before between the frames coordinate system and users current coordinate system, which is what we do in the following code snippet:

 Vector3 bestRectPositionInWorldspace = 
 Vector3.Transform(bestRectPositionInCameraSpace, cameraToWorld.Value);
 Vector3 labelPosition = bestRectPositionInWorldspace + 
 labelOffsetInWorldSpace;

With that, we conclude the example; the full listing of the ProcessFrame method is demonstrated here:

 void ProcessFrame(SpatialCoordinateSystem worldCoordinateSystem)
 {
 if (!IsInValidateStateToProcessFrame())
 {
 return;
 }

 FrameGrabber.Frame frame = frameGrabber.LastFrame;

 if (frame.mediaFrameReference == null)
 {
 return;
 }

 MediaFrameReference mediaFrameReference = frame.mediaFrameReference;

 SpatialCoordinateSystem cameraCoordinateSystem = 
 mediaFrameReference.CoordinateSystem;
 CameraIntrinsics cameraIntrinsics = 
 mediaFrameReference.VideoMediaFrame.CameraIntrinsics;

 Matrix4x4? cameraToWorld = 
 cameraCoordinateSystem.TryGetTransformTo(worldCoordinateSystem);

 if (!cameraToWorld.HasValue)
 {
 return;
 }

 float averageFaceWidthInMeters = 0.15f;

 float pixelsPerMeterAlongX = cameraIntrinsics.FocalLength.X;
 float averagePixelsForFaceAt1Meter = pixelsPerMeterAlongX * 
 averageFaceWidthInMeters;

 Vector3 labelOffsetInWorldSpace = new Vector3(0.0f, 0.25f, 0.0f); 

 frameAnalyzer.AnalyzeFrame(frame.mediaFrameReference, (status, 
 detectedPersons) =>
 {
 if(status > 0 && detectedPersons.Count > 0)
 {
 FrameAnalyzer.Bounds? bestRect = null;
 Vector3 bestRectPositionInCameraSpace = Vector3.Zero;
 float bestDotProduct = -1.0f;
 FrameAnalyzer.DetectedPerson bestPerson = null; 

 foreach (var dp in detectedPersons)
 {
 Point faceRectCenterPoint = new Point(
 dp.bounds.left + dp.bounds.width /2, 
 dp.bounds.top + dp.bounds.height / 2
 );

 Vector2 centerOfFace = 
 cameraIntrinsics.UnprojectAtUnitDepth(faceRectCenterPoint);

 Vector3 vectorTowardsFace = Vector3.Normalize(new 
 Vector3(centerOfFace.X, centerOfFace.Y, -1.0f));

 float dotFaceWithGaze = Vector3.Dot(vectorTowardsFace, -
 Vector3.UnitZ); 

 if (dotFaceWithGaze > bestDotProduct)
 {
 float estimatedFaceDepth = averagePixelsForFaceAt1Meter / 
 (float)dp.bounds.width;

 Vector3 targetPositionInCameraSpace = vectorTowardsFace * 
 estimatedFaceDepth;

 bestDotProduct = dotFaceWithGaze;
 bestRect = dp.bounds;
 bestRectPositionInCameraSpace = targetPositionInCameraSpace;
 bestPerson = dp; 
 } 
 }

 if (bestRect.HasValue)
 {
 Vector3 bestRectPositionInWorldspace = 
 Vector3.Transform(bestRectPositionInCameraSpace, cameraToWorld.Value);
 Vector3 labelPosition = bestRectPositionInWorldspace + 
 labelOffsetInWorldSpace; 

 quadRenderer.TargetPosition = labelPosition;
 textRenderer.RenderTextOffscreen($"{bestPerson.name}, 
 {bestPerson.gender}, Age: {bestPerson.age}");

 lastFaceDetectedTimestamp = Utils.GetCurrentUnixTimestampMillis();
 } 
 }
 }); 
 }

As we did before, turn on your HoloLens device and click on the Remote Machine deployment button to deploy and run the application; you should now be able to see metadata floating above the heads of the persons you have registered, as shown in the following image (flattering age prediction from Microsoft Cognitive Services):

Let's conclude this chapter with a quick summary of what we covered before moving on to the next example.

Table of Contents for Positioning our label

Create new playlist

Sign In

Sign Up

Table of Contents for
Positioning our label