Keyword recognizer

Keywords are akin to commands typed in a CLI, in our case, a word or short phrase that encapsulates a specific action; some examples include (all are HoloLens system commands):

Select is used to select the currently focused object
Go back to return to the previous screen
Hide to hide the currently focused object

For each utterance, there is no need for conversing or dialog with the user; the intent is obvious, but what is notable is that they are contextually dependent and multimodal, which means that they are dependent on other forms of input to interpret a user's intent correctly; for example, Select will select the item the user is currently gazing at, where gaze is the other input providing context. Multimodal interfaces provide a more natural form of interaction, something more aligned with how we interact with other people.

HoloLens makes it easy to add voice keywords into your application, and, in this section, we will do just that. We will start by adding the necessary code to listen and handle keywords, and then we will explore how we can create a more natural experience by better understanding the user.

Before committing to any code, let's quickly discuss what we are trying to achieve. However, first a disclaimer--the examples we present here do not adhere to best practices for voice interfaces, but rather, for demonstration purposes only, showing how to integrate voice functional into your application; for a more comprehensive guide to designing voice interfaces, I recommend Designing Voice User Interfaces by Cathy Pearl.

As mentioned earlier, keywords are single words or short phrases that denote a specific action. Given that our goal for this example is to allow the user to program (or control) the robot arm, we will simply extend this capability to voice. The following is a list of phrases and their associated actions, we will implement:

Phrases	Action
`rotate left`, `rotate right`	When recognized, start rotating the base of our robot in the specified direction
`rotate up`, `rotate down`	Start rotating the currently focused arm, if any, in the specified direction; for example, if the user is currently gazing at arm 1, we will infer this to be the arm the user is referring to
`rotate arm 1 up`, `rotate arm 1 down`	Start rotating arm 1 in the specified direction
`rotate arm 2 up`, `rotate arm 2 down`	Start rotating arm 2 in the specified direction
`move up`, `move down`, `move forward`, `move backwards`, `move left`, `move right`	Start moving the inverse kinematics target (IK handle) in the specified direction
`stop`	Stop the current action

With our vocabulary now defined, let's turn our attention to putting these concepts into practice. Jump back into the Unity editor, and we will start by defining an abstract class that will act as a contract for the two approaches for integrating voice into our application. This will allow us to easily swap the behavior with minimal changes to our code, known in software engineering as the strategy pattern.

Click on the Create button within the Project panel and select C# Script; enter the name PlayStateVoiceHandler. We will make this class abstract and implement common functionality that is likely to be shared between the two approaches, two of which will be abstract methods for starting and stopping the handler, some constants for the part names and helper methods to translate directions into something usable. Double-click on the PlayStateVoiceHandler script to open it in Visual Studio and make the following amendments:

public abstract class PlayStateVoiceHandler : MonoBehaviour {

 sealed public class Direction {
 public const string Left = "Left";
 public const string Right = "Right";
 public const string Up = "Up";
 public const string Down = "Down";
 public const string Forward = "Forward";
 public const string Back = "Back";
 }

 protected const string PART_BASE = "Base";
 protected const string PART_ARM_1 = "Arm 1";
 protected const string PART_ARM_2 = "Arm 2";
 protected const string PART_HANDLE = "Handle";

 public abstract void StartHandler();

 public abstract void StopHandler(); 

 protected Vector3 GetRotationVector(string direction, float magnitude = 1f)
 {
 switch (direction)
 {
 case Direction.Up:
 return new Vector3(1f * magnitude, 0, 0);
 case Direction.Down:
 return new Vector3(-1f * magnitude, 0, 0);
 case Direction.Left:
 return new Vector3(0, 0, -1 * magnitude);
 case Direction.Right:
 return new Vector3(0, 0, 1 * magnitude);
 }

 return Vector3.zero;
 }

 protected Vector3 GetTranslationVector(string direction, float magnitude = 1f)
 {
 switch (direction)
 {
 case Direction.Up:
 return new Vector3(0, 1f * magnitude, 0);
 case Direction.Down:
 return new Vector3(0, -1f * magnitude, 0);
 case Direction.Left:
 return new Vector3(-1 * magnitude, 0f, 0);
 case Direction.Right:
 return new Vector3(1 * magnitude, 0f, 0);
 case Direction.Forward:
 return new Vector3(0, 0, 1f * magnitude);
 case Direction.Back:
 return new Vector3(0, 0, -1f * magnitude);
 }

 return Vector3.zero;
 }

}

With our base class defined, let's jump back to the Unity editor and implement the class responsible for listening and handling the phrases defined earlier. Click on the Create button within the Project panel, select C# Script, enter the name PSKeywordHandler, and double-click to open it in Visual Studio.

Let's first add the necessary namespaces; add the following lines at the top of the script:

using UnityEngine.Windows.Speech;
using System.Linq;
using System.Text.RegularExpressions; 
using HoloToolkit.Unity;

The UnityEngine.Windows.Speech namespace, as the name suggests, gives us access to the Windows Speech API, which we make extensive use of in the remainder of this chapter. Next, extend the PSKeywordHandler class with PlayStateVoiceHandler, for reasons described already. With housekeeping out of the way, we can now concentrate on implementing the necessary functionalities, starting with listening to a user's utterances and handling the specified phrases.

The class that will perform all the heavy lifting is KeywordRecognizer; we can integrate voice capabilities into our application by simply passing an array of phrases we're interested in (and optionally, a minimum confidence level) and assigning a delegate to handle the results. Let's do this now; add the following code to the StartHandler method:

 KeywordRecognizer keywordRecognizer;

 public override void StartHandler()
 {
 var keywordCollection = new List<string>();

 keywordCollection.Add("rotate left");
 keywordCollection.Add("rotate right");

 keywordCollection.Add("rotate up");
 keywordCollection.Add("rotate down");

 keywordCollection.Add("rotate one up");
 keywordCollection.Add("rotate one down");

 keywordCollection.Add("rotate two up");
 keywordCollection.Add("rotate two down");

 keywordCollection.Add("move up");
 keywordCollection.Add("move down");
 keywordCollection.Add("move left");
 keywordCollection.Add("move right");
 keywordCollection.Add("move forward");
 keywordCollection.Add("move back");

 keywordCollection.Add("stop");

 keywordRecognizer = new KeywordRecognizer(keywordCollection.ToArray(), ConfidenceLevel.High);
 keywordRecognizer.OnPhraseRecognized += KeywordRecognizer_OnPhraseRecognized;
 keywordRecognizer.Start();
 }

Being good citizens, we'll also add code to unregister the delegate and stop KeywordRecognizer. Once stopped and destroyed, add the following code:

 public override void StopHandler()
 {
 if(keywordRecognizer == null)
 {
 return; 
 }

 keywordRecognizer.OnPhraseRecognized -= KeywordRecognizer_OnPhraseRecognized;
 keywordRecognizer.Stop();
 keywordRecognizer.Dispose();
 }

 void OnDestroy()
 {
 StopHandler(); 
 }

In the StartHandler method, we simply create a list of phrases we are interested in capturing, passing them to the constructor of the KeywordRecognizer. When KeywordRecognizer recognizes any of these phrases, it will pass the results back to the yet-to-be-defined delegate, along with the PhraseRecognizedEventArgs argument. This object includes details such as the following:

Text: The interpreted utterance of the user (in most instances, matching one of your phrases)
Confidence: The level of confidence the KeywordRecognizer has in its interpretation of the text
phraseStartTime: The start time when the phrase was first detected
phraseDuration: The time it took for the phrase to be spoken
semanticMeanings: An array of SemanticMeaning, semantic properties that have been specified in an SRGS grammar file which is metadata associated to a recognized utterance (something we will cover in the next section)

Finally, we call Start to activate KeyRecognizer; let's now implement the OnPhraseRecognized delegate by adding the following code:

 void KeywordRecognizer_OnPhraseRecognized(PhraseRecognizedEventArgs args)
 {
 Debug.LogFormat("Heard {0} ({1})", args.text, args.confidence); 
 }

As mentioned, this delegate is called whenever a specified phrase has been recognized (and is above the given confidence threshold) by KeywordRecognizer. Now is a good time to test, build, and deploy to test each of the phrases and to ensure that everything is working.

Our next task is to perform some action when we recognize a phrase. For each phrase, we want to change the state of the robot; this state will be determined by either the phrase in isolation or the phrase within the context of the user's gaze. For this example, we will manage state simply by managing the currently selected direction and part. For example, if the phrase move left is recognized, we will set the part to base and direction to left, and continue updating until the state changes.

Let's go ahead and define the types, variables, and properties to manage this state; add the following code to your PSKeywordHandler class:

 public float rotationSpeed = 5.0f;

 public float moveSpeed = 5.0f; 

 public string CurrentDirection { get; set; }

 private string _currentPart = null;

 public string CurrentPart
 {
 get
 {
 return _currentPart; 
 }
 set
 {
 if(_currentPart != null)
 {
 if (_currentPart.Equals(PART_HANDLE) && 
 (PlayStateManager.Instance.CurrentInteractible != null && PlayStateManager.Instance.CurrentInteractible.interactionType != Interactible.InteractionTypes.Manipulation))
 {
 PlayStateManager.Instance.Robot.solverActive = false; 
 }
 }

 _currentPart = value;

 if (_currentPart != null)
 {
 if (_currentPart.Equals(PART_HANDLE))
 {
 PlayStateManager.Instance.Robot.solverActive = true;
 }
 }
 }
 }

First, we define a property to track the currently selected part, CurrentPart, and its currently assigned behavior, CurrentDirection, set by parsing the captured keywords. The CurrentPart property is also responsible for toggling the robot's variable solverActive based on the selected part, set to true when the Handle is selected, and otherwise set to false. Lastly, we define translation and rotation speeds, in the moveSpeed and rotationSpeed variables, respectively, which provide some control over how quickly the selected part updates.

Once a part is selected, it will be transformed or rotated depending on the type of part selected and CurrentDirection. We will be using the two helper methods we defined in the base class that will be used for translating the direction into a vector we can pass to RobotController. This transformation is applied in the Update method; make this change now by amending the Update method with the following code:

 void Update()
 {
 if(CurrentPart == null)
 {
 return; 
 }

 if(PlayStateManager.Instance.CurrentInteractible != null)
 {
 CurrentPart = null;
 return; 
 }

 if (CurrentPart.Equals(PART_HANDLE))
 {
 PlayStateManager.Instance.Robot.MoveIKHandle(GetTranslationVector(moveSpeed * Time.deltaTime));
 }
 else
 {
 PlayStateManager.Instance.Robot.Rotate(CurrentPart, GetRotationVector(rotationSpeed * Time.deltaTime));
 }
 }

If we have a current part, we first determine the type of transformation required by the name, and, based on the name, either call the MoveIKHandle method using the GetTranslationVector we defined, or we call the Rotate method of RobotController, passing it the result of the GetRotationVector we defined earlier. Most of this code should look familiar to you as we are using the methods we had earlier used to control the robot with gestures.

The last task is to handle each distinct phrase recognized. We use a Dictionary to act as a delegate lookup, using the phrase as the key and delegate as the value. Make the following amendments in the code snippet:

 delegate void KeywordAction(PhraseRecognizedEventArgs args);
 Dictionary<string, KeywordAction> keywordCollection;
 
 public override void StartHandler()
 {
 keywordCollection = new Dictionary<string, KeywordAction>();

 keywordCollection.Add("rotate left", StartRotatingLeft);
 keywordCollection.Add("rotate right", StartRotatingRight);

 keywordCollection.Add("rotate up", StartRotatingUp);
 keywordCollection.Add("rotate down", StartRotatingDown);

 keywordCollection.Add("rotate one up", StartRotatingArm1Up);
 keywordCollection.Add("rotate one down", StartRotatingArm1Down);

 keywordCollection.Add("rotate two up", StartRotatingArm2Up);
 keywordCollection.Add("rotate two down", StartRotatingArm2Down);

 keywordCollection.Add("move up", StartMovingIKHandle);
 keywordCollection.Add("move down", StartMovingIKHandle);
 keywordCollection.Add("move left", StartMovingIKHandle);
 keywordCollection.Add("move right", StartMovingIKHandle);
 keywordCollection.Add("move forward", StartMovingIKHandle);
 keywordCollection.Add("move back", StartMovingIKHandle);

 keywordCollection.Add("stop", Stop);

 keywordRecognizer = new KeywordRecognizer(keywordCollection.Keys.ToArray(), confidenceLevel);
 keywordRecognizer.OnPhraseRecognized += KeywordRecognizer_OnPhraseRecognized;
 keywordRecognizer.Start();
 }

Next, we have to make the delegation; we perform this in the KeywordRecognizer_OnPhraseRecognized method using the recognized phrase as the key to the lookup we just created and by invoking the associated delegate. Make the following amendments to the KeywordRecognizer_OnPhraseRecognized method:

 void KeywordRecognizer_OnPhraseRecognized(PhraseRecognizedEventArgs args)
 {
 KeywordAction keywordAction;

 if (keywordCollection.TryGetValue(args.text, out keywordAction))
 {
 keywordAction.Invoke(args);
 }
 }

With this implemented, we now we have the laborious task of writing out all the delegates:

 void StartRotatingLeft(PhraseRecognizedEventArgs args)
 {
 CurrentPart = PART_BASE;
 CurrentDirection = Direction.Left;
 }

 void StartRotatingRight(PhraseRecognizedEventArgs args)
 {
 CurrentPart = PART_BASE;
 CurrentDirection = Direction.Right;
 }

 void StartRotatingArm1Up(PhraseRecognizedEventArgs args)
 {
 CurrentPart = PART_ARM_1;
 CurrentDirection = Direction.Up;
 }

 void StartRotatingArm1Down(PhraseRecognizedEventArgs args)
 {
 CurrentPart = PART_ARM_1;
 CurrentDirection = Direction.Down;
 }

 void StartRotatingArm2Up(PhraseRecognizedEventArgs args)
 {
 CurrentPart = PART_ARM_2;
 CurrentDirection = Direction.Up;
 }

 void StartRotatingArm2Down(PhraseRecognizedEventArgs args)
 {
 CurrentPart = PART_ARM_2;
 CurrentDirection = Direction.Down;
 }

There's little ambiguity for the phrases handled. Now, our next set of code is to handle phrases where the part is not obvious; to infer what they mean, we will make use of the user's gaze, specifically if the user is gazing at either arm 1 or arm 2, and ignoring the request if the user is gazing at neither:

 void StartRotatingUp(PhraseRecognizedEventArgs args)
 {
 if (!GazeManager.Instance.Hit)
 {
 return; 
 }

 if(GazeManager.Instance.FocusedObject.name.Equals("arm 1", StringComparison.OrdinalIgnoreCase))
 {
 CurrentPart = PART_ARM_1;
 CurrentDirection = Direction.Up;
 }
 else if (GazeManager.Instance.FocusedObject.name.Equals("arm 1", StringComparison.OrdinalIgnoreCase))
 {
 CurrentPart = PART_ARM_2;
 CurrentDirection = Direction.Up;
 }
 }

 void StartRotatingDown(PhraseRecognizedEventArgs args)
 {
 if (!GazeManager.Instance.Hit)
 {
 return;
 }

 if (GazeManager.Instance.FocusedObject.name.Equals("arm 1", StringComparison.OrdinalIgnoreCase))
 {
 CurrentPart = PART_ARM_1;
 CurrentDirection = Direction.Down;
 }
 else if (GazeManager.Instance.FocusedObject.name.Equals("arm 1", StringComparison.OrdinalIgnoreCase))
 {
 CurrentPart = PART_ARM_2;
 CurrentDirection = Direction.Down;
 }
 }

Our last method handles the move X keyword, where x is a direction. This method differs from the last one in that we are handling multiple keywords with a single delegate; we will use a regular expression to extract the direction:

  void StartMovingIKHandle(PhraseRecognizedEventArgs args)
 {
 if(Regex.IsMatch(args.text, @"b(up|higher)b"))
 {
 CurrentPart = PART_HANDLE;
 CurrentDirection = Direction.Up;
 }
 else if (Regex.IsMatch(args.text, @"b(down|lower)b"))
 {
 CurrentPart = PART_HANDLE;
 CurrentDirection = Direction.Down;
 }
 else if (Regex.IsMatch(args.text, @"b(forward|away)b"))
 {
 CurrentPart = PART_HANDLE;
 CurrentDirection = Direction.Forward;
 }
 else if (Regex.IsMatch(args.text, @"b(back|backwards)b"))
 {
 CurrentPart = PART_HANDLE;
 CurrentDirection = Direction.Back;
 }
 else if (Regex.IsMatch(args.text, @"b(left)b"))
 {
 CurrentPart = PART_HANDLE;
 CurrentDirection = Direction.Left;
 }
 else if (Regex.IsMatch(args.text, @"b(right)b"))
 {
 CurrentPart = PART_HANDLE;
 CurrentDirection = Direction.Right;
 }
 }

With this method now implemented, we have finished the code for our first voice handler. The final task remaining is to wire it up in the editor, so head back to the Unity Editor and expand the Managers GameObject in the Hierarchy panel (if not already done), add a new empty GameObject with the name PSKeywordHandler by clicking on the Create dropdown, selecting Create Empty, and entering the appropriate name. Next, we need to add our script; select the newly created PSKeywordHandler GameObject and click on the Add Component button within the Inspector panel, type the name PSKeywordHandler, and select it when it becomes visible. Once attached, select the Managers GameObject and assign the GameObject PSKeywordHandler to the PlayStateManager script by clicking on and dragging the PSKeywordHandler onto the Voice Handler field.

Now we are good to take it for a test run; build and deploy the application to the device and return here once you have finished where we will move on to looking at how we can handle more complex (and natural) phrases.

Table of Contents for Keyword recognizer

Create new playlist

Sign In

Sign Up

Table of Contents for
Keyword recognizer