Chapter 5. Speaking with Your Application

In the previous chapter, we learned how to discover and understand the intent of a user, based on utterances. In this chapter, we will learn how to add audio capabilities to our applications, convert text to speech and speech to text, and learn how to identify the person speaking. Throughout this chapter, we will learn how you can utilize spoken audio to verify a person. Finally, we will briefly touch on how to customize speech recognition to make it unique for your application's usage.

By the end of this chapter, we will have covered the following topics:

  • Converting spoken audio to text and text to spoken audio
  • Recognizing intent from spoken audio by utilizing LUIS
  • Verifying that the speaker is who they claim to be
  • Identifying the speaker
  • Tailoring the Speaker Recognition API to recognize custom speaking styles and environments

Converting text to audio and vice versa

In Chapter 1, Getting Started with Microsoft Cognitive Services, we utilized a part of the Bing Speech API. We gave the example application the ability to say sentences to us. We will use the code that we created in that example now, but we will dive a bit deeper into the details.

We will also go through the other feature of Bing Speech API, that is, converting spoken audio to text. The idea is that we can speak to the smart-house application, which will recognize what we are saying. Using the textual output, the application will use LUIS to gather the intent of our sentence. If LUIS needs more information, the application will politely ask us for more via audio.

To get started, we want to modify the build definition of the smart-house application. We need to specify whether we are running it on a 32-bit or 64-bit OS. To utilize speech-to-text conversion, we want to install the Bing Speech NuGet client package. Search for Microsoft.ProjectOxford.SpeechRecognition and install either the 32-bit version or the 64-bit version, depending on your system.

Further on, we need to add references to System.Runtime.Serialization and System.Web. These are needed so that we are able to make web requests and deserialize response data from the APIs.

Speaking to the application

Add a new file to the Model folder, called SpeechToText.cs. Beneath the automatically created SpeechToText class, we want to add an enum type variable called SttStatus. It should have two values, Success and Error.

In addition, we want to define an EventArgs class for events that we will raise during execution. Add the following class at the bottom of the file:

    public class SpeechToTextEventArgs : EventArgs
    {
        public SttStatus Status { get; private set; }
        public string Message { get; private set; }
        public List<string> Results { get; private set; }

        public SpeechToTextEventArgs(SttStatus status, 
        string message, List<string> results = null)
        {
            Status = status;
            Message = message;
            Results = results;
        }
    }

As you can see, the event argument will hold the operation status, a message of any kind, and a list of strings. This will be a list with potential speech-to-text conversions.

The SpeechToText class needs to implement IDisposable. This is done so that we can clean up the resources used for recording spoken audio and shut down the application properly. We will add the details presently, so for now, just make sure to add the Dispose function.

Now, we need to define a few private members in the class, as well as an event:

    public event EventHandler<SpeechToTextEventArgs> OnSttStatusUpdated;

    private DataRecognitionClient _dataRecClient;
    private MicrophoneRecognitionClient _micRecClient;
    private SpeechRecognitionMode _speechMode = SpeechRecognitionMode.ShortPhrase;

    private string _language = "en-US";
    private bool _isMicRecording = false;

The OnSttStatusUpdated event will be triggered whenever we have a new operation status. DataRecognitionClient and MicrophoneRecognitionClient are the two objects that we can use to call the Bing Speech API. We will look at how they are created presently.

We define SpeechRecognitionMode as ShortPhrase. This means that we do not expect any spoken sentences longer than 15 seconds. The alternative is LongDictation, which means that we can convert spoken sentences to be up to 2 minutes long.

Finally, we specify the language to be English, and define a bool type variable, which indicates whether or not we are currently recording anything.

In our constructor, we accept the Bing Speech API key as a parameter. We will use this in the creation of our API clients:

    public SpeechToText(string bingApiKey)
    {
        _dataRecClient = SpeechRecognitionServiceFactory.CreateDataClientWithIntentUsingEndpointUrl(_language, bingApiKey, "LUIS_ROOT_URI");

        _micRecClient = SpeechRecognitionServiceFactory.CreateMicrophoneClient(_speechMode, _language, bingApiKey);

        Initialize();
    }

As you can see, we create both _dataRecClient and _micRecClient by calling SpeechRecognitionServiceFactory. For the first client, we state that we want to use intent recognition as well. The parameters required are the language, Bing API key, the LUIS app ID, and the LUIS API key. By using a DataRecognitionClient object, we can upload audio files with speech.

By using MicrophoneRecognitionClient, we can use a microphone for real-time conversion. For this, we do not want intent detection, so we call CreateMicrophoneClient. In this case, we only need to specify the speech mode, the language, and the Bing Speech API key.

Before leaving the constructor, we call the Initialize function. In this, we subscribe to certain events on each of the clients:

    private void Initialize()
    {
        _micRecClient.OnMicrophoneStatus += OnMicrophoneStatus;
        _micRecClient.OnPartialResponseReceived += OnPartialResponseReceived;
        _micRecClient.OnResponseReceived += OnResponseReceived;
        _micRecClient.OnConversationError += OnConversationErrorReceived;

        _dataRecClient.OnIntent += OnIntentReceived;
        _dataRecClient.OnPartialResponseReceived +=
        OnPartialResponseReceived;
        _dataRecClient.OnConversationError += OnConversationErrorReceived;
        _dataRecClient.OnResponseReceived += OnResponseReceived;
    }

As you can see, there are quite a few similarities between the two clients. The two differences are that _dataRecClient will get intents through the OnIntent event, and _micRecClient will get the microphone status through the OnMicrophoneStatus event.

We do not really care about partial responses. However, they may be useful in some cases, as they will continuously give the currently completed conversion:

    private void OnPartialResponseReceived(object sender, PartialSpeechResponseEventArgs e)
    {
        Debug.WriteLine($"Partial response received:{e.PartialResult}");
    } 

For our application, we will choose to output it to the debug console window. In this case, PartialResult is a string with the partially converted text:

private void OnMicrophoneStatus(object sender, MicrophoneEventArgs e)
{
    Debug.WriteLine($"Microphone status changed to recording: {e.Recording}");
}

We do not care about the current microphone status, either. Again, we output the status to the debug console window.

Before moving on, add a helper function, called RaiseSttStatusUpdated. This should raise OnSttStatusUpdated when called.

When we are calling _dataRecClient, we may recognize intents from LUIS. In these cases, we want to raise an event, where we output the recognized intent. This is done with the following code:

private void OnIntentReceived(object sender, SpeechIntentEventArgs e)
{
    SpeechToTextEventArgs args = new SpeechToTextEventArgs(SttStatus.Success, $"Intent received: {e.Intent.ToString()}.
 Payload: {e.Payload}");
    RaiseSttStatusUpdated(args);
}

We choose to print out intent information and the Payload. This is a string containing recognized entities, intents, and actions that are triggered from LUIS.

If any errors occur during the conversion, there are several things we will want to do. First and foremost, we want to stop any microphone recordings that may be running. There is really no point in trying to convert more in the current operation if it has failed:

    private void OnConversationErrorReceived(object sender, SpeechErrorEventArgs e)
    {
        if (_isMicRecording) StopMicRecording();

We will create StopMicRecording presently.

In addition, we want to notify any subscribers that the conversion failed. In such cases, we want to give details about error codes and error messages:

        string message = $"Speech to text failed with status code:{e.SpeechErrorCode.ToString()}, and error message: {e.SpeechErrorText}";
  
        SpeechToTextEventArgs args = new SpeechToTextEventArgs(SttStatus.Error, message);

        RaiseSttStatusUpdated(args);
    }

The OnConversationError event does, fortunately, provide us with detailed information about any errors.

Now, let's look at the StopMicRecording method:

    private void StopMicRecording()
    {
        _micRecClient.EndMicAndRecognition();
        _isMicRecording = false;
    }

This is a simple function that calls EndMicAndRecognition on the _micRecClient MicrophoneRecognitionClient object. When this is called, we stop the client from recording.

The final event handler that we need to create is the OnResponseReceived handler. This will be triggered whenever we receive a complete, converted response from the service.

Again, we want to make sure we do not record any more if we are currently recording:

    private void OnResponseReceived(object sender, SpeechResponseEventArgs e)
    {
        if (_isMicRecording) StopMicRecording();

The SpeechResponseEventArgs argument contains a PhraseResponse object. This contains an array of RecognizedPhrase, which we want to access. Each item in this array contains the confidence of correct conversion. It also contains the converted phrases as DisplayText. This uses inverse text normalization, proper capitalization, and punctuation, and it masks profanities with asterisks:

    RecognizedPhrase[] recognizedPhrases = e.PhraseResponse.Results;
    List<string> phrasesToDisplay = new List<string>();

    foreach(RecognizedPhrase phrase in recognizedPhrases)
    {
        phrasesToDisplay.Add(phrase.DisplayText);
    }

We may also get the converted phrases in other formats, as described in the following table:

Format

Description

LexicalForm

This is the raw, unprocessed recognition result.

InverseTextNormalizationResult

This displays phrases such as one two three four as 1234, so it is ideal for usages such as go to second street.

MaskedInverseTextNormalizationResult

Inverse text normalization and the profanity mask. No capitalization or punctuation is applied.

For our use, we are just interested in the DisplayText. With a populated list of recognized phrases, we raise the status update event:

        SpeechToTextEventArgs args = new SpeechToTextEventArgs(SttStatus.Success, $"STT completed with status: {e.PhraseResponse.RecognitionStatus.ToString()}", phrasesToDisplay);

        RaiseSttStatusUpdated(args);
    }

To be able to use this class, we need a couple of public functions so that we can start speech recognition:

    public void StartMicToText()
    {
        _micRecClient.StartMicAndRecognition();
        _isMicRecording = true;
    }

The StartMicToText method will call the StartMicAndRecognition method on the _micRecClient object. This will allow us to use the microphone to convert spoken audio. This function will be our main way of accessing this API:

    public void StartAudioFileToText(string audioFileName) {
        using (FileStream fileStream = new FileStream(audioFileName, FileMode.Open, FileAccess.Read))
        {
            int bytesRead = 0;
            byte[] buffer = new byte[1024];

The second function will require a filename for the audio file, with the audio we want to convert. We open the file, with read access, and are ready to read it:

    try {
        do {
            bytesRead = fileStream.Read(buffer, 0, buffer.Length);
            _dataRecClient.SendAudio(buffer, bytesRead);
        } while (bytesRead > 0);
    }

As long as we have data available, we read from the file. We will fill up the buffer, and call the SendAudio method. This will then trigger a recognition operation in the service.

If any exceptions occur, we make sure to output the exception message to a debug window. Finally, we need to call the EndAudio method so that the service does not wait for any more data:

    catch(Exception ex) {
        Debug.WriteLine($"Exception caught: {ex.Message}");
    }
    finally {
        _dataRecClient.EndAudio();
    }

Before leaving this class, we need to dispose of our API clients. Add the following in the Dispose function:

    if (_micRecClient != null) {
        _micRecClient.EndMicAndRecognition();
        _micRecClient.OnMicrophoneStatus -= OnMicrophoneStatus;
        _micRecClient.OnPartialResponseReceived -= OnPartialResponseReceived;
        _micRecClient.OnResponseReceived -= OnResponseReceived;
        _micRecClient.OnConversationError -= OnConversationErrorReceived;

       _micRecClient.Dispose();
       _micRecClient = null;
    }

    if(_dataRecClient != null) {
        _dataRecClient.OnIntent -= OnIntentReceived;
        _dataRecClient.OnPartialResponseReceived -= OnPartialResponseReceived;
        _dataRecClient.OnConversationError -= OnConversationErrorReceived;
        _dataRecClient.OnResponseReceived -= OnResponseReceived;

        _dataRecClient.Dispose();
        _dataRecClient = null;
    }

We stop microphone recording, unsubscribe from all events, and dispose and clear the client objects.

Make sure that the application compiles before moving on. We will look at how to use this class presently.

Letting the application speak back

We have already seen how to make the application speak back to us. We are going to use the same classes we created in Chapter 1, Getting Started with Microsoft Cognitive Services. Copy Authentication.cs and TextToSpeech.cs from the example project from Chapter 1, Getting Started with Microsoft Cognitive Services, into the Model folder. Make sure that the namespaces are changed accordingly.

As we have been through the code already, we will not go through it again. We will instead look at some of the details left out in Chapter 1, Getting Started with Microsoft Cognitive Services.

Audio output format

The audio output format can be one of the following formats:

  • raw-8khz-8bit-mono-mulaw
  • raw-16khz-16bit-mono-pcm
  • riff-8khz-8bit-mono-mulaw
  • riff-16khz-16bit-mono-pcm

Error codes

There are four possible error codes that can occur in calls to the API. These are described in the following table:

Code

Description

400 / BadRequest

A required parameter is missing, empty, or null. Alternatively, a parameter is invalid. An example may be a string that's longer than the allowed length.

401 / Unauthorized

The request is not authorized.

413 / RequestEntityTooLarge

The SSML input is larger than what's supported.

502 / BadGateway

A network-related or server-related issue.

Supported languages

The following languages are supported:

English (Australia), English (United Kingdom), English (United States), English (Canada), English (India), Spanish, Mexican Spanish, German, Arabic (Egypt), French, Canadian French, Italian, Japanese, Portuguese, Russian, Chinese (S), Chinese (Hong Kong), and Chinese (T).

Utilizing LUIS based on spoken commands

To utilize the features that we have just added, we are going to modify LuisView and LuisViewModel. Add a new Button in the View, which will make sure that we record commands. Add a corresponding ICommand in the ViewModel.

We also need to add a few more members to the class:

    private SpeechToText _sttClient;
    private TextToSpeech _ttsClient;
    private string _bingApiKey = "BING_SPEECH_API_KEY";

The first two will be used to convert between spoken audio and text. The third is the API key for the Bing Speech API.

Make the ViewModel implement IDisposable, and explicitly dispose the SpeechToText object.

Create the objects by adding the following in the constructor:

_sttClient = new SpeechToText(_bingApiKey);
_sttClient.OnSttStatusUpdated += OnSttStatusUpdated;

_ttsClient = new TextToSpeech();
_ttsClient.OnAudioAvailable += OnTtsAudioAvailable;
_ttsClient.OnError += OnTtsError;
GenerateHeaders();

This will create the client objects and subscribe to the required events. Finally, it will call a function to generate authentication tokens for the REST API calls. This function should look like this:

private async void GenerateHeaders()
{
   if (await _ttsClient.GenerateAuthenticationToken(_bingApiKey))
   _ttsClient.GenerateHeaders();
}

If we receive any errors from _ttsClient, we want to output it to the debug console:

    private void OnTtsError(object sender, AudioErrorEventArgs e)
    {
        Debug.WriteLine($"Status: Audio service failed - {e.ErrorMessage}");
    }

We do not need to output this to the UI, as this is a nice-to-have feature.

If we have audio available, we want to make sure that we play it. We do so by creating a SoundPlayer object:

    private void OnTtsAudioAvailable(object sender, AudioEventArgs e)
    {
        SoundPlayer player = new SoundPlayer(e.EventData);
        player.Play();
        e.EventData.Dispose();
    }

Using the audio stream we got from the event arguments, we can play the audio to the user.

If we have a status update from _sttClient, we want to display this in the textbox.

If we have successfully recognized spoken audio, we want to show the Message string if it is available:

    private void OnSttStatusUpdated(object sender, SpeechToTextEventArgs e) {
        Application.Current.Dispatcher.Invoke(() =>  {
            StringBuilder sb = new StringBuilder();

            if(e.Status == SttStatus.Success) {
               if(!string.IsNullOrEmpty(e.Message)) {
                   sb.AppendFormat("Result message: {0}

", e.Message);
                }

We also want to show all recognized phrases. Using the first available phrase, we make a call to LUIS:

        if(e.Results != null && e.Results.Count != 0) {
            sb.Append("Retrieved the following results:
");
                foreach(string sentence in e.Results) {
                    sb.AppendFormat("{0}

", sentence);
                }
                sb.Append("Calling LUIS with the top result
");
                CallLuis(e.Results.FirstOrDefault());
            }
        }

If the recognition failed, we print out any error messages that we may have. Finally, we make sure that the ResultText is updated with the new data:

            else {
                sb.AppendFormat("Could not convert speech to text:{0}
", e.Message);
            }

            sb.Append("
");
            ResultText = sb.ToString();
        });
    }

The newly created ICommand needs to have a function to start the recognition process:

    private void RecordUtterance(object obj) {
        _sttClient.StartMicToText();
    }

The function starts the microphone recording.

Finally, we need to make some modifications to OnLuisUtteranceResultUpdated. Make the following modifications, where we output any DialogResponse:

    if (e.RequiresReply && !string.IsNullOrEmpty(e.DialogResponse))
    {
        await _ttsClient.SpeakAsync(e.DialogResponse, CancellationToken.None);
        sb.AppendFormat("Response: {0}
", e.DialogResponse);
        sb.Append("Reply in the left textfield");
 
        RecordUtterance(sender);
    }
    else
    {
        await _ttsClient.SpeakAsync($"Summary: {e.Message}", CancellationToken.None);
    }

This will play the DialogResponse if it exists. The application will ask you for more information if required. It will then start the recording, so we can answer without clicking any buttons.

If no DialogResponse exists, we simply make the application say the summary to us. This will contain data on intents, entities, and actions from LUIS.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.188.238