© Nishith Pathak and Anurag Bhandari 2018
Nishith Pathak and Anurag BhandariIoT, AI, and Blockchain for .NEThttps://doi.org/10.1007/978-1-4842-3709-0_6

6. Building Smarter Applications Using Cognitive APIs

Nishith Pathak1  and Anurag Bhandari2
(1)
Kotdwara, Dist. Pauri Garhwal, India
(2)
Jalandhar, Punjab, India
 

The famous saying “Rome wasn’t built in a day” is so true about Microsoft cognitive services. Each service is exposed as a RESTful API by abstracting years and decades of experience, complex algorithms, deep neural networks, fuzzy logic, and research from the Microsoft research team. In Chapter 4, you were introduced to Microsoft Cognitive Services. You also learned how cognitive services are different from traditional programming systems. Later in Chapter 4, you got a sneak preview of all the Microsoft Cognitive Services API that Microsoft has produced.

Chapter 5 continued the journey of cognitive services by covering the prerequisites of creating Cognitive Services and setting up a development environment. You also got familiar with using your first cognitive applications in Visual Studio and recognized the power of the Computer Vision API.

Welcome to this chapter. In this chapter, we extend Cognitive Services by applying them to the smart hospital use case that we introduced in Chapter 1. As you now know, Microsoft Cognitive Services are broadly classified into six main categories, and each of them has 4-5 services, which leads to about 29 services while writing these books. Each of the Cognitive Services can be used in our smart hospital Asclepius Consortium. We will go through some of the most powerful ways of using cognitive services, especially around NLU, speech, and using the Face API, and apply them to the Asclepius hospital example.

At the end of chapter, you’ll:
  • Understand some of the powerful ways to apply cognitive services

  • Understand natural language understanding (NLU) and LUIS

  • Develop, test, host, and manage the LUIS application for the smart hospital

  • Interact with the Speech API

  • Use the Speech API to convert speech to text and text to speech

  • Use face detection and recognition in hospital scenarios

The Asclepius Consortium is a chain of hospitals that deal with almost every kind of disease. As part of their offering, they have the Dr. Checkup mobile application that provides natural language functionality in series of applications. This includes a mobile application that serves not only to maintain history, records, and payment transfers, but also helps user perform basic diagnosis by providing symptoms. This helps patients not only take quick action but understand the complexity of diseases, as shown in Figure 6-1.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig1_HTML.jpg
Figure 6-1

The Dr. Checkup app provides a basic diagnosis

Once the disease is identified, patients can schedule an appointment with a specific specialist/doctor, as shown in Figure 6-2.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig2_HTML.jpg
Figure 6-2

Dr. Checkup is scheduling an appointment with a doctor after chatting with a patient

Asclepius Consortium also uses the same NLU techniques of scheduling appointments in their frontend teller machines.

Microsoft’s Mission and NLU

Microsoft’s earlier mission was to have desktop computers in every home. Microsoft has been very successful in achieving that in most of the Western world. Recently Microsoft revised their mission to “Empowering every person and every organization on the planet to achieve more”. This new mission from Microsoft is more human in nature compared to previously being more technological. The supreme goal of artificial intelligence has always been to serve humanity. In order for AI to serve humanity, it should be able to understand humans. As a first step, AI should understand and process the languages that humans speak.

As more and more intelligent systems are devised, it is important for these systems to understand human languages, interpret the meanings, and then take appropriate actions wherever necessary. This domain of interpreting human languages is called Natural Language Understanding (NLU). NLU is the ability of a machine to convert natural language text to a form that the computer can understand. In other words, it’s the ability to extract the meaning from the sentence. This enables the system to understand the “what” question of the language.

Take the use case in Chapter 4 where a person is asking an intelligent system questions like the following:

“How does my schedule look today?

Now this question can be asked in a variety of ways. Some of them

What’s my schedule for the day?

Are there a lot of meetings in the office?

How busy is my calendar for today?

To humans, it is immediately clear that these sentences are a person’s way of inquiring about her schedule for the day. For a machine, it may not be clear. In fact, creating an NLU engine has been one of the more complex problems to solve. There is variety of factors, but the main issues is that there is no one algorithm to resolve it. Moreover, training a machine to understand grammar, rules, and slang is difficult. Thanks to the advent of deep learning, we can now achieve great precision of language understanding by training the engine with thousands of utterances, create a pattern recognition based on those examples, and then constantly improve on it to identify formal and informal rules of language understanding.

Based on this NLU theory, various commercial offerings exist and are based on the intent-entity schema that we briefly discussed in Chapter 4. Intent helps in understanding the intention of the user behind the question being asked. If you look at these examples, there were multiple ways of asking same questions, but the intention was the same—get me the schedule for that day. Once we know the right intent, it is then straightforward for the system to respond with a correct answer.

To identify the right intention, NLU systems are also trained to recognize certain keywords. These keywords are called entities. In our previous examples, to get a schedule from a calendar, you often need require person’s name and date. Each of these can be marked as entities. It is important to understand that entities are optional and not all sentences may be accompanied by them. There is also a good chance that a sentence may have more than one entity as well. We are going to discuss identifying intents and entities later in the chapter.

Language Understanding Intelligent Service (LUIS)

The Language Understanding Intelligent Service aka LUIS is Microsoft’s NLU cloud service and is part of the cognitive suite discussed in Chapter 4. It is important to understand that LUIS shouldn’t be treated as a full-fledged software application for your end user. LUIS only replaces the NLU component from the overall stack. It’s an NLU engine that handles NLU implementation by abstracting the inner machine learning model’s complexity. Your frontend still can be a website, chatbot, or any application ranging from a graphical to conversational user interface. In our Asclepius use case, the frontend is mobile application (Dr. Checkup). Before we open LUIS and start working, it is important to understand the underpinnings of LUIS that require deep expertise in design. Let’s look at the behind the scenes aspects of LUIS.

Designing on LUIS

The LUIS framework is just like a clean slate that’s been trained on a few built-in entities. As a first step, it is important to design the LUIS application in an effective manner. Designing a LUIS application requires a profound understanding of the problems that your software is trying to solve. For example, our use case of the smart hospital requires LUIS to be used to converse with patients to book an appointment with a doctor or to talk with the hospital chatbot to detect whether a doctor’s attention is really required. There are no strict guidelines or rules to follow while implementing LUIS. However, based on our experiences while interacting with LUIS and another NLU system, we came up with some high-level guidelines.

Design Guidelines for Using LUIS

The LUIS application should have the right intent and entities for this smart healthcare use case. There are many ways to identify the right intent and entities. We propose the following design principles for achieving effective identification:
  • Plan your scope first. It is very important to plan the scope and narrow it down to what’s inside the scope of LUIS. Clear identification of the scope is the first step in successfully achieving delivery of LUIS implementation. We recommend you keep the scope of applying LUIS limited to a few tasks or goals.

  • Use a prebuilt domain (if possible). Once the scope is finalized, you are aware now of the use case or domain that LUIS is going to solve. LUIS also comes with some prebuilt domains that have a set of entities and intents previously defined. These prebuilt domains are also pre-trained and are ready to use. Use prebuilt domain whenever you can. If your requirement is achieved with prebuilt intents or entities, choose them before creating a new one. You can always customize the prebuilt domains. You should always try to pick at least one of the prebuilt domains and entities along, even if you also have to create a few new ones to suit your requirements.

  • Identify the tasks that your application is going to perform. Once a domain is identified, list the tasks that your application is going to perform. These tasks identify the intention of the end user to access your application. As you may have guessed, create each task as an intent. It is important to break tasks appropriately so that you don’t have tasks that are non-specific. The more tasks you have, the more rigorous the training is.

  • Identify additional information required for tasks. Not all tasks are straightforward. Most of them require some additional information. This additional information are called entities. Entities complement the intents for identifying specific tasks. You can think of entities as variables of your application that are required to store and pass information to your client application.

  • Training is a key. At the core, LUIS accepts textual inputs. These inputs are called utterances. LUIS needs to be trained extensively on these utterances before the LUIS model will really understand future utterances. The more example utterances you have, the better the model. In order to identify various utterances, it’s better to collect phrases that you expect the user would type and identify different ways through which the same questions can be asked. If your application must deal with multiple similar use cases, there's a good chance LUIS will get confused with the lack of sufficient training for each of the similar use cases. Despite your training, it is not safe to assume that LUIS will respond with absolute accuracy. LUIS is based on the active machine learning model, which ensures that the LUIS model keeps on learning and enhancing the model in time. In fact, LUIS keeps track of all utterances for which it was unable to predict an intent with high confidence. You will find such utterances under the “Suggested Utterances” section on an intent’s page. Use this option to appropriately label utterances and confirm to LUIS whether it was right or wrong. LUIS training is a continuous process, until you find suggested utterances sections showing no suggestions.

Plan Your Scope First

The high-level sequence diagram for this mobile application and frontend teller is shown in Figure 6-3.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig3_HTML.jpg
Figure 6-3

The high-level interactions between various actors while booking appointments or during initial diagnosis

Here are the stepwise interactions as well.
  1. 1.

    The patient asks, either through a mobile application or the frontend teller machine (either through message or voice), to talk with Dr. Checkup.

     
  2. 2.

    If the patient submits the query through voice, the mobile app/frontend teller machine uses the Microsoft Speech API to convert the voice into text.

     
  3. 3.

    The application then passes the text to the LUIS app to determine the intent of the question asked.

     
  4. 4.

    LUIS performs its NLU analysis and returns its predictions for intent and entity. The result is returned as JSON back to the application.

     
  5. 5.

    The application checks for top scoring intents in the returned JSON and decides on the desired action. The action can be scheduling an appointment or providing a basic diagnosis. If action is scheduling an appointment, it checks with a doctor’s calendar and sets the appointment. For a basic diagnosis, it talks with the databases by passing the intent and entity combination received.

     
  6. 6.

    The database has a predefined set of answers stored for the combination of intent and entity. The answer is retrieved and sent back to the mobile application.

     
  7. 7.

    The application displays the desired result to the patient, either through text or voice.

     

In these scenarios and applications, NLU plays a vital role in the implementation. Before we delve more into other aspects, let us first look at how to identify intents and entities.

Identifying Intents and Entities

Once the scope is finalized, the next step is to identify the intents and entities. Let’s take the case of early diagnosis. What do you expect from the end user? Apart from the name, age, and gender details, you expect the user to list his symptoms so that, based on some general set of rules, an early diagnosis can be provided. Now the user can provide these symptoms in a simple sentence or in a couple of sentences. Figure 6-4 shows some of the examples of how someone can provide symptoms.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig4_HTML.jpg
Figure 6-4

Some of the ways that users can share their symptoms

If you look these examples, they all are being requested for the early diagnosis intent. Most of these early diagnosis sentences are matched with various entities, like body part and symptom. Each body part and symptom can be replaced with some body parts or with symptoms. For the first example, you can have the following:
../images/458845_1_En_6_Chapter/458845_1_En_6_Figa_HTML.jpg
Of course, this list is not exhaustive and you may need to work on creating a more complete list. Let’s take another use case of scheduling an appointment with the doctor. What do you need to schedule a doctor appointment? Essentially, you need the type of doctor (gynecologist, dentist, cardiologist, and so on), the potential date, and optionally a specific time range, before blocking the time of the doctor. Figure 6-5 shows some of the ways you can schedule your appointment.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig5_HTML.jpg
Figure 6-5

Some ways to schedule an appointment with a doctor

These examples are for setting an appointment with the doctor and have the entity doctortype associated with them. There are times when the user specifies the intent but doesn’t provide an entity. Then it’s up to the application logic to decide on the next step. You can check back with the user or even proceed to process while assuming some default value. Consider this example:

I want to see a doctor

The user wants to schedule an appointment with a doctor, but doesn’t specify the doctor’s name or even the date and time. In this scenario, you can set the default type for the entity. In this case, it can be a physician for example and the appointment time can be the next available slot.

Creating a Data Dictionary for LUIS

As you know by now, LUIS doesn’t perform any calculations or logic. It only extract meaning from utterances by extracting intents and entities from them. As a good design process, it helps to create a data dictionary before you open LUIS and start working on it. The data dictionary is more like a bible of metadata that stores utterances, along with its associated intents and entities. You can use it as a design document for your NLU. This is also helpful, as through this dictionary, you can determine the count of intents and entities to be created. Table 6-1 shows a good way to create three columns of a data dictionary for the LUIS application.
Table 6-1

The Data Dictionary Sample for Dr. Checkup

Utterances

Intent

Entity

Schedule an appointment with the physician tomorrow before 5 PM

scheduleAppointment

DoctorType = physician,

datetime = tomorrow,

AppointmentTime::StartTime = AppointmentTime::EndTime = 5 PM

Pain in abdomen

checkCondition

Symptom = Pain

Body part = abdomen

Feverish with loud cough

checkCondition

Symptom = feverish

Body Part = Loud Cough

Set an appointment with orthopedist

scheduleAppointment

Doctor type = orthopedist

This table is just an example, but it would be wise to create it as Excel sheet and use it as a design document for your LUIS application.

Tip

The data dictionary shown in Table 6-1 is a basic dictionary to get started. You can extend it and make it more usable. For instance, add an Answer column that provides a message about what answer would be returned to the user. You can also use this data dictionary to create test use cases.

Identifying intent and entities is the one of the most important tasks. Once the intents and entities are identified, the next task is to create these entities and intents on LUIS. Before you create entities in LUIS, you need to grab a subscription key for LUIS.

Getting a Subscription Key for LUIS

Chapter 5 described in detail how to create an account in Azure Portal and get the subscription key for the Vision API. You can use the same account to get a subscription key for LUIS. Open Azure Portal and, from the left side menu, choose New ➤ AI + Cognitive Services. Click on Language Understanding, as shown in Figure 6-6.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig6_HTML.jpg
Figure 6-6

A screenshot of the Azure Service Portal after selecting the new option from the menu

Fill out the form with an appropriate subscription, pricing tier, and resource, as shown in Figure 6-7. You may want to get started with the free tier (F0), as we discussed in the previous chapter. Click on Create.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig7_HTML.jpg
Figure 6-7

Filling in the form for creating a LUIS subscription key

Once this is submitted, it will take some time to get the account validated and then created. You can then go back to the dashboard after receiving notification on it. Click on your LUIS account in the dashboard and get the keys to take it forward, as shown Figure 6-8.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig8_HTML.jpg
Figure 6-8

The LUIS subscription keys are created in the Azure Portal

Apply the Subscription

Now you have everything ready, so the first task is to add the newly created subscription keys to LUIS.
  1. 1.

    Open the LUIS site from https://luis.ai .

     
  2. 2.

    Log in through the same Microsoft account that you used in the Azure Portal. While writing the book, LUIS asks for accepting license and grants LUIS permission to access the Microsoft account.

     
  3. 3.

    Click OK. You are now on the LUIS dashboard page. If this is the first time you have logged in, you will see the MyApps section empty.

     
  4. 4.

    Click on Create New App to create a new app and specify name, culture, and optional description, as shown in Figure 6-9. Click Done. You can provide your own name and culture as per your logic.

     
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig9_HTML.jpg
Figure 6-9

Creating a new application in LUIS

Congratulations, you have quickly created an application. Now it’s time to add your subscription keys to the application before creating intents and entities.

Applying the Subscription Key in LUIS

As you click Done previously, you are redirected to the application’s page. If not, click on Asclepius NLU from the MyApps section.

Click on Publish and scroll down to the “Resource and Keys” section. Click on Add Keys and select the tenant, Azure subscription, and then LUIS account, as shown in Figure 6-10. Tenant represents the Azure active subscription ID of the client or of your organization.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig10_HTML.jpg
Figure 6-10

Assigning a LUIS key to the app

It’s time now to add intents and entities.

Adding Intent and Entities

Adding intents and entities is very straightforward. To add a new intent, from the left sidebar go to the Intents page. If this is the first time you have been on this page, you won’t have any intents created. Keep your data dictionary handy. Now click on Create New Intent, specify the first entry of the intent from the data dictionary as the intent name, and click Done. The first column of your data dictionary also lists the sample utterances. Start adding sample utterances as shown previously. After adding the initial utterances, click the Save button to save your changes. Hover over the word Psychiatrist in the Just Added utterance to see it surrounded by square brackets. Clicking on the word will give you the option to either label it as an existing entity or as a new entity. Create an entity called Doctor type. Figure 6-11 shows how the scheduleAppointment intent will look once you have committed the initial utterances.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig11_HTML.jpg
Figure 6-11

The ScheduleAppointment intent screenshot with a few utterances

Repeat this process for all the intents, entities, and utterances created in the data dictionary. You should now train, test, and publish the LUIS as an endpoint, which can be used by the mobile application.

Training and Testing LUIS

One of the reasons we followed this process was to ensure that all the intents have some sample utterances. If an intent doesn’t have an utterance, you can’t train the application. Hover over the Train button. You will see the message “App has untrained changes”. Click Train on the top right to train the app. By training, you ensure that LUIS is creating a generalized model with utterances and will be able to identify intents and entities. Once the training process is successfully complete, your Train button should have a green notification bar mentioning this fact. You can now hover the Train button. You will get the message “App up to date”.

LUIS also provides a testing interface so the user can test the application. Click on the Test button to open the Test slide-out page and start interactive testing. Provide some utterances other than what you have used in training to see whether the results come in as expected or not. Your testing slide also contains the inspect panel to inspect LUIS results and change the top scoring intent, as shown in Figure 6-12.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig12_HTML.jpg
Figure 6-12

The LUIS test and inspect screen to inspect the LUIS result and change the top scoring intent

You will realize soon that not all tests yield the desired result. Initially, apart from getting a positive output from LUIS, you may also end up getting different intents for your test utterances, or even getting the right intent without an entity. Both require rigorous training. It can come through either through your training utterance or through the production application. For your application to use it, you need to push it to the production endpoint so you can get an HTTP endpoint, which can be used to call your LUIS app.

Publishing LUIS App

The endpoint is nothing but a web service to access a LUIS app. The endpoint receives an utterance via a query string parameter and returns a JSON output with a corresponding intent-entities breakdown. When you are publishing for the first time, an endpoint URL is generated for you based on your application’s ID and the subscription key. This URL remains constant throughout the lifetime of your LUIS app.

For creating a production endpoint, click on Publish on the top left. Select Production from the Publish to dropdown. If you want your JSON endpoint to include all predicted scores, check the Include All Predicted Intent Scores option. Click on Publish to Production Slot as shown in Figure 6-13. After publishing, you will get the endpoint.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig13_HTML.jpg
Figure 6-13

Publishing the LUIS app to get an endpoint

Using a LUIS Endpoint

Using a LUIS endpoint is easy. You just need to call the HTTP endpoint created in the earlier section and pass your query with parameter q. If the endpoint is correct, you start getting the response in JSON, as shown in the following code:

{
  "query": "can you set appointment for dentist",
  "topScoringIntent": {
    "intent": "ScheduleAppointment",
    "score": 0.999998569
  },
  "intents": [
    {
      "intent": "ScheduleAppointment",
      "score": 0.999998569
    },
    {
      "intent": "GetCondition",
      "score": 0.0250619985
    }
],
  "entities": []
}

You now understand the essence of training. The more training there is, the higher the confidence score. Our frontend teller and mobile application uses this endpoint and takes the first confidence score to talk to our database, which in turns returns the actual answer to be displayed to the user. Apart from the mobile device and the frontend teller UI, they also have the additional work to maintain the state of the user. LUIS supports a limited set of dialog management but, at the time this book was written, dialog management support hasn’t been so flexible, so we urge your software application to maintain it. Our frontend application also caters to interaction with voice users. Therefore, let’s look at how to implement speech in our application.

Interaction with Speech

There are many new devices coming up almost every morning. Gone are those days where we just expect our devices to understand some set of commands and take actions on them. Interaction with these devices has also been changed drastically in the last decade. Among all of them, speech has been one of the most powerful, and it is of course a natural way for interaction with the users. Consider the personal assistants like Cortana, Siri, or even smart voices controlling cars—almost all of them have the natural interaction of speech. The Asclepius chains of hospitals have patients coming from various fields and they also want their frontend teller machine, along with mobile devices, to have speech interaction capabilities.

Broadly speaking, speech interactions used in the Asclepius mobile and frontend teller applications must convert speech into text or vice versa. Traditionally, implementing speech in the application has been always been a difficult job. Thanks to the Microsoft Cognitive Bing Speech API, which provides the easiest way to enhance your application with speech-driven scenarios. The Speech API provides ease of use by abstracting all the complexity of speech algorithms and presenting an easy-to-use REST API.

Getting Started with Bing Speech API

The Bing Speech API has various implementations of the Speech API to suit customer requirements. Requirements are accomplished by using Speech Recognition and Speech Synthesis. Speech recognition, aka speech to text, allows you to handle spoken words from your users in your application using a recognition engine and convert the words to text. Speak Synthesis, also known as Text to Speech (TTS), allows you to speak words or phrases back to the users through a speech synthesis engine.

Speech to Text

The Bing Text to Speech API provides these features in an easy-to-use REST API. Like all other Microsoft Cognitive APIs, all interactions are done through HTTP POST. All calls go through these APIs, which are hosted on a Azure Cloud server. You need to follow these steps to use the Bing Speech API for text to speech.
  1. 1.

    Get a JSON Web Token (JWT) by calling the token service.

     
  2. 2.

    Put the JWT token in the header and call the Bing Speech API.

     
  3. 3.

    Parse the text.

     

Getting the JWT Token

All calls to the Microsoft Cognitive Services API require authentication before actually using the service. Unlike other Cognitive Services, which just require a subscription key to be passed, the Bing Speech API also requires an access token, also called a JWT token, before it’s called. access_token is the JWT token passed as a base64 string in the speech request header. To obtain the JWT token, users need to send a POST message to the token service and pass the subscription key as shown here:

POST https://api.cognitive.microsoft.com/sts/v1.0/issueToken HTTP/1.1
Host: api.cognitive.microsoft.com
Ocp-Apim-Subscription-Key: 000000000000000000000000000000000

The following code shows you how to use this in the frontend teller ASP.NET application:

private GUID instanceID; public async Task<string> GetTextFromAudio (Stream audiostream)
        {
            var requestUri = @"https://speech.platform.bing.com/recognize?scenarios=smd&appid=D4D52672-91D7-4C74-8AD8-42B1D98141A5&locale=en-US&device.os= WindowsOS&version=3.0&format=json&instanceid=instanceID3&requestid=" + Guid.NewGuid();
            using (var client = new HttpClient())
            {
                var token = this.getValidToken();
                client.DefaultRequestHeaders.Add("Authorization", "Bearer" + token);
                using (var binaryContent = new ByteArrayContent(StreamToBytes(audiostream)))
                {
                    binaryContent.Headers.TryAddWithoutValidation("content-type", "audio/wav; codec="audio/pcm"; samplerate=18000");
                    var response = await client.PostAsync(requestUri, binaryContent);
                    var responseString = await response.Content.ReadAsStringAsync();
                    dynamic data = JsonConvert.DeserializeObject(responseString);
                    return data.header.name;
                }
            }
private const int TokenExpiryInSeconds = 600;
private string Token;
private void getValidToken()
        {
            this.token = GetNewToken();
            this.timer?.Dispose();
            this.timer = new Timer(
                x => this. getValidToken(),
                null,
//Specify that token should run after 9 mins                TimeSpan.FromSeconds(TokenExpiryInSeconds).Subtract(TimeSpan.FromMinutes(1)),
       TimeSpan.FromMilliseconds(-1));  // Indicates that this function will only run once
        }
private static string GetNewToken ()
        {
            using (var client = new HttpClient())
            {
                client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", ApiKey);
                var response = client.PostAsync("https://api.cognitive.microsoft.com/sts/v1.0/issueToken", null).Result;
                return response.Content.ReadAsStringAsync().Result;
            }
        }

Code Walkthrough

As an experienced developer, you might have immediately observed that the essence of the code for calling speech to text lies in the GetTextFromAudio method. Here is the step-by-step procedure for this code:
  1. 1.

    Create a URL pointing to the speech endpoint with a necessary parameter. Ensure each parameter is been used once, otherwise, you end up getting an HTTP 400 error.

     
  2. 2.

    Call the getValidToken method to get the JWT token. In order to ensure utmost security, each JWT token is valid for only 10 minutes. Therefore, tokens need to be refreshed on or before 10 minutes to ensure they are always valid. Calling the Speech API with an invalid token results in an error. GetValidToken shows a mechanism to achieve it. You are also free to use your own method. Internally, the GetValidToken method calls the getNewToken method to get the JWT token.

     
  3. 3.

    Pass the valid JWT token as an authorization header prefixed with the string Bearer.

     
  4. 4.

    Convert the audio being passed from analog to digital by using codecs.

     
  5. 5.

    Call the Speech API asynchronously and get the JSON data.

     
  6. 6.

    Deserialize JSON to a .NET object for further use.

     

Congratulations! You now know how to convert any speech to text with just a few lines of code. Let’s also learn how to convert text to speech as well.

Text to Speech

Text-to-speech conversion follows nearly the same pattern of speech-to-text conversion. Let’s look at the code first before going through the code walkthrough:

private string GenerateSsml(string locale, string gender, string name, string text)
{
    var ssmlDoc = new XDocument(
                        new XElement("speak",
                            new XAttribute("version", "1.0"),
                            new XAttribute(XNamespace.Xml + "lang", "en-US"),
                            new XElement("voice",
                                new XAttribute(XNamespace.Xml + "lang", locale),
                                new XAttribute(XNamespace.Xml + "gender", gender),
                                new XAttribute("name", name),
                                text)));
    return ssmlDoc.ToString();
}
Public void ConvertTexttoSpeech(string text)
{
String ssml = this. GenerateSsml("en-US"," Female","TexttoSpeech",text)
byte[] TTSAudio = this.convertTextToSpeech(ssml);
SoundPlayer player = new SoundPlayer(new MemoryStream(TTSAudio));
player.PlaySync();
}
private byte[] convertTextToSpeech(string ssml)
{
    var token = this.getValidToken();
    var client = new RestClient("https://speech.platform.bing.com/synthesize");
    var request = new RestRequest(Method.POST);
    request.AddHeader("authorization", "Bearer" + token);
    request.AddHeader("x-search-clientid", "8ae9b9546ebb49c98c1b8816b85779a1");
    request.AddHeader("x-search-appid", "1d51d9fa3c1d4aa7bd4421a5d974aff9");
    request.AddHeader("x-microsoft-outputformat", "riff-16khz-16bit-mono-pcm");
    request.AddHeader("user-agent", "MyCoolApp");
    request.AddHeader("content-type", "application/ssml+xml");
    request.AddParameter("application/ssml+xml", ssml, ParameterType.RequestBody);
    IRestResponse response = client.Execute(request);
    return response.RawBytes;
}

Code Walkthrough

Here are the steps to convert text to speech:
  1. 1.

    Create a SSML markup for the text to be converted into speech by calling the generateSSML method.

    Note Speech Synthesis Markup Language (SSML) is a common standard way to represent speech in XML format . It is part of W3C specification. SSML provides a uniform way of creating speech-based markup text. Check the official spec for complete SSML syntax at https://www.w3.org/TR/speech-synthesis .

     
  2. 2.

    Call the Text to Speech API. You first need to get a valid token. For that, you can reuse the getValidToken() method shown in the speech-to-text code.

     
  3. 3.

    Make a POST request to https://speech.platform.bing.com/synthesize to get a byte array of the audio sent back as a response by the Text to Speech API. There are various ways to make POST requests. We use a popular third-party HTTP library called RestSharp. Easy installation of RestSharp in Visual Studio is supported via NuGet.

     
  4. 4.

    Use a SoundPlayer class, a built-in .NET class, to play audio files and streams, in order to convert a byte array into speech. SoundPlayer is a built-in .NET class that plays audio files and streams. The format of this audio file is determined by the value of the x-Microsoft-outputformat header. As SoundPlayer only supports WAV audio files, use riff-16khz-16bit-mono-pcm as the value for the outputformat header.

     

Identifying and Recognizing Faces

Like any smart hospital, Asclepius has a very tight surveillance system. Asclepius consortium is also very careful about the number of people visiting patients. The Asclepius surveillance system monitors inventory warehouse, restricts only a limited set of doctors and nurses into security and warehouse zones, and even with the attending patients. This smart digital action helps restrict unwanted people from entering buy raising the alarm when a security breach is attempted. Asclepius also has an automated attendance system, which doesn’t require the user to carry an identity card or any device. As soon as the employee enters the hospital, she is identified from the CCTV camera and her attendance is registered. Her time in and out of the system is also being monitored. All rooms and bays have digital identification mechanism for allowing only authorized people to enter rooms. This all has been achieved through the use of various cognitive technologies, especially the Face API.

What Does the Face API Work?

Face API at a high level helps in detecting, identifying, verifying, and recognizing faces. It also can be used to get more insights about a face. Face detection and identification is not a new concept. It’s been used across academia, government institutions, and industries for decades in one form or another. The Face API extends it by bringing years of Microsoft research into face detection and recognition in a simple-to-use API.

How Does Asclepius Achieve Strong Surveillance?

Asclepius has created various security groups to cater their security. Each hospital employee and person are added to one of those known groups (Visitors, Regular, Patients, Doctors, Vendors, Admin). By default, any unknown people are tagged as part of the Visitors group. All patients are tagged under the Patients group. All the doctors belong to the Doctors group and some of the doctors and hospital employees belong to the Admin group, which has access to all the bays. With the help of smart CCTV cameras, images are captured live via frames. Only authorized people from certain people groups will get access to certain rooms, as shown in Figure 6-14. The same process is followed to identify employees from people entering the office.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig14_HTML.jpg
Figure 6-14

How Asclepius consortium is using face recognition to recognize people and open the door for them

Getting Keys for the Face API

As with all cognitive services, you must first get a subscription key for the Face API via the Azure Portal.
  1. 1.

    Go to the Azure Portal ➤ AI + Cognitive service blade.

     
  2. 2.

    Select the API type as Face API.

     
  3. 3.

    Fill out the location, pricing tier, and resource group, and then click Create to create a subscription.

     

Make sure the subscription key for the Face API is handy. You’ll need it soon.

Creating a Person and Person Group

Follow these steps to attain face identification and verification:
  1. 1.

    Create a Person group.

     
  2. 2.

    Create a person.

     
  3. 3.

    Add faces to the person.

     
  4. 4.

    Train the Person group.

     
Once these tasks are achieved, the Person group can be used to authenticate and authorize people. While doing any face identification, you need to set the scope. The Person group determines the overall scope of the face identification. Each Person group contains one or more people. For example, only the Admin Person group has authority to access all security inventory. The Person group contains many people and each person internally contains multiple face images to get it trained, as shown in Figure 6-15.
../images/458845_1_En_6_Chapter/458845_1_En_6_Fig15_HTML.png
Figure 6-15

The correlation between the Person group and a person

As shown in Figure 6-15, John, Marc, and Ken are part of the Admin Person group. Likewise, Asclepius has other person groups that handle authentication and authorization for other tasks. For example, the Regular Person group is for all the regular employees in the hospital. The Regular Person group is used to handle the automated attendance process by monitoring and validating all people entering the hospital using a secured bay.

Creating a Person group profile is a straightforward and easy process. You need to provide an HTTP PUT API, which is available at https://[location].api.cognitive.microsoft.com/face/v1.0/persongroups/, by passing the Person group name and the subscription key of the Face API that you created earlier.

PUT https://westcentralus.api.cognitive.microsoft.com/face/v1.0/persongroups/Admin HTTP/1.1
Host: westcentralus.api.cognitive.microsoft.com
Content-Type: application/json
Ocp-Apim-Subscription-Key: 000000000000000000000000000000000000000000000000

If you call the API with a valid subscription key and your person name is unique, you will get successful response 200 with an empty response body. Ensure that all the Face APIs only support application/json.

Once all the groups are added, the next step is to add a person to these groups. Adding a person is also an easy task. You need to call a https://[location].api.cognitive.microsoft.com/face/v1.0/persongroups/{personGroupId}/persons where the location needs to be replaced with the location used while getting the Face API and persongroupid needs to be replaced with the Person Group name in which this person needs to be added. The Person name to be created needs to be part of the JSON body request. For example, the following code shows how to add Nishith to the Admin group created previously:

POST https://westus.api.cognitive.microsoft.com/face/v1.0/persongroups/admin/persons HTTP/1.1
Host: westus.api.cognitive.microsoft.com
Content-Type: application/json
Ocp-Apim-Subscription-Key: 00000000000000000000000000000000
{
    "name":"Nishith",
    "userData":"User-provided data attached to the person"
}

Successful calls create a unique person ID with JSON response shown here:

Pragma: no-cache
apim-request-id: 845a9a5f-9c4f-4f96-a126-04f2916a602d
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
x-content-type-options: nosniff
Cache-Control: no-cache
Date: Sun, 31 Dec 2017 11:26:35 GMT
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Content-Length: 51
Content-Type: application/json; charset=utf-8
Expires: -1
{
  "personId": "91ec6844-2be9-46d9-bc7f-c4d2deab166e"
}
Each person is identified by his or her unique person ID. So now for future reference, Nishith would be referred as person ID 91ec6844-2be9-46d9-bc7f-c4d2deab166e. In order to call a successful person creation, you need to ensure that you are within the limit of creating a person and the person group for that subscription. The following table shows the limitation of each subscription.

Tier

Person per Person Group

Persons per Subscription

Free Tier

1K

1K

S0

10K

1M

If you exceed the limit, you will get the QuotaExceeded error. While creating a person, you didn’t provide any faces.

Add Faces

In order to add faces, call https://[location].api.cognitive.microsoft.com/face/v1.0/persongroups/{personGroupId}/persons/{personId} via an HTTP Post. Be sure to replace the location, persongroupid, and person ID with the appropriate values. Pass the image as an URL in the application body. For example, use the following code to add Nishith’s image to the Nishith Person in the Admin group:

POST https://westus.api.cognitive.microsoft.com/face/v1.0/persongroups/admin/persons/91ec6844-2be9-46d9-bc7f-c4d2deab166e/persistedFaces HTTP/1.1
Host: westus.api.cognitive.microsoft.com
Content-Type: application/json
Ocp-Apim-Subscription-Key: 00000000000000000000000000000000
{
    "url":" https://pbs.twimg.com/media/DReABwzW4AAv7hL.jpg"
}

A successful call with this URL will result in returning a persisted face ID:

Pragma: no-cache
apim-request-id: 7e5f7e9d-4ad1-4cd4-a1df-ec81f4f37d33
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
x-content-type-options: nosniff
Cache-Control: no-cache
Date: Mon, 01 Jan 2018 09:55:08 GMT
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Content-Length: 58
Content-Type: application/json; charset=utf-8
Expires: -1
{
  "persistedFaceId": "185e9d95-d52b-4bd9-b6a9-f512c4dbd5a2"
}

Make sure the image is an URL and is specified in the request body; your application should be Internet accessible. Only files having extensions of JPEG, PNG, GIF, and BMP are supported and each image should not be more than 4MB. Each person can have multiple faces so efficient training can be done. You can tag up to 248 faces per person.

Training Is the Key

More training results in better accuracy. We must now train the Person group. Any changes to the faces would require training the Person group again before using it to identify faces. In order to train the Person group, give an HTTP POST call to https://[location].api.cognitive.microsoft.com/face/v1.0/persongroups/[persongroupname]/train and pass the subscription key. Replace the persongroupname and location appropriately. For example, to train the Admin Person group, the HTTP request would be:

POST https://westus.api.cognitive.microsoft.com/face/v1.0/persongroups/admin/train HTTP/1.1
Host: westus.api.cognitive.microsoft.com
Ocp-Apim-Subscription-Key: 00000000000000000000000000000000

A successful call to the API would result in Empty JSON.

Using the Face API for Authentication

Once a person and the Person group are trained, the Face API can then be used to authenticate or identify a face from an unknown Person group. As mentioned, all the bays and rooms of Asclepius have digital entrance mechanisms that are supported by Smart CCTV cameras. The digital locks are tied to one or more Person groups. Door locks are only open when the system authenticates and authorizes a person as being part of the Person group.

As and when a new person tries to enter a room, the CCTV camera first calls the Face Detect API by passing the captured image from CCTV to detect human faces in the image, which also returns face IDs for those faces. Those face IDs are then been passed to the Face Identify API to test it against person groups. When the Face Identify API returns a confidence score of more than .9, the door opens. The system also tracks the time spent in that room as well.

To call the Face Detect API for the image being captured, use the URL https://westus.api.cognitive.microsoft.com/face/v1.0/detect?returnFaceId=true&returnFaceLandmarks=false . Then pass the subscription key as the header and the image being captured from the CCTV camera as part of the JSON request body. For example:

POST https://westus.api.cognitive.microsoft.com/face/v1.0/detect?returnFaceId=true&returnFaceLandmarks=false HTTP/1.1
Host: westus.api.cognitive.microsoft.com
Content-Type: application/json
Ocp-Apim-Subscription-Key: 00000000000000000000000000000000
{
    "url":"https://asclepius.com/media/DReABwzW4AAv7hL.jpg"
}

The response will return an array of face entries in JSON, along with face ID, as shown in this code.

Pragma: no-cache
apim-request-id: b6a209c3-fd26-422c-8baa-99b54e827d86
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
x-content-type-options: nosniff
Cache-Control: no-cache
Date: Mon, 01 Jan 2018 10:35:37 GMT
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Content-Length: 113
Content-Type: application/json; charset=utf-8
Expires: -1
[{
  "faceId": "1e685d67-43b9-470f-8c06-e9b0cd5d6584",
  "faceRectangle": {
    "top": 134,
    "left": 525,
    "width": 74,
    "height": 74
  }
}]

In this code, a face is returned along with the face ID. Use the face ID and pass it to the Face Identify API, as shown in the following code, to validate whether the face ID captured is part of the Person group that’s authorized to enter the bay:

POST https://westus.api.cognitive.microsoft.com/face/v1.0/identify HTTP/1.1
Host: westus.api.cognitive.microsoft.com
Content-Type: application/json
Ocp-Apim-Subscription-Key: 00000000000000000000000000000000
{    
    "personGroupId":"admin",
    "faceIds":[
        "1e685d67-43b9-470f-8c06-e9b0cd5d6584"
        ],
    "maxNumOfCandidatesReturned":4,
    "confidenceThreshold": 0.9
}

In this code, the confidenceThreshold parameter is set to .9, which acts as a confidence score for determining the identity of a person. ConfidenceThreshold is an optional parameter and should have a value between O and 1. With proper subscription, it returns the result as a JSON API:

Pragma: no-cache
apim-request-id: de275faf-1cdf-4476-b19f-6466ebdbcbea
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
x-content-type-options: nosniff
Cache-Control: no-cache
Date: Mon, 01 Jan 2018 10:45:48 GMT
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Content-Length: 135
Content-Type: application/json; charset=utf-8
Expires: -1
[{
  "faceId": "1e685d67-43b9-470f-8c06-e9b0cd5d6584",
  "candidates": [{
    "personId": "91ec6844-2be9-46d9-bc7f-c4d2deab166e",
    "confidence": 1.0
  }]
}]

In this code, the face is being detected with 100% accuracy, so the digital door opens.

Your Assignment

As part of their digital strategy to make video processing more powerful, Asclepius is planning to use surveillance to get more useful insights. It plans to achieve this by getting powerful insights through the video streaming, all of which is possible through the use of Video AI. Video content has been developing at such a rapid pace that manual processing or creating manual video surveillance system doesn’t work. The need of the hour is to get a surveillance system that monitors the video, takes immediate actions, and generates insights to help serve the need. Asclepius Consortium plans to handle queries like the following:
  • How many patients and people came into the hospital yesterday?

  • Were the patients happy after consulting with the doctor?

  • Who entered or attempted to enter the secured inventory warehouse?

  • What are all the products and tools taken out of inventory?

  • How do we respond to critical and severe patients 24X7?

  • Can we monitor people activity and ensure that people get proper support whenever and wherever required?

  • When did Dr. John (for example) enter the hospital and where is he currently? Or when was the patient named Mike last attended to?

  • How can we identify instruments, doctors, and other inventory objects?

Hint

We discussed Video AI briefly in Chapter 4, but essentially Video AI helps in processing videos, generating insights such as the face recognition and tracking, detecting voice activity, performing sentiment analysis, detecting scenes, and much more. You need a couple of APIs of Video AI to do this. You need an Index API to index, a Search API to do a search, a Visual Insights API to get insights, and then the Streaming API to do the actual streaming. This is achieved by discovering content in the video and generating insights. To learn more about it, visit https://Vi.microsoft.com . At the time this book was written, Video Indexer is in preview mode.

Recap

In this chapter, you learned about some of the powerful ways to use Microsoft Cognitive Services and apply them to the Asclepius hospital example. You got an in-depth understanding about LUIS and learned how to create, train, test, and publish a LUIS application. You also learned how to convert text to speech and speech to text by calling the Speech API. Later in the chapter, you learned how to use the Face API to identify and recognize faces and to create a strong surveillance system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.119.17