Customizing speech recognition

When we use speech recognition systems, there are several components that are working together. Two of the more important components are acoustic and language models. The first one labels short fragments of audio into sound units. The second helps the system decide the words, based on the likelihood of a given word appearing in certain sequences.

Although Microsoft has done a great job of creating comprehensive acoustic and language models, there may still be times when you need to customize these models.

Imagine that you have an application that is supposed to be used in a factory environment. Using speech recognition will require acoustic training of that environment so that the recognition can separate it from usual factory noises.

Another example is if your application is used by a specific group of people, say, an application for search, where programming is the main topic. You would typically use words such as object-oriented, dot net, or debugging. This can be recognized by customizing language models.

Creating a custom acoustic model

To create custom acoustic models, you will need audio files and transcripts. Each audio file must be stored as a WAV and be between 100 ms and 1 minute in length. It is recommended that there is at least 100 ms of silence at the start and end of the file. Typically, this will be between 500 ms and 1 second. With a lot of background noise, it is recommended to have silences in-between content.

Each file should contain one sentence or utterance. Files should be uniquely named, and an entire set of files can be up to 2 GB. This translates to about 17 to 34 hours of audio, depending on the sampling rate. All files in one set should be placed in a zipped folder, which then can be uploaded.

Accompanying the audio files is a single file with the transcript. This should name the file and have the sentence next to the name. The filename and sentence should be separated by a tab.

Uploading the audio files and transcript will make CRIS process it. When this process is done, you will get a report stating which sentences have failed or succeeded. If anything fails, you will get the reason for the failure.

When the dataset has been uploaded, you can create the acoustic model. This will be associated with the dataset you select. When the model has been created, you can start the process to train it. Once the training is completed, you can deploy the model.

Creating a custom language model

Creating custom language models will also require a dataset. This set is a single plain text file containing sentences or utterances unique to your model. Each new line marks a new utterance. The maximum file size is 2 GB.

Uploading the file will make CRIS process it. Once the processing is done, you will get a report, which will print any errors, with the reason of failure.

With the processing done, you can create a custom language model. Each model will be associated with a given dataset of your selection. Once created, you can train the model, and when the training complete, you can deploy it.

Deploying the application

To deploy and use the custom models, you will need to create a deployment. Here, you will name and describe the application. You can select acoustic models and language models. Be aware that you can only select one of each per deployed application.

Once created, the deployment will start. This process can take up to 30 minutes to complete, so be patient. When the deployment completes, you can get the required information by clicking on the application name. You will be given URLs you can use, as well as subscription keys to use.

To use the custom models with the Bing Speech API, you can overload CreateDataClientWithIntent and CreateMicrophoneClient. The overloads you will want to use specify both the primary and secondary API keys. You need to use the ones supplied by CRIS. Additionally, you need to specify the supplied URL as the last parameter.

Once this is done, you are able to use customized recognition models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.103.96