Understanding speech recognition

Converting speech to text is an important application feature. This ability is increasingly being used in a wide variety of contexts. Voice input is used to control smart phones, automatically handle input as part of help desk applications, and to assist people with disabilities, to mention a few examples.

Speech consists of an audio stream that is complex. Sounds can be split into phones, which are sound sequences that are similar. Pairs of these phones are called diphones. Utterances consist of words and various types of pauses between them.

The essence of the conversion process involves splitting sounds by silences between utterances. These utterances are then matched to the words that most closely sound like the utterance. However, this can be difficult due to many factors. For example, these differences may be in the form of variances in how words are pronounced due to the context of the word, regional dialects, the quality of the sound, and other factors.

The matching process is quite involved and often uses multiple models. A model may be used to match acoustic features with a sound. A phonetic model can be used to match phones to words. Another model is used to restrict word searches to a given language. These models are never entirely accurate and contribute to inaccuracies found in the recognition process.

We will be using CMUSphinx 4 to illustrate this process.

Using CMUPhinx to convert speech to text

Audio processed by CMUSphinx must be in Pulse Code Modulation (PCM) format. PCM is a technique that samples analog data, such as an analog wave representing speech, and produces a digital version of the signal. FFmpeg (https://ffmpeg.org/) is a free tool that can convert between audio formats if needed.

You will need to create sample audio files using the PCM format. These files should be fairly short and can contain numbers or words. It is recommended that you run the examples with different files to see how well the speech recognition works.

First, we set up the basic framework for the conversion by creating a try-catch block to handle exceptions. First, create an instance of the Configuration class. It is used to configure the recognizer to recognize standard English. The configuration models and dictionary need to be changed to handle other languages:

try { 
    Configuration configuration = new Configuration(); 
    String prefix = "resource:/edu/cmu/sphinx/models/en-us/"; 
    configuration 
            .setAcousticModelPath(prefix + "en-us"); 
    configuration 
            .setDictionaryPath(prefix + "cmudict-en-us.dict"); 
    configuration 
            .setLanguageModelPath(prefix + "en-us.lm.bin"); 
    ... 
} catch (IOException ex) { 
    // Handle exceptions 
} 

The StreamSpeechRecognizer class is then created using configuration. This class processes the speech based on an input stream. In the following code, we create an instance of the StreamSpeechRecognizer class and an InputStream from the speech file:

StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer( 
        configuration); 
InputStream stream = new FileInputStream(new File("filename")); 

To start speech processing, the startRecognition method is invoked. The getResult method returns a SpeechResult instance that holds the result of the processing. We then use the SpeechResult method to get the best results. We stop the processing using the stopRecognition method:

recognizer.startRecognition(stream); 
SpeechResult result; 
while ((result = recognizer.getResult()) != null) { 
    out.println("Hypothesis: " + result.getHypothesis());
} 
recognizer.stopRecognition(); 

When this is executed, we get the following, assuming the speech file contained this sentence:

Hypothesis: mary had a little lamb

When speech is interpreted there may be more than one possible word sequence. We can obtain the best ones using the getNbest method, whose argument specifies how many possibilities should be returned. The following demonstrates this method:

Collection<String> results = result.getNbest(3); 
for (String sentence : results) { 
    out.println(sentence); 
} 

One possible output follows:

<s> mary had a little lamb </s>
<s> marry had a little lamb </s>
<s> mary had a a little lamb </s>

This gives us the basic results. However, we will probably want to do something with the actual words. The technique for getting the words is explained next.

Obtaining more detail about the words

The individual words of the results can be extracted using the getWords method, as shown next. The method returns a list of WordResult instance, each of which represents one word:

List<WordResult> words = result.getWords(); 
for (WordResult wordResult : words) { 
    out.print(wordResult.getWord() + " "); 
} 

The output for this code sequence follows <sil> reflects a silence found at the beginning of the speech:

<sil> mary had a little lamb

We can extract more information about the words using various methods of the WordResult class. In this sequence that follows, we will return the confidence and time frame associated with each word.

The getConfidence method returns the confidence expressed as a log. We use the SpeechResult class' getResult method to get an instance of the Result class. Its getLogMath method is then used to get a LogMath instance. The logToLinear method is passed the confidence value and the value returned is a real number between 0 and 1.0 inclusive. More confidence is reflected by a larger value.

The getTimeFrame method returns a TimeFrame instance. Its toString method returns two integer values, separated by a colon, reflecting the beginning and end times of the word:

for (WordResult wordResult : words) { 
    out.printf("%s
	Confidence: %.3f
	Time Frame: %s
", 
            wordResult.getWord(), result 
                    .getResult() 
                    .getLogMath() 
                    .logToLinear((float)wordResult 
                            .getConfidence()), 
            wordResult.getTimeFrame()); 
} 

One possible output follows:

<sil>
Confidence: 0.998
Time Frame: 0:430
mary
Confidence: 0.998
Time Frame: 440:900
had
Confidence: 0.998
Time Frame: 910:1200
a
Confidence: 0.998
Time Frame: 1210:1340
little
Confidence: 0.998
Time Frame: 1350:1680
lamb
Confidence: 0.997
Time Frame: 1690:2170

Now that we have examined how sound can be processed, we will turn our attention to image processing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.20.231