Streaming requests

Based on the requirement of the application, it may be required to have a two-way streaming capability for the real-time continuous transcription of audio signals. The Speech-to-Text API provides a method for real-time transcription within a bi-directional stream. The sender application can send a continuous stream of an audio signal to the API and receive a discrete, as well as a complete, form of transcription from the service. Just-in-time results represent the current audio in the transcribed format and the final response gets the entire transcription (similar to synchronous and asynchronous responses).

In terms of the API, the streaming requests are sent to the StreamingRecognize method as the endpoint. Since the API is a continuous, streaming API, multiple requests are sent to the API with a different audio window. However, the first message must contain the configuration of the streaming request. The configuration is defined by the StreamingRecognitionConfig object, which provides a hint to the API to process the specific streaming audio signal. The StreamingRecognitionConfig object is defined as follows:

config: This is the RecognitionConfig object that we discussed earlier in this chapter.
single_utterance: This is an optional boolean flag. When this is set to false, the streaming recognition API continues to transcribe the input signal despite a long pause within the speech. The stream remains open until it is explicitly closed by the calling process or until a certain time threshold has elapsed. In this case, the API may return multiple StreamingRecognitionResult objects. If this flag is set to true, the model detects a pause within the audio signal and the API returns an END_OF_SIGNAL_UTTERANCE event and completes the recognition process. In this case, the API will return only one occurrence of StreamingRecognitionResult.
interim_results: This is an optional flag. If set to true, the interim results are returned by the API, and if set to false, the API returns results only once the transcription is completed.

The API returns a response in the form of a StreamingRecognizeResponse message object. This is the only object returned by the streaming Speech-to-Text service API. The response object contains the following significant fields:

speechEventType: This represents a pause in the audio conversation as detected by the underlying model. There are two event types recognized by the API. SPEECH_EVENT_UNSPECIFIED indicates that no event is specified and END_OF_SINGLE_UTTERANCE indicates that the model has detected a pause within the audio signal and the API does not expect any additional audio data in the stream. This event is sent only when the single_utterance request parameter is set to true.
results: This is the main wrapper object that contains the transcription results as a collection:
- alternatives: Similar to synchronous and asynchronous transcription requests, this collection provides various transcription alternatives with varying confidence levels.
- isFinal: This flag is set to true when the entire audio signal is transcribed by the model.
- stability: In the context of streaming speech recognition, overlapping parts of the speech get transcribed over a moving time window. That means a particular position within the audio signal may be transcribed more than once within subsequent frames. The speech-to-text model generates a stability score that indicates the possibility of change in the transcription. A score of 0 indicates an unstable transcription that will eventually change and a score of 1 indicates that there will not be any change from the original transcription.

Table of Contents for Streaming requests

Create new playlist

Sign In

Sign Up

Table of Contents for
Streaming requests