Speech To Text


This feature is part of OpenVidu PRO and ENTERPRISE editions.
Speech To Text module needs port 4000/TCP, so you need to open this port in Media Nodes to allow Master Nodes to communicate with them.
WARNING: OpenVidu Speech to Text is considered a feature in beta version. This means that there is a possibility that unexpected bugs may arise, and that the API may change in the near future.

How does Speech To Text work 🔗

OpenVidu provides a Speech To Text module that allows transcribing in real time the audio tracks of an OpenVidu Session.

  • OpenVidu is able to deliver events to the client side with the text transcription of Streams that have audio.
  • Clients are able to receive events for one or multiple Streams of an OpenVidu Session, including their own Stream.
  • Events are returned in real time, following a recognizing-to-recognized strategy: when a speaker that is being transcribed is talking, events flagged as recognizing will be generated one after the other as the speaker delivers one sentence. The transcribed text of recognizing events may change from one to another, while the engine gathers information about the final sentence. When the engine considers that the speaker has completed a full sentence, it triggers a recognized event with the final result.



Speech To Text engines 🔗

Azure 🔗

See Azure web.

Microsoft provides an Azure service called Speech To Text that transcribes spoken audio to text. OpenVidu seamlessly integrates the audio streams of OpenVidu Sessions with this Azure service. The only thing needed is a key for the Cognitive Service API of Azure.

Enabling Speech To Text module 🔗

OPENVIDU_PRO_SPEECH_TO_TEXT=azure
OPENVIDU_PRO_SPEECH_TO_TEXT_AZURE_KEY=<AzureKey>        ## e.g. rywfyDIAL5BM70ErU9O1XSIFzWk2QQhP
OPENVIDU_PRO_SPEECH_TO_TEXT_AZURE_REGION=<AzureRegion>  ## e.g. westeurope

Available languages 🔗

There are dozens of different languages supported by Azure. You have the complete list in this link.

AWS 🔗

Coming soon...

Vosk 🔗

Coming soon...



Receiving Speech To Text events 🔗

To receive Speech To Text events in your application's client side you just need to setup listener speechToTextMessage in the Session object. The listener will handle SpeechToTextEvent objects when the targetted participant speaks. You can differentiate between sentences under construction or final sentences using the event property reason:

session.on("subscribeToSpeechToText", event => {
    if (event.reason === "recognizing") {
        console.log("User " + event.connection.connectionId + " is speaking: " + event.text);
    } else if (event.reason === "recognized") {
        console.log("User " + event.connection.connectionId + " spoke: " + event.text);
    }
});

Then you just need to subscribe to the desired Stream transcription using method Session.subscribeToSpeechToText. Pass the desired Stream object for which you want to receive Speech To Text events:

await session.subscribeToSpeechToText(stream, "en-US");

Check out tutorial openvidu-speech-to-text to test a real sample application.



Reconnecting to Speech to Text module in the case of a crash 🔗

Speech to Text is a beta feature that could experiment unexpected crashes in rare occasions. openvidu-browser SDK provides an event to know if the service has crashed, so that the application may re-establish the transcription subscriptions once it is available again (the Speech to Text module restarts on its own in case of a crash). To do so, simply listen to the ExceptionEvent in your Session object, and filter by SPEECH_TO_TEXT_DISCONNECTED name. See the code snippet below:

session.on("exception", async (event) => {

    if (event.name === "SPEECH_TO_TEXT_DISCONNECTED") {

        console.warn("Speech to Text service has disconnected. Retrying the subscription...");
        var speechToTextReconnected = false;

        while (!speechToTextReconnected) {
            await new Promise(r => setTimeout(r, 1000)); // Waiting one second
            try {
                await session.subscribeToSpeechToText(stream, "en-US");
                console.log("Speech to Text service has recovered");
                speechToTextReconnected = true;
            } catch (error) {
                console.warn("Speech to Text service still unavailable. Retrying again...")
            }
        }

    } else {
        // Other types of ExceptionEvents
    }

});