Fault tolerance



How OpenVidu Pro provides fault tolerance 🔗

Fault tolerance in OpenVidu Pro is provided through the presence of multiple Media Nodes (see OpenVidu Pro architecture). An OpenVidu Pro cluster with at least 2 Media Nodes ensures that if one Media Node goes down for any reason, the other can take over the affected OpenVidu sessions. This is represented in the image below.

The points below summarize the functioning of OpenVidu Pro in terms of fault tolerance upon a Media Node crash:

  1. The Master Node keeps a persistent, full-duplex connection with Media Nodes through WebSocket.
  2. Whenever the Master Node detects a disconnection of a Media Node, it starts a reconnection process that grants 3 seconds to successfully reconnect. At this point, it is possible that videos on the client side are frozen, if the disconnection of the Media Node was an actual crash of the media routing process.
  3. If the Master Node succeeds in reconnecting to the Media Node within the allowed time interval, it is considered a one-time issue and no further action is taken. But if the reconnection is not possible, then every OpenVidu session hosted in the crashed Media Node is closed and every participant will receive the proper event to allow the application to rebuild the session. This is further explained in the next section Making your OpenVidu app fault tolerant.



Media Node reconnection configuration 🔗

When a Media Node disconnects from the OpenVidu cluster, a sequence of processes and events are triggered to try to recover the connection. The diagram below provides a temporal view of this sequence.

  • The Master Node detects a disconnection of a Media Node.
  • Tries to silently reconnect to it during 3 seconds. If that is possible, the Media Node is considered healthy and no further action is taken.
  • If a silent reconnection is not possible, a WebHook event nodeCrashed is triggered and an active reconnection process starts.
  • The active reconnection process depends on configuration property OPENVIDU_PRO_CLUSTER_RECONNECTION_TIMEOUT. You can configure this property with a custom value, though the default one offers the behavior considered most reasonable for most use cases. The default value of this property depends on the deployment environment of the OpenVidu cluster:
    • For On Premises deployments with autoscaling disabled, default value is infinite time. This means that by default there will never be a mediaNodeStatusChanged (status to terminating). The Media Node will infinitely try to reconnect to the cluster and a nodeRecovered event will always be possible. This behavior can be assumed because in on premises deployments Media Nodes will generally have fixed IPs.
    • For any other type of deployment, default value is 3 seconds. This means that there will be no time for nodeRecovered event: as soon as a nodeCrashed event is triggered, a mediaNodeStatusChanged (status to terminating) will be triggered.



Making your OpenVidu app fault tolerant 🔗

OpenVidu Pro delegates the recovery of the sessions to the application in the event of a Media Node crash. The application should simply re-create the crashed session, which translates in nothing more than repeating the normal process of joining users to a session:

  • Initialize the Session in OpenVidu Server from your application's backend.
  • Create a new Connection for the Session from your application's backend.
  • Return the Connection's token to your application's frontend so it can use it to call Session.connect.

The key part is letting the application's frontend know when to ask the application's backend for a new token to re-connect to a recently crashed session. The application's frontend must listen to sessionDisconnected event and identify its reason. If it is nodeCrashed then the application's frontend just needs to ask the application's backend for a new token for a session with the same identifier as the previous one. This is reproduced in the snippet below, in a JavaScript code using openvidu-browser library.

var OV = new OpenVidu();
var session = OV.initSession();

session.on('sessionDisconnected', event => {

  if (event.reason === 'nodeCrashed') {

    // User was evicted from the session upon a node crash
    console.warn('Your session has been closed due to a node crash!');

    // HERE THE CLIENT SHOULD RE-RUN THE PROCESS OF CONNECTING TO THE SESSION AS NORMAL.
    // THE SESSION SHOULD KEEP THE SAME IDENTIFIER.

  } else {

    // User left the session for any conventional reason
    console.log('You left the session!');

    // HERE THE CLIENT SHOULD DO WHATEVER IS NECESSARY UPON A NORMAL SESSION CLOSURE.

  }
});

Your application's backend can also receive the nodeCrashed CDR event if you want. Listening to this CDR event is really not necessary for achieving fault tolerance and re-building sessions after a Media Node crash, but you can still use it for custom logic and monitoring purposes.

If you want to see an example of an application that automatically reconnects users after a node crash, take a look to the simple openvidu-fault-tolerance tutorial.



Recordings and fault tolerance 🔗

By the time being, the recordings hosted by a crashed Media Node should be considered lost and are not recoverable in a standardized way. Qualifying this statement:

  • For OpenVidu Pro AWS deployments the Media Node is immediately terminated after a crash, so there is no possibility of recovering any recording file.
  • For OpenVidu Pro On Premises deployments a Media Node cannot be automatically terminated after a crash. It will be removed from the cluster so that it is no longer charged, but OpenVidu will not terminate the machine (WebHook event mediaNodeStatusChanged may be used to performed the termination task from outside OpenVidu). For this reason, recording files may be recoverable while the machine previously provisioned with the OpenVidu Media Node is still accessible. Inside the recording folder (by default /opt/openvidu/recordings) there should be a subfolder for any ongoing recording in the crashed Media Node. INDIVIDUAL recording files should be healthy. COMPOSED recordings files can be corrupted and require a repair process to make them playable.

In any case, there is no guarantee that recordings can be recovered after a Media Node crash. Future versions of OpenVidu will address this limitation.