Failover

17.05.2023

5 min read

Failover is a backup mode of operation where the functions of a primary system component, such as a processor, server, network, or database, are switched to a secondary system component when the primary one becomes unavailable, whether due to a failure or for scheduled maintenance.

It provides better fault tolerance and is often used as an integral part of mission-critical systems that must always remain available.

The main purpose of failover is to allow processes that normally run on one node to be switched over to another when the primary service provider fails. Among other functions, this method can be used to perform maintenance or replace hardware without affecting specific services provided to the end user.

For better failover effectiveness, it is recommended to place the nodes in different data centers and, if possible, in different geographic locations.

The procedure includes uploading tasks automatically to a system component in the standby mode to make switching to a secondary system in case the main one fails as seamless as possible for the end user.

The recovery capability means that normal functions can be maintained despite unavoidable interruptions caused by issues with the equipment or service.

Strategies

The level of fault tolerance depends on the methods used. However, there is no such thing as total fault tolerance as there is always some massive failure that could lead to a fatal error.

For each critical system, fault tolerance must be designed so that the effort to mitigate certain failure types is justified considering the damage they would cause if not tolerated by the system.

In this regard, there are different strategies to make a system as fault tolerant as possible. The most important and well-known ones are:

Redundancy: availability of passive modules that do exactly the same as other assets and can take their place to prevent the system from breaking down when an element fails.
Replication: to prevent loss of stored information in case of failure, the information is usually replicated on several physical media or on an external computer or device as a backup. This way, if a failure that may cause data loss occurs, the system should be able to restore all the information by recovering the necessary data from an available backup device.

Implementation

There are two types of failover operations: failover and switchover. In practice, these are essentially the same, except that the former is automatic and generally functions without notice, while the latter requires human intervention.

For systems supporting servers or networks that require almost continuous availability and high reliability, automatic failover is used.

Failover automation generally uses a “heartbeat” system that connects two servers, either using a separate cable (such as RS-232 serial ports/cable) or a network connection.

As long as there is a regular “pulse” or “heartbeat” between the primary server and the secondary server, the latter will not bring its systems online. There may also be a third “spare parts” server with running backup components for “hot” switching to prevent downtime.

The second server takes over from the first one as soon as it detects a change in its “heartbeat”. Some systems can send a failover notification.

Others are designed not to failover fully automatically, but to require human intervention instead. Once a person has approved the failover, this “automated with manual approval” configuration runs automatically.

The use of virtualization software has made failover operations less dependent on physical hardware due to a process called migration, where a running virtual machine moves from one physical host to another with little or no disruption in service.

Failover Clustering

A failover cluster is a group of computer servers working together to provide high or continuous availability.

If one of the servers fails, another node in the cluster can take over its workload with little or no downtime. Some failover clusters use only physical servers, while others contain virtual machines.

The primary goal of a failover cluster is to maintain continuous availability or high availability of applications and services.

Continuous availability clusters allow end users to continue using applications and services without downtime if a server fails.

Whereas with high availability clusters, a user may experience a brief service interruption, but the system will automatically recover with no data loss and minimal downtime.

While continuous availability failover clusters are designed for 100% availability, high availability clusters aim for 99.999% availability (also known as “five nines”).

As a trade-off for lower availability, high availability clusters are less expensive to implement than continuous availability clusters with their higher hardware requirements.

Registration and Monitoring

Other important functions associated with failover include logging the history of link state changes and notifying the administrator of such changes. The team can use this information to take necessary measures and mitigate failures.

After a failure has occurred and contingency configurations have been implemented, these solutions are also expected to allow the team to continue monitoring the failed links and, when they are restored in a transparent manner, to make the necessary changes to return to the original pre-failure scenario.

It is important to highlight the significance of regular (scheduled) testing of failover configurations to make sure they function as expected and are up to date with changes in the structure.

An effective contingency plan and a team capable of configuring automatic link redundancy are essential when it comes to guaranteeing continuous availability of a service and even more so with cybersecurity systems.

In its Cluster Security Gateways of the protected network, Protelion uses hot backup for systems that have to operate continuously due to their importance with server change times of less than a second.