PCE High Availability and Disaster Recovery Concepts

Learn how the PCE provides high availability (HA) and disaster recovery (DR).

Goals for PCE High Availability

The PCE is designed to handle system or network failures based on the following goals:

Elimination of single points of failure: A failure of one component (PCE node or service) does not mean failure of the entire PCE cluster. Recovery from failure is done with zero or minimal loss of data.
Detection of failures as they occur: The PCE detects failures without human intervention.
Reliable recovery: Recovery from failure is done with minimal loss of data.

Three conditions determine whether the PCE can survive a failure and remain available:

All these conditions must be met for the PCE to be available and provide acceptable performance.

Quorum

A PCE cluster relies on quorum, a sufficient number of servers to ensure consistent operation. Quorum prevents the so-called “split-brain” case, where two cluster parts operate autonomously. Any node disconnected from the quorum is automatically isolated or “fenced” by shutting down most of its services.

All core nodes and the data0 node (an odd number) are voting members of the quorum. The data1 node is not a voting member. Most nodes must be available to maintain quorum and elect a cluster leader.

When a cluster experiences a failure and doesn't have the majority of nodes functioning to maintain quorum, the cluster becomes unavailable until it recovers the minimal number of nodes.

In practice, as long as at least one core node and one data node are available, the PCE remains operational but with restricted functionality.

Service Availability

Another key requirement of PCE high availability is service availability, which means that at least one instance of all required PCE services is available.

The Service Discovery Service (SDS) monitors all services running on each node in the cluster. This service must be monitored for failure. See Monitor PCE Health for information.

For a PCE cluster to provide all its necessary services, even in a partial cluster failure, it must contain at least one functioning data node and at least one core node, with all services fully available on each node.

Node Type	Service Tiers
Core	Front end Processing Service and caching
Data	Service and caching Data persistence (database)

Cluster Capacity

Cluster capacity means that the PCE can provide sufficient compute resources to meet the demands required by the number of workloads deployed at any given time.

PCE 2x2 and 4x2 clusters are sized to support the loss of one data node plus half the total number of core nodes and still operate with degraded performance (1+1 redundancy). When more than one data node plus half the total number of core nodes in the cluster is lost, the cluster might not have sufficient capacity to meet demands.

Illumio Administration Guide 25.4