PCE High Availability and Disaster Recovery Concepts

Learn how the PCE provides high availability (HA) and disaster recovery (DR).

Goals for PCE High Availability

The PCE is designed to handle system or network failures based on the following goals:

Elimination of single points of failure: A failure of one component (PCE node or service) does not mean failure of the entire PCE cluster. Recovery from failure is achieved with zero or minimal data loss.
Detection of failures as they occur: The PCE detects failures without human intervention.
Reliable recovery: Recovery from failure is achieved with minimal data loss.

Three conditions determine whether the PCE can survive a failure and remain available:

All these conditions must be met for the PCE to be available and provide acceptable performance.

Quorum

A PCE cluster relies on quorum, a sufficient number of servers to ensure consistent operation. A quorum prevents the so-called “split-brain” case, in which two cluster parts operate autonomously. Any node disconnected from the quorum is automatically isolated or “fenced” by shutting down most of its services.

All core nodes and the data0 node (an odd number) are voting members of the quorum. The data1 node is not a voting member. Most nodes must be available to maintain quorum and elect a cluster leader.

When a cluster experiences a failure and doesn't have the majority of nodes functioning to maintain quorum, the cluster becomes unavailable until it recovers the minimal number of nodes.

In practice, as long as at least one core node and one data node are available, the PCE remains operational but with restricted functionality.

Service Availability

Another key requirement for PCE high availability is service availability, meaning that at least one instance of each required PCE service is available.

The Service Discovery Service (SDS) monitors all services running on each node in the cluster. This service must be monitored for failure. See Monitor PCE Health for information.

For a PCE cluster to provide all necessary services, even in the event of a partial cluster failure, it must contain at least one functioning data node and at least one core node, with all services fully available on each node.

Node Type	Service Tiers
Core	Front end Processing Service and caching
Data	Service and caching Data persistence (database)

Cluster Capacity

Cluster capacity means the PCE can provide sufficient compute resources to meet the demands of the number of workloads deployed at any given time.

PCE 2x2 and 4x2 clusters are sized to support the loss of one data node plus half the total number of core nodes, while still operating with degraded performance (1+1 redundancy). When more than half the total number of core nodes in the cluster is lost, the cluster might not have sufficient capacity to meet demand.

Illumio Administration Guide 26.x (SaaS)