PCE High Availability and Disaster Recovery Concepts
Learn how the PCE provides high availability (HA) and disaster recovery (DR).
Goals for PCE High Availability
The PCE is designed to handle system or network failures based on the following goals:
Elimination of single points of failure: A failure of one component (PCE node or service) does not mean failure of the entire PCE cluster. Recovery from failure is done with zero or minimal loss of data.
Detection of failures as they occur: The PCE detects failure without human intervention.
Reliable recovery: Recovery from failure is done with zero of minimal loss of data.
Three conditions determine whether the PCE can survive a failure and remain available:
All these conditions must be met for the PCE to be available and provide acceptable performance.
Quorum
A PCE cluster relies on quorum, which is a sufficient number of servers to ensure consistent operation. Quorum prevents the so-called “split brain” case where two parts of the cluster are operating autonomously. Any node that becomes disconnected from the quorum is automatically isolated or “fenced” by shutting down most of its services.
All core nodes and the data0 node (an odd number) are voting members of the quorum. The data1 node is not a voting member. A majority of these nodes must be available to maintain quorum and elect a cluster leader.
When a cluster experiences a failure and doesn't have the majority of nodes functioning to maintain quorum, the cluster becomes unavailable until it recovers the minimal number of nodes.
In practice, this means that as long as at least one core node and one data node are available, the PCE remains operational but with restricted functionality.
Service Availability
Another key requirement of PCE high availability is service availability, which means at least one instance of all required PCE services are available.
The Service Discovery Service (SDS) monitors all services running on each node in the cluster. This service must be monitored for failure. See Monitor PCE Health for information.
For a PCE cluster to provide all its necessary services, even in the event of a partial cluster failure, it must contain at least one functioning data node and at least one core node, with all services fully available on each node.
Node Type | Service Tiers |
---|---|
Core |
|
Data |
|
Cluster Capacity
Cluster capacity means that at any given time, the PCE is able to provide sufficient compute resources to meet the demands required by the number of workloads deployed.
PCE 2x2 and 4x2 clusters are sized to support the loss of one data node plus half the total number of core nodes and still operate with degraded performance (1+1 redundancy). When more than one data node plus half the total number of core nodes in the cluster is lost, the cluster might not have sufficient capacity to meet demands.