Site Failure (Split Clusters)

In this failure type, one of the data nodes plus half the total number of core nodes fail, while the surviving data and remaining core nodes are still functioning.

In a 2x2 deployment, a split cluster failure means the loss of one of these node combinations:

Data0 and one core node
Data1 and one core node

In a 4x2 deployment, a split cluster failure means the loss of one of these node combinations::

Data0 and two core nodes
Data1 and two core nodes

This type of failure can occur when the PCE cluster is split across two separate physical sites or availability zones with network latency greater than 10ms, and a site failure causes half the nodes in the cluster to fail. A site failure is one case that can cause this type of failure; however, split cluster failures can also occur in a single site deployment when multiple nodes fails simultaneously for any reason.

Split Cluster Failure Involving Data1

In this failure case, data1 and half the core nodes completely fail.

Stage	Details
Preconditions	None.
Failure Behavior	PCE The PCE is temporarily unavailable. Users might be unable to log into the PCE web console. The PCE might return a HTTP 502 response and the `/node_available` API request might return am HTTP 404 error. Other services that are dependent on the failed services might be restarted within the cluster. VENs VENs are not affected. VENs continue to enforce the current policy. When a VEN misses a heartbeat to the PCE, it retries in 5 minutes.
Recovery	Recovery type: Automatic. Because quorum is maintained, the data0 half of the cluster can operate as a standalone cluster. When data1 is the primary database, the PCE automatically promotes data0 to be the new primary database. Recovery procedure: None. RTO: 5 minutes. RPO: Service specific based on which data services were running on data1 at the time of the failure: `database_service`: Data1 node was the primary database. All database data committed on data1 and not replicated to data0 is lost. Typically under one second. `database_slave_service`: Data1 node was the replica database. No database data is lost. `agent_traffic_redis_server`: All traffic data is lost. `fileserver_service`: All asynchronous query requests and Support Reports are lost.
Full Recovery	Either recover the failed nodes or provision new nodes and join them to the cluster. For recovery information, see Replace a Failed Node.

Split Cluster Failure Involving Data0

In this failure case, data0 and half of the total number of core nodes completely fail.

Stage	Details
Preconditions	Caution When reverting the standalone cluster back to a full cluster, you must be able to control the recovery process so that each recovered node is powered on and re-joined to the cluster one node at a time (while the other recovered nodes are powered off). Otherwise, the cluster could become corrupted and need to be fully rebuilt.
Failure Behavior	PCE The PCE is unavailable because it does not have the minimum number of nodes to maintain quorum. VENs The VEN continues to enforce its last known good policy. The VEN's state and flow updates are cached locally on the workload where the VEN is installed. The VEN stores up to 24 hours of flow data, then purges the oldest data first during an extended event. After missing 3 heartbeats (approximately 15 minutes), the VEN enters a degraded state. While it is in the degraded state, the VEN ignores all asynchronous commands received as lightning bolts from the PCE, except the commands that initiate software upgrade and Support Reports.
Recovery	Recovery type: Manual intervention is required to recover from this failure case. Recovery procedure RTO: Customer dependent based on how long it takes you to detect this failure and perform the manual recovery procedures. RPO: Service specific based on which data services were running on data0 at the time of the failure: `database_service`: Data0 node was the primary database. All database data committed on data0 and not replicated to data1 is lost. Typically under one second. `database_slave_service`: Data0 node was the replica database. No database data is lost. `agent_traffic_redis_server`: All traffic data is lost. `fileserver_service`: All asynchronous query requests and Support Reports are lost.
Full Recovery	See Revert Standalone Cluster Back to a Full Cluster for information.

Configure Data1 and Core Nodes as Standalone Cluster

To enable the surviving data1 and core nodes to operate as a standalone 2x2 or 4x2 cluster, follow these steps in this exact order.

On the surviving data1 node and all surviving core nodes, stop the PCE software:
```
sudo -u ilo-pce illumio-pce-ctl stop
```
On any surviving core node, promote the core node to be a standalone cluster leader:
```
sudo -u ilo-pce illumio-pce-ctl promote-cluster-leader
```
On the surviving data1 node, promote the data1 node to be the primary database for the new standalone cluster:
```
sudo -u ilo-pce illumio-pce-ctl promote-data-node <promoted-core-node-ip-address>
```
For the IP address, enter the IP address of the promoted core node from step 2.
(4x2 clusters only) On the other surviving core node, join the surviving core node to the new standalone cluster:
```
sudo -u ilo-pce illumio-pce-ctl cluster-join <promoted-core-node-ip-address> --split-cluster
```
For the IP address, enter the IP address of the promoted core node from step 2.
Back up the surviving data1 node.

Revert Standalone Cluster Back to a Full Cluster

To revert back to a 2x2 or 4x2 cluster, follow these steps in this exact order:

Important

When you plan to recover the failed nodes and the PCE software is configured to auto-start when powered on (the default behavior for a PCE RPM installation), you must power on every node and re-join them to the cluster one node at a time, while the other nodes are powered off and the PCE is not running on the other nodes. Otherwise, your cluster might become corrupted and need to be fully rebuilt.

Recover one of the failed core nodes or provision a new core node.
If you provisioned a new core node, run the following command on any existing node in the cluster (not the new node you are about to add). For ip_address, substitute the IP address of the new node.
```
sudo -u ilo-pce illumio-pce-ctl cluster-nodes allow ip_address
```
On the recovered or new core node, start the PCE software and enable the node to join the cluster:
```
sudo -u ilo-pce illumio-pce-ctl cluster-join <promoted-core-node-ip-address>
```
For the IP address, enter the IP address of the promoted core node.
(4x2 clusters only) For the other recovered or new core nodes, repeat steps 1-3.
Recover the failed data0 nodes or provision a new data0 node.
If you provisioned a new data node, run the following command on any existing node in the cluster (not the new node you are about to add). For ip_address, substitute the IP address of the new node.
```
sudo -u ilo-pce illumio-pce-ctl cluster-nodes allow ip_address
```
On the recovered data0 or new data0 node, start the PCE software and enable the node to join the cluster:
```
sudo -u ilo-pce illumio-pce-ctl cluster-join <promoted-core-node-ip-address>
```
For the IP address, enter the IP address of the promoted core node.
On the surviving data1 node and all core nodes, remove the standalone configuration for the nodes that you previously promoted during failure:
```
sudo -u ilo-pce illumio-pce-ctl revert-node-config
```
Note
Run this command so that the nodes that you previously promoted during the failure no longer operate as a standalone cluster.

Verify that the cluster is in the RUNNING state:

sudo -u ilo-pce illumio-pce-ctl cluster-status --wait

Verify that you can log into the PCE web console.
Note
In rare cases, you might receive an error when attempting to log into the PCE web console. When this happens, restart all nodes and try logging in again:
```
sudo -u ilo-pce illumio-pce-ctl restart
```

Illumio Core 25.2.10 Administration Guide