PCE Failures and Recoveries

This section describes how the PCE handles various types of possible failures. It tells whether the failure can be handled automatically by the PCE and, if not, what manual intervention you need to perform to remedy the situation.

PCE Core Deployments and High Availability (HA)

The most common PCE Core deployments are either 2x2 or 4x2 setup.

For High Availability (HA) purpose, the PCE nodes can be deployed as 2 separate pairs (either 1core+1data or 2core+1data respectively) in separate data centers.

For high availability, the database services run in a primary replica mode with the primary service running on either of the data nodes.

Note

Both data nodes (data0 & data1) are always working as "active". Therefore, one of the data nodes (data1) is not on a "warm" standby that would become "active" when the primary data node has failed.

Types of PCE Failures

These are the general kinds of failures that can occur with a PCE deployment:

PCE-VEN network partition: A network partition occurs that cuts off communication between the PCE and VENs.
PCE service failure: One or more of the PCE's services fail on a node.
PCE node failure: One of the PCE's core or data nodes fails.
PCE split cluster failure (site failure): One data plus half the total number of core nodes fail.
PCE cluster network partition: A network partition occurs between two halves of a PCE cluster but all nodes are still functioning.
Multi-node traffic database failure: If the traffic database uses the optional multi-node configuration, the coordinator and worker nodes can fail.
Complete PCE failure: The entire PCE cluster fails or is destroyed and must be rebuilt.

Failure-to-Recovery Stages

For each failure case, this document provides the following information (when applicable):

Stage	Details
Preconditions	Any required or recommended pre-conditions that you are responsible for to recover from the failure. For example, in some failure cases, Illumio assumes you regularly exported a copy of the primary database to an external system in case you needed to recover the database.
Failure behavior	The behavior of the PCE and VENs from the time the failure occurs to recovery. It can be caused by the failure itself or by the execution of recovery procedures.
Recovery	A description of how the system recovers from the failure incident to resume operations, which might be automatic or require manual intervention on the PCE or VEN. When intervention is required, the steps are provided. Includes the following items: Recovery type: Can the PCE and VENs automatically recover from the failure, or is human intervention required to resume operations? Recovery procedure (when required): When human intervention is required on the PCE or VENs, the recovery procedures are provided. Recovery Time Objective (RTO): The average time it takes to detect and recover from a failure. Recovery Point Objective (RPO): The amount of data loss due to the failure.
Full Recovery (not always applicable)	In some cases, additional steps might be required to revert the PCE to its normal, pre-failure operating state. This situation is usually a planned activity that can be scheduled.

Legend for PCE Failure Diagrams

The following diagram symbols illustrate the affected parts of the PCE in a failure:

Dotted red line: Loss of network connectivity, but all nodes are still functioning
Dotted red X: Failure or loss of one or more nodes, such as when a node is shut down or stops functioning

PCE-VEN Network Partition

In this failure case, a network partition occurs between the PCE and VENs, cutting off communication between the PCE and all or some of its VENs. However, the PCE and VENs are still functioning.

Stage	Details
Preconditions	None
Failure Behavior	PCE Users cannot provision any changes to the VENs until the connection is re-established. The information displayed in the Illumination map in the PCE web console is only as current as the last time the VENs reported to the PCE. The PCE ignores any disconnected VENs until at least one hour has passed. When the outage persists longer than one hour, the PCE marks unreachable VENs as offline. When any existing policy allows the offline VENs to communicate with other VENS, the PCE recalculate its current policy and exclude those workloads marked as offline. VENs VENs continue to enforce their last known good policy. All VEN state and flow updates are cached locally on the workload where the VEN is installed. The VEN stores up to 24 hours of flow data then purges the oldest data first during an extended event. After missing 3 heartbeats (approximately 15 minutes), the VEN enters a degraded state, during which the VEN ignores all asynchronous commands received as lightning bolts from the PCE, except commands that initiate software upgrade and Support Reports.
Recovery	Recovery type: Automatic. The VEN tries to connect to the PCE every 5 minutes. After PCE-VEN network connectivity is restored, the VENs automatically reconnect to the PCE and resume normal operations: Policy for the VEN is automatically synchronized (when new policy from PCE was provisioned during failure). Cached state and flow data from the VEN is sent to the PCE. After three successful heartbeats (approximately 15 minutes), the VEN comes out of the degraded state. Recovery procedure: None required. RTO: Customer dependent based on the time it takes for PCE-VEN network connectivity to be restored, plus approximately 15 minutes for three successful heartbeats. RPO: Zero.

Stage

Details

Preconditions

None

Failure Behavior

PCE

Users cannot provision any changes to the VENs until the connection is re-established.
The information displayed in the Illumination map in the PCE web console is only as current as the last time the VENs reported to the PCE.
The PCE ignores any disconnected VENs until at least one hour has passed.
When the outage persists longer than one hour, the PCE marks unreachable VENs as offline. When any existing policy allows the offline VENs to communicate with other VENS, the PCE recalculate its current policy and exclude those workloads marked as offline.

VENs

VENs continue to enforce their last known good policy.
All VEN state and flow updates are cached locally on the workload where the VEN is installed. The VEN stores up to 24 hours of flow data then purges the oldest data first during an extended event.
After missing 3 heartbeats (approximately 15 minutes), the VEN enters a degraded state, during which the VEN ignores all asynchronous commands received as lightning bolts from the PCE, except commands that initiate software upgrade and Support Reports.

Recovery

Recovery type: Automatic. The VEN tries to connect to the PCE every 5 minutes. After PCE-VEN network connectivity is restored, the VENs automatically reconnect to the PCE and resume normal operations:
- Policy for the VEN is automatically synchronized (when new policy from PCE was provisioned during failure).
- Cached state and flow data from the VEN is sent to the PCE.
- After three successful heartbeats (approximately 15 minutes), the VEN comes out of the degraded state.
Recovery procedure: None required.
RTO: Customer dependent based on the time it takes for PCE-VEN network connectivity to be restored, plus approximately 15 minutes for three successful heartbeats.
RPO: Zero.

Service Failure

In this failure case, one of the PCE's services fails on a node.

Stage

Details

Preconditions

None.

Failure Behavior

PCE

The PCE might be temporarily unavailable.
Users might be unable to log into the PCE web console.
The PCE might return an HTTP 502 response and the /node_available API request might return an HTTP 404 error.
Other services that are dependent on the failed services might be restarted within the cluster.

VENs

VENs are not affected.
VENs continue to enforce the current policy.
When a VEN misses a heartbeat to the PCE, it retries in 5 minutes.

Recovery

Recovery type: Automatic. The PCE's SDS ensures that all PCE services are running, including itself. When any service fails, SDS restarts it.
Recovery procedure: None required.
RTO: Variable depending on which service failed and how many dependent services must be restarted. Typically 30 seconds to 2 minutes.
RPO: Zero.

Core Node Failure

In this failure case, one of the core nodes completely fails. This situation occurs anytime a node is not communicating with any of the other nodes in the cluster; for example, a node is destroyed, the node's SDS fails, or the node is powered off or disconnected from the cluster.

Stage	Details
Preconditions	The load balancer must be able to run application level health checks on each of the core nodes in the PCE cluster, so that it can be aware at all times whether a node is available. Important When you use a DNS load balancer and need to provision a new core node to recover from this failure, the `runtime_env.yml` file parameter named `cluster_public_ips` must include the IP address of your existing core nodes and the IP addresses of the replacement nodes. When this is not configured correctly, VENs will not have outbound rules programmed to allow them to connect to the IP address of the replacement node. Illumio recommends that you preallocate these IP addresses so that, in the event of a failure, you can restore the cluster and the VENs can communicate with the replacement node.
Failure Behavior	PCE The PCE is temporarily unavailable. Users might be unable to log into the PCE web console. The PCE might return an HTTP 502 response and the `/node_available` API call might return an HTTP 404 error. Other services that are dependent on the failed services might be restarted within the cluster. VENs VENs are not affected. VENs continue to enforce the current policy. When a VEN misses a heartbeat to the PCE, it retries in 5 minutes.
Recovery	Recovery type: Automatic. The cluster has multiple active core nodes for redundancy. Recovery procedure: None required. RTO: 5 minutes. RPO: Zero. No data loss occurs because the core nodes are stateless.
Full Recovery	Either recover the failed node or provision a new node and join it to the cluster.

Data Node Failure

In this failure scenario, one of the data nodes fails completely.

Stage	Details
Preconditions	You should continually monitor the replication lag of the replica database to make sure it is in sync with the primary database. You can accomplish this precondition by monitoring the `illumio_pce/system_health` syslog messages or by running the following command on one of the data nodes: sudo -u ilo-pce illumio-pce-db-management show-replication-info
Failure Behavior	PCE The PCE is temporarily unavailable. Users may be unable to log in to the PCE web console. The PCE might return a HTTP 502 response, and the `/node_available` API call might return an HTTP 404 error. Other services that depend on the failed services may be restarted within the cluster. When the `set_server_redis_server` service is running on the failed data node, the VENs go into the syncing state, and the policy is re-computed for each VEN, even when no new policy has been provisioned. The CPU usage on the PCE core nodes might spike and stay at very high levels until policy computation is completed. VENs VENs are not affected and continue to enforce the current policy. When a VEN misses a heartbeat to the PCE, it retries in 5 minutes.
Recovery	Recovery type: Automatic. The PCE detects this failure and automatically migrates any required data services to the surviving data node. When the failed node is the primary database, the PCE automatically promotes the replica database to be the new primary database. Recovery procedure: None required. RTO: 5 minutes, with the following caveats for specific PCE services: `set_server_redis_server`: Additional time is required for all VENs to synchronize. This time is variable based on the number of VENs and complexity of the policy. RPO: Service-specific based on the data services that were running on the failed data node. `database_service`: Implies the failed data node was the primary database. All data committed to the primary database, and not replicated to the replica, is lost. Typically under one second. `database_slave_service`: Implies the failed data node is the replica database. No data is lost. `agent_traffic_redis_server`: All traffic data is lost. `fileserver_service`: All asynchronous query requests and Support Reports are lost.
Full Recovery	When the failed data node is recovered or a new node is provisioned, it registers with PCE and is added as an active member of the cluster. This node is designated as the replica database and will replicate all the data from the primary database.

Primary Database Doesn't Start

In this failure case, the database node fails to start.

Stage	Details
Preconditions	The primary database node does not start.
Failure Behavior	The database cannot be started. Therefore, the entire PCE cluster cannot be started.
Full Recovery	Recovery type: Manual. You have two recovery options: Find the root cause of the primary database failure and correct it. Contact Illumio Customer Support for assistance if needed. Promote the replica data node to be the primary data node. Warning Promoting a replica to primary risks data loss Illumio strongly recommends that this option be a last resort because of the potential for data loss. When the PCE Supercluster is affected by this problem, you must also restore data on the promoted primary database.

Stage

Details

Preconditions

The primary database node does not start.

Failure Behavior

The database cannot be started. Therefore, the entire PCE cluster cannot be started.

Full Recovery

Recovery type: Manual. You have two recovery options:

Find the root cause of the primary database failure and correct it. Contact Illumio Customer Support for assistance if needed.
Promote the replica data node to be the primary data node.

Warning

Promoting a replica to primary risks data loss

Illumio strongly recommends that this option be a last resort because of the potential for data loss.

When the PCE Supercluster is affected by this problem, you must also restore data on the promoted primary database.

Primary Database Doesn't Start When PCE Starts

In this failure case, the database node fails to start when the PCE starts or restarts.

The following recovery information applies only when the PCE starts or restarts. When the PCE is already running and the primary database node fails, database failover will occur normally and automatically, and the replica database node will become the primary node.

Stage	Details
Preconditions	The primary database node does not start during PCE startup. This issue could occur because of an error on the primary node. Even when no error occurred, you might start the replica node first and then be interrupted, causing a delay in starting the primary node that exceeds the timeout.
Failure Behavior	The database cannot be started. Therefore, the entire PCE cluster cannot be started.
Full Recovery	Recovery type: Manual. You have two recovery options: Find and correct the root cause of the primary database failure. Contact Illumio Customer Support for help if needed. Promote the replica data node to the primary data node. Warning Promoting a replica to primary risks data loss Consider this option as a last resort because of the potential for data loss, depending on the replication lag. When you decide on the second option, on the replica database node, run the following command: sudo ilo-pce illumio-pce-ctl promote-data-node <core-node-ip-address> This command promotes the node to be the primary database for the cluster whose leader is at the specified IP address.

Stage

Details

Preconditions

The primary database node does not start during PCE startup. This issue could occur because of an error on the primary node. Even when no error occurred, you might start the replica node first and then be interrupted, causing a delay in starting the primary node that exceeds the timeout.

Failure Behavior

The database cannot be started. Therefore, the entire PCE cluster cannot be started.

Full Recovery

Recovery type: Manual. You have two recovery options:

Find and correct the root cause of the primary database failure. Contact Illumio Customer Support for help if needed.
Promote the replica data node to the primary data node.

Warning

Promoting a replica to primary risks data loss

Consider this option as a last resort because of the potential for data loss, depending on the replication lag.

When you decide on the second option, on the replica database node, run the following command:

sudo ilo-pce illumio-pce-ctl promote-data-node <core-node-ip-address>

This command promotes the node to be the primary database for the cluster whose leader is at the specified IP address.

Site Failure (Split Clusters)

In this failure type, one of the data nodes plus half the total number of core nodes fail, while the surviving data and remaining core nodes are still functioning.

In a 2x2 deployment, a split cluster failure means the loss of one of these node combinations:

Data0 and one core node
Data1 and one core node

In a 4x2 deployment, a split cluster failure means the loss of one of these node combinations::

Data0 and two core nodes
Data1 and two core nodes

This type of failure can occur when the PCE cluster is split across two separate physical sites or availability zones with network latency greater than 10ms, and a site failure causes half the nodes in the cluster to fail. A site failure is one case that can cause this type of failure; however, split cluster failures can also occur in a single site deployment when multiple nodes fails simultaneously for any reason.

Split Cluster Failure Involving Data1

In this failure case, data1 and half the core nodes completely fail.

Stage	Details
Preconditions	None.
Failure Behavior	PCE The PCE is temporarily unavailable. Users might be unable to log into the PCE web console. The PCE might return a HTTP 502 response and the `/node_available` API request might return am HTTP 404 error. Other services that are dependent on the failed services might be restarted within the cluster. VENs VENs are not affected. VENs continue to enforce the current policy. When a VEN misses a heartbeat to the PCE, it retries in 5 minutes.
Recovery	Recovery type: Automatic. Because quorum is maintained, the data0 half of the cluster can operate as a standalone cluster. When data1 is the primary database, the PCE automatically promotes data0 to be the new primary database. Recovery procedure: None. RTO: 5 minutes. RPO: Service specific based on which data services were running on data1 at the time of the failure: `database_service`: Data1 node was the primary database. All database data committed on data1 and not replicated to data0 is lost. Typically under one second. `database_slave_service`: Data1 node was the replica database. No database data is lost. `agent_traffic_redis_server`: All traffic data is lost. `fileserver_service`: All asynchronous query requests and Support Reports are lost.
Full Recovery	Either recover the failed nodes or provision new nodes and join them to the cluster. For recovery information, see Replace a Failed Node.

Split Cluster Failure Involving Data0

In this failure case, data0 and half of the total number of core nodes completely fail.

Stage	Details
Preconditions	Caution When reverting the standalone cluster back to a full cluster, you must be able to control the recovery process so that each recovered node is powered on and re-joined to the cluster one node at a time (while the other recovered nodes are powered off). Otherwise, the cluster could become corrupted and need to be fully rebuilt.
Failure Behavior	PCE The PCE is unavailable because it does not have the minimum number of nodes to maintain quorum. VENs The VEN continues to enforce its last known good policy. The VEN's state and flow updates are cached locally on the workload where the VEN is installed. The VEN stores up to 24 hours of flow data, then purges the oldest data first during an extended event. After missing 3 heartbeats (approximately 15 minutes), the VEN enters a degraded state. While it is in the degraded state, the VEN ignores all asynchronous commands received as lightning bolts from the PCE, except the commands that initiate software upgrade and Support Reports.
Recovery	Recovery type: Manual intervention is required to recover from this failure case. Recovery procedure RTO: Customer dependent based on how long it takes you to detect this failure and perform the manual recovery procedures. RPO: Service specific based on which data services were running on data0 at the time of the failure: `database_service`: Data0 node was the primary database. All database data committed on data0 and not replicated to data1 is lost. Typically under one second. `database_slave_service`: Data0 node was the replica database. No database data is lost. `agent_traffic_redis_server`: All traffic data is lost. `fileserver_service`: All asynchronous query requests and Support Reports are lost.
Full Recovery	See Revert Standalone Cluster Back to a Full Cluster for information.

Configure Data1 and Core Nodes as Standalone Cluster

To enable the surviving data1 and core nodes to operate as a standalone 2x2 or 4x2 cluster, follow these steps in this exact order.

On the surviving data1 node and all surviving core nodes, stop the PCE software:
```
sudo -u ilo-pce illumio-pce-ctl stop
```
On any surviving core node, promote the core node to be a standalone cluster leader:
```
sudo -u ilo-pce illumio-pce-ctl promote-cluster-leader
```
On the surviving data1 node, promote the data1 node to be the primary database for the new standalone cluster:
```
sudo -u ilo-pce illumio-pce-ctl promote-data-node <promoted-core-node-ip-address>
```
For the IP address, enter the IP address of the promoted core node from step 2.
(4x2 clusters only) On the other surviving core node, join the surviving core node to the new standalone cluster:
```
sudo -u ilo-pce illumio-pce-ctl cluster-join <promoted-core-node-ip-address> --split-cluster
```
For the IP address, enter the IP address of the promoted core node from step 2.
Back up the surviving data1 node.

Revert Standalone Cluster Back to a Full Cluster

To revert back to a 2x2 or 4x2 cluster, follow these steps in this exact order:

Important

When you plan to recover the failed nodes and the PCE software is configured to auto-start when powered on (the default behavior for a PCE RPM installation), you must power on every node and re-join them to the cluster one node at a time, while the other nodes are powered off and the PCE is not running on the other nodes. Otherwise, your cluster might become corrupted and need to be fully rebuilt.

Recover one of the failed core nodes or provision a new core node.
If you provisioned a new core node, run the following command on any existing node in the cluster (not the new node you are about to add). For ip_address, substitute the IP address of the new node.
```
sudo -u ilo-pce illumio-pce-ctl cluster-nodes allow ip_address
```
On the recovered or new core node, start the PCE software and enable the node to join the cluster:
```
sudo -u ilo-pce illumio-pce-ctl cluster-join <promoted-core-node-ip-address>
```
For the IP address, enter the IP address of the promoted core node.
(4x2 clusters only) For the other recovered or new core nodes, repeat steps 1-3.
Recover the failed data0 nodes or provision a new data0 node.
If you provisioned a new data node, run the following command on any existing node in the cluster (not the new node you are about to add). For ip_address, substitute the IP address of the new node.
```
sudo -u ilo-pce illumio-pce-ctl cluster-nodes allow ip_address
```
On the recovered data0 or new data0 node, start the PCE software and enable the node to join the cluster:
```
sudo -u ilo-pce illumio-pce-ctl cluster-join <promoted-core-node-ip-address>
```
For the IP address, enter the IP address of the promoted core node.
On the surviving data1 node and all core nodes, remove the standalone configuration for the nodes that you previously promoted during failure:
```
sudo -u ilo-pce illumio-pce-ctl revert-node-config
```
Note
Run this command so that the nodes that you previously promoted during the failure no longer operate as a standalone cluster.

Verify that the cluster is in the RUNNING state:

sudo -u ilo-pce illumio-pce-ctl cluster-status --wait

Verify that you can log into the PCE web console.
Note
In rare cases, you might receive an error when attempting to log into the PCE web console. When this happens, restart all nodes and try logging in again:
```
sudo -u ilo-pce illumio-pce-ctl restart
```

Cluster Network Partition

In this failure case, the network connection between half your PCE cluster is severed, cutting off all communication between the each half of the cluster. However, all nodes in the cluster are still functioning.

Illumio defines “half a cluster” as one data node plus half the total number of core nodes in the cluster.

Stage	Details
Preconditions	None.
Failure behavior	PCE The PCE is temporarily unavailable. Users might be unable to log into the PCE web console. The PCE might return an HTTP 502 response and the `/node_available` API request might return an HTTP 404 error. Other services that are dependent on the failed services might be restarted within the cluster. VENs VENs are not affected. VENs continue to enforce the current policy. When a VEN misses a heartbeat to the PCE, it retries in 5 minutes.
Recovery	Recovery type: Automatic: Having two sides of the PCE cluster operate independently of each other (“split brain”) could cause data corruption. To prevent this situation, the PCE stops services on the nodes that are not part of the quorum (namely, nodes in the data1 half of the cluster). Additionally, the PCE automatically migrates any required data services to the data0 node. When data1 was the primary database, the PCE automatically promotes data0 to be the new primary database. Recovery procedure: None required. RTO: 5 minutes. RPO: Service specific based on which data services were running on data1 at the time of the partition: `database_service`: Data1 node was the primary database. All database data committed on data1 and not replicated to data0 is lost. Typically under one second. `database_slave_service`: Data1 node was the replica database. No database data is lost. `agent_traffic_redis_server`: All traffic data is lost. `fileserver_service`: All asynchronous query requests and Support Reports are lost.
Full Recovery	No additional steps are required to revert the PCE to its normal, pre-failure operating state. When network connectivity is restored, the data1 half of the cluster automatically reconnects to the data0 half of the cluster. The PCE then restarts all services on the data1 half of the cluster.

Multi-Node Traffic Database Failure

If the traffic database uses the optional multi-node configuration, the coordinator and worker nodes can fail.

For information about multi-node traffic database configuration, see "Scale Traffic Database to Multiple Nodes" in the PCE Installation and Upgrade Guide.

Coordinator primary node failure

If the coordinator master completely fails, all the data-related PCE applications might be unavailable for a brief period. All other PCE services should be operational.

Recovery is automatic after the failover timeout. The coordinator replica will be promoted to the primary, and all data-related applications should work as usual when the recovery is done.

Warning

Any unprocessed traffic flow data on the coordinator primary will be lost until the coordinator primary is back to normal.

Coordinator primary does not start

If the coordinator primary does not start, the PCE will not function as usual.

There are two options for recovery:

Find the root cause of the failure and fix it. Contact Illumio Support if needed.
Promote a replica coordinator node to primary.

Warning

Promoting a replica coordinator to a primary can result in data loss. Use this recovery procedure only as a last resort.

To promote a replica coordinator node to primary:

sudo -u ilo-pce illumio-pce-ctl promote-coordinator-node cluster-leader-address

Worker primary node nailure

If the worker's primary node fails, all data-related applications might be unavailable briefly. All other PCE services should be operational.

Recovery is automatic after the failover timeout. The worker replica will be promoted to the primary. All data-related applications should work as usual once the recovery is done.

Warning

Any data not replicated to the replica worker node before the failure will be lost.

Worker primary does not start

If the worker primary does not start, the PCE will not function as usual.

There are two options for recovery:

Find the root cause of the failure and fix it. Contact Illumio Support if needed.
Promote the corresponding replica worker node to the primary.

Warning

Promoting a replica worker to primary can result in data loss. Use this recovery procedure only as a last resort.

To promote a replica worker node to primary, find out the corresponding replica worker for the failed primary node. Run the following command to list the metadata information for all the workers. Get the IP address of the replica for the failed primary:

sudo -u ilo-pce illumio-pce-db-management traffic citus-worker-metadata

Promote the replica worker node to primary:

sudo -u ilo-pce illumio-pce-ctl promote-worker-node core-node-ip

Complete Cluster Failure

In this rare failure case, the entire PCE cluster has failed.

Stage	Details
Preconditions	Illumio assumes that you have met the following conditions before the failure occurs for this failure case. IMPORTANT: You must consistently and frequently back up the PCE primary database to an external storage system that can be used for restoring the primary database after this type of failure. You need access to this backup database file to recover from this failure case. The `runtime_env.yml` file parameter named `cluster_public_ips` must include the front-end IP addresses of the primary and secondary clusters. When this is not configured correctly, VENs will not have outbound rules programmed to allow them to connect to the secondary cluster in a failure case. Illumio recommends that you pre-allocate these IP addresses so that, in the event of a failure, you can restore the cluster and the VENs can communicate with the newly restored PCE. Regularly back up the PCE `runtime_env.yml` file for each node in the functioning cluster before failure. Have a secondary PCE cluster deployed in a data center different from the primary cluster. The secondary PCE cluster can have IP addresses and hostnames that are different from the primary clusters.
Failure behavior	PCE The PCE is unavailable. VENs The VEN continues to enforce its last known good policy. The VEN's state and flow updates are cached locally on the workload where the VEN is installed. The VEN stores up to 24 hours of flow data and then purges the oldest data first during an extended event. The VEN is degraded after missing 3 heartbeats (approximately 15 minutes). While it is in the degraded state, the VEN ignores all asynchronous commands received as lightning bolts from the PCE, except the commands that initiate software upgrades and Support Reports.
Recovery	Recovery type: Manual intervention is required to fully recover from this failure case. Recovery procedure: See Complete Cluster Recovery for information. RTO: Customer dependent based on how long it takes to detect this failure and perform the manual recovery procedures. RPO: Customer dependent based on your backup frequency and time of the last backup.
Full Recovery	See Complete Cluster Recovery for full recovery information; perform all the listed steps on the restored primary cluster.

Complete Cluster Recovery

Recovering from this failure case requires performing the following tasks:

Power on all nodes in the secondary PCE cluster.
Use the database backup file from your most recent backup and restore the backup on the primary database node.

To restore the PCE database from backup, perform the following steps:

On all nodes in the PCE cluster, stop the PCE software:
```
sudo -u ilo-pce illumio-pce-ctl stop
```
On all nodes in the PCE cluster, start the PCE software at runlevel 1:
```
sudo -u ilo-pce illumio-pce-ctl start --runlevel 1
```

Determine the primary database node:

sudo -u ilo-pce illumio-pce-db-management show-master

On the primary database node, restore the database:

sudo -u ilo-pce illumio-pce-db-management restore --file <location of prior db dump file>

Migrate the database by running this command:

sudo -u ilo-pce illumio-pce-db-management migrate

Copy the Illumination data file from the primary database to the other data node. The file is located in the following directory on both nodes:
```
<persistent_data_root>/redis/redis_traffic_0_master.rdb
```

Bring the PCE cluster to runlevel 5:

sudo -u ilo-pce illumio-pce-ctl set-runlevel 5

Verify that you can log into the PCE web console.

PCE-Based VEN Distribution Recovery

When you rely on the PCE-based distribution of VEN software, after you have recovered from a PCE cluster failure, you need to reload or redeploy the PCE VEN Library.

When you have at least one PCE core node unaffected by the failure, you can redeploy the VEN library to the other nodes.
When the failure is catastrophic and you have to replace the entire PCE cluster, you need to reload the PCE's VEN library. See VEN Administration Guide for information.

Restore VENs Paired to Failed PCE

A failed PCE does not receive information from VENs paired with it. This lack of connectivity can result in stale IP addresses and other information recorded for the VENs. Additionally, other PCEs might also have this stale information only. When the PCE regains connectivity, the PCE eventually marks those uncommunicative VENs “offline” and removes them from the policy.

To resolve this situation, you must delete the “offline” workloads from the PCE by using the PCE web console or the REST API. After deleting the VENs, you can re-install and re-activate the affected VENs on the affected workloads.

Illumio Core 25.1 Administration Guide

PCE Failures and Recoveries

PCE Core Deployments and High Availability (HA)

Note

Types of PCE Failures

Failure-to-Recovery Stages

Legend for PCE Failure Diagrams

PCE-VEN Network Partition

Service Failure

Core Node Failure

Important

Data Node Failure

Primary Database Doesn't Start

Warning

Primary Database Doesn't Start When PCE Starts

Warning

Site Failure (Split Clusters)

Split Cluster Failure Involving Data1

Split Cluster Failure Involving Data0

Caution

Configure Data1 and Core Nodes as Standalone Cluster

Revert Standalone Cluster Back to a Full Cluster

Important

Note

Note

Cluster Network Partition

Multi-Node Traffic Database Failure

Coordinator primary node failure

Warning

Coordinator primary does not start

Warning

Worker primary node nailure

Warning

Worker primary does not start

Warning

Complete Cluster Failure

Complete Cluster Recovery

PCE-Based VEN Distribution Recovery

Restore VENs Paired to Failed PCE

Search results