Skip to main content

Illumio Core 25.1 Administration Guide

Data Node Failure

In this failure case, one of the data nodes completely fails.

pce-ha-dr-2.png

Stage

Details

Preconditions

You should continually monitor the replication lag of the replica database to make sure it is in sync with the primary database.

You can accomplish this precondition by monitoring the illumio_pce/system_health syslog messages or by running the following command on one of the data nodes:

sudo -u ilo-pce illumio-pce-db-management show-replication-info

Failure Behavior

PCE

  • The PCE is temporarily unavailable.

  • Users might be unable to log into the PCE web console.

  • The PCE might return a HTTP 502 response and the /node_available API call might return an HTTP 404 error.

  • Other services that are dependent on the failed services might be restarted within the cluster.

  • When the set_server_redis_server service is running on the failed data node, the VENs go into the syncing state and policy is re-computed for each VEN, even when no new policy has been provisioned. The CPU usage on the PCE core nodes might spike and stay at very high levels until policy computation is completed.

VENs

  • VENs are not affected and continue to enforce the current policy.

  • When a VEN misses a heartbeat to the PCE, it retries in 5 minutes.

Recovery

  • Recovery type: Automatic. The PCE detects this failure and automatically migrates any required data services to the surviving data node. When the failed node is the primary database, the PCE automatically promotes the replica database to be the new primary database.

  • Recovery procedure: None required.

  • RTO: 5 minutes, with the following caveats for specific PCE services:

    • set_server_redis_server: Additional time is required for all VENs to synchronize. This time is variable based on the number of VENs and complexity of the policy.

  • RPO: Service-specific based on the data services that were running on the failed data node.

    • database_service: Implies the failed data node was the primary database. All data committed to the primary database, and not replicated to the replica, is lost. Typically under one second.

    • database_slave_service: Implies the failed data node is the replica database. No data is lost.

    • agent_traffic_redis_server: All traffic data is lost.

    • fileserver_service: All asynchronous query requests and Support Reports are lost.

Full Recovery

When the failed data node is recovered or a new node is provisioned, it registers with PCE and is added as an active member of the cluster. This node is designated as the replica database and will replicate all the data from the primary database.

Primary Database Doesn't Start

In this failure case, the database node fails to start.

pce-ha-dr-2.png

Stage

Details

Preconditions

The primary database node does not start.

Failure Behavior

The database cannot be started. Therefore, the entire PCE cluster cannot be started.

Full Recovery

Recovery type: Manual. You have two recovery options:

  • Find the root cause of the primary database failure and correct it. Contact Illumio Customer Support for assistance if needed.

  • Promote the replica data node to be the primary data node.

Warning

Promoting a replica to primary risks data loss

Illumio strongly recommends that this option be a last resort because of the potential for data loss.

When the PCE Supercluster is affected by this problem, you must also restore data on the promoted primary database.

Primary Database Doesn't Start When PCE Starts

In this failure case, the database node fails to start when the PCE starts or restarts.

The following recovery information applies only when the PCE starts or restarts. When the PCE is already running and the primary database node fails, database failover will occur normally and automatically, and the replica database node will become the primary node.

Stage

Details

Preconditions

The primary database node does not start during PCE startup. This issue could occur because of an error on the primary node. Even when no error occurred, you might start the replica node first and then be interrupted, causing a delay in starting the primary node that exceeds the timeout.

Failure Behavior

The database cannot be started. Therefore, the entire PCE cluster cannot be started.

Full Recovery

Recovery type: Manual. You have two recovery options:

  • Find and correct the root cause of the primary database failure. Contact Illumio Customer Support for help if needed.

  • Promote the replica data node to primary data node.

Warning

Promoting replica to primary risks data loss

Consider this option as a last resort because of the potential for data loss, depending on the replication lag.

When you decide on the second option, on the replica database node, run the following command:

sudo ilo-pce illumio-pce-ctl promote-data-node <core-node-ip-address>

This command promotes the node to be the primary database for the cluster whose leader is at the specified IP address.