Skip to main content

Illumio Administration Guide 25.4

Data Node Failure

In this failure scenario, one of the data nodes fails completely.

pce-ha-dr-2.png

Stage

Details

Preconditions

You should continually monitor the replica database's replication lag to ensure it is in sync with the primary database.

You can accomplish this precondition by monitoring the illumio_pce/system_health syslog messages or by running the following command on one of the data nodes:

sudo -u ilo-pce illumio-pce-db-management show-replication-info

Failure Behavior

PCE

  • The PCE is temporarily unavailable.

  • Users may be unable to log in to the PCE web console.

  • The PCE might return an HTTP 502 response, and the /node_available API call might return an HTTP 404 error.

  • Other services that depend on the failed services may be restarted within the cluster.

  • When the set_server_redis_server service is running on the failed data node, the VENs go into the syncing state, and the policy is re-computed for each VEN, even when no new policy has been provisioned. The CPU usage on the PCE core nodes might spike and stay at very high levels until policy computation is completed.

VENs

  • VENs are not affected, and the current policy continues to be enforced.

  • When a VEN misses a heartbeat to the PCE, it retries in 5 minutes.

Recovery

  • Recovery type: Automatic. The PCE detects this failure and automatically migrates any required data services to the surviving data node. When the failed node is the primary database, the PCE automatically promotes the replica database to be the new primary database.

  • Recovery procedure: None required.

  • RTO: 5 minutes, with the following caveats for specific PCE services:

    • set_server_redis_server: All VENs must synchronize for an additional time, which varies depending on the number of VENs and the complexity of the policy.

  • RPO: Service-specific, based on the data services running on the failed data node.

    • database_service: This implies that the failed data node was the primary database. All data committed to the primary database and not replicated to the replica is lost, typically in under one second.

    • database_slave_service: Implies the failed data node is the replica database. No data is lost.

    • agent_traffic_redis_server: All traffic data is lost.

    • fileserver_service: All asynchronous query requests and Support Reports are lost.

Full Recovery

When the failed data node is recovered or a new node is provisioned, it registers with PCE and is added as an active member of the cluster. This node is designated as the replica database and will replicate all the data from the primary database.

Primary Database Doesn't Start

In this failure case, the database node fails to start.

pce-ha-dr-2.png

Stage

Details

Preconditions

The primary database node does not start.

Failure Behavior

The database cannot be started. Therefore, the entire PCE cluster cannot be started.

Full Recovery

Recovery type: Manual. You have two recovery options:

  • Find and correct the root cause of the primary database failure. If necessary, contact Illumio Customer Support for assistance.

  • Promote the replica data node to be the primary data node.

Warning

Promoting a replica to primary risks data loss

Illumio strongly recommends that this option be a last resort because of the potential for data loss.

When this problem affects the PCE Supercluster, you must also restore data on the promoted primary database.

Primary Database Doesn't Start When PCE Starts

In this failure case, the database node fails to start when the PCE starts or restarts.

The following recovery information applies only when the PCE starts or restarts. When the PCE is already running and the primary database node fails, database failover will occur normally and automatically, and the replica database node will become the primary node.

Stage

Details

Preconditions

The primary database node does not start during PCE startup. This issue could occur because of an error on the primary node. Even when no error occurred, you might start the replica node first and then be interrupted, causing a delay in starting the primary node that exceeds the timeout.

Failure Behavior

The database cannot be started. Therefore, the entire PCE cluster cannot be started.

Full Recovery

Recovery type: Manual. You have two recovery options:

  • Find and correct the root cause of the primary database failure. If needed, contact Illumio Customer Support for help.

  • Promote the replica data node to the primary data node.

Warning

Promoting a replica to primary risks data loss

Consider this option as a last resort because, depending on the replication lag, it could result in data loss.

When you decide on the second option, on the replica database node, run the following command:

sudo ilo-pce illumio-pce-ctl promote-data-node <core-node-ip-address>

This command promotes the node to be the primary database for the cluster whose leader is at the specified IP address.