Skip to main content

Illumio Core 25.1 Administration Guide

Core Node Failure

In this failure case, one of the core nodes completely fails. This situation occurs anytime a node is not communicating with any of the other nodes in the cluster; for example, a node is destroyed, the node's SDS fails, or the node is powered off or disconnected from the cluster.

pce-ha-dr-1.png

Stage

Details

Preconditions

The load balancer must be able to run application level health checks on each of the core nodes in the PCE cluster, so that it can be aware at all times whether a node is available.

Important

When you use a DNS load balancer and need to provision a new core node to recover from this failure, the runtime_env.yml file parameter named cluster_public_ips must include the IP address of your existing core nodes and the IP addresses of the replacement nodes. When this is not configured correctly, VENs will not have outbound rules programmed to allow them to connect to the IP address of the replacement node. Illumio recommends that you preallocate these IP addresses so that, in the event of a failure, you can restore the cluster and the VENs can communicate with the replacement node.

Failure Behavior

PCE

  • The PCE is temporarily unavailable.

  • Users might be unable to log into the PCE web console.

  • The PCE might return an HTTP 502 response and the /node_available API call might return an HTTP 404 error.

  • Other services that are dependent on the failed services might be restarted within the cluster.

VENs

  • VENs are not affected.

  • VENs continue to enforce the current policy.

  • When a VEN misses a heartbeat to the PCE, it retries in 5 minutes.

Recovery

  • Recovery type: Automatic. The cluster has multiple active core nodes for redundancy.

  • Recovery procedure: None required.

  • RTO: 5 minutes.

  • RPO: Zero. No data loss occurs because the core nodes are stateless.

Full Recovery

Either recover the failed node or provision a new node and join it to the cluster.