Skip to main content

Illumio Core 25.1 Administration Guide

PCE Replication and Failover

To increase reliability, you can set up replication and failover for PCEs. Having a PCE on "warm standby," ready to take over if the active PCE fails, contributes to a resilient disaster recovery (DR) plan.

For PCE replication and failover, set up PCEs in pairs. Each pair consist of an active PCE and a standby PCE. A combination of continuous real-time replication and periodic synchronization is used to keep the standby PCE's data up to date with the active PCE. If the active PCE fails, the standby PCE can take over and become the new active PCE.

The data from the following services are replicated:

  • database_service

  • citus_coordinator_service

  • reporting_database_service

  • agent_traffic_redis_server

  • data_job_queue_redis_service

  • fileserver

Standby PCE Prerequisites

Warning

Active Standby assumes the same certificate is used for all nodes of the cluster. You cannot use a unique certificate per Core node.

Warning

The user/secret variable must be set as the ilo-pce user. Alternatively, you need to run it as sudo -E -u ilo-pce.

Before designating a standby PCE, perform the following preparation steps.

Set Up Two PCEs

Install PCE software on two machines or find two machines where it is already installed. Be sure the following are true:

  • Hardware configuration and capacity are as near identical as possible on the two PCEs.

  • PCE software version is the same on both PCEs.

Reset Any Repurposed PCE

If you are repurposing an existing PCE to be the standby, be sure the existing PCE is completely reset.

  1. On all nodes of the existing PCE, run the following command to reset the PCE:

    $sudo -u ilo-pce illumio-pce-ctl reset
  2. On all nodes of the existing PCE, run the following command to start the PCE and set it to runlevel 1:

    sudo -u ilo-pce illumio-pce-ctl start --runlevel 1
  3. On any one data node of the existing PCE, run the following command to set up the database:

    sudo -u ilo-pce illumio-pce-db-management setup
Open Ports Between Active and Standby PCEs

Be sure the required ports are open on both PCEs to allow network traffic between the active PCE and the standby PCE so data replication can occur. Make sure that all the same service ports are opened on the standby PCE and the active PCE. For a list of the required ports, see Port Ranges for Cluster Communication in PCE Installation and Upgrade Guide.

Set Up FQDNs

Set up the FQDNs that are required when using active and standby PCEs:

  • FQDN of the active PCE.

  • FQDN of the standby PCE.

  • FQDN of the front-end load balancer.

  • In the runtime_env.yml file, active_standby_replication:active_pce_fqdn is always the FQDN of the currently active PCE.

Add active_standby_replication:active_pce_fqdn to the runtime_env.yml file on both PCEs, active and standby. Example:

pce_fqdn: FQDN of the active PCE
 
active_standby_replication:
  active_pce_fqdn: active-pce-fqdn.com

Warning

Whether the PCE runs in a standalone or active-standby mode, never remove the setting active_pce_fqdn from runtime_env.yml. VENs are paired using this FQDN. Removing this entry will break VEN communications.

There are two options for setting up these FQDNs.

Option 1: Use a new FQDN for active_standby_replication:active_pce_fqdn.

You can use a FQDN that is not currently assigned to either the active PCE or the standby PCE. Use this option if you do not want to update the FQDN of the currently active PCE. The FQDN assigned to active_pce_fqdn should resolve to the currently active PCE. For example:

Existing Setup
  Active PCE:
    pce_fqdn: active-pce.com

  Standby PCE:
    pce_fqdn: standby-pce.com
 
Before Standby is Set Up
						
  Active PCE:
    pce_fqdn: active-pce.com
    active_standby_replication:
      active_pce_fqdn: active-pce-global.com
   
  Standby PCE:
   pce_fqdn: standby-pce.com
   active_standby_replication:
      active_pce_fqdn: active-pce-global.com

The active_pce_fqdn always contains the FQDN of the PCE that is currently active in the active-standby pair. When a standby PCE is set up, the VEN master configuration is updated if needed so that it contains the active_pce_fqdn FQDN. After the standby PCE is set up, VENs paired to the active PCE contain the active_pce_fqdn in their master configuration. If the standby PCE is promoted, reconfigure the load balancer or GTM so that active_pce_fqdn resolves to the promoted (new active) PCE.

Option 2: Use the FQDN of the active PCE for active_standby_replication:active_pce_fqdn.

You might have scripts that use the pce_fqdn of the active PCE. In this case, it is easier to set active_pce_fqdn to the same value. Before you set up the standby PCE, change the pce_fqdn of the active PCE to something other than the active_pce_fqdn.

If necessary, reconfigure your load balancer or global traffic manager (GTM) so that active_pce_fqdn and the new pce_fqdn of the active PCE resolve to the active PCE. For example:

Existing Setup

  Active PCE:
    pce_fqdn: active-pce.com

  Standby PCE:
    pce_fqdn: standby-pce.com
 
Before Standby is Set Up
 
  Active PCE:
    pce_fqdn: active-pce-updated.com
    active_standby_replication:
      active_pce_fqdn: active-pce.com
   
  Standby PCE:
    pce_fqdn: standby-pce.com
    active_standby_replication:
      active_pce_fqdn: active-pce.com
(Optional) Set DNS TTL Value

The DNS TTL (time to live) setting affects how long it takes for a new active PCE to take over in a failover situation. Consider adjusting the DNS TTL to avoid any delay. A shorter value, such as 30 minutes, is recommended.

Set Up PCE Certificates

The SSL certificate must include all three FQDNs that are described in Set Up FQDNs.

Set Up VEN Library

The PCE acts as a repository for distributing, installing and upgrading the VEN software. Install or update the VEN library on both the active and standby PCEs. See the VEN Installation and Upgrade Guide.

Note

Be sure the VEN versions in the library are supported by the PCE version that is installed.

Set Up a Standby PCE

To set up a standby PCE and associate it with its active PCE partner, use the following steps.

  1. Complete the prerequisite steps in Standby PCE Prerequisites.

  2. On the active PCE, generate an API key. This API key is used only while setting up the standby PCE.

  3. Bring the standby PCE to runlevel 2. On any node of the standby PCE, run the following command:

    sudo -u ilo-pce illumio-pce-ctl set-runlevel 2

    The active PCE can remain at runlevel 5.

  4. On the standby PCE, run the following commands to set up authentication. In username, give the active PCE's API key authentication username. In secret, give the API key secret.

    $ export ILO_ACTIVE_PCE_USER_NAME=username
    $ export ILO_ACTIVE_PCE_USER_PASSWORD=secret
  5. Link the standby PCE to its active PCE. On the standby PCE, run the following command. Foractive_pce_fqdn:front_end_management_https_port, give the FQDN and port of the current active PCE. The value in --active-pce is not the same as active_pce_fqdn in the configuration file runtime_env.yml.

    sudo -u ilo-pce --preserve-env illumio-pce-ctl setup-standby-pce 
    --active-pce active_pce_fqdn:front_end_management_https_port

    Warning

    Do not bring the standby PCE to runlevel 5.

  6. After replication is set up up for the first time, the status of some services, such as the citus_coordinator_service, might be NOT RUNNING for a long time, and the cluster status is stuck in PARTIAL. This is usually because the service is performing a database backup, which can take time depending on network latency, disk IOPS, traffic flow, and traffic data size. To check whether the backup process is running, use the following command:

    ps -ef | grep pg

    Example output:

    pce      84742 73150 18 16:25 ?        00:04:42 
      /var/illumio_pce/external/bin/pg_basebackup -d host=10.31.2.172
      port=5532 -D /var/traff_dir/traffic_datastore -v -P -X stream -c fast
    pce      84747 84742  7 16:25 ?       00:01:54 
      /var/illumio_pce/external/bin/pg_basebackup -d host=10.31.2.172 
      port=5532 -D /var/traff_dir/traffic_datastore -v -P -X stream -c fast

    Warning

    If the citus coordinator service is busy with a backup, do not restart services yet. Wait until this operation is complete and the service status changes to RUNNING.

  7. Restart services on the active PCE. On any node of the active PCE, run the following command:

    sudo -u ilo-pce illumio-pce-ctl cluster-restart

For example:

$ export ILO_ACTIVE_PCE_USER_NAME=api_17abrwerwe
$ export ILO_ACTIVE_PCE_USER_PASSWORD=6efefeafe34ewrooppll494934kdf
sudo -u ilo-pceillumio-pce-ctl setup-standby-pce 
--active-pce active.pce.com:8443
sudo -u ilo-pce illumio-pce-ctl cluster-restart
Failover to Standby PCE

This section tells how to perform a PCE failover for disaster recovery (DR). The active PCE has failed, and you need to promote the standby PCE so it can take over as the active PCE. Follow these steps.

  1. Check to be sure the PCE you are about to promote is actually a standby PCE and that it is at runlevel 2.

    sudo -u ilo-pce illumio-pce-ctl active-standby?

    The output should say "standby."

  2. Check to be sure the active PCE has failed and is offline. There must not be any data replicating to the standby PCE. On every node of the active PCE, run the following command:

    sudo -u ilo-pce illumio-pce-ctl cluster-status

    The output should contain STOPPED. Be sure to repeat this command on every node of the PCE.

  3. On the standby PCE, run the following command to promote the standby PCE.

    sudo -u ilo-pce illumio-pce-ctl promote-standby-pce

    When the active PCE is down, this command promotes this PCE to be the new primary. If the active PCE is not down, the standby PCE will not be promoted, and a message like "Active PCE is still reachable" is generated.

  4. Make sure that DNS recognizes this as the new active PCE FQDN so devices in your network can find the PCE. Make sure that the values for both active_standby_replication and active_pce_fqdn in the configuration file runtime_env.yml are the PCE FQDN of the former standby (new active) PCE. For example, reconfigure the PCE FQDN on load balancers. The steps depend on your devices and configuration. For more information about the PCE FQDN, see Standby PCE Prerequisites.

  5. Check the VEN synchronization status by running the following command:

    sudo -u ilo-pce illumio-pce-ctl promote-standby-check

    Run the command repeatedly and watch the output to make sure the VEN sync count increases. This indicates that the DNS change is in effect and the new active PCE has been promoted successfully.

    The DNS update for the new PCE FQDN can take some time, depending on the DNS TTL value.

  6. When you are ready, connect a new standby PCE to the new active PCE. Repeat the steps in Standby PCE Prerequisites and Set Up a Standby PCE.

Monitoring Replication

In the Health page of the PCE web console, use the Standby Replication tab to monitor replication between the active PCE and standby PCE. The Standby Replication tab shows the replication lag on the active and standby PCEs for the traffic database, the policy database, the reporting database, the job queue redis, and traffic data redis. (The fileserver is not shown.)

standby_replication_health_tab.png

Another way that the PCE administrator can monitor replication is by watching the service discovery log for WAL segment missing errors. This error may appear when the standby traffic database service could not keep up with synchronization from the active traffic database service. When this error occurs, the log looks like the following:

2022-06-30T15:43:19.556560+00:00 level=warning host=db0-4x2systest50 
ip=127.0.0.1 program=illumio_pce/service_discovery| sec=603799.555 
sev=ERROR pid=12416 tid=2440 rid=0 [citus_coordinator_service] 
Health Check: WAL segment 105/2B95FD98 is missing. Full base backup marker 
file set.

When this situation arises, the citus_coordinator_service causes the service to restart and perform the full database backup again. The network latency, disk IOPS, traffic flow, and traffic data size affect the replication latency. If you experience this issue, make any improvements you can to these factors.

For example, you can increase the value of the wal_keep_segments setting in the traffic_datastore section of the runtime_env.yml configuration file. Increasing this value comes at the expense of disk space cost. Each WAL segment is 16 MB, so 5120 WAL segments would use about 82 GB of extra space.

traffic_datastore:
   wal_keep_segments: 5120
Limitations and Constraints

When using active and standby PCEs for replication, be aware of the following limitations and constraints:

  • Fileserver replication lag is not shown in the Standby Replication tab of the Health page.

  • Support reports are replicated, but support bundles are not replicated.

  • In an active-standby PCE pair, it is not necessary to perform database backups in the same way you would with a standalone PCE. However, if you wish to do so, take the backups from the active PCE. It is also not normally necessary to restore a database backup on the active PCE or the standby PCE. If one of the PCEs fails, the other takes over as active PCE, and it already has an up-to-date copy of the data because of the ongoing replication between the two PCEs.

    Warning

    If it becomes necessary to restore data from a backup (for example, if both PCEs fail), you must restore the same backup to both the active PCE and the standby PCE.