PCE Replication and Failover
To increase reliability, you can set up replication and failover for PCEs. Having a PCE on "warm standby," ready to take over if the active PCE fails, contributes to a resilient disaster recovery (DR) plan.
For PCE replication and failover, set up PCEs in pairs. Each pair consist of an active PCE and a standby PCE. A combination of continuous real-time replication and periodic synchronization is used to keep the standby PCE's data up to date with the active PCE. If the active PCE fails, the standby PCE can take over and become the new active PCE.
The data from the following services are replicated:
database_service
citus_coordinator_service
reporting_database_service
agent_traffic_redis_server
data_job_queue_redis_service
fileserver
Standby PCE Prerequisites
Warning
Active Standby assumes the same certificate is used for all nodes of the cluster. You cannot use a unique certificate per Core node.
Warning
The user/secret variable must be set as the ilo-pce
user. Alternatively, you need to run it as sudo -E -u ilo-pce
.
Before designating a standby PCE, perform the following preparation steps.
Set Up Two PCEs
Install PCE software on two machines or find two machines where it is already installed. Be sure the following are true:
Hardware configuration and capacity are as near identical as possible on the two PCEs.
PCE software version is the same on both PCEs.
Reset Any Repurposed PCE
If you are repurposing an existing PCE to be the standby, be sure the existing PCE is completely reset.
On all nodes of the existing PCE, run the following command to reset the PCE:
$sudo -u ilo-pce illumio-pce-ctl reset
On all nodes of the existing PCE, run the following command to start the PCE and set it to runlevel 1:
sudo -u ilo-pce illumio-pce-ctl start --runlevel 1
On any one data node of the existing PCE, run the following command to set up the database:
sudo -u ilo-pce illumio-pce-db-management setup
Open Ports Between Active and Standby PCEs
Be sure the required ports are open on both PCEs to allow network traffic between the active PCE and the standby PCE so data replication can occur. Make sure that all the same service ports are opened on the standby PCE and the active PCE. For a list of the required ports, see Port Ranges for Cluster Communication in PCE Installation and Upgrade Guide.
Set Up FQDNs
Set up the FQDNs that are required when using active and standby PCEs:
FQDN of the active PCE.
FQDN of the standby PCE.
FQDN of the front-end load balancer.
In the
runtime_env.yml
file,active_standby_replication:active_pce_fqdn
is always the FQDN of the currently active PCE.
Add active_standby_replication
:active_pce_fqdn
to the runtime_env.yml
file on both PCEs, active and standby. Example:
pce_fqdn: FQDN of the active PCE active_standby_replication: active_pce_fqdn: active-pce-fqdn.com
Warning
Whether the PCE runs in a standalone or active-standby mode, never remove the setting active_pce_fqdn
from runtime_env.yml
. VENs are paired using this FQDN. Removing this entry will break VEN communications.
There are two options for setting up these FQDNs.
Option 1: Use a new FQDN for active_standby_replication:active_pce_fqdn
.
You can use a FQDN that is not currently assigned to either the active PCE or the standby PCE. Use this option if you do not want to update the FQDN of the currently active PCE. The FQDN assigned to active_pce_fqdn
should resolve to the currently active PCE. For example:
Existing Setup Active PCE: pce_fqdn: active-pce.com Standby PCE: pce_fqdn: standby-pce.com Before Standby is Set Up Active PCE: pce_fqdn: active-pce.com active_standby_replication: active_pce_fqdn: active-pce-global.com Standby PCE: pce_fqdn: standby-pce.com active_standby_replication: active_pce_fqdn: active-pce-global.com
The active_pce_fqdn
always contains the FQDN of the PCE that is currently active in the active-standby pair. When a standby PCE is set up, the VEN master configuration is updated if needed so that it contains the active_pce_fqdn
FQDN. After the standby PCE is set up, VENs paired to the active PCE contain the active_pce_fqdn
in their master configuration. If the standby PCE is promoted, reconfigure the load balancer or GTM so that active_pce_fqdn
resolves to the promoted (new active) PCE.
Option 2: Use the FQDN of the active PCE for active_standby_replication:active_pce_fqdn
.
You might have scripts that use the pce_fqdn
of the active PCE. In this case, it is easier to set active_pce_fqdn
to the same value. Before you set up the standby PCE, change the pce_fqdn
of the active PCE to something other than the active_pce_fqdn
.
If necessary, reconfigure your load balancer or global traffic manager (GTM) so that active_pce_fqdn
and the new pce_fqdn
of the active PCE resolve to the active PCE. For example:
Existing Setup Active PCE: pce_fqdn: active-pce.com Standby PCE: pce_fqdn: standby-pce.com Before Standby is Set Up Active PCE: pce_fqdn: active-pce-updated.com active_standby_replication: active_pce_fqdn: active-pce.com Standby PCE: pce_fqdn: standby-pce.com active_standby_replication: active_pce_fqdn: active-pce.com
(Optional) Set DNS TTL Value
The DNS TTL (time to live) setting affects how long it takes for a new active PCE to take over in a failover situation. Consider adjusting the DNS TTL to avoid any delay. A shorter value, such as 30 minutes, is recommended.
Set Up PCE Certificates
The SSL certificate must include all three FQDNs that are described in Set Up FQDNs.
Set Up VEN Library
The PCE acts as a repository for distributing, installing and upgrading the VEN software. Install or update the VEN library on both the active and standby PCEs. See the VEN Installation and Upgrade Guide.
Note
Be sure the VEN versions in the library are supported by the PCE version that is installed.
Set Up a Standby PCE
To set up a standby PCE and associate it with its active PCE partner, use the following steps.
Complete the prerequisite steps in Standby PCE Prerequisites.
On the active PCE, generate an API key. This API key is used only while setting up the standby PCE.
Bring the standby PCE to runlevel 2. On any node of the standby PCE, run the following command:
sudo -u ilo-pce illumio-pce-ctl set-runlevel 2
The active PCE can remain at runlevel 5.
On the standby PCE, run the following commands to set up authentication. In username, give the active PCE's API key authentication username. In secret, give the API key secret.
$ export ILO_ACTIVE_PCE_USER_NAME=username $ export ILO_ACTIVE_PCE_USER_PASSWORD=secret
Link the standby PCE to its active PCE. On the standby PCE, run the following command. For
active_pce_fqdn:front_end_management_https_port
, give the FQDN and port of the current active PCE. The value in--active-pce
is not the same asactive_pce_fqdn
in the configuration fileruntime_env.yml
.sudo -u ilo-pce --preserve-env illumio-pce-ctl setup-standby-pce --active-pce active_pce_fqdn:front_end_management_https_port
Warning
Do not bring the standby PCE to runlevel 5.
After replication is set up up for the first time, the status of some services, such as the
citus_coordinator_service
, might be NOT RUNNING for a long time, and the cluster status is stuck in PARTIAL. This is usually because the service is performing a database backup, which can take time depending on network latency, disk IOPS, traffic flow, and traffic data size. To check whether the backup process is running, use the following command:ps -ef | grep pg
Example output:
pce 84742 73150 18 16:25 ? 00:04:42 /var/illumio_pce/external/bin/pg_basebackup -d host=10.31.2.172 port=5532 -D /var/traff_dir/traffic_datastore -v -P -X stream -c fast pce 84747 84742 7 16:25 ? 00:01:54 /var/illumio_pce/external/bin/pg_basebackup -d host=10.31.2.172 port=5532 -D /var/traff_dir/traffic_datastore -v -P -X stream -c fast
Warning
If the citus coordinator service is busy with a backup, do not restart services yet. Wait until this operation is complete and the service status changes to RUNNING.
Restart services on the active PCE. On any node of the active PCE, run the following command:
sudo -u ilo-pce illumio-pce-ctl cluster-restart
For example:
$ export ILO_ACTIVE_PCE_USER_NAME=api_17abrwerwe $ export ILO_ACTIVE_PCE_USER_PASSWORD=6efefeafe34ewrooppll494934kdf sudo -u ilo-pceillumio-pce-ctl setup-standby-pce --active-pce active.pce.com:8443 sudo -u ilo-pce illumio-pce-ctl cluster-restart
Failover to Standby PCE
This section tells how to perform a PCE failover for disaster recovery (DR). The active PCE has failed, and you need to promote the standby PCE so it can take over as the active PCE. Follow these steps.
Check to be sure the PCE you are about to promote is actually a standby PCE and that it is at runlevel 2.
sudo -u ilo-pce illumio-pce-ctl active-standby?
The output should say "standby."
Check to be sure the active PCE has failed and is offline. There must not be any data replicating to the standby PCE. On every node of the active PCE, run the following command:
sudo -u ilo-pce illumio-pce-ctl cluster-status
The output should contain STOPPED. Be sure to repeat this command on every node of the PCE.
On the standby PCE, run the following command to promote the standby PCE.
sudo -u ilo-pce illumio-pce-ctl promote-standby-pce
When the active PCE is down, this command promotes this PCE to be the new primary. If the active PCE is not down, the standby PCE will not be promoted, and a message like "Active PCE is still reachable" is generated.
Make sure that DNS recognizes this as the new active PCE FQDN so devices in your network can find the PCE. Make sure that the values for both
active_standby_replication
andactive_pce_fqdn
in the configuration fileruntime_env.yml
are the PCE FQDN of the former standby (new active) PCE. For example, reconfigure the PCE FQDN on load balancers. The steps depend on your devices and configuration. For more information about the PCE FQDN, see Standby PCE Prerequisites.Check the VEN synchronization status by running the following command:
sudo -u ilo-pce illumio-pce-ctl promote-standby-check
Run the command repeatedly and watch the output to make sure the VEN sync count increases. This indicates that the DNS change is in effect and the new active PCE has been promoted successfully.
The DNS update for the new PCE FQDN can take some time, depending on the DNS TTL value.
When you are ready, connect a new standby PCE to the new active PCE. Repeat the steps in Standby PCE Prerequisites and Set Up a Standby PCE.
Monitoring Replication
In the Health page of the PCE web console, use the Standby Replication tab to monitor replication between the active PCE and standby PCE. The Standby Replication tab shows the replication lag on the active and standby PCEs for the traffic database, the policy database, the reporting database, the job queue redis, and traffic data redis. (The fileserver is not shown.)

Another way that the PCE administrator can monitor replication is by watching the service discovery log for WAL segment missing errors. This error may appear when the standby traffic database service could not keep up with synchronization from the active traffic database service. When this error occurs, the log looks like the following:
2022-06-30T15:43:19.556560+00:00 level=warning host=db0-4x2systest50 ip=127.0.0.1 program=illumio_pce/service_discovery| sec=603799.555 sev=ERROR pid=12416 tid=2440 rid=0 [citus_coordinator_service] Health Check: WAL segment 105/2B95FD98 is missing. Full base backup marker file set.
When this situation arises, the citus_coordinator_service
causes the service to restart and perform the full database backup again. The network latency, disk IOPS, traffic flow, and traffic data size affect the replication latency. If you experience this issue, make any improvements you can to these factors.
For example, you can increase the value of the wal_keep_segments
setting in the traffic_datastore
section of the runtime_env.yml
configuration file. Increasing this value comes at the expense of disk space cost. Each WAL segment is 16 MB, so 5120 WAL segments would use about 82 GB of extra space.
traffic_datastore: wal_keep_segments: 5120
Limitations and Constraints
When using active and standby PCEs for replication, be aware of the following limitations and constraints:
Fileserver replication lag is not shown in the Standby Replication tab of the Health page.
Support reports are replicated, but support bundles are not replicated.
In an active-standby PCE pair, it is not necessary to perform database backups in the same way you would with a standalone PCE. However, if you wish to do so, take the backups from the active PCE. It is also not normally necessary to restore a database backup on the active PCE or the standby PCE. If one of the PCEs fails, the other takes over as active PCE, and it already has an up-to-date copy of the data because of the ongoing replication between the two PCEs.
Warning
If it becomes necessary to restore data from a backup (for example, if both PCEs fail), you must restore the same backup to both the active PCE and the standby PCE.