Monitor Supercluster Health

You can use these two general methods for monitoring the health of your PCE Supercluster:

REST API calls to determine the Supercluster leader and a PCE member's health
The PCE web console to view the health of the entire Supercluster from the leader or for the member you are logged into

This section discusses health monitoring specifically for a PCE Supercluster. Additionally, follow the PCE health monitoring guidelines in the PCE Administration Guide.

REST API for Supercluster Health

You can monitor Supercluster health using the following REST API mechanisms.

REST API `/health`

Using the PCE Health API, you can get current health information about all PCEs in your Supercluster, including the leader and members.

GET [api_version]/health

REST API to Determine Supercluster Leader

Use this Public Stable REST API request to determine whether the PCE in a Supercluster is a leader or member.

GET [api_version]/supercluster/leader

HTTP Response Code from /supercluster/leader

Response	Meaning
202	The PCE is the leader.
404	The PCE is a member.

REST API `/node_available`

After you determine the Supercluster leader, issue the following REST API request to monitor the leader's availability:

GET [api_version]/node_available

HTTP response code from /node_available

The Health REST API can take up to 30 seconds to reflect the actual status of the node.

Response	Meaning
202	The node is healthy and is connected to the rest of the cluster.
404 or no response	The node is unhealthy and cannot accept requests. Such a node should be removed from the load balancing pool.

PCE Web Console for Supercluster Health

The Health page in the PCE web console in a Supercluster provides health information about your on-premises PCE, whether you deployed an SNC, 2x2, 4x2, or Supercluster.

General PCE Health: Shows general health information for each PCE in your Supercluster, such as health status, node status and uptime, and system health information for each node (CPU usage, memory, and disk usage). When you deployed a PCE Supercluster, the Health page lists all PCEs in the Supercluster with individual health information for each PCE.
Supercluster Leader Health: Displays the health status of the leader PCE in the Supercluster. You can view the health of each PCE in the Supercluster.
Supercluster Member Health: Shows health information about the member you are logged into, including a timer that indicates the amount of time since Illumination data was synced across the Supercluster. The Health page shows the database replication lag for each PCE relative to all other PCEs in the Supercluster, indicating how long it took for data to be replicated from one PCE to another.

The PCE Health page indicates the current state of database replication across the Supercluster and how recently each member PCE's Illumination data has been synced with the leader.

Supercluster Replication (Lag): Indicates how long it took for one PCE to receive replicated data from another PCE in the Supercluster. For example, a user created a new IP list in the leader and saved it. The change took 4 seconds to replicate to Member1 and Member1's Health page showed that its replication lag is 4 seconds behind the leader. The PCE web console shows replication lag for each PCE in the Supercluster.
Supercluster Illumination Sync (Members only): Shows the last time since a member PCE replicated its Illumination traffic data with the Supercluster leader. This information only appears for members that periodically send traffic data to the leader. This information provides a full picture of Illumination traffic for your entire Supercluster. You can initiate a sync of Illumination data on demand by clicking the link in the lower right of the Illumination map.

Supercluster PCE Health Icon

When the PCE Health button has a badge with a number, one or more of the PCEs in your Supercluster have a health status that is not “Normal.” The badge color indicates the type of warning.

For example, a yellow warning badge with the number 1 indicates that one of the PCEs in the Supercluster has a health warning status.

When the badge is red and shows the number 1, one of the Supercluster PCEs has failed or is down.

Supercluster Web Console Health Page

The Supercluster Health page on the leader displays a high-level view of each PCE's health. You can click a PCE to view individual health information. The information on this page is refreshed every 60 seconds.

Individual PCE Health Status

The following table lists the possible health statuses for a PCE: Normal, Warning, or Critical.

Status	Color	Definition
Normal (healthy)	Green	A PCE is considered to be in a normal state when: All required services are running. All nodes are running. CPU usage of all nodes is less than 95%. Memory usage of all nodes is less than 95%. Disk usage of all nodes is less than 95%. Database replication lag is less than or equal to 30 seconds. Supercluster replication lag is less than or equal to 120 seconds.
Warning	Yellow	A PCE is considered to be in a warning state when: One or more nodes are unreachable. One or more optional services are missing, or one or more required services are degraded. The CPU usage of any node is greater than or equal to 95%. Memory usage of any node is greater than or equal to 95%. Disk usage of any node is greater than or equal to 95%. Database replication lag is greater than 30 seconds. Supercluster replication lag is greater than 120 seconds.
Critical	Red	A PCE is considered to be in a critical state when one or more required services are missing. In this scenario, it might not be possible to authenticate to the PCE or get a REST API response depending on which services are missing from the PCE.

Status

Color

Definition

Normal (healthy)

Green

A PCE is considered to be in a normal state when:

All required services are running.
All nodes are running.
CPU usage of all nodes is less than 95%.
Memory usage of all nodes is less than 95%.
Disk usage of all nodes is less than 95%.
Database replication lag is less than or equal to 30 seconds.
Supercluster replication lag is less than or equal to 120 seconds.

Warning

Yellow

A PCE is considered to be in a warning state when:

One or more nodes are unreachable.
One or more optional services are missing, or one or more required services are degraded.
The CPU usage of any node is greater than or equal to 95%.
Memory usage of any node is greater than or equal to 95%.
Disk usage of any node is greater than or equal to 95%.
Database replication lag is greater than 30 seconds.
Supercluster replication lag is greater than 120 seconds.

Critical

Red

A PCE is considered to be in a critical state when one or more required services are missing.

In this scenario, it might not be possible to authenticate to the PCE or get a REST API response depending on which services are missing from the PCE.

PCE Health on Workload Details

When your workloads have been paired with a Supercluster leader or member, you can view PCE health on the Summary tab of the Workload details page. This page includes the PCE section, which lists the hostname and health of the PCE that this workload is paired with.

PCE Health on Illumination Command Panel

When you select a workload in the Illumination map in a Supercluster, the command panel that displays workload details includes the health of the PCE that the workload is paired with. For example, you can see the health status of the PCE the workload is paired with in the PCE Health field.

Command to Show All Supercluster Members

On any core node or the data0 node in a cluster, run the following command to display the leader and all member PCEs of the Supercluster.

sudo -u ilo-pce illumio-pce-ctl supercluster-members

Illumio Install, Configure, and Upgrade Guide 24.2.20