Skip to main content

Illumio Core 25.2.10 Administration Guide

Events Monitoring Best Practices

The Illumio Core generates a rich stream of structured messages that provide the following information:

  • Illumio PCE system health

  • Illumio PCE notable activity

  • Illumio VEN notable activity

Illumio Core events are structured and actionable. Using the event data, you can identify the severity, affected systems, and what triggered the event. Illumio Core sends the structured messages using the syslog protocol to remote systems, such as Splunk and QRadar. You can set up your remote systems to automatically process the messages and alert you.

Monitoring Operational Practices

In addition to setting up an automated system, Illumio recommends implementing the following operational practices:

  1. Determine the normal quantity of events from the Illumio Core and monitor the trend for changes; investigate spikes or reductions in the event generation rate.

  2. Implement good operational practices to troubleshoot and investigate alerts and to recover from events.

  3. Do not monitor Illumio Core events in isolation. Monitor them as part of your overall system. Understanding the events in the context of your overall system activity can provide as much information as the events themselves.

Recommended Events to Monitor

As a best practice, Illumio recommends you monitor the following events at a minimum.

Events

Description

Program name = Illumio_pce/system_health

Severity = Warning, Error, or Fatal

Provides multiple systems metrics, such as CPU and memory data, for each node in a PCE cluster. The PCE generates these events every minute. The Severity field is particularly important. When system metrics exceed thresholds, the severity changes to warning, error, or fatal.

For more information about the metrics and thresholds, see the PCE Administration Guide.

Recommendation: Monitor system_health messages with a severity of warning or higher and correlate the event with other operational monitoring tools to determine if administrative intervention is required.

event_type="lost_agent.found"

Contains the information necessary to identify workloads with lost agents. A lost agent occurs when the PCE deletes a workload from its database, but that workload still has a VEN running on it.

Recommendation: Monitor lost_agent.found events and send alerts in case you need to pair the workloads' VENs with the PCE again.

event_type="system_task.agent_missed_heartbeats_check"

Lists the VENs that missed three heartbeats (default: total of 15 minutes). Typically, this event precedes the PCE taking the VENs offline to perform internal maintenance.

For Server VENs, this event triggers an alert to be sent at 25% of the time configured in the offline timer. For example, if the offline timer is configured to 1 hour, an alert is sent after the Server VEN has not sent a heartbeat for 15 minutes; if the offline timer is configured to 4 hours, an alert is sent after the Server VEN hasn't sent a heartbeat for 1 hour. Alerts are disabled by default for Endpoint VENs.

Recommendation: Monitor these events for high-value workloads. The PCE can take these workloads offline when the VENs miss 12 heartbeats (usually 60 minutes).

event_type="system_task.agent_offline_check"

This event lists VENs that the PCE has marked offline, usually because they missed 12 heartbeats. The VENs on these workloads haven't communicated with the PCE for an hour and it removed the workloads from policy.

Recommendation: Monitor these events for high-value workloads because they indicate a change in the affected workloads' security posture.

event_type="agent.suspend"

This event indicates that the VEN is suspended and no longer protecting the workload. If you did not intentionally run the VEN suspend command on the workload, this event can indicate the workload is under attack.

Recommendation: Monitor these events for high-value workloads.

event_type="agent.tampering"

This event indicates tampering with the workload's Illumio-managed firewall and that the VEN recovered the firewall. Firewall tampering is one of the first signs that a workload is compromised. During a tampering attempt, the VEN and PCE continue to protect the workload; however, you should investigate the event's cause.

Recommendation: Monitor these events for high-value workloads.

event_type="agent.update"

Contains the state data that the VEN regularly sends to the PCE. Typically, these events contain routine information; however, the VEN can attach a notice indicating the following issues:

  • Processes not running

  • Policy deployment failure

Recommendation: Monitor agent.update events that include notifications because they indicate workloads that might require administrative intervention.

event_type="rule_set.create"

event_type="rule_set.update"

event_type="rule_sets.delete”

Contains the labels indicating the scope of a draft ruleset. Illumio Core generates these events when you create, update, or delete a draft ruleset. When you include “All Applications,” “All Environments,” or “All Locations” in a ruleset scope, the PCE represents that label type as a null HREF. Ruleset scopes that are overly broad affect a large number of workloads. Draft rulesets do not take effect until they are provisioned.

Recommendation: Monitor these events to pinpoint ruleset scopes that are unintentionally overly broad.

event_type="sec_rule.create"

event_type="sec_rule.update"

event_type="sec_rule.delete"

These events contain labels indicating when all workloads affected, all services, or a label/label-group are used as a rule source or destination. Illumio Core generates these events when you create, update, or delete a draft ruleset. The removed or added labels could represent high-value applications or environments.

Recommendation: Monitor these events for high-value labels.

event_type="sec_policy.create"

[NEW in Illumio Core 19.3.0] It contains the workloads_affected field, which includes the number of workloads affected by a policy. Illumio Core generates this event when you provision a draft policy that updates the policy on affected workloads. The number of affected workloads could be high or a significant percentage of your managed workloads.

Recommendation: Monitor the workloads_affected field for a high number of affected workloads. If the number exceeds an acceptable threshold, investigate the associated policy.

event_type="agent.clone_detected"

The PCE detects cloned VENs based on clone token mismatch. This is a special alert from release 19.3.2 onwards, as clones have become a higher priority. The volume of these events makes the severity level important, not the fact that these events occurred.

Recommendation: If severity is 1 or ‘error’, some intervention may be needed.