Events Monitoring Best Practices
The Illumio Core generates a rich stream of structured messages that provide the following information:
Illumio PCE system health
Illumio PCE notable activity
Illumio VEN notable activity
Illumio Core events are structured and actionable. Using the event data, you can identify the severity, affected systems, and what triggered the event. Illumio Core sends the structured messages using the syslog protocol to remote systems, such as Splunk and QRadar. You can set up your remote systems to automatically process the messages and alert you.
Monitoring Operational Practices
In addition to setting up an automated system, Illumio recommends implementing the following operational practices:
Determine the normal quantity of events from the Illumio Core and monitor the trend for changes; investigate spikes or reductions in the event generation rate.
Implement good operational practices to troubleshoot and investigate alerts and to recover from events.
Do not monitor Illumio Core events in isolation. Monitor them as part of your overall system. Understanding the events in the context of your overall system activity can provide as much information as the events themselves.
Recommended Events to Monitor
As a best practice, Illumio recommends you monitor the following events at a minimum.
Events | Description |
---|---|
Program name = Severity = Warning, Error, or Fatal | Provides multiple systems metrics, such as CPU and memory data, for each node in a PCE cluster. The PCE generates these events every minute. The Severity field is particularly important. When system metrics exceed thresholds, the severity changes to warning, error, or fatal. For more information about the metrics and thresholds, see the PCE Administration Guide. Recommendation: Monitor |
| Contains the information necessary to identify workloads with lost agents. A lost agent occurs when the PCE deletes a workload from its database, but that workload still has a VEN running on it. Recommendation: Monitor |
| Lists the VENs that missed three heartbeats (default: total of 15 minutes). Typically, this event precedes the PCE taking the VENs offline to perform internal maintenance. For Server VENs, this event triggers an alert to be sent at 25% of the time configured in the offline timer. For example, if the offline timer is configured to 1 hour, an alert is sent after the Server VEN has not sent a heartbeat for 15 minutes; if the offline timer is configured to 4 hours, an alert is sent after the Server VEN hasn't sent a heartbeat for 1 hour. Alerts are disabled by default for Endpoint VENs. Recommendation: Monitor these events for high-value workloads. The PCE can take these workloads offline when the VENs miss 12 heartbeats (usually 60 minutes). |
| This event lists VENs that the PCE has marked offline, usually because they missed 12 heartbeats. The VENs on these workloads haven't communicated with the PCE for an hour and it removed the workloads from policy. Recommendation: Monitor these events for high-value workloads because they indicate a change in the affected workloads' security posture. |
| This event indicates that the VEN is suspended and no longer protecting the workload. If you did not intentionally run the VEN suspend command on the workload, this event can indicate the workload is under attack. Recommendation: Monitor these events for high-value workloads. |
| This event indicates tampering with the workload's Illumio-managed firewall and that the VEN recovered the firewall. Firewall tampering is one of the first signs that a workload is compromised. During a tampering attempt, the VEN and PCE continue to protect the workload; however, you should investigate the event's cause. Recommendation: Monitor these events for high-value workloads. |
| Contains the state data that the VEN regularly sends to the PCE. Typically, these events contain routine information; however, the VEN can attach a notice indicating the following issues:
Recommendation: Monitor |
| Contains the labels indicating the scope of a draft ruleset. Illumio Core generates these events when you create, update, or delete a draft ruleset. When you include “All Applications,” “All Environments,” or “All Locations” in a ruleset scope, the PCE represents that label type as a null HREF. Ruleset scopes that are overly broad affect a large number of workloads. Draft rulesets do not take effect until they are provisioned. Recommendation: Monitor these events to pinpoint ruleset scopes that are unintentionally overly broad. |
| These events contain labels indicating when all workloads affected, all services, or a label/label-group are used as a rule source or destination. Illumio Core generates these events when you create, update, or delete a draft ruleset. The removed or added labels could represent high-value applications or environments. Recommendation: Monitor these events for high-value labels. |
| [NEW in Illumio Core 19.3.0] It contains the Recommendation: Monitor the |
| The PCE detects cloned VENs based on clone token mismatch. This is a special alert from release 19.3.2 onwards, as clones have become a higher priority. The volume of these events makes the severity level important, not the fact that these events occurred. Recommendation: If severity is 1 or ‘error’, some intervention may be needed. |