Monitoring Detection Rule Latency in Chronicle SIEM

Chris Martin (@thatsiemguy)
7 min readMar 9, 2023

How can I monitor Detection Rule latency in Chronicle SIEM? A question I’ve received a few times recently; so, in this post, I’m going to cover a few options available in Chronicle SIEM for doing just — specifically using Chronicle SIEM’s Data Lake (BigQuery), Detection Engine itself (how meta), and Chronicle SIEM’s Looker Dashboards.

Fancy and over the top visual dashboards? Check

There’s several factors involved in calculating and understanding Detection Rule latency, so let’s cover some of those (which is a disclaimer to say, there may be more I’m not aware or have not covered)

Match Window

This is the first thing I consider when understanding the latency of an alert, what is my Rule’s match window. A rule with a 1 to 10 minute window is going to generate a Detection Alert with a far lower latency versus a Detection Alert with a match window of 24 to 48 hours, which is going to have a Detection latency of 1 to 2 days accordingly.

The most optimal time for a match window I find is between 1 to 10 minutes, as this keeps the run frequency as low as possible.

Processing each run frequency is spread over the given time period. For example, for the 1 hour run frequency, detections might be generated at any point in a given hour. For the 24 hours run frequency, detections are generated 24 hours after the event occurs.

Log Latency

The next step I consider is understanding the original log sources latency, as not all log sources either generate data in near real-time, can have a high variance in the range of late arriving data, or environmental factors such as devices not being always online to transmit data in near real-time.

I wrote previously on how to understand log sources in your environment that may have late arriving data, i.e., logs that can cause detection latency, which is a useful primer for this topic.

There’s no public documentation on the exact mechanics of late arriving data and how it impacts Rule Detections, but my observation would be that is event data has several hours latency on ingestion you are going to get delayed detection, I suspect as it no longer is used in more real-time pipelines and is picked up in scheduled batch pipelines (which is still pretty neat that late arriving data is handled gracefully), and hence the importance of understanding which of your log sources are consistently near real-time, consistently late and not near real-time, or else somewhere in the middle with a large range of latency.

Use the Chronicle UI

A simple option to understand Detection Latency is use the Chronicle Alert Detection page. For any given Detection Alert you’ll see the Detection time (roughly the log event time), and the Creation time, e.g., in the example below you can see:

  • there was a 1 hour match window, via the Detection window
  • the first event occurred at 22:24 UTC
  • the Detection was generated at 23:45 UTC

Given all that, we can understand this rule had a latency of 1 hour and 21 minutes which, with a 1 hour match window, is expected, i.e., 1 hour latency for the window, and the rule appears to run randomly in the next hour interval.

Using YARA-L Outcomes to display log latency

Chronicle SIEM’s UDM includes several timestamps, including:

i) the event timestamp, as parsed from the original log

ii) the ingested timestamp, when the log was received into Chronicle

We can use a custom Outcome variable to generate a delta in seconds, which can be used to quickly tell how long the latency was, and even used for reporting in Looker. Here’s an example from a multi-event rule, if it were a Single event rule remove the Max aggregation function.

outcome:    
$late_arriving_data_delta_seconds = max($event.metadata.ingested_timestamp.seconds - $event.metadata.event_timestamp.seconds )

And putting that into a YARA-L rule, you can see the delta in the output. This quickly shows me that the log source in question for this YARA-L rule has a latency of several hours, between 6 to 8 hours.

Using Chronicle Data Lake to analyze Detection Latency

Another option is using Chronicle’s Data Lake, aka BigQuery, to analyze Detection Latency.

Here’s an example of using Google Sheets Connected Sheets feature to build a chart directly from BigQuery, but you can always run the SQL in the BigQuery console, and export results.

The graph shows recent rules for the configured interval, the last two weeks in this case, and then plots the various percentiles to give an idea of the latency range we can expect for a given rule.

Understanding the range of latency for your YARA-L Detections using Google Sheets & Chronicle Data Lake

You can clearly see a correlation between a YARA-L rule’s match window and the latency, in this case three rules with a small match window, and one rule with a 24 hour match window.

Of interest is to note the range of latency for the larger match rule, workspace_SuperUserLogin, which is between 7 to 10 hours. Given this information we can then understand the latency is:

  • 24 hours for the range of the match window
  • the run frequency is 1 hour
  • the original log itself had a latency of 6 to 8 hours (remember, we discovered that from the above YARA-L outcome)

Here’s the SQL statement I used:

🐉 Here be dragons warning, I’m not a DBA, nor a Math major, but it appears to do the trick

select
rule_name,
PERCENTILES[offset(10)] as p10,
PERCENTILES[offset(25)] as p25,
PERCENTILES[offset(50)] as p50,
PERCENTILES[offset(75)] as p75,
PERCENTILES[offset(90)] as p90,
SUM( PERCENTILES[offset(100)] - PERCENTILES[offset(0)] ) AS percentile_range
FROM (
SELECT
rule_name,
APPROX_QUANTILES(TIMESTAMP_DIFF(TIMESTAMP_SECONDS(detection.commit_timestamp.seconds), TIMESTAMP_SECONDS(detection.detection_timestamp.seconds), MINUTE),100) AS percentiles
FROM
`datalake.rule_detections`
WHERE
TIMESTAMP_SECONDS(detection.detection_timestamp.seconds) > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY
1
)
GROUP BY 1,2,3,4,5,6
HAVING percentile_range > 0

⚠️ Retro Hunts skew BigQuery stats

YARA-L rules can operate in two modes, i) Live, or ii) Batch, and within a given YARA-L rule the Window time range impacts the run frequency; however, when running a Retro Hun there is no way of telling that apart from a Live Rule in the BQ stats, which can cause a major stats discrepancy.

Take the below chart, and we see a few rules with large unexpected latency, and this is because they’re historical Retro Hunts which means the event and commit timestamps are very large.

An example of skewed stats via Rule Detections Data Lake table

There’s no easy solution for this via SQL alone until such time as the rule_detections table can differentiate between a live rule and a retro hunt, but options available include i) knowing which rules may have been run as a retro hunt, ii) view in the Chronicle UI which rules have been run as a retrohunt, or iii) get a little more advanced and use the Detection Engine API which can retrieve RetroHunts, and then merge the two datasets (which is way beyond what I’m covering today).

And the SQL statement for prosperity.

SELECT
rule_name,
severity,
COUNT(detection.id) AS unique_count,
TIMESTAMP_SECONDS(MIN(detection.detection_timestamp.seconds)) AS first_observed,
DATETIME_DIFF(CURRENT_DATE(),DATETIME(TIMESTAMP_SECONDS(MIN(detection.detection_timestamp.seconds))),DAY) AS first_observed_days_ago,
TIMESTAMP_SECONDS(MAX(detection.detection_timestamp.seconds)) AS last_observed,
DATETIME_DIFF(CURRENT_DATE(),DATETIME(TIMESTAMP_SECONDS(MAX(detection.detection_timestamp.seconds))),DAY) AS last_observed_days_ago,
DATETIME_DIFF(DATETIME(TIMESTAMP_SECONDS(MAX(detection.detection_timestamp.seconds))), DATETIME(TIMESTAMP_SECONDS(MIN(detection.detection_timestamp.seconds))), DAY) AS interval_duration,
ROUND( SAFE_DIVIDE(COUNT(detection.id),DATETIME_DIFF( DATETIME(TIMESTAMP_SECONDS(MAX(detection.detection_timestamp.seconds))), DATETIME(TIMESTAMP_SECONDS(MIN(detection.detection_timestamp.seconds))), DAY) ),0) AS average_per_day_count,
DATETIME_DIFF(DATETIME(TIMESTAMP_SECONDS(MAX(detection.time_window.end_time.seconds))), DATETIME(TIMESTAMP_SECONDS(MAX(detection.time_window.start_time.seconds))), MINUTE) AS rule_match_duration,
COUNT(DISTINCT version_timestamp.seconds) AS rule_versions_count,
DATETIME_DIFF(CURRENT_DATE(),DATETIME(TIMESTAMP_SECONDS(MIN(version_timestamp.seconds))),DAY) AS rule_version_last_change,
ROUND(MIN(TIMESTAMP_DIFF(TIMESTAMP_SECONDS(detection.commit_timestamp.seconds), TIMESTAMP_SECONDS(detection.detection_timestamp.seconds), MINUTE)),0) AS min_commit_latency,
ROUND(MAX(TIMESTAMP_DIFF(TIMESTAMP_SECONDS(detection.commit_timestamp.seconds), TIMESTAMP_SECONDS(detection.detection_timestamp.seconds), MINUTE)),0) AS max_commit_latency,
ROUND(AVG(TIMESTAMP_DIFF(TIMESTAMP_SECONDS(detection.commit_timestamp.seconds), TIMESTAMP_SECONDS(detection.detection_timestamp.seconds), MINUTE)),0) AS avg_commit_latency,
ARRAY_AGG(DISTINCT outcome.name) AS outcomes
FROM
`datalake.rule_detections`,
UNNEST(detection.outcomes) AS outcome
GROUP BY
1,
2

And example results from the above SQL statement used to generate this chart.

Imagine a spreadsheet with the power of BigQuery, that’s Connected Sheets

Using Looker Dashboards

You can also use native Looker dashboards to plot the same results visually. The below Dashboard quickly shows the Commit time (the time the Rule was created) against the Detection time (the time the single event happened, or the end of the multi-event window range).

Via this Dashboard, and the analysis steps above, I can quickly understand the 24 hour rules are expected as they’re using a 1 day match, and/or have significantly late arriving data.

Understanding Detection Latency using Chronicle Dashboards

I’m not going to go through all the steps to create such a Dashboard, but the brief summary is add a Visualization from the Rule Detections Explore, then configure the following fields:

The custom table calculations are as follows:

diff_hours(${rule_detections.detection__detection_timestamp_date},${rule_detections.detection__commit_timestamp_date})

And for the Rule Type:

if(${rule_detections.detection__rule_type}=1,"SINGLE","MULTI")

📝 Note, I’m not sure this is 100% accurate as the Chronicle UI shows the rule types differently, which makes me think there is a different in the BQ export, so one you could probably leave out.

Summary

Hopefully the above helps to give you a better understanding of the considerations that can impact latency when generating a YARA-L rule, specifically though in most case I help troubleshoot this it’s original log data arriving late, and/or large match windows. Happy Detecting!

--

--