Telemetry Framework

The telemetry framework can be used to record metrics. Such metrics can either revolve around Zeek’s operational behavior, or describe characteristics of the monitored traffic.

The telemetry framework is fairly Prometheus inspired. It supports the same metric types as most Prometheus client libraries with the exception of the Summary type.

The actual implementation of the metrics and the registry is provided by Broker and internally CAF.

This document outlines usage examples. Head to the Telemetry API documentation for more details.

Metric Types

The following metric types are supported.

Counter

Continuously increasing, resets on process restart. Examples for counters are number of log writes since process start, packets processed, or process_seconds representing CPU usage.

Gauge

Gauge metric can increase and decrease. Examples are table sizes or val_footprint of Zeek script values over the lifetime of the process. Temperature or memory usage are other examples.

Histogram

Pre-configured buckets of observations. Examples for histograms are connection durations, delays, transfer sizes. Generally, it is useful to know the expected range and distribution as the histogram’s buckets are pre-configured.

A good reference to consult for more details is the official Prometheus Metric Types documentation. The next section provides examples using each of these types.

Examples

Counting Log Writes per Stream

In combination with the Log::log_stream_policy hook, it is straight forward to record Log::write invocations over the dimension of the Log::ID value.

This section shows three different approaches. Which approach is most applicable depends mostly on the expected script layer performance overhead for updating the metric. For example, calling Telemetry::counter_with and Telemetry::counter_inc within a handler of a high-frequency event may be prohibitive, while for a low-frequency event it’s unlikely to be performance impacting.

Assuming Zeek was started with BROKER_METRICS_PORT=4242 being set in the environment, querying the Prometheus endpoint using curl provides the following metrics data for each of the three approaches.

$ curl -s localhost:4242/metrics | grep log_writes
# HELP zeek_log_writes_total Number of log writes per stream
# TYPE zeek_log_writes_total counter
zeek_log_writes_total{endpoint="zeek",log_id="packetfilter_log"} 1.000000 1658924926624
zeek_log_writes_total{endpoint="zeek",log_id="loadedscripts_log"} 477.000000 1658924926624
zeek_log_writes_total{endpoint="zeek",log_id="stats_log"} 1.000000 1658924926624
zeek_log_writes_total{endpoint="zeek",log_id="dns_log"} 200.000000 1658924926624
zeek_log_writes_total{endpoint="zeek",log_id="ssl_log"} 9.000000 1658924926624
zeek_log_writes_total{endpoint="zeek",log_id="conn_log"} 215.000000 1658924926624
zeek_log_writes_total{endpoint="zeek",log_id="captureloss_log"} 1.000000 1658924926624

Immediate

The following example creates a global counter family object and uses the Telemetry::counter_family_inc helper to increment the counter metric associated with a string representation of the Log::ID value.

log-writes-immediate.zeek
 1global log_writes_cf = Telemetry::register_counter_family([
 2    $prefix="zeek",
 3    $name="log_writes",
 4    $unit="1",
 5    $helptext="Number of log writes per stream",
 6    $labels=vector("log_id")
 7]);
 8
 9hook Log::log_stream_policy(rec: any, id: Log::ID)
10    {
11    local log_id = to_lower(gsub(cat(id), /:+/, "_"));
12    Telemetry::counter_family_inc(log_writes_cf, vector(log_id));
13    }

With a few lines of scripting code, Zeek now track log writes per stream ready to be scraped by a Prometheus server.

Cached

For cases where creating the label value (stringification, gsub and to_lower) and instantiating the label vector as well as invoking the Telemetry::counter_family_inc methods cause too much performance overhead, the counter instances can also be cached in a lookup table. The counters can then be incremented with Telemetry::counter_inc directly.

log-writes-cached.zeek
 1global log_writes_cf = Telemetry::register_counter_family([
 2    $prefix="zeek",
 3    $name="log_writes",
 4    $unit="1",
 5    $helptext="Number of log writes per stream",
 6    $labels=vector("log_id")
 7]);
 8
 9# Cache for the Telemetry::Counter instances.
10global log_write_counters: table[Log::ID] of Telemetry::Counter;
11
12hook Log::log_stream_policy(rec: any, id: Log::ID)
13    {
14    if ( id !in log_write_counters )
15        {
16        local log_id = to_lower(gsub(cat(id), /:+/, "_"));
17        log_write_counters[id] = Telemetry::counter_with(log_writes_cf,
18                                                         vector(log_id));
19        }
20
21    Telemetry::counter_inc(log_write_counters[id]);
22    }

For metrics without labels, the metric instances can also be cached as global variables directly. The following example counts the number of http requests.

global-http-counter.zeek
 1global http_counter_cf = Telemetry::register_counter_family([
 2    $prefix="zeek",
 3    $name="monitored_http_requests",
 4    $unit="1",
 5    $helptext="Number of http requests observed"
 6]);
 7
 8global http_counter = Telemetry::counter_with(http_counter_cf);
 9
10event http_request(c: connection, method: string, original_URI: string,
11                   unescaped_URI: string, version: string)
12    {
13    Telemetry::counter_inc(http_counter);
14    }

Sync

In case where the scripting overhead of this approach is still too high, the individual writes (or events) can be tracked in a table and then synchronized / mirrored during execution of the Telemetry::sync hook.

log-writes-sync.zeek
 1global log_writes_cf = Telemetry::register_counter_family([
 2    $prefix="zeek",
 3    $name="log_writes",
 4    $unit="1",
 5    $helptext="Number of log writes per stream",
 6    $labels=vector("log_id")
 7]);
 8
 9global log_writes: table[Log::ID] of count &default=0;
10
11hook Log::log_stream_policy(rec: any, id: Log::ID)
12    {
13    ++log_writes[id];
14    }
15hook Telemetry::sync()
16    {
17    for ( id, v in log_writes )
18        {
19        local log_id = to_lower(gsub(cat(id), /:+/, "_"));
20        Telemetry::counter_family_inc(log_writes_cf, vector(log_id));
21        }
22    }

For the use-case of tracking log writes, this is unlikely to be required, but for updating metrics within high frequency events that otherwise have very low processing overhead it’s a valuable approach. Note, metrics will be stale up to the next Telemetry::sync_interval using this method.

Table sizes

It can be useful to expose the size of state holding tables as metrics. As table sizes may increase and decrease, a Telemetry::Gauge is used for this purpose.

The following example records the size of the Tunnel::active table and its footprint with two gauges. The gauges are updated during the Telemetry::sync hook. Note, there are no labels in use, both gauge instances are simple globals.

log-writes-sync.zeek
 1module Tunnel;
 2
 3global tunnels_active_size_gf = Telemetry::register_gauge_family([
 4    $prefix="zeek",
 5    $name="monitored_tunnels_active",
 6    $unit="1",
 7    $helptext="Number of currently active tunnels as tracked in Tunnel::active"
 8]);
 9
10global tunnels_active_size_gauge = Telemetry::gauge_with(tunnels_active_size_gf);
11
12global tunnels_active_footprint_gf = Telemetry::register_gauge_family([
13    $prefix="zeek",
14    $name="monitored_tunnels_active_footprint",
15    $unit="1",
16    $helptext="Footprint of the Tunnel::active table"
17]);
18
19global tunnels_active_footprint_gauge = Telemetry::gauge_with(tunnels_active_footprint_gf);
20
21hook Telemetry::sync() {
22
23    Telemetry::gauge_set(tunnels_active_size_gauge, |Tunnel::active|);
24    Telemetry::gauge_set(tunnels_active_footprint_gauge, val_footprint(Tunnel::active));
25}

Example representation of these metrics when querying the Prometheus endpoint:

$ curl -s localhost:4242/metrics | grep tunnel
# HELP zeek_monitored_tunnels_active_footprint Footprint of the Tunnel::active table
# TYPE zeek_monitored_tunnels_active_footprint gauge
zeek_monitored_tunnels_active_footprint{endpoint="zeek"} 324.000000 1658929821941
# HELP zeek_monitored_tunnels_active Number of currently active tunnels as tracked in Tunnel::active
# TYPE zeek_monitored_tunnels_active gauge
zeek_monitored_tunnels_active{endpoint="zeek"} 12.000000 1658929821941

Instead of tracking footprints per variable, global_container_footprints, could be leveraged to track all global containers at once, using the variable name as label.

Connection Durations as Histogram

To track the distribution of certain measurements, a Telemetry::Histogram can be used. The histogram’s buckets have to be preconfigured.

Below example observes the duration of each connection that Zeek has monitored.

connection-durations.zeek
 1global conn_durations_hf = Telemetry::register_histogram_family([
 2    $prefix="zeek",
 3    $name="monitored_connection_duration",
 4    $unit="seconds",
 5    $helptext="Duration of monitored connections",
 6    $bounds=vector(0.1, 1.0, 10.0, 30.0, 60.0),
 7    $labels=vector("proto", "service")
 8]);
 9
10event connection_state_remove(c: connection)
11    {
12    local proto = cat(c$conn$proto);
13    local service: set[string] = {"unknown"};
14
15    if ( |c$service| != 0 )
16        service = c$service;
17
18    for (s in service )
19        {
20        local h = Telemetry::histogram_with(conn_durations_hf, vector(proto, to_lower(s)));
21        Telemetry::histogram_observe(h, interval_to_double(c$duration));
22        }
23    }

Due to the way Prometheus represents histograms and the fact that durations are broken down by protocol and service in the given example, the resulting is rather verbose.

$ curl -s localhost:4242/metrics | grep monitored_connection_duration
# HELP zeek_monitored_connection_duration_seconds Duration of monitored connections
# TYPE zeek_monitored_connection_duration_seconds histogram
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="0.100000"} 970.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="1.000000"} 998.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="10.000000"} 1067.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="30.000000"} 1108.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="60.000000"} 1109.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="+Inf"} 1109.000000 1658931613557
zeek_monitored_connection_duration_seconds_sum{endpoint="zeek",proto="udp",service="dns"} 1263.085691 1658931613557
zeek_monitored_connection_duration_seconds_count{endpoint="zeek",proto="udp",service="dns"} 1109.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="0.100000"} 16.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="1.000000"} 54.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="10.000000"} 56.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="30.000000"} 57.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="60.000000"} 57.000000 1658931613557
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="+Inf"} 57.000000 1658931613557

To work with histogram data, Prometheus provides specialized query functions. For example histogram_quantile().

Note, when using data from conn.log and post-processing, a proper histogram of connection durations can be calculated and possibly preferred. The above example is meant for demonstration purposes. Histograms may be primarily be useful for Zeek operational metrics such as processing times or queueing delays, response times to external systems, etc.

Exporting the Zeek Version

A common pattern in the Prometheus ecosystem is to expose the version information of the running process as gauge metric with a value of 1.

The following example does just that with a Zeek script:

version.zeek
 1global version_gf = Telemetry::register_gauge_family([
 2    $prefix="zeek",
 3    $name="version_info",
 4    $unit="1",
 5    $helptext="The Zeek version",
 6    $labels=vector("version_number", "major", "minor", "patch", "commit", "beta", "debug","version_string")
 7]);
 8
 9event zeek_init()
10    {
11    local v = Version::info;
12    local labels = vector(cat(v$version_number),
13                          cat(v$major), cat(v$minor), cat (v$patch),
14                          cat(v$commit),
15                          v$beta ? "true" : "false",
16                          v$debug ? "true" : "false",
17                          v$version_string);
18    Telemetry::gauge_family_set(version_gf, labels, 1.0);
19    }

This is exposed in Prometheus format as follows:

$ curl -s localhost:4242/metrics | grep version
# HELP zeek_version_info The Zeek version
# TYPE zeek_version_info gauge
zeek_version_info{beta="false",commit="289",debug="true",endpoint="zeek",major="5",minor="1",patch="0",version_number="50100",version_string="5.1.0-dev.289-debug"} 1.000000 1658936589580

Note, the zeek_version_info gauge is created by default in base/frameworks/telemetry/main.zeek. There is no need to add above snippet to your site.

Metrics Export

Cluster Considerations

In a Zeek cluster, every node has its own metric registry independent of the other nodes.

As noted below in the Prometheus section, the Broker subsystem can be configured such that metrics from all nodes are imported to a single node for exposure via the Prometheus HTTP endpoint. Concretely, the manager process can be configured to import metrics from workers, proxies and loggers.

No aggregation of metrics happens during the import process. Rather, the centralized metrics receive an additional “endpoint” label that can be used to identify the originating node.

The Telemetry::collect_metrics and Telemetry::collect_histogram_metrics functions only return node local metrics. A node importing metrics will not expose metrics from other nodes to the scripting layer.

When configuring the telemetry.log and telemetry_histogram.log, each node in a cluster is logging its own metrics. The logs contain a peer field that can be used to determine from which node the metrics originated from.

Zeek Log

The metrics created using the telemetry module can be exported as telemetry.log and telemetry_histogram.log by loading the policy script frameworks/telemetry/log on the command line, or via local.zeek.

The logs are documented through the Telemetry::Info and Telemetry::HistogramInfo records, respectively.

By default, only metrics with the prefix (namespace) zeek and process are included in above logs. If you add new metrics with your own prefix and expect these to be included, redefine the Telemetry::log_prefixes option:

@load frameworks/telemetry/log

redef Telemetry::log_prefixes += { "my_prefix" };

Native Prometheus Export

To enable the Prometheus endpoint for a Zeek process, set the BROKER_METRICS_PORT variable in its environment. As shown with the curl examples in the previous section, a Prometheus server can be configured to scrape the Zeek process directly. See also the Prometheus Getting Started Guide.

In a cluster setup there are two configuration possibilities. Either configure a unique BROKER_METRICS_PORT and BROKER_ENDPOINT_NAME for each of the Zeek processes and setup a Prometheus server to scrape each of these individual endpoints. Alternatively, set BROKER_METRICS_IMPORT_TOPICS and BROKER_METRICS_EXPORT_TOPIC environment variables to have a single process, presumably the Zeek manager, import all metrics from other Zeek processes. Only set BROKER_METRICS_PORT for the Zeek manager and configure the Prometheus server to scrape only the manager.

See also the Broker Zeek options related to metrics. For example, Broker::metrics_export_topic, Broker::metrics_port and Broker::metrics_export_endpoint_name.