Telemetry Framework

Note

This framework changed considerably with Zeek 7, and is not API-compatible with earlier versions. While earlier versions relied on an implementation in Broker, Zeek now maintains its own implementation, building on prometheus-cpp, with Broker adding its telemetry to Zeek’s internal registry of metrics.

The telemetry framework continuously collects metrics during Zeek’s operation, and provides ways to export this telemetry to third-party consumers. Zeek ships with a pre-defined set of metrics and allows you to add your own, via script-layer and in-core APIs you use to instrument relevant parts of the code. Metrics target Zeek’s operational behavior, or track characteristics of monitored traffic. Metrics are not an additional export vehicle for Zeek’s various regular logs. Zeek’s telemetry data model closely resembles that of Prometheus, and supports its text-based exposition format for scraping by third-party collectors.

This document outlines usage examples, and gives brief API examples for composing your own metrics. Head to the Telemetry API documentation for more details.

Metric Types

Zeek supports the following metric types:

Counter

A continuously increasing value, resetting on process restart. Examples of counters are the number of log writes since process start, packets processed, or process_seconds representing CPU usage.

Gauge

A gauge metric is a numerical value that can increase and decrease over time. Examples are table sizes or the val_footprint of Zeek script values over the lifetime of the process. More general examples include a temperature or memory usage.

Histogram

Pre-configured buckets of observations with corresponding counts. Examples of histograms are connection durations, delays, or transfer sizes. Generally, it is useful to know the expected range and distribution of such values, as the bounds of a histogram’s buckets are defined when this metric gets created.

Zeek uses double throughout to track metric values. Since terminology around telemetry can be complex, it helps to know a few additional terms:

Labels

A given metric sometimes doesn’t exist in isolation, but comes with additional labeling to disambiguate related observations. For example, Zeek ships with gauge called zeek_active_sessions that labels counts for TCP, UDP, and other transport protocols separately. Labels have a name (for example, “protocol”) to refer to value (such as “tcp”). A metric can have multiple labels. Labels are thus a way to associate textual information with the numerical values of metrics.

Family

The set of such metrics, differing only by their labeling, is a known as a Family. Zeek’s script-layer metrics API lets you operate on individual metrics and families.

Zeek has no equivalent to Prometheus’s Summary type. A good reference to consult for more details is the official Prometheus Metric Types documentation.

Cluster Considerations

When running Zeek as a cluster, every node maintains its own metrics registry, independently of the other nodes. Zeek does not automatically synchronize, centralize, or aggregate metrics across the cluster. Instead, it adds the name of the node a particular metric originated from at collection time, leaving any aggregation to post-processing where desired.

Accordingly, the Telemetry::collect_metrics and Telemetry::collect_histogram_metrics functions only return node-local metrics.

Metrics Export

Zeek supports two mechanisms for exporting telemetry: traditional logs, and Prometheus-compatible endpoints for scraping by a third-party service. We cover them in turn.

Zeek Logs

Zeek can export current metrics continuously via telemetry.log and telemetry_histogram.log. It does not do so by default. To enable, load the policy script frameworks/telemetry/log on the command line, or via local.zeek.

The Telemetry::Info and Telemetry::HistogramInfo records define the logs. Both records include a peer field that conveys the cluster node the metric originated from.

By default, Zeek reports current telemetry every 60 seconds, as defined by the Telemetry::log_interval, which you’re free to adjust.

Also, by default only metrics with the prefix (namespace) zeek and process are included in above logs. If you add new metrics with your own prefix and expect these to be included, redefine the Telemetry::log_prefixes option:

@load frameworks/telemetry/log

redef Telemetry::log_prefixes += { "my_prefix" };

Clearing the set will cause all metrics to be logged. As with any logs, you may employ policy hooks, Telemetry::log_policy and Telemetry::log_policy_histogram, to define potentially more granular filtering.

Native Prometheus Export

Every Zeek process, regardless of whether it’s running long-term standalone or as part of a cluster, can run an HTTP server that renders current telemetry in Prometheus’s text-based exposition format.

The Telemetry::metrics_port variable controls this behavior. Its default of 0/unknown disables exposing the port; setting it to another TCP port will enable it. In clusterized operation, the cluster topology can specify each node’s metrics port via the corresponding Cluster::Node field, and the framework will adjust Telemetry::metrics_port accordingly. Both zeekctl and the management framework let you define specific ports and can also auto-populate their values, similarly to Broker’s listening ports.

To query a node’s telemetry, point an HTTP client or Prometheus scraper at the node’s metrics port:

$ curl -s http://<node>:<node-metrics-port>/metrics
# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 0
...
# HELP zeek_event_handler_invocations_total Number of times the given event handler was called
# TYPE zeek_event_handler_invocations_total counter
zeek_event_handler_invocations_total{endpoint="manager",name="run_sync_hook"} 2
...

To simplify telemetry collection from all nodes in a cluster, Zeek supports Prometheus HTTP Service Discovery on the manager node. In this approach, the endpoint http://<manager>:<manager-metrics-port>/services.json returns a JSON data structure that itemizes all metrics endpoints in the cluster. Prometheus scrapers supporting service discovery then proceed to collect telemetry from the listed endpoints in turn. See the Prometheus Getting Started Guide for additional information.

Note

Changed in version 7.0.

The built-in aggregation for Zeek telemetry to the manager node has been removed, in favor of the Prometheus-compatible service discovery endpoint. The new approach requires cluster administrators to manage access to the additional ports. However, it allows Prometheus to conduct the aggregation, instead of burdening the Zeek manager with it, which has historically proved expensive.

If these setups aren’t right for your environment, there’s the possibility to redefine the options in local.zeek to something more suitable. For example, the following snippet opens an individual Prometheus port for each Zeek process (relative to the port used in cluster-layout.zeek):

@load base/frameworks/cluster

global my_node = Cluster::nodes[Cluster::node];
global my_metrics_port = count_to_port(port_to_count(my_node$p) - 1000, tcp);

redef Telemetry::metrics_port = my_metrics_port;

As a different example, to only change the port from 9911 to 1234 on the manager process, but keep the export and import of metrics enabled, use the following snippet:

@load base/frameworks/cluster

@ifdef ( Cluster::local_node_type() == Cluster::MANAGER )
redef Telemetry::metrics_port = 1234/tcp;
@endif

Examples of Metrics Application

Counting Log Writes per Stream

In combination with the Log::log_stream_policy hook, it is straightforward to record Log::write invocations over the dimension of the Log::ID value. This section shows three different approaches to do this. Which approach is most applicable depends mostly on the expected script layer performance overhead for updating the metric. For example, calling Telemetry::counter_with and Telemetry::counter_inc within a handler of a high-frequency event may be prohibitive, while for a low-frequency event it’s unlikely to matter.

Assuming a Telemetry::metrics_port of 9090, querying the Prometheus endpoint using curl provides output resembling the following for each of the three approaches.

$ curl -s localhost:9090/metrics | grep log_writes
# HELP zeek_log_writes_total Number of log writes per stream
# TYPE zeek_log_writes_total counter
zeek_log_writes_total{endpoint="zeek",log_id="packetfilter_log"} 1
zeek_log_writes_total{endpoint="zeek",log_id="loadedscripts_log"} 477
zeek_log_writes_total{endpoint="zeek",log_id="stats_log"} 1
zeek_log_writes_total{endpoint="zeek",log_id="dns_log"} 200
zeek_log_writes_total{endpoint="zeek",log_id="ssl_log"} 9
zeek_log_writes_total{endpoint="zeek",log_id="conn_log"} 215
zeek_log_writes_total{endpoint="zeek",log_id="captureloss_log"} 1

The above shows a family of 7 zeek_log_writes_total metrics, each with an endpoint label (here, zeek, which would be a cluster node name if scraped from a Zeek cluster) and a log_id one.

Immediate

The following example creates a global counter family object and uses the Telemetry::counter_family_inc helper to increment the counter metric associated with a string representation of the Log::ID value.

log-writes-immediate.zeek
 1global log_writes_cf = Telemetry::register_counter_family([
 2    $prefix="zeek",
 3    $name="log_writes",
 4    $unit="1",
 5    $help_text="Number of log writes per stream",
 6    $label_names=vector("log_id")
 7]);
 8
 9hook Log::log_stream_policy(rec: any, id: Log::ID)
10    {
11    local log_id = to_lower(gsub(cat(id), /:+/, "_"));
12    Telemetry::counter_family_inc(log_writes_cf, vector(log_id));
13    }

With a few lines of scripting code, Zeek now track log writes per stream ready to be scraped by a Prometheus server.

Cached

For cases where creating the label value (stringification, gsub and to_lower) and instantiating the label vector as well as invoking the Telemetry::counter_family_inc methods cause too much performance overhead, the counter instances can also be cached in a lookup table. The counters can then be incremented with Telemetry::counter_inc directly.

log-writes-cached.zeek
 1global log_writes_cf = Telemetry::register_counter_family([
 2    $prefix="zeek",
 3    $name="log_writes",
 4    $unit="1",
 5    $help_text="Number of log writes per stream",
 6    $label_names=vector("log_id")
 7]);
 8
 9# Cache for the Telemetry::Counter instances.
10global log_write_counters: table[Log::ID] of Telemetry::Counter;
11
12hook Log::log_stream_policy(rec: any, id: Log::ID)
13    {
14    if ( id !in log_write_counters )
15        {
16        local log_id = to_lower(gsub(cat(id), /:+/, "_"));
17        log_write_counters[id] = Telemetry::counter_with(log_writes_cf,
18                                                         vector(log_id));
19        }
20
21    Telemetry::counter_inc(log_write_counters[id]);
22    }

For metrics without labels, the metric instances can also be cached as global variables directly. The following example counts the number of http requests.

global-http-counter.zeek
 1global http_counter_cf = Telemetry::register_counter_family([
 2    $prefix="zeek",
 3    $name="monitored_http_requests",
 4    $unit="1",
 5    $help_text="Number of http requests observed"
 6]);
 7
 8global http_counter = Telemetry::counter_with(http_counter_cf);
 9
10event http_request(c: connection, method: string, original_URI: string,
11                   unescaped_URI: string, version: string)
12    {
13    Telemetry::counter_inc(http_counter);
14    }

Sync

In case the scripting overhead of the previous approach is still too high, individual writes (or events) can be tracked in a table and then synchronized / mirrored during execution of the Telemetry::sync hook.

log-writes-sync.zeek
 1global log_writes_cf = Telemetry::register_counter_family([
 2    $prefix="zeek",
 3    $name="log_writes",
 4    $unit="1",
 5    $help_text="Number of log writes per stream",
 6    $label_names=vector("log_id")
 7]);
 8
 9global log_writes: table[Log::ID] of count &default=0;
10
11hook Log::log_stream_policy(rec: any, id: Log::ID)
12    {
13    ++log_writes[id];
14    }
15hook Telemetry::sync()
16    {
17    for ( id, v in log_writes )
18        {
19        local log_id = to_lower(gsub(cat(id), /:+/, "_"));
20        Telemetry::counter_family_inc(log_writes_cf, vector(log_id));
21        }
22    }

For the use-case of tracking log writes, this is unlikely to be required, but for updating metrics within high frequency events that otherwise have very low processing overhead it’s a valuable approach. Note, metrics will be stale up to the next Telemetry::sync_interval using this method.

Table sizes

It can be useful to expose the size of tables as metrics, as they often indicate the approximate amount of state maintained in memory. As table sizes may increase and decrease, a Telemetry::Gauge is appropriate for this purpose.

The following example records the size of the Tunnel::active table and its footprint with two gauges. The gauges are updated during the Telemetry::sync hook. Note, there are no labels in use, both gauge instances are simple globals.

log-writes-sync.zeek
 1module Tunnel;
 2
 3global tunnels_active_size_gf = Telemetry::register_gauge_family([
 4    $prefix="zeek",
 5    $name="monitored_tunnels_active",
 6    $unit="1",
 7    $help_text="Number of currently active tunnels as tracked in Tunnel::active"
 8]);
 9
10global tunnels_active_size_gauge = Telemetry::gauge_with(tunnels_active_size_gf);
11
12global tunnels_active_footprint_gf = Telemetry::register_gauge_family([
13    $prefix="zeek",
14    $name="monitored_tunnels_active_footprint",
15    $unit="1",
16    $help_text="Footprint of the Tunnel::active table"
17]);
18
19global tunnels_active_footprint_gauge = Telemetry::gauge_with(tunnels_active_footprint_gf);
20
21hook Telemetry::sync() {
22
23    Telemetry::gauge_set(tunnels_active_size_gauge, |Tunnel::active|);
24    Telemetry::gauge_set(tunnels_active_footprint_gauge, val_footprint(Tunnel::active));
25}

Example representation of these metrics when querying the Prometheus endpoint:

$ curl -s localhost:9090/metrics | grep tunnel
# HELP zeek_monitored_tunnels_active_footprint Footprint of the Tunnel::active table
# TYPE zeek_monitored_tunnels_active_footprint gauge
zeek_monitored_tunnels_active_footprint{endpoint="zeek"} 324
# HELP zeek_monitored_tunnels_active Number of currently active tunnels as tracked in Tunnel::active
# TYPE zeek_monitored_tunnels_active gauge
zeek_monitored_tunnels_active{endpoint="zeek"} 12

Instead of tracking footprints per variable, global_container_footprints, could be leveraged to track all global containers at once, using the variable name as label.

Connection Durations as Histogram

To track the distribution of certain measurements, a Telemetry::Histogram can be used. The histogram’s buckets have to be preconfigured.

The following example observes the duration of each connection that Zeek has monitored.

connection-durations.zeek
 1global conn_durations_hf = Telemetry::register_histogram_family([
 2    $prefix="zeek",
 3    $name="monitored_connection_duration",
 4    $unit="seconds",
 5    $help_text="Duration of monitored connections",
 6    $bounds=vector(0.1, 1.0, 10.0, 30.0, 60.0),
 7    $label_names=vector("proto", "service")
 8]);
 9
10event connection_state_remove(c: connection)
11    {
12    local proto = cat(c$conn$proto);
13    local service: set[string] = {"unknown"};
14
15    if ( |c$service| != 0 )
16        service = c$service;
17
18    for (s in service )
19        {
20        local h = Telemetry::histogram_with(conn_durations_hf, vector(proto, to_lower(s)));
21        Telemetry::histogram_observe(h, interval_to_double(c$duration));
22        }
23    }

Due to the way Prometheus represents histograms and the fact that durations are broken down by protocol and service in the given example, the resulting representation becomes rather verbose.

$ curl -s localhost:9090/metrics | grep monitored_connection_duration
# HELP zeek_monitored_connection_duration_seconds Duration of monitored connections
# TYPE zeek_monitored_connection_duration_seconds histogram
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="0.1"} 970
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="1"} 998
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="10"} 1067
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="30"} 1108
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="60"} 1109
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="+Inf"} 1109
zeek_monitored_connection_duration_seconds_sum{endpoint="zeek",proto="udp",service="dns"} 1263.085691
zeek_monitored_connection_duration_seconds_count{endpoint="zeek",proto="udp",service="dns"} 1109
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="0.1"} 16
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="1"} 54
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="10"} 56
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="30"} 57
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="60"} 57
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="+Inf"} 57

To work with histogram data, Prometheus provides specialized query functions. For example histogram_quantile().

Note, when using data from conn.log and post-processing, a proper histogram of connection durations can be calculated and possibly preferred. The above example is meant for demonstration purposes. Histograms may be primarily be useful for Zeek operational metrics such as processing times or queueing delays, response times to external systems, etc.

Exporting the Zeek Version

A common pattern in the Prometheus ecosystem is to expose the version information of the running process as gauge metric with a value of 1.

The following example does just that with a Zeek script:

version.zeek
 1global version_gf = Telemetry::register_gauge_family([
 2    $prefix="zeek",
 3    $name="version_info",
 4    $unit="1",
 5    $help_text="The Zeek version",
 6    $label_names=vector("version_number", "major", "minor", "patch", "commit", "beta", "debug","version_string")
 7]);
 8
 9event zeek_init()
10    {
11    local v = Version::info;
12    local labels = vector(cat(v$version_number),
13                          cat(v$major), cat(v$minor), cat (v$patch),
14                          cat(v$commit),
15                          v$beta ? "true" : "false",
16                          v$debug ? "true" : "false",
17                          v$version_string);
18    Telemetry::gauge_family_set(version_gf, labels, 1.0);
19    }

In Prometheus’s exposition format, this turns into the following:

$ curl -s localhost:9090/metrics | grep version
# HELP zeek_version_info The Zeek version
# TYPE zeek_version_info gauge
zeek_version_info{beta="true",commit="0",debug="true",major="7",minor="0",patch="0",version_number="70000",version_string="7.0.0-rc4-debug"} 1
zeek_version_info{beta="false",commit="289",debug="true",endpoint="zeek",major="5",minor="1",patch="0",version_number="50100",version_string="5.1.0-dev.289-debug"} 1.000000

Zeek already ships with this gauge, via base/frameworks/telemetry/main.zeek. There is no need to add above snippet to your site.