Telemetry Framework

Note

This framework changed considerably with Zeek 7, and is not API-compatible with earlier versions. While earlier versions relied on an implementation in Broker, Zeek now maintains its own implementation, building on prometheus-cpp, with Broker adding its telemetry to Zeek’s internal registry of metrics.

The telemetry framework continuously collects metrics during Zeek’s operation, and provides ways to export this telemetry to third-party consumers. Zeek ships with a pre-defined set of metrics and allows you to add your own, via script-layer and in-core APIs you use to instrument relevant parts of the code. Metrics target Zeek’s operational behavior, or track characteristics of monitored traffic. Metrics are not an additional export vehicle for Zeek’s various regular logs. Zeek’s telemetry data model closely resembles that of Prometheus, and supports its text-based exposition format for scraping by third-party collectors.

This section outlines usage examples, and gives brief API examples for composing your own metrics. Head to the Telemetry API documentation for more details.

Metric Types

Zeek supports the following metric types:

Counter
A continuously increasing value, resetting on process restart. Examples of counters are the number of log writes since process start, packets processed, or process_seconds representing CPU usage.

Gauge
A gauge metric is a numerical value that can increase and decrease over time. Examples are table sizes or the val_footprint of Zeek script values over the lifetime of the process. More general examples include a temperature or memory usage.

Histogram
Pre-configured buckets of observations with corresponding counts. Examples of histograms are connection durations, delays, or transfer sizes. Generally, it is useful to know the expected range and distribution of such values, as the bounds of a histogram’s buckets are defined when this metric gets created.

Zeek uses double throughout to track metric values. Since terminology around telemetry can be complex, it helps to know a few additional terms:

Labels
A given metric sometimes doesn’t exist in isolation, but comes with additional labeling to disambiguate related observations. For example, Zeek ships with gauge called zeek_active_sessions that labels counts for TCP, UDP, and other transport protocols separately. Labels have a name (for example, “protocol”) to refer to value (such as “tcp”). A metric can have multiple labels. Labels are thus a way to associate textual information with the numerical values of metrics.

Family
The set of such metrics, differing only by their labeling, is a known as a Family. Zeek’s script-layer metrics API lets you operate on individual metrics and families.

Zeek has no equivalent to Prometheus’s Summary type. A good reference to consult for more details is the official Prometheus Metric Types documentation.

Cluster Considerations

When running Zeek as a cluster, every node maintains its own metrics registry, independently of the other nodes. Zeek does not automatically synchronize, centralize, or aggregate metrics across the cluster. Instead, it adds the name of the node a particular metric originated from at collection time, leaving any aggregation to post-processing where desired.

Note

This is a departure from the design in earlier versions of Zeek, which could (either by default, or after activation) centralize metrics in the cluster’s manager node.

Accordingly, the Telemetry::collect_metrics and Telemetry::collect_histogram_metrics functions only return node-local metrics.

Metrics Export

Zeek supports two mechanisms for exporting telemetry: traditional logs, and Prometheus-compatible endpoints for scraping by a third-party service. We cover them in turn.

Zeek Logs

Zeek can export current metrics continuously via telemetry.log and telemetry_histogram.log. It does not do so by default. To enable, load the policy script frameworks/telemetry/log on the command line, or via local.zeek.

The Telemetry::Info and Telemetry::HistogramInfo records define the logs. Both records include a peer field that conveys the cluster node the metric originated from.

By default, Zeek reports current telemetry every 60 seconds, as defined by the Telemetry::log_interval, which you’re free to adjust.

Also, by default only metrics with the prefix (namespace) zeek and process are included in above logs. If you add new metrics with your own prefix and expect these to be included, redefine the Telemetry::log_prefixes option:

@load frameworks/telemetry/log

redef Telemetry::log_prefixes += { "my_prefix" };

Clearing the set will cause all metrics to be logged. As with any logs, you may employ policy hooks, Telemetry::log_policy and Telemetry::log_policy_histogram, to define potentially more granular filtering.

Native Prometheus Export

Every Zeek process, regardless of whether it’s running long-term standalone or as part of a cluster, can run an HTTP server that renders current telemetry in Prometheus’s text-based exposition format.

The Telemetry::metrics_port variable controls this behavior. Its default of 0/unknown disables exposing the port; setting it to another TCP port will enable it. In clusterized operation, the cluster topology can specify each node’s metrics port via the corresponding Cluster::Node field, and the framework will adjust Telemetry::metrics_port accordingly. Both zeekctl and the management framework let you define specific ports and can also auto-populate their values, similarly to Broker’s listening ports.

To query a node’s telemetry, point an HTTP client or Prometheus scraper at the node’s metrics port:

$ curl -s http://<node>:<node-metrics-port>/metrics
# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 0
...
# HELP zeek_event_handler_invocations_total Number of times the given event handler was called
# TYPE zeek_event_handler_invocations_total counter
zeek_event_handler_invocations_total{endpoint="manager",name="run_sync_hook"} 2
...

To simplify telemetry collection from all nodes in a cluster, Zeek supports Prometheus HTTP Service Discovery on the manager node. Using this approach, the endpoint http://<manager>:<manager-metrics-port>/services.json returns a JSON data structure that itemizes all metrics endpoints in the cluster. Prometheus scrapers supporting service discovery then proceed to collect telemetry from the listed endpoints in turn.

The following is an example service discovery scrape config entry within Prometheus server’s prometheus.yml configuration file:

...
scrape_configs:
  - job_name: zeek-discovery
    scrape_interval: 5s
    http_sd_configs:
      - url: http://localhost:9991/services.json
        refresh_interval: 10s

See the Prometheus Getting Started Guide for additional information.

Note

Changed in version 7.0.

The built-in aggregation for Zeek telemetry to the manager node has been removed, in favor of the Prometheus-compatible service discovery endpoint. The new approach requires cluster administrators to manage access to the additional ports. However, it allows Prometheus to conduct the aggregation, instead of burdening the Zeek manager with it, which has historically proved expensive.

If these setups aren’t right for your environment, there’s the possibility to redefine the options in local.zeek to something more suitable. For example, the following snippet selects the metrics port of each Zeek process relative to the cluster port used in cluster-layout.zeek:

@load base/frameworks/cluster

global my_node = Cluster::nodes[Cluster::node];
global my_metrics_port = count_to_port(port_to_count(my_node$p) - 1000, tcp);

redef Telemetry::metrics_port = my_metrics_port;

Examples of Metrics Application

Counting Log Writes per Stream

In combination with the Log::log_stream_policy hook, it is straightforward to record Log::write invocations over the dimension of the Log::ID value. This section shows three different approaches to do this. Which approach is most applicable depends mostly on the expected script layer performance overhead for updating the metric. For example, calling Telemetry::counter_with and Telemetry::counter_inc within a handler of a high-frequency event may be prohibitive, while for a low-frequency event it’s unlikely to matter.

Assuming a Telemetry::metrics_port of 9090, querying the Prometheus endpoint using curl provides output resembling the following for each of the three approaches.

$ curl -s localhost:9090/metrics | grep log_writes
# HELP zeek_log_writes_total Number of log writes per stream
# TYPE zeek_log_writes_total counter
zeek_log_writes_total{endpoint="zeek",log_id="packetfilter_log"} 1
zeek_log_writes_total{endpoint="zeek",log_id="loadedscripts_log"} 477
zeek_log_writes_total{endpoint="zeek",log_id="stats_log"} 1
zeek_log_writes_total{endpoint="zeek",log_id="dns_log"} 200
zeek_log_writes_total{endpoint="zeek",log_id="ssl_log"} 9
zeek_log_writes_total{endpoint="zeek",log_id="conn_log"} 215
zeek_log_writes_total{endpoint="zeek",log_id="captureloss_log"} 1

The above shows a family of 7 zeek_log_writes_total metrics, each with an endpoint label (here, zeek, which would be a cluster node name if scraped from a Zeek cluster) and a log_id one.

Immediate

The following example creates a global counter family object and uses the Telemetry::counter_family_inc helper to increment the counter metric associated with a string representation of the Log::ID value.

log-writes-immediate.zeek

global log_writes_cf = Telemetry::register_counter_family([
    $prefix="zeek",
    $name="log_writes",
    $unit="1",
    $help_text="Number of log writes per stream",
    $label_names=vector("log_id")
]);

hook Log::log_stream_policy(rec: any, id: Log::ID)
    {
    local log_id = to_lower(gsub(cat(id), /:+/, "_"));
    Telemetry::counter_family_inc(log_writes_cf, vector(log_id));
    }

With a few lines of scripting code, Zeek now track log writes per stream ready to be scraped by a Prometheus server.

Cached

For cases where creating the label value (stringification, gsub and to_lower) and instantiating the label vector as well as invoking the Telemetry::counter_family_inc methods cause too much performance overhead, the counter instances can also be cached in a lookup table. The counters can then be incremented with Telemetry::counter_inc directly.

log-writes-cached.zeek

global log_writes_cf = Telemetry::register_counter_family([
    $prefix="zeek",
    $name="log_writes",
    $unit="1",
    $help_text="Number of log writes per stream",
    $label_names=vector("log_id")
]);

# Cache for the Telemetry::Counter instances.
global log_write_counters: table[Log::ID] of Telemetry::Counter;

hook Log::log_stream_policy(rec: any, id: Log::ID)
    {
    if ( id !in log_write_counters )
        {
        local log_id = to_lower(gsub(cat(id), /:+/, "_"));
        log_write_counters[id] = Telemetry::counter_with(log_writes_cf,
                                                         vector(log_id));
        }

    Telemetry::counter_inc(log_write_counters[id]);
    }

For metrics without labels, the metric instances can also be cached as global variables directly. The following example counts the number of http requests.

global-http-counter.zeek

global http_counter_cf = Telemetry::register_counter_family([
    $prefix="zeek",
    $name="monitored_http_requests",
    $unit="1",
    $help_text="Number of http requests observed"
]);

global http_counter = Telemetry::counter_with(http_counter_cf);

event http_request(c: connection, method: string, original_URI: string,
                   unescaped_URI: string, version: string)
    {
    Telemetry::counter_inc(http_counter);
    }

Sync

In case the scripting overhead of the previous approach is still too high, individual writes (or events) can be tracked in a table or global variable and then synchronized / mirrored to concrete counter and gauge instances during execution of the Telemetry::sync hook.

log-writes-sync.zeek

global log_writes_cf = Telemetry::register_counter_family([
    $prefix="zeek",
    $name="log_writes",
    $unit="1",
    $help_text="Number of log writes per stream",
    $label_names=vector("log_id")
]);

global log_writes: table[Log::ID] of count &default=0;

hook Log::log_stream_policy(rec: any, id: Log::ID)
    {
    ++log_writes[id];
    }

hook Telemetry::sync()
    {
    for ( id, v in log_writes )
        {
        local log_id = to_lower(gsub(cat(id), /:+/, "_"));
        Telemetry::counter_family_inc(log_writes_cf, vector(log_id));
        }
    }

For tracking log writes, this is unlikely to be required (and Zeek exposes various logging natively through the framework already), but for updating metrics within high frequency events that otherwise have low script processing overhead, it’s a valuable approach.

Changed in version 7.1.

The Telemetry::sync hook is invoked on-demand only. Either when one of the Telemetry::collect_metrics or Telemetry::collect_histogram_metrics functions is invoked, or when querying Prometheus endpoint. It’s an error to call either of the collection BiFs within the Telemetry::sync hook and results in a reporter warning.

Note

In versions before Zeek 7.1, Telemetry::sync was invoked on a fixed schedule, potentially resulting in stale metrics at collection time, as well as generating small runtime overhead when metrics are not collected.

Table Sizes

It can be useful to expose the size of tables as metrics, as they often indicate the approximate amount of state maintained in memory. As table sizes may increase and decrease, a Telemetry::Gauge is appropriate for this purpose.

The following example records the size of the Tunnel::active table and its footprint with two gauges. The gauges are updated during the Telemetry::sync hook. Note, there are no labels in use, both gauge instances are simple globals.

log-writes-sync.zeek

module Tunnel;

global tunnels_active_size_gf = Telemetry::register_gauge_family([
    $prefix="zeek",
    $name="monitored_tunnels_active",
    $unit="1",
    $help_text="Number of currently active tunnels as tracked in Tunnel::active"
]);

global tunnels_active_size_gauge = Telemetry::gauge_with(tunnels_active_size_gf);

global tunnels_active_footprint_gf = Telemetry::register_gauge_family([
    $prefix="zeek",
    $name="monitored_tunnels_active_footprint",
    $unit="1",
    $help_text="Footprint of the Tunnel::active table"
]);

global tunnels_active_footprint_gauge = Telemetry::gauge_with(tunnels_active_footprint_gf);

hook Telemetry::sync() {

    Telemetry::gauge_set(tunnels_active_size_gauge, |Tunnel::active|);
    Telemetry::gauge_set(tunnels_active_footprint_gauge, val_footprint(Tunnel::active));
}

Example representation of these metrics when querying the Prometheus endpoint:

$ curl -s localhost:9090/metrics | grep tunnel
# HELP zeek_monitored_tunnels_active_footprint Footprint of the Tunnel::active table
# TYPE zeek_monitored_tunnels_active_footprint gauge
zeek_monitored_tunnels_active_footprint{endpoint="zeek"} 324
# HELP zeek_monitored_tunnels_active Number of currently active tunnels as tracked in Tunnel::active
# TYPE zeek_monitored_tunnels_active gauge
zeek_monitored_tunnels_active{endpoint="zeek"} 12

Instead of tracking footprints per variable, global_container_footprints, could be leveraged to track all global containers at once, using the variable name as label.

Connection Durations as Histogram

To track the distribution of certain measurements, a Telemetry::Histogram can be used. The histogram’s buckets have to be preconfigured.

The following example observes the duration of each connection that Zeek has monitored.

connection-durations.zeek

global conn_durations_hf = Telemetry::register_histogram_family([
    $prefix="zeek",
    $name="monitored_connection_duration",
    $unit="seconds",
    $help_text="Duration of monitored connections",
    $bounds=vector(0.1, 1.0, 10.0, 30.0, 60.0),
    $label_names=vector("proto", "service")
]);

event connection_state_remove(c: connection)
    {
    local proto = cat(c$conn$proto);
    local service: set[string] = {"unknown"};

    if ( |c$service| != 0 )
        service = c$service;

    for (s in service )
        {
        local h = Telemetry::histogram_with(conn_durations_hf, vector(proto, to_lower(s)));
        Telemetry::histogram_observe(h, interval_to_double(c$duration));
        }
    }

Due to the way Prometheus represents histograms and the fact that durations are broken down by protocol and service in the given example, the resulting representation becomes rather verbose.

$ curl -s localhost:9090/metrics | grep monitored_connection_duration
# HELP zeek_monitored_connection_duration_seconds Duration of monitored connections
# TYPE zeek_monitored_connection_duration_seconds histogram
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="0.1"} 970
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="1"} 998
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="10"} 1067
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="30"} 1108
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="60"} 1109
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="+Inf"} 1109
zeek_monitored_connection_duration_seconds_sum{endpoint="zeek",proto="udp",service="dns"} 1263.085691
zeek_monitored_connection_duration_seconds_count{endpoint="zeek",proto="udp",service="dns"} 1109
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="0.1"} 16
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="1"} 54
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="10"} 56
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="30"} 57
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="60"} 57
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="+Inf"} 57

To work with histogram data, Prometheus provides specialized query functions. For example histogram_quantile().

Note, when using data from conn.log and post-processing, a proper histogram of connection durations can be calculated and possibly preferred. The above example is meant for demonstration purposes. Histograms may be primarily be useful for Zeek operational metrics such as processing times or queueing delays, response times to external systems, etc.

Exporting the Zeek Version

A common pattern in the Prometheus ecosystem is to expose the version information of the running process as gauge metric with a value of 1.

The following example does just that with a Zeek script:

version.zeek

global version_gf = Telemetry::register_gauge_family([
    $prefix="zeek",
    $name="version_info",
    $unit="1",
    $help_text="The Zeek version",
    $label_names=vector("version_number", "major", "minor", "patch", "commit", "beta", "debug","version_string")
]);

event zeek_init()
    {
    local v = Version::info;
    local labels = vector(cat(v$version_number),
                          cat(v$major), cat(v$minor), cat (v$patch),
                          cat(v$commit),
                          v$beta ? "true" : "false",
                          v$debug ? "true" : "false",
                          v$version_string);
    Telemetry::gauge_family_set(version_gf, labels, 1.0);
    }

In Prometheus’s exposition format, this turns into the following:

$ curl -s localhost:9090/metrics | grep version
# HELP zeek_version_info The Zeek version
# TYPE zeek_version_info gauge
zeek_version_info{beta="true",commit="0",debug="true",major="7",minor="0",patch="0",version_number="70000",version_string="7.0.0-rc4-debug"} 1
zeek_version_info{beta="false",commit="289",debug="true",endpoint="zeek",major="5",minor="1",patch="0",version_number="50100",version_string="5.1.0-dev.289-debug"} 1.000000

Zeek already ships with this gauge, via base/frameworks/telemetry/main.zeek. There is no need to add above snippet to your site.