Telemetry Framework
Note
This framework changed considerably with Zeek 7, and is not API-compatible with earlier versions. While earlier versions relied on an implementation in Broker, Zeek now maintains its own implementation, building on prometheus-cpp, with Broker adding its telemetry to Zeek’s internal registry of metrics.
The telemetry framework continuously collects metrics during Zeek’s operation, and provides ways to export this telemetry to third-party consumers. Zeek ships with a pre-defined set of metrics and allows you to add your own, via script-layer and in-core APIs you use to instrument relevant parts of the code. Metrics target Zeek’s operational behavior, or track characteristics of monitored traffic. Metrics are not an additional export vehicle for Zeek’s various regular logs. Zeek’s telemetry data model closely resembles that of Prometheus, and supports its text-based exposition format for scraping by third-party collectors.
This document outlines usage examples, and gives brief API examples for
composing your own metrics. Head to the Telemetry
API documentation
for more details.
Metric Types
Zeek supports the following metric types:
- Counter
A continuously increasing value, resetting on process restart. Examples of counters are the number of log writes since process start, packets processed, or
process_seconds
representing CPU usage.- Gauge
A gauge metric is a numerical value that can increase and decrease over time. Examples are table sizes or the
val_footprint
of Zeek script values over the lifetime of the process. More general examples include a temperature or memory usage.- Histogram
Pre-configured buckets of observations with corresponding counts. Examples of histograms are connection durations, delays, or transfer sizes. Generally, it is useful to know the expected range and distribution of such values, as the bounds of a histogram’s buckets are defined when this metric gets created.
Zeek uses double
throughout to track metric values. Since
terminology around telemetry can be complex, it helps to know a few additional
terms:
- Labels
A given metric sometimes doesn’t exist in isolation, but comes with additional labeling to disambiguate related observations. For example, Zeek ships with gauge called
zeek_active_sessions
that labels counts for TCP, UDP, and other transport protocols separately. Labels have a name (for example, “protocol”) to refer to value (such as “tcp”). A metric can have multiple labels. Labels are thus a way to associate textual information with the numerical values of metrics.- Family
The set of such metrics, differing only by their labeling, is a known as a Family. Zeek’s script-layer metrics API lets you operate on individual metrics and families.
Zeek has no equivalent to Prometheus’s Summary type. A good reference to consult for more details is the official Prometheus Metric Types documentation.
Cluster Considerations
When running Zeek as a cluster, every node maintains its own metrics registry, independently of the other nodes. Zeek does not automatically synchronize, centralize, or aggregate metrics across the cluster. Instead, it adds the name of the node a particular metric originated from at collection time, leaving any aggregation to post-processing where desired.
Accordingly, the Telemetry::collect_metrics
and
Telemetry::collect_histogram_metrics
functions only return
node-local metrics.
Metrics Export
Zeek supports two mechanisms for exporting telemetry: traditional logs, and Prometheus-compatible endpoints for scraping by a third-party service. We cover them in turn.
Zeek Logs
Zeek can export current metrics continuously via telemetry.log and
telemetry_histogram.log. It does not do so by default. To enable, load the
policy script frameworks/telemetry/log
on the command line, or via
local.zeek
.
The Telemetry::Info
and Telemetry::HistogramInfo
records
define the logs. Both records include a peer
field that conveys the
cluster node the metric originated from.
By default, Zeek reports current telemetry every 60 seconds, as defined by the
Telemetry::log_interval
, which you’re free to adjust.
Also, by default only metrics with the prefix (namespace) zeek
and
process
are included in above logs. If you add new metrics with your own
prefix and expect these to be included, redefine the
Telemetry::log_prefixes
option:
@load frameworks/telemetry/log
redef Telemetry::log_prefixes += { "my_prefix" };
Clearing the set will cause all metrics to be logged. As with any logs, you may
employ policy hooks,
Telemetry::log_policy
and
Telemetry::log_policy_histogram
, to define potentially more granular
filtering.
Native Prometheus Export
Every Zeek process, regardless of whether it’s running long-term standalone or as part of a cluster, can run an HTTP server that renders current telemetry in Prometheus’s text-based exposition format.
The Telemetry::metrics_port
variable controls this behavior. Its
default of 0/unknown
disables exposing the port; setting it to another TCP
port will enable it. In clusterized operation, the cluster topology can specify
each node’s metrics port via the corresponding Cluster::Node
field,
and the framework will adjust Telemetry::metrics_port
accordingly. Both
zeekctl and the management framework let you define specific ports and can also
auto-populate their values, similarly to Broker’s listening ports.
To query a node’s telemetry, point an HTTP client or Prometheus scraper at the node’s metrics port:
$ curl -s http://<node>:<node-metrics-port>/metrics
# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 0
...
# HELP zeek_event_handler_invocations_total Number of times the given event handler was called
# TYPE zeek_event_handler_invocations_total counter
zeek_event_handler_invocations_total{endpoint="manager",name="run_sync_hook"} 2
...
To simplify telemetry collection from all nodes in a cluster, Zeek supports
Prometheus HTTP Service Discovery on the manager node. In this approach, the
endpoint http://<manager>:<manager-metrics-port>/services.json
returns a
JSON data structure that itemizes all metrics endpoints in the
cluster. Prometheus scrapers supporting service discovery then proceed to
collect telemetry from the listed endpoints in turn. See the Prometheus Getting
Started Guide for additional information.
Note
Changed in version 7.0.
The built-in aggregation for Zeek telemetry to the manager node has been removed, in favor of the Prometheus-compatible service discovery endpoint. The new approach requires cluster administrators to manage access to the additional ports. However, it allows Prometheus to conduct the aggregation, instead of burdening the Zeek manager with it, which has historically proved expensive.
If these setups aren’t right for your environment, there’s the possibility to
redefine the options in local.zeek
to something more suitable. For example,
the following snippet opens an individual Prometheus port for each Zeek process
(relative to the port used in cluster-layout.zeek
):
@load base/frameworks/cluster
global my_node = Cluster::nodes[Cluster::node];
global my_metrics_port = count_to_port(port_to_count(my_node$p) - 1000, tcp);
redef Telemetry::metrics_port = my_metrics_port;
As a different example, to only change the port from 9911 to 1234 on the manager process, but keep the export and import of metrics enabled, use the following snippet:
@load base/frameworks/cluster
@ifdef ( Cluster::local_node_type() == Cluster::MANAGER )
redef Telemetry::metrics_port = 1234/tcp;
@endif
Examples of Metrics Application
Counting Log Writes per Stream
In combination with the Log::log_stream_policy
hook, it is
straightforward to record Log::write
invocations over the dimension
of the Log::ID
value. This section shows three different approaches
to do this. Which approach is most applicable depends mostly on the expected
script layer performance overhead for updating the metric. For example, calling
Telemetry::counter_with
and Telemetry::counter_inc
within a handler of a high-frequency event may be prohibitive, while for a
low-frequency event it’s unlikely to matter.
Assuming a Telemetry::metrics_port
of 9090, querying the Prometheus
endpoint using curl
provides output resembling the following for each of
the three approaches.
$ curl -s localhost:9090/metrics | grep log_writes
# HELP zeek_log_writes_total Number of log writes per stream
# TYPE zeek_log_writes_total counter
zeek_log_writes_total{endpoint="zeek",log_id="packetfilter_log"} 1
zeek_log_writes_total{endpoint="zeek",log_id="loadedscripts_log"} 477
zeek_log_writes_total{endpoint="zeek",log_id="stats_log"} 1
zeek_log_writes_total{endpoint="zeek",log_id="dns_log"} 200
zeek_log_writes_total{endpoint="zeek",log_id="ssl_log"} 9
zeek_log_writes_total{endpoint="zeek",log_id="conn_log"} 215
zeek_log_writes_total{endpoint="zeek",log_id="captureloss_log"} 1
The above shows a family of 7 zeek_log_writes_total
metrics, each with an
endpoint
label (here, zeek
, which would be a cluster node name if
scraped from a Zeek cluster) and a log_id
one.
Immediate
The following example creates a global counter family object and uses
the Telemetry::counter_family_inc
helper to increment the
counter metric associated with a string representation of the Log::ID
value.
1global log_writes_cf = Telemetry::register_counter_family([
2 $prefix="zeek",
3 $name="log_writes",
4 $unit="1",
5 $help_text="Number of log writes per stream",
6 $label_names=vector("log_id")
7]);
8
9hook Log::log_stream_policy(rec: any, id: Log::ID)
10 {
11 local log_id = to_lower(gsub(cat(id), /:+/, "_"));
12 Telemetry::counter_family_inc(log_writes_cf, vector(log_id));
13 }
With a few lines of scripting code, Zeek now track log writes per stream ready to be scraped by a Prometheus server.
Cached
For cases where creating the label value (stringification, gsub
and to_lower
)
and instantiating the label vector as well as invoking the
Telemetry::counter_family_inc
methods cause too much
performance overhead, the counter instances can also be cached in a lookup table.
The counters can then be incremented with Telemetry::counter_inc
directly.
1global log_writes_cf = Telemetry::register_counter_family([
2 $prefix="zeek",
3 $name="log_writes",
4 $unit="1",
5 $help_text="Number of log writes per stream",
6 $label_names=vector("log_id")
7]);
8
9# Cache for the Telemetry::Counter instances.
10global log_write_counters: table[Log::ID] of Telemetry::Counter;
11
12hook Log::log_stream_policy(rec: any, id: Log::ID)
13 {
14 if ( id !in log_write_counters )
15 {
16 local log_id = to_lower(gsub(cat(id), /:+/, "_"));
17 log_write_counters[id] = Telemetry::counter_with(log_writes_cf,
18 vector(log_id));
19 }
20
21 Telemetry::counter_inc(log_write_counters[id]);
22 }
For metrics without labels, the metric instances can also be cached as global variables directly. The following example counts the number of http requests.
1global http_counter_cf = Telemetry::register_counter_family([
2 $prefix="zeek",
3 $name="monitored_http_requests",
4 $unit="1",
5 $help_text="Number of http requests observed"
6]);
7
8global http_counter = Telemetry::counter_with(http_counter_cf);
9
10event http_request(c: connection, method: string, original_URI: string,
11 unescaped_URI: string, version: string)
12 {
13 Telemetry::counter_inc(http_counter);
14 }
Sync
In case the scripting overhead of the previous approach is still too high,
individual writes (or events) can be tracked in a table and then
synchronized / mirrored during execution of the Telemetry::sync
hook.
1global log_writes_cf = Telemetry::register_counter_family([
2 $prefix="zeek",
3 $name="log_writes",
4 $unit="1",
5 $help_text="Number of log writes per stream",
6 $label_names=vector("log_id")
7]);
8
9global log_writes: table[Log::ID] of count &default=0;
10
11hook Log::log_stream_policy(rec: any, id: Log::ID)
12 {
13 ++log_writes[id];
14 }
15hook Telemetry::sync()
16 {
17 for ( id, v in log_writes )
18 {
19 local log_id = to_lower(gsub(cat(id), /:+/, "_"));
20 Telemetry::counter_family_inc(log_writes_cf, vector(log_id));
21 }
22 }
For the use-case of tracking log writes, this is unlikely to be required, but
for updating metrics within high frequency events that otherwise have very
low processing overhead it’s a valuable approach. Note, metrics will be stale
up to the next Telemetry::sync_interval
using this method.
Table sizes
It can be useful to expose the size of tables as metrics, as they often
indicate the approximate amount of state maintained in memory.
As table sizes may increase and decrease, a Telemetry::Gauge
is appropriate for this purpose.
The following example records the size of the Tunnel::active
table
and its footprint with two gauges. The gauges are updated during the
Telemetry::sync
hook. Note, there are no labels in use, both
gauge instances are simple globals.
1module Tunnel;
2
3global tunnels_active_size_gf = Telemetry::register_gauge_family([
4 $prefix="zeek",
5 $name="monitored_tunnels_active",
6 $unit="1",
7 $help_text="Number of currently active tunnels as tracked in Tunnel::active"
8]);
9
10global tunnels_active_size_gauge = Telemetry::gauge_with(tunnels_active_size_gf);
11
12global tunnels_active_footprint_gf = Telemetry::register_gauge_family([
13 $prefix="zeek",
14 $name="monitored_tunnels_active_footprint",
15 $unit="1",
16 $help_text="Footprint of the Tunnel::active table"
17]);
18
19global tunnels_active_footprint_gauge = Telemetry::gauge_with(tunnels_active_footprint_gf);
20
21hook Telemetry::sync() {
22
23 Telemetry::gauge_set(tunnels_active_size_gauge, |Tunnel::active|);
24 Telemetry::gauge_set(tunnels_active_footprint_gauge, val_footprint(Tunnel::active));
25}
Example representation of these metrics when querying the Prometheus endpoint:
$ curl -s localhost:9090/metrics | grep tunnel
# HELP zeek_monitored_tunnels_active_footprint Footprint of the Tunnel::active table
# TYPE zeek_monitored_tunnels_active_footprint gauge
zeek_monitored_tunnels_active_footprint{endpoint="zeek"} 324
# HELP zeek_monitored_tunnels_active Number of currently active tunnels as tracked in Tunnel::active
# TYPE zeek_monitored_tunnels_active gauge
zeek_monitored_tunnels_active{endpoint="zeek"} 12
Instead of tracking footprints per variable, global_container_footprints
,
could be leveraged to track all global containers at once, using the variable
name as label.
Connection Durations as Histogram
To track the distribution of certain measurements, a Telemetry::Histogram
can be used. The histogram’s buckets have to be preconfigured.
The following example observes the duration of each connection that Zeek has monitored.
1global conn_durations_hf = Telemetry::register_histogram_family([
2 $prefix="zeek",
3 $name="monitored_connection_duration",
4 $unit="seconds",
5 $help_text="Duration of monitored connections",
6 $bounds=vector(0.1, 1.0, 10.0, 30.0, 60.0),
7 $label_names=vector("proto", "service")
8]);
9
10event connection_state_remove(c: connection)
11 {
12 local proto = cat(c$conn$proto);
13 local service: set[string] = {"unknown"};
14
15 if ( |c$service| != 0 )
16 service = c$service;
17
18 for (s in service )
19 {
20 local h = Telemetry::histogram_with(conn_durations_hf, vector(proto, to_lower(s)));
21 Telemetry::histogram_observe(h, interval_to_double(c$duration));
22 }
23 }
Due to the way Prometheus represents histograms and the fact that durations are broken down by protocol and service in the given example, the resulting representation becomes rather verbose.
$ curl -s localhost:9090/metrics | grep monitored_connection_duration
# HELP zeek_monitored_connection_duration_seconds Duration of monitored connections
# TYPE zeek_monitored_connection_duration_seconds histogram
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="0.1"} 970
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="1"} 998
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="10"} 1067
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="30"} 1108
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="60"} 1109
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="udp",service="dns",le="+Inf"} 1109
zeek_monitored_connection_duration_seconds_sum{endpoint="zeek",proto="udp",service="dns"} 1263.085691
zeek_monitored_connection_duration_seconds_count{endpoint="zeek",proto="udp",service="dns"} 1109
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="0.1"} 16
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="1"} 54
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="10"} 56
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="30"} 57
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="60"} 57
zeek_monitored_connection_duration_seconds_bucket{endpoint="zeek",proto="tcp",service="http",le="+Inf"} 57
To work with histogram data, Prometheus provides specialized query functions. For example histogram_quantile().
Note, when using data from conn.log and post-processing, a proper histogram of connection durations can be calculated and possibly preferred. The above example is meant for demonstration purposes. Histograms may be primarily be useful for Zeek operational metrics such as processing times or queueing delays, response times to external systems, etc.
Exporting the Zeek Version
A common pattern in the Prometheus ecosystem is to expose the version information of the running process as gauge metric with a value of 1.
The following example does just that with a Zeek script:
1global version_gf = Telemetry::register_gauge_family([
2 $prefix="zeek",
3 $name="version_info",
4 $unit="1",
5 $help_text="The Zeek version",
6 $label_names=vector("version_number", "major", "minor", "patch", "commit", "beta", "debug","version_string")
7]);
8
9event zeek_init()
10 {
11 local v = Version::info;
12 local labels = vector(cat(v$version_number),
13 cat(v$major), cat(v$minor), cat (v$patch),
14 cat(v$commit),
15 v$beta ? "true" : "false",
16 v$debug ? "true" : "false",
17 v$version_string);
18 Telemetry::gauge_family_set(version_gf, labels, 1.0);
19 }
In Prometheus’s exposition format, this turns into the following:
$ curl -s localhost:9090/metrics | grep version
# HELP zeek_version_info The Zeek version
# TYPE zeek_version_info gauge
zeek_version_info{beta="true",commit="0",debug="true",major="7",minor="0",patch="0",version_number="70000",version_string="7.0.0-rc4-debug"} 1
zeek_version_info{beta="false",commit="289",debug="true",endpoint="zeek",major="5",minor="1",patch="0",version_number="50100",version_string="5.1.0-dev.289-debug"} 1.000000
Zeek already ships with this gauge, via base/frameworks/telemetry/main.zeek. There is no need to add above snippet to your site.