Event Metadata Plugin

Added in version 8.0.

Zeek’s plugin API allows adding metadata to Zeek events. In the Zeek-script layer, the EventMetadata::current and EventMetadata::current_all functions can be used to introspect metadata attached to events. In a Zeek cluster, metadata is transported via remote events for consumption by other Zeek nodes. This section describes the functionality in form of a tutorial. We’ll be using custom event metadata to track the latency of Zeek events in a cluster and expose them as a Prometheus histogram.

If you’re unfamiliar with plugin development, head over to the Writing Plugins section. For more information about telemetry and Prometheus, see also the Telemetry framework’s documentation.

Registering Metadata

Initially, we make Zeek’s core aware of the metadata to attach to events. This requires two steps. First, redefining the EventMetadata::ID enumeration with our custom enumeration value WALLCLOCK_TIMESTAMP. This is our metadata identifier. Its value represents the Unix timestamps when an event was published. Second, registering the metadata identifier with Zeek’s time type by calling EventMetadata::register in a zeek_init handler. This instructs Zeek to convert metadata items in received remote events with identifier 10001000 to a time value.

For simplicity, the second step is done in the plugin’s scripts/__init__.zeek file that’s loaded automatically when Zeek loads the plugin.

main.zeek
 1module EventLatency;
 2
 3redef enum EventMetadata::ID += {
 4    ## Identifier for the absolute time at which Zeek published this event.
 5    WALLCLOCK_TIMESTAMP = 10001000,
 6};
 7
 8event zeek_init()
 9    {
10    assert EventMetadata::register(WALLCLOCK_TIMESTAMP, time);
11    }

The 10001000 represents the metadata identifier for serialization purposes. It needs to be unique and have a defined meaning and consistent type for a given Zeek deployment. Metadata identifiers below 200 are reserved for Zeek’s internal use. Users are free to choose any other value. Zeek will fail to start or fail to register the type in the case of conflicting identifiers in third-party packages.

Implementing the Plugin

Next, we implement the InitPostScript(), HookPublishEvent() and HookQueueEvent() methods in our plugin. In the InitPostScript() method, a histogram instance is initialized using Zeek’s telemetry manager with hard-coded bounds. These define buckets for latency monitoring. The HookPublishEvent() method adds WALLCLOCK_TIMESTAMP metadata with the current time to the event, while the HookQueueEvent() method extracts the sender’s timestamp and computes the latency based on its own local time. Finally, the latency is recorded with the histogram by calling Observe().

main.zeek
 1void Plugin::InitPostScript() {
 2    double bounds[] = {0.0002, 0.0004, 0.0006, 0.0008, 0.0010, 0.0012, 0.0014, 0.0016, 0.0018, 0.0020};
 3    histogram =
 4        zeek::telemetry_mgr->HistogramInstance("zeek", "cluster_event_latency_seconds", {}, bounds, "event latency");
 5}
 6
 7bool Plugin::HookPublishEvent(zeek::cluster::Backend& backend, const std::string& topic,
 8                              zeek::cluster::detail::Event& event) {
 9    static const auto& wallclock_id = zeek::id::find_val<zeek::EnumVal>("EventLatency::WALLCLOCK_TIMESTAMP");
10
11    auto now_val = zeek::make_intrusive<zeek::TimeVal>(zeek::util::current_time(/*real=*/true));
12
13    if ( ! event.AddMetadata(wallclock_id, now_val) )
14        zeek::reporter->FatalError("failed to add wallclock timestamp metadata");
15
16    return true;
17}
18
19bool Plugin::HookQueueEvent(zeek::Event* event) {
20    static const auto& wallclock_id = zeek::id::find_val<zeek::EnumVal>("EventLatency::WALLCLOCK_TIMESTAMP");
21
22    if ( event->Source() == zeek::util::detail::SOURCE_LOCAL )
23        return false;
24
25    auto timestamps = event->MetadataValues(wallclock_id);
26
27    if ( timestamps->Size() > 0 ) {
28        double remote_ts = timestamps->ValAt(0)->AsTime();
29        auto now = zeek::util::current_time(/*real=*/true);
30        auto latency = std::max(0.0, now - remote_ts);
31
32        histogram->Observe(latency);
33    }
34    else
35        zeek::reporter->Warning("missing wallclock timestamp metadata");
36
37    return false;
38}

Resulting Prometheus Metrics

Deploying the plugin outlined above in a cluster and querying the manager’s metrics endpoint presents the following result:

$ curl -s localhost:10001/metrics | grep '^zeek_cluster_event_latency'
zeek_cluster_event_latency_seconds_count{endpoint="manager"} 11281
zeek_cluster_event_latency_seconds_sum{endpoint="manager"} 7.960928916931152
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="0.0002"} 37
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="0.0004"} 583
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="0.0005999999999999999"} 3858
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="0.0008"} 7960
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="0.001"} 10185
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="0.0012"} 10957
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="0.0014"} 11239
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="0.0016"} 11269
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="0.0018"} 11279
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="0.002"} 11281
zeek_cluster_event_latency_seconds_bucket{endpoint="manager",le="+Inf"} 11281

This example indicates that there were a total of 11281 latencies observed, the summed up latency was around 8 seconds, 37 events had a latency less or equal to 0.2 milliseconds, 583 with less or equal than 0.4 milliseconds and none that took more than 2 milliseconds.

This sort of data is usually scraped and ingested by a Prometheus server and then visualized using Grafana.