Supervisor Framework

The Supervisor framework enables an entirely new mode for Zeek, one that supervises a set of Zeek processes that are meant to be persistent. A Supervisor automatically revives any process that dies or exits prematurely and also arranges for an ordered shutdown of the entire process tree upon its own termination. This Supervisor mode for Zeek provides the basic foundation for process configuration/management that could be used to deploy a Zeek cluster similar to what ZeekControl does, but is also simpler to integrate as a standard system service.

Simple Example

A simple example of using the Supervisor to monitor one Zeek process sniffing packets from an interface looks like the following:

$ zeek -j simple-supervisor.zeek
simple-supervisor.zeek
 1event zeek_init()
 2    {
 3    if ( Supervisor::is_supervisor() )
 4        {
 5        local sn = Supervisor::NodeConfig($name="foo", $interface="en0");
 6        local res = Supervisor::create(sn);
 7
 8        if ( res == "" )
 9            print "supervisor created a new node";
10        else
11            print "supervisor failed to create node", res;
12        }
13    else
14        print fmt("supervised node '%s' zeek_init()", Supervisor::node()$name);
15    }
16
17event zeek_done()
18    {
19    if ( Supervisor::is_supervised() )
20        print fmt("supervised node '%s' zeek_done()", Supervisor::node()$name);
21    else
22        print "supervisor zeek_done()";
23    }

The command-line argument of -j toggles Zeek to run in “Supervisor mode” to allow for creation and management of child processes. If you’re going to test this locally, be sure to change en0 to a real interface name you can sniff.

Notice that the simple-supervisor.zeek script is loaded and executed by both the main Supervisor process and also the child Zeek process that it spawns via Supervisor::create with Supervisor::is_supervisor or Supervisor::is_supervised being able to distinguish the Supervisor process from the supervised child process, respectively. You can also distinguish between multiple supervised child processes by inspecting the contents of Supervisor::node (e.g. comparing node names).

If you happened to be running this locally on an interface with checksum offloading and want Zeek to ignore checksums, instead simply run with the -C command-line argument like:

$ zeek -j -C simple-supervisor.zeek

Most command-line arguments to Zeek are automatically inherited by any supervised child processes that get created. The notable ones that are not inherited are the options to read pcap files and live interfaces, -r and -i, respectively.

For node-specific configuration options, see Supervisor::NodeConfig which gets passed as argument to Supervisor::create.

Supervised Cluster Example

To run a full Zeek cluster similar to what you may already know, try the following script:

$ zeek -j cluster-supervisor.zeek
cluster-supervisor.zeek
 1event zeek_init()
 2    {
 3    if ( ! Supervisor::is_supervisor() )
 4        return;
 5
 6    Broker::listen("127.0.0.1", 9999/tcp);
 7
 8    local cluster: table[string] of Supervisor::ClusterEndpoint;
 9    cluster["manager"] = [$role=Supervisor::MANAGER, $host=127.0.0.1, $p=10000/tcp];
10    cluster["logger"] = [$role=Supervisor::LOGGER, $host=127.0.0.1, $p=10001/tcp];
11    cluster["proxy"] = [$role=Supervisor::PROXY, $host=127.0.0.1, $p=10002/tcp];
12    cluster["worker"] = [$role=Supervisor::WORKER, $host=127.0.0.1, $p=10003/tcp, $interface="en0"];
13
14    for ( n, ep in cluster )
15        {
16        local sn = Supervisor::NodeConfig($name=n);
17        sn$cluster = cluster;
18        sn$directory = n;
19
20        if ( ep?$interface )
21            sn$interface = ep$interface;
22
23        local res = Supervisor::create(sn);
24
25        if ( res != "" )
26            print fmt("supervisor failed to create node '%s': %s", n, res);
27        }
28    }

This script now spawns four nodes: a cluster manager, logger, worker, and proxy. It also configures each node to use a separate working directory corresponding to the node’s name within the current working directory of the Supervisor process. Any stdout/stderr output of the nodes is automatically redirected through the Supervisor process and prefixes with relevant information, like the node name that the output came from.

The Supervisor process also listens on a port of its own for further instructions from other external/remote processes via Broker::listen. For example, you could use this other script to tell the Supervisor to restart all processes, perhaps to re-load Zeek scripts you’ve changed in the meantime:

$ zeek supervisor-control.zeek
supervisor-control.zeek
 1event zeek_init()
 2    {
 3    Broker::peer("127.0.0.1", 9999/tcp, 1sec);
 4    }
 5
 6event Broker::peer_added(endpoint: Broker::EndpointInfo, msg: string)
 7    {
 8    Broker::publish(SupervisorControl::topic_prefix, SupervisorControl::restart_request, "", "");
 9    }
10
11event SupervisorControl::restart_response(reqid: string, result: bool)
12    {
13    print fmt("got result of supervisor restart request: %s", result);
14    terminate();
15    }

Any Supervisor instruction you can perform via an API call in a local script can also be triggered via an associated external event.

For further details, consult the Supervisor API at base/frameworks/supervisor/api.zeek and SupervisorControl API (for remote management) at base/frameworks/supervisor/control.zeek.

Internal Architecture

The following details aren’t necessarily important for most users, but instead aim to give developers a high-level overview of how the process supervision framework is implemented. The process tree in “supervisor” mode looks like:

../_images/zeek-supervisor-architecture.png

The top-level “Supervisor” process does not directly manage any of the supervised nodes that are created. Instead, it spawns in intermediate process, called “Stem”, to manage the lifetime of supervised nodes. This is done for two reasons:

  1. Avoids the need to exec() the supervised processes which requires executing whatever version of the zeek binary happens to exist on the filesystem at the time of call and it may have changed in the meantime. This can help avoid potential incompatibility or race-condition pitfalls associated with system maintenance/upgrades. The one situation that does still require an exec() is if the Stem process dies prematurely, but that is expected to be a rare scenario.

  2. Zeek run-time operation generally taints global state, so creating an early fork() for use as the Stem process provides a pure baseline image to use for supervised processes.

Ultimately, there are two tiers of process supervision happening: the Supervisor will revive the Stem process if needed and the Stem process will revive any of its children when needed.

Also, either the Stem or any of its supervised children processes will automatically detect if they are orphaned from their parent process and self-terminate. The Stem checks for orphaning simply by waking up every second from its poll() loop to look if its parent PID changed. A supervised node checks for orphaning similarly, but instead does so from a recurring Timer. Other than the orphaning-check and how it establishes the desired configuration from a combination of inheriting command-line arguments and inspecting Supervisor-specific options, a supervised node does not operate differently at run-time from a traditional Zeek process.

Node Revival

The Supervisor framework assumes that supervised nodes run until something asks the Supervisor to stop them. When a supervised node exits unexpectedly, the Stem attempts to revive it during its periodic polling routine. This revival procedure implements exponential delay, as follows: starting from a delay of one second, the Stem revives the node up to 3 times. At that point, it doubles the revival delay, and again tries up to 3 times. This continues indefinitely: the Stem never gives up on a node, while the revival delay keeps growing. Once a supervised node has remained up for at least 30 seconds, the revival state clears and will start from scratch as just described, should the node exit again. The Supervisor codebase currently hard-wires these thresholds and delays.