File Analysis Framework

In the past, writing Zeek scripts with the intent of analyzing file content could be cumbersome because of the fact that the content would be presented in different ways, via events, at the script-layer depending on which network protocol was involved in the file transfer. Scripts written to analyze files over one protocol would have to be copied and modified to fit other protocols. The file analysis framework (FAF) instead provides a generalized presentation of file-related information. The information regarding the protocol involved in transporting a file over the network is still available, but it no longer has to dictate how one organizes their scripting logic to handle it. A goal of the FAF is to provide analysis specifically for files that is analogous to the analysis Zeek provides for network connections.

Supported Protocols

Zeek ships with file analysis for the following protocols: FTP, HTTP, IRC, Kerberos, MIME, RDP, SMTP, and SSL/TLS/DTLS. Protocol analyzers are regular Zeek plugins, so users are welcome to provide additional ones in separate Zeek packages.

File Lifecycle Events

The key events that may occur during the lifetime of a file are: file_new, file_over_new_connection, file_sniff, file_timeout, file_gap, and file_state_remove. Handling any of these events provides some information about the file such as which network connection and protocol are transporting the file, how many bytes have been transferred so far, and its MIME type.

Here’s a simple example:

 1event connection_state_remove(c: connection)
 2    {
 3    print "connection_state_remove";
 4    print c$uid;
 5    print c$id;
 6    for ( s in c$service )
 7        print s;
 8    }
10event file_state_remove(f: fa_file)
11    {
12    print "file_state_remove";
13    print f$id;
14    for ( cid in f$conns )
15        {
16        print f$conns[cid]$uid;
17        print cid;
18        }
19    print f$source;
20    }
$ zeek -r http/get.trace file_analysis_01.zeek
[orig_h=, orig_p=59856/tcp, resp_h=, resp_p=80/tcp]
[orig_h=, orig_p=59856/tcp, resp_h=, resp_p=80/tcp]

This doesn’t perform any interesting analysis yet, but does highlight the similarity between analysis of connections and files. Connections are identified by the usual 5-tuple or a convenient UID string while files are identified just by a string of the same format as the connection UID. So there’s unique ways to identify both files and connections and files hold references to a connection (or connections) that transported it.

File Type Identification

Zeek ships with its own library of content signatures to determine the type of a file, conveyed as MIME types in the file_sniff event. You can find those signatures in the Zeek distribution’s scripts/base/frameworks/files/magic/ directory. (Despite the name, Zeek does not rely on libmagic for content analysis.)

Adding Analysis

Zeek supports customized file analysis via file analyzers that users can attach to observed files. Once attached, file analyzers start receiving the contents of files as Zeek extracts them from ongoing network connections. Zeek comes with the following built-in analyzers:

Like protocol parsers, file analyzers are regular Zeek plugins, so users are free to contribute additional ones in separate Zeek packages.

In the future there may be file analyzers that automatically attach to files based on heuristics, similar to the Dynamic Protocol Detection (DPD) framework for connections, but many will always require an explicit attachment decision.

Here’s a simple example of how to use the MD5 file analyzer to calculate the MD5 of plain text files:

 1event file_sniff(f: fa_file, meta: fa_metadata)
 2    {
 3    if ( ! meta?$mime_type ) return;
 4    print "new file", f$id;
 5    if ( meta$mime_type == "text/plain" )
 6        Files::add_analyzer(f, Files::ANALYZER_MD5);
 7    }
 9event file_hash(f: fa_file, kind: string, hash: string)
10    {
11    print "file_hash", f$id, kind, hash;
12    }
$ zeek -r http/get.trace file_analysis_02.zeek
new file, FakNcS1Jfe01uljb3
file_hash, FakNcS1Jfe01uljb3, md5, 397168fd09991a0e712254df7bc639ac

Some file analyzers have tunable parameters that need to be specified in the call to Files::add_analyzer:

event file_new(f: fa_file)
    Files::add_analyzer(f, Files::ANALYZER_EXTRACT,

In this case, the file extraction analyzer doesn’t generate any further events, but does have the effect of writing out the file contents to the local file system at the location resulting from the concatenation of the path specified by FileExtract::prefix and the string, myfile. Of course, for a network with more than a single file being transferred, it’s probably preferable to specify a different extraction path for each file, unlike this example.

You may add the same analyzer type multiple times to a given file, assuming you use varying Files::AnalyzerArgs parameterization, and remove them selectively from files via calls to Files::remove_analyzer. You may also enable and disable file analyzers globally by calling Files::enable_analyzer and Files::disable_analyzer, respectively.

For additional customizations and APIs, please refer to base/frameworks/files/main.zeek.

Regardless of which file analyzers end up acting on a file, general information about the file (e.g. size, time of last data transferred, MIME type, etc.) is logged in files.log.

Input Framework Integration

The FAF comes with a simple way to integrate with the Input Framework, so that Zeek can analyze files from external sources in the same way it analyzes files that it sees coming over traffic from a network interface it’s monitoring. It only requires a call to Input::add_analysis:

 1redef exit_only_after_terminate = T;
 3event file_new(f: fa_file)
 4    {
 5    print "new file", f$id;
 6    Files::add_analyzer(f, Files::ANALYZER_MD5);
 7    }
 9event file_state_remove(f: fa_file)
10    {
11    print "file_state_remove";
12    Input::remove(f$source);
13    terminate();
14    }
16event file_hash(f: fa_file, kind: string, hash: string)
17    {
18    print "file_hash", f$id, kind, hash;
19    }
21event zeek_init()
22    {
23    local source: string = "./myfile";
24    Input::add_analysis([$source=source, $name=source]);
25    }

Note that the “source” field of fa_file corresponds to the “name” field of Input::AnalysisDescription since that is what the input framework uses to uniquely identify an input stream.

Example output of the above script may be:

$ echo "Hello world" > myfile
$ zeek file_analysis_03.zeek
new file, FZedLu4Ajcvge02jA8
file_hash, FZedLu4Ajcvge02jA8, md5, f0ef7081e1539ac00ef5b761b4fb01b3

Nothing that special, but it at least verifies the MD5 file analyzer saw all the bytes of the input file and calculated the checksum correctly!