5.1. Parsing

5.1.1. Basics

5.1.1.1. Type Declaration

Spicy expresses units of data to parse through a type called, appropriately, unit. At a high level, a unit is similar to structs or records in other languages: It defines an ordered set of fields, each with a name and a type, that during runtime will store corresponding values. Units can be instantiated, fields can be assigned values, and these values can then be retrieved. Here’s about the most basic Spicy unit one can define:

type Foo = unit {
    version: uint32;
};

We name the type Foo, and it has just one field called version, which stores a 32-bit unsigned integer type.

Leaving parsing aside for a moment, we can indeed use this type similar to a typical struct/record type:

module Test;

type Foo = unit {
    version: uint32;
};

global f: Foo;
f.version = 42;
print f;

This will print:

[$version=42]

Fields are initially unset, and attempting to read an unset field will trigger a runtime error. You may, however, provide a default value by adding a &default attribute to the field, in which case that will be returned on access if no value has been explicitly assigned:

module Test;

type Foo = unit {
    version: uint32 &default=42;
};

global f: Foo;
print f;
print "version is %s" % f.version;

This will print:

[$version=(not set)]
version is 42

Note how the field remains unset even with the default now specified, while the access returns the expected value.

Note

During development, we recommend running spicy-driver with the --debug option (-d for short) to prevent the optimizer from being too aggressive and eliding fields it considers unused. For example, if we had run the first example above without -d, the output would have looked like this:

[$version=(optimized out)]

In the following, we will generally run with -d. Once a parser is ready for production, you would omit this option to get the best performance. See Optimization for more on this topic.

5.1.1.2. Parsing a Field

We can turn this minimal unit type into a starting point for parsing data—in this case a 32-bit integer from four bytes of raw input. First, we need to declare the unit as public to make it accessible from outside of the current module—a requirement if a host application wants to use the unit as a parsing entry point.

module Test;

public type Foo = unit {
    version: uint32;

    on %done {
        print "0x%x" % self.version;
    }
};

Let’s use spicy-driver to parse 4 bytes of input through this unit:

# printf '\01\02\03\04' | spicy-driver -d foo.spicy
0x1020304

The output comes of course from the print statement inside the %done hook, which executes once the unit has been fully parsed. (We will discuss unit hooks further below.)

By default, Spicy assumes integers that it parses to be represented in network byte order (i.e., big-endian), hence the output above. Alternatively, we can tell the parser through an attribute that our input is arriving in, say, little-endian instead. To do that, we import the spicy library module, which provides an enum type spicy::ByteOrder that we can give to a &byte-order field attribute for fields that support it:

module Test;

import spicy;

public type Foo = unit {
    version: uint32 &byte-order=spicy::ByteOrder::Little;

    on %done {
        print "0x%x" % self.version;
    }
};

# printf '\01\02\03\04' | spicy-driver -d foo.spicy
0x4030201

We see that unpacking the value has now flipped the bytes before storing it in the version field.

Similar to &byte-order, Spicy offers a variety of further attributes that control the specifics of how fields are parsed. We’ll discuss them in the relevant sections throughout the rest of this chapter.

5.1.1.3. Non-type Fields

Unit fields always have a type. However, in some cases a field’s type is not explicitly declared, but derived from what’s being parsed. The main example of this is parsing a constant value: Instead of a type, a field can specify a constant of a parseable type. The field’s type will then (usually) just correspond to the constant’s type, and parsing will expect to find the corresponding value in the input stream. If a different value gets unpacked instead, parsing will abort with an error. Example:

module Test;

public type Foo = unit {
    bar: b"bar";
    on %done { print self.bar; }
};

# printf 'bar' | spicy-driver -d foo.spicy
bar

# printf 'foo' | spicy-driver -d foo.spicy
[error] processing failed with exception of type spicy::rt::ParseError: expected bytes literal "bar" but input starts with "foo" (foo.spicy:5:10-5:15)

Regular expressions extend this scheme a bit further: If a field specifies a regular expression constant rather than a type, the field will have type Bytes and store the data that ends up matching the regular expression:

module Test;

public type Foo = unit {
    x: /Foo.*Bar/;
    on %done { print self; }
};

# printf 'Foo12345Bar' | spicy-driver -d foo.spicy
[$x=b"Foo12345Bar"]

There’s also a programmatic way to change a field’s type to something that’s different than what’s being parsed, see the &convert attribute.

5.1.1.3.1. Parsing Fields With Known Size

You can limit the input that a field receives by attaching a &size=EXPR attribute that specifies the number of raw bytes to make available. This works on top of any other attributes that control the field’s parsing. From the field’s perspective, such a size limit acts just like reaching the end of the input stream at the specified position. Example:

module Test;

public type Foo = unit {
    x: int16[] &size=6;
    y: bytes &eod;
    on %done { print self; }
};

# printf '\000\001\000\002\000\003xyz' | spicy-driver -d foo.spicy
[$x=[1, 2, 3], $y=b"xyz"]

As you can see, x receives 6 bytes of input, which it then turns into three 16-bit integers.

Normally, the field must consume all the bytes specified by &size, otherwise a parse error will be triggered. Some types support an additional &eod attribute to lift this restrictions; we discuss that in the corresponding type’s section where applicable.

After a field with a &size=EXPR attribute, parsing will always move ahead the full amount of bytes, even if the field did not consume them all.

Todo

Parsing a regular expression would make a nice example for &size as well.

5.1.1.3.2. Defensively Limiting Input Size

On their own, parsers place no intrinsic upper limit on the size of variable-size fields or units. This can have negative effects like out-of-memory errors, e.g., when available memory is constrained, or for malformed input.

As a defensive mechanism you can put an upper limit on the data a field or unit receives by attaching a &max-size=EXPR attribute where EXPR is an unsigned integer specifying the upper limit of number of raw bytes a field or unit should receive. If more than &max-size bytes are consumed during parsing, an error will be triggered. This attribute works on top of any other attributes that control parsing. Example:

module Test;

public type Foo = unit {
    x: bytes &until=b"\x00" &max-size=1024;
    on %done { print self; }
};

# printf '\001\002\003\004\005\000' | spicy-driver -d foo.spicy
[$x=b"\x01\x02\x03\x04\x05"]

Here x will parse a NULL-terminated byte sequence (excluding the terminating NULL), but never more than 1024 bytes.

&max-size cannot be combined with &size.

5.1.1.4. Anonymous Fields

Field names are optional. If skipped, the field becomes an anonymous field. These still participate in parsing as any other field, but they won’t store any value, nor is there a way to get access to them from outside. You can, however, still get to the field’s final value inside a corresponding field hook (see Unit Hooks) using the reserved $$ identifier (see Reserved Identifiers).

module Test;

public type Foo = unit {
    x: int8;
     : int8 { print $$; } # anonymous field
    y: int8;
    on %done { print self; }
};

# printf '\01\02\03' | spicy-driver -d foo.spicy
2
[$x=1, $y=3]

Anonymous fields can often be more efficient to process because the parser doesn’t need to retain their values. In particular for larger bytes fields, making them anonymous is recommended where possible (unless, even better, they can be fully skipped over; see Skipping Input).

5.1.1.5. Skipping Input

For cases where your parser just needs to skip over some data without needing access to its content, Spicy provides a skip keyword to prefix corresponding fields with:

module Test;

public type Foo = unit {
    x: int8;
     : skip bytes &size=5;
    y: int8;
    on %done { print self; }
};

# printf '\01\02\03\04\05\06\07' | spicy-driver -d foo.spicy
[$x=1, $y=7]

skip works for all kinds of fields but is particularly efficient for fields of known size for which optimized code will be generating avoiding the overhead of storing any data.

skip fields may have conditions and hooks attached, like any other fields. However, they do not support $$ in expressions and hook.

Since skip allows the compiler to optimize the field’s parsing code—including completely eliding most of it—it remains undefined if any side effects associated with the field will take effect. For example, &requires attributes might be ignored, &convert expressions might not be evaluated, and hooks could end up not being invoked.

For readability, a skip field may be named (e.g., padding: skip bytes &size=3;), but even with a name, its value cannot be accessed.

5.1.1.6. Reserved Identifiers

Inside units, two reserved identifiers provide access to values currently being parsed:

self: Inside a unit’s type definition, self refers to the unit instance that’s currently being processed. The instance is writable and maybe modified by assigning to any fields of self.
$$: Inside field attributes, $$ refers to the value as it was parsed. Inside field hooks, $$ refers to the final value after any conversions are applied (see On-the-fly Type Conversion with &convert). This applies even if the value is not going to be directly stored in the field. The value of $$ is writable and may be modified.

Note

$$ has slightly different semantics in a field attribute and in a hook. In an attribute, $$ refers to the parsed value before any conversions. In a hook, $$ refers to the final value after any conversions.

5.1.1.7. On-the-fly Type Conversion with &convert

Fields may use an attribute &convert=EXPR to transform the value that was just being parsed before storing it as the field’s final value. With the attribute being present, it’s the value of EXPR that’s stored in the field, not the parsed value. Accordingly, the field’s type also changes to the type of EXPR.

Typically, EXPR will use $$ to access the parsed value and then transform it into the desired representation. For example, the following stores an integer parsed in an ASCII representation as a uint64:

module Test;

import spicy;

public type Foo = unit {
    x: bytes &eod &convert=$$.to_uint();
    on %done { print self; }
};

# printf 12345 | spicy-driver -d foo.spicy
[$x=12345]

&convert also works at the unit level to transform a whole instance into a different value after it has been parsed:

module Test;

type Data = unit {
    data: bytes &size=2;
} &convert=self.data.to_int();

public type Foo = unit {
    numbers: Data[];

    on %done { print self.numbers; }
};

# printf 12345678 | spicy-driver -d foo.spicy
[12, 34, 56, 78]

Note how the Data instances have been turned into integers. Without the &convert attribute, the output would have looked like this:

[[$data=b"12"], [$data=b"34"], [$data=b"56"], [$data=b"78"]]

5.1.1.8. Enforcing Parsing Constraints

Fields may use an attribute &requires=EXPR to enforce additional constraints on their values. EXPR must yield a boolean value and will be evaluated after the parsing for the field has finished, but before any hooks execute. If EXPR returns False, the parsing process will abort with an error, just as if the field had been unparsable in the first place (incl. executing any %error hooks). EXPR has access to the parsed value through $$. It may also retrieve the field’s final value through self.<field>, which can be helpful when &convert is present.

Example:

module Test;

import spicy;

public type Foo = unit {
    x: int8 &requires=($$ < 5);
    on %done { print self; }
};

# printf '\001' | spicy-driver -d foo.spicy
[$x=1]

# printf '\010' | spicy-driver -d foo.spicy
[error] processing failed with exception of type spicy::rt::ParseError: &requires failed: ($$ < 5) (foo.spicy:7:13-7:30)

New in version 1.12: Custom error messages

Instead of computing a boolean value directly, EXPR can also leverage the condition test operator to provide a custom error message when the condition fails. Example:

module Test;

import spicy;

public type Foo = unit {
    x: int8 &requires=($$ < 5 : "x is too large"); # custom error message
    on %done { print self; }
};

# printf '\010' | spicy-driver -d foo.spicy
[error] processing failed with exception of type spicy::rt::ParseError: x is too large (foo.spicy:7:13-7:49)

One can also enforce conditions globally at the unit level through a attribute &requires = EXPR. EXPR will be evaluated once the unit has been fully parsed, but before any %done hook executes. If EXPR returns False, the unit’s parsing process will abort with an error. As usual, EXPR has access to the parsed instance through self. More than one &requires attribute may be specified.

Example:

module Test;

import spicy;

public type Foo = unit {
    x: int8;
    on %done { print self; }
} &requires = self.x < 5;

# printf '\001' | spicy-driver -d foo.spicy
[$x=1]

# printf '\010' | spicy-driver -d foo.spicy
[error] processing failed with exception of type spicy::rt::ParseError: &requires failed: self.x < 5 (foo.spicy:9:15-9:24)

5.1.2. Unit Hooks

Unit hooks provide one of the most powerful Spicy tools to control parsing, track state, and retrieve results. Generally, hooks are blocks of code triggered to execute at certain points during parsing, with access to the current unit instance.

Conceptually, unit hooks are somewhat similar to methods: They have bodies that execute when triggered, and these bodies may receive a set of parameters as input. Different from functions, however, a hook can have more than one body. If multiple implementations are provided for the same hook, all of them will execute successively. A hook may also not have any body implemented at all, in which case there’s nothing to do when it executes.

The most commonly used hooks are:

on %init() { ... }: Executes just before unit parsing will start.
on %done { ... }: Executes just after unit parsing has completed successfully.

on %error { ... } or on %error(msg: string) { ... }: Executes when a parse error has been encountered, just before the parser either aborts processing. If the second form is used, a description of the error will be provided through the string argument.
on %finally { ... }: Executes once unit parsing has completed in any way. This hook is most useful to modify global state that needs to be updated no matter the success of the parsing process. Once %init triggers, this hook is guaranteed to eventually execute as well. It will run after either %done or %error, respectively. (If a new error occurs during execution of %finally, that will not trigger the unit’s %error hook.)
on %print { ... }: Executes when a unit is about to be printed (and more generally: when rendered into a string representation). By default, printing a unit will produce a list of its fields with their current values. Through this hook, a unit can customize its appearance by returning the desired string.
on <field name> { ... } (field hook): Executes just after the given unit field has been parsed. The final value is accessible through the $$, potentially with any relevant type conversion applied (see On-the-fly Type Conversion with &convert). The same will also have been assigned to the field already.

on <field name> foreach { ... } (container hook): Assuming the specified field is a container (e.g., a vector), this executes each time a new container element has been parsed, and just before it’s been added to the container. The element’s final value is accessible through the $$ identifier, although it can be further modified before it’s stored. The hook implementation may also use the stop statement to abort container parsing, without the current element being added anymore.

In addition, Spicy provides a set of hooks specific to the sink type which are discussed in the section on sinks, and hooks which are executed during error recovery.

There are three locations where hooks can be implemented:

Inside a unit, on <hook name> { ... } implements the hook of the given name:

type Foo = unit {
    x: uint32;
    v: uint8[];

    on %init { ... }
    on x { ... }
    on v foreach { ... }
    on %done { ... }
}

Field and container hooks may be directly attached to their field, skipping the on ... part:

type Foo = unit {
    x: uint32 { ... }
    v: uint8[] foreach { ... }
}

At the global module level, one can add hooks to any available unit type through on <unit type>::<hook name> { ... }. With the definition of Foo above, this implements hooks externally:
```
on Foo::%init { ... }
on Foo::x { ... }
on Foo::v foreach { ... }
on Foo::%done { ... }
```
External hooks work across module boundaries by qualifying the unit type accordingly. They provide a powerful mechanism to extend a predefined unit without changing any of its code.

If multiple implementations are provided for the same hook, by default it remains undefined in which order they will execute. If a particular order is desired, you can specify priorities for your hook implementations:

on Foo::v priority=5 { ... }
on Foo::v priority=-5 { ... }

Implementations then execute in order of their priorities: The higher a priority value, the earlier it will execute. If not specified, a hook’s priority is implicitly taken as zero.

Note

When a hook executes, it has access to the current unit instance through the self identifier. The state of that instance will reflect where parsing is at that time. In particular, any field that hasn’t been parsed yet, will remain unset. You can use the ?. unit operator to test if a field has received a value yet.

5.1.3. Unit Variables

In addition to unit field for parsing, you can also add further instance variables to a unit type to store arbitrary state:

module Test;

public type Foo = unit {
    on %init { print self; }
    x: int8 { self.a = "Our integer is %d" % $$; }
    on %done { print self; }

    var a: string;
};

# printf \05 | spicy-driver -d foo.spicy
[$x=(not set), $a=""]
[$x=48, $a="Our integer is 48"]

Here, we assign a string value to a once we have parsed x. The final print shows the expected value. As you can also see, before we assign anything, the variable’s value is just empty: Spicy initializes unit variables with well-defined defaults. If you would rather leave a variable unset by default, you can add &optional:

module Test;

public type Foo = unit {
    on %init { print self; }
    x: int8 { self.a = "Our integer is %d" % $$; }
    on %done { print self; }

    var a: string &optional;
};

# printf \05 | spicy-driver -d foo.spicy
[$x=(not set), $a=(not set)]
[$x=48, $a="Our integer is 48"]

You can use the ?. unit operator to test if an optional unit variable remains unset, e.g., self?.x would return True if field x is set and False otherwise.

Unit variables can also be initialized with custom expressions when being defined. The initialization is performed just before the containing unit starts parsing (implying that the expressions cannot access parse results of the unit itself yet)

module Test;

public type Foo = unit {
    x: int8;
    var a: int8 = 123;
    on %done { print self; }
};

# printf \05 | spicy-driver -d foo.spicy
[$x=48, $a=123]

5.1.4. Unit Parameters

Unit types can receive parameters upon instantiation, which will then be available to any code inside the type’s declaration:

module Test;

type Bar = unit(msg: string, mult: int8) {
    x: int8 &convert=($$ * mult);
    on %done { print "%s: %d" % (msg, self.x); }
};

public type Foo = unit {
    y: Bar("My multiplied integer", 5);
};

# printf '\05' | spicy-driver -d foo.spicy
My multiplied integer: 25

This example shows a typical idiom: We’re handing parameters down to a subunit through parameters it receives. Inside the submodule, we then have access to the values passed in.

Note

It’s usually not very useful to define a top-level parsing unit with parameters because we don’t have a way to pass anything in through spicy-driver. A custom host application could make use of them, though.

This works with subunits inside containers as well:

module Test;

type Bar = unit(mult: int8) {
    x: int8 &convert=($$ * mult);
    on %done { print self.x; }
};

public type Foo = unit {
    x: int8;
    y: Bar(self.x)[];
};

# printf '\05\01\02\03' | spicy-driver -d foo.spicy
5
10
15

A common use-case for unit parameters is passing the self of a higher-level unit down into a subunit:

type Foo = unit {
    ...
    b: Bar(self);
    ...
}

type Bar = unit(foo: Foo) {
    # We now have access to any state in "foo".
}

That way, the subunit can for example store state directly in the parent. If you declare the foo parameter as inout, the subunit can also modify its members.

Unit parameters generally follow the same passing conventions as function parameters, yet with some restrictions. By default, just like with functions, parameters are read-only by default. If you want the receiving unit to be able to modify the value, there are two options:

If the parameter itself is a unit, you can declare it as inout as described above.

For all other types, you instead need to pass the parameter as a reference. Here’s an example passing a string so that it can be modified by the subunit:

module Test;

type X = unit(s: string&) {
    n: uint8 {
        *s = "Hello, world!";
       }
};

public type Y = unit {
    x: X(self.s);

    on %done { print self.s; }

    var s: string& = new string;
};

# printf '\x2a' | spicy-driver -d foo.spicy
Hello, world!

Note

While this lack of support for inout may seem like a surprising restriction at first, it follows from Spicy’s safety guarantees: since a subunit may access its parameters during its entire lifetime, generally Spicy couldn’t guarantee that a parameter passed as inout at initialization time would actually remain around for modification the whole time. References do not have that problem: their wrapped values are guaranteed to remain valid as long as necessary. (Units happen to share that behaviour, too, which is why Spicy can support inout for them.)

5.1.5. Unit Attributes

Unit types support the following type attributes:

&byte-order=ORDER

Specifies a byte order to use for parsing the unit where ORDER is of type spicy::ByteOrder. This overrides the byte order specified for the module. Individual fields can override this value by specifying their own byte-order. Example:

type Foo = unit {
    version: uint32;
} &byte-order=spicy::ByteOrder::Little;

&convert=EXPR

Replaces a unit instance with the result of the expression EXPR after parsing it from inside a parent unit. See On-the-fly Type Conversion with &convert for an example. EXPR has access to self to retrieves state from the unit.

&requires=EXPR

Enforces post-conditions on the parsed unit. EXPR must be a boolean expression that will be evaluated after the parsing for the unit has finished, but before any hooks execute. More than one &requires attributes may be specified. Example:
type Foo = unit {
    a: int8;
    b: int8;
} &requires=self.a==self.b;

See the section on parsing constraints for more details.

&size=N

Limits the unit’s input to N bytes, which it must fully consume. Example:

type Foo = unit {
    a: int8;
    b: bytes &eod;
} &size=5;

This expects 5 bytes of input when parsing an instance of Foo. The unit will store the first byte into a, and then fill b with the remaining 4 bytes.

The expression N has access to self as well as to the unit’s parameters.

5.1.6. Meta data

Units can provide meta data about their semantics through properties that both Spicy itself and host applications can access. One defines properties inside the unit’s type through either a %<property> = <value>; tuple, or just as %<property>; if the property does not take an argument. Currently, units support the following meta data properties:

%mime-type = STRING

A string of the form "<type>/<subtype>" that defines the MIME type for content the unit knows how to parse. This may include a * wildcard for either the type or subtype. We use a generalized notion of MIME types here that can include custom meanings. See Sinks for more on how these MIME types are used to select parsers dynamically during runtime.

You can specify this property more than once to associate a unit with multiple types.

%description = STRING

A short textual description of the unit type (i.e., the parser that it defines). Host applications have access to this property, and spicy-driver includes the information into the list of available parsers that it prints with the --list-parsers option.

%port = PORT_VALUE [&originator|&responder]

A Port to associate this unit with, optionally including a direction to limit its use to the corresponding side. This property has no built-in effect, but host applications may make use of the information to decide which unit type to use for parsing a connection’s payload.

%skip = ( REGEXP | Null );

Specifies a pattern which should be skipped when encountered in the input stream in between parsing of unit fields. This overwrites a value set at the module level; use Null to reset the property, i.e., not skip anything.

%skip-pre = ( REGEXP | Null );

Specifies a pattern which should be skipped when encountered in the input stream before parsing of a unit begins. This overwrites a value set at the module level; use Null to reset the property, i.e., not skip anything.

%skip-post = ( REGEXP | Null );

Specifies a pattern which should be skipped when encountered in the input stream after parsing of a unit has finished. This overwrites a value set at the module level; use Null to reset the property, i.e., not skip anything.

%synchronize-at = EXPR;: Specifies a literal to synchronize on if the unit is used as a synchronization point during error recovery. The literal is left in the input stream.

%synchronize-after = EXPR;: Specifies a literal to synchronize on if the unit is used as a synchronization point during error recovery. The literal is consumed and will not be present in the input stream after successful synchronization.

Units support some further properties for other purposes, which we introduce in the corresponding sections.

5.1.7. Parsing Types

Several, but not all, of Spicy’s data types can be parsed from binary data. In the following we summarize the types that can, along with any options they support to control specifics of how they unpack binary representations.

5.1.7.1. Address

Spicy parses addresses from either 4 bytes of input for IPv4 addresses, or 16 bytes for IPv6 addresses. To select the type, a unit field of type addr must come with either an &ipv4 or &ipv6 attribute.

By default, addresses are assumed to be represented in network byte order. Alternatively, a different byte order can be specified through a &byte-order attribute specifying the desired spicy::ByteOrder.

Example:

module Test;

import spicy;

public type Foo = unit {
    ip: addr &ipv6 &byte-order=spicy::ByteOrder::Little;
    on %done { print self; }
};

# printf '1234567890123456' | spicy-driver -d foo.spicy
[$ip=3635:3433:3231:3039:3837:3635:3433:3231]

5.1.7.2. Bitfield

Bitfields parse an integer value of a given size, and then make selected smaller bit ranges within that value available individually through dedicated identifiers. For example, the following unit parses 4 bytes as an uint32 and then makes the value of bit 0 available as f.x1, bits 1 to 2 as f.x2, and bits 3 to 4 as f.x3, respectively:

module Test;

public type Foo = unit {
    f: bitfield(32) {
        x1: 0;
        x2: 1..2;
        x3: 3..4;
    };

    on %done {
        print self.f.x1, self.f.x2, self.f.x3;
        print self;
    }
};

# printf '\01\02\03\04' | spicy-driver -d foo.spicy
0, 2, 0
[$f=(x1: 0, x2: 2, x3: 0)]

Generally, a field bitfield(N) field is parsed like an uint<N>. The field then supports dereferencing individual bit ranges through their labels. The corresponding expressions (self.x.<id>) have the same uint<N> type as the parsed value itself, with the value shifted to the right so that the least significant extracted bit becomes the least significant bit of the returned value. As you can see in the example, the type of the field itself becomes a tuple composed of the values of the individual bit ranges.

By default, a bitfield assumes the underlying integer comes in network byte order. You can specify a &byte-order attribute to change that (e.g., bitfield(32) { ... } &byte-order=spicy::ByteOrder::Little).

When parsing a bitfield(16) in network byte order and with bit order spicy::BitOrder::LSB0 (default value of &bit-order), bits are numbered 0 to 15 from right to left.

MSB                           LSB
      <--   1         <--       0
6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
+---------------+---------------+
|               |               |
+-------------------------------+

This default bit numbering may be surprising given that some RFCs use the inverse as documented in RFC 1700. Here, the most significant bit is numbered 0 on the left with higher bit numbers representing less significant bits to the right. Concrete examples would be the WebSocket framing or IPv4 header notations.

MSB                           LSB
0       -->         1    -->
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
+-+-+-+-+-------+-+-------------+
|F|R|R|R| opcode|M| Payload len |
|I|S|S|S|  (4)  |A|     (7)     |
|N|V|V|V|       |S|             |
| |1|2|3|       |K|             |
+-+-+-+-+-------+-+-------------+

To express such bitfields more naturally in Spicy, use &bit-order=spicy::BitOrder::MSB0 on the whole bitfield:

module WebSocket;

import spicy;

public type Header= unit {
    : bitfield(32) {
        fin: 0;
        rsv: 1..3;
        opcode: 4..7;
        mask: 8;
        payload_len: 9..15;
    } &bit-order=spicy::BitOrder::MSB0;
};

The way to think about this is that the most significant bit of an integer in network byte order is always the most left bit and the least significant bit the most right one. Specifying the bit order as LSB0 or MSB0 essentially sets the bit numbering direction by specifying the location of bit 0.

With little endian byte order, the bits are numbered zigzag-wise and MSB0 and LSB0 can again be used to change the direction of the bit numbering. The following example uses spicy::ByteOrder::Little and the default LSB0 bit order for bitfield(16). Notice how the most significant and least significant bit for a 2 byte little endian integer are next to each other.

f: bitfield(16) {

  ...

} &byte-order=spicy::ByteOrder::Little;

              LSB MSB
     <--        0     <--   1
  7 6 5 4 3 2 1 0 5 4 3 2 1 0 9 8
+---------------+---------------+
|               |               |
+-------------------------------+

With MSB0 as bit order, the bit numbering direction is from left to right, instead:

f: bitfield(16) {

  ...

} &byte-order=spicy::ByteOrder::Little &bit-order=spicy::BitOrder::MSB0;

            LSB MSB
    1  -->      0    -->
8 9 0 1 2 3 4 5 0 1 2 3 4 5 6 7
+---------------+---------------+
|               |               |
+-------------------------------+

Bit numbering with larger sized bitfields in little endian gets only more confusing. Prefer network byte ordered bitfields unless it makes sense given the spec you’re working with.

The individual bit ranges support the &convert attribute and will adjust their types accordingly, just like a regular unit field (see On-the-fly Type Conversion with &convert). For example, that allows for mapping a bit range to an enum, using $$ to access the parsed value:

module Test;

import spicy;

type X = enum { A = 1, B = 2 };

public type Foo = unit {
    f: bitfield(8) {
        x1: 0..3 &convert=X($$);
        x2: 4..7 &convert=X($$);
    } { print self.f.x1, self.f.x2; }
};

# printf '\x21' | spicy-driver -d foo.spicy
X::A, X::B

When parsing a bitfield, you can enforce expected values for some or all of the bitranges through an assignment-style syntax:

type Foo = unit {
    f: bitfield(8) {
        x1: 0..3 = 2;
        x2: 4..5;
        x3: 6..7 = 3;
    }
};

Now parsing will fail if values of x1 and x3 aren’t 2 and 3, respectively. Internally, Spicy treats bitfields with such expected values similar to constants of other types, meaning they operate as valid look-ahead symbols as well (see Look-Ahead).

5.1.7.3. Bytes

When parsing a field of type Bytes, Spicy will consume raw input bytes according to a specified attribute that determines when to stop. The following attributes are supported:

&eod

Consumes all subsequent data until the end of the input is reached.

&size=N

Consumes exactly N bytes. The attribute may be combined with &eod to consume up to N bytes instead (i.e., permit running out of input before the size limit is reached).

(This attribute works for fields of all types. We list it here because it’s particularly common to use it with bytes.)

&until=DELIM

Consumes bytes until the specified delimiter is found. DELIM must be of type bytes itself. The delimiter will not be included into the resulting value, but consumed.

&until-including=DELIM

Similar to &until, but this does include the delimiter DELIM into the resulting value.

At least one of these attributes must be provided.

On top of that, bytes fields support the attribute &chunked to change how the parsed data is processed and stored. Normally, a bytes field will first accumulate all desired data and then store the final, complete value in the field. With &chunked, if the data arrives incrementally in pieces, the field instead processes just whatever is available at a time, storing each piece directly, and individually, in the field. Each time a piece gets stored, any associated field hooks execute with the new part as their $$. Parsing with &chunked will eventually still consume the same number of bytes overall, but it avoids buffering everything in cases where that’s either infeasible or simply not not needed.

Bytes fields support parsing constants: If a bytes constant is specified instead of a field type, parsing will expect to find the corresponding value in the input stream.

5.1.7.4. Integer

Fields of integer type can be either signed (intN) or unsigned (uintN). In either case, the bit length N determines the number of bytes being parsed. By default, integers are expected to come in network byte order. You can specify a different order through the &byte-order=ORDER attribute, where ORDER is of type spicy::ByteOrder.

Integer fields support parsing constants: If an integer constant is specified instead the instead of a field type, parsing will expect to find the corresponding value in the input stream. Since the exact type of the integer constant is important, you should use their constructor syntax to make that explicit (e.g., uint32(42), int8(-1); vs. using just 42 or -1).

5.1.7.5. Real

Real values are parsed as either single or double precision values in IEEE754 format, depending on the value of their &type=T attribute, where T is one of spicy::RealType.

5.1.7.6. Regular Expression

When parsing a field through a Regular Expression, the expression is expected to match at the current position of the input stream. The field’s type becomes bytes, and it will store the matching data.

Inside hooks for fields with regular expressions, you can access capture groups through $1, $2, $3, etc. For example:

x : /(a.c)(de*f)(h.j)/ {
    print $1, $2, $3;
    }

This will print out the relevant pieces of the data matching the corresponding set of parentheses. (There’s no $0, just use $$ as normal to get the full match.)

Matching an regular expression is more expensive if you need it to capture groups. If are using groups inside your expression but don’t need the actual captures, add &nosub to the field to remove that overhead.

5.1.7.7. Unit

Fields can have the type of another unit, in which case parsing will descend into that subunit’s grammar until that instance has been fully parsed. Field initialization and hooks work as usual.

If the subunit receives parameters, they must be given right after the type.

module Test;

type Bar = unit(a: string) {
    x: uint8 { print "%s: %u" % (a, self.x); }
};

public type Foo = unit {
    y: Bar("Spicy");
    on %done { print self; }
};

# printf '\01\02' | spicy-driver -d foo.spicy
Spicy: 1
[$y=[$x=1]]

See Unit Parameters for more.

5.1.7.8. Vector

Parsing a vector creates a loop that repeatedly parses elements of the specified type from the input stream until an end condition is reached. The field’s value accumulates all the elements into the final vector.

Spicy uses a specific syntax to define fields of type vector:

NAME : ELEM_TYPE[SIZE]

NAME is the field name as usual. ELEM_TYPE is type of the vector’s elements, i.e., the type that will be repeatedly parsed. SIZE is the number of elements to parse into the vector; this is an arbitrary Spicy expression yielding an integer value. The resulting field type then will be vector<ELEM_TYPE>. Here’s a simple example parsing five uint8:

module Test;

public type Foo = unit {
    x: uint8[5];
    on %done { print self; }
};

# printf '\01\02\03\04\05' | spicy-driver -d foo.spicy
[$x=[1, 2, 3, 4, 5]]

It is possible to skip the SIZE (e.g., x: uint8[]) and instead use another kind of end conditions to terminate a vector’s parsing loop. To that end, vectors support the following attributes:

&eod: Parses elements until the end of the input stream is reached.
&size=N: Parses the vector from the subsequent N bytes of input data. This effectively limits the available input to the corresponding window, letting the vector parse elements until it runs out of data. (This attribute works for fields of all types. We list it here because it’s particularly common to use it with vectors.)
&until=EXPR: Vector elements are parsed in a loop with EXPR being evaluated as a boolean expression after each parsed element, and before adding the element to the vector. Once EXPR evaluates to true, parsing stops without adding the element that was just parsed. Inside EXPR, $$ refers to the element most recently parsed.
&until-including=EXPR: Similar to &until, but does include the final element EXPR into the field’s vector when stopping parsing. Inside EXPR, $$ refers to the element most recently parsed.
&while=EXPR: Continues parsing as long as the boolean expression EXPR evaluates to true. Inside EXPR, $$ refers to the element most recently parsed.

If neither a size nor an attribute is given, Spicy will attempt to use look-ahead parsing to determine the end of the vector based on the next expected token. Depending on the unit’s field, this may not be possible, in which case Spicy will decline to compile the unit.

The syntax shown above generally works for all element types, including subunits (e.g., x: MyUnit[]).

Note

The x: (<T>)[] syntax is quite flexible. In fact, <T> is not limited to subunits, but allows for any standard field specification defining how to parse the vector elements. For example, x: (bytes &size=5)[]; parses a vector of 5-character bytes instances.

When parsing a vector, Spicy supports using a special kind of field hook, foreach, that executes for each parsed element individually. Inside that hook, $$ refers to the element’s final value:

module Test;

public type Foo = unit {
    x: uint8[5] foreach { print $$, self.x; }
};

# printf '\01\02\03\04\05' | spicy-driver -d foo.spicy
1, []
2, [1]
3, [1, 2]
4, [1, 2, 3]
5, [1, 2, 3, 4]

As you can see, when a foreach hook executes the element has not yet been added to the vector. You may indeed use a stop statement inside a foreach hook to abort the vector’s parsing without adding the current element anymore. See Unit Hooks for more on hooks.

5.1.7.9. Void

The Void type can be used as a placeholder in fields not meant to consume any data. This can be useful in some situations, such as providing a branch in switch constructs to that foregoes any parsing, or attaching a &requires attribute to enforce a condition.

Fields of type void do not have any accessible value.

5.1.8. Controlling Parsing

Spicy offers a few additional constructs inside a unit’s declaration for steering the parsing process. We discuss them in the following.

5.1.8.1. Conditional Parsing

A unit field may be conditionally skipped for parsing by adding an if ( COND ) clause, where COND is a boolean expression. The field will be only parsed if the expression evaluates to true at the time the field is next in line.

module Test;

public type Foo = unit {
    a: int8;
    b: int8 if ( self.a == 1 );
    c: int8 if ( self.a % 2 == 0 );
    d: int8;

    on %done { print self; }
};

# printf '\01\02\03\04' | spicy-driver -d foo.spicy
[$a=1, $b=2, $c=(not set), $d=3]

# printf '\02\02\03\04' | spicy-driver -d foo.spicy
[$a=2, $b=(not set), $c=2, $d=3]

New in version 1.12: Conditional blocks

If the same condition applies to multiple subsequent fields, they can be grouped together into a single conditional block:

module Test;

public type Foo = unit {
    a: int8;

    if ( self.a == 1 ) {
        b: int8;
        c: int8;
    }; # note the trailing semicolon

    on %done { print self; }
};

The syntax supports an optional else-block as well:

module Test;

public type Foo = unit {
    a: int8;

    if ( self.a == 1 ) {
        b: int8;
    }
    else {
        c: int8;
    }; # note the trailing semicolon

    on %done { print self; }
};

For repeated cases of conditional parsing where a single expression evaluates to one of several values, unit switch statements might allow for more compact and easier to maintain code.

5.1.8.2. Look-Ahead

Internally, Spicy builds an LR(1) grammar for each unit that it parses, meaning that it can actually look ahead in the parsing stream to determine how to process the current input location. Roughly speaking, if (1) the current construct does not have a clear end condition defined (such as a specific length), and (2) a specific value is expected to be found next; then the parser will keep looking for that value and end the current construct once it finds it.

“Construct” deliberately remains a bit of a fuzzy term here, but think of vector parsing as the most common instance of this: If you don’t give a vector an explicit termination condition (as discussed in Vector), Spicy will look at what’s expected to come after the container. As long as that’s something clearly recognizable (e.g., a specific value of an atomic type, or a match for a regular expression), it’ll terminate the vector accordingly.

Here’s an example:

module Test;

public type Foo = unit {
    data: uint8[];
        : /EOD/;
    x   : int8;

    on %done { print self; }
};

# printf '\01\02\03EOD\04' | spicy-driver -d foo.spicy
[$data=[1, 2, 3], $x=4]

For vectors, Spicy attempts look-ahead parsing automatically as a last resort when it doesn’t find more explicit instructions. However, it will reject a unit if it can’t find a suitable look-ahead symbol to work with. If we had written int32 in the example above, that would not have worked as the parser can’t recognize when there’s a int32 coming; it would need to be a concrete value, such as int32(42).

See the switch construct for another instance of look-ahead parsing.

5.1.8.3. `switch`

Spicy supports a switch construct as way to branch into one of several parsing alternatives. There are two variants of this, an explicit branch and one driving by look-ahead:

Branch by expression

The most basic form of switching by expression looks like this:

switch ( EXPR ) {
    VALUE_1 -> FIELD_1;
    VALUE_2 -> FIELD_2;
    ...
    VALUE_N -> FIELD_N;
};

This evaluates EXPR at the time parsing reaches the switch. If there’s a VALUE matching the result, parsing continues with the corresponding field, and then proceeds with whatever comes after the switch. Example:

module Test;

public type Foo = unit {
    x: bytes &size=1;
    switch ( self.x ) {
        b"A" -> a8: int8;
        b"B" -> a16: int16;
        b"C" -> a32: int32;
    };

    on %done { print self; }
};

# printf 'A\01' | spicy-driver -d foo.spicy
[$x=b"A", $a8=1, $a16=(not set), $a32=(not set)]

# printf 'B\01\02' | spicy-driver -d foo.spicy
[$x=b"B", $a8=(not set), $a16=258, $a32=(not set)]

We see in the output that all of the alternatives turn into normal unit members, with all but the one for the branch that was taken left unset.

If none of the values match the expression, that’s considered a parsing error and processing will abort. Alternative, one can add a default alternative by using * as the value. The branch will then be taken whenever no other value matches.

A couple additional notes about the fields inside an alternative:

In our example, the fields of all alternatives all have different names, and they all show up in the output. One can also reuse names across alternatives as long as the types exactly match. In that case, the unit will end up with only a single instance of that member.

An alternative can match against more than one value by separating them with commas (e.g., b"A", b"B" -> x: int8;).

Alternatives can have more than one field attached by enclosing them in braces, i.e.,: VALUE -> { FIELD_1a; FIELD_1b; ...; FIELD_1n; }.

Sometimes one really just needs the branching capability, but doesn’t have any field values to store. In that case an anonymous void field may be helpful( e.g., b"A" -> : void { DoSomethingHere(); }.

Branch by look-ahead

switch also works without any expression as long as the presence of all the alternatives can be reliably recognized by looking ahead in the input stream:

module Test;

public type Foo = unit {
    switch {
        -> a: b"A";
        -> b: b"B";
        -> c: b"C";
    };

    on %done { print self; }
};

# printf 'A' | spicy-driver -d foo.spicy
[$a=b"A", $b=(not set), $c=(not set)]

While this example is a bit contrived, the mechanism becomes powerful once you have subunits that are recognizable by how they start:

module Test;

type A = unit {
    a: b"A";
};

type B = unit {
    b: uint16(0xffff);
};

public type Foo = unit {
    switch {
        -> a: A;
        -> b: B;
    };

    on %done { print self; }
};

# printf 'A ' | spicy-driver -d foo.spicy
[$a=[$a=b"A"], $b=(not set)]

# printf '\377\377' | spicy-driver -d foo.spicy
[$a=(not set), $b=[$b=65535]]

Switching Over Fields With Common Size

You can limit the input any field in a unit switch receives by attaching an optional &size=EXPR attribute that specifies the number of raw bytes to make available. This is analog to the field size attribute and especially useful to remove duplication when each case is subject to the same constraint.

module Test;

public type Foo = unit {
    tag: uint8;
    switch ( self.tag ) {
       1 -> b1: bytes &eod;
       2 -> b2: bytes &eod &convert=$$.lower();
    } &size=3;

    on %done { print self; }
};

# printf '\01ABC' | spicy-driver -d foo.spicy
[$tag=1, $b1=b"ABC", $b2=(not set)]

# printf '\02ABC' | spicy-driver -d foo.spicy
[$tag=2, $b1=(not set), $b2=b"abc"]

5.1.8.4. Backtracking

Spicy supports a simple form of manual backtracking. If a field is marked with &try, a later call to the unit’s backtrack() method anywhere down in the parse tree originating at that field will immediately transfer control over to the field following the &try. When doing so, the data position inside the input stream will be reset to where it was when the &try field started its processing. Units along the original path will be left in whatever state they were at the time backtrack() executed (i.e., they will probably remain just partially initialized). When backtrack() is called on a path that involves multiple &try fields, control continues after the most recent.

Example:

module Test;

public type test = unit {
    foo: Foo &try;
    bar: Bar;

    on %done { print self; }
};

type Foo = unit {
    a: int8 {
        if ( $$ != 1 )
            self.backtrack();
       }
    b: int8;
};

type Bar = unit {
    a: int8;
    b: int8;
};

# printf '\001\002\003\004' | spicy-driver -d backtrack.spicy
[$foo=[$a=1, $b=2], $bar=[$a=3, $b=4]]

# printf '\003\004' | spicy-driver -d backtrack.spicy
[$foo=[$a=3, $b=(not set)], $bar=[$a=3, $b=4]]

backtrack() can be called from inside %error hooks, so this provides a simple form of error recovery as well.

Note

This mechanism is preliminary and will probably see refinement over time, both in terms of more automated backtracking and by providing better control where to continue after backtracking.

5.1.9. Changing Input

By default, a Spicy parser proceeds linearly through its inputs, parsing as much as it can and yielding back to the host application once it runs out of input. There are two ways to change this linear model: diverting parsing to a different input, and random access within the current unit’s data.

Parsing custom data

A unit field can have either &parse-from=EXPR or &parse-at=EXPR attached to it to change where it’s receiving its data to parse from. EXPR is evaluated at the time the field is reached. For &parse-from it must produce a value of type bytes, which will then constitute the input for the field. This can, e.g., be used to reparse previously received input:

module Test;

public type Foo = unit {
    x: bytes &size=2;
    y: uint16 &parse-from=self.x;
    z: bytes &size=2;

    on %done { print self; }
};

# printf '\x01\x02\x03\04' | spicy-driver -d foo.spicy
[$x=b"\x01\x02", $y=258, $z=b"\x03\x04"]

For &parse-at, EXPR must yield an iterator pointing to (a still valid) position of the current unit’s input stream (such as retrieved through input()). The field will then be parsed from the data starting at that location.

Random access

While a unit is being parsed, you may revert the current input position backwards to any location between the first byte the unit has seen and the current position. You can use a set of built-in unit methods to control the current position:

input(): Returns a stream iterator pointing to the current input position.
set_input(): Sets the current input position to the location of the specified stream iterator. Per above, the new position needs to reside between the beginning of the current unit’s data and the current position; otherwise an exception will be generated at runtime.
offset(): Returns the numerical offset of the current input position relative to position of the first byte fed into this unit.
position(): Returns iterator to the current input position in the stream fed into this unit.

You can achieve random access by saving an iterator from input() in a unit variable, then later return to that position (or one derived from it) by calling set_input() with that variable. Here’s an example that parses input data twice with different sub units:

module Test;

public type Foo = unit {
    on %init() { self.start = self.input(); }

    a: A { self.set_input(self.start); }
    b: B;

    on %done() { print self; }

    var start: iterator<stream>;
};

type A = unit {
    x: uint32;
};

type B = unit {
    y: bytes &size=4;
};

# printf '\00\00\00\01' | spicy-driver -d foo.spicy
[$a=[$x=1], $b=[$y=b"\x00\x00\x00\x01"], $start=<offset=0 data=b"\x00\x00\x00\x01">]

If you look at output, you see that start iterator remembers its offset, relative to the global input stream. It would also show the data at that offset if the parser had not already discarded that at the time we print it out.

Note

Spicy parsers discard input data as quickly as possible as parsing moves through the input stream. Indeed, that’s why using random access may come with a performance penalty as the parser now needs to buffer all of unit’s data until it has been fully processed.

5.1.10. Filters

Spicy supports attaching filters to units that get to preprocess and transform a unit’s input before its parser gets to see it. A typical use case for this is stripping off a data encoding, such as compression or Base64.

A filter is itself just a unit that comes with an additional property %filter marking it as such. The filter unit’s input represents the original input to be transformed. The filter calls an internally provided unit method forward() to pass any transformed data on to the main unit that it’s attached to. The filter can call forward arbitrarily many times, each time forwarding a subsequent chunk of input. To attach a filter to a unit, one calls the method connect_filter() with an instance of the filter’s type. Putting that all together, this is an example of a simple a filter that upper-cases all input before the main parsing unit gets to see it:

module Test;

type Filter = unit {
    %filter;

    : bytes &eod &chunked {
        self.forward($$.upper());
    }
};

public type Foo = unit {
    on %init { self.connect_filter(new Filter); }
    x: bytes &size=5 { print self.x; }
};

# printf 'aBcDe' | spicy-driver -d foo.spicy
ABCDE

There are a couple of predefined filters coming with Spicy that become available by importing the filter library module:

filter::Zlib: Provides zlib decompression.
filter::Base64Decode: Provides base64 decoding.

5.1.11. Sinks

Sinks provide a powerful mechanism to chain multiple units together into a layered stack, each processing the output of its predecessor. A sink is the connector here that links two unit instances: one side writing and one side reading, like a Unix pipe. As additional functionality, the sink can internally reassemble data chunks that are arriving out of order before passing anything on.

Here’s a basic example of two units types chained through a sink:

module Test;

public type A = unit {
    on %init { self.b.connect(new B); }

    length: uint8;
    data: bytes &size=self.length { self.b.write($$); }

    on %done { print "A", self; }

    sink b;
};

public type B = unit {
        : /GET /;
    path: /[^\n]+/;

    on %done { print "B", self; }
};

# printf '\13GET /a/b/c\n' | spicy-driver -d -p Test::A foo.spicy
B, [$path=b"/a/b/c"]
A, [$length=11, $data=b"GET /a/b/c\x0a", $b=<sink>]

Let’s see what’s going on here. First, there’s sink b inside the declaration of A. That’s the connector, kept as state inside A. When parsing for A is about to begin, the %init hook connects the sink to a new instance of B; that’ll be the receiver for data that A is going to write into the sink. That writing happens inside the field hook for data: once we have parsed that field, we write what will go to the sink using its built-in write() method. With that write operation, the data will emerge as input for the instance of B that we created earlier, and that will just proceed parsing it normally. As the output shows, in the end both unit instances end up having their fields set.

As an alternative for using the write() in the example, there’s some syntactic sugar for fields of type bytes (like data here): We can just replace the hook with a -> operator to have the parsed data automatically be forwarded to the sink: data: bytes &size=self.length -> self.b.

Sinks have a number of further methods, see Sink for the complete reference. Most of them we will also encounter in the following when discussing additional functionality that sinks provide.

Note

Because sinks are meant to decouple processing between two units, a unit connected to a sink will not pass any parse errors back up to the sink’s parent. If you want to catch them, install an %error hook inside the connected unit.

5.1.11.1. Using Filters

Sinks also support filters to preprocess any data they receive before forwarding it on. This works just like for units by calling the built-in sink method connect_filter(). For example, if in the example above, data would have been gzip compressed, we could have instructed the sink to automatically decompress it by calling self.b.connect_filter(new filter::Zlib) (leveraging the Spicy-provided Zlib filter).

5.1.11.2. Leveraging MIME Types

In our example above we knew which type of unit we wanted to connect. In practice, that may or may not be the case. Often, it only becomes clear at runtime what the choice for the next layer should be, such as when using well-known ports to determine the appropriate application-layer analyzer for a TCP stream. Spicy supports dynamic selection through a generalized notion of MIME types: Units can declare which MIME types they know how to parse (see Meta data) , and sinks have connect_mime_type() method that will instantiate and connect any that match their argument (if that’s multiple, all will be connected and all will receive the same data).

“MIME type” can mean actual MIME types, such text/html. Applications can, however, also define their own notion of <type>/<subtype> to model other semantics. For example, one could use x-port/443 as convention to trigger parsers by well-known port. An SSL unit would then declare %mime-type = "x-port/443, and the connection would be established through the equivalent of connect_mime_type("x-port/%d" % resp_port_of_connection).

Todo

For this specific example, there’s a better solution: We also have the %port property and should just build up a table index on that.

5.1.11.3. Reassembly

Reassembly (or defragmentation) of out-of-order data chunks is a common requirement for many protocols. Sinks have that functionality built-in by allowing you to associate a position inside a virtual sequence space with each chunk of data. Sinks will then pass their data on to connected units only once they have collected a continuous, in-order range of bytes.

The easiest way to leverage this is to simply associate sequence numbers with each write() operation:

module Test;

public type Foo = unit {

    sink data;

    on %init {
        self.data.connect(new Bar);
        self.data.write(b"567", 5);
        self.data.write(b"89", 8);
        self.data.write(b"012", 0);
        self.data.write(b"34", 3);
    }
};

public type Bar = unit {
    s: bytes &eod;
    on %done { print self.s; }
};

# spicy-driver -p Test::Foo foo.spicy </dev/null
0123456789

By default, Spicy expects the sequence space to start at zero, so the first byte of the input stream needs to be passed in with sequence number zero. You can change that base number by calling the sink method set_initial_sequence_number(). You can control Spicy’s gap handling, including when to stop buffering data because you know nothing further will arrive anymore. Spicy can also notify you about unsuccessful reassembly through a series of built-in unit hooks. See Sink for a reference of the available functionality.

5.1.12. Contexts

Parsing may need to retain state beyond any specific unit’s lifetime. For example, a UDP protocol may want to remember information across individual packets (and hence units), or a bi-directional protocol may need to correlate the request side with the response side. One option for implementing this in Spicy is managing such state manually in global variables, for example by maintaining a global map that ties a unique connection ID to the information that needs to be retained. However, doing so is clearly cumbersome and error prone. As an alternative, a unit can make use of a dedicated context value, which is an instance of a custom type that has its lifetime determined by the host application running the parser. For example, Zeek will tie the context to the underlying connection.

Any public unit can declare a context through a unit-level property called %context, which takes an arbitrary type as its argument. For example:

public type Foo = unit {
    %context = bytes;
    [...]
};

When used as a top-level entry point to parsing, the unit will then, by default, receive a unique context value of that type. That context value can be accessed through the context() method, which will return a reference to it:

module Test;

public type Foo = unit {
    %context = int64;

    on %init { print self.context(); }
};

# spicy-driver -d foo.spicy </dev/null
0

By itself, this is not very useful. However, host applications can control how contexts are maintained, and they may assign the same context value to multiple units. For example, when parsing a protocol, the Zeek integration always creates a single context value shared by all top-level units belonging to the same connection, enabling parsers to maintain bi-directional, per-connection state. The batch mode of spicy-driver does the same.

Note

A unit’s context value gets set only when a host application uses it as the top-level starting point for parsing. If in the above example Foo wasn’t the entry point, but used inside another unit further down during the parsing process, its context would remain unset.

As an example, the following grammar—mimicking a request/reply-style protocol—maintains a queue of outstanding textual commands to then associate numerical result codes with them as the responses come in:

module Test;

# We wrap the state into a tuple to make it easy to add more attributes if needed later.
type Pending = tuple<pending: vector<bytes>>;

public type Requests = unit {
    %context = Pending;

    : Request[] foreach { self.context().pending.push_back($$.cmd); }
};

public type Replies = unit {
    %context = Pending;

    : Reply[] foreach {
        if ( |self.context().pending| ) {
            print "%s -> %s" % (self.context().pending.back(), $$.response);
            self.context().pending.pop_back();
        }
        else
            print "<missing request> -> %s", $$.response;
      }
};

type Request = unit {
    cmd: /[A-Za-z]+/;
    : b"\n";
};

type Reply = unit {
    response: /[0-9]+/;
    : b"\n";
};

# spicy-driver -F input.dat context.spicy
msg -> 100
put -> 200
CAT -> 555
end -> 300
get -> 400
LST -> 666

The output is produced from this input batch file. This would work the same when used with the Zeek on a corresponding packet trace.

Note that the units for the two sides of the connection need to declare the same %context type. Processing will abort at runtime with a type mismatch error if that’s not the case.

5.1.13. Error Handling

Whenever a parser encounters an unexpected situation during processing, it triggers a runtime error. This includes parsing errors due to input that does not match the current unit, failing &requires conditions, and also any logic errors in hooks, such as attempting to read an unset unit field or accessing an invalid vector index.

By default, any runtime error will cause the parsing to terminate immediately, with a corresponding error message reported back to the host application. The Spicy parser will not be able to continue processing afterwards. However, there are a couple of ways to catch parsing errors (but not other runtime errors) and potentially recover from them, which we discuss in the following.

A unit can provide special %error hooks that will execute when a parsing error is encountered. A unit-wide %error hook will catch all parsing errors occurring anywhere inside the unit, including any sub-units (if not otherwise handled by the sub-unit itself already). Example:

module MyModule;

type MyType = unit {
    magic: b"MAGIC";

    on %error(msg: string) {
        print "Error when parsing MyUnit: ", msg;
    }
};

The msg parameter is optional. If it’s specified, it will contain an error message describing the issue.

By default, even with an %error hook in place, the parser will still terminate after executing the hook. To change that, the hook may use Backtracking to specify where to continue parsing after the error. Alternatively, if automatic error recovery is in place, the parser will attempt recovery after the error hooks have executed.

New in version 1.12: Per-field %error handler

Rather than defining a unit-wide %error hook, it is also possible to just have an individual field catch its own parsing errors. The easiest way to do this is to attach an %error attribute to an inline hook:

module My;

type MyType = unit {
    magic: b"MAGIC" %error { # will run if magic cannot be parsed
        print "magic not found";
    }
};

To get access to the error message as well, define it out of line like this:

module MyUnit;

type MyType = unit {
    magic: b"MAGIC"

    on magic(msg: string) %error {
        print "Error when parsing magic: ", msg;
    }
};

5.1.14. Error Recovery

Real world input does not always look like what parsers expect: endpoints may not conform to the protocol’s specification, a parser’s grammar might not fully cover all of the protocol, or some input may be missing due to packet loss or stepping into the middle of a conversation. By default, if a Spicy parser encounters such situations, it will abort parsing altogether and issue an error message. Alternatively, however, Spicy allows grammar writers to specify heuristics to recover from errors. The main challenge here is finding a spot in the subsequent input where parsing can reliably resume.

Spicy employs a two-phase approach to such recovery: it first searches for a possible point in the input stream where it seems promising to attempt to resume parsing; and then it confirms that choice by trying to parse a few fields at that location according to the grammar grammar to see if that’s successful. We say that during the first part of this process, the Spicy parser is in synchronization mode; d during the second, it is in trial mode.

Phase 1: Synchronization

To identity locations where parsing can attempt to pick up again after an error, a grammar can add &synchronize attributes to selected unit fields, marking them as a synchronization points. Whenever an error occurs during parsing, Spicy will determine the closest synchronization point in the grammar following the error’s location, and then attempt to continue processing there by skipping ahead in the input data until it aligns with what that field is looking for.

A synchronization point may be any of the following:

A field for which parsing begins with a constant literal (e.g., a specific sequence of bytes). To realign the input stream, the parser will search the input for the next occurrence of this literal, discarding any data in between. Example:
```
type X = unit { ... }

type Y = unit {
    a: b"begin-of-Y";
    b: bytes &size=10;
};

type Foo = unit {
    x: X;
    y: Y &synchronize;
};
```
If parse error occurs during Foo::x, Spicy will move ahead to Foo::y, switch into synchronization mode, and start search the input for the bytes begin-of-Y. If found, it’ll continue with parsing Foo::y at that location in trial mode (see below).

Note

Behind the scenes, synchronization through literals uses the same machinery as look-ahead parsing, meaning that it works across sub-units, vector content, switch statements, etc.. No matter how complex the field, as long as there’s one or more literals that always must be coming first when parsing it, the field may be used as a synchronization point.
A field with a type which specifies %synchronize-at or %synchronize-after. The parser will search the input for the next occurrence of the given literal, discarding any data in between. If the search was successful, %synchronize-at will leave the input at the position of the search literal for later extraction while %synchronize-after will discard the search literal.

If either of these unit properties is specified, it will always overrule any other potential synchronization points in the unit. Example:
```
type X = unit {
    ...
    : /END/;
};

type Y = unit {
    %synchronize-after = /END/;
    a: bytes &size=10;
};

type Foo = unit {
    x: X;
    y: Y &synchronize;
};
```
A field that’s located inside the input stream at a fixed offset relative to the field triggering the error. The parser will then be able to skip ahead to that offset. Example:
```
type X = unit { ... }
type Y = unit { ... }

type Foo = unit {}
    x: X &size=512;
    y: Y &synchronize;
};
```
Here, when parsing Foo:x triggers an error, Spicy will know that it can continue with Foo::y at offset <beginning of Foox:x> + 512.

Todo

This synchronization strategy is not yet implemented.
When parsing a vector, the inner elements may provide synchronization points as well. Example:
```
type X = unit {
    a: b"begin-of-X";
    b: bytes &size=10;
};

type Foo = unit {}
    xs: (X &synchronize)[];
};
```
If one element of the vector Foo::xs fails to parse, Spicy will attempt to find the beginning of the next X in the input stream and continue there. For this to work, the vector’s elements must itself represent valid synchronization point (e.g., start with a literal). If the list is of fixed size, after successful synchronization, it will contain the expected number of entries, but some of them may remain (fully or partially) uninitialized if they encountered errors.

Phase 2: Trial parsing

Once input has been realigned with a synchronization point, parsing switches from synchronization mode into trial mode, in which the parser will attempt to confirm that it has indeed found a viable place to continue. It does so by proceeding to parse subsequent input from the synchronization point onwards, until one of the following occurs:

A unit hook explicitly acknowledges that synchronization has been successful by executing Spicy’s confirm statement. Typically, a grammar will do so once it has been able to correctly parse a few fields following the synchronization point–whatever it needs to sufficiently certain that it’s indeed seeing the expected structure.
A unit hook explicitly declines the synchronization by executing Spicy’s reject statement. This will abandon the current synchronization attempt, and switch back into the original synchronization mode again to find another location to try.
Parsing reaches the end of the grammar without either confirm or reject already called. In this case, the parser will abort with a fatal parse error.

Note that during trial mode, any fields between the synchronization point and the eventual confirm/reject location will already be processed as usual, including any hooks executing except %error. This may leave the unit’s state in a partially initialized state if trial parsing eventually fails. Trial mode will also consume any input along the way, with any further synchronization attempts proceeding only on subsequent, not yet seen, data.

Synchronisation Hooks

For customization, Spicy provides a set of hooks executing at different points during the synchronization process:

on %synced { ...}

Executes when a synchronization point has been found and parsing resumes there, just before the parser begins processing the corresponding field in trial mode.

on %confirmed { ...}

Executes when trial mode ends successfully with confirm.

on %rejected { ...}

Executes when trial mode fails with reject.

on %sync_advance(offset: uint64)

Executes regularly (see below) while the parser is searching for a synchronization point. The offset is given the current position inside the input stream.

This hook can be used check if the parser is skipping too much data for the analysis to remain useful. For example, a protocol analyzer could decide to bail out if the input stream consists mainly of gaps, as reported by self.stream().statistics().

By default, the hook executes every 4KB of input data skipped while searching for a synchronization point. It may not necessarily trigger immediately at the 4KB mark, but soon after when parsing gets a chance to check the input stream’s position.

You may change the trigger volume by defining a unit property %sync-advance-block-size = <VALUE> where <VALUE> is an alternative size value in bytes. As usual, this property can also be set at the module level to apply to all units.

Example Synchronization Process

As an example, let’s consider a grammar consisting of two sections where each section is started with a section header literal (SEC_A and SEC_B here).

We want to allow for inputs which miss parts or all of the first section. For such inputs, we can still synchronize the input stream by looking for the start of the second section. (For simplicity, we just use a single unit, even though typically one would probably have separate units for the two sections.)

module Test;

public type Example = unit {
    start_a: /SEC_A/;
    a: uint8;

    # If we fail to find e.g., 'SEC_A' in the input, try to synchronize on this literal.
    start_b: /SEC_B/ &synchronize;
    b: bytes &eod;

    # In this example confirm unconditionally.
    on %synced {
        print "Synced: %s" % self;
        confirm;
    }

    # Perform logging for these %confirmed and %rejected.
    on %confirmed { print "Confirmed: %s" % self; }
    on %rejected { print "Rejected: %s" % self; }

    on %done { print "Done %s" % self; }
};

Let us consider that this parsers encounters the input \xFFSEC_Babc that missed the SEC_A section marker:

start_a missing,
a=255
start_b=SEC_B as expected, and
b=abc.

For such an input parsing will encounter an initial error when it sees \xFF where SEC_A would have been expected.

Since start_b is marked as a synchronization point, the parser enters synchronisation mode, and jumps over the field a to start_b, to now search for SEC_B.
At this point the input still contains the unexpected \xFF and remains \xFFSEC_Babc . While searching for SEC_B \xFF is skipped over, and then the expected token is found. The input is now SEC_Babc.
The parser has successfully synchronized and enters trial mode. All %synced hooks are invoked.
The unit’s %synced hook executes confirm and the parser leaves trial mode. All %confirmed hooks are invoked.
Regular parsing continues at start_b. The input was SEC_Babc so start_b is set to SEC_B and b to abc.

Since parsing for start_a was unsuccessful and a was jumped over, their fields remain unset.

# printf '\xFFSEC_Babc' | spicy-driver -d foo.spicy
Synced: [$start_a=(not set), $a=(not set), $start_b=(not set), $b=(not set)]
Confirmed: [$start_a=(not set), $a=(not set), $start_b=(not set), $b=(not set)]
Done [$start_a=(not set), $a=(not set), $start_b=b"SEC_B", $b=b"abc"]

5.1. Parsing

5.1.1. Basics

5.1.1.1. Type Declaration

5.1.1.2. Parsing a Field

5.1.1.3. Non-type Fields

5.1.1.3.1. Parsing Fields With Known Size

5.1.1.3.2. Defensively Limiting Input Size

5.1.1.4. Anonymous Fields

5.1.1.5. Skipping Input

5.1.1.6. Reserved Identifiers

5.1.1.7. On-the-fly Type Conversion with &convert

5.1.1.8. Enforcing Parsing Constraints

5.1.2. Unit Hooks

5.1.3. Unit Variables

5.1.4. Unit Parameters

5.1.5. Unit Attributes

5.1.6. Meta data

5.1.7. Parsing Types

5.1.7.1. Address

5.1.7.2. Bitfield

5.1.7.3. Bytes

5.1.7.4. Integer

5.1.7.5. Real

5.1.7.6. Regular Expression

5.1.7.7. Unit

5.1.7.8. Vector

5.1.7.9. Void

5.1.8. Controlling Parsing

5.1.8.1. Conditional Parsing

5.1.8.2. Look-Ahead

5.1.8.3. switch

5.1.8.4. Backtracking

5.1.9. Changing Input

5.1.10. Filters

5.1.11. Sinks

5.1.11.1. Using Filters

5.1.11.2. Leveraging MIME Types

5.1.11.3. Reassembly

5.1.12. Contexts

5.1.13. Error Handling

5.1.14. Error Recovery

5.1.8.3. `switch`