5.1. Parsing
5.1.1. Basics
5.1.1.1. Type Declaration
Spicy expresses units of data to parse through a type called,
appropriately, unit
. At a high level, a unit is similar to structs
or records in other languages: It defines an ordered set of fields,
each with a name and a type, that during runtime will store
corresponding values. Units can be instantiated, fields can be
assigned values, and these values can then be retrieved. Here’s about
the most basic Spicy unit one can define:
type Foo = unit {
version: uint32;
};
We name the type Foo
, and it has just one field called
version
, which stores a 32-bit unsigned integer type.
Leaving parsing aside for a moment, we can indeed use this type similar to a typical struct/record type:
module Test;
type Foo = unit {
version: uint32;
};
global f: Foo;
f.version = 42;
print f;
This will print:
[$version=42]
Fields are initially unset, and attempting to read an unset field will
trigger a runtime error. You may, however,
provide a default value by adding a &default
attribute to the
field, in which case that will be returned on access if no value has
been explicitly assigned:
module Test;
type Foo = unit {
version: uint32 &default=42;
};
global f: Foo;
print f;
print "version is %s" % f.version;
This will print:
[$version=(not set)]
version is 42
Note how the field remains unset even with the default now specified, while the access returns the expected value.
5.1.1.2. Parsing a Field
We can turn this minimal unit type into a starting point for parsing
data—in this case a 32-bit integer from four bytes of raw input.
First, we need to declare the unit as public
to make it accessible
from outside of the current module—a requirement if a host
application wants to use the unit as a parsing entry point.
module Test;
public type Foo = unit {
version: uint32;
on %done {
print "0x%x" % self.version;
}
};
Let’s use spicy-driver to parse 4 bytes of input through this unit:
# printf '\01\02\03\04' | spicy-driver foo.spicy
0x1020304
The output comes of course from the print
statement inside the
%done
hook, which executes once the unit has been fully parsed.
(We will discuss unit hooks further below.)
By default, Spicy assumes integers that it parses to be represented in
network byte order (i.e., big-endian), hence the output above.
Alternatively, we can tell the parser through an attribute that our
input is arriving in, say, little-endian instead. To do that, we
import the spicy
library module, which provides an enum type
spicy::ByteOrder that we can give to a &byte-order
field
attribute for fields that support it:
module Test;
import spicy;
public type Foo = unit {
version: uint32 &byte-order=spicy::ByteOrder::Little;
on %done {
print "0x%x" % self.version;
}
};
# printf '\01\02\03\04' | spicy-driver foo.spicy
0x4030201
We see that unpacking the value has now flipped the bytes before
storing it in the version
field.
Similar to &byte-order
, Spicy offers a variety of further
attributes that control the specifics of how fields are parsed. We’ll
discuss them in the relevant sections throughout the rest of this
chapter.
5.1.1.3. Non-type Fields
Unit fields always have a type. However, in some cases a field’s type is not explicitly declared, but derived from what’s being parsed. The main example of this is parsing a constant value: Instead of a type, a field can specify a constant of a parseable type. The field’s type will then (usually) just correspond to the constant’s type, and parsing will expect to find the corresponding value in the input stream. If a different value gets unpacked instead, parsing will abort with an error. Example:
module Test;
public type Foo = unit {
bar: b"bar";
on %done { print self.bar; }
};
# printf 'bar' | spicy-driver foo.spicy
bar
# printf 'foo' | spicy-driver foo.spicy
[fatal error] terminating with uncaught exception of type spicy::rt::ParseError: parse error: expecting 'bar' (foo.spicy:5)
Regular expressions extend this scheme a bit further: If a field specifies a regular expression constant rather than a type, the field will have type Bytes and store the data that ends up matching the regular expression:
module Test;
public type Foo = unit {
x: /Foo.*Bar/;
on %done { print self; }
};
# printf 'Foo12345Bar' | spicy-driver foo.spicy
[$x=b"Foo12345Bar"]
There’s also a programmatic way to change a field’s type to something that’s different than what’s being parsed, see the &convert attribute.
5.1.1.3.1. Parsing Fields With Known Size
You can limit the input that a field receives by attaching a
&size=EXPR
attribute that specifies the number of raw bytes to
make available. This works on top of any other attributes that control
the field’s parsing. From the field’s perspective, such a size limit
acts just like reaching the end of the input stream at the specified
position. Example:
module Test;
public type Foo = unit {
x: int16[] &size=6;
y: bytes &eod;
on %done { print self; }
};
# printf '\000\001\000\002\000\003xyz' | spicy-driver foo.spicy
[$x=[1, 2, 3], $y=b"xyz"]
As you can see, x
receives 6 bytes of input, which it then turns
into three 16-bit integers.
Normally, the field must consume all the bytes specified by &size
,
otherwise a parse error will be triggered. Some types support an
additional &eod
attribute to lift this restrictions; we discuss
that in the corresponding type’s section where applicable.
After a field with a &size=EXPR
attribute, parsing will always
move ahead the full amount of bytes, even if the field did not consume
them all.
Todo
Parsing a regular expression would make a nice example for
&size
as well.
5.1.1.3.2. Defensively Limiting Input Size
On their own, parsers place no intrinsic upper limit on the size of variable-size fields or units. This can have negative effects like out-of-memory errors, e.g., when available memory is constrained, or for malformed input.
As a defensive mechanism you can put an upper limit on the data a field or unit
receives by attaching a &max-size=EXPR
attribute where EXPR
is an
unsigned integer specifying the upper limit of number of raw bytes a field or
unit should receive. If more than &max-size
bytes are consumed during
parsing, an error will be triggered. This attribute works on top of any other
attributes that control parsing. Example:
module Test;
public type Foo = unit {
x: bytes &until=b"\x00" &max-size=1024;
on %done { print self; }
};
# printf '\001\002\003\004\005\000' | spicy-driver foo.spicy
[$x=b"\x01\x02\x03\x04\x05"]
Here x
will parse a NULL
-terminated byte sequence (excluding the
terminating NULL
), but never more than 1024 bytes.
&max-size
cannot be combined with &size
.
5.1.1.4. Anonymous Fields
Field names are optional. If skipped, the field becomes an anonymous
field. These still participate in parsing as any other field, but they
won’t store any value, nor is there a way to get access to them from
outside. You can however still get to the parsed value inside a
corresponding field hook (see Unit Hooks) using the reserved
$$
identifier (see Reserved Identifiers).
module Test;
public type Foo = unit {
x: int8;
: int8 { print $$; } # anonymous field
y: int8;
on %done { print self; }
};
# printf '\01\02\03' | spicy-driver foo.spicy
2
[$x=1, $y=3]
Anonymous fields can often be more efficient to process because the
parser doesn’t need to retain their values. In particular for larger
bytes
fields, making them anonymous is recommended where possible
(unless, even better, they can be fully skipped over; see
Skipping Input).
5.1.1.5. Skipping Input
For cases where your parser just needs to skip over some data without
needing access to its content, Spicy provides a skip
keyword to
prefix corresponding fields with:
module Test;
public type Foo = unit {
x: int8;
: skip bytes &size=5;
y: int8;
on %done { print self; }
};
# printf '\01\02\03\04\05\06\07' | spicy-driver foo.spicy
[$x=1, $y=7]
skip
works for all kinds of fields but is particularly efficient
for fields of known size for which optimized code will be generating
avoiding the overhead of storing any data.
skip
fields may have conditions and hooks attached, like any other fields.
However, they do not support $$
in expressions and hook.
Since skip
allows the compiler to optimize the field’s parsing
code—including completely eliding most of it—it remains undefined if any
side effects associated with the field will take effect. For example,
&requires
attributes might be ignored, &convert
expressions might not
be evaluated, and hooks could end up not being invoked.
For readability, a skip
field may be named (e.g., padding: skip
bytes &size=3;
), but even with a name, its value cannot be accessed.
5.1.1.6. Reserved Identifiers
Inside units, two reserved identifiers provide access to values currently being parsed:
self
Inside a unit’s type definition,
self
refers to the unit instance that’s currently being processed. The instance is writable and maybe modified by assigning to any fields ofself
.$$
Inside field attributes and hooks,
$$
refers to the just parsed value, even if it’s not going to be directly stored in the field. The value of$$
is writable and may be modified.
5.1.1.7. On-the-fly Type Conversion with &convert
Fields may use an attribute &convert=EXPR
to transform the value
that was just being parsed before storing it as the field’s final
value. With the attribute being present, it’s the value of EXPR
that’s stored in the field, not the parsed value. Accordingly, the
field’s type also changes to the type of EXPR
.
Typically, EXPR
will use $$
to access the value actually being
parsed and then transform it into the desired representation. For
example, the following stores an integer parsed in an ASCII
representation as a uint64
:
module Test;
import spicy;
public type Foo = unit {
x: bytes &eod &convert=$$.to_uint();
on %done { print self; }
};
# printf 12345 | spicy-driver foo.spicy
[$x=12345]
&convert
also works at the unit level to transform a whole
instance into a different value after it has been parsed:
module Test;
type Data = unit {
data: bytes &size=2;
} &convert=self.data.to_int();
public type Foo = unit {
numbers: Data[];
on %done { print self.numbers; }
};
# printf 12345678 | spicy-driver foo.spicy
[12, 34, 56, 78]
Note how the Data
instances have been turned into integers.
Without the &convert
attribute, the output would have looked like
this:
[[$data=b"12"], [$data=b"34"], [$data=b"56"], [$data=b"78"]]
5.1.1.8. Enforcing Parsing Constraints
Fields may use an attribute &requires=EXPR
to enforce additional
constraints on their values. EXPR
must be a boolean expression
that will be evaluated after the parsing for the field has finished,
but before any hooks execute. If EXPR
returns False
, the
parsing process will abort with an error, just as if the field had
been unparsable in the first place (incl. executing any %error hooks). EXPR
has access to the parsed value through
$$. It may also retrieve the field’s final
value through self.<field>
, which can be helpful when
&convert is present.
Example:
module Test;
import spicy;
public type Foo = unit {
x: int8 &requires=($$ < 5);
on %done { print self; }
};
# printf '\001' | spicy-driver foo.spicy
[$x=1]
# printf '\010' | spicy-driver foo.spicy
[fatal error] terminating with uncaught exception of type spicy::rt::ParseError: parse error: &required failed ($$ == 8) (foo.spicy:7:13)
One can also enforce conditions globally at the unit level through a attribute
&requires = EXPR
. EXPR
will be evaluated once the unit has been fully
parsed, but before any %done
hook executes. If EXPR
returns False
,
the unit’s parsing process will abort with an error. As usual, EXPR
has
access to the parsed instance through self
. More than one &requires
attribute may be specified.
Example:
module Test;
import spicy;
public type Foo = unit {
x: int8;
on %done { print self; }
} &requires = self.x < 5;
# printf '\001' | spicy-driver foo.spicy
[$x=1]
# printf '\010' | spicy-driver foo.spicy
[error] terminating with uncaught exception of type spicy::rt::ParseError: parse error: &requires failed (foo.spicy:9:15)
5.1.2. Unit Hooks
Unit hooks provide one of the most powerful Spicy tools to control parsing, track state, and retrieve results. Generally, hooks are blocks of code triggered to execute at certain points during parsing, with access to the current unit instance.
Conceptually, unit hooks are somewhat similar to methods: They have bodies that execute when triggered, and these bodies may receive a set of parameters as input. Different from functions, however, a hook can have more than one body. If multiple implementations are provided for the same hook, all of them will execute successively. A hook may also not have any body implemented at all, in which case there’s nothing to do when it executes.
The most commonly used hooks are:
on %init() { ... }
Executes just before unit parsing will start.
on %done { ... }
Executes just after unit parsing has completed successfully.
on %error { ... }
oron %error(msg: string) { ... }
Executes when a parse error has been encountered, just before the parser either aborts processing. If the second form is used, a description of the error will be provided through the string argument.
on %finally { ... }
Executes once unit parsing has completed in any way. This hook is most useful to modify global state that needs to be updated no matter the success of the parsing process. Once %init triggers, this hook is guaranteed to eventually execute as well. It will run after either
%done
or%error
, respectively. (If a new error occurs during execution of%finally
, that will not trigger the unit’s%error
hook.)on %print { ... }
Executes when a unit is about to be printed (and more generally: when rendered into a string representation). By default, printing a unit will produce a list of its fields with their current values. Through this hook, a unit can customize its appearance by returning the desired string.
on <field name> { ... }
(field hook)Executes just after the given unit field has been parsed. The parsed value is accessible through the
$$
, potentially with any relevant type conversion applied (see On-the-fly Type Conversion with &convert). The same will also have been assigned to the field already.
on <field name> foreach { ... }
(container hook)Assuming the specified field is a container (e.g., a vector), this executes each time a new container element has been parsed, and just before it’s been added to the container. The parsed element is accessible through the
$$
identifier, and can be modified before it’s stored. The hook implementation may also use the stop statement to abort container parsing, without the current element being added anymore.
In addition, Spicy provides a set of hooks specific to the sink
type which
are discussed in the section on sinks, and hooks which are
executed during error recovery.
There are three locations where hooks can be implemented:
Inside a unit,
on <hook name> { ... }
implements the hook of the given name:type Foo = unit { x: uint32; v: uint8[]; on %init { ... } on x { ... } on v foreach { ... } on %done { ... } }
Field and container hooks may be directly attached to their field, skipping the
on ...
part:type Foo = unit { x: uint32 { ... } v: uint8[] foreach { ... } }
At the global module level, one can add hooks to any available unit type through
on <unit type>::<hook name> { ... }
. With the definition ofFoo
above, this implements hooks externally:on Foo::%init { ... } on Foo::x { ... } on Foo::v foreach { ... } on Foo::%done { ... }
External hooks work across module boundaries by qualifying the unit type accordingly. They provide a powerful mechanism to extend a predefined unit without changing any of its code.
If multiple implementations are provided for the same hook, by default it remains undefined in which order they will execute. If a particular order is desired, you can specify priorities for your hook implementations:
on Foo::v priority=5 { ... }
on Foo::v priority=-5 { ... }
Implementations then execute in order of their priorities: The higher a priority value, the earlier it will execute. If not specified, a hook’s priority is implicitly taken as zero.
Note
When a hook executes, it has access to the current unit instance
through the self
identifier. The state of that instance will
reflect where parsing is at that time. In particular, any field
that hasn’t been parsed yet, will remain unset. You can use the
?.
unit operator to test if a field has received a value yet.
5.1.3. Unit Variables
In addition to unit field for parsing, you can also add further instance variables to a unit type to store arbitrary state:
module Test;
public type Foo = unit {
on %init { print self; }
x: int8 { self.a = "Our integer is %d" % $$; }
on %done { print self; }
var a: string;
};
# printf \05 | spicy-driver foo.spicy
[$x=(not set), $a=""]
[$x=48, $a="Our integer is 48"]
Here, we assign a string value to a
once we have parsed x
. The
final print
shows the expected value. As you can also see, before
we assign anything, the variable’s value is just empty: Spicy
initializes unit variables with well-defined defaults. If you
would rather leave a variable unset by default, you can add
&optional
:
module Test;
public type Foo = unit {
on %init { print self; }
x: int8 { self.a = "Our integer is %d" % $$; }
on %done { print self; }
var a: string &optional;
};
# printf \05 | spicy-driver foo.spicy
[$x=(not set), $a=(not set)]
[$x=48, $a="Our integer is 48"]
You can use the ?.
unit operator to test if an optional unit variable
remains unset, e.g., self?.x
would return True
if field x
is set
and False
otherwise.
Unit variables can also be initialized with custom expressions when being defined. The initialization is performed just before the containing unit starts parsing (implying that the expressions cannot access parse results of the unit itself yet)
module Test;
public type Foo = unit {
x: int8;
var a: int8 = 123;
on %done { print self; }
};
# printf \05 | spicy-driver foo.spicy
[$x=48, $a=123]
5.1.4. Unit Parameters
Unit types can receive parameters upon instantiation, which will then be available to any code inside the type’s declaration:
module Test;
type Bar = unit(msg: string, mult: int8) {
x: int8 &convert=($$ * mult);
on %done { print "%s: %d" % (msg, self.x); }
};
public type Foo = unit {
y: Bar("My multiplied integer", 5);
};
# printf '\05' | spicy-driver foo.spicy
My multiplied integer: 25
This example shows a typical idiom: We’re handing parameters down to a subunit through parameters it receives. Inside the submodule, we then have access to the values passed in.
Note
It’s usually not very useful to define a top-level parsing
unit with parameters because we don’t have a way to pass anything
in through spicy-driver
. A custom host application could make
use of them, though.
This works with subunits inside containers as well:
module Test;
type Bar = unit(mult: int8) {
x: int8 &convert=($$ * mult);
on %done { print self.x; }
};
public type Foo = unit {
x: int8;
y: Bar(self.x)[];
};
# printf '\05\01\02\03' | spicy-driver foo.spicy
5
10
15
A common use-case for unit parameters is passing the self
of a
higher-level unit down into a subunit:
type Foo = unit {
...
b: Bar(self);
...
}
type Bar = unit(foo: Foo) {
# We now have access to any state in "foo".
}
That way, the subunit can for example store state directly in the
parent. If you declare the foo
parameter as inout
, the subunit
can also modify its members.
Unit parameters generally follow the same passing conventions as function parameters, yet with some restrictions. By default, just like with functions, parameters are read-only by default. If you want the receiving unit to be able to modify the value, there are two options:
If the parameter itself is a unit, you can declare it as
inout
as described above.For all other types, you instead need to pass the parameter as a reference. Here’s an example passing a string so that it can be modified by the subunit:
module Test; type X = unit(s: string&) { n: uint8 { *s = "Hello, world!"; } }; public type Y = unit { x: X(self.s); on %done { print self.s; } var s: string& = new string; };
# printf '\x2a' | spicy-driver foo.spicy Hello, world!
Note
While this lack of support for inout
may seem like a
surprising restriction at first, it follows from Spicy’s safety
guarantees: since a subunit may access its parameters during its
entire lifetime, generally Spicy couldn’t guarantee that a
parameter passed as inout
at initialization time would
actually remain around for modification the whole time. References
do not have that problem: their wrapped values are guaranteed to
remain valid as long as necessary. (Units happen to share that
behaviour, too, which is why Spicy can support inout
for
them.)
5.1.5. Unit Attributes
Unit types support the following type attributes:
&byte-order=ORDER
Specifies a byte order to use for parsing the unit where
ORDER
is of type spicy::ByteOrder. This overrides the byte order specified for the module. Individual fields can override this value by specifying their own byte-order. Example:type Foo = unit { version: uint32; } &byte-order=spicy::ByteOrder::Little;
&convert=EXPR
Replaces a unit instance with the result of the expression
EXPR
after parsing it from inside a parent unit. See On-the-fly Type Conversion with &convert for an example.EXPR
has access toself
to retrieves state from the unit.&requires=EXPR
Enforces post-conditions on the parsed unit.
EXPR
must be a boolean expression that will be evaluated after the parsing for the unit has finished, but before any hooks execute. More than one&requires
attributes may be specified. Example:type Foo = unit { a: int8; b: int8; } &requires=self.a==self.b;
See the section on parsing constraints for more details.
&size=N
Limits the unit’s input to
N
bytes, which it must fully consume. Example:type Foo = unit { a: int8; b: bytes &eod; } &size=5;
This expects 5 bytes of input when parsing an instance of
Foo
. The unit will store the first byte intoa
, and then fillb
with the remaining 4 bytes.The expression
N
has access toself
as well as to the unit’s parameters.
5.1.6. Meta data
Units can provide meta data about their semantics through properties
that both Spicy itself and host applications can access. One defines
properties inside the unit’s type through either a %<property> =
<value>;
tuple, or just as %<property>;
if the property does not
take an argument. Currently, units support the following meta data
properties:
%mime-type = STRING
A string of the form
"<type>/<subtype>"
that defines the MIME type for content the unit knows how to parse. This may include a*
wildcard for either the type or subtype. We use a generalized notion of MIME types here that can include custom meanings. See Sinks for more on how these MIME types are used to select parsers dynamically during runtime.You can specify this property more than once to associate a unit with multiple types.
%description = STRING
A short textual description of the unit type (i.e., the parser that it defines). Host applications have access to this property, and
spicy-driver
includes the information into the list of available parsers that it prints with the--list-parsers
option.%port = PORT_VALUE [&originator|&responder]
A Port to associate this unit with, optionally including a direction to limit its use to the corresponding side. This property has no built-in effect, but host applications may make use of the information to decide which unit type to use for parsing a connection’s payload.
%skip = ( REGEXP | Null );
Specifies a pattern which should be skipped when encountered in the input stream in between parsing of unit fields. This overwrites a value set at the module level; use
Null
to reset the property, i.e., not skip anything.%skip-pre = ( REGEXP | Null );
Specifies a pattern which should be skipped when encountered in the input stream before parsing of a unit begins. This overwrites a value set at the module level; use
Null
to reset the property, i.e., not skip anything.%skip-post = ( REGEXP | Null );
Specifies a pattern which should be skipped when encountered in the input stream after parsing of a unit has finished. This overwrites a value set at the module level; use
Null
to reset the property, i.e., not skip anything.
%synchronize-at = EXPR;
Specifies a literal to synchronize on if the unit is used as a synchronization point during error recovery. The literal is left in the input stream.
%synchronize-after = EXPR;
Specifies a literal to synchronize on if the unit is used as a synchronization point during error recovery. The literal is consumed and will not be present in the input stream after successful synchronization.
Units support some further properties for other purposes, which we introduce in the corresponding sections.
5.1.7. Parsing Types
Several, but not all, of Spicy’s data types can be parsed from binary data. In the following we summarize the types that can, along with any options they support to control specifics of how they unpack binary representations.
5.1.7.1. Address
Spicy parses addresses from either 4 bytes of
input for IPv4 addresses, or 16 bytes for IPv6 addresses. To select
the type, a unit field of type addr
must come with either an
&ipv4
or &ipv6
attribute.
By default, addresses are assumed to be represented in network byte
order. Alternatively, a different byte order can be specified through
a &byte-order
attribute specifying the desired
spicy::ByteOrder.
Example:
module Test;
import spicy;
public type Foo = unit {
ip: addr &ipv6 &byte-order=spicy::ByteOrder::Little;
on %done { print self; }
};
# printf '1234567890123456' | spicy-driver foo.spicy
[$ip=3635:3433:3231:3039:3837:3635:3433:3231]
5.1.7.2. Bitfield
Bitfields parse an integer value of a given
size, and then make selected smaller bit ranges within that value
available individually through dedicated identifiers. For example, the
following unit parses 4 bytes as an uint32
and then makes the
value of bit 0 available as f.x1
, bits 1 to 2 as f.x2
, and
bits 3 to 4 as f.x3
, respectively:
module Test;
public type Foo = unit {
f: bitfield(32) {
x1: 0;
x2: 1..2;
x3: 3..4;
};
on %done {
print self.f.x1, self.f.x2, self.f.x3;
print self;
}
};
# printf '\01\02\03\04' | spicy-driver foo.spicy
0, 2, 0
[$f=(0, 2, 0)]
Generally, a field bitfield(N)
field is parsed like an
uint<N>
. The field then supports dereferencing individual bit
ranges through their labels. The corresponding expressions
(self.x.<id>
) have the same uint<N>
type as the parsed value
itself, with the value shifted to the right so that the least significant
extracted bit becomes the least significant bit of the returned value. As you can see in
the example, the type of the field itself becomes a tuple composed of
the values of the individual bit ranges.
By default, a bitfield assumes the underlying integer comes in network
byte order. You can specify a &byte-order
attribute to change that
(e.g., bitfield(32) { ... } &byte-order=spicy::ByteOrder::Little
).
When parsing a bitfield(16)
in network byte order and with bit order
spicy::BitOrder::LSB0
(default value of &bit-order
), bits are
numbered 0 to 15 from right to left.
MSB LSB
<-- 1 <-- 0
6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
+---------------+---------------+
| | |
+-------------------------------+
This default bit numbering may be surprising given that some RFCs use the inverse as documented in RFC 1700. Here, the most significant bit is numbered 0 on the left with higher bit numbers representing less significant bits to the right. Concrete examples would be the WebSocket framing or IPv4 header notations.
MSB LSB
0 --> 1 -->
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
+-+-+-+-+-------+-+-------------+
|F|R|R|R| opcode|M| Payload len |
|I|S|S|S| (4) |A| (7) |
|N|V|V|V| |S| |
| |1|2|3| |K| |
+-+-+-+-+-------+-+-------------+
To express such bitfields more naturally in Spicy, use &bit-order=spicy::BitOrder::MSB0
on the whole bitfield:
module WebSocket;
public type Header= unit {
: bitfield(32) {
fin: 0;
rsv: 1..3;
opcode: 4..7;
mask: 8;
payload_len: 9..15;
} &bit-order=spicy::BitOrder::MSB0
The way to think about this is that the most significant bit of an integer in
network byte order is always the most left bit and the least significant bit
the most right one. Specifying the bit order as LSB0
or MSB0
essentially
sets the bit numbering direction by specifying the location of bit 0.
With little endian byte order, the bits are numbered zigzag-wise and
MSB0
and LSB0
can again be used to change the direction of the bit
numbering. The following example uses spicy::ByteOrder::Little
and
the default LSB0
bit order for bitfield(16)
. Notice how the most
significant and least significant bit for a 2 byte little endian integer
are next to each other.
f: bitfield(16) {
...
} &byte-order=spicy::ByteOrder::Little;
LSB MSB
<-- 0 <-- 1
7 6 5 4 3 2 1 0 5 4 3 2 1 0 9 8
+---------------+---------------+
| | |
+-------------------------------+
With MSB0
as bit order, the bit numbering direction is from left to right, instead:
f: bitfield(16) {
...
} &byte-order=spicy::ByteOrder::Little &bit-order=spicy::BitOrder::MSB0;
LSB MSB
1 --> 0 -->
8 9 0 1 2 3 4 5 0 1 2 3 4 5 6 7
+---------------+---------------+
| | |
+-------------------------------+
Bit numbering with larger sized bitfields in little endian gets only more confusing. Prefer network byte ordered bitfields unless it makes sense given the spec you’re working with.
The individual bit ranges support the &convert
attribute and will
adjust their types accordingly, just like a regular unit field (see
On-the-fly Type Conversion with &convert). For example, that allows for mapping a bit
range to an enum, using $$
to access the parsed value:
module Test;
import spicy;
type X = enum { A = 1, B = 2 };
public type Foo = unit {
f: bitfield(8) {
x1: 0..3 &convert=X($$);
x2: 4..7 &convert=X($$);
} { print self.f.x1, self.f.x2; }
};
# printf '\x21' | spicy-driver foo.spicy
X::A, X::B
When parsing a bitfield, you can enforce expected values for some or all of the bitranges through an assignment-style syntax:
type Foo = unit {
f: bitfield(8) {
x1: 0..3 = 2;
x2: 4..5;
x3: 6..7 = 3;
}
};
Now parsing will fail if values of x1
and x3
aren’t 2
and
3
, respectively. Internally, Spicy treats bitfields with such
expected values similar to constants of other types, meaning they
operate as valid look-ahead symbols as well (see
Look-Ahead).
5.1.7.3. Bytes
When parsing a field of type Bytes, Spicy will consume raw input bytes according to a specified attribute that determines when to stop. The following attributes are supported:
&eod
Consumes all subsequent data until the end of the input is reached.
&size=N
Consumes exactly
N
bytes. The attribute may be combined with&eod
to consume up toN
bytes instead (i.e., permit running out of input before the size limit is reached).(This attribute works for fields of all types. We list it here because it’s particularly common to use it with bytes.)
&until=DELIM
Consumes bytes until the specified delimiter is found.
DELIM
must be of typebytes
itself. The delimiter will not be included into the resulting value, but consumed.&until-including=DELIM
Similar to
&until
, but this does include the delimiterDELIM
into the resulting value.
At least one of these attributes must be provided.
On top of that, bytes fields support the attribute &chunked
to
change how the parsed data is processed and stored. Normally, a bytes
field will first accumulate all desired data and then store the final,
complete value in the field. With &chunked
, if the data arrives
incrementally in pieces, the field instead processes just whatever is
available at a time, storing each piece directly, and individually, in
the field. Each time a piece gets stored, any associated field hooks
execute with the new part as their $$
. Parsing with &chunked
will eventually still consume the same number of bytes overall, but it
avoids buffering everything in cases where that’s either infeasible or
simply not not needed.
Bytes fields support parsing constants: If a bytes
constant is
specified instead of a field type, parsing will expect to find the
corresponding value in the input stream.
5.1.7.4. Integer
Fields of integer type can be either signed
(intN
) or unsigned (uintN
). In either case, the bit length
N
determines the number of bytes being parsed. By default,
integers are expected to come in network byte order. You can specify a
different order through the &byte-order=ORDER
attribute, where
ORDER
is of type spicy::ByteOrder.
Integer fields support parsing constants: If an integer constant is
specified instead the instead of a field type, parsing will expect to
find the corresponding value in the input stream. Since the exact type
of the integer constant is important, you should use their constructor
syntax to make that explicit (e.g., uint32(42)
, int8(-1)
; vs.
using just 42
or -1
).
5.1.7.5. Real
Real values are parsed as either single or double precision values in
IEEE754 format, depending on the value of their &type=T
attribute,
where T
is one of spicy::RealType.
5.1.7.6. Regular Expression
When parsing a field through a Regular Expression, the expression is
expected to match at the current position of the input stream. The
field’s type becomes bytes
, and it will store the matching data.
Inside hooks for fields with regular expressions, you can access
capture groups through $1
, $2
, $3
, etc. For example:
x : /(a.c)(de*f)(h.j)/ {
print $1, $2, $3;
}
This will print out the relevant pieces of the data matching the
corresponding set of parentheses. (There’s no $0
, just use $$
as normal to get the full match.)
Matching an regular expression is more expensive if you need it to
capture groups. If are using groups inside your expression but don’t
need the actual captures, add &nosub
to the field to remove that
overhead.
5.1.7.7. Unit
Fields can have the type of another unit, in which case parsing will descend into that subunit’s grammar until that instance has been fully parsed. Field initialization and hooks work as usual.
If the subunit receives parameters, they must be given right after the type.
module Test;
type Bar = unit(a: string) {
x: uint8 { print "%s: %u" % (a, self.x); }
};
public type Foo = unit {
y: Bar("Spicy");
on %done { print self; }
};
# printf '\01\02' | spicy-driver foo.spicy
Spicy: 1
[$y=[$x=1]]
See Unit Parameters for more.
5.1.7.8. Vector
Parsing a vector creates a loop that repeatedly parses elements of the specified type from the input stream until an end condition is reached. The field’s value accumulates all the elements into the final vector.
Spicy uses a specific syntax to define fields of type vector:
NAME : ELEM_TYPE[SIZE]
NAME
is the field name as usual. ELEM_TYPE
is type of the
vector’s elements, i.e., the type that will be repeatedly parsed.
SIZE
is the number of elements to parse into the vector; this is
an arbitrary Spicy expression yielding an integer value. The resulting
field type then will be vector<ELEM_TYPE>
. Here’s a simple example
parsing five uint8
:
module Test;
public type Foo = unit {
x: uint8[5];
on %done { print self; }
};
# printf '\01\02\03\04\05' | spicy-driver foo.spicy
[$x=[1, 2, 3, 4, 5]]
It is possible to skip the SIZE
(e.g., x: uint8[]
) and instead
use another kind of end conditions to terminate a vector’s parsing
loop. To that end, vectors support the following attributes:
&eod
Parses elements until the end of the input stream is reached.
&size=N
Parses the vector from the subsequent
N
bytes of input data. This effectively limits the available input to the corresponding window, letting the vector parse elements until it runs out of data. (This attribute works for fields of all types. We list it here because it’s particularly common to use it with vectors.)&until=EXPR
Vector elements are parsed in a loop with
EXPR
being evaluated as a boolean expression after each parsed element, and before adding the element to the vector. OnceEXPR
evaluates to true, parsing stops without adding the element that was just parsed.&until-including=EXPR
Similar to
&until
, but does include the final elementEXPR
into the field’s vector when stopping parsing.&while=EXPR
Continues parsing as long as the boolean expression
EXPR
evaluates to true.
If neither a size nor an attribute is given, Spicy will attempt to use look-ahead parsing to determine the end of the vector based on the next expected token. Depending on the unit’s field, this may not be possible, in which case Spicy will decline to compile the unit.
The syntax shown above generally works for all element types,
including subunits (e.g., x: MyUnit[]
).
Note
The x: (<T>)[]
syntax is quite flexible. In fact, <T>
is
not limited to subunits, but allows for any standard field
specification defining how to parse the vector elements. For
example, x: (bytes &size=5)[];
parses a vector of 5-character
bytes
instances.
When parsing a vector, Spicy supports using a special kind of field
hook, foreach
, that executes for each parsed element individually.
Inside that hook, $$
refers to the just parsed element:
module Test;
public type Foo = unit {
x: uint8[5] foreach { print $$, self.x; }
};
# printf '\01\02\03\04\05' | spicy-driver foo.spicy
1, []
2, [1]
3, [1, 2]
4, [1, 2, 3]
5, [1, 2, 3, 4]
As you can see, when a foreach
hook executes the element has not yet
been added to the vector. You may indeed use a stop
statement
inside a foreach
hook to abort the vector’s parsing without adding
the current element anymore. See Unit Hooks for more on hooks.
5.1.7.9. Void
The Void type can be used as a placeholder in fields not meant to consume any data. This can be useful in some situations, such as providing a branch in switch constructs to that foregoes any parsing, or attaching a &requires attribute to enforce a condition.
Fields of type void
do not have any accessible value.
5.1.8. Controlling Parsing
Spicy offers a few additional constructs inside a unit’s declaration for steering the parsing process. We discuss them in the following.
5.1.8.1. Conditional Parsing
A unit field may be conditionally skipped for parsing by adding an
if ( COND )
clause, where COND
is a boolean expression. The
field will be only parsed if the expression evaluates to true at the
time the field is next in line.
module Test;
public type Foo = unit {
a: int8;
b: int8 if ( self.a == 1 );
c: int8 if ( self.a % 2 == 0 );
d: int8;
on %done { print self; }
};
# printf '\01\02\03\04' | spicy-driver foo.spicy
[$a=1, $b=2, $c=(not set), $d=3]
# printf '\02\02\03\04' | spicy-driver foo.spicy
[$a=2, $b=(not set), $c=2, $d=3]
New in version 1.12: Conditional blocks
If the same condition applies to multiple subsequent fields, they can be grouped together into a single conditional block:
module Test;
public type Foo = unit {
a: int8;
if ( self.a == 1 ) {
b: int8;
c: int8;
}; # note the trailing semicolon
on %done { print self; }
};
The syntax supports an optional else
-block as well:
module Test;
public type Foo = unit {
a: int8;
if ( self.a == 1 ) {
b: int8;
}
else {
c: int8;
}; # note the trailing semicolon
on %done { print self; }
};
For repeated cases of conditional parsing where a single expression evaluates to one of several values, unit switch statements might allow for more compact and easier to maintain code.
5.1.8.2. Look-Ahead
Internally, Spicy builds an LR(1) grammar for each unit that it parses, meaning that it can actually look ahead in the parsing stream to determine how to process the current input location. Roughly speaking, if (1) the current construct does not have a clear end condition defined (such as a specific length), and (2) a specific value is expected to be found next; then the parser will keep looking for that value and end the current construct once it finds it.
“Construct” deliberately remains a bit of a fuzzy term here, but think of vector parsing as the most common instance of this: If you don’t give a vector an explicit termination condition (as discussed in Vector), Spicy will look at what’s expected to come after the container. As long as that’s something clearly recognizable (e.g., a specific value of an atomic type, or a match for a regular expression), it’ll terminate the vector accordingly.
Here’s an example:
module Test;
public type Foo = unit {
data: uint8[];
: /EOD/;
x : int8;
on %done { print self; }
};
# printf '\01\02\03EOD\04' | spicy-driver foo.spicy
[$data=[1, 2, 3], $x=4]
For vectors, Spicy attempts look-ahead parsing automatically as a last
resort when it doesn’t find more explicit instructions. However, it
will reject a unit if it can’t find a suitable look-ahead symbol to
work with. If we had written int32
in the example above, that
would not have worked as the parser can’t recognize when there’s a
int32
coming; it would need to be a concrete value, such as
int32(42)
.
See the switch construct for another instance of look-ahead parsing.
5.1.8.3. switch
Spicy supports a switch
construct as way to branch into one
of several parsing alternatives. There are two variants of this, an
explicit branch and one driving by look-ahead:
Branch by expression
The most basic form of switching by expression looks like this:
switch ( EXPR ) {
VALUE_1 -> FIELD_1;
VALUE_2 -> FIELD_2;
...
VALUE_N -> FIELD_N;
};
This evaluates EXPR
at the time parsing reaches the switch
. If
there’s a VALUE
matching the result, parsing continues with the
corresponding field, and then proceeds with whatever comes after the
switch. Example:
module Test;
public type Foo = unit {
x: bytes &size=1;
switch ( self.x ) {
b"A" -> a8: int8;
b"B" -> a16: int16;
b"C" -> a32: int32;
};
on %done { print self; }
};
# printf 'A\01' | spicy-driver foo.spicy
[$x=b"A", $a8=1, $a16=(not set), $a32=(not set)]
# printf 'B\01\02' | spicy-driver foo.spicy
[$x=b"B", $a8=(not set), $a16=258, $a32=(not set)]
We see in the output that all of the alternatives turn into normal unit members, with all but the one for the branch that was taken left unset.
If none of the values match the expression, that’s considered a
parsing error and processing will abort. Alternative, one can add a
default alternative by using *
as the value. The branch will then
be taken whenever no other value matches.
A couple additional notes about the fields inside an alternative:
In our example, the fields of all alternatives all have different names, and they all show up in the output. One can also reuse names across alternatives as long as the types exactly match. In that case, the unit will end up with only a single instance of that member.
An alternative can match against more than one value by separating them with commas (e.g.,
b"A", b"B" -> x: int8;
).Alternatives can have more than one field attached by enclosing them in braces, i.e.,:
VALUE -> { FIELD_1a; FIELD_1b; ...; FIELD_1n; }
.Sometimes one really just needs the branching capability, but doesn’t have any field values to store. In that case an anonymous
void
field may be helpful( e.g.,b"A" -> : void { DoSomethingHere(); }
.
Branch by look-ahead
switch
also works without any expression as long as the presence
of all the alternatives can be reliably recognized by looking ahead in
the input stream:
module Test;
public type Foo = unit {
switch {
-> a: b"A";
-> b: b"B";
-> c: b"C";
};
on %done { print self; }
};
# printf 'A' | spicy-driver foo.spicy
[$a=b"A", $b=(not set), $c=(not set)]
While this example is a bit contrived, the mechanism becomes powerful once you have subunits that are recognizable by how they start:
module Test;
type A = unit {
a: b"A";
};
type B = unit {
b: uint16(0xffff);
};
public type Foo = unit {
switch {
-> a: A;
-> b: B;
};
on %done { print self; }
};
# printf 'A ' | spicy-driver foo.spicy
[$a=[$a=b"A"], $b=(not set)]
# printf '\377\377' | spicy-driver foo.spicy
[$a=(not set), $b=[$b=65535]]
Switching Over Fields With Common Size
You can limit the input any field in a unit switch receives by attaching an
optional &size=EXPR
attribute that specifies the number of raw bytes to
make available. This is analog to the field size attribute
and especially useful to remove duplication when each case is subject to the
same constraint.
module Test;
public type Foo = unit {
tag: uint8;
switch ( self.tag ) {
1 -> b1: bytes &eod;
2 -> b2: bytes &eod &convert=$$.lower();
} &size=3;
on %done { print self; }
};
# printf '\01ABC' | spicy-driver foo.spicy
[$tag=1, $b1=b"ABC", $b2=(not set)]
# printf '\02ABC' | spicy-driver foo.spicy
[$tag=2, $b1=(not set), $b2=b"abc"]
5.1.8.4. Backtracking
Spicy supports a simple form of manual backtracking. If a field is
marked with &try
, a later call to the unit’s backtrack()
method anywhere down in the parse tree originating at that field will
immediately transfer control over to the field following the &try
.
When doing so, the data position inside the input stream will be reset
to where it was when the &try
field started its processing. Units
along the original path will be left in whatever state they were at
the time backtrack()
executed (i.e., they will probably remain
just partially initialized). When backtrack()
is called on a path
that involves multiple &try
fields, control continues after the
most recent.
Example:
module Test;
public type test = unit {
foo: Foo &try;
bar: Bar;
on %done { print self; }
};
type Foo = unit {
a: int8 {
if ( $$ != 1 )
self.backtrack();
}
b: int8;
};
type Bar = unit {
a: int8;
b: int8;
};
# printf '\001\002\003\004' | spicy-driver backtrack.spicy
[$foo=[$a=1, $b=2], $bar=[$a=3, $b=4]]
# printf '\003\004' | spicy-driver backtrack.spicy
[$foo=[$a=3, $b=(not set)], $bar=[$a=3, $b=4]]
backtrack()
can be called from inside %error hooks, so this provides a simple form of error recovery
as well.
Note
This mechanism is preliminary and will probably see refinement over time, both in terms of more automated backtracking and by providing better control where to continue after backtracking.
5.1.9. Changing Input
By default, a Spicy parser proceeds linearly through its inputs, parsing as much as it can and yielding back to the host application once it runs out of input. There are two ways to change this linear model: diverting parsing to a different input, and random access within the current unit’s data.
Parsing custom data
A unit field can have either &parse-from=EXPR
or
&parse-at=EXPR
attached to it to change where it’s receiving its
data to parse from. EXPR
is evaluated at the time the field is
reached. For &parse-from
it must produce a value of type
bytes
, which will then constitute the input for the field. This
can, e.g., be used to reparse previously received input:
module Test;
public type Foo = unit {
x: bytes &size=2;
y: uint16 &parse-from=self.x;
z: bytes &size=2;
on %done { print self; }
};
# printf '\x01\x02\x03\04' | spicy-driver foo.spicy
[$x=b"\x01\x02", $y=258, $z=b"\x03\x04"]
For &parse-at
, EXPR
must yield an iterator pointing to (a
still valid) position of the current unit’s input stream (such as
retrieved through input()
). The field will then be
parsed from the data starting at that location.
Random access
While a unit is being parsed, you may revert the current input position backwards to any location between the first byte the unit has seen and the current position. You can use a set of built-in unit methods to control the current position:
input()
Returns a stream iterator pointing to the current input position.
set_input()
Sets the current input position to the location of the specified stream iterator. Per above, the new position needs to reside between the beginning of the current unit’s data and the current position; otherwise an exception will be generated at runtime.
offset()
Returns the numerical offset of the current input position relative to position of the first byte fed into this unit.
position()
Returns iterator to the current input position in the stream fed into this unit.
For random access, you’d typically get the current position through
input()
, subtract from it the desired number of bytes you want to
back, and then use set_input
to establish that new position. By
further storing iterators as unit variables you can decouple these
steps and, e.g., remember a position to later come back to.
Here’s an example that parses input data twice with different sub units:
module Test;
public type Foo = unit {
on %init() { self.start = self.input(); }
a: A { self.set_input(self.start); }
b: B;
on %done() { print self; }
var start: iterator<stream>;
};
type A = unit {
x: uint32;
};
type B = unit {
y: bytes &size=4;
};
# printf '\00\00\00\01' | spicy-driver foo.spicy
[$a=[$x=1], $b=[$y=b"\x00\x00\x00\x01"], $start=<offset=0 data=b"\x00\x00\x00\x01">]
If you look at output, you see that start
iterator remembers it’s
offset, relative to the global input stream. It would also show the
data at that offset if the parser had not already discarded that at
the time we print it out.
Note
Spicy parsers discard input data as quickly as possible as parsing moves through the input stream. Indeed, that’s why using random access may come with a performance penalty as the parser now needs to buffer all of unit’s data until it has been fully processed.
5.1.10. Filters
Spicy supports attaching filters to units that get to preprocess and transform a unit’s input before its parser gets to see it. A typical use case for this is stripping off a data encoding, such as compression or Base64.
A filter is itself just a unit
that comes with an additional property
%filter
marking it as such. The filter unit’s input represents the
original input to be transformed. The filter calls an internally
provided unit method forward()
to pass any
transformed data on to the main unit that it’s attached to. The filter
can call forward
arbitrarily many times, each time forwarding a
subsequent chunk of input. To attach a filter to a unit, one calls the
method connect_filter()
with an instance of the
filter’s type. Putting that all together, this is an example of a simple
a filter that upper-cases all input before the main parsing unit gets
to see it:
module Test;
type Filter = unit {
%filter;
: bytes &eod &chunked {
self.forward($$.upper());
}
};
public type Foo = unit {
on %init { self.connect_filter(new Filter); }
x: bytes &size=5 { print self.x; }
};
# printf 'aBcDe' | spicy-driver foo.spicy
ABCDE
There are a couple of predefined filters coming with Spicy that become
available by importing the filter
library module:
filter::Zlib
Provides zlib decompression.
filter::Base64Decode
Provides base64 decoding.
5.1.11. Sinks
Sinks provide a powerful mechanism to chain multiple units together into a layered stack, each processing the output of its predecessor. A sink is the connector here that links two unit instances: one side writing and one side reading, like a Unix pipe. As additional functionality, the sink can internally reassemble data chunks that are arriving out of order before passing anything on.
Here’s a basic example of two units types chained through a sink:
module Test;
public type A = unit {
on %init { self.b.connect(new B); }
length: uint8;
data: bytes &size=self.length { self.b.write($$); }
on %done { print "A", self; }
sink b;
};
public type B = unit {
: /GET /;
path: /[^\n]+/;
on %done { print "B", self; }
};
# printf '\13GET /a/b/c\n' | spicy-driver -p Test::A foo.spicy
B, [$path=b"/a/b/c"]
A, [$length=11, $data=b"GET /a/b/c\x0a", $b=<sink>]
Let’s see what’s going on here. First, there’s sink b
inside the
declaration of A
. That’s the connector, kept as state inside
A
. When parsing for A
is about to begin, the %init
hook
connects the sink to a new instance of B
; that’ll be the receiver
for data that A
is going to write into the sink. That writing
happens inside the field hook for data
: once we have parsed that
field, we write what will go to the sink using its built-in
write()
method. With that write operation, the
data will emerge as input for the instance of B
that we created
earlier, and that will just proceed parsing it normally. As the output
shows, in the end both unit instances end up having their fields set.
As an alternative for using the write()
in the
example, there’s some syntactic sugar for fields of type bytes
(like data
here): We can just replace the hook with a ->
operator to have the parsed data automatically be forwarded to the
sink: data: bytes &size=self.length -> self.b
.
Sinks have a number of further methods, see Sink for the complete reference. Most of them we will also encounter in the following when discussing additional functionality that sinks provide.
Note
Because sinks are meant to decouple processing between two units, a unit connected to a sink will not pass any parse errors back up to the sink’s parent. If you want to catch them, install an %error hook inside the connected unit.
5.1.11.1. Using Filters
Sinks also support filters to preprocess any data
they receive before forwarding it on. This works just like for units
by calling the built-in sink method
connect_filter()
. For example, if in the example
above, data
would have been gzip compressed, we could have
instructed the sink to automatically decompress it by calling
self.b.connect_filter(new filter::Zlib)
(leveraging the
Spicy-provided Zlib
filter).
5.1.11.2. Leveraging MIME Types
In our example above we knew which type of unit we wanted to connect.
In practice, that may or may not be the case. Often, it only becomes
clear at runtime what the choice for the next layer should be, such as
when using well-known ports to determine the appropriate
application-layer analyzer for a TCP stream. Spicy supports dynamic
selection through a generalized notion of MIME types: Units can
declare which MIME types they know how to parse (see
Meta data) , and sinks have
connect_mime_type()
method that will instantiate and
connect any that match their argument (if that’s multiple, all will be
connected and all will receive the same data).
“MIME type” can mean actual MIME types, such text/html
.
Applications can, however, also define their own notion of
<type>/<subtype>
to model other semantics. For example, one could
use x-port/443
as convention to trigger parsers by well-known
port. An SSL unit would then declare %mime-type = "x-port/443
, and
the connection would be established through the equivalent of
connect_mime_type("x-port/%d" % resp_port_of_connection)
.
Todo
For this specific example, there’s a better solution: We also have
the %port
property and should just build up a table index on
that.
5.1.11.3. Reassembly
Reassembly (or defragmentation) of out-of-order data chunks is a common requirement for many protocols. Sinks have that functionality built-in by allowing you to associate a position inside a virtual sequence space with each chunk of data. Sinks will then pass their data on to connected units only once they have collected a continuous, in-order range of bytes.
The easiest way to leverage this
is to simply associate sequence numbers with each
write()
operation:
module Test;
public type Foo = unit {
sink data;
on %init {
self.data.connect(new Bar);
self.data.write(b"567", 5);
self.data.write(b"89", 8);
self.data.write(b"012", 0);
self.data.write(b"34", 3);
}
};
public type Bar = unit {
s: bytes &eod;
on %done { print self.s; }
};
# spicy-driver -p Test::Foo foo.spicy </dev/null
0123456789
By default, Spicy expects the sequence space to start at zero, so the
first byte of the input stream needs to be passed in with sequence
number zero. You can change that base number by calling the
sink method set_initial_sequence_number()
. You can
control Spicy’s gap handling, including when to stop buffering data
because you know nothing further will arrive anymore. Spicy can also
notify you about unsuccessful reassembly through a series of built-in unit hooks.
See Sink for a reference of the available functionality.
5.1.12. Contexts
Parsing may need to retain state beyond any specific unit’s lifetime. For example, a UDP protocol may want to remember information across individual packets (and hence units), or a bi-directional protocol may need to correlate the request side with the response side. One option for implementing this in Spicy is managing such state manually in global variables, for example by maintaining a global map that ties a unique connection ID to the information that needs to be retained. However, doing so is clearly cumbersome and error prone. As an alternative, a unit can make use of a dedicated context value, which is an instance of a custom type that has its lifetime determined by the host application running the parser. For example, Zeek will tie the context to the underlying connection.
Any public unit can declare a context through a unit-level property
called %context
, which takes an arbitrary type as its argument.
For example:
public type Foo = unit {
%context = bytes;
[...]
};
When used as a top-level entry point to parsing, the unit will then,
by default, receive a unique context value of that type. That context
value can be accessed through the context()
method, which will return a reference to it:
module Test;
public type Foo = unit {
%context = int64;
on %init { print self.context(); }
};
# spicy-driver foo.spicy </dev/null
0
By itself, this is not very useful. However, host applications can control how contexts are maintained, and they may assign the same context value to multiple units. For example, when parsing a protocol, the Zeek integration always creates a single context value shared by all top-level units belonging to the same connection, enabling parsers to maintain bi-directional, per-connection state. The batch mode of spicy-driver does the same.
Note
A unit’s context value gets set only when a host application uses it as the top-level starting point for parsing. If in the above example Foo wasn’t the entry point, but used inside another unit further down during the parsing process, it’s context would remain unset.
As an example, the following grammar—mimicking a request/reply-style protocol—maintains a queue of outstanding textual commands to then associate numerical result codes with them as the responses come in:
module Test;
# We wrap the state into a tuple to make it easy to add more attributes if needed later.
type Pending = tuple<pending: vector<bytes>>;
public type Requests = unit {
%context = Pending;
: Request[] foreach { self.context().pending.push_back($$.cmd); }
};
public type Replies = unit {
%context = Pending;
: Reply[] foreach {
if ( |self.context().pending| ) {
print "%s -> %s" % (self.context().pending.back(), $$.response);
self.context().pending.pop_back();
}
else
print "<missing request> -> %s", $$.response;
}
};
type Request = unit {
cmd: /[A-Za-z]+/;
: b"\n";
};
type Reply = unit {
response: /[0-9]+/;
: b"\n";
};
# spicy-driver -F input.dat context.spicy
msg -> 100
put -> 200
CAT -> 555
end -> 300
get -> 400
LST -> 666
The output is produced from this input batch file
. This would work the same when used with
the Zeek on a corresponding packet trace.
Note that the units for the two sides of the connection need to
declare the same %context
type. Processing will abort at
runtime with a type mismatch error if that’s not the case.
5.1.13. Error Handling
Whenever a parser encounters an unexpected situation during processing, it triggers a runtime error. This includes parsing errors due to input that does not match the current unit, failing &requires conditions, and also any logic errors in hooks, such as attempting to read an unset unit field or accessing an invalid vector index.
By default, any runtime error will cause the parsing to terminate immediately, with a corresponding error message reported back to the host application. The Spicy parser will not be able to continue processing afterwards. However, there are a couple of ways to catch parsing errors (but not other runtime errors) and potentially recover from them, which we discuss in the following.
A unit can provide special %error hooks that will
execute when a parsing error is encountered. A unit-wide %error
hook will catch all parsing errors occurring anywhere inside the unit,
including any sub-units (if not otherwise handled by the sub-unit
itself already). Example:
module MyModule;
type MyType = unit {
magic: b"MAGIC";
on %error(msg: string) {
print "Error when parsing MyUnit: ", msg;
}
};
The msg
parameter is optional. If it’s specified, it will contain
an error message describing the issue.
By default, even with an %error
hook in place, the parser will
still terminate after executing the hook. To change that, the hook may
use Backtracking to specify where to continue parsing after the
error. Alternatively, if automatic error recovery is in place, the parser will attempt recovery after
the error hooks have executed.
New in version 1.12: Per-field %error
handler
Rather than defining a unit-wide %error
hook, it is also possible
to just have an individual field catch its own parsing errors. The
easiest way to do this is to attach an %error
attribute to an
inline hook:
module My;
type MyType = unit {
magic: b"MAGIC" %error { # will run if magic cannot be parsed
print "magic not found";
}
};
To get access to the error message as well, define it out of line like this:
module MyUnit;
type MyType = unit {
magic: b"MAGIC"
on magic(msg: string) %error {
print "Error when parsing magic: ", msg;
}
};
5.1.14. Error Recovery
Real world input does not always look like what parsers expect: endpoints may not conform to the protocol’s specification, a parser’s grammar might not fully cover all of the protocol, or some input may be missing due to packet loss or stepping into the middle of a conversation. By default, if a Spicy parser encounters such situations, it will abort parsing altogether and issue an error message. Alternatively, however, Spicy allows grammar writers to specify heuristics to recover from errors. The main challenge here is finding a spot in the subsequent input where parsing can reliably resume.
Spicy employs a two-phase approach to such recovery: it first searches for a possible point in the input stream where it seems promising to attempt to resume parsing; and then it confirms that choice by trying to parse a few fields at that location according to the grammar grammar to see if that’s successful. We say that during the first part of this process, the Spicy parser is in synchronization mode; d during the second, it is in trial mode.
Phase 1: Synchronization
To identity locations where parsing can attempt to pick up again after
an error, a grammar can add &synchronize
attributes to selected unit
fields, marking them as a synchronization points. Whenever an error
occurs during parsing, Spicy will determine the closest
synchronization point in the grammar following the error’s location,
and then attempt to continue processing there by skipping ahead in the
input data until it aligns with what that field is looking for.
A synchronization point may be any of the following:
A field for which parsing begins with a constant literal (e.g., a specific sequence of bytes). To realign the input stream, the parser will search the input for the next occurrence of this literal, discarding any data in between. Example:
type X = unit { ... } type Y = unit { a: b"begin-of-Y"; b: bytes &size=10; }; type Foo = unit { x: X; y: Y &synchronize; };
If parse error occurs during
Foo::x
, Spicy will move ahead toFoo::y
, switch into synchronization mode, and start search the input for the bytesbegin-of-Y
. If found, it’ll continue with parsingFoo::y
at that location in trial mode (see below).Note
Behind the scenes, synchronization through literals uses the same machinery as look-ahead parsing, meaning that it works across sub-units, vector content,
switch
statements, etc.. No matter how complex the field, as long as there’s one or more literals that always must be coming first when parsing it, the field may be used as a synchronization point.A field with a type which specifies %synchronize-at or %synchronize-after. The parser will search the input for the next occurrence of the given literal, discarding any data in between. If the search was successful,
%synchronize-at
will leave the input at the position of the search literal for later extraction while%synchronize-after
will discard the search literal.If either of these unit properties is specified, it will always overrule any other potential synchronization points in the unit. Example:
type X = unit { ... : /END/; }; type Y = unit { %synchronize-after = /END/; a: bytes &size=10; }; type Foo = unit { x: X; y: Y &synchronize; };
A field that’s located inside the input stream at a fixed offset relative to the field triggering the error. The parser will then be able to skip ahead to that offset. Example:
type X = unit { ... } type Y = unit { ... } type Foo = unit {} x: X &size=512; y: Y &synchronize; };
Here, when parsing
Foo:x
triggers an error, Spicy will know that it can continue withFoo::y
at offset<beginning of Foox:x> + 512
.Todo
This synchronization strategy is not yet implemented.
When parsing a vector, the inner elements may provide synchronization points as well. Example:
type X = unit { a: b"begin-of-X"; b: bytes &size=10; }; type Foo = unit {} xs: (X &synchronize)[]; };
If one element of the vector
Foo::xs
fails to parse, Spicy will attempt to find the beginning of the nextX
in the input stream and continue there. For this to work, the vector’s elements must itself represent valid synchronization point (e.g., start with a literal). If the list is of fixed size, after successful synchronization, it will contain the expected number of entries, but some of them may remain (fully or partially) uninitialized if they encountered errors.
Phase 2: Trial parsing
Once input has been realigned with a synchronization point, parsing switches from synchronization mode into trial mode, in which the parser will attempt to confirm that it has indeed found a viable place to continue. It does so by proceeding to parse subsequent input from the synchronization point onwards, until one of the following occurs:
A unit hook explicitly acknowledges that synchronization has been successful by executing Spicy’s confirm statement. Typically, a grammar will do so once it has been able to correctly parse a few fields following the synchronization point–whatever it needs to sufficiently certain that it’s indeed seeing the expected structure.
A unit hook explicitly declines the synchronization by executing Spicy’s reject statement. This will abandon the current synchronization attempt, and switch back into the original synchronization mode again to find another location to try.
Parsing reaches the end of the grammar without either
confirm
orreject
already called. In this case, the parser will abort with a fatal parse error.
Note that during trial mode, any fields between the synchronization point and
the eventual confirm
/reject
location will already be processed as
usual, including any hooks executing except %error
. This may leave the
unit’s state in a partially initialized state if trial parsing eventually
fails. Trial mode will also consume any input along the way, with any further
synchronization attempts proceeding only on subsequent, not yet seen, data.
Synchronisation Hooks
For customization, Spicy provides a set of hooks executing at different points during the synchronization process:
on %synced { ...}
Executes when a synchronization point has been found and parsing resumes there, just before the parser begins processing the corresponding field in trial mode.
on %confirmed { ...}
Executes when trial mode ends successfully with confirm.
on %rejected { ...}
Executes when trial mode fails with reject.
on %sync_advance(offset: uint64)
Executes regularly (see below) while the parser is searching for a synchronization point. The offset is given the current position inside the input stream.
This hook can be used check if the parser is skipping too much data for the analysis to remain useful. For example, a protocol analyzer could decide to bail out if the input stream consists mainly of gaps, as reported by
self.stream().statistics()
.By default, the hook executes every 4KB of input data skipped while searching for a synchronization point. It may not necessarily trigger immediately at the 4KB mark, but soon after when parsing gets a chance to check the input stream’s position.
You may change the trigger volume by defining a unit property
%sync-advance-block-size = <VALUE>
where<VALUE>
is an alternative size value in bytes. As usual, this property can also be set at the module level to apply to all units.
Example Synchronization Process
As an example, let’s consider a grammar consisting of two sections
where each section is started with a section header literal (SEC_A
and SEC_B
here).
We want to allow for inputs which miss parts or all of the first section. For such inputs, we can still synchronize the input stream by looking for the start of the second section. (For simplicity, we just use a single unit, even though typically one would probably have separate units for the two sections.)
module Test;
public type Example = unit {
start_a: /SEC_A/;
a: uint8;
# If we fail to find e.g., 'SEC_A' in the input, try to synchronize on this literal.
start_b: /SEC_B/ &synchronize;
b: bytes &eod;
# In this example confirm unconditionally.
on %synced {
print "Synced: %s" % self;
confirm;
}
# Perform logging for these %confirmed and %rejected.
on %confirmed { print "Confirmed: %s" % self; }
on %rejected { print "Rejected: %s" % self; }
on %done { print "Done %s" % self; }
};
Let us consider that this parsers encounters the input
\xFFSEC_Babc
that missed the SEC_A
section marker:
start_a
missing,a=255
start_b=SEC_B
as expected, andb=abc
.
For such an input parsing will encounter an initial error when it sees
\xFF
where SEC_A
would have been expected.
Since
start_b
is marked as a synchronization point, the parser enters synchronisation mode, and jumps over the fielda
tostart_b
, to now search forSEC_B
.At this point the input still contains the unexpected
\xFF
and remains\xFFSEC_Babc
. While searching forSEC_B
\xFF
is skipped over, and then the expected token is found. The input is nowSEC_Babc
.The parser has successfully synchronized and enters trial mode. All
%synced
hooks are invoked.The unit’s
%synced
hook executesconfirm
and the parser leaves trial mode. All%confirmed
hooks are invoked.Regular parsing continues at
start_b
. The input wasSEC_Babc
sostart_b
is set toSEC_B
andb
toabc
.
Since parsing for start_a
was unsuccessful and a
was jumped
over, their fields remain unset.
# printf '\xFFSEC_Babc' | spicy-driver foo.spicy
Synced: [$start_a=(not set), $a=(not set), $start_b=(not set), $b=(not set)]
Confirmed: [$start_a=(not set), $a=(not set), $start_b=(not set), $b=(not set)]
Done [$start_a=(not set), $a=(not set), $start_b=b"SEC_B", $b=b"abc"]