owl

using the generated parser

When you run owl with the -c option, it generates parser code as a C header file.

$ owl -c grammar.owl
// This file was generated by the Owl parsing tool.
...

Once you’re done admiring the beautiful parser code which has filled your terminal buffer, you may decide you want to write it to a file, either by using -o or by redirecting stdout:

$ owl -c grammar.owl -o parser.h
$ owl -c grammar.owl > parser.h

integrating the parser

The header file has two parts (in single-file library style): a header-like part and an implementation-like part. By default, including the header includes only the header-like part. To include the implementation as well, define OWL_PARSER_IMPLEMENTATION before using #include:

// Include parser implementation.
#define OWL_PARSER_IMPLEMENTATION
#include "parser.h"

The implementation should be included by a single .c file somewhere in your project.

step-by-step

Here are some steps you can follow to create a new program that uses a generated parser:

Write a grammar for your language in your-grammar.owl (you can test it out before you compile it by running owl your-grammar.owl).
Run owl -c your-grammar.owl -o parser.h to generate the parser.h header file.

Create a file called main.c that looks like:

#define OWL_PARSER_IMPLEMENTATION
#include "parser.h"
   
int main()
{
    struct owl_tree *tree;
    tree = owl_tree_create_from_file(stdin);
    owl_tree_print(tree);
    owl_tree_destroy(tree);
    return 0;
}

Build and run main.c using your favorite build system (or by running cc -o test main.c && ./test).
Type some input matching your grammar into standard input to test that things are working. You should see a parse tree matching your input. Make sure to mark the end of the input using Control+D (or however you specify “end of input”).

creating a tree

Owl uses the struct owl_tree type to represent a parse tree. There are three ways to create a tree:

from a string

struct owl_tree *tree = owl_tree_create_from_string(string);

The string parameter must be a null-terminated C string.

The tree returned by owl_create_tree_from_string may reference pieces of its string argument. It’s important to keep the string around until you’re done with the tree.

from a file

struct owl_tree *tree = owl_tree_create_from_file(file);

The file parameter must be an open FILE *.

Owl will copy the contents of the file into an internal buffer, so feel free to close the file after calling this function.

using options

struct owl_tree *tree = owl_tree_create_with_options(options);

Either options.file or options.string must be set (but not both). Use this function if you want to provide a custom tokenize function (see user defined tokens). More options may be available in the future.

reporting errors

There are a few kinds of errors that can happen while creating a tree (see the table below). If one of these errors happens, the owl_create_tree_... functions return an error tree. Calling any function other than owl_tree_destroy on an error tree will print the error and exit.

error type	what it means	error range
`ERROR_INVALID_FILE`	The argument to `owl_tree_create_from_file` was null, or there was an error while reading it.	None.
`ERROR_INVALID_OPTIONS`	The `options` argument to `owl_tree_create_with_options` either had both `options.file` and `options.string` set, or it had neither.	None.
`ERROR_INVALID_TOKEN`	Part of the text didn’t match any valid token.	A range that begins with the first unrecognized character.
`ERROR_UNEXPECTED_TOKEN`	The parser encountered an out-of-place token that didn’t fit the grammar.	The range of the unexpected token.
`ERROR_MORE_INPUT_NEEDED`	The input is valid so far, but incomplete; more tokens are necessary to complete it.	A range positioned at the end of the input.
`ERROR_ALLOCATION_FAILURE`	A call to a system allocator returned NULL. Note that not all allocation failures are currently detected.	None.

To handle the error yourself, you can use owl_tree_get_error. The range parameter is optional; if you pass a non-null pointer, it will be filled with a byte range as described in the table above.

struct source_range range;
switch (owl_tree_get_error(tree, &range)) {
case ERROR_NONE:
    // No error; continue with the rest of the program.
    break;
case ERROR_INVALID_FILE:
    fprintf(stderr, "error: invalid file\n");
    exit(1);
default:
    fprintf(stderr, "error: at range %zu %zu\n", range.start, range.end);
    exit(1);
}

cleaning up

When you’re done with a tree, use owl_tree_destroy(tree) to reclaim its memory. Calling owl_tree_destroy on a null value is okay (it does nothing).

inside the tree

Each time a rule matches part of the input, Owl records details of the match in a hierarchical structure—that’s the parse tree. Let’s see what this tree looks like for a list-matching grammar that begins like:

list = item (',' item)*

If the list rule matches the input, Owl will record

the range of text that list matches, and
match details for each item in the list.

This recorded data is represented by a parsed_list struct:

struct parsed_list {
    struct source_range range;
    struct owl_ref item;
};

Note that item is a struct owl_ref, not a struct parsed_item. To save memory and improve locality, Owl stores parse tree data in a compressed format. A struct owl_ref is like a pointer into this compressed data. The parsed_item_get function unpacks a ref into a struct parsed_item and returns it.

struct parsed_item first_item = parsed_item_get(list.item);

There’s a function like this for every rule:

rule name	match data type	unpacking function
`list`	`struct parsed_list`	`parsed_list_get`
`item`	`struct parsed_item`	`parsed_item_get`

iterating

The list.item field is a reference to the first item in the list. Calling owl_next on an item returns a reference to the next item. You can keep calling owl_next in a loop to iterate over the list:

while (!list.item.empty) {
    struct parsed_item item = parsed_item_get(list.item);
    // ...do something with item...
    list.item = owl_next(list.item);
}

Mutating list.item doesn’t change the parse tree—it just changes the local, unpacked copy. You can also use a for loop if you don’t want to reassign anything:

for (struct owl_ref r = list.item; !r.empty; r = owl_next(r)) {
    struct parsed_item item = parsed_item_get(r);
    // ...do something with item...
}

Each struct owl_ref is valid for as long as the tree is. You can store them and reuse them as much as you want.

named options

If a rule has named options, the chosen option is indicated by the type field in the match struct. For example, say our item rule looks like this:

item =
  'nil' : nothing
  number : numeric
  [ '[' list ']' ] : list

with three named options (nothing, numeric, and list). Its struct has a type field to indicate which option was chosen.

struct parsed_item {
    struct source_range range;
    enum parsed_type type; // <----
    struct owl_ref number;
    struct owl_ref list;
};

and a parsed_type enum is generated with three values.

enum parsed_type {
    PARSED_LIST = 1,
    PARSED_NOTHING,
    PARSED_NUMERIC,
};

All option names go into the same parsed_type enum to avoid unneccessary prefixing.

Fields which can’t appear in patterns of the given type always have their empty flag set. For example, if item.type is PARSED_NUMERIC, item.list.empty will always be set.

renames

Most of the time, the rule name works fine as a field name. But occasionally you want to distinguish between multiple references to the same rule:

set-index = expr '[' expr ']' '=' expr

The @ operator can be used to specify a field name for each reference.

set-index = expr@array '[' expr@index ']' '=' expr@value

The new names appear in the fields of the parse tree struct:

struct parsed_set_index {
    struct source_range range;
    struct owl_ref array;
    struct owl_ref index;
    struct owl_ref value;
};

To keep the types of the references consistent, Owl makes sure each field name can only refer to one kind of rule.

# This isn't allowed:
# bad-plus = identifier@bad-name '+' expr@bad-name

getting the root match

The first rule (or root rule) in the grammar matches the entire input. This root match is our starting point in the parse tree. If list is the first rule in the grammar, Owl will generate an owl_tree_get_parsed_list function which returns the root match:

struct parsed_list list = owl_tree_get_parsed_list(tree);

token data

The identifier, integer, number, and string tokens record data about their match like rules do. This data can also be unpacked using parsed_identifier_get, parsed_integer_get, parsed_number_get, or parsed_string_get.

struct parsed_identifier {
    struct source_range range;
    const char *identifier;
    size_t length;
};

struct parsed_integer {
    struct source_range range;
    uint64_t integer;
};

struct parsed_number {
    struct source_range range;
    double number;
};

struct parsed_string {
    struct source_range range;
    const char *string;
    size_t length;
    bool has_escapes;
};

struct parsed_identifier parsed_identifier_get(struct owl_ref);
struct parsed_integer parsed_integer_get(struct owl_ref);
struct parsed_number parsed_number_get(struct owl_ref);
struct parsed_string parsed_string_get(struct owl_ref);

If has_escapes is true, the string data is owned by the owl_tree—otherwise, it’s a direct reference to the parsed text.

user-defined tokens

If you need more than the built-in identifier, integer, number, and string token types, you can define your own token types using .token.

If a grammar has any of these user-defined tokens, owl_tree_create_with_options accepts two extra parameters in its owl_tree_options struct:

struct owl_tree_options {
    // ...
    owl_token_func_t tokenize;
    void *tokenize_info;
};

The tokenize function matches a user-defined token. For example, if your grammar needs to match individual digits:

.token digit '7' '3'

…then you might write this tokenize function:

struct owl_token match_digit(const char *string, void *info)
{
    if ('0' <= string[0] && string[0] <= '9') {
        return (struct owl_token){
            .type = OWL_TOKEN_DIGIT,
            .length = 1,
            .data.integer = string[0] - '0'
        };
    } else
        return owl_token_no_match;
}

…and pass it into owl_tree_create_with_options like this:

struct owl_tree_options options = {
    .string = "2",
    .tokenize = match_digit,
};
struct owl_tree *tree = owl_tree_create_with_options(options);
// ...

Each time Owl’s tokenizer steps forward, it calls the tokenize function (if it’s not NULL), passing the remaining zero-terminated input as string and the tokenize_info field provided in owl_tree_options as info. The tokenize function then returns an owl_token struct representing details about the match:

struct owl_token {
    enum owl_token_type type;
    size_t length;
    union {
        uint64_t integer;
        double real;
        void *pointer;
    } data;
};

A return value of owl_token_no_match (or any value with the length field set to zero) indicates no match. Otherwise, the length field indicates how long the match is, and type indicates which type of token it is. Each user-defined token type (like .token digit) corresponds to a value in the owl_token_type enum (like OWL_TOKEN_DIGIT). Tokens with type OWL_WHITESPACE will be treated as whitespace instead of as a custom token.

If there’s a conflict between a user-defined token match and another token match, the longest match will be chosen, with ties going to keywords first, then user-defined token types, then other token types (like identifier and so on).

User-defined tokens can be unpacked into parsed_... structs just like rules and built-in tokens.

struct parsed_digit {
    struct source_range range;
    union {
        uint64_t integer;
        double real;
        void *pointer;
    } data;
};

Any data returned via the data field in owl_token will appear in the parsed_... struct automatically.

custom prefix

The -p option specifies a custom prefix for names in the generated parser code.

$ owl -c grammar.owl -p asdf -o parser.h

Instead of owl_ref and parsed_integer_get, parser.h will use names like asdf_ref and asdf_integer_get.

function index

ROOT is the root rule name. RULE ranges over all rules.

name	arguments	return value
`owl_next`	An `owl_ref`.	The next ref matching the corresponding field in the rule, or an empty ref.
`owl_refs_equal`	Two `owl_ref` values.	`true` if the refs refer to the same match; `false` otherwise.
`owl_tree_create_from_file`	A `FILE *` to read from. The file is read into an intermediate string and may be closed immediately.	A new tree.
`owl_tree_create_from_string`	A null-terminated string to parse. You retain ownership and must keep the string around until the tree is destroyed.	A new tree.
`owl_tree_create_with_options`	An `owl_tree_options` struct—use this to specify a custom tokenize function.	A new tree.
`owl_tree_destroy`	An `owl_tree *` to destroy, freeing its resources back to the system. May be `NULL`.	None.
`owl_tree_get_error`	An `owl_tree *` and an `error_range` out-parameter. The error range may be `NULL`.	An error which interrupted parsing, or `ERROR_NONE` if there was no error.
`owl_tree_get_parsed_ROOT`	An `owl_tree *`.	A `parsed_ROOT` struct corresponding to the root match.
`owl_tree_print`	An `owl_tree *` to print to stdout (typically for debugging purposes). Must not be `NULL`.	None.
`owl_tree_root_ref`	An `owl_tree *`.	The ref corresponding to the root match.
`parsed_identifier_get`	An `owl_ref` corresponding to an identifier match.	A `parsed_identifier` struct corresponding to the identifier match.
`parsed_integer_get`	An `owl_ref` corresponding to an integer match.	A `parsed_integer` struct corresponding to the integer match.
`parsed_number_get`	An `owl_ref` corresponding to a number match.	A `parsed_number` struct corresponding to the number match.
`parsed_string_get`	An `owl_ref` corresponding to a string match.	A `parsed_string` struct corresponding to the string match.
`parsed_RULE_get`	An `owl_ref` corresponding to a match for `RULE`.	A `parsed_RULE` struct corresponding to the ref’s match.
`parsed_TOKEN_get`	An `owl_ref` corresponding to a match for the user-defined token `TOKEN`.	A `parsed_TOKEN` struct corresponding to the ref’s match.

This site is open source. Improve this page.