When you run owl
with the -c
option, it generates parser code as a C header file.
$ owl -c grammar.owl
// This file was generated by the Owl parsing tool.
...
Once you’re done admiring the beautiful parser code which has filled your terminal buffer, you may decide you want to write it to a file, either by using -o
or by redirecting stdout:
$ owl -c grammar.owl -o parser.h
$ owl -c grammar.owl > parser.h
The header file has two parts (in single-file library style): a header-like part and an implementation-like part. By default, including the header includes only the header-like part. To include the implementation as well, define OWL_PARSER_IMPLEMENTATION
before using #include
:
// Include parser implementation.
#define OWL_PARSER_IMPLEMENTATION
#include "parser.h"
The implementation should be included by a single .c
file somewhere in your project.
Here are some steps you can follow to create a new program that uses a generated parser:
your-grammar.owl
(you can test it out before you compile it by running owl your-grammar.owl
).owl -c your-grammar.owl -o parser.h
to generate the parser.h
header file.Create a file called main.c
that looks like:
#define OWL_PARSER_IMPLEMENTATION
#include "parser.h"
int main()
{
struct owl_tree *tree;
tree = owl_tree_create_from_file(stdin);
owl_tree_print(tree);
owl_tree_destroy(tree);
return 0;
}
main.c
using your favorite build system (or by running cc -o test main.c && ./test
).Owl uses the struct owl_tree
type to represent a parse tree. There are three ways to create a tree:
struct owl_tree *tree = owl_tree_create_from_string(string);
The string
parameter must be a null-terminated C string.
The tree returned by owl_create_tree_from_string
may reference pieces of its string argument. It’s important to keep the string around until you’re done with the tree.
struct owl_tree *tree = owl_tree_create_from_file(file);
The file
parameter must be an open FILE *
.
Owl will copy the contents of the file into an internal buffer, so feel free to close the file after calling this function.
struct owl_tree *tree = owl_tree_create_with_options(options);
Either options.file
or options.string
must be set (but not both). Use this function if you want to provide a custom tokenize function (see user defined tokens). More options may be available in the future.
There are a few kinds of errors that can happen while creating a tree (see the table below). If one of these errors happens, the owl_create_tree_...
functions return an error tree. Calling any function other than owl_tree_destroy
on an error tree will print the error and exit.
error type | what it means | error range |
---|---|---|
ERROR_INVALID_FILE |
The argument to owl_tree_create_from_file was null, or there was an error while reading it. |
None. |
ERROR_INVALID_OPTIONS |
The options argument to owl_tree_create_with_options either had both options.file and options.string set, or it had neither. |
None. |
ERROR_INVALID_TOKEN |
Part of the text didn’t match any valid token. | A range that begins with the first unrecognized character. |
ERROR_UNEXPECTED_TOKEN |
The parser encountered an out-of-place token that didn’t fit the grammar. | The range of the unexpected token. |
ERROR_MORE_INPUT_NEEDED |
The input is valid so far, but incomplete; more tokens are necessary to complete it. | A range positioned at the end of the input. |
ERROR_ALLOCATION_FAILURE |
A call to a system allocator returned NULL. Note that not all allocation failures are currently detected. | None. |
To handle the error yourself, you can use owl_tree_get_error
. The range
parameter is optional; if you pass a non-null pointer, it will be filled with a byte range as described in the table above.
struct source_range range;
switch (owl_tree_get_error(tree, &range)) {
case ERROR_NONE:
// No error; continue with the rest of the program.
break;
case ERROR_INVALID_FILE:
fprintf(stderr, "error: invalid file\n");
exit(1);
default:
fprintf(stderr, "error: at range %zu %zu\n", range.start, range.end);
exit(1);
}
When you’re done with a tree, use owl_tree_destroy(tree)
to reclaim its memory. Calling owl_tree_destroy
on a null value is okay (it does nothing).
Each time a rule matches part of the input, Owl records details of the match in a hierarchical structure—that’s the parse tree. Let’s see what this tree looks like for a list-matching grammar that begins like:
list = item (',' item)*
If the list
rule matches the input, Owl will record
list
matches, anditem
in the list.This recorded data is represented by a parsed_list
struct:
struct parsed_list {
struct source_range range;
struct owl_ref item;
};
Note that item
is a struct owl_ref
, not a struct parsed_item
. To save memory and improve locality, Owl stores parse tree data in a compressed format. A struct owl_ref
is like a pointer into this compressed data. The parsed_item_get
function unpacks a ref into a struct parsed_item
and returns it.
struct parsed_item first_item = parsed_item_get(list.item);
There’s a function like this for every rule:
rule name | match data type | unpacking function |
---|---|---|
list |
struct parsed_list |
parsed_list_get |
item |
struct parsed_item |
parsed_item_get |
The list.item
field is a reference to the first item in the list. Calling owl_next
on an item returns a reference to the next item. You can keep calling owl_next
in a loop to iterate over the list:
while (!list.item.empty) {
struct parsed_item item = parsed_item_get(list.item);
// ...do something with item...
list.item = owl_next(list.item);
}
Mutating list.item
doesn’t change the parse tree—it just changes the local, unpacked copy. You can also use a for
loop if you don’t want to reassign anything:
for (struct owl_ref r = list.item; !r.empty; r = owl_next(r)) {
struct parsed_item item = parsed_item_get(r);
// ...do something with item...
}
Each struct owl_ref
is valid for as long as the tree is. You can store them and reuse them as much as you want.
If a rule has named options, the chosen option is indicated by the type
field in the match struct. For example, say our item
rule looks like this:
item =
'nil' : nothing
number : numeric
[ '[' list ']' ] : list
with three named options (nothing
, numeric
, and list
). Its struct has a type
field to indicate which option was chosen.
struct parsed_item {
struct source_range range;
enum parsed_type type; // <----
struct owl_ref number;
struct owl_ref list;
};
and a parsed_type
enum is generated with three values.
enum parsed_type {
PARSED_LIST = 1,
PARSED_NOTHING,
PARSED_NUMERIC,
};
All option names go into the same parsed_type
enum to avoid unneccessary prefixing.
Fields which can’t appear in patterns of the given type always have their empty
flag set. For example, if item.type
is PARSED_NUMERIC
, item.list.empty
will always be set.
Most of the time, the rule name works fine as a field name. But occasionally you want to distinguish between multiple references to the same rule:
set-index = expr '[' expr ']' '=' expr
The @
operator can be used to specify a field name for each reference.
set-index = expr@array '[' expr@index ']' '=' expr@value
The new names appear in the fields of the parse tree struct:
struct parsed_set_index {
struct source_range range;
struct owl_ref array;
struct owl_ref index;
struct owl_ref value;
};
To keep the types of the references consistent, Owl makes sure each field name can only refer to one kind of rule.
# This isn't allowed:
# bad-plus = identifier@bad-name '+' expr@bad-name
The first rule (or root rule) in the grammar matches the entire input. This root match is our starting point in the parse tree. If list
is the first rule in the grammar, Owl will generate an owl_tree_get_parsed_list
function which returns the root match:
struct parsed_list list = owl_tree_get_parsed_list(tree);
The identifier
, integer
, number
, and string
tokens record data about their match like rules do. This data can also be unpacked using parsed_identifier_get
, parsed_integer_get
, parsed_number_get
, or parsed_string_get
.
struct parsed_identifier {
struct source_range range;
const char *identifier;
size_t length;
};
struct parsed_integer {
struct source_range range;
uint64_t integer;
};
struct parsed_number {
struct source_range range;
double number;
};
struct parsed_string {
struct source_range range;
const char *string;
size_t length;
bool has_escapes;
};
struct parsed_identifier parsed_identifier_get(struct owl_ref);
struct parsed_integer parsed_integer_get(struct owl_ref);
struct parsed_number parsed_number_get(struct owl_ref);
struct parsed_string parsed_string_get(struct owl_ref);
If has_escapes
is true, the string data is owned by the owl_tree
—otherwise, it’s a direct reference to the parsed text.
If you need more than the built-in identifier
, integer
, number
, and string
token types, you can define your own token types using .token
.
If a grammar has any of these user-defined tokens, owl_tree_create_with_options
accepts two extra parameters in its owl_tree_options
struct:
struct owl_tree_options {
// ...
owl_token_func_t tokenize;
void *tokenize_info;
};
The tokenize
function matches a user-defined token. For example, if your grammar needs to match individual digits:
.token digit '7' '3'
…then you might write this tokenize function:
struct owl_token match_digit(const char *string, void *info)
{
if ('0' <= string[0] && string[0] <= '9') {
return (struct owl_token){
.type = OWL_TOKEN_DIGIT,
.length = 1,
.data.integer = string[0] - '0'
};
} else
return owl_token_no_match;
}
…and pass it into owl_tree_create_with_options
like this:
struct owl_tree_options options = {
.string = "2",
.tokenize = match_digit,
};
struct owl_tree *tree = owl_tree_create_with_options(options);
// ...
Each time Owl’s tokenizer steps forward, it calls the tokenize function (if it’s not NULL
), passing the remaining zero-terminated input as string
and the tokenize_info
field provided in owl_tree_options
as info
. The tokenize function then returns an owl_token
struct representing details about the match:
struct owl_token {
enum owl_token_type type;
size_t length;
union {
uint64_t integer;
double real;
void *pointer;
} data;
};
A return value of owl_token_no_match
(or any value with the length
field set to zero) indicates no match. Otherwise, the length
field indicates how long the match is, and type
indicates which type of token it is. Each user-defined token type (like .token digit
) corresponds to a value in the owl_token_type
enum (like OWL_TOKEN_DIGIT
). Tokens with type OWL_WHITESPACE
will be treated as whitespace instead of as a custom token.
If there’s a conflict between a user-defined token match and another token match, the longest match will be chosen, with ties going to keywords first, then user-defined token types, then other token types (like identifier
and so on).
User-defined tokens can be unpacked into parsed_...
structs just like rules and built-in tokens.
struct parsed_digit {
struct source_range range;
union {
uint64_t integer;
double real;
void *pointer;
} data;
};
Any data returned via the data
field in owl_token
will appear in the parsed_...
struct automatically.
The -p
option specifies a custom prefix for names in the generated parser code.
$ owl -c grammar.owl -p asdf -o parser.h
Instead of owl_ref
and parsed_integer_get
, parser.h
will use names like asdf_ref
and asdf_integer_get
.
ROOT
is the root rule name. RULE
ranges over all rules.
name | arguments | return value |
---|---|---|
owl_next |
An owl_ref . |
The next ref matching the corresponding field in the rule, or an empty ref. |
owl_refs_equal |
Two owl_ref values. |
true if the refs refer to the same match; false otherwise. |
owl_tree_create_from_file |
A FILE * to read from. The file is read into an intermediate string and may be closed immediately. |
A new tree. |
owl_tree_create_from_string |
A null-terminated string to parse. You retain ownership and must keep the string around until the tree is destroyed. | A new tree. |
owl_tree_create_with_options |
An owl_tree_options struct—use this to specify a custom tokenize function. |
A new tree. |
owl_tree_destroy |
An owl_tree * to destroy, freeing its resources back to the system. May be NULL . |
None. |
owl_tree_get_error |
An owl_tree * and an error_range out-parameter. The error range may be NULL . |
An error which interrupted parsing, or ERROR_NONE if there was no error. |
owl_tree_get_parsed_ROOT |
An owl_tree * . |
A parsed_ROOT struct corresponding to the root match. |
owl_tree_print |
An owl_tree * to print to stdout (typically for debugging purposes). Must not be NULL . |
None. |
owl_tree_root_ref |
An owl_tree * . |
The ref corresponding to the root match. |
parsed_identifier_get |
An owl_ref corresponding to an identifier match. |
A parsed_identifier struct corresponding to the identifier match. |
parsed_integer_get |
An owl_ref corresponding to an integer match. |
A parsed_integer struct corresponding to the integer match. |
parsed_number_get |
An owl_ref corresponding to a number match. |
A parsed_number struct corresponding to the number match. |
parsed_string_get |
An owl_ref corresponding to a string match. |
A parsed_string struct corresponding to the string match. |
parsed_RULE_get |
An owl_ref corresponding to a match for RULE . |
A parsed_RULE struct corresponding to the ref’s match. |
parsed_TOKEN_get |
An owl_ref corresponding to a match for the user-defined token TOKEN . |
A parsed_TOKEN struct corresponding to the ref’s match. |