Parser Generators
Table of Contents
In the terminology of formal language theory, the general parsing problem is easy to state.1
Definition: The (context-free) parsing problem is, given a (context-free) grammar \(\mathcal G\) and a sentence \(S\) over the terminal symbols of \(\mathcal G\), to determine if \(\mathcal G\) recognizes \(S\).
There's some very interesting algorithmic theory regarding this problem, including the Cocke-Younger-Kasami (CYK) algorithm, a beautiful example of dynamic programming.
We will not be covering this. We won't be implementing parsers at all. Instead, we'll be specifying our PLs and then generating parsers for them, no coding necessary (at least, no OCaml programming).
Roughly, a parser generator is a program which, given an EBNF grammar (or a similar specification language), generates the code which implements a parser for that grammar.2 We'll be using a parser generator for OCaml called Menhir. This page is essentially a tutorial on Menhir.
But parser generators are like streaming services, there are more that exist than we'll ever need and people are still making new ones.3 ANTLR, for example, is a popular parser generator for Python.
Remark. Before continuing I'd also like to point out that there are other ways of building parsers.
- Obviously you can handroll your own parser using any of the algorithms out there. Lua has what's called a recursive descent parser which was implemented in-house. They argue it's more efficient and more portable than using a generator.
- Elm (a very fun functional language for building web-apps which has interoperability with JavaScript) uses what are called parser combinators to build its parser in a modular fashion. This implementation is very slick, and has interesting connections to monads.4
This is just to say, if you're interested, there are a lot of avenues to explore. It's also to say that, like any interesting engineering problem, building a parser takes a bit of artistry. There's no best choice, and you'll have to make some choices based on your aesthetic inclinations as well as domain-specific considerations.
Lexing vs. Parsing
The last bit of preamble we need before we can start using Menhir: we need to distinguish between lexing and parsing. Recall from the chapter on the interpretation pipeline that there were two boxes which were necessary for parsing. The first was for lexical analysis and the second was for syntactic analysis. Lexing is, as you might expected, the first of these two boxes.
At the beginning, we're given a string. Lexing is the process of
taking that string and grouping together the characters that
correspond to a single unit in our PL. When we parse a program,
we'd like to know if the 'l'
we're looking at is part of a string,
part of a variable name, or part of the keyword let
. Grouping these
beforehand will make parsing much simpler.
Aside: On a more technical note, strings tend to be stored contiguously in memory, which is inconvenient for doing things like recursion. I am of the opinion that if you ever need to process a string in a way that is more complex than reading each character one-by-one, you might as well lex it first. Structure your data for your problem, not the other way around.
We call the abstract units of a program tokens and we call their associated strings lexemes.
Example: Here is a short list of lexemes and tokens that will appear in the languages we consider. It is standard (at least in Menhir) to represent tokens with ALLCAPS.
Lexeme Token "let" LET "if" IF "123" NUM 123 "-123" NUM (-123) "(" LPAREN "->" ARROW "xy" VAR "xy" We have to define tokens all the way down to the symbols we use, like the function arrow and the left parentheses symbol. Also note that some tokens may not be defined as a single collection of characters, but as a collection of characters "with a particular shape." In Menhir (and many other parser generators) we use regular expressions to describe more complex tokens like numbers and variables.
In short, lexing is the process of converting a string
to token
list
, if possible.
val lex : string -> token list option
The type token
is an ADT that we define which contains abstract
represtations of the units of our language. This may fail (e.g., if
there are unrecognized symbols) which is why it returns a token list
option
.
Parsing is the process of converting a list of tokens to an expr
, if possible
val parse : token list -> expr option
The type expr
is an ADT that we define which we use to represent
expressions in our language. This may fail (e.g., if the program is
not well-formed) which is why it returns an expr option
. When
parsing fails, good interpreters will display a syntax error
message, which might give you more information about why parsing
fails (in which case we'd have to return something more sophisticated
than an expr option
.
Terminology: We call the output of parsing a parse tree or a concrete syntax tree. Occasionally we will also use the term abstract syntax tree (AST) because the output of parsing represents the syntax of the program we want to evaluate abstractly as a data. For more complex interpreters, we draw a stronger distinction between these two things. It is not necessarily the case that the structure of our syntactic program exactly matches the object we will ultimately will evaluate. For now, we will use the term AST for the output of parsing.
Remark: There are a couple reasons for separating lexing and parsing. We won't dwell on it in this course, but in rough terms:
- It makes the parsing task easier. There are fewer low-level concerns we need to deal with for parsing.
- It makes this part of the interpretation pipeline more portable. Different operating systems have different ways of dealing with text, e.g., the CR vs. LF vs. CRLF issue that still plagues us today). If we separate the concerns, we can insulate the parser from these concerns.
Recall that, when we covered grammars, we said nothing about whitespace, or about complex kinds of objects like integers or variable names. This is because these are not parsing issues.5
Using Menhir
With that business out of the way, we go to generating a parser for a simple expression language using Menhir. We always start with a grammar, and a table which tells us the precedence of the operators represented in the grammar.
This will be our working example, a language for arithmetic expressions with local variables.
The Grammar:
<prog> ::= <expr> <expr> ::= let <var> = <expr> in <expr> | <expr1> <bop> ::= + | - | * | / <expr1> ::= <expr1> <bop> <expr1> | <num> | <var> | ( <expr> ) <num> ::= 0 ; DUMMY VALUE <var> ::= x ; DUMMY VALUE ; In lex.mll: ; ; let num = '-'? ['0'-'9']+ ; let var = ['a'-'z' '_'] ['a'-'z' 'A'-'Z' '0'-'9' '_' '\'']*Operators in order of increasing precedence:
Operator Associativity +
,-
left *
,/
left Output type of parsing (the AST):
type bop = Add | Sub | Mul | Div type expr = | Var of string | Num of int | Let of string * expr * expr | Bop of bop * expr * expr type prog = expr
A couple notes on the everything above.
- In the above grammar, we use dummy terminal symbols for variables
and numbers because, again, these are handled by the lexer not the
parser. We've also included the regular expressions used for
NUM
andVAR
tokens in our lexer. - Note that there are no parentheses in the expression type itself, even though they are in the language. By the time we're representing the syntax abstractly, the parentheses are represented implicitly.
In what follows, we describe the steps to generate an OCaml function which, given the contents of a file with:
let x = 2 in let y = -3 in let x_squared = x * x in let y_squared = y * y in x_squared + y_squared
returns Some e
where e
is the expression
Let ("x" , Num 2 , Let ( "y" , Num (-3) , Let ( "x_squared" , Mul (Var "x", Var "x") , Let ( "y_squared" , Mul (Var "y", Var "y") , Add (Var "x_squared", Var "y_squared") ) ) ) )
We will assume that you have a dune project set up (if not, you can
run dune init project parser_example
to create one).
Step 0: The Project File
If we want to use Menhir, we have to tell dune that. First, add the stanza
(using menhir 3.0)
to the end of the file dune-project
.
Step 1: Setting Up
You need to create four files with the following file structure in the
lib
directory of your project.
lib ├──dune ├──utils.ml ├──parser_example.ml ├──lexer.mll └──parser.mly
dune
:
(library (name parser_example)) (menhir (modules parser)) (ocamllex lexer)
utils.ml:
type bop = Add | Sub | Mul | Div type expr = | Var of string | Num of int | Let of string * expr * expr | Bop of bop * expr * expr type prog = expr
parse_example.ml:
let parse s = try Some (Parser.prog Lexer.read (Lexing.from_string s)) with _ -> None
Remark: One of the benefits of using a parser generator is better error messages. We will not be using this feature in this course. If you were to use Menhir in a personal project, you would need to update the
parse
function to make use of the exceptions produced byParser.prog
.
The last two files are where the work happens. We can start with the following dummy files.
lexer.mll
:
{ open Parser } let whitespace = [' ' '\t' '\n' '\r']+ rule read = parse | whitespace { read lexbuf } | eof { EOF }
parser.mly
:
%{ open Utils %} %token EOF %start <Utils.prog> prog %% prog: | EOF { Num 0 }
At this point, you should be able to dune build
your project.
Step 2: Tokens
Now we need to populate our parser with the tokens of our PL.
We also need to discuss the anatomy of the parser.mly
file.
Our file begins with a header that can include arbitary OCaml code.
For now we just open the module Utils
so that we can refer to the
prog
and expr
types in this file:
%{ open Utils %}
Next follows a collection of declarations. This is where we will specify tokens, precedence, and associativity. We also declare our start symbol here (this is already done):
%token EOF %start <Utils.prog> { prog } %%
Our declarations end with the symbol %%
. Everything that follows
are rules, i.e., the production rules of our grammar. We will cover
this in the next section.
For now, we'll populate the top of our file with token declarations, one for each terminal symbol in our grammar. The syntax is
%token NAME
and it is typical to choose an ALLCAPS name. There is already a token that we are using the represent the end of a file, which will appear in every parser we build. In the case of complex tokens, like numbers and variables, we also specify the type of thing the lexer will give us with the syntax
%token <type> NAME
With this, we can update parser.mly
:
%{ open Utils %} %token LET %token EQUALS %token IN %token PLUS %token MINUS %token TIMES %token DIVIDE %token <int> NUM %token <string> VAR %token LPAREN %token RPAREN %token EOF %start <Utils.prog> prog %% prog: | EOF { Num 0 }
Determining the tokens you need to create is a matter of looking through the list of rules in the grammar, picking out all the terminal symbols, and giving them names.
At this point, if you try to dune build
, you will see many warnings
about unused tokens. This is because we haven't used any of the
tokens in the rules of the grammar. We'll handle this in the
next section.
Step 3: Rules
Onto the most important part, describing the rules of the grammar. Menhir is designed so that its rules looks a whole lot like EBNF production rules. Let's look at the first two rules.
In EBNF:
<prog> ::= <expr> <expr> ::= let <var> = <expr> in <expr> | <expr1>
in Menhir syntax:
prog: | e = expr; EOF { e } expr: | LET; x = var; EQUALS; e1 = expr; IN; e2 = expr { Let(x, e1, e2) } | e = expr1 { e }
If we squint, we can see the similarities. The main differences in the Menhir syntax:
- The starting symbol should be followed with the special
EOF
(end-of-file) token. This ensures that Menhir knows that there is nothing left to parse after parsing an expression. - symbols are separated by semicolons6
- terminal symbols are replaced with their associated tokens
- alternatives are followed by the values which should be returned if they are matched with (the parts in the curly braces)
- nonterminal symbols are given names which can be used in these
returned values (like
e1 = expr
)
For example, we read the first alternative of the expr
rule as:
- if I see the keyword
let
- and then I see a variable
x
- and then I see the symbol
=
- and then I see a whole expression which parses to
e1
- and then I see the keyword
in
- and then I see a whole expression which parses to
e2
- then the result of parsing should be
Let(x, e1, e2)
.
Implementing the parser is a matter of translating these rules into Menhir syntax.
We also need to provide precedence and associativity declarations for our operators. The syntax is:
%left TOKEN %right TOKEN
depending on whether TOKEN
should be left associative or right
associative, and precendence is handled by the order the declarations
are presented. Precedences increases down the list, just like in our
precedence table.
Futher updating parser.mly
:
%{ open Utils %} %token LET %token EQUALS %token IN %token PLUS %token MINUS %token TIMES %token DIVIDE %token <int> NUM %token <string> VAR %token LPAREN %token RPAREN %token EOF %left PLUS, MINUS %left TIMES, DIVIDE %start <Utils.prog> prog %% prog: | e = expr; EOF { e } expr: | LET; x = var; EQUALS; e1 = expr; IN; e2 = expr { Let(x, e1, e2) } | e = expr1 { e } %inline bop: | PLUS { Add } | MINUS { Sub } | TIMES { Mul } | DIVIDE { Div } expr1: | e1 = expr1; op = bop; e2 = expr1 { Bop(op, e1, e2) } | n = num { Num n } | v = var { Var v } | LPAREN; e = expr; RPAREN { e } num: | n = NUM { n } var: | x = VAR { x }
Remark: If we want to follow the grammar exactly, we have to inline the binary operators. Menhir is not smart enough to recognize that
bop
expands to operators with declared precendence. We use the keyword%inline
to tell Menhir essentially to immediately substitutebop
in the grammar itself.It's equivalent to having the rule:
expr1: | e1 = expr1; PLUS; e2 = expr2 { Bop(Add, e1, e2) } | e1 = expr1; MINUS; e2 = expr2 { Bop(Sub, e1, e2) } | e1 = expr1; TIMES; e2 = expr2 { Bop(Mul, e1, e2) } | e1 = expr1; DIVIDE; e2 = expr2 { Bop(Div, e1, e2) } | n = num { Num n } | v = var { Var v } | LPAREN; e = expr; RPAREN { e }but with far less typing.
At this point, you should be able to dune build
your project. That
said, if you try to use Parser_example.parse
, it won't do anything
because we haven't specified the lexer. That is, we haven't said
which strings correspond to which tokens.
Step 4: Lexing
The last thing we need to do: populate the lexer. We won't discuss
the anatomy of the lex.mll
file in too much detail because it will
stay fairly fixed for us in this course.
There is a header (it will always be the same):
{ open Parser }
There are identifiers, which are regular expressions for defining the structure of more complex tokens.7 These were included in the definition of the grammar above:
let whitespace = [' ' '\t' '\n' '\r']+ let num = '-'? ['0'-'9']+ let var = ['a'-'z' '_'] ['a'-'z' 'A'-'Z' '0'-'9' '_' '\'']*
There is a rule which describes which strings correspond to which tokens:
rule read = parse | "let" { LET } | "in" { IN } | "=" { EQUALS } | "+" { PLUS } | "-" { MINUS } | "*" { TIMES } | "/" { DIVIDE } | num { NUM (int_of_string (Lexing.lexeme lexbuf)) } | var { VAR (Lexing.lexeme lexbuf) } | "(" { LPAREN } | ")" { RPAREN } | whitespace { read lexbuf } | eof { EOF }
Without going into too much detail, what's going on with
num
andvar
: the expressionLexing.lexeme lexbuf
is the lexeme (i.e., the string) which corresponds to the complex token being read. For example, if"123 abc"
is given, then"123"
is the lexemeLexing.lexeme lexbuf
when"123"
is grouped by the lexer. We want to give the parser a number not a string so we convert it withstring_of_int
.
All together, the updated lexer.mll
:
{ open Parser } let whitespace = [' ' '\t' '\n' '\r'] let num = '-'? ['0'-'9']+ let var = ['a'-'z' '_'] ['a'-'z' 'A'-'Z' '0'-'9' '_' '\'']* rule read = parse | "let" { LET } | "in" { IN } | "=" { EQUALS } | "+" { PLUS } | "-" { MINUS } | "*" { TIMES } | "/" { DIVIDE } | "(" { LPAREN } | ")" { RPAREN } | num { NUM (int_of_string (Lexing.lexeme lexbuf)) } | var { VAR (Lexing.lexeme lexbuf) } | whitespace { read lexbuf } | eof { EOF }
Important: The order in which you list lexemes matters. The ones that are listed first are matched first. This implies, for example, that
"let"
will never be read as a variable name, because it will always first be read as the keywordlet
. In pratice, this basically means put your keywords and symbols first and everything else after.
At this point, you should be able to dune build
your project.
Step 5: Running
If all goes well you should be able to open dune utop
and evaluate
parse "let x = 2 in let y = 3 in x + y"
and see that it has the value:
Some (Parser_example__.Utils.Let ("x", Parser_example__.Utils.Num 2, Parser_example__.Utils.Let ("y", Parser_example__.Utils.Num 3, Parser_example__.Utils.Bop (Parser_example__.Utils.Add, Parser_example__.Utils.Var "x", Parser_example__.Utils.Var "y"))))
Don't worry too much about the module names in the output. You'll use
parse
as your parser for the mini-projects.
Closing Remarks
- Menhir is it's own beast, there's a lot we're not covering. You may occassionally have to look at documentation to see how things work (gasp). The hope is that this is a starting point from which you can troubleshoot.
- In general, you're going to want to work more incrementally than we've done here. Since you'll be given the specification, it's maybe fine for this course to try to sprint to the end and build the parser in one shot, but if you're worried about making mistakes, you should include one new token at a time and one new rule at a time, and test frequently.
- You will have the luxury of being given unambiguous grammars. In the case that you try to generate a parser for an ambiguous grammar, Menhir will warn you. So, if you're getting strange warnings about ambiguity or shifting, make sure to double check all your rules (it shouldn't be happening).
Footnotes:
To be clear, this is a definition of parsing used for theoretical purposes. It ignores many concerns about parsing that are incredibly important when it comes to practical parsing, e.g., error messages and portability.
It's worth taking a moment to appreciate how cool this is. And also to note that this is really an example of compilation, i.e., of taking a high-level language (EBNF specifications) and translating it into a low-level language (OCaml).
See the wikipedia page on comparing parser generators for more details.
We used to teach parser combinators in this course, but have since decided to spare you.
As usual, this is not strictly true, but I hope y'all can accept this fairly innocuous pedagogically motivated lie.
Actually this isn't necessary, but I'd recommend doing it for a while to get used to the syntax.
Since we likely won't have time to cover regular expressions, we'll give you these identifiers in a specification.