Parser Generators
Table of Contents
In the terminology of formal language theory, the general parsing problem is easy to state.1
Definition. The (context-free) parsing problem is: given a (context-free) grammar \(\mathcal G\) and a sentence \(S\) over the terminal symbols of \(\mathcal G\), determine if \(\mathcal G\) recognizes \(S\).
There's some very interesting algorithmic theory regarding this problem, including the Cocke-Younger-Kasami (CYK) algorithm, a beautiful example of dynamic programming.
We will not be covering this. We won't be implementing parsers at all. Instead, we'll be specifying our programming languages and then generating parsers for them, no coding necessary (at least, no OCaml programming).
In broad strokes, a parser generator is a program which, given an EBNF grammar (or a similar specification language), generates the code which implements a parser for that grammar.2 We'll be using a parser generator for OCaml called Menhir. This page is essentially a tutorial on using Menhir.
But parser generators are like streaming services, there are more that exist than we'll ever need and people are still making new ones.3 ANTLR, for example, is a popular parser generator for Python.
Remark. Before continuing I'd also like to point out that there are other ways of building parsers.
- Obviously you can handroll your own parser using any of the algorithms out there. Lua has what's called a recursive descent parser which was implemented in-house. They argue it's more efficient and more portable than using a generator.
- Elm (a very fun functional language for building web-apps which has interoperability with JavaScript) uses what are called parser combinators to build its parser in a modular fashion. This implementation is very slick, and has interesting connections to monads.4
This is just to say, if you're interested, there are a lot of avenues to explore. It's also to say that, like any interesting engineering problem, building a parser takes a bit of artistry. There's no best choice, and you'll have to make some choices based on your aesthetic inclinations as well as domain-specific considerations (also your gut).
Lexing vs. Parsing
The last bit of preamble we need before we can start using Menhir: we need to distinguish between lexing and parsing. Recall from the chapter on the interpretation pipeline that there were two boxes which were necessary for parsing. The first was for lexical analysis and the second was for syntactic analysis. Lexing is, as you might expected, the first of these two boxes.
At the beginning, we're given a string. Lexing is the process of
taking that string and grouping together the characters that
correspond to a single unit in our programming language. When we
parse a program, we'd like to know if the 'l'
we're looking at is
part of a string, part of a variable name, or part of the keyword
let
. Grouping these beforehand will make parsing much simpler.
We call the abstract units of a program tokens and we call their associated strings lexemes.
Example. Here is a short list of lexemes and tokens that will appear in the languages we consider. It is standard (at least in Menhir) to represent tokens with ALLCAPS.
Lexeme Token "let" LET "if" IF "123" NUM 123 "-123" NUM (-123) "(" LPAREN "->" ARROW "xy" VAR "xy" We have to define tokens all the way down to the symbols we use, like the function arrow and the left parentheses symbol. Also note that some tokens may not be defined as a single collection of characters, but as a collection of characters "with a particular shape." In Menhir (and many other parser generators) we use regular expressions to describe more complex tokens.
In short, lexing is the process of converting a string
to token
list
, if possible.
val lex : string -> token list option
The type token
is an ADT that we define which contains abstract
represtations of the units of our language. This may fail (e.g., if
there are unrecognized symbols) which is why it returns a token list
option
.
Parsing is the process of converting a list of tokens to an expr
, if possible
val parse : token list -> expr option
The type expr
is an ADT that we define which we use to represent
expressions in our language. This may fail (e.g., if the program is
not well-formed) which is why it returns an expr option
. When
parsing fails, good interpreters will display a syntax error
message, which might give you more information about why parsing
fails (in which case we'd have to return something more sophisticated
than an expr option
.
Terminology. We call the output of parsing a parse tree or a concrete syntax tree. Occasionally we will also use the term abstract syntax tree (AST) because the output of parsing represents the syntax of the program we want to evaluate abstractly as a data. For more complex interpreters, we draw a stronger distinction between these two things. It is not necessarily the case that the structure of our syntactic program exactly matches the object we will ultimately will evaluate. For now, we will use the term AST for the output of parsing.
Remark. There are a couple reasons for separating these two processes, we won't dwell on it in this course. But, in rougth terms:
- It makes the parsing task easier. There are fewer low-level concerns we need to deal with for parsing.
- It makes this part of the interpretation pipeline more portable. Different operating systems have different ways of dealing with text, e.g., the CR vs. LF vs. CRLF issue that still plagues us today). If we separate the concerns, we can insulate the parser from these concerns.
Recall that, when we covered grammars, we said nothing about whitespace, or about complex kinds of objects like integers or variable names. This is because these are not parsing issues.5
Using Menhir
With that business out of the way, we go to generating a parser for a simple expression language using Menhir. We always start with a grammar, and a table which tells us the precedence of the operators represented in the grammar.
This will be our working example, a language for arithmetic expressions with local variables.
The grammar:
<prog> ::= <expr> <expr> ::= let <var> = <expr> in <expr> | <expr1> <bop> ::= + | - | * | / <expr1> ::= <expr1> <bop> <expr1> | <num> | <var> | ( <expr> ) <num> ::= 0 ; DUMMY VALUE <var> ::= x ; DUMMY VALUE ; In lex.mll: ; ; let num = '-'? ['0'-'9']+ ; let var = ['a'-'z' '_'] ['a'-'z' 'A'-'Z' '0'-'9' '_' '\'']*Operators in order of increasing precedence:
Operator Associativity +
,-
left *
,/
left Output type of parsing (the AST):
type bop = Add | Sub | Mul | Div type expr = | Var of string | Num of int | Let of string * expr * expr | Bop of bop * expr * expr type prog = expr
A couple notes on the everything above.
- In the above grammar, we use dummy terminal symbols for variables
and numbers because, again, these are handled by the lexer not the
parser. We've also included the regular expressions used for
NUM
andVAR
tokens in our lexer. - Note that there are no parentheses in the expression type itself, even though they are in the language. By the time we're representing the syntax abstractly, the parentheses are represented implicitly.
In what follows, we describe the steps to generate an OCaml function which, given the contents of a file with:
let x = 2 in let y = -3 in let x_squared = x * x in let y_squared = y * y in x_squared + y_squared
returns Some e
where e
is the expression
Let ("x" , Num 2 , Let ( "y" , Num (-3) , Let ( "x_squared" , Mul (Var "x", Var "x") , Let ( "y_squared" , Mul (Var "y", Var "y") , Add (Var "x_squared", Var "y_squared") ) ) ) )
We will assume that you have a dune project set up (if not, you can
run dune init project parser_example
to create one). We will also
assume (as we do in the projects) that you have access to a library
called Utils
, which contains the AST for the language (among other
things).
Step 0: The Project File
If we want to use Menhir, we have to tell dune that. First, add the stanza
(using menhir 2.1)
to the end of the file dune-project
.
Step 1: Setting Up
You need to create four files with the following file structure in the
lib
directory of your project.
lib └──parser ├──dune ├──lex.mll ├──my_parser.ml └──par.mly
We use the name My_parser
because the module Parser
already exists
in the standard library. For us, the dune
file and my_parser.ml
file
are fixed:
dune
:
(library (name my_parser) (libraries utils)) (menhir (modules par)) (ocamllex lex)
my_parser.ml:
let parse s = try Some (Parser.prog Lexer.read (Lexing.from_string s)) with _ -> None
Remark. One of the benefits of using a parser generator is better error messages. We will not be using this feature in this course. If you were to use Menhir in a personal project, you would need to update the
parse
function to make use of the exceptions produced byParser.prog
.
The last two files are where the work happens. We can start with the following dummy files.
lex.mll
:
{ open Par } let whitespace = [' ' '\t' '\n' '\r']+ rule read = parse | whitespace { read lexbuf } | eof { EOF }
par.mly
%{ open Utils %} %token EOF %start <Ast.prog> { prog } %% prog: | EOF { Num 0 }
At this point, you should be able to dune build
your project.
Note. We will presume that the abstract representation (in this case,
prog
as above) we want to target is in a filelib/utils/utils.ml
.The type
prog
is something you are either given, or which you should have determined before you started building a parser.6
Step 2: Tokens
Now we need to populate our parser with the tokens of our language.
We also need to discuss the anatomy of the par.mly
file.
Our file begins with a header that can include arbitary OCaml code.
For now we just open the module Ast
so that we can refer to the
expr
type in this file:
%{ open Ast %}
Next follows a collection of declarations. This is where we will specify tokens and precedence. We also declare our start production rule here (this is already done):
%token EOF %start <Ast.prog> { prog } %%
Our declarations end with the symbol %%
. Everything that follows
are rules, i.e., the production rules of our grammar. We will cover
this in the next section.
For now, we'll populate the top of our file with token declarations,
one for each terminal symbol in our grammar. The syntax is %token
NAME
, and it is typical to choose an ALLCAPS name. There is already
a token that we are using the represent the end of a file, which will
appear in every parser we build. In the case of complex tokens, like
numbers and variables, we also specify the type of thing the lexer
will give us with the syntax %token <type> NAME
.
Updated par.mly
:
%{ open Ast %} %token LET %token EQUALS %token IN %token PLUS %token MINUS %token TIMES %token DIVIDE %token <int> NUM %token <string> VAR %token LPAREN %token RPAREN %token EOF %start <Ast.prog> { prog } %% prog: | EOF { Num 0 }
Determining the tokens you need to create is a matter of looking through the list of rules in the grammar, picking out all the terminal symbols and giving them names.
At this point, if you try to dune build
, you will see many warnings
about unused tokens. This is because we haven't used any of the
tokens in the rules of the grammar. We'll handle this in the
next section.
Step 3: Rules
Onto the most important part, describing the rules of the grammar. Menhir is designed so that its rules looks a whole lot like EBNF production rules. Let's look at the first two rules.
In EBNF:
<prog> ::= <expr> <expr> ::= let <var> = <expr> in <expr> | <expr1>
in Menhir syntax:
prog: | e = expr; EOF { e } expr: | LET; x = var; EQUALS; e1 = expr; IN; e2 = expr { Let(x, e1, e2) } | e = expr1 { e }
If we squint, we can see the similarities. The main differences in the Menhir syntax:
- The starting symbol should be followed with the special
EOF
(end-of-file) token. This ensures that Menhir knows that there is nothing left to parse after parsing an expression. - symbols are separated by semicolons7
- terminal symbols are replaced with their associated tokens
- alternatives are followed by the values which should be returned if they are matched with (the parts in the curly braces)
- nonterminal symbols are given names which can be used in these
returned values (like
e1 = expr
)
For example, we read the first alternative of the expr
rule as:
- if I see the keyword
let
- and then I see a variable
x
- and then I see the symbol
=
- and then I see a whole expression which parses to
e1
- and then I see the keyword
in
- and then I see a whole expression which parses to
e2
- then the result of parsing should be
Let(x, e1, e2)
.
Implementing the parser is a matter of translating these rules into Menhir syntax.
We also need to provide precedence and associativity declarations for
our operators. The syntax is %left TOKEN
or %right TOKEN
,
depending on whether TOKEN
should be left associative or right
associative, and precendence is handled by the order the declarations
are presented (just like in our precedence table).
Updated par.mly
:
%{ open Ast %} %token LET %token EQUALS %token IN %token PLUS %token MINUS %token TIMES %token DIVIDE %token <int> NUM %token <string> VAR %token LPAREN %token RPAREN %token EOF %left PLUS, MINUS %left TIMES, DIVIDE %start <Ast.prog> { prog } %% prog: | e = expr; EOF { e } expr: | LET; x = var; EQUALS; e1 = expr; IN; e2 = expr { Let(x, e1, e2) } | e = expr1 { e } %inline bop: | PLUS { Add } | MINUS { Sub } | TIMES { Mul } | DIVIDE { Div } expr1: | e1 = expr1; op = bop; e2 = expr2 { Bop(op, e1, e2) } | e = num { e } | e = var { e } | LPAREN; e = expr; RPAREN { e } num: | n = NUM { Num n } var: | x = VAR { Var x }
Remark. If we want to follow the grammar exactly, we have to use what is called inlining for the binary operators. Menhir is not smart enough to recognize that
bop
expands to operators with declared precendence. We use the keyword%inline
to tell Menhir essentially to immediately substitutebop
in the grammar itself.It's equivalent to having the rule:
expr1: | e1 = expr1; PLUS; e2 = expr2 { Bop(Add, e1, e2) } | e1 = expr1; MINUS; e2 = expr2 { Bop(Sub, e1, e2) } | e1 = expr1; TIMES; e2 = expr2 { Bop(Mul, e1, e2) } | e1 = expr1; DIVIDE; e2 = expr2 { Bop(Div, e1, e2) } | e = num { e } | e = var { e } | LPAREN; e = expr; RPAREN { e }but with far less typing.
At this point, you should be able to dune build
your project. That
said, if you try to use it, it won't do anything because we haven't
specified the lexer, we haven't said which strings correspond to which
tokens. We will deal with this in the next section.
Step 4: Lexing
The last thing we need to do: populate the lexer. We won't discuss
the anatomy of the lex.mll
file in too much detail because it will
stay fairly fixed for us in this course.
There is a header (it will always be the same):
{ open Par }
There are identifiers, which are regular expressions for defining the structure of more complex tokens.8 These were included in the definition of the grammar above:
let whitespace = [' ' '\t' '\n' '\r'] let num = '-'? ['0'-'9']+ let var = ['a'-'z' '_'] ['a'-'z' 'A'-'Z' '0'-'9' '_' '\'']*
There is a rule which describes which strings correspond to which tokens:
rule read = parse | "let" { LET } | "=" { EQUALS } | "+" { PLUS } | "-" { MINUS } | "*" { TIMES } | "/" { DIVIDE } | num { NUM (int_of_string (Lexing.lexeme lexbuf)) } | var { VAR (Lexing.lexeme lexbuf) } | "(" { LPAREN } | ")" { RPAREN } | whitespace { read lexbuf } | eof { EOF }
Without going into too much detail, what's going on with
num
andvar
: the expressionLexing.lexeme lexbuf
is the lexeme (i.e., the string) which corresponds to the complex token being read. For example, if"123 abc"
is given, then"123"
is the lexemeLexing.lexeme lexbuf
when"123"
is grouped by the lexer. We want to give the parser a number not a string so we convert it withstring_of_int
.
All together, the updated lex.mll
:
{ open Par } let whitespace = [' ' '\t' '\n' '\r'] let num = '-'? ['0'-'9']+ let var = ['a'-'z' '_'] ['a'-'z' 'A'-'Z' '0'-'9' '_' '\'']* rule read = parse | "let" { LET } | "=" { EQUALS } | "+" { PLUS } | "-" { MINUS } | "*" { TIMES } | "/" { DIVIDE } | num { NUM (int_of_string (Lexing.lexeme lexbuf)) } | var { VAR (Lexing.lexeme lexbuf) } | "(" { LPAREN } | ")" { RPAREN } | whitespace { read lexbuf } | eof { EOF }
At this point, you should be able to dune build
your project.
Step 5: Running
If all goes well you should be able to open dune utop
and evaluate
My_parser.parse "let x = 2 in let y = 3 in x + y"
and see that it has the value
Let ("x" , Num 2 , Let ( "y" , Num 3 , Add (Var "x", Var "y") ) )
You'll use My_parser.parse
as your parser for the mini-projects.
Closing Remarks
- Menhir is it's own beast, there's a lot we're not covering. You may occassionally have to look at documentation to see how things work (gasp). The hope is that this is a starting point from which you can troubleshoot.
- In general, you're going to want to work more incrementally than we've done here. Since you'll be given the specification, it's maybe fine for this course to try to sprint to the end and build a parser in one shot, but if you're worried about making mistakes, you should include one new token at a time and one new rule at a time, and test frequently.
- You will have the luxury of being given unambiguous grammars. In the case that you try to generate a parser for an ambiguous grammar, Menhir will warn you. So, if you're getting strange warnings about ambiguity or shifting, make sure to double check all your rules (it shouldn't be happening).
Footnotes:
To be clear, this is a definition of parsing used for theoretical purposes. It ignores many concerns about parsing that are incredibly important when it comes to practical parsing, e.g., error messages and portability.
It's worth taking a moment to appreciate how cool this is. And also to note that this is really an example of compilation, i.e., of taking a high-level language (EBNF specifications) and translating it into a low-level language (OCaml).
See the wikipedia page on comparing parser generators for more details.
We used to teach parser combinators in this course, but have since decided to spare you all.
As usual, this is not strictly true, but I hope y'all can accept this fairly innocuous pedagogically motivated lie.
It's generally not a good idea to try to build a parser before you know what abstract representation you're targeting.
Actually this isn't necessary, but I'd recommend doing it for a while to get used to the syntax.
Since we likely won't have time to cover regular expressions, we'll give you these identifiers in a specification.