Parser Generators

Lexing vs. Parsing
Using Menhir
Closing Remarks

In the terminology of formal language theory, the general parsing problem is easy to state.¹

Definition: The (context-free) parsing problem is, given a (context-free) grammar \(\mathcal G\) and a sentence \(S\) over the terminal symbols of \(\mathcal G\), to determine if \(\mathcal G\) recognizes \(S\).

There's some very interesting algorithmic theory regarding this problem, including the Cocke-Younger-Kasami (CYK) algorithm, a beautiful example of dynamic programming.

We will not be covering this. We won't be implementing parsers at all. Instead, we'll be specifying our PLs and then generating parsers for them, no coding necessary (at least, no OCaml programming).

Roughly, a parser generator is a program which, given an EBNF grammar (or a similar specification language), generates the code which implements a parser for that grammar.² We'll be using a parser generator for OCaml called Menhir. This page is essentially a tutorial on Menhir.

But parser generators are like streaming services, there are more that exist than we'll ever need and people are still making new ones.³ ANTLR, for example, is a popular parser generator for Python.

Remark. Before continuing I'd also like to point out that there are other ways of building parsers.

Obviously you can handroll your own parser using any of the algorithms out there. Lua has what's called a recursive descent parser which was implemented in-house. They argue it's more efficient and more portable than using a generator.

Elm (a very fun functional language for building web-apps which has interoperability with JavaScript) uses what are called parser combinators to build its parser in a modular fashion. This implementation is very slick, and has interesting connections to monads.⁴

This is just to say, if you're interested, there are a lot of avenues to explore. It's also to say that, like any interesting engineering problem, building a parser takes a bit of artistry. There's no best choice, and you'll have to make some choices based on your aesthetic inclinations as well as domain-specific considerations.

Lexing vs. Parsing

The last bit of preamble we need before we can start using Menhir: we need to distinguish between lexing and parsing. Recall from the chapter on the interpretation pipeline that there were two boxes which were necessary for parsing. The first was for lexical analysis and the second was for syntactic analysis. Lexing is, as you might expected, the first of these two boxes.

At the beginning, we're given a string. Lexing is the process of taking that string and grouping together the characters that correspond to a single unit in our PL. When we parse a program, we'd like to know if the 'l' we're looking at is part of a string, part of a variable name, or part of the keyword let. Grouping these beforehand will make parsing much simpler.

Aside: On a more technical note, strings tend to be stored contiguously in memory, which is inconvenient for doing things like recursion. I am of the opinion that if you ever need to process a string in a way that is more complex than reading each character one-by-one, you might as well lex it first. Structure your data for your problem, not the other way around.

We call the abstract units of a program tokens and we call their associated strings lexemes.

Example: Here is a short list of lexemes and tokens that will appear in the languages we consider. It is standard (at least in Menhir) to represent tokens with ALLCAPS.

Lexeme Token

"let" LET

"if" IF

"123" NUM 123

"-123" NUM (-123)

"(" LPAREN

"->" ARROW

"xy" VAR "xy"

We have to define tokens all the way down to the symbols we use, like the function arrow and the left parentheses symbol. Also note that some tokens may not be defined as a single collection of characters, but as a collection of characters "with a particular shape." In Menhir (and many other parser generators) we use regular expressions to describe more complex tokens like numbers and variables.

Lexeme	Token
"let"	LET
"if"	IF
"123"	NUM 123
"-123"	NUM (-123)
"("	LPAREN
"->"	ARROW
"xy"	VAR "xy"

In short, lexing is the process of converting a string to token list, if possible.

val lex : string -> token list option

The type token is an ADT that we define which contains abstract represtations of the units of our language. This may fail (e.g., if there are unrecognized symbols) which is why it returns a token list option.

Parsing is the process of converting a list of tokens to an expr, if possible

val parse : token list -> expr option

The type expr is an ADT that we define which we use to represent expressions in our language. This may fail (e.g., if the program is not well-formed) which is why it returns an expr option. When parsing fails, good interpreters will display a syntax error message, which might give you more information about why parsing fails (in which case we'd have to return something more sophisticated than an expr option.

Terminology: We call the output of parsing a parse tree or a concrete syntax tree. Occasionally we will also use the term abstract syntax tree (AST) because the output of parsing represents the syntax of the program we want to evaluate abstractly as a data. For more complex interpreters, we draw a stronger distinction between these two things. It is not necessarily the case that the structure of our syntactic program exactly matches the object we will ultimately will evaluate. For now, we will use the term AST for the output of parsing.

Remark: There are a couple reasons for separating lexing and parsing. We won't dwell on it in this course, but in rough terms:

It makes the parsing task easier. There are fewer low-level concerns we need to deal with for parsing.

It makes this part of the interpretation pipeline more portable. Different operating systems have different ways of dealing with text, e.g., the CR vs. LF vs. CRLF issue that still plagues us today). If we separate the concerns, we can insulate the parser from these concerns.

Recall that, when we covered grammars, we said nothing about whitespace, or about complex kinds of objects like integers or variable names. This is because these are not parsing issues.⁵

Using Menhir

With that business out of the way, we go to generating a parser for a simple expression language using Menhir. We always start with a grammar, and a table which tells us the precedence of the operators represented in the grammar.

This will be our working example, a language for arithmetic expressions with local variables.

The Grammar:

<prog>  ::= <expr>

<expr>  ::= let <var> = <expr> in <expr>
          | <expr1>

<bop>   ::= + | - | * | /

<expr1> ::= <expr1> <bop> <expr1>
          | <num>
          | <var>
          | ( <expr> )

<num>   ::= 0 ; DUMMY VALUE
<var>   ::= x ; DUMMY VALUE

; In lex.mll:
;
; let num = '-'? ['0'-'9']+
; let var = ['a'-'z' '_'] ['a'-'z' 'A'-'Z' '0'-'9' '_' '\'']*

Operators in order of increasing precedence:

Operator	Associativity
`+`, `-`	left
`*`, `/`	left

Output type of parsing (the AST):

type bop = Add | Sub | Mul | Div

type expr =
  | Var of string
  | Num of int
  | Let of string * expr * expr
  | Bop of bop * expr * expr

type prog = expr

A couple notes on the everything above.

In the above grammar, we use dummy terminal symbols for variables and numbers because, again, these are handled by the lexer not the parser. We've also included the regular expressions used for NUM and VAR tokens in our lexer.
Note that there are no parentheses in the expression type itself, even though they are in the language. By the time we're representing the syntax abstractly, the parentheses are represented implicitly.

In what follows, we describe the steps to generate an OCaml function which, given the contents of a file with:

let x = 2 in
let y = -3 in
let x_squared = x * x in
let y_squared = y * y in
x_squared + y_squared

returns Some e where e is the expression

Let
  ("x"
  , Num 2
  , Let
      ( "y"
      , Num (-3)
      , Let
          ( "x_squared"
          , Mul (Var "x", Var "x")
          , Let
              ( "y_squared"
              , Mul (Var "y", Var "y")
              , Add (Var "x_squared", Var "y_squared")
              )
          )
      )
  )

We will assume that you have a dune project set up (if not, you can run dune init project parser_example to create one).

Step 0: The Project File

If we want to use Menhir, we have to tell dune that. First, add the stanza

(using menhir 3.0)

to the end of the file dune-project.

Step 1: Setting Up

You need to create four files with the following file structure in the lib directory of your project.

lib
├──dune
├──utils.ml
├──parser_example.ml
├──lexer.mll
└──parser.mly

dune:

(library (name parser_example))
(menhir (modules parser))
(ocamllex lexer)

utils.ml:

type bop = Add | Sub | Mul | Div

type expr =
  | Var of string
  | Num of int
  | Let of string * expr * expr
  | Bop of bop * expr * expr

type prog = expr

parse_example.ml:

let parse s =
  try Some (Parser.prog Lexer.read (Lexing.from_string s))
  with _ -> None

Remark: One of the benefits of using a parser generator is better error messages. We will not be using this feature in this course. If you were to use Menhir in a personal project, you would need to update the parse function to make use of the exceptions produced by Parser.prog.

The last two files are where the work happens. We can start with the following dummy files.

lexer.mll:

{
open Parser
}

let whitespace = [' ' '\t' '\n' '\r']+

rule read =
  parse
  | whitespace { read lexbuf }
  | eof { EOF }

parser.mly:

%{
open Utils
%}

%token EOF

%start <Utils.prog> prog

%%

prog:
  | EOF { Num 0 }

At this point, you should be able to dune build your project.

Step 2: Tokens

Now we need to populate our parser with the tokens of our PL. We also need to discuss the anatomy of the parser.mly file.

Our file begins with a header that can include arbitary OCaml code. For now we just open the module Utils so that we can refer to the prog and expr types in this file:

%{
open Utils
%}

Next follows a collection of declarations. This is where we will specify tokens, precedence, and associativity. We also declare our start symbol here (this is already done):

%token EOF

%start <Utils.prog> { prog }

%%

Our declarations end with the symbol %%. Everything that follows are rules, i.e., the production rules of our grammar. We will cover this in the next section.

For now, we'll populate the top of our file with token declarations, one for each terminal symbol in our grammar. The syntax is

%token NAME

and it is typical to choose an ALLCAPS name. There is already a token that we are using the represent the end of a file, which will appear in every parser we build. In the case of complex tokens, like numbers and variables, we also specify the type of thing the lexer will give us with the syntax

%token <type> NAME

With this, we can update parser.mly:

%{
open Utils
%}

%token LET
%token EQUALS
%token IN
%token PLUS
%token MINUS
%token TIMES
%token DIVIDE
%token <int> NUM
%token <string> VAR
%token LPAREN
%token RPAREN
%token EOF

%start <Utils.prog> prog

%%

prog:
  | EOF { Num 0 }

Determining the tokens you need to create is a matter of looking through the list of rules in the grammar, picking out all the terminal symbols, and giving them names.

At this point, if you try to dune build, you will see many warnings about unused tokens. This is because we haven't used any of the tokens in the rules of the grammar. We'll handle this in the next section.

Step 3: Rules

Onto the most important part, describing the rules of the grammar. Menhir is designed so that its rules looks a whole lot like EBNF production rules. Let's look at the first two rules.

In EBNF:

<prog> ::= <expr>

<expr> ::= let <var> = <expr> in <expr>
         | <expr1>

in Menhir syntax:

prog:
  | e = expr; EOF { e }

expr:
  | LET; x = var; EQUALS; e1 = expr; IN; e2 = expr { Let(x, e1, e2) }
  | e = expr1 { e }

If we squint, we can see the similarities. The main differences in the Menhir syntax:

The starting symbol should be followed with the special EOF (end-of-file) token. This ensures that Menhir knows that there is nothing left to parse after parsing an expression.
symbols are separated by semicolons⁶
terminal symbols are replaced with their associated tokens
alternatives are followed by the values which should be returned if they are matched with (the parts in the curly braces)
nonterminal symbols are given names which can be used in these returned values (like e1 = expr)

For example, we read the first alternative of the expr rule as:

if I see the keyword let
and then I see a variable x
and then I see the symbol =
and then I see a whole expression which parses to e1
and then I see the keyword in
and then I see a whole expression which parses to e2
then the result of parsing should be Let(x, e1, e2).

Implementing the parser is a matter of translating these rules into Menhir syntax.

We also need to provide precedence and associativity declarations for our operators. The syntax is:

%left TOKEN
%right TOKEN

depending on whether TOKEN should be left associative or right associative, and precendence is handled by the order the declarations are presented. Precedences increases down the list, just like in our precedence table.

Futher updating parser.mly:

%{
open Utils
%}

%token LET
%token EQUALS
%token IN
%token PLUS
%token MINUS
%token TIMES
%token DIVIDE
%token <int> NUM
%token <string> VAR
%token LPAREN
%token RPAREN
%token EOF

%left PLUS, MINUS
%left TIMES, DIVIDE

%start <Utils.prog> prog

%%

prog:
  | e = expr; EOF { e }

expr:
  | LET; x = var; EQUALS; e1 = expr; IN; e2 = expr { Let(x, e1, e2) }
  | e = expr1 { e }

%inline bop:
  | PLUS { Add }
  | MINUS { Sub }
  | TIMES { Mul }
  | DIVIDE { Div }

expr1:
  | e1 = expr1; op = bop; e2 = expr1 { Bop(op, e1, e2) }
  | n = num { Num n }
  | v = var { Var v }
  | LPAREN; e = expr; RPAREN { e }

num:
  | n = NUM { n }

var:
  | x = VAR { x }

Remark: If we want to follow the grammar exactly, we have to inline the binary operators. Menhir is not smart enough to recognize that bop expands to operators with declared precendence. We use the keyword %inline to tell Menhir essentially to immediately substitute bop in the grammar itself.

It's equivalent to having the rule:
expr1:
  | e1 = expr1; PLUS; e2 = expr2 { Bop(Add, e1, e2) }
  | e1 = expr1; MINUS; e2 = expr2 { Bop(Sub, e1, e2) }
  | e1 = expr1; TIMES; e2 = expr2 { Bop(Mul, e1, e2) }
  | e1 = expr1; DIVIDE; e2 = expr2 { Bop(Div, e1, e2) }
  | n = num { Num n }
  | v = var { Var v }
  | LPAREN; e = expr; RPAREN { e }
but with far less typing.

At this point, you should be able to dune build your project. That said, if you try to use Parser_example.parse, it won't do anything because we haven't specified the lexer. That is, we haven't said which strings correspond to which tokens.

Step 4: Lexing

The last thing we need to do: populate the lexer. We won't discuss the anatomy of the lex.mll file in too much detail because it will stay fairly fixed for us in this course.

There is a header (it will always be the same):
```
{
open Parser
}
```
There are identifiers, which are regular expressions for defining the structure of more complex tokens.⁷ These were included in the definition of the grammar above:
```
let whitespace = [' ' '\t' '\n' '\r']+
let num = '-'? ['0'-'9']+
let var = ['a'-'z' '_'] ['a'-'z' 'A'-'Z' '0'-'9' '_' '\'']*
```
There is a rule which describes which strings correspond to which tokens:
```
rule read =
  parse
  | "let" { LET }
  | "in" { IN }
  | "=" { EQUALS }
  | "+" { PLUS }
  | "-" { MINUS }
  | "*" { TIMES }
  | "/" { DIVIDE }
  | num { NUM (int_of_string (Lexing.lexeme lexbuf)) }
  | var { VAR (Lexing.lexeme lexbuf) }
  | "(" { LPAREN }
  | ")" { RPAREN }
  | whitespace { read lexbuf }
  | eof { EOF }
```
Without going into too much detail, what's going on with num and var: the expression Lexing.lexeme lexbuf is the lexeme (i.e., the string) which corresponds to the complex token being read. For example, if "123 abc" is given, then "123" is the lexeme Lexing.lexeme lexbuf when "123" is grouped by the lexer. We want to give the parser a number not a string so we convert it with string_of_int.

All together, the updated lexer.mll:

{
open Parser
}

let whitespace = [' ' '\t' '\n' '\r']
let num = '-'? ['0'-'9']+
let var = ['a'-'z' '_'] ['a'-'z' 'A'-'Z' '0'-'9' '_' '\'']*

rule read =
  parse
  | "let" { LET }
  | "in" { IN }
  | "=" { EQUALS }
  | "+" { PLUS }
  | "-" { MINUS }
  | "*" { TIMES }
  | "/" { DIVIDE }
  | "(" { LPAREN }
  | ")" { RPAREN }
  | num { NUM (int_of_string (Lexing.lexeme lexbuf)) }
  | var { VAR (Lexing.lexeme lexbuf) }
  | whitespace { read lexbuf }
  | eof { EOF }

Important: The order in which you list lexemes matters. The ones that are listed first are matched first. This implies, for example, that "let" will never be read as a variable name, because it will always first be read as the keyword let. In pratice, this basically means put your keywords and symbols first and everything else after.

At this point, you should be able to dune build your project.

Step 5: Running

If all goes well you should be able to open dune utop and evaluate

parse "let x = 2 in let y = 3 in x + y"

and see that it has the value:

Some
 (Parser_example__.Utils.Let ("x", Parser_example__.Utils.Num 2,
   Parser_example__.Utils.Let ("y", Parser_example__.Utils.Num 3,
    Parser_example__.Utils.Bop (Parser_example__.Utils.Add,
     Parser_example__.Utils.Var "x", Parser_example__.Utils.Var "y"))))

Don't worry too much about the module names in the output. You'll use parse as your parser for the mini-projects.

Closing Remarks

Menhir is it's own beast, there's a lot we're not covering. You may occassionally have to look at documentation to see how things work (gasp). The hope is that this is a starting point from which you can troubleshoot.
In general, you're going to want to work more incrementally than we've done here. Since you'll be given the specification, it's maybe fine for this course to try to sprint to the end and build the parser in one shot, but if you're worried about making mistakes, you should include one new token at a time and one new rule at a time, and test frequently.
You will have the luxury of being given unambiguous grammars. In the case that you try to generate a parser for an ambiguous grammar, Menhir will warn you. So, if you're getting strange warnings about ambiguity or shifting, make sure to double check all your rules (it shouldn't be happening).

Footnotes:

To be clear, this is a definition of parsing used for theoretical purposes. It ignores many concerns about parsing that are incredibly important when it comes to practical parsing, e.g., error messages and portability.

It's worth taking a moment to appreciate how cool this is. And also to note that this is really an example of compilation, i.e., of taking a high-level language (EBNF specifications) and translating it into a low-level language (OCaml).

See the wikipedia page on comparing parser generators for more details.

⁴

We used to teach parser combinators in this course, but have since decided to spare you.

⁵

As usual, this is not strictly true, but I hope y'all can accept this fairly innocuous pedagogically motivated lie.

⁶

Actually this isn't necessary, but I'd recommend doing it for a while to get used to the syntax.

⁷

Since we likely won't have time to cover regular expressions, we'll give you these identifiers in a specification.