Formal Grammar

Introduction
Backus-Naur Form
Ambiguity
Extended BNF

Before discussing parsing, we have to look at its mathematical counterpart: formal language theory. We won't put heavy emphasis on formal language theory in this course (as interesting as it is). Rather the most important things to get from these notes are:

the terminology; you need to know what a grammar is, what precedence is, etc.
the ability to read a grammar and (in time) construct a parser from it using a parser generator

In particular, the problem of designing grammars is a subtle and difficult one, and certainly one worth studying if you're interested. But, logistically speaking, in this course you will be given a grammar, from which you will generate a parser.

Introduction

Most of us are familiar with grammar in the context of natural language. In English class, we learn that we should use the article "an" instead of "a" if its corresponding noun starts with a vowel sound, and that what follows a semicolon should be a independent clause (i.e., should be able to stand on its own as a complete sentences). These are examples of English grammar rules.

Grammar, in broad strokes, refers to the rules which govern what constitutes a well-formed sentence in a given language, barring low-level syntactic concerns like spelling or whitespace. It's the concern of grammar to determine that

I taught the car in the refrigerator

makes grammatical sense and that

I car teach refrigerator in there

does not. It's not the concern of grammar to determine that the first sentence, though grammatical, has no reasonable interpretation in English (except, perhaps, in surrealist fiction).

PLs──being themselves languages in their own right, albeit more stringent ones than natural languages──have their own grammars, i.e., rules for determining what counts as a well-formed program. Due to the grammatical precision expected of PLs (we don't want too much variation in what we are allowed to write as a valid program), these tend to be called formal grammars.

Example: In OCaml, the program
let f x = x + 1
is well-formed, but the program
let f x = x 1 +
is not, because the rule for using the + operator is that its arguments appear to its left and its right, i.e., + is an infix operator.¹

As in the case of natural language, grammars for PLs are not concerned with the meaning of programs, just their well-formedness. The program

let omega x = x x

is well-formed in OCaml, but it's not well-typed.

Exercise: Open up Utop and evaluate the above OCaml program. What is the error message?

If our goal is to interpret programs, then we have to understand formally──both theoretically and practically──the grammars describing the well-formed programs. This means being able to represent and interpret representations of formal grammars. The grammar of OCaml, for example, is given in its entirety in The OCaml Manual. After going through this chapter, you should be able to interpret the specification given there.

Placing this in the pipeline of interpretation, a grammar represents the target of parsing. As a reminder, a stream of tokens is transformed into a parse tree during syntactic analysis. A parse tree is a hierarchical structure which describes the composition of the source program. It's easier to interpret a program in its hierarchical form as opposed to its linear form as a stream of tokens.

Remark: This is another way of describing the role of grammar: it determines the hierarchical structure of a sentence. A sentence is well-formed if it has a parse tree, e.g.
S
├──┐
NP VP
│  ├──────┐
N  V      NP
│  │      ├───────┐
│  │      NP      PP
│  │      ├───┐   ├──┐
│  │      A   N   P  NP
│  │      │   │   │  ├───┐
│  │      │   │   │  A   N
│  │      │   │   │  │   │
I  taught the car in the refrigerator.
Its not important that you know exactly what the abbreviations in the above parse tree stand for (though you should be able to make educated guesses). But hopefully the structure aligns with your intuition about how words in the sentences are grouped.

In this note on formal grammars, we will:

define Backus-Naur Form specifications, a way of describing so-called context-free grammars, which we'll use to present the formal grammars of PLs;
discuss ambiguity in grammar and how to avoid it (and why);
look at extended BNF as a slightly more useful framework for specifying PLs.

Backus-Naur Form

Backus-Naur Form (BNF) specifications are used to describe what are called context-free grammars. Context-free grammars form a class of formal grammars which are sufficiently expressive to capture the grammars of most PLs. We will be using BNF specifications to describe the rules which determine well-formed programs in then PLs we aim to interpret.

First, a thought experiment. Consider the following English statement.

the cow jumped over the moon

Suppose we try to break down the cognitive process of determining that this sentence is grammatical. We might first recognize that each word falls into a particular part of speech. We can represent this step of the process by replacing each word in the sentence with a symbol standing for each figure of speech.

<article> <noun> <verb> <prep> <article> <noun>

We then might recognize some familiar patterns: <article> <noun> captures the determination or quantification of an object, so we might mentally group these symbols (into what grammaticists call nominal phrases or noun phrases) and represent them by a new symbol:

<noun-phrase> <verb> <prep> <noun-phrase>

Then we might recognize that a preposition followed by a noun phrase is also single unit ("over the moon", "through the woods", and "behind the wall" are examples of prepositional phrases) so that the structure of the entire sentence may be represented as

<noun-phrase> <verb> <prep-phrase>

Then we might recognize that prepositional phrases can modify verbs, again creating a single unit (e.g., "ran to the car", "arose from bed") leaving us with something like

<noun-phrase> <verb-phrase>

which we should finally recognize the canonical structure of a well-formed sentence: a thing does a thing. A bit hand-wavy, but this accounts roughly for what we do when we judge that the above sentence is grammatical.

Putting these steps in reverse order (and starting with a single symbol <sentence>) we get something that looks like a proof that the cow jumped over the moon is a grammatical sentence.

<sentence>
<noun-phrase>    <verb-phrase>
<noun-phrase>    <verb> <prep-phrase>
<noun-phrase>    <verb> <prep> <noun-phrase>
<article> <noun> <verb> <prep> <article> <noun>
the       cow    jumped over   the       moon

Each line follows from line which precedes it by a grammar rule, e.g., a sentence is composed of a noun phrase followed by a verb phrase. And if we squint, we can see something hierarchical, something that looks a bit like the parse tree in the introduction to this chapter.

S
├───────┐
NP      VP
├───┐   ├──────┐
A   N   V      PP
│   │   │      ├────┐
│   │   │      P    NP
│   │   │      │    ├───┐
│   │   │      │    A   N
│   │   │      │    │   │
the cow jumped over the moon

A formal grammar is meant to model this cognitive process of classifying a sentence as grammatical by verifying that it has "valid" hierarchical structure.

Definitions

When defining a formal grammar, we fix ourselves to a collection of symbols. These symbols are divided into two disjoint groups: the terminal symbols and the non-terminal symbols. We will always notate a non-terminal symbol by something of the form <non-term> where we replace non-term with something more descriptive, and terminal symbols by sequence of alphanumeric symbols.

Remark: We almost never state outright what the underlying symbols of a grammar are. We can always determine the terminal and non-terminal symbols of a grammar by looking at its BNF specification.

In the proof that the cow jumped over the moon was grammatical, we built a sequence of not-quite sentences, until the very last one which was an actual sentences. We call these possibly-not-quite sentences sentential forms.

Definition: A sentential form is a sequence of symbols (terminal or non-terminal). A sentence is a sequence of terminal symbols.

Remark: We notate a sequences of symbols by white space separation. For example,
the dog jumped
is a sentence and
the <noun> jumped
is a sentential form. But it is important to note that this is just notation. We're not interested in low-level concerns like whitespace when studying formal grammar. If it helps, it may be useful to thing of sentential forms as lists:
[the; <noun>; jumped]

In the process of building sentential forms, we replaced non-terminal symbols with sentential forms, e.g., we replaced <noun phrase> with <article> <noun>. A grammar is determined by what replacements we are allowed to do.

Definition: A production rule is an equation of the form
<non-term> ::= SENTENTIAL-FORM
where the left-hand side of the ::= is a non-terminal symbol, and the right-hand side is a sentential form.

We read a production rule as: "the non-terminal symbol on the left-hand side can be replaced with the sentential form on the right hand side." In a sense, production rules define the non-terminal symbols, e.g., a sentence is a noun phrase followed by a verb phrase.

Definition: A Backus-Naur Form (BNF) specification is a collection of production rules, together with a designated the starting symbol.

Remark: We will always take the start symbol to be the left-hand side of the first rule appearing in a specification.

Example: The following is an example of a grammar which recognizes the above sentence.
<sentence>    ::= <noun-phrase> <verb-phrase>
<verb-phrase> ::= <verb> <prep-phrase>
<verb-phrase> ::= <verb>
<prep-phrase> ::= <prep> <noun-phrase>
<noun-phrase> ::= <article> <noun>
<article>     ::= the
<noun>        ::= cow
<noun>        ::= moon
<verb>        ::= jumped
<prep>        ::= over
The nonterminal symbols of this grammar are: <sentence>, <noun-phrase>, <verb-phrase>, <verb>, <prep>, <prep-phrase>, <article>, and <noun>.

The terminal symbols of this grammar are: the, cow, moon, jumped, and over.

The nonterminal symbol <sentence> is our starting symbol because it appears as the left-hand side of the first rule.

Note that a non-terminal symbol can have multiple associated production rules. This is common enough to have its own special syntax.

Notation: We write

<non-term> ::= SENT-FORM-1 | SENT-FORM-2 | ... | SENT-FORM-n

as shorthand for

<non-term> ::= SENT-FORM-1
<non-term> ::= SENT-FORM-2
...
<non-term> ::= SENT-FORM-n

With this shorthand, we can write the above grammar as:

<sentence>    ::= <noun-phrase> <verb-phrase>
<verb-phrase> ::= <verb> | <verb> <prep-phrase>
<prep-phrase> ::= <prep> <noun-phrase>
<noun-phrase> ::= <article> <noun>
<article>     ::= the
<noun>        ::= cow | moon
<verb>        ::= jumped
<prep>        ::= over

The last piece of the thought experiment above that we need to formalize is the proof that the given sentence was grammatical. We formalize this in the notion of a derivation.

Definition: A derivation of a sentence S in a BNF grammar is a sequence of sentential forms with the following properties:

it begins with the designated start symbol;

it ends in the sentence S;

each sentential form is a the result of replacing one of the non-terminal symbols in the preceding sentence with a sentential form according to a production rule of the grammar.

We say that a grammar recognizes a sentence S if there is a derivation of S in the grammar.

This definition restates the process from the thought experiment mathematically. That said, it deviates in one way which makes the definition easier to state: in the thought experiment, we allowed ourselves to replace multiple non-terminal symbols simultaneously. This is not allowed according to the above notion of a derivation. A correct derivation would look like:

<sentence>
<noun-phrase>    <verb-phrase>
<noun-phrase>    <verb> <prep-phrase>
<noun-phrase>    <verb> <prep> <noun-phrase>
<article> <noun> <verb> <prep> <noun-phrase>
<article> <noun> <verb> <prep> <article> <noun>
the       <noun> <verb> <prep> <article> <noun>
the       cow    <verb> <prep> <article> <noun>
the       cow    jumped <prep> <article> <noun>
the       cow    jumped over   <article> <noun>
the       cow    jumped over   the       <noun>
the       cow    jumped over   the       moon

Exercise: In the above derivation, mark the nonterminal in each sentential form which was replaced in the following line.

A sentence is not guaranteed to have a unique derivation. However, there is one form of derivation we will single out.

Definition: A leftmost derivation of a sentence is one in which the leftmost nonterminal symbol is expanded in each step.

The above derivation is not a leftmost derivation.

Exercise: Write a leftmost derivation for the sentence
the cow jumped over the moon
in the above grammar.

A More Interesting Example

The following is a BNF specification for a fragment of a simple imperative PL.

<program> ::= <stmts>
<stmts>   ::= <stmt> | <stmt> ︔ <stmts>
<stmt>    ::= <var> = <expr>
<var>     ::= a | b | c | d
<expr>    ::= <term> | <term> + <term> | <term> - <term>
<term>    ::= <var> | const

In English, we would read this specification as:

a PROGRAM is a SEQUENCE OF STATEMENTS;
a SEQUENCE OF STATEMENTS is either a single STATEMENT, or a single STATEMENT followed a semicolon, followed by a SEQUENCE OF STATEMENTS;
a STATEMENT is a VARIABLE followed by an equals sign, followed by an EXPRESSION;

and so on…

This second rule highlights an interesting feature of BNF specifications: rules are allowed to be recursive. The production rule for <stmts> allows us to replace it with a sentential form which contains the non-terminal symbol <stmts>. As such, this grammar recognizes an infinite number of sentences.

Exercise: Determine the number of sentences that can be derived in the grammar from the previous section (i.e., the number of sentences which can be derived from the starting symbol <sentence>).

Consider the following program.

a = const ︔
a = a + const ︔
b = a

We can verify that this program is recognized by the above grammar by finding a (leftmost) derivation.

<program>
<stmts>
<stmt> ︔ <stmts>
<var> = <expr> ︔ <stmts>
a = <expr> ︔ <stmts>
a = <term> ︔ <stmts>
a = const ︔ <stmts>
a = const ︔ <stmt> ︔ <stmts>
a = const ︔ <var> = <expr> ︔ <stmts>
a = const ︔ a = <expr> ︔ <stmts>
a = const ︔ a = <term> + <term> ︔ <stmts>
a = const ︔ a = <var> + <term> ︔ <stmts>
a = const ︔ a = a + <term> ︔ <stmts>
a = const ︔ a = a + const ︔ <stmts>
a = const ︔ a = a + const ︔ <var> = <expr>
a = const ︔ a = a + const ︔ b = <expr>
a = const ︔ a = a + const ︔ b = <term>
a = const ︔ a = a + const ︔ b = <var>
a = const ︔ a = a + const ︔ b = a

Remark: As a reminder, we're not interested in whitespace when we consider whether or not a sentence is recognized by a grammar. The choice to present the sentences in three lines was for readability, and the choice to present it in a single line in the derivation was for convenience.

There is a notable feature of the last four lines of the above derivation: even if a nonterminal symbol is replaced by a single nonterminal symbol in succession, we have to include each step. We're only allowed to apply one production rule at a time, e.g., we cannot immedatiely replace <expr> with <var> because that is not one of our production rules.

Exercise: Verify that
a = a + a ︔ b = b
is recognized by the above grammar.

Parse Trees

Grammars imbue sentences with hierarchical structure. This structure is represented as a parse tree. We've seen a couple examples of English grammar parse trees so far, but we can also build parse trees for sentences recognized by any grammar with a BNF specification.

Definition: A parse tree for a sentence \(S\) in a grammar is a tree \(T\) with the following properties:
every leaf of \(T\) has a terminal symbol;

every non-leaf node \(n\) has a nonterminal symbol (we write \(\mathsf{val}(n)\) for the value at \(n\));
if a node \(n\) has children \(T_1, T_2, \dots, T_k\) then
val(n) ::= root(T1) root(T2) ... root(Tk)
is a production rule in the grammar (where \(\mathsf{root}(T)\) denotes the value at the root of the tree \(T\));
the leaves in order (i.e., the frontier of \(T\)) form the sentence \(S\).

The details of the above definition are not so important, as long as you have the right picture in your head. For example, the sentence a = b + const has the following derivation:

<program>
<stmts>
<stmt>
<var> = <expr>
a = <expr>
a = <term> + <term>
a = <var> + <term>
a = b + <term>
a = b + const

and has the following parse tree:

<program>
│
<stmts>
│
<stmt>
├─────┬───┐
<var> │   <expr>
│     │   ├──────┬─┐
│     │   <term> │ <term>
│     │   │      │ │
│     │   <var>  │ │
│     │   │      │ │
a     =   b      + const

Remark: You don't have to draw your trees in this rectilinear styling. If you're drawing one on paper, I'd recommend drawing it like any other tree you've drawn in a CS course. I just have a predilection for ASCII art…

Exercise: Given the ADT
type 'a tree
   = Leaf of 'a
   | Node of 'a * 'a tree list
implement the OCaml function
val frontier : 'a tree → 'a list
so that frontier t is the list of values at leaves of t in order from left to right.

Every derivation can be converted into a parse tree, and vice versa, but multiple derivations may correspond to the same parse tree. This will be important when we cover ambiguity in the next section.

Exercise: Write a derivation corresponding to the above parse tree which is not leftmost.

Ambiguity

As participants of language, we are no strangers to grammatical ambiguity. Take, for instance, the following sentence:²

John saw the man on the mountain with the telescope

Was John using the telescope? Was the man carrying the telescope? Are the multiple mountains, one of which has a telescope on it?

The ambiguity exists because it is not clear which hierarchical structure should scaffold the sentence. Here's a possible parse trees for the sentence.

S
├────┐
NP   VP
│    ├───┬───────────────────────┐
N    V   NP                      PP
│    │   ├───────┐               │
│    │   NP      PP              │
│    │   │       ├──┐            ├────┐
│    │   │       P  NP           P    NP
│    │   ├───┐   │  ├───┐        │    ├───┐
│    │   A   N   │  A   N        │    A   N
│    │   │   │   │  │   │        │    │   │
John saw the man on the mountain with the telescope

The propositional phrase with the telescope is grouped alongside the verb phrase starting with saw, indicating that John was using the telescope. Another option:

S
├────┐
NP   VP
│    ├───┐
N    V   NP
│    │   ├───────┐
│    │   NP      PP
│    │   ├───┐   ├──┐
│    │   A   N   P  NP
│    │   │   │   │  ├────────────┐
│    │   │   │   │  NP           PP
│    │   │   │   │  ├───┐        ├────┐
│    │   │   │   │  A   N        P    NP
│    │   │   │   │  │   │        │    ├───┐
│    │   │   │   │  │   │        │    A   N
│    │   │   │   │  │   │        │    │   │
John saw the man on the mountain with the telescope

The prepositional phrase with the telescope is part of the noun phrase for the mountain, indicating the there is a telescope on the mountain and the man is on that mountain.

Remark: Grammatical ambiguity comes from not being completely sure which parse tree to give to a sentence. We experience language in a linear fashion, either by reading it or hearing it. If our interlocutor could display the parse tree of their statement (floating eerily in space before our eyes) there would be nothing to say of (grammatical) ambiguity.

Here is a natural-enough-looking grammar which recognizes the above sentence.

<s>  ::= <np> <vp>
<vp> ::= <v> | <v> <np> | <v> <np> <pp>
<pp> ::= <p> <np>
<np> ::= <n> | <d> <n> | <np> <pp>
<n>  ::= John | man | mountain | telescope
<v>  ::= saw
<d>  ::= the
<p>  ::= on | with

The above sentence has multiple parse trees in the this grammar. Equivalently, since every parse trees correspond to exactly one leftmost derivation, and vice versa, it also has multiple leftmost derivations.

Exercise: Give two leftmost derivations of
John saw the man on the mountain with the telescope
in the above grammar.

Ambiguity in natural language is a complex topic, but restricted to formal grammars, ambiguity is a well-defined notion.

Definition: A grammar is ambiguous if there is a sentence it recognizes which has two distinct parse trees. Equivalently, it is ambiguous of there is a sentence it recognizes which has two distinct leftmost derivations.

Thus, the above grammar is ambiguous in the mathematical sense.

More to the point, consider the following grammar which may be seen as a prototype of a grammar for arithmetic expressions (something we will probably want if we're to give a grammar for a PL).

<expr> ::= <var> | <expr> <op> <expr>
<op>   ::= + | - | * | /
<var>  ::= x

This seems, ignoring obvious issues, a reasonable enough definition; an expression is either a variable or a pair of expressions with an operator between them. Note that the recursive nature of the first production rule means that this grammar recognizes an infinite number of sentence.

Exercise: Give a leftmost a derivation of x * x + x * x in the above grammar.

But, with regards to ambiguity, we should already be suspicious of the first production rule. As soon as we've applied the (second alternative of) the first production rule twice we have the partial derivation:

<expr>
<expr> <op> <expr>
<expr> <op> <expr> <op> <expr>

For the third line, which of the two <expr> symbols did we expand? To make this concrete, here are two parse trees for the sentence x + x + x.

<expr>                                <expr>
├──────┬────┐                         ├──────────────────┬────┐
<expr> <op> <expr>                    <expr>             <op> <expr>
│      │    ├──────┬────┐             ├──────┬────┐      │    │
<var>  │    <expr> <op> <expr>        <expr> <op> <expr> │    <var>
│      │    │      │    │             │      │    │      │    │
│      │    <var>  │    <var>         <var>  │    <var>  │    │
│      │    │      │    │             │      │    │      │    │
x      +    x      +    x             x      +    x      +    x

These two parse trees correspond to the following two leftmost derivations (the first for the tree on the left, the second for the tree on the right).

<expr>
<expr> <op> <expr>
<var> <op> <expr>
x <op> <expr>
x + <expr>
x + <expr> <op> <expr>
x + <var> <op> <expr>
x + x <op> <expr>
x + x + <expr>
x + x + <var>
x + x + x

<expr>
<expr> <op> <expr>
<expr> <op> <expr> <op> <expr>
<var> <op> <expr> <op> <expr>
x <op> <expr> <op> <expr>
x + <expr> <op> <expr>
x + <var> <op> <expr>
x + x <op> <expr>
x + x + <expr>
x + x + <var>
x + x + x

This demonstrates that the above grammar is ambiguous.

Aside: In this example, and many of the examples we will see, it will be fairly clear that the grammar is ambiguous. As students of computer science, we might think that we could write a program that checks for us if a grammar is ambiguous. Unfortunately, this is impossible (not just very difficult, but impossible). This is to say that determining if a context-free grammar is ambiguous is undecidable (a term worth looking up if this piques your interest).

Avoiding Ambiguity

Our next task it to determine how to avoid grammatical ambiguity when possible. But first, why should we care? Natural language is ambiguous and we get along perfectly fine. Why should we go through the trouble to make sure grammars we design are unambiguous?

It's a fair question; the way I see it, it's a promise that we make to the user of a PL: we never make unspoken assumptions about what a user meant when we read one of their programs. To be fair, we try to do this with natural language too, but in communication, if a statement is ambiguous, we can usually just ask our interlocutor what they meant. We can't do this for a program, so instead we make it impossible for a sentence to have multiple meanings.

Aside: We see a similar phenomena in legal language, which tends to be grammatically sterile, and usually no fun to read.

(Reverse) Polish Notation

If our only concern is avoiding ambiguity, we can use Polish notation or reverse polish notation. In Polish notation, operators appear before all their arguments, e.g.

<expr> ::= <var> | <op> <expr> <expr>
<op>   ::= + | - | * | /
<var>  ::= x

We won't dwell on this but it turns out this gives us an unambiguous grammar, we don't even need parentheses.

It's not difficult to then guess what reverse polish notation is: operators always appear after all their arguments. This is how early calculators like the HP 9100A Desktop Calculator were designed. If you wanted to calculate something (2 + 3) * (4 - 5), you would push the values you want to apply operations to onto a stack, and then apply operations to the top elements of the stack, like so:

STACK        RPN EXPRESSION
2            2
2 3          2 3
5            2 3 +
5 4          2 3 + 4
5 4 5        2 3 + 4 5
5 (-1)       2 3 + 4 5 -
(-5)         2 3 + 4 5 - *

So the sequence of tokens you end up typing into the calculator is an expression in reverse polish notation.

Exercise: Derive the sentence
+ * x * x - x x x x
in the above grammar.

The obvious issue with (reverse) polish notation is that it's difficult to read. Imagine writing code in a PL in which if-then-else logic had to be done like this:

IFTHENELSE
  cond
  IFTHENELSE
    IFTHENELSE
      cond
      ifcase
      elsecase
    ifcase
    elsecase
  elsecase

This is in part to say that what truly causes ambiguity in expressions is complex operator fixity.

Definition: The fixity of an operator refers to where the syntactic components of an operator are placed relative to its arguments. There are four kinds of operator fixity.

A prefix operator appears before its arguments. This is like function application in OCaml, e.g.,

f x

not b

A postfix operator appears after its arguments. This is like type constructor application in OCaml, e.g.,

int list

bool option list

An infix operator appears between its arguments. This is like arithmetic operations we learn in primary school, e.g.,

2 + 3

4 / 5

A mixfix operator is an operator with multiple syntactic components which may appear as some combination of prefix, infix and postfix position. This is like if-then-else expressions or anonymous functions in OCaml, e.g.,

if b then x else y

fun x -> e

If we want to contend with operator fixity (i.e., we don't want just prefix or just postfix operators) then we have to work to avoid ambiguity.

Parentheses

Another simple solution to the ambiguity problem is to surround applications of operations with parentheses:

<expr> ::= <var> | ( <expr> <op> <expr> )
<op>   ::= + | - | * | /
<var>  ::= x

It then becomes very clear in the partial derivation:

<expr>
( <expr> <op> <expr> )
( ( <expr> <op> <expr> ) <op> <expr> )

which <expr> in the second line we expanded. But we run into a similar issue: lots of parentheses are no fun to read or type.

Exercise: Give a derivation of
( ( ( x * x ) * x ) + ( x / x ) )
in the above grammar.

Aside: The cult of parentheses is real. PLs based on LISP use full parenthesization and prefix notation. Many people consider this to be elegant, and many others believe it to be ugly as hell.

So our basic question becomes: how do we avoid grammar while being able to mix operator fixities and not use so many parentheses? And this question has a simple answer in theory: make explicit assumptions about how operator arguments are grouped. This will mean contending with two things: associativity and precedence.

Associativity

Associativity refers to how arguments are grouped when we are given a sequence of applications of an infix operator in the absence of parentheses, e.g.,

x + x + x + x

may be understood as any one of the following (among others):

(((x + x) + x) + x)
(x + (x + (x + x)))
((x + x) + (x + x))
(x + ((x + x) + x))

Exercise: Determine the number of ways the expression
x + x + x + x
can be parenthesized.

In the case of addition this point is somewhat moot. The order in which we group arguments does not affect the value of a sequence of additions. That is, addition is associative.

Definition: An operator \(\circ: X \to X \to X\) is associative if
\begin{align*} (a \circ b) \circ c = a \circ (b \circ c) \end{align*}
for any \(a\), \(b\), and \(c\) in \(X\).

But not all operators are associative. Take division for example. We need to decide how to implicitly parenthesize an expression like

\begin{align*} a / b / c / d \end{align*}

Again, we could try to bar the ability to write an expression like this, but we might rather avoid using parentheses if possible.

Exercise: How does OCaml evaluate the expression 100 / 5 / 4? How are arguments grouped?

For binary operators, we typically choose one of two ways for grouping terms.

Definition: A operation \(\circ : X \to X \to X\) is said to be left-associative if sequences of applications of the operator are understood as grouping arguments from the left, i.e.,
\begin{align*} a \circ b \circ c \circ d = (((a \circ b) \circ c) \circ d) \end{align*}
for any \(a\), \(b\), \(c\), and \(d\), in \(X\). It's said to be right-associative if arguments are grouped from the right, i.e.,
\begin{align*} a \circ b \circ c \circ d = (a \circ (b \circ (c \circ d))) \end{align*}

Bringing this back to grammatical ambiguity, giving an implicit parenthesization of a sequence of operators means specifying a "shape" for the corresponding parse tree. Taking addition to be left-associative means that one of the two parse trees for the sentence x + x + x in the above example is correct (the one on the right) and the other is not (the one on the left).

Exercise: Here is a grammar for function types over the base type int.
<fun-type> ::= <int-type> | <fun-type> -> <fun-type>
<int-type> ::= int
Give two leftmost derivations of int -> int -> int in this grammar.

Aside: Looking ahead a bit, we will end up building parsers using a parser generator. In this setting, specifying the associativity of an operator will be simple: there will be a command for doing so, a line we'll add to the code. But we can also deal with associativity in the grammar itself.

Function types are right associative, so we understand
int -> int -> int
to be parenthesized implicitly as
int -> (int -> int)
The problem, as it stands, is that because of the production rule
<fun-type> ::= <fun-type> -> <fun-type>
the argument type can an arbitrary complex function type. But we might recognize that, no matter how deep the nesting, the argument types have the special property in the case we assume right associativity: they're always just int. Therefore, we might consider the following updated grammar.
<fun-type> ::= <int-type> | <int-type> -> <fun-type>
<int-type> ::= int
In this grammar, no matter how many times you apply the second alternative of the first production rule, the argument type is always just the int type. So we are, in fact, able to restrict the shape of the parse tree for sentences, by breaking the symmetry the production rule. (This, of course, is more complicated once we start considering higher-order function types).

Exercise: Write the leftmost derivation of
int -> int -> int
in the grammar at the end of the previous Aside, along with its parse tree.

Precedence

In the examples from the previous section, there was only one operator. In the presence of multiple operators we have new issues to deal with. How should something like \(x * y + z\) be implicitly parenthesized? n This is an issue of precedence or order of operations, something which is likely already familiar to you.

If you went through the American public school system then you probably learned the abbreviation PEMDAS ((P)arentheses, (E)xponentiation, (M)ultiplication, (D)ivision, (A)ddition, (S)ubtraction, along with an accompanying mnemonic, something like "please excuse my dear aunt sally"). Focusing on just the last four letters, this rule tells us that we should group multiplications and divisions first, and then group additions and subtractions. That is to say, multiplication and division have greater precedence than addition and subtraction.

Definition: The precedence of an operator, relative to another operator, determines which operator binds more tightly, in the presents of ambiguity.

Example: The expression \(2 * 3 + 4 * 5\) should be implicitly parenthesized as \((2 * 3) + (4 * 5)\) because multiplication binds tighter.

The relative precedence of a collection of operators determines the "shape" of the parse tree for a sentence in a grammar. To say that multiplication has higher precedence is to say that when we build the parse tree for x * x + x, we want + to be the top-level operation, at the root of the tree.

Remark: One thing that was probably glossed over if/when you learned PEMDAS: what do you do with something like \(1 + 1 - 1 + 1\)? Do you group additions and then subtractions? Or vice versa? The issue here is that addition and subtraction have the same precedence. In this case, we will use the associativity of the operators to determine how to parenthesize: given a sequences of operators of the same precedence, we use their associativity to group them.

Since addition and subtraction are both left-associative, this would be parenthesized as \(((1 + 1) - 1) + 1\). Things can get truly ugly if you have two operators with the same precedence, but different associativity. We will ignore this possibility in this course, but this really matters in languages like Haskell, in which users can define their own operators, with specified precedence and associativity.

All of this is to say: in order to define an unambiguous grammar without parentheses, we need to know three things of each operator appearing in the grammar: fixity, associativity, and precedence. Fixity is determined directly by the production rules, but we will have to specify associativity and precedence explicitly.

We can, for example, present the four basic arithmetic operators along with this information. The operators are presented in order of decreasing precedence.³

Operator	Fixity	Associativity
`*`, `/`	infix	left
`+`, `-`	infix	left

In fact, just like the grammar for the OCaml is given in the OCaml Manual, so is the associativity and precedence of all its operators.

Aside: As with associativity, we will be able to specify precedence in the parser generator we use. And, as with associativity, we can also deal with precedence in the grammar itself.

In the case of arithmetic expressions, we use similar observations:

because of associativity, the right argument of multiplication or division must be a variable x;

because of precedence, the left argument of multiplication or division must contain only multiplications and divisions;

because of associativity, the right argument of addition or subtraction must be an expression with only multiplications and divisions.

These observations yields the following grammar. I've tried to use suggestive names to indicate how the three points above manifest.
<expr>         ::= <only-mul-div> | <expr> <add-sub> <only-mul-div>
<only-mul-div> ::= <var> | <only-mul-div> <mul-div> <var>
<add-sub>      ::= + | -
<mul-div>      ::= * | /
<var>          ::= x

Exercise: Give the leftmost derivation of
x * x + x * x
in the above grammar from the previous Aside. Also draw its parse tree.

Proving that this grammar is unambiguous is a bit tricky (we won't do this). It suffices to say that, given we know how to parenthesize such expressions, and this grammar captures the rules we use, it would seem to be ambiguous.

Parentheses (Again)

We're not quite done. Without parentheses, we can't derive all expressions we might want to write down. This is important: parentheses in a PL aren't just "there" implicitly. We have to explicitly include them in our grammar.

Given the rules we've described above, we cannot write an expression (without parentheses) which is equivalent to

(x + x) * x

To give a complete specification of the arithmetic expressions we know and love, we have to add back parentheses (this is the "P" part of PEMDAS).

But where should we put parentheses into the grammar? As we saw above, if we put them everywhere, we get unnecessarily verbose sentences. Rather, it should be the case that: it should be possible put any expression as an argument to operator if it's wrapped in parentheses. For this we will replace <var> with something can also be any expression wrapped in parentheses.

<expr>          ::= <only-mul-div> | <expr> <add-sub> <only-mul-div>
<only-mul-div>  ::= <var-or-parens> | <only-mul-div> <mul-div> <var-or-parens>
<add-sub>       ::= + | -
<mul-div>       ::= * | /
<var-or-parens> ::= x | ( <expr> )

We might worry that now it's possible to have an arbitrary expression as an argument to an operator. But the point is that, if we do this, it must be wrapped in parentheses, which ensures unambiguity.

Exercise: Give a leftmost derivation of
( x + x ) * x
in the above grammar. Draw its parse tree.

Exercise: (Challenge) According to PEMDAS, we also need to handle exponentiation. Give a grammar for arithmetic expressions including parentheses and exponentiation, using the following operator information.

Operator Fixity Associativity

^ infix right

*, / infix left

+, - infix left

Operator	Fixity	Associativity
`^`	infix	right
`*`, `/`	infix	left
`+`, `-`	infix	left

Exercise: (Challenge) Write a grammar for Boolean expressions in Python. You can check what sorts of expressions are allowed by using the Python interpreter.

Extended BNF

There are a number of extensions to the BNF (meta-)syntax which make it more usable. If you go digging around the Internet for Extended BNF (EBNF), you'll find a couple definitions, most of which are more complex than what we'll need. That said, the extensions we consider in this short section are the ones that will be available in the parser generator we use in the next chapter.

Optional

We use [ SENT-FORM ] to notate part of a production rule which is optional. We may, for instance, want to write a language which has both if-then and if-then-else expressions. Rather than using the BNF production rules

<if-expr> ::= if <expr> then <expr> | if <expr> then <expr> else <expr>

we can use the EBNF production rule

<if-expr> ::= if <expr> then <expr> [ else <expr> ]

These two rules express the same thing. In fact, all BNF production rules are also EBNF production rules, and all EBNF production rules can be rewritten as a collection of BNF production rules.

Exercise: Rewrite the following EBNF production rule as a collection of BNF production rules.
<a> ::= a [ <b> ] [ a ]

Remark: One issue with extending BNF is now we've made it harder to express grammars which include the symbols used for the EBNF extensions (e.g., [ and ]). In practice this is not a problem, we will try to be very explicit if something like [ appears as part of a symbol of the grammar, and not a part of the specification of the grammar.

Alternative

We use ( SENT-FORM₁ | SENT-FORM₂ | ... | SENT-FORMₖ ) to notate the choice of multiple sentential forms as part of a production rule. In the previous section we defined a grammar for arithmetic expressions with multiple rules for the choice of operation.

<expr>          ::= <only-mul-div> | <expr> <add-sub> <only-mul-div>
<only-mul-div>  ::= <var-or-parens> | <only-mul-div> <mul-div> <var-or-parens>
<add-sub>       ::= + | -
<mul-div>       ::= * | /
<var-or-parens> ::= x | ( <expr> )

We can simplify this in EBNF by removing the production rules for operators.

<expr>          ::= <only-mul-div> | <expr> (+ | -)  <only-mul-div>
<only-mul-div>  ::= <var-or-parens> | <only-mul-div> (* | /) <var-or-parens>
<var-or-parens> ::= x | ( <expr> )

Repetition

We use { SENT-FORM } to notate a part of a production rule which can be repeated as many times as we want. We can also combine this with the alternative notation to as

{ SENT-FORM₁ | SENT-FORM₂ | ... | SENT-FORMₖ }

to represent a collection of choices we may repeat as many times as we want.

In the same grammar as above, we can rewrite the recursive production rules in terms of repetition.

<expr>          ::= <only-mul-div> { (+ | -) <only-mul-div> }
<only-mul-div>  ::= <var-or-parens> { (* | /) <var-or-parens> }
<var-or-parens> ::= x | ( <expr> )

Exercise: (Challenge) Write an EBNF specification for Boolean expressions in Python.

Footnotes:

It can be used as a prefix operator if put in parentheses, e.g. (+) x 1, but it cannot under any circumstances be used as a postfix operator.

This sentence is unapologetically taken from the Wikipedia article on syntactic ambiguity.

We can only present the table like this because we assume that all operators of the same precedence also have the same fixity.