Specifications.Lab4
In the previous lab, we built a tool to build structured data (a table) from unstructured data (text in TSV format). In this lab, we'll expand on this idea by building a tool to build hierarchical data (a tree) from text (you may be familiar with file formats like JSON for storing hierarchical data in text) We'll be working with S-expressions, which have also historically been used for the syntax of programming languages, e.g., dialects of LISP.
Make sure to pull down the course repository to get access to starter code for this lab in the directory labs/lab4
.
An S-expression is defined to be one of two things:
\texttt( e_1 \ e_2 \ \dots \ e_k \texttt)
, where e_1
through e_k
are S-expressions. In this lab, we will make the simplifying assumption that k
must be at least 1
and that e_1
is an atom, so that \texttt{()}
is not a valid S-expression, and neither is \texttt{((foo) bar)}
.Note that this is a recursive definition, and the data an S-expression denotes can be represented as a string ntree
.
Example. the S-expression (+ (+ 3 4) (+ 5 5))
is a valid S-expression, which denotes the tree
let example =
Node "+" [
Node "+" [ Node "3" [], Node "4" []];
Node "+" [ Node "5" [], Node "5" []];
]
which could be further processed into a value of a recursive ADT:
type expr =
| Num of int
| Add of expr * expr
let example = Add (Add (Num 3, Num 4), Add (Num 5, Num 5))
Another Example. We've actually already come across S-expression in this course in passing: dune files are written as S-expressions. Take a look at (part of) the dune file in this directory:
(library
(public_name lab4)
(libraries stdlib320))
This S-expression could be represented by the following nonempty tree:
Node "library" [
Node "public_name" [Node "lab4" []];
Node "libraries" [Node "stdlib320" []];
]
The goal of this lab is: construct nonempty trees of strings from S-expressions.
When we're converting text to some kind of structured data, it's useful to first break up the text into its "relevant" parts. As we will see in the second half of the course, this process is called lexing or tokenizing, and is a preprocessing step of parsing.
For example, if we're looking at the string:
"(defun foo (args x y) (+ (/ x y) (/ y x)))"
we don't care so much that the third character is 'e'
, as much as we care that the 'e'
is part of an atom "defun"
which is the second "unit" of the given expression. Likewise, we don't care how many spaces or what kind of spaces there are between atoms, just that there is whitespace.
val tokenize : string -> token list
The function tokenize
separates out the parentheses and groups characters that correspond to a single atom in its input string. It also drops all whitespace. For example, the above example would be tokenized as
[Lparen; Atom "defun"; Atom "foo"; Lparen; Atom "args"; Atom "x";...]
Note that, at this point, we don't care if the input is a well-formed S-expression, we just care about the valid "pieces" of the expression, e.g.,
let _ = assert (tokenize "))asdf((" = [Rparen; Rparen; "asdf"; LParen; LParen]
You don't need to do anything for this part of the lab, we've provided the function tokenize
for you. But, if you have time, take a look at it's implementation and make sure you know what's going on in it.
The bulk of the today's lab work will be the following function.
val ntree_of_toks : token list -> string Stdlib320.ntree option
Implement the function ntree_of_toks
so that ntree_of_toks toks
is None
if toks
is not a valid S-expression, and is Some t
otherwise, where t
is a nonempty tree representation of the data in the S-expression toks
. Some examples:
let _ = assert (ntree_of_toks [Lparen; Atom "+"; Atom "2"; Atom "3"; Rparen]
= Some (Node "+" [Node "2" []; Node "3" []]))
let _ = assert (ntree_of_toks [Lparen; Atom "+"; Atom "2"; Atom "3"; Rparen; Rparen]
= None)
let _ = assert (ntree_of_token [] = None)
let _ = assert (
ntree_of_toks
[Lparen; Atom "+"; Atom "2"; Lparen; Atom "foo"; Atom "3"; Rparen; Rparen]
= Some (Node "+" [Node "2" []; Node "foo" [Node "3" []]])
)
Implementing this function will require you to have a strong grasp on recursion. Please spend some real time working on this, knowing how to implement a function like this is crucial for this course. A hint (in addition to asking the lab TF/TA for help): it's useful to first implement a function of type:
token list -> (string ntree * token list) option
As you recurse, you'll want to keep track of the "progress" you've made, so you can return both the tree and the part of the input list that hasn't been processed yet, e.g.
let _ = assert (helper [Lparen; Atom "+"; Atom "2"; Atom "3"; Rparen; Rparen]
= Some (Node "+" [Node "2" []; Node "3" []], [Rparen]))
That way, you can "continue" on the part that hasn't been processed.
val parse : string -> string Stdlib320.ntree option
Once you've finished this ntree_of_toks
, you should be able to parse and print S-expressions:
utop>
match Lab4.parse
"
(library
(public_name lab4)
(libraries stdlib320))
"
with
| None -> ()
| Some t -> Stdlib320.Ntree.print (fun x -> x) t;;
library
├──public_name
│ └──lab4
└──libraries
└──stdlib320
- : unit = ()
If you finish ntree_of_toks
, then there are a couple extensions you could work on:
bin/main.ml
to take input at stdin as in Lab 3.sexpr_of_ntree
described below.libraries
part of the S-expression, and then prints out the updated dune file.json_of_ntree
so that you can convert S-expressions to JSON.The point is: once you can read in and represent heirarchical data, you can start processing and reformatting this data (a very common task).
val sexpr_of_ntree : string Stdlib320.ntree -> string
sexpr_of_ntree t
is t
given as an S-expression. Up to formatting and the Some
constructor, it is the inverse of parse
on well-formed S-expressions.