This post is under construction.

I’m currently pretty stuck on my compiler project, for technical problems related to the modular type classes (Dreyer et al., 2007) paper, and how poorly it interacts with the module interpretation of datatype and polymophism in particular. It is a bit of a long story but I had two different elaboration stragies in mind, both of which are deeply problematic for complicated, and different, reasons (perhaps a different blog post/brain dump on a different day, but the crux here is that specifically I’m not compiling exactly Standard ML in the Definition, and dictionary construction actually forces me to be extremely careful about how elaboration works).

Nevertheless, I find it useful to just take a break from thinking about it for a while. Maybe come back to the drawing board in a few days and hope that I get a different perspective on this (I doubt it, but we’ll see). So it’s time to look at something completely different (and certainly much more straightforward).

For quite a long time, I was somewhat facinated by how Agda does mixfix notation parsing. For those not familiar with Agda, basically you can just define a function like this:

if_then_else_ : {A : Set} → Bool → A → A → A
if true then x else y = x
if false then x else y = y

and then the parser will pick it up as a mixfix expression and apply it to subsequent parsing (ex. if true then 1 else 1), where _ represents a hole an expression should go to. (Note that it sticks to the Haskell convention where the arguments are always curried.)

Nowadays many modern languages have notational shorthands like this (on top of my head, Coq, Lean and Isabelle comes to mind). The most sophisticated notation system is probably the one implemented in Coq, but to be honest that implementation is exceptionally complicated and requires the language to have specific conventions (in Gallina’s case, all statements terminated with a .) to work. Languages like Standard ML (what I’m working on) and Haskell have a much more restricted notation parsing system (“fixity”), which only handles infix operators. That can be accomplished in a very straightforward way through a hand written shift reduce parser, so there’s not much to see there. Since I’m implementing my own compiler as a single person project, I have always wondered if a similar mixfix parsing feature can be added to the language, but it wasn’t too obvious how to do so on top of my head.

So time to steal how everyone else does it. The way Agda does it is more akin to what a normal person would do: it does a single round of lexing and parsing first, where these notation expressions are first interpreted as a list of application expression. Then it performs a second round of parsing to handle these mixfix parses. The internal comments in the code very nicely pointed me to (Johnson, 1995), which I thought was a really cool paper, and is what I’ll explain today.

Before this, a huge disclaimer:

I am not computability theorist. Like most people working on languages, I only really care about issues on language design. Issues on grammar design are something I prefer not to think about if possible. As such, my knowledge on computability theory is dated, so read this at your own risk. The results I’m writing about are almost entirely from the Johnson paper, but for reasons which I will sometimes explain, the presentation is different. I will present all this in Standard ML (the original paper presents this in Scheme).

Top Down Parsing

Like (probably) most computer science students, I am very significantly more well versed in bottom up parsing than top down parsing. This is because bottom up parsers are much more widely used, and there is a reason for this. To see the problems with top down parsing, let us look at an example, of parsing an expression with only natural numbers and addition:

E ::= n (where n is a natural number) | E + E

We stick to the convention that uppercase expressions are metavariables/nonterminals and anything else are just terminals/atom/axioms. This syntax is ambiguous, in that an expression 1 + 2 + 3 can have two different parses: (1 + 2) + 3 and 1 + (2 + 3). We shall presume that we want all possible parses.

Top down parsing consists of checking whether the string matches some rule for the grammar, and if so, apply it. Do this for all rules (since we want all possible parses). This makes writing the code very simple, since each parsing rule can be very directly modelled as a function. The function takes in a token stream (to simplify, suppose it is a list of strings), and then returns a list of all possible parses, and the rest of the stream that is remaining. Hence, it is not too far-fetched to quickly hack up this piece of code:

structure Fail = struct
  type stream = string list
  datatype exp =
    ExpNat of int
  | ExpPlus of exp * exp

  type t = stream -> (exp * stream) list
  
  fun parse_single s (f : string -> exp option) =
    case s of
      nil => nil
    | h :: t => 
        let
        in
          case f h of
            NONE => nil
          | SOME v => [ ( v , t ) ]
        end

  fun bind (a : t) (f : exp -> t) : t =
    fn s =>
      let 
        val parses = List.map (fn (e , s) => f e s) (a s)
      in 
        List.concat parses
      end

  fun either (l : t list) : t =
    fn s =>
    let
      val parses = List.map (fn f => f s) l
    in
      List.concat parses
    end

  fun parse_num s = parse_single s 
    (fn v =>
      case Int.fromString v of 
         NONE => NONE
       | SOME v => SOME (ExpNat v))

  and parse_plus s =
    bind parse_exp
    (fn e1 => fn s =>
      case s of
        nil => nil
      | h :: t =>
          if h = "+"
          then nil
          else 
            bind parse_exp
            (fn e2 => fn s => [ (ExpPlus (e1 , e2) , s) ]) 
            t
    ) s

  and parse_exp s =
    either [ parse_num , parse_plus ] s
end

In this code, parse_single, bind, and either are just simple parser combinators: parse_single just parses a single token and converts it to an expression, bind is basically just sequencing (parse a first, then parse f), and either is basically just consider all these possible parsing rules. The implementation of parse_num, parse_plus, and parse_exp are self explanatory then based on exactly what the grammar says).

In the original paper, Johnson calls the functions parse_single, bind, either different names: terminal, seq, alt. The types differ because here we are not just concerned about whether the given token stream has valid parses (perhaps that is more useful in a NLP context?). We care a lot about exactly what syntax tree the parsing ended up producing. Hence we are not at all concerned about “recognizers” in the paper, and the syntax we deal with are all CFGs anyway.

The nice part is that this code doesn’t take very long to write from scratch. The bad part is that it doesn’t work (and why many people stick to bottom up parsing). There are two big problems here:

Left Factoring

In order to parse E, I need to try and parse E + E. But in order to parse E + E, I need to parse E first, then +, then E again. This procedure does not terminate because the leftmost token refers to a parsing procedure back to itself, which makes no progress here (still at the beginning of the stream). You can simply see in the code that is doesn’t terminate.

This is a very significant problem, and generally a strong enough reason to not use top down parsing (at least certainly not so liberally). On top of my head, these are the ways to make it work that I know of:

Just stick with bottom up parsing, which does not have this problem.
Use a lookahead parser to determine how many tokens to consume for parsing the subexpression (aka, LL(k))
Refactor the syntax, for example, if my syntax has delimiters (say, wrapped in [ ] always), then there is in fact neither parsing ambiguity nor left factoring issues:
```
E ::= n | [ E + E ]
```
This is because just by looking at [ I can uniquely make progress in top down parsing since I know exactly where the subexpression starts and ends.
Alternatively, I can refactor the grammar a bit to make this work (aka. “left factoring”), by breaking down which rules should be handled when.

(It just occurred to me when writing that I actually don’t know of a general left factoring algorithm. I just know how to do it with concrete examples. One day remind me to look into it.)

In any case, 2, 3, 4, requires you to know the whole syntax ahead of time, which you don’t for mixfix notation parsing, because it is user defined. More importantly, 2, 3, 4, are simply not always possible to do (2 might not have a finite lookahead, 3, 4, can be constrained by what the grammar looks like). Because these are global syntax transformations, modelling them as functions no longer work, because functions are negative types (aka, you cannot break up a function to look at it inside), so you basically end with something more similar to a parser generator with a top down strategy rather than a parser combinator anyway if you pick any of the approaches here.

A note on practicality: the reality is that most parser generators today actually generate code rather than generate internal representations on the fly that can be easily linked to your compiler, so unless your parser generator library supports constructing these rules on the fly, for practical purposes, it’s a no go, at least for mixfix parsing.

Time complexity

The other issue is time complexity for a naive approach like this. Suppose this is the grammar that I am parsing:

T ::= T list
    | T tree
    | int

(aka the grammar is basically int postfixed with list or tree multiple times.)

Top down parsing in fact requires you to guess which rule to use for parsing. If I want to parse something like int list tree tree tree, what will happen is: I will guess first that my rule is T list, this requires me to parse T = int list tree tree first, and then check the next token is list (which it is not, it is tree). So my guess here missed. Then I have to guess that my rule is T tree, which requires me to parse T = int list tree tree a second time, and then check that the next token is tree, which worked this time. Given n tokens, T(n) = c1T(n - 1) + c2T(n - 1) + c3, which solves to T(n) = O(2 ^ n), aka exponential time. (Note that because I am actually getting all parses, the actual time complexity is even worse than this.)

This issue has a much more straightforward solution: we can always memoize at a location completed parses of T. When we need it a second time we can just pull it out. That is, assuming that parsing for T terminates in the first place (aka left factoring).

Johnson

So are parser combinators doomed? When I was much younger in college, I thought that due to left factoring issues, parser combinators are not practical to use in general parsing (not to mention that dealing with left factoring is simply painful, I’m sure anyone who had the displeasure of writing parsers in general can agree).

I am glad to see that I am proven wrong by (Johnson, 1995): there is a surprisingly straightforward way of working around both issues at the same time. It involves two very simple tricks that everyone learned in freshman programming class, but used them in a surprising way: CPS and memoization.

In short, the idea is:

The parser combinator can be written in continuation passing style, which transforms it into something very similar to a parser table. We send the results to the continuation one at a time. Aka, previously if there are no valid parse results, the continuation will never be called, and if there are multiple parse results, we send them to the continuation one at a time, so the continuation gets called multiple times, etc.
Memoization takes care of both left factoring (through a simple termination proof), and time complexity, at the same time.

Trick 1: CPS conversion (by hand)

The first step is straightforward: we change the type of the combinator. Now, instead of returning the parse results, we call the provided continuation each time.

type t = stream -> ((exp * stream) -> unit) -> unit

You might be wondering why I’m using the unit type rather than the void type or 'a here. The reason here is that the function can return if there are no valid parses in the rule. It does not return only if parsing succeeds. See parse_single below.

Then to rewrite the combinators:

fun parse_single (f : string -> exp option) s k =
  case s of
    nil => ()
  | h :: t =>
      let
      in
        case f h of
          NONE => ()
        | SOME v => k ( v , t )
      end

fun bind (a : t) (f : exp -> t) : t =
  fn s => fn k =>
    a s (fn (exp , s) => f exp s k)

fun either (l : t list) : t =
  fn s => fn k =>
  List.app (fn f => f s k) l

Which is actually mildly easier to write now, since the type really just tells you everything. However, since we are writing code in CPS form by hand, and the rest of your program probably isn’t in CPS form, we will unfortunately need to use a reference to escape out of CPS form, and also collate all parses sent over.

fun run (p : t) : stream -> (exp * stream) list =
  fn s =>
  let
    val results = ref []
    val () = p s (fn result => results := result :: (!results) )
  in
    !results
  end

So far no problems, but this does not solve the inherent issue of left factoring. This brings us to the next trick.

Trick 2: Memoization

So this is the really tricky part of the paper: how do you memoize for this problem? Usually for programs not in CPS form, we think of something like:

functor Memoize (structure A :
  sig 
    type t
    val eq : t -> t -> bool
  end) = struct
    
    local open A in

    fun memo (f : t -> 'a) : t -> 'a =
      let 
        val table = ref []
      in
        fn v =>
          case List.find (fn (x , _) => A.eq x v) (!table) of
            SOME (_ , res) => res
          | NONE =>
              let
                val result = f v
              in
                table := (v , result) :: (!table) ;
                result
              end
      end

    end
        
end

(exactly what it says, if a value exists in the table, look it up and use it directly, otherwise compute the function value and write it to the table, then return. There are obviously different ways of writing this where you can do lookup faster with a better data structure, if say A.t can also be hashed instead of using a list, etc.)

But what about for the CPS converted combinators here? Clearly we shouldn’t be caching the “return value”, since that value gets passed to the continuation, and possibly multiple times.

First, we handle the arguments. We determine whether two token streams are the same by comparing “where we are” in the stream, aka the current stream position. We assume that all tokens have unique positions, which can be easily done by just tacking some integer position on the token stream. Hence we can memoize results of subparses by only referring to their positions in the stream (which is nice, since it does not require an equality function on the input token type at all). This is also nice since we are no longer restricted to lists for memoization, as the same trick works for coinductive streams as well.

type token = string * int
type stream = token list

Then what about the results? This is the tricky part. Per (Johnson, 1995), we will also need to maintain a list of continuations that are passed in as well, and call them at the appropriate time. Specifically: