I’ve been tinkering with monadic parser combinators combined with continuations (see Monads, Mother of All), but my latest toy simplifies all that by using PEG grammars. PEGs and CFGs are nearly interchangable for most parsing of computer languages, and can be parsed in linear time and space by packrat parsing — parsing with backtracking, but uses memoization to bound backtracking cost to a constant factor as parsing work is not repeated.

The end result will be succinctly:

@grammar
def number():
  x = digit()
  while test(digit):
    x = x*10 + digit()
  return x

print parse(number, "15")

Not recursive decent

When folks talk of ‘writing parsers by hand’, they generally mean recursive decent parsing. You write functions that consume particular pieces of input, and have to be really careful with lookahead.

In recursive descent, you first write:

def number(inputstream):
  d = digit(inputstream)
  if peek(inputstream) in _digits:
    n = number(inputstream)
    return d*10 + n
  else:
    return d

And then you do some manual work for your literal parsing:

_digits = "0123456789"
def digit(inputstream):
  if peek(inputstream) in _digits:
    return ord(readchar(inputstream))-ord('0')
  else:
    raise error()

So your leaf functions get a little messy, and you write a lot of specialized lookahead logic everywhere.

Not combinators

Parser combinators can express a wide variety of parsers, including Packrat parsing, so no more worries about lookahead, consuming too much input, etc. I’m leaving combinators behind, but it’s where I started and I want to paint a picture of the evolution.

In a parser combinator approach, you build parsers for little bits of the language and piece them together, creating bigger parsers:

digit = literal('1') | literal('2') | literal('3') ...
number = digit + repeat(digit)

The literal operator is a parser that recognizes a single string. The or (|) operator takes two parsers, and produces a new parser that can recognize either the lefthand or righthand side. Etc.

The problem with this approach gets to when you want to perform semantic actions — that is, when you “extract information from what was parsed.” Take the above code, computing the numeric value of the recognized number — if you attach the semantic action to the ‘number’ operator, you’ll have to re-examine the string all over again to compute that number. You might instead attach multiple semantic actions, but then you have to define multiple functions and flow state between them… it can get messy.

Not Monads

Monadic parsers can clean up semantic actions by allowing you to ‘weave’ them between parsing. I think Parsec pioneered* this technique. Parsing number as before, I might write (in a dreadfully imperative style):

number = (do
  x <- digit
  let more x = (
  do
      y <- digit
      more (x*10 + y)
  ) | Return x
  in more x

The number parser (on success) returns a parse result containing an integer. The ‘do’ syntax weaves semantic actions between parsing using a combinator called bind (or >>= as an infix operator), which allows you to making parsing choices in your semantic actions. Do syntax unwinds the whole thing into lambda expressions. In Haskell, this desugars to:

arithmetic =
  bind digit \x ->
  let more x = (
    bind digit \x ->
    more (x*10 + y)
  ) | Return x
  in more x

Where Return(x) is just empty with the semantic action of returning x.

However, this still generally requires a compiler that can do the ‘do’ syntax transformation for you (you didn’t think that Python code looked remotely readable . Otherwise, you’re left with a mess of lambdas and nesting.

Not continuations either

In a previous post, I messed around with using generators to simulate the continuation monad, but it had some nasty perf consequences, and still wasn’t quite as clean as I wanted.

However, PEG parsers are more constrained — they don’t require suspension, so yield isn’t needed.

Imperative Packrat Parsing

I’m using a Python decorator to squirrel away state, such as the input stream and the memo table. Also, there’s a predicate test that uses a parser function for lookahead, so no more custom lookahead logic:

@grammar
def number():
  x = digit()
  while test(digit):
    x = x*10 + digit()
  return x

There’s no backtracking wasting effort here. Any call to a decorated function will reuse previous results if available, fetching from a hash table.

Also, the loop looks like a proper loop now.

Best of all, it feels easy like a recursive descent parser, but now I can start writing combinators. For instance, the PEG preferential choice combinator Or can be written succinctly:

def Or(parser1, parser2):
  return grammar(lambda: parser1() if test(parser1) else parser2())

Working code

I’ve got a working demo up at https://bitbucket.org/xtoast/blog-demos/ under the ‘peg2’ folder. Give it a try:

[peg2] $ python ./alg.py
?? 1+1
--> 2
?? 2+3*4+5
--> 19
?? 100*9
--> 900
??

Future work

As I mentioned, I’d like to have combinators too, and special case them to avoid unnecessary exception paths. I’d also like to write rule alternates as ‘overloads’, such as:

@grammar       # mapping to EBNF grammar:
def number():  # 
    digit();   #   number = digit number
    number();  #
@grammar       #    
def number():  #         
    digit()    #          | digit

Or maybe the decorator could provide an amb() operator to explore non-deterministic choice:

@grammar
def number():
   if amb():  digit(); number();
   else:      digit();

(* at the very least, Parsec demonstrates Monadic parsing well)

Join the Conversation

2 Comments

Ryan Gonzalez says:

May 13, 2015 at 7:12 pm

Bitbucket is telling me your link is dead. 🙁

1. aaron says:
  
  December 28, 2015 at 10:16 pm
  
  Thanks, I had forgotten about those. I’m not sure when I’ll have time to fix those up though.

Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

(Mostly) Imperative Packrat Parsing