I’ve been tinkering with monadic parser combinators combined with continuations (see Monads, Mother of All), but my latest toy simplifies all that by using PEG grammars. PEGs and CFGs are nearly interchangable for most parsing of computer languages, and can be parsed in linear time and space by packrat parsing — parsing with backtracking, but uses memoization to bound backtracking cost to a constant factor as parsing work is not repeated.
The end result will be succinctly:
@grammar
def number():
x = digit()
while test(digit):
x = x*10 + digit()
return x
print parse(number, "15")
Not recursive decent
When folks talk of ‘writing parsers by hand’, they generally mean recursive decent parsing. You write functions that consume particular pieces of input, and have to be really careful with lookahead.
In recursive descent, you first write:
def number(inputstream):
d = digit(inputstream)
if peek(inputstream) in _digits:
n = number(inputstream)
return d*10 + n
else:
return d
And then you do some manual work for your literal parsing:
_digits = "0123456789"
def digit(inputstream):
if peek(inputstream) in _digits:
return ord(readchar(inputstream))-ord('0')
else:
raise error()
So your leaf functions get a little messy, and you write a lot of specialized lookahead logic everywhere.
Not combinators
Parser combinators can express a wide variety of parsers, including Packrat parsing, so no more worries about lookahead, consuming too much input, etc. I’m leaving combinators behind, but it’s where I started and I want to paint a picture of the evolution.
In a parser combinator approach, you build parsers for little bits of the language and piece them together, creating bigger parsers:
digit = literal('1') | literal('2') | literal('3') ...
number = digit + repeat(digit)
The literal
operator is a parser that recognizes a single string. The or (|) operator takes two parsers, and produces a new parser that can recognize either the lefthand or righthand side. Etc.
The problem with this approach gets to when you want to perform semantic actions — that is, when you “extract information from what was parsed.” Take the above code, computing the numeric value of the recognized number — if you attach the semantic action to the ‘number’ operator, you’ll have to re-examine the string all over again to compute that number. You might instead attach multiple semantic actions, but then you have to define multiple functions and flow state between them… it can get messy.
Not Monads
Monadic parsers can clean up semantic actions by allowing you to ‘weave’ them between parsing. I think Parsec pioneered* this technique. Parsing number as before, I might write (in a dreadfully imperative style):
number = (do
x <- digit
let more x = (
do
y <- digit
more (x*10 + y)
) | Return x
in more x
The number parser (on success) returns a parse result containing an integer. The ‘do’ syntax weaves semantic actions between parsing using a combinator called bind
(or >>=
as an infix operator), which allows you to making parsing choices in your semantic actions. Do syntax unwinds the whole thing into lambda expressions. In Haskell, this desugars to:
arithmetic =
bind digit \x ->
let more x = (
bind digit \x ->
more (x*10 + y)
) | Return x
in more x
Where Return(x)
is just empty
with the semantic action of returning x
.
However, this still generally requires a compiler that can do the ‘do’ syntax transformation for you (you didn’t think that Python code looked remotely readable . Otherwise, you’re left with a mess of lambdas and nesting.
Not continuations either
In a previous post, I messed around with using generators to simulate the continuation monad, but it had some nasty perf consequences, and still wasn’t quite as clean as I wanted.
However, PEG parsers are more constrained — they don’t require suspension, so yield isn’t needed.
Imperative Packrat Parsing
I’m using a Python decorator to squirrel away state, such as the input stream and the memo table. Also, there’s a predicate test
that uses a parser function for lookahead, so no more custom lookahead logic:
@grammar
def number():
x = digit()
while test(digit):
x = x*10 + digit()
return x
There’s no backtracking wasting effort here. Any call to a decorated function will reuse previous results if available, fetching from a hash table.
Also, the loop looks like a proper loop now.
Best of all, it feels easy like a recursive descent parser, but now I can start writing combinators. For instance, the PEG preferential choice combinator Or
can be written succinctly:
def Or(parser1, parser2):
return grammar(lambda: parser1() if test(parser1) else parser2())
Working code
I’ve got a working demo up at https://bitbucket.org/xtoast/blog-demos/ under the ‘peg2’ folder. Give it a try:
[peg2] $ python ./alg.py
?? 1+1
--> 2
?? 2+3*4+5
--> 19
?? 100*9
--> 900
??
Future work
As I mentioned, I’d like to have combinators too, and special case them to avoid unnecessary exception paths. I’d also like to write rule alternates as ‘overloads’, such as:
@grammar # mapping to EBNF grammar:
def number(): #
digit(); # number = digit number
number(); #
@grammar #
def number(): #
digit() # | digit
Or maybe the decorator could provide an amb() operator to explore non-deterministic choice:
@grammar
def number():
if amb(): digit(); number();
else: digit();
(* at the very least, Parsec demonstrates Monadic parsing well)
Bitbucket is telling me your link is dead. 🙁
Thanks, I had forgotten about those. I’m not sure when I’ll have time to fix those up though.