I’ve got two parser combinators today for you to play with, both whipped up this evening from pieces of earlier experiments.
Parser 5: PEG grammar without memoization
This is loosely based on Daan Leijen and Erik Meijers’ 2001 paper [1]. I say loosely as it lacks all the important elements demonstrated in the paper — efficiency, useful error messages, etc. — but it is a monadic parser.
(Why do you want a monadic parser? This author explains better than I)
Here, the type of a parser is ‘string -> maybe (a, string)’, where ‘a’ is your parse tree, and the result string is the remaining input. If the parse fails, None is returned instead of a tuple.
# simplified monadic (PEG) parser. no memoization, some backtracking.
# parser :: str -> maybe (value, str)
def ret (value):
return lambda s: (value, s)
def bind (p, *fs):
def parse (s):
res = p(s)
for f in fs:
res = res and f(res[0])(res[1])
return res
return parse
These are the monad operators return and bind. ‘Ret(v)’ produces a parser that consumes and empty string, producing the parse tree ‘v’, but the juice is in ‘bind’.
‘Bind(p, f)’ glues together a parser and a function. A parser ‘p’ consumes some input and produces a parse tree ‘v’. This value ‘v’ is then passed to a function ‘f’, returning a new parser to consume the remaining input. That is, ‘f’ chooses, based on the parse tree so far, which language to use to interpret the rest of the input.
This is extremely powerful, and permits monadic parsers the ability to recognize classes of context sensitive grammars, such as parsing XML, or loading new languages on the fly as you parse (Perl6, anyone?) or perhaps backtracking to recover if an evaluation of the parse tree fail (Perl5, anyone?).
It can also inhibit various parser optimizations if you’re not careful.
never = lambda s: False
def alt (*ps):
def parse (s):
for p in ps:
res = p(s)
if res: return res
return False
return parse
‘never’ is the MonadZero value, a sort of additive identity for the ‘alt’ MonadPlus operator. ‘alt’ produces a parser that will recognize any of the languages passed to it. ‘never’ recognizes nothing, so e.g. ‘alt(p, never)’ is equivalent to ‘p’ in the same way as ‘bind(p, ret)’ is equivalent to ‘p’.
You can ignore these equivalences for now — they’re useful though when it comes time to optimize grammars, but that’s not on today’s agenda.
From here on out, we have a handful of various other operators:
empty = ret(None)
def then (p, *ps):
return bind(p, lambda x: bind(then(*ps), lambda rest: ret([x]+rest)) if ps else ret([x]))
def repeat (p, res=[]):
return alt( bind(p, lambda v: repeat(p, res+[v])),
ret(res) )
def repeat1 (p):
return bind(p, lambda v: repeat(p, [v]))
def pred (f):
def parse (s):
if s and f(s[0]): return (s[0], s[1:])
else: return False
return parse
def option (p):
return alt(p, empty)
def repn (p, n):
return then(*([p]*n))
def char (ch):
return pred(lambda c: c==ch)
def literal (s):
return then(*(char(c) for c in s))
def delay (fn):
return bind(empty, lambda _:fn())
end = lambda s: False if s else (None, s)
‘Empty’ recognizes the empty string, ‘then’ chains parsers in sequence, ‘repeat’ is Klein star, ‘repeat1’ likewise plus. ‘option’ recognizes one or zero of a language, while ‘repn(p,n)’ recognizes exactly n occurrences of p.
‘Pred(fn)’ consumes one character ‘c’ for which ‘fn(c)’ is true (e.g. using str.isalpha). ‘char(c)’ recognizes the character ‘c’, and ‘literal(s)’ the sequence of characters in the string ‘s’. ‘delay(lambda:p)’ recognizes ‘p’ — and is merely an artifact of strict evaluation, as ‘p’ might not yet be defined. Finally, ‘end’ recognizes only the end of the input string; not very useful in today’s example, but it can be handy in parsers that perform lookahead.
You may not have noticed as it whizzed by, but this parser doesn’t perform full backtracking. The ‘alt’ operator attempts to parse using the first language, eagerly, and if it succeeds, returns immediately. If a subsequent parse operations should fail, no backtracking is performed to consider the alternate path. As such, this parser combinator library might be classified as a parsing expression grammar. Contrast this with context free grammars, where both sides of the alternation are considered equal for backtracking / lookahead purposes.
Size: ~60 lines
Parser 7: CSG with partial memoization
This next parser adds a couple tweaks. First, the type of the parser has changed to ‘string -> lcons of (a, string)’, where lcons is a kind of lazy list.
class lcons (object):
def __init__ (self, iter): self.iter = iter; self.value = None
def force (self):
if self.iter:
try:
self.value = (self.iter.next(), lcons(self.iter))
except StopIteration:
self.value = None
self.iter = None
return self.value
def empty(self): return self.force() == None
def head (self): return self.force()[0]
def tail (self): return self.force()[1]
def __iter__ (self):
while not self.empty():
yield self.head()
self = self.tail()
def memoize(fn, _memo=dict()):
def replc (s):
k = (fn, s)
if k not in _memo:
_memo[k] = lcons(iter(fn(s)))
return _memo[k]
return replc
‘lcons’ only role is ensure that an iteration is computed only once; otherwise, we’ll be using Python iterators. Speaking of memoization, we’ll also memoize of our monad parser so that we don’t recompute them on backtrack. This … will consume quite a bit of memory, but will reduce the backtracking costs in some key places. You can disable this with otherwise no change in behavior.
def ret (value):
return lambda s: [(value, s)]
def bind (p, f, *fs):
@memoize
def parse (s):
for res1 in p(s):
for res2 in f(res1[0])(res1[1]):
yield res2
if fs: return bind(parse, *fs)
return parse
def alt (*ps):
@memoize
def parse (s):
for p in ps:
for res in p(s):
yield res
return parse
def then (p, *ps):
more = then(*ps) if ps else None
return bind(p, lambda x: bind(more, lambda rest: ret([x]+rest)) if ps else ret([x]))
The monad parser combinators have received a facelift. ‘bind(p,f)’ now evaluates each possible parse of ‘p’, passing the result to ‘f’ and then enumerating the results from that parser, lazily yielding them to be memoized by ‘lcons’.
‘Alt’ has similarly been updated — each possible parse result from each parser is yielded in turn.
‘Then’ gets a minor performance improvement. As we’re memoizing parse results, we get a boost by reusing the same tail parser each time.
‘Ret’, ‘never’, ‘end’, ‘char’, and ‘pred’ are each updated to return arrays of results, but are otherwise unchanged. All the other parser combinators remain as before.
As a result of these changes, parser7 now supports nearly the class of context free grammars (CFG), save its inability to handle left recursion (perhaps the techniques in [2] could fix that). It can even handle ambiguous grammars, returning a parse forest instead of a parse tree. Of course, nothing is free — parser7 is slower than parser5, and most useful CFGs can be rewritten as PEGs with some effort.
Above and beyond CFGs, this parser continues to provide monadic bind, so can continue to parse a number of useful languages from the class of context sensitive languages. For example, the mini-xml parser in the samples bellow runs great under both libraries.
It’s a memory hog, but it’s also ~90 lines of vanilla Python.
Full source, samples, references
I’ve put the parser combinators and some samples up on codepad.org so you can get a feel for what the output looks like. It’s not pretty, but it’s pretty well structured:
- parser5: http://codepad.org/P5l2l6dm
- parser7: http://codepad.org/pmpqp1wI
[1] Parsec: Direct Style Monadic Parser Combinators For The Real World. Daan Leijen, Erik Meijer (2001).
[2] Packrat Parsers Can Support Left Recursion. Alessandro Warth, James R. Douglass, Todd Millstein (2007).