[pro] Is cl-yacc going to cut it?

Scott L. Burson Scott at sympoiesis.com
Sat Feb 5 21:29:39 UTC 2011


On Fri, Feb 4, 2011 at 8:06 AM, Pascal J. Bourguignon
<pjb at informatimago.com> wrote:
>
> I would also note that given that context free languages include regular
> languages, there's also little theorical point in distinguishing a lexer
> from a parser: you can describe the tokens using normal grammar rules.
>
> space-or-comment := { space | comment } .
> comment          := '#' '|' { comment-chars } '|' '#' .
> comment-chars    := '|' not-sharp | not-pipe .
> not-sharp        := space | letter | digit | '|' | '+' | '-' | ... .
> not-pipe         := space | letter | digit | '#' | '+' | '-' | ... .
> identifier       := space-or-comment letter { digit | letter } .
>
> etc, so basically the only lexer you need is READ-CHAR.

Except that as a matter of notational convenience, we normally allow
lexical grammars to be ambiguous, and apply the additional
disambiguating rule that the longest possible match at each point in
the input is to be preferred over shorter ones (thus the token "ifx"
is not the token "if" followed by the token "x").  This rule is not
directly expressible in the context-free formalism.  Also, as shown by
your example, expressing negation is tedious -- you have to enumerate
the complement of the set of characters you wish to exclude.

Boolean grammars solve both these problems.  Let's look at how they work.

The key insight is that we consider nonterminal definitions to be
operations on predicates over strings.  The definition

  A ::= B | C

is fundamentally a disjunction: a string is an A if it's a B, or if
it's a C.  Why not also allow the other boolean operations of
conjunction and negation?  This is what boolean grammars do.  So you
can have rules like

  A ::= B & ~C

which says that a string is an A if it's a B _and_ it's not a C.

If you consider the input to consist not of single characters but of
pairs of (a character and its following character), you can express
the longest-match  rule.  In practice, rather than making you do this
explicitly, implementations like SBP provide a "followed by" notation
that allows you to constrain the success of a production on the
immediately following character.  Thus you can write a production for
"`if' followed by a token break character"  fairly succinctly.

See Megacz' paper, which you can download from here:
http://www.cs.berkeley.edu/~megacz/research/

-- Scott




More information about the pro mailing list