Using ANTLR to parse boolean queries

December 23rd, 2009 by Andy Leave a reply »

ANTLR is a well-known parser generator.  You supply it with a grammar and it builds a lexer and parser in the programming language of your choice.   I’ve just spent a few hours getting to grips with ANTLR basics, so I thought I’d document it for future reference.

My basic requirement is to transform an end-user boolean query to an XML document.   A simple example of a query I need to parse is:

john AND (joe OR sue)

I want the generated parser to produce the following syntax tree for this query:

AND
      john
      OR
            joe
            sue

It should then be trivial to walk the tree and write the XML document I need.

An ANTLR grammar is specified in a text file with a .g extension.   Here’s how my SimpleBoolean.g starts:

grammar SimpleBoolean;

options
{
  language = CSharp2;
  output = AST;
}

I’m specifying that the generated code should be C#, and that the parser should build an abstract syntax tree (AST).

Next I specify the lexer rules:

LPAREN : '(' ;
RPAREN : ')' ;
AND : 'AND';
OR : 'OR';
WS :  ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}  ;
WORD :  (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' ))*;

These rules are used to turn the source text into a stream of tokens.  So parentheses and operators are special tokens, and anything else is either whitespace (which is skipped) or a word.

Finally I specify the parser rules:

expr : andexpr;
andexpr : orexpr (AND^ orexpr)*;
orexpr : atom (OR^ atom)*;
atom : WORD | LPAREN! expr RPAREN!;

This is the confusing bit.  The chain of expressions (expr –> andexpr –> orexpr) is used to signify precedence.  In this case I’ve made OR higher precedence than AND.   Also notice that andexpr (for example) does not require an AND – it is optional.  This is why the grammar supports an expr containing only an OR, even though expr is defined in terms of andexpr.

The symbol suffixes (^ on the operators and ! on the parentheses) are there to direct the AST generation.  ^ signifies a tree branch, whereas ! indicates something that should be omitted from the tree.   Anything else is a leaf.   (The parentheses are omitted because they are redundant – the structure of the tree gives the order of evaluation implicitly.)

So SimpleBoolean.g is now complete, and all that remains is to ask ANTLR to generate a C# parser from it.  I used the ANTLRWorks IDE for this, but you could use the command-line.   Once the C# files are generated, and the ANTLR .NET runtimes files added to a C# project, we’re ready to write some C#.   Here’s some code that parses a query and then walks the syntax tree and writes the nodes to the console:

class Program
{
    static void Main(string[] args)
    {
        ANTLRStringStream expression = new ANTLRStringStream("john AND (joe OR sue)");
        var tokens = new CommonTokenStream(new SimpleBooleanLexer(expression));
        var parser = new SimpleBooleanParser(tokens);

        SimpleBooleanParser.expr_return ret = parser.expr();
        CommonTree ast = (CommonTree)ret.Tree;
        Print(ast,0);
    }

    static void Print(CommonTree tree, int level)
    {
        Console.WriteLine(new string('\t', level) + tree.Text);
        if (tree.Children != null)
            foreach (CommonTree child in tree.Children)
                Print(child,level+1);
    }
}

which prints the following:

AND
        john
        OR
                joe
                sue

For a more comprehensive ANTLR tutorial, see this series of blog posts.

Advertisement

1 comment

  1. Umer says:

    wow, such a handy tutorial…
    thanks for sharing