Using ANTLR to parse boolean queries
ANTLR is a well-known parser generator. You supply it with a grammar and it builds a lexer and parser in the programming language of your choice. I’ve just spent a few hours getting to grips with ANTLR basics, so I thought I’d document it for future reference.
My basic requirement is to transform an end-user boolean query to an XML document. A simple example of a query I need to parse is:
john AND (joe OR sue)
I want the generated parser to produce the following syntax tree for this query:
AND
john
OR
joe
sue
It should then be trivial to walk the tree and write the XML document I need.
An ANTLR grammar is specified in a text file with a .g extension. Here’s how my SimpleBoolean.g starts:
grammar SimpleBoolean;
options
{
language = CSharp2;
output = AST;
}
I’m specifying that the generated code should be C#, and that the parser should build an abstract syntax tree (AST).
Next I specify the lexer rules:
LPAREN : '(' ; RPAREN : ')' ; AND : 'AND'; OR : 'OR'; WS : ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;} ; WORD : (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' ))*;
These rules are used to turn the source text into a stream of tokens. So parentheses and operators are special tokens, and anything else is either whitespace (which is skipped) or a word.
Finally I specify the parser rules:
expr : andexpr; andexpr : orexpr (AND^ orexpr)*; orexpr : atom (OR^ atom)*; atom : WORD | LPAREN! expr RPAREN!;
This is the confusing bit. The chain of expressions (expr –> andexpr –> orexpr) is used to signify precedence. In this case I’ve made OR higher precedence than AND. Also notice that andexpr (for example) does not require an AND – it is optional. This is why the grammar supports an expr containing only an OR, even though expr is defined in terms of andexpr.
The symbol suffixes (^ on the operators and ! on the parentheses) are there to direct the AST generation. ^ signifies a tree branch, whereas ! indicates something that should be omitted from the tree. Anything else is a leaf. (The parentheses are omitted because they are redundant – the structure of the tree gives the order of evaluation implicitly.)
So SimpleBoolean.g is now complete, and all that remains is to ask ANTLR to generate a C# parser from it. I used the ANTLRWorks IDE for this, but you could use the command-line. Once the C# files are generated, and the ANTLR .NET runtimes files added to a C# project, we’re ready to write some C#. Here’s some code that parses a query and then walks the syntax tree and writes the nodes to the console:
class Program { static void Main(string[] args) { ANTLRStringStream expression = new ANTLRStringStream("john AND (joe OR sue)"); var tokens = new CommonTokenStream(new SimpleBooleanLexer(expression)); var parser = new SimpleBooleanParser(tokens); SimpleBooleanParser.expr_return ret = parser.expr(); CommonTree ast = (CommonTree)ret.Tree; Print(ast,0); } static void Print(CommonTree tree, int level) { Console.WriteLine(new string('\t', level) + tree.Text); if (tree.Children != null) foreach (CommonTree child in tree.Children) Print(child,level+1); } }
which prints the following:
AND
john
OR
joe
sue
For a more comprehensive ANTLR tutorial, see this series of blog posts.