Using ANTLR to parse boolean queries

December 23rd, 2009 by Andy 1 comment »

ANTLR is a well-known parser generator.  You supply it with a grammar and it builds a lexer and parser in the programming language of your choice.   I’ve just spent a few hours getting to grips with ANTLR basics, so I thought I’d document it for future reference.

My basic requirement is to transform an end-user boolean query to an XML document.   A simple example of a query I need to parse is:

john AND (joe OR sue)

I want the generated parser to produce the following syntax tree for this query:

AND
      john
      OR
            joe
            sue

It should then be trivial to walk the tree and write the XML document I need.

An ANTLR grammar is specified in a text file with a .g extension.   Here’s how my SimpleBoolean.g starts:

grammar SimpleBoolean;

options
{
  language = CSharp2;
  output = AST;
}

I’m specifying that the generated code should be C#, and that the parser should build an abstract syntax tree (AST).

Next I specify the lexer rules:

LPAREN : '(' ;
RPAREN : ')' ;
AND : 'AND';
OR : 'OR';
WS :  ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}  ;
WORD :  (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' ))*;

These rules are used to turn the source text into a stream of tokens.  So parentheses and operators are special tokens, and anything else is either whitespace (which is skipped) or a word.

Finally I specify the parser rules:

expr : andexpr;
andexpr : orexpr (AND^ orexpr)*;
orexpr : atom (OR^ atom)*;
atom : WORD | LPAREN! expr RPAREN!;

This is the confusing bit.  The chain of expressions (expr –> andexpr –> orexpr) is used to signify precedence.  In this case I’ve made OR higher precedence than AND.   Also notice that andexpr (for example) does not require an AND – it is optional.  This is why the grammar supports an expr containing only an OR, even though expr is defined in terms of andexpr.

The symbol suffixes (^ on the operators and ! on the parentheses) are there to direct the AST generation.  ^ signifies a tree branch, whereas ! indicates something that should be omitted from the tree.   Anything else is a leaf.   (The parentheses are omitted because they are redundant – the structure of the tree gives the order of evaluation implicitly.)

So SimpleBoolean.g is now complete, and all that remains is to ask ANTLR to generate a C# parser from it.  I used the ANTLRWorks IDE for this, but you could use the command-line.   Once the C# files are generated, and the ANTLR .NET runtimes files added to a C# project, we’re ready to write some C#.   Here’s some code that parses a query and then walks the syntax tree and writes the nodes to the console:

class Program
{
    static void Main(string[] args)
    {
        ANTLRStringStream expression = new ANTLRStringStream("john AND (joe OR sue)");
        var tokens = new CommonTokenStream(new SimpleBooleanLexer(expression));
        var parser = new SimpleBooleanParser(tokens);

        SimpleBooleanParser.expr_return ret = parser.expr();
        CommonTree ast = (CommonTree)ret.Tree;
        Print(ast,0);
    }

    static void Print(CommonTree tree, int level)
    {
        Console.WriteLine(new string('\t', level) + tree.Text);
        if (tree.Children != null)
            foreach (CommonTree child in tree.Children)
                Print(child,level+1);
    }
}

which prints the following:

AND
        john
        OR
                joe
                sue

For a more comprehensive ANTLR tutorial, see this series of blog posts.

Unity’s VirtualMethodInterceptor with internal classes

December 13th, 2009 by Andy 2 comments »

Unity includes some basic AOP functionality.  One of the interception mechanisms it offers is VirtualMethodInterceptor, which works by building a derived class at runtime and overriding the virtual methods of the target class.  This approach has some obvious limitations, but it seemed like a good way to handle some simple logging & profiling requirements that I have with my current project.

The first problem I hit is that the current release of Unity (1.2) has a huge bug – VirtualMethodInterceptor doesn’t work with classes that have parameters in their constructors.  Fortunately downloading and building the latest source was enough to get around that problem.

The next problem is that VirtualMethodInterceptor requires the target class to be public.  This is a common problem with code that builds derived classes at runtime.  Moq has the same issue, for example.  The standard workaround is to use the InternalsVisibleTo assembly attribute to give the dynamic assembly access to internal classes in the target assembly.   So after a little digging in the source I found the name of the dynamic assembly, and added this to my assembly:

[assembly: InternalsVisibleTo("Unity_ILEmit_DynamicClasses")]

Unfortunately this is not sufficient.  The Unity code has some validation that insists on the target class being public.  The code is in VirtualMethodInterceptor.cs:

public bool CanIntercept(Type t)
{
    Guard.ArgumentNotNull(t, "t");
    return t.IsClass &&
        (t.IsPublic || t.IsNestedPublic) &&
        t.IsVisible &&
        !t.IsSealed;
}

Removing the checks for IsPublic and IsVisible was enough to finally get everything to work correctly:

public bool CanIntercept(Type t)
{
    Guard.ArgumentNotNull(t, "t");
    return t.IsClass && !t.IsSealed;
}

Hopefully this will be fixed in time for the release of Unity 2.0.

Update: For internal interfaces with InterfaceInterceptor, you need this:

[assembly: InternalsVisibleTo("Unity_ILEmit_InterfaceProxies")]

Managed threads in “whole stack committed” shocker

November 30th, 2009 by Andy No comments »

I was experimenting with thread creation in a managed process, to test the limits, and I noticed something very odd.  Creating a large number of threads was using a surprisingly large amount of memory.  I guessed it must be the thread stacks, but the numbers didn’t fit with my understanding of how thread stacks are allocated.  

In an unmanaged process, by default each thread has 1 Mbyte of address space reserved for its stack, but initially only a small amount of this address space is committed.   You can see this with the excellent VMMAP utility – here’s what it shows for an unmanaged x64 application:

unmanaged_threads_x64

The ‘size’ is the reserved address space – 1024k as expected.  The committed memory is 16k, which is the amount of the address space that actually has physical storage associated with it.   This can grow as required, up to the size of the reserved address space, but usually it won’t get close to that. 

The size of the reserved address space is important in the sense that we need to make sure we don’t run out – 2000 threads would be enough to max out a 32-bit process, for example.   But generally speaking it’s the committed size that has more impact – that number has to be backed by real storage, in RAM or at least the page file.

So, back to the managed process.  This is what VMMAP shows:

managed_threads_x64

Reserved address space is the same as before at 1024k, but the whole thing is committed!  So each thread is using the full 1 Mbyte of physical storage, regardless of how much memory the stack actually needs.   That can’t be right, can it?   Well it seems it is, as Joe Duffy explains.

So, the bottom line is that you should think carefully about creating managed threads that are going to sit idle in your process, for example in a thread pool.   In an unmanaged process those idle threads wouldn’t be using many resources, but in a managed process they’re using a significant chunk of valuable memory.   This is even more important if you have any form of multi-process architecture, perhaps with thread pools in each process. 

If you really need a large number of long-lived threads in your managed process, consider reducing the size of the stack using the appropriate overload of the System.Threading.Thread constructor.