Unity’s VirtualMethodInterceptor with internal classes

December 13th, 2009 by Andy 2 comments »

Unity includes some basic AOP functionality.  One of the interception mechanisms it offers is VirtualMethodInterceptor, which works by building a derived class at runtime and overriding the virtual methods of the target class.  This approach has some obvious limitations, but it seemed like a good way to handle some simple logging & profiling requirements that I have with my current project.

The first problem I hit is that the current release of Unity (1.2) has a huge bug – VirtualMethodInterceptor doesn’t work with classes that have parameters in their constructors.  Fortunately downloading and building the latest source was enough to get around that problem.

The next problem is that VirtualMethodInterceptor requires the target class to be public.  This is a common problem with code that builds derived classes at runtime.  Moq has the same issue, for example.  The standard workaround is to use the InternalsVisibleTo assembly attribute to give the dynamic assembly access to internal classes in the target assembly.   So after a little digging in the source I found the name of the dynamic assembly, and added this to my assembly:

[assembly: InternalsVisibleTo("Unity_ILEmit_DynamicClasses")]

Unfortunately this is not sufficient.  The Unity code has some validation that insists on the target class being public.  The code is in VirtualMethodInterceptor.cs:

public bool CanIntercept(Type t)
    Guard.ArgumentNotNull(t, "t");
    return t.IsClass &&
        (t.IsPublic || t.IsNestedPublic) &&
        t.IsVisible &&

Removing the checks for IsPublic and IsVisible was enough to finally get everything to work correctly:

public bool CanIntercept(Type t)
    Guard.ArgumentNotNull(t, "t");
    return t.IsClass && !t.IsSealed;

Hopefully this will be fixed in time for the release of Unity 2.0.

Update: For internal interfaces with InterfaceInterceptor, you need this:

[assembly: InternalsVisibleTo("Unity_ILEmit_InterfaceProxies")]

Managed threads in “whole stack committed” shocker

November 30th, 2009 by Andy No comments »

I was experimenting with thread creation in a managed process, to test the limits, and I noticed something very odd.  Creating a large number of threads was using a surprisingly large amount of memory.  I guessed it must be the thread stacks, but the numbers didn’t fit with my understanding of how thread stacks are allocated.  

In an unmanaged process, by default each thread has 1 Mbyte of address space reserved for its stack, but initially only a small amount of this address space is committed.   You can see this with the excellent VMMAP utility – here’s what it shows for an unmanaged x64 application:


The ‘size’ is the reserved address space – 1024k as expected.  The committed memory is 16k, which is the amount of the address space that actually has physical storage associated with it.   This can grow as required, up to the size of the reserved address space, but usually it won’t get close to that. 

The size of the reserved address space is important in the sense that we need to make sure we don’t run out – 2000 threads would be enough to max out a 32-bit process, for example.   But generally speaking it’s the committed size that has more impact – that number has to be backed by real storage, in RAM or at least the page file.

So, back to the managed process.  This is what VMMAP shows:


Reserved address space is the same as before at 1024k, but the whole thing is committed!  So each thread is using the full 1 Mbyte of physical storage, regardless of how much memory the stack actually needs.   That can’t be right, can it?   Well it seems it is, as Joe Duffy explains.

So, the bottom line is that you should think carefully about creating managed threads that are going to sit idle in your process, for example in a thread pool.   In an unmanaged process those idle threads wouldn’t be using many resources, but in a managed process they’re using a significant chunk of valuable memory.   This is even more important if you have any form of multi-process architecture, perhaps with thread pools in each process. 

If you really need a large number of long-lived threads in your managed process, consider reducing the size of the stack using the appropriate overload of the System.Threading.Thread constructor.

Improving performance of Parallel30.For

November 18th, 2009 by Andy No comments »

In my last post I presented some code for a .NET 3.x version of the .NET 4.0 Parallel.For method.   For many scenarios the performance of that Parallel30.For method is fine, but it performs very poorly when there are a large number of iterations and a small amount of cpu-bound work to do in each one.   For example:

int size = 20000000;
int[] vector = new int[size];
Parallel.For(0, size, i => vector[i] = i % 100);

Here is the performance comparison for three scenarios:  1) running the code  in a standard (serial) for loop,  2) the .NET 4.0 Parallel.For,  and 3) my Parallel30.For.    The code is running on my dual-proc laptop.

Serial:  400ms
.NET 4.0 Parallel.For:  280ms
My Parallel30.For:  8000ms

Ouch.  My Parallel30.For is 20 times slower than a standard for loop, and nearly 30 times slower than .NET 4.0 Parallel.For.   I suppose it’s really not very surprising when you consider that my code is calling ThreadPool.QueueUserWorkItem for every iteration – that’s 20 million times in this scenario.   In some respects it’s impressive that it only takes eight seconds 🙂

So I wrote another version of the For method, which I’ve called CpuFor, targeted at this sort of scenario.   Instead of submitting a separate task to the thread pool for every iteration, it divides the iterations into ranges based on the processor count, then submits each range as a task to the thread pool.   So instead of 20 million tasks, each executing one iteration, we’ll get 2 tasks, each executing 10 million iterations.  Or on a quad-proc machine, 4 tasks each executing 5 million iterations.   Of course this assumes that on average each iteration takes a roughly equal amount of time, but that’s probably mostly true for this type of scenario.

The results are pretty good:

My Parallel30.CpuFor: 250ms

I’m now (just about) beating .NET 4.0 Parallel.For.   Though I had to go pretty specialised to do it, which makes the .NET 4.0 Parallel.For all the more impressive.   I must crack it open with Reflector some time 🙂

Anyway, here’s the code:

public static void CpuFor(int fromInclusive, int toExclusive, Action<int> action)
    using (var doneEvent = new ManualResetEvent(false))
        int iterations = (toExclusive - fromInclusive);
        int iterationsCompleted = 0;
        Exception actionEx = null;
        int processors = Environment.ProcessorCount;

        WaitCallback iterationCode =
            (arg) =>
                var range = (KeyValuePair<int, int>)arg;
                    int from = range.Key;
                    int to = range.Value;
                    for (int i = from; i < to; i++)
                catch (Exception ex)
                    actionEx = ex;

                int completed = Interlocked.Add(ref iterationsCompleted, range.Value - range.Key);
                if (completed == iterations)

        int rangeSize = Math.Max(1, iterations / processors);
        for (int i = fromInclusive; i < toExclusive; i += rangeSize)
            int from = i;
            int to = Math.Min(i + rangeSize, toExclusive);
            var range = new KeyValuePair<int, int>(from, to);
            ThreadPool.QueueUserWorkItem(iterationCode, range);


        if (actionEx != null)
            throw new Exception("Action failed", actionEx);

BTW If you think that the delegate invocation – action(i) – on every iteration must be a performance killer, it doesn’t seem to be.   I wrote another version that eliminates the action() call (at the expense of usability) but the performance improvement was marginal.

Oh, and you can see here My Frugal Business, which is fun and easy to play!