Archive for the ‘Threads’ category

Counting objects with Unity

April 30th, 2010

I’m using Unity for the first time on my current project (a WCF service), and the project is now fairly large with a significant amount of container configuration.  I’m taking advantage of lifetime managers to optimise object reuse, with PerThreadLifetimeManager being particularly attractive in a service – less garbage through object reuse but no locking to worry about.

I decided that it might be nice to see counts of the number of objects that Unity is creating for each type, to confirm that my lifetime configuration is optimal.  It turned out to be straightforward to do this with a Unity extension.

Here’s how the extension is used.  First, the extension is added to the container:

    var objectCounterExtension = new ObjectCounterExtension();


At application shutdown a list of types and their associated instance counts can be output like this:

    IDictionary<Type,int> objectCounts = objectCounterExtension.ObjectCounts;

    foreach (var objectCount in objectCounts.OrderBy(oc => oc.Value))
        Type type = objectCount.Key;
        int count = objectCount.Value;
        Console.WriteLine("{0}: {1}", type, count);


The extension is quite simple, once you know how:

    class ObjectCounterExtension : UnityContainerExtension
        ObjectCounterStrategy objectCounterStrategy;

        protected override void Initialize()
            objectCounterStrategy = new ObjectCounterStrategy();
            Context.Strategies.Add(objectCounterStrategy, UnityBuildStage.Creation);

        public IDictionary<Type, int> ObjectCounts
            get { return objectCounterStrategy.ObjectCounts; }

    class ObjectCounterStrategy : IBuilderStrategy
        Dictionary<Type, int> objectCounts = new Dictionary<Type,int>();

        public void PostBuildUp(IBuilderContext context)
            IBuildKey buildKey = context.BuildKey as IBuildKey;
            if (buildKey == null)

            Type type = buildKey.Type;

            lock (objectCounts)
                int count;
                if (objectCounts.TryGetValue(type, out count))
                    objectCounts[type] = count + 1;
                    objectCounts[type] = 1;

        public Dictionary<Type, int> ObjectCounts
                lock (objectCounts)
                    return new Dictionary<Type, int>(objectCounts);

        public void PostTearDown(IBuilderContext context) {}
        public void PreBuildUp(IBuilderContext context) {}
        public void PreTearDown(IBuilderContext context) {}

Managed threads in “whole stack committed” shocker

November 30th, 2009

I was experimenting with thread creation in a managed process, to test the limits, and I noticed something very odd.  Creating a large number of threads was using a surprisingly large amount of memory.  I guessed it must be the thread stacks, but the numbers didn’t fit with my understanding of how thread stacks are allocated.  

In an unmanaged process, by default each thread has 1 Mbyte of address space reserved for its stack, but initially only a small amount of this address space is committed.   You can see this with the excellent VMMAP utility – here’s what it shows for an unmanaged x64 application:


The ‘size’ is the reserved address space – 1024k as expected.  The committed memory is 16k, which is the amount of the address space that actually has physical storage associated with it.   This can grow as required, up to the size of the reserved address space, but usually it won’t get close to that. 

The size of the reserved address space is important in the sense that we need to make sure we don’t run out – 2000 threads would be enough to max out a 32-bit process, for example.   But generally speaking it’s the committed size that has more impact – that number has to be backed by real storage, in RAM or at least the page file.

So, back to the managed process.  This is what VMMAP shows:


Reserved address space is the same as before at 1024k, but the whole thing is committed!  So each thread is using the full 1 Mbyte of physical storage, regardless of how much memory the stack actually needs.   That can’t be right, can it?   Well it seems it is, as Joe Duffy explains.

So, the bottom line is that you should think carefully about creating managed threads that are going to sit idle in your process, for example in a thread pool.   In an unmanaged process those idle threads wouldn’t be using many resources, but in a managed process they’re using a significant chunk of valuable memory.   This is even more important if you have any form of multi-process architecture, perhaps with thread pools in each process. 

If you really need a large number of long-lived threads in your managed process, consider reducing the size of the stack using the appropriate overload of the System.Threading.Thread constructor.

Improving performance of Parallel30.For

November 18th, 2009

In my last post I presented some code for a .NET 3.x version of the .NET 4.0 Parallel.For method.   For many scenarios the performance of that Parallel30.For method is fine, but it performs very poorly when there are a large number of iterations and a small amount of cpu-bound work to do in each one.   For example:

int size = 20000000;
int[] vector = new int[size];
Parallel.For(0, size, i => vector[i] = i % 100);

Here is the performance comparison for three scenarios:  1) running the code  in a standard (serial) for loop,  2) the .NET 4.0 Parallel.For,  and 3) my Parallel30.For.    The code is running on my dual-proc laptop.

Serial:  400ms
.NET 4.0 Parallel.For:  280ms
My Parallel30.For:  8000ms

Ouch.  My Parallel30.For is 20 times slower than a standard for loop, and nearly 30 times slower than .NET 4.0 Parallel.For.   I suppose it’s really not very surprising when you consider that my code is calling ThreadPool.QueueUserWorkItem for every iteration – that’s 20 million times in this scenario.   In some respects it’s impressive that it only takes eight seconds :-)

So I wrote another version of the For method, which I’ve called CpuFor, targeted at this sort of scenario.   Instead of submitting a separate task to the thread pool for every iteration, it divides the iterations into ranges based on the processor count, then submits each range as a task to the thread pool.   So instead of 20 million tasks, each executing one iteration, we’ll get 2 tasks, each executing 10 million iterations.  Or on a quad-proc machine, 4 tasks each executing 5 million iterations.   Of course this assumes that on average each iteration takes a roughly equal amount of time, but that’s probably mostly true for this type of scenario.

The results are pretty good:

My Parallel30.CpuFor: 250ms

I’m now (just about) beating .NET 4.0 Parallel.For.   Though I had to go pretty specialised to do it, which makes the .NET 4.0 Parallel.For all the more impressive.   I must crack it open with Reflector some time :-)

Anyway, here’s the code:

public static void CpuFor(int fromInclusive, int toExclusive, Action<int> action)
    using (var doneEvent = new ManualResetEvent(false))
        int iterations = (toExclusive - fromInclusive);
        int iterationsCompleted = 0;
        Exception actionEx = null;
        int processors = Environment.ProcessorCount;

        WaitCallback iterationCode =
            (arg) =>
                var range = (KeyValuePair<int, int>)arg;
                    int from = range.Key;
                    int to = range.Value;
                    for (int i = from; i < to; i++)
                catch (Exception ex)
                    actionEx = ex;

                int completed = Interlocked.Add(ref iterationsCompleted, range.Value - range.Key);
                if (completed == iterations)

        int rangeSize = Math.Max(1, iterations / processors);
        for (int i = fromInclusive; i < toExclusive; i += rangeSize)
            int from = i;
            int to = Math.Min(i + rangeSize, toExclusive);
            var range = new KeyValuePair<int, int>(from, to);
            ThreadPool.QueueUserWorkItem(iterationCode, range);


        if (actionEx != null)
            throw new Exception("Action failed", actionEx);

BTW If you think that the delegate invocation – action(i) – on every iteration must be a performance killer, it doesn’t seem to be.   I wrote another version that eliminates the action() call (at the expense of usability) but the performance improvement was marginal.