Archive for the ‘Threads’ category

Managed threads in “whole stack committed” shocker

November 30th, 2009

I was experimenting with thread creation in a managed process, to test the limits, and I noticed something very odd.  Creating a large number of threads was using a surprisingly large amount of memory.  I guessed it must be the thread stacks, but the numbers didn’t fit with my understanding of how thread stacks are allocated.  

In an unmanaged process, by default each thread has 1 Mbyte of address space reserved for its stack, but initially only a small amount of this address space is committed.   You can see this with the excellent VMMAP utility – here’s what it shows for an unmanaged x64 application:

unmanaged_threads_x64

The ‘size’ is the reserved address space – 1024k as expected.  The committed memory is 16k, which is the amount of the address space that actually has physical storage associated with it.   This can grow as required, up to the size of the reserved address space, but usually it won’t get close to that. 

The size of the reserved address space is important in the sense that we need to make sure we don’t run out – 2000 threads would be enough to max out a 32-bit process, for example.   But generally speaking it’s the committed size that has more impact – that number has to be backed by real storage, in RAM or at least the page file.

So, back to the managed process.  This is what VMMAP shows:

managed_threads_x64

Reserved address space is the same as before at 1024k, but the whole thing is committed!  So each thread is using the full 1 Mbyte of physical storage, regardless of how much memory the stack actually needs.   That can’t be right, can it?   Well it seems it is, as Joe Duffy explains.

So, the bottom line is that you should think carefully about creating managed threads that are going to sit idle in your process, for example in a thread pool.   In an unmanaged process those idle threads wouldn’t be using many resources, but in a managed process they’re using a significant chunk of valuable memory.   This is even more important if you have any form of multi-process architecture, perhaps with thread pools in each process. 

If you really need a large number of long-lived threads in your managed process, consider reducing the size of the stack using the appropriate overload of the System.Threading.Thread constructor.

Improving performance of Parallel30.For

November 18th, 2009

In my last post I presented some code for a .NET 3.x version of the .NET 4.0 Parallel.For method.   For many scenarios the performance of that Parallel30.For method is fine, but it performs very poorly when there are a large number of iterations and a small amount of cpu-bound work to do in each one.   For example:

int size = 20000000;
int[] vector = new int[size];
Parallel.For(0, size, i => vector[i] = i % 100);

Here is the performance comparison for three scenarios:  1) running the code  in a standard (serial) for loop,  2) the .NET 4.0 Parallel.For,  and 3) my Parallel30.For.    The code is running on my dual-proc laptop.

Serial:  400ms
.NET 4.0 Parallel.For:  280ms
My Parallel30.For:  8000ms

Ouch.  My Parallel30.For is 20 times slower than a standard for loop, and nearly 30 times slower than .NET 4.0 Parallel.For.   I suppose it’s really not very surprising when you consider that my code is calling ThreadPool.QueueUserWorkItem for every iteration – that’s 20 million times in this scenario.   In some respects it’s impressive that it only takes eight seconds 🙂

So I wrote another version of the For method, which I’ve called CpuFor, targeted at this sort of scenario.   Instead of submitting a separate task to the thread pool for every iteration, it divides the iterations into ranges based on the processor count, then submits each range as a task to the thread pool.   So instead of 20 million tasks, each executing one iteration, we’ll get 2 tasks, each executing 10 million iterations.  Or on a quad-proc machine, 4 tasks each executing 5 million iterations.   Of course this assumes that on average each iteration takes a roughly equal amount of time, but that’s probably mostly true for this type of scenario.

The results are pretty good:

My Parallel30.CpuFor: 250ms

I’m now (just about) beating .NET 4.0 Parallel.For.   Though I had to go pretty specialised to do it, which makes the .NET 4.0 Parallel.For all the more impressive.   I must crack it open with Reflector some time 🙂

Anyway, here’s the code:

public static void CpuFor(int fromInclusive, int toExclusive, Action<int> action)
{
    using (var doneEvent = new ManualResetEvent(false))
    {
        int iterations = (toExclusive - fromInclusive);
        int iterationsCompleted = 0;
        Exception actionEx = null;
        int processors = Environment.ProcessorCount;

        WaitCallback iterationCode =
            (arg) =>
            {
                var range = (KeyValuePair<int, int>)arg;
                try
                {
                    int from = range.Key;
                    int to = range.Value;
                    for (int i = from; i < to; i++)
                        action(i);
                }
                catch (Exception ex)
                {
                    actionEx = ex;
                }

                int completed = Interlocked.Add(ref iterationsCompleted, range.Value - range.Key);
                if (completed == iterations)
                    doneEvent.Set();
            };

        int rangeSize = Math.Max(1, iterations / processors);
        for (int i = fromInclusive; i < toExclusive; i += rangeSize)
        {
            int from = i;
            int to = Math.Min(i + rangeSize, toExclusive);
            var range = new KeyValuePair<int, int>(from, to);
            ThreadPool.QueueUserWorkItem(iterationCode, range);
        }

        doneEvent.WaitOne();

        if (actionEx != null)
            throw new Exception("Action failed", actionEx);
    }
}

BTW If you think that the delegate invocation – action(i) – on every iteration must be a performance killer, it doesn’t seem to be.   I wrote another version that eliminates the action() call (at the expense of usability) but the performance improvement was marginal.

Parallel.For for .NET 3.x

November 12th, 2009

.NET 4.0 includes an interesting class called Parallel, with a For method that allows the iterations of a for loop to be executed concurrently, like this:

Parallel.For(0, 10, i => Console.WriteLine(i));

I’ve done a lot of multithreading over the years, but I must confess it’s never occurred to me to apply concurrency in this way, at the level of a simple for loop.   I’m not sure how often I’ll use it in practice, but my current work project is .NET 3.5 so I wanted to have a .NET 3.x version of Parallel.For to play with.   The code is shown below. 

Some basic tests reveal that at the extremes the performance is not as good as the .NET 4.0 version, but for more common scenarios it seems to be comparable.  

static class Parallel30
{
    public static void For(int fromInclusive, int toExclusive, Action<int> action)
    {
        using (var doneEvent = new ManualResetEvent(false))
        {
            int iterations = (toExclusive - fromInclusive);
            int iterationsCompleted = 0;
            Exception actionEx = null;

            WaitCallback iterationCode =
                (arg) =>
                {
                    try
                    {
                        action((int)arg);
                    }
                    catch (Exception ex)
                    {
                        actionEx = ex;
                    }

                    int completed = Interlocked.Increment(ref iterationsCompleted);
                    if (completed == iterations)
                        doneEvent.Set();
                };

            for (int i = fromInclusive; i < toExclusive; i++)
            {
                ThreadPool.QueueUserWorkItem(iterationCode, i);
            }

            doneEvent.WaitOne();

            if (actionEx != null)
                throw new Exception("Action failed", actionEx);
        }
    }
}