Improving performance of Parallel30.For

November 18th, 2009 by Andy Leave a reply »

In my last post I presented some code for a .NET 3.x version of the .NET 4.0 Parallel.For method.   For many scenarios the performance of that Parallel30.For method is fine, but it performs very poorly when there are a large number of iterations and a small amount of cpu-bound work to do in each one.   For example:

int size = 20000000;
int[] vector = new int[size];
Parallel.For(0, size, i => vector[i] = i % 100);

Here is the performance comparison for three scenarios:  1) running the code  in a standard (serial) for loop,  2) the .NET 4.0 Parallel.For,  and 3) my Parallel30.For.    The code is running on my dual-proc laptop.

Serial:  400ms
.NET 4.0 Parallel.For:  280ms
My Parallel30.For:  8000ms

Ouch.  My Parallel30.For is 20 times slower than a standard for loop, and nearly 30 times slower than .NET 4.0 Parallel.For.   I suppose it’s really not very surprising when you consider that my code is calling ThreadPool.QueueUserWorkItem for every iteration – that’s 20 million times in this scenario.   In some respects it’s impressive that it only takes eight seconds 🙂

So I wrote another version of the For method, which I’ve called CpuFor, targeted at this sort of scenario.   Instead of submitting a separate task to the thread pool for every iteration, it divides the iterations into ranges based on the processor count, then submits each range as a task to the thread pool.   So instead of 20 million tasks, each executing one iteration, we’ll get 2 tasks, each executing 10 million iterations.  Or on a quad-proc machine, 4 tasks each executing 5 million iterations.   Of course this assumes that on average each iteration takes a roughly equal amount of time, but that’s probably mostly true for this type of scenario.

The results are pretty good:

My Parallel30.CpuFor: 250ms

I’m now (just about) beating .NET 4.0 Parallel.For.   Though I had to go pretty specialised to do it, which makes the .NET 4.0 Parallel.For all the more impressive.   I must crack it open with Reflector some time 🙂

Anyway, here’s the code:

public static void CpuFor(int fromInclusive, int toExclusive, Action<int> action)
    using (var doneEvent = new ManualResetEvent(false))
        int iterations = (toExclusive - fromInclusive);
        int iterationsCompleted = 0;
        Exception actionEx = null;
        int processors = Environment.ProcessorCount;

        WaitCallback iterationCode =
            (arg) =>
                var range = (KeyValuePair<int, int>)arg;
                    int from = range.Key;
                    int to = range.Value;
                    for (int i = from; i < to; i++)
                catch (Exception ex)
                    actionEx = ex;

                int completed = Interlocked.Add(ref iterationsCompleted, range.Value - range.Key);
                if (completed == iterations)

        int rangeSize = Math.Max(1, iterations / processors);
        for (int i = fromInclusive; i < toExclusive; i += rangeSize)
            int from = i;
            int to = Math.Min(i + rangeSize, toExclusive);
            var range = new KeyValuePair<int, int>(from, to);
            ThreadPool.QueueUserWorkItem(iterationCode, range);


        if (actionEx != null)
            throw new Exception("Action failed", actionEx);

BTW If you think that the delegate invocation – action(i) – on every iteration must be a performance killer, it doesn’t seem to be.   I wrote another version that eliminates the action() call (at the expense of usability) but the performance improvement was marginal.

Oh, and you can see here My Frugal Business, which is fun and easy to play!


Comments are closed.