Improving performance of Parallel30.For

November 18th, 2009 by Andy Leave a reply »

In my last post I presented some code for a .NET 3.x version of the .NET 4.0 Parallel.For method.   For many scenarios the performance of that Parallel30.For method is fine, but it performs very poorly when there are a large number of iterations and a small amount of cpu-bound work to do in each one.   For example:

int size = 20000000;
int[] vector = new int[size];
Parallel.For(0, size, i => vector[i] = i % 100);

Here is the performance comparison for three scenarios:  1) running the code  in a standard (serial) for loop,  2) the .NET 4.0 Parallel.For,  and 3) my Parallel30.For.    The code is running on my dual-proc laptop.

Serial:  400ms
.NET 4.0 Parallel.For:  280ms
My Parallel30.For:  8000ms

Ouch.  My Parallel30.For is 20 times slower than a standard for loop, and nearly 30 times slower than .NET 4.0 Parallel.For.   I suppose it’s really not very surprising when you consider that my code is calling ThreadPool.QueueUserWorkItem for every iteration – that’s 20 million times in this scenario.   In some respects it’s impressive that it only takes eight seconds 🙂

So I wrote another version of the For method, which I’ve called CpuFor, targeted at this sort of scenario.   Instead of submitting a separate task to the thread pool for every iteration, it divides the iterations into ranges based on the processor count, then submits each range as a task to the thread pool.   So instead of 20 million tasks, each executing one iteration, we’ll get 2 tasks, each executing 10 million iterations.  Or on a quad-proc machine, 4 tasks each executing 5 million iterations.   Of course this assumes that on average each iteration takes a roughly equal amount of time, but that’s probably mostly true for this type of scenario.

The results are pretty good:

My Parallel30.CpuFor: 250ms

I’m now (just about) beating .NET 4.0 Parallel.For.   Though I had to go pretty specialised to do it, which makes the .NET 4.0 Parallel.For all the more impressive.   I must crack it open with Reflector some time 🙂

Anyway, here’s the code:

public static void CpuFor(int fromInclusive, int toExclusive, Action<int> action)
    using (var doneEvent = new ManualResetEvent(false))
        int iterations = (toExclusive - fromInclusive);
        int iterationsCompleted = 0;
        Exception actionEx = null;
        int processors = Environment.ProcessorCount;

        WaitCallback iterationCode =
            (arg) =>
                var range = (KeyValuePair<int, int>)arg;
                    int from = range.Key;
                    int to = range.Value;
                    for (int i = from; i < to; i++)
                catch (Exception ex)
                    actionEx = ex;

                int completed = Interlocked.Add(ref iterationsCompleted, range.Value - range.Key);
                if (completed == iterations)

        int rangeSize = Math.Max(1, iterations / processors);
        for (int i = fromInclusive; i < toExclusive; i += rangeSize)
            int from = i;
            int to = Math.Min(i + rangeSize, toExclusive);
            var range = new KeyValuePair<int, int>(from, to);
            ThreadPool.QueueUserWorkItem(iterationCode, range);


        if (actionEx != null)
            throw new Exception("Action failed", actionEx);

BTW If you think that the delegate invocation – action(i) – on every iteration must be a performance killer, it doesn’t seem to be.   I wrote another version that eliminates the action() call (at the expense of usability) but the performance improvement was marginal.


Comments are closed.