Improving performance of Parallel30.For
In my last post I presented some code for a .NET 3.x version of the .NET 4.0 Parallel.For method. For many scenarios the performance of that Parallel30.For method is fine, but it performs very poorly when there are a large number of iterations and a small amount of cpu-bound work to do in each one. For example:
int size = 20000000; int[] vector = new int[size]; Parallel.For(0, size, i => vector[i] = i % 100);
Here is the performance comparison for three scenarios: 1) running the code in a standard (serial) for loop, 2) the .NET 4.0 Parallel.For, and 3) my Parallel30.For. The code is running on my dual-proc laptop.
Serial: 400ms
.NET 4.0 Parallel.For: 280ms
My Parallel30.For: 8000ms
Ouch. My Parallel30.For is 20 times slower than a standard for loop, and nearly 30 times slower than .NET 4.0 Parallel.For. I suppose it’s really not very surprising when you consider that my code is calling ThreadPool.QueueUserWorkItem for every iteration – that’s 20 million times in this scenario. In some respects it’s impressive that it only takes eight seconds :-)
So I wrote another version of the For method, which I’ve called CpuFor, targeted at this sort of scenario. Instead of submitting a separate task to the thread pool for every iteration, it divides the iterations into ranges based on the processor count, then submits each range as a task to the thread pool. So instead of 20 million tasks, each executing one iteration, we’ll get 2 tasks, each executing 10 million iterations. Or on a quad-proc machine, 4 tasks each executing 5 million iterations. Of course this assumes that on average each iteration takes a roughly equal amount of time, but that’s probably mostly true for this type of scenario.
The results are pretty good:
My Parallel30.CpuFor: 250ms
I’m now (just about) beating .NET 4.0 Parallel.For. Though I had to go pretty specialised to do it, which makes the .NET 4.0 Parallel.For all the more impressive. I must crack it open with Reflector some time :-)
Anyway, here’s the code:
public static void CpuFor(int fromInclusive, int toExclusive, Action<int> action) { using (var doneEvent = new ManualResetEvent(false)) { int iterations = (toExclusive - fromInclusive); int iterationsCompleted = 0; Exception actionEx = null; int processors = Environment.ProcessorCount; WaitCallback iterationCode = (arg) => { var range = (KeyValuePair<int, int>)arg; try { int from = range.Key; int to = range.Value; for (int i = from; i < to; i++) action(i); } catch (Exception ex) { actionEx = ex; } int completed = Interlocked.Add(ref iterationsCompleted, range.Value - range.Key); if (completed == iterations) doneEvent.Set(); }; int rangeSize = Math.Max(1, iterations / processors); for (int i = fromInclusive; i < toExclusive; i += rangeSize) { int from = i; int to = Math.Min(i + rangeSize, toExclusive); var range = new KeyValuePair<int, int>(from, to); ThreadPool.QueueUserWorkItem(iterationCode, range); } doneEvent.WaitOne(); if (actionEx != null) throw new Exception("Action failed", actionEx); } }
BTW If you think that the delegate invocation – action(i) – on every iteration must be a performance killer, it doesn’t seem to be. I wrote another version that eliminates the action() call (at the expense of usability) but the performance improvement was marginal.
0 Comments:
Post a Comment
<< Home