Browse thread
JIT & HLVM, LLVM
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2009-09-30 (15:22) |
From: | Jon Harrop <jon@f...> |
Subject: | Re: [Caml-list] JIT & HLVM, LLVM |
On Wednesday 30 September 2009 02:08:06 Mikkel Fahnøe Jørgensen wrote: > 2009/9/27 Jon Harrop <jon@ffconsultancy.com>: > > On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote: > >> If Grand Central Dispatch makes its way > >> into *nix... > > > > Incidentally, what exactly (technically) is Grand Central and how does it > > relate to existing alternatives and the state-of-the-art? I Googled it > > but failed to find anything useful... > > Grand Central Dispatch... Thanks for the explanation! This seems to be attacking the same problem as Cilk and the TPL but my immediate impression (based entirely upon your description) is that Cilk and the TPL are probably better foundations. Using one work stealing deque per core is much more efficient than the work sharing queue you described for two reasons: 1. Less global syncronization. 2. Subtasks are likely to be executed on the core that spawned them, which improves cache efficiency. The parallel "for" loop you described is similar to the TPL's and leaves a lot to be desired. Specifically, you want to execute clusters of consecutive tasks on the same core for two reasons: 1. Using the same core improves cache efficiency because the end of one task is often spatially related to the start of the next, e.g. parallel quicksort, linear algebra or image processing. 2. Executing clusters of tasks amortizes the cost of queueing tasks. The former is easily solved with the Cilk/TPL design by using recursive subdivision instead of the index sharing scheme you described because subtasks are likely to be executed on the same core. However, that will not work with a work sharing queue because subtasks are executed on random cores. Moreover, a trivial extension to this higher-order function allows you to pass in another function that describes the complexity of each task. This allows the recursive function to more intelligently subdivide the given range into clusters of variable-complexity tasks with similar total running times. This is the technique I use in our commercial F# libraries but I have not seen it described elsewhere. Cilk and the TPL also just rely upon the programmer to make tasks sufficiently complex that the time spent queueing and dequeueing them is insignificant. I solved this problem by injecting run-time profiling code (with global synchronizations per cluster of tasks) to measure the proportion of the time being spent spawning rather than executing tasks and then use exponential backoff to increase the cluster size until a sufficiently small proportion of the time is spent spawning. Again, I have not seen this technique described elsewhere. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e