Version française
Home     About     Download     Resources     Contact us    
Browse thread
JIT & HLVM, LLVM
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Jon Harrop <jon@f...>
Subject: Re: [Caml-list] JIT & HLVM, LLVM
On Wednesday 30 September 2009 02:08:06 Mikkel Fahnøe Jørgensen wrote:
> 2009/9/27 Jon Harrop <jon@ffconsultancy.com>:
> > On Sunday 27 September 2009 20:23:13 kcheung@math.carleton.ca wrote:
> >> If Grand Central Dispatch makes its way
> >> into *nix...
> >
> > Incidentally, what exactly (technically) is Grand Central and how does it
> > relate to existing alternatives and the state-of-the-art? I Googled it
> > but failed to find anything useful...
>
> Grand Central Dispatch...

Thanks for the explanation! This seems to be attacking the same problem as 
Cilk and the TPL but my immediate impression (based entirely upon your 
description) is that Cilk and the TPL are probably better foundations.

Using one work stealing deque per core is much more efficient than the work 
sharing queue you described for two reasons:

1. Less global syncronization.

2. Subtasks are likely to be executed on the core that spawned them, which 
improves cache efficiency.

The parallel "for" loop you described is similar to the TPL's and leaves a lot 
to be desired. Specifically, you want to execute clusters of consecutive 
tasks on the same core for two reasons:

1. Using the same core improves cache efficiency because the end of one task 
is often spatially related to the start of the next, e.g. parallel quicksort, 
linear algebra or image processing.

2. Executing clusters of tasks amortizes the cost of queueing tasks.

The former is easily solved with the Cilk/TPL design by using recursive 
subdivision instead of the index sharing scheme you described because 
subtasks are likely to be executed on the same core. However, that will not 
work with a work sharing queue because subtasks are executed on random cores. 
Moreover, a trivial extension to this higher-order function allows you to 
pass in another function that describes the complexity of each task. This 
allows the recursive function to more intelligently subdivide the given range 
into clusters of variable-complexity tasks with similar total running times. 
This is the technique I use in our commercial F# libraries but I have not 
seen it described elsewhere.

Cilk and the TPL also just rely upon the programmer to make tasks sufficiently 
complex that the time spent queueing and dequeueing them is insignificant. I 
solved this problem by injecting run-time profiling code (with global 
synchronizations per cluster of tasks) to measure the proportion of the time 
being spent spawning rather than executing tasks and then use exponential 
backoff to increase the cluster size until a sufficiently small proportion of 
the time is spent spawning. Again, I have not seen this technique described 
elsewhere.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e