English version
Accueil propos Tlchargement Ressources Contactez-nous

Ce site est rarement mis jour. Pour les informations les plus rcentes, rendez-vous sur le nouveau site OCaml l'adresse ocaml.org.

Browse thread
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2009-09-30 (19:56)
From: Mikkel_Fahnøe_Jørgensen <mikkel@d...>
Subject: Re: [Caml-list] JIT & HLVM, LLVM
> Using one work stealing deque per core is much more efficient than the work
> sharing queue you described for two reasons:
> 1. Less global syncronization.
> 2. Subtasks are likely to be executed on the core that spawned them, which
> improves cache efficiency.

You could also create larger task chunks or schedule multiple tasks on
a serial queue to obtain some benefits of cache coherency. I'm not
saying that this is a better way than the one suggest.

Another issue is latency vs. throughput.
If you have a lot of short transactions like in a trading system, you
want things done now, not more efficiently in a while, then you'd
rather add more cores to the system and wasting some.

> The parallel "for" loop you described is similar to the TPL's and leaves a lot
> to be desired. Specifically, you want to execute clusters of consecutive
> tasks on the same core for two reasons:

umm yes - it just a convenience for getting things done I suppose, but
again, you could split the loop into nested loops to increase
granularity. But in relation to my other recent post, there are also
map-reduce for addressing these issues which work well across network

> The former is easily solved with the Cilk/TPL design by using recursive
> subdivision instead of the index sharing scheme you described because
> subtasks are likely to be executed on the same core.

I also wondered why GCD insisted on starting indices in order and even
waste time syncing on the index counter since there is no guarantee of
execution order, other than than start time. I guess it would be easy
to modify GCD with a different scheduler - it is about 10 lines of
code in the library, and you could set a property on a queue to
suggest a preferred scheduler. However, I think the benefit of the
current approach is exactly to ensure that as many early indices as
possible run in parallel - which you might want for low latency.

Another issue that concerns me more is how serial queues when more
work is added to the queue. If the tasks keeps eating on the serial
queue, it might starve other more important work, unless the serial
queue takes a break - of course the thread is preempted, but it will
not give priority to older concurrent tasks still waiting to be pulled
from the global concurrent queue. Again I think this is easily fixed,
and I might have not understood this fully.

For high performance computing of many similar computations like pixel
processing, I think one should also have a look at OpenCL, and
possibly some map reduce top layer that can distribute work to
multiple nodes - such a layer could be built with GCD without relying
heavily on GCD for anything other than coordination.

Again, here latency is important - the sooner you can ship off work to
OpenCL (which feeds work to graphic coprocessors) the more work gets
done rather than having work lining up on the same core to ensure GCD
maximizes its own cpu cache.

This also applies to network communication: if you can send data
sooner, the other end can start computing earlier, and especially if
risk delays.

So there is an argument both ways - I think GCD is more intended for
systems programming than number crunching (or raytracing :-)

I think one should also remember that GCD is a programming
abstraction, and that there are many ways to implement it, and as time
progress some changes could be made.