Browse thread
JIT & HLVM, LLVM
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2009-09-30 (19:56) |
From: | Mikkel_Fahnøe_Jørgensen <mikkel@d...> |
Subject: | Re: [Caml-list] JIT & HLVM, LLVM |
> Using one work stealing deque per core is much more efficient than the work > sharing queue you described for two reasons: > > 1. Less global syncronization. > > 2. Subtasks are likely to be executed on the core that spawned them, which > improves cache efficiency. You could also create larger task chunks or schedule multiple tasks on a serial queue to obtain some benefits of cache coherency. I'm not saying that this is a better way than the one suggest. Another issue is latency vs. throughput. If you have a lot of short transactions like in a trading system, you want things done now, not more efficiently in a while, then you'd rather add more cores to the system and wasting some. > The parallel "for" loop you described is similar to the TPL's and leaves a lot > to be desired. Specifically, you want to execute clusters of consecutive > tasks on the same core for two reasons: umm yes - it just a convenience for getting things done I suppose, but again, you could split the loop into nested loops to increase granularity. But in relation to my other recent post, there are also map-reduce for addressing these issues which work well across network nodes. > The former is easily solved with the Cilk/TPL design by using recursive > subdivision instead of the index sharing scheme you described because > subtasks are likely to be executed on the same core. I also wondered why GCD insisted on starting indices in order and even waste time syncing on the index counter since there is no guarantee of execution order, other than than start time. I guess it would be easy to modify GCD with a different scheduler - it is about 10 lines of code in the library, and you could set a property on a queue to suggest a preferred scheduler. However, I think the benefit of the current approach is exactly to ensure that as many early indices as possible run in parallel - which you might want for low latency. Another issue that concerns me more is how serial queues when more work is added to the queue. If the tasks keeps eating on the serial queue, it might starve other more important work, unless the serial queue takes a break - of course the thread is preempted, but it will not give priority to older concurrent tasks still waiting to be pulled from the global concurrent queue. Again I think this is easily fixed, and I might have not understood this fully. For high performance computing of many similar computations like pixel processing, I think one should also have a look at OpenCL, and possibly some map reduce top layer that can distribute work to multiple nodes - such a layer could be built with GCD without relying heavily on GCD for anything other than coordination. Again, here latency is important - the sooner you can ship off work to OpenCL (which feeds work to graphic coprocessors) the more work gets done rather than having work lining up on the same core to ensure GCD maximizes its own cpu cache. This also applies to network communication: if you can send data sooner, the other end can start computing earlier, and especially if risk delays. So there is an argument both ways - I think GCD is more intended for systems programming than number crunching (or raytracing :-) I think one should also remember that GCD is a programming abstraction, and that there are many ways to implement it, and as time progress some changes could be made.