Version française
Home     About     Download     Resources     Contact us    
Browse thread
Re: [Caml-list] How to write a CUDA kernel in ocaml?
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Eray Ozkural <examachine@g...>
Subject: Re: [Caml-list] How to write a CUDA kernel in ocaml?
On Thu, Dec 17, 2009 at 2:34 AM, Philippe Wang
<philippe.wang.lists@gmail.com> wrote:
> On Wed, Dec 16, 2009 at 2:47 PM, Eray Ozkural <examachine@gmail.com> wrote:
>
>> One trivial and low-performance solution that comes to mind is: make
>> an ocaml bytecode interpreter into a CUDA kernel and then pass the
>> bytecode to it, and then voila, at least we have some 512-way
>> parallelism on the GT300. How does that sound? We'd be losing some
>> performance but massive parallelism will cover up for some of that.
>
>
> With parallel processors, you move very quickly the performance
> bottleneck from processor(s) to memory bandwidth, such that
> - it's hell to program because you have to manage concurrency and it
> has a real cost
> - it's useful for very specific programs that have very few memory
> access compared to processor computations (such as some compression
> algorithms, a more specific and very easy to write example is matrix
> multiplications).
>
> Imagine you have 3000MHz for memory bandwidth, which is extremely good
> today (I think). And imagine you have 100 processors that share this
> memory bandwidth. If they all want to access memory at the same time,
> even if you forget the concurrency management cost, you have
> 3000/100MHz/processor=30MHz/processor, which is very very very low. So
> think about 10 processors instead of 100 to be more realistic, it's
> still 300MHz/processor, which looks like what we had about a decade
> ago...
>
> (IMHO) A not-too-too-bad-but-still-realistic way to take benefit of
> GPUs today, with OCaml (or any high-level language), is to write
> computation functions in C (possibly with some assembly), and to write
> composition functions in OCaml. Or (less realistic in a short amount
> of time) maybe to write a compiler that may do the job for you, but
> it's not quite easy...
>
> Good luck,

First, the GT300 will have great memory bandwidth, probably 256 GB/s.
Half a gig/sec per core, not bad I think. With a smart ocaml bytecode
interpreter, we could derive some performance from this (hypothetical
yet!) baby. GT300 is great, it makes the reconfigurable computing
project I worked on mostly obsolete =)

Of course, you are right that the "memory wall" is a serious
obstacle to *any* parallel architecture, not just this architecture or
that architecture. I didn't read very thoroughly but in the Fermi
architecture the caches and local memory in NVIDIA naturally have
severe limitations. You have 512 cores. You can't give each huge
caches.

In the context of the following comments assume that a PRAM algorithm is given.

Obviously, we may expect the performance of a memory bound algorithm
to suffer in *both* multicore architectures and GPU's (that's where
reconfigurable computing might take over...).

However, if the algorithm is compute-bound, and can do with a moderate
memory bandwidth per processor, then I think this becomes an ideal
architecture. Not necessarily "embarrassingly parallel" algorithms,
but as seen in the CUDA pages of NVIDIA, those will work great!

My application is a perfect match for NVIDIA. It needs just 1-2 mb storage per
processor. And it spends more time computing than accessing memory, so
I think it will do well.

The ocaml bytecode interpreter is written in C. For a baseline
implementation we could try to port this to OpenCL.
http://caml.inria.fr/cgi-bin/viewcvs.cgi/ocaml/trunk/byterun/

Would be a cool experiment at least =)

What I want to do is to run the ocaml bytecode interpreter on each core, and
then feed the relevant bytecode to those. It can be done, I suppose? Or am I
missing something crucial? :) The runtime library would have to be ported to
OpenCL/CUDA, as well, isn't that possible?

Best,

PS: Sorry for having mailed this to you personally, I intended to post
it to the
mailing list.

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct