Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
Re: [Caml-list] How to write a CUDA kernel in ocaml?
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Eray Ozkural <examachine@g...>
Subject: Re: [Caml-list] How to write a CUDA kernel in ocaml?
On Wed, Dec 16, 2009 at 3:41 PM, Mattias Engdegård <> wrote:
>>And trampolines to eliminate tail calls that cannot be eliminated using goto.
>>However, trampolines are ~10x slower than TCO in the code gen.
> With some care, gcc's sibcall mechanism can be exploited. For example,
> by having one standard signature for all generated C functions, and
> taking care not to pass pointers to variables in the caller's stack
> frame. This should give fairly good performance (better than
> trampolines anyway), at the cost of portability (but gcc is good at
> that). It would give full TCO, even across compilation units. It
> should work well with a Cheney-on-the-MTA-style GC, too.
> How suitable it is depends on the reason why compilation to C is done in
> the first place. It might be one of:
> 1) portability to odd platforms with semi-decent performance (ie,
>   better than interpreted bytecode)
> 2) a simple target for maintaining bootstrapping capability for the
>   compiler (but bytecode works well for this too)
> 3) simpler (?) interfacing to libraries in C etc
> 4) flat-out maximum performance by exploiting the optimisations that
>   modern C compilers are capable of
> Of course, these days we have llvm which has a lot going for it.

Well, the original question was to be able to use the CUDA or OpenCL
compiler on that generated C code.

Possible or impossible? :)

One trivial and low-performance solution that comes to mind is: make
an ocaml bytecode interpreter into a CUDA kernel and then pass the
bytecode to it, and then voila, at least we have some 512-way
parallelism on the GT300. How does that sound? We'd be losing some
performance but massive parallelism will cover up for some of that.


Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara