Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
[Caml-list] DFT in OCaml vs. C
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Issac Trotts <ijtrotts@u...>
Subject: Re: OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C)
David Monniaux wrote:

>>The "Pentium 4 SSE2" column is an experimental code generator for the
>>Pentium 4 that uses SSE2 instructions and registers for floating-point
>>computations.  (Before you ask: no, it's not publically available,
>In this case, to get meaningful comparison results, you should use
>gcc -march=pentium4 -msse2 or icc -march=pentium4
>>and it delivers about 2/3 of the performances of C, even on the Pentium.
>Let me tell you about our experience here. We are developing a large
>program consisting of
>- a large part of Caml code handling complex data structures
>- a smaller C library handling certain numerical matrix computations that
>  are triggered by the Caml code
>- some C (+ assembler) libraries dealing with system-dependent issues.
>I profiled the code using OProfile (, for
>expenses in clock cycles and cache faults. Earlier attempts were made with
>It turned out that we spent a significant amount of time in:
>- The Caml polymorphic compare function (15% time + some cache faults)
>  Part of the problem seems to lie with the fact that the same function is
>  called when comparing strings, int64's and other types, thus the
>  processor has to do lots of tests and jumps just to get at the correct
>  comparison function.
>  Wouldn't it be reasonable to define and to
>  call monomorphic functions?
>- The garbage collector (15% time + lots of cache faults)
>  There's little we can do about it. Changing the size of the minor heap,
>  adjusting it to optimize the use of L2 cache seems to gain 2.30% of the
>  total running time.
>  Curiously, using the compactor seems to slow things slightly.
>  Would it be possible to optimize the GC cache-wise? For instance, have
>  it ask the processor to "prefetch" data.
>- 17% in a particular matrix function written in C. There's little we can
>  do except trying to optimize it carefully and compiling it with the best
>  C compiler around.
>- The rest of the time is spent within the Caml code.
>Now this was a bit surprising to us, because we thought we spent far more
>time in the numerical computations.
>Now back to the original question about DFTs. In your real-life
>application, will DFT computations make a major part of the clock cycles
>spent by the program?
There's a small image processing experiment I want to do that will compute
lots of DFTs on small sub-images and will probably spend most of its 
clock cycles
doing the transforms.  

- Issac

To unsubscribe, mail Archives:
Bug reports: FAQ:
Beginner's list: