English version
Accueil     À propos     Téléchargement     Ressources     Contactez-nous    

Ce site est rarement mis à jour. Pour les informations les plus récentes, rendez-vous sur le nouveau site OCaml à l'adresse ocaml.org.

Browse thread
[Caml-list] DFT in OCaml vs. C
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2003-03-27 (16:06)
From: David Monniaux <David.Monniaux@e...>
Subject: OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C)
> The "Pentium 4 SSE2" column is an experimental code generator for the
> Pentium 4 that uses SSE2 instructions and registers for floating-point
> computations.  (Before you ask: no, it's not publically available,

In this case, to get meaningful comparison results, you should use
gcc -march=pentium4 -msse2 or icc -march=pentium4

> and it delivers about 2/3 of the performances of C, even on the Pentium.

Let me tell you about our experience here. We are developing a large
program consisting of
- a large part of Caml code handling complex data structures
- a smaller C library handling certain numerical matrix computations that
  are triggered by the Caml code
- some C (+ assembler) libraries dealing with system-dependent issues.

I profiled the code using OProfile (http://oprofile.sourceforge.net), for
expenses in clock cycles and cache faults. Earlier attempts were made with

It turned out that we spent a significant amount of time in:

- The Caml polymorphic compare function (15% time + some cache faults)

  Part of the problem seems to lie with the fact that the same function is
  called when comparing strings, int64's and other types, thus the
  processor has to do lots of tests and jumps just to get at the correct
  comparison function.

  Wouldn't it be reasonable to define String.compare and Int64.compare to
  call monomorphic functions?

- The garbage collector (15% time + lots of cache faults)

  There's little we can do about it. Changing the size of the minor heap,
  adjusting it to optimize the use of L2 cache seems to gain 2.30% of the
  total running time.

  Curiously, using the compactor seems to slow things slightly.

  Would it be possible to optimize the GC cache-wise? For instance, have
  it ask the processor to "prefetch" data.

- 17% in a particular matrix function written in C. There's little we can
  do except trying to optimize it carefully and compiling it with the best
  C compiler around.

- The rest of the time is spent within the Caml code.

Now this was a bit surprising to us, because we thought we spent far more
time in the numerical computations.

Now back to the original question about DFTs. In your real-life
application, will DFT computations make a major part of the clock cycles
spent by the program?

David Monniaux            http://www.di.ens.fr/~monniaux
Laboratoire d'informatique de l'École Normale Supérieure,
Paris, France

To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners