Browse thread
[Caml-list] DFT in OCaml vs. C
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Issac Trotts <ijtrotts@u...> |
| Subject: | Re: OCaml performance (was: Re: [Caml-list] DFT in OCaml vs. C) |
David Monniaux wrote: >>The "Pentium 4 SSE2" column is an experimental code generator for the >>Pentium 4 that uses SSE2 instructions and registers for floating-point >>computations. (Before you ask: no, it's not publically available, >> >> > >In this case, to get meaningful comparison results, you should use >gcc -march=pentium4 -msse2 or icc -march=pentium4 > > > >>and it delivers about 2/3 of the performances of C, even on the Pentium. >> >> > >Let me tell you about our experience here. We are developing a large >program consisting of >- a large part of Caml code handling complex data structures >- a smaller C library handling certain numerical matrix computations that > are triggered by the Caml code >- some C (+ assembler) libraries dealing with system-dependent issues. > >I profiled the code using OProfile (http://oprofile.sourceforge.net), for >expenses in clock cycles and cache faults. Earlier attempts were made with >gprof. > >It turned out that we spent a significant amount of time in: > >- The Caml polymorphic compare function (15% time + some cache faults) > > Part of the problem seems to lie with the fact that the same function is > called when comparing strings, int64's and other types, thus the > processor has to do lots of tests and jumps just to get at the correct > comparison function. > > Wouldn't it be reasonable to define String.compare and Int64.compare to > call monomorphic functions? > >- The garbage collector (15% time + lots of cache faults) > > There's little we can do about it. Changing the size of the minor heap, > adjusting it to optimize the use of L2 cache seems to gain 2.30% of the > total running time. > > Curiously, using the compactor seems to slow things slightly. > > Would it be possible to optimize the GC cache-wise? For instance, have > it ask the processor to "prefetch" data. > >- 17% in a particular matrix function written in C. There's little we can > do except trying to optimize it carefully and compiling it with the best > C compiler around. > >- The rest of the time is spent within the Caml code. > >Now this was a bit surprising to us, because we thought we spent far more >time in the numerical computations. > > >Now back to the original question about DFTs. In your real-life >application, will DFT computations make a major part of the clock cycles >spent by the program? > There's a small image processing experiment I want to do that will compute lots of DFTs on small sub-images and will probably spend most of its clock cycles doing the transforms. - Issac ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners