Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
Slow allocations with 64bit code?
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2007-04-22 (10:23)
From: Xavier Leroy <Xavier.Leroy@i...>
Subject: Re: [Caml-list] Slow allocations with 64bit code?
> I wonder whether others have already noticed that allocations may
> surprisingly be slower on 64bit platforms than on 32bit ones.

As already mentioned, on 64-bit platforms almost all Caml data
representations are twice as large as on 32-bit platforms (exceptions:
strings, float arrays), so the processor has twice as much data to
move through its memory subsystem.

However, you certainly don't get a slowdown by a factor of 2, for two
reasons: 1- the processor doesn't spend all its time doing memory
accesses, there are some computations here and there; 2- cache lines
are much bigger than 32 bits, meaning that accessing 64 bits at a
given address is much cheaper than accessing two 32-bit
quantities at two random addresses (spatial locality).

Moreover, x86 in 64-bit mode is much more compiler-friendly than in
32-bit mode: twice as many registers, a sensible floating-point model
at last.  So, OCaml in 64-bit mode generates better code than in
32-bit mode.

All in all, your 10% slowdown seems reasonable and in line with what
others reported using C benchmarks.

> This is only a difference of about 10%, but I have seen more complex
> cases where there are timing differences in excess of 50%, which is
> already pretty substantial.

Be careful with timings: I've seen simple changes in code placement
(e.g. introducing or removing dead code) cause performance differences
in excess of 20%.  It's an unfortunate fact of today's processors that
their performance is very hard to predict.

> Looking at the assembly, there is really no difference in the loop
> other than the use of the quad word instructions, which should not
> take longer on the exact same platform (i.e. same CPU-frequency).  But
> there is a suspicious call to "caml_alloc2", which might cause these
> differences.  Can it be that there are alignment problems or similar
> in the run time?

ocamlopt compiles module initialization code in the so-called
"compact" model, where code size is reduced by not open-coding some
operations such as heap allocation, but instead going through
auxiliary functions like "caml_alloc2".  This makes sense since
initialization code is usually large but not performance-critical.
I recommend you put performance-critical code in functions, not in the
initialization code.

- Xavier Leroy