Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
Comparison of OCaml and MLton for numerics
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2007-06-01 (14:57)
From: Brian Hurt <bhurt@j...>
Subject: Re: [Caml-list] Comparison of OCaml and MLton for numerics
Stephen Weeks wrote:

>> However, in some cases, defunctorization may produce a good speedup,
>> especially if you use massive inlining (e.g. ocamlopt -inline 1000). 
>> On the
>> contrary, defunctorization may produce cache problem because the size 
>> of the
>> defunctorized code may be very bigger than the size of the initial code.
> I've never observed this problem in practice using MLton, and don't 
> know anyone
> in the MLton world who has.  Has this actually been observed using the 
> OCaml
> defunctorizer?

Not with the Ocaml defunctorizor, but in other contexts I have indeed 
seen issues where inlining functions signifigantly decreased 
performance, due to cache thrashing.

And I know people (my dad) who've seen program sizes reduce by a factor 
of 3 with a one *word* change in the source code.  Short story: A base 
class in a large C++ function had an inline virtual destructor, which 
then had to be inlined everywhere in the code where an object that 
inherited from that class was being freed.  Removing the inline keyword 
signifigantly increased performance and radically decreased code size.  
The code change was opposed because "inlining functions makes code faster".

Another example I've seen, although it's smaller, is in branch 
prediction.  CPUs keep track, per branch, of wether branches tend to be 
taken or not.  Branch prediction is then used to speculatively execute 
code- but the problem is that if they're mispredicted, the cost is large 
(10-20+ clock cycles, smaller than the 100-350+ clock cycles of a cache 
miss, but still signifigant compared to the cost of a function call).  
They only keep track of a limited number of branches, however.  By 
inlining, and duplicating, the code, you're putting more pressure on the 
branch prediction logic, and are having more branches be mispredicted, 
with associated cost.

My experience has been that inlining is only a win in three cases: 1) 
where the function being inlined is so trivial and so small that the 
size and cost of the function call is the same as the rest of the 
function. Given that the size of a function call to a known location is 
like 5 bytes on the x86, and the cost of a function call the last time I 
measured it was like 1.5 cycles for the call, plus 1-2 cycles per 
argument, I mean really effin small and simple functions here.  Or, 2) 
where the function is only called from one place, or 3) where inlining 
opened up signifigant other optimization opportunities.  The classic 
example for Ocaml here is replacing a call to with an 
integer compare.  Most of the rest of the time, inlining is either a 
break even proposition, or often a loss.

Which is why I consider Linus Torvals "real programmer" attitude dumb.  
In the first two cases, the compiler can easily determine that the 
inlining is a good idea. Counting the cost or size of a function is easy 
enough, and counting the number of places where the function is called 
from real easy.  And the third case, where inlining opens up new 
possibilities for optimization- that almost has to be done by the 
compiler, as it depends upon what optimizations the compiler can, and 
will, apply to the newly inlined function.  This is something I trust 
the compiler to do more than I trust even me to do correctly.