English version
Accueil     À propos     Téléchargement     Ressources     Contactez-nous    

Ce site est rarement mis à jour. Pour les informations les plus récentes, rendez-vous sur le nouveau site OCaml à l'adresse ocaml.org.

Browse thread
[Caml-list] Object-oriented access bottleneck
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2003-12-07 (23:50)
From: Abdulaziz Ghuloum <aghuloum@c...>
Subject: Re: [Caml-list] Object-oriented access bottleneck
Brian Hurt wrote:

>I actually question the value of inlining as a performance improvement, 
>unless it leads to other signifigant optimizations.  Function calls simply 
>aren't that expensive anymore, on today's OOO super-scalar 
>speculative-execution CPUs.  A direct call, i.e. one not through a 
>function pointer, I benchmarked out at about 1.5 clocks on an AMD K6-3.  
>Probably less on a more advanced CPU.  Indirect calls, i.e. through a 
>function pointer, are slower only due to the load to use penalty.  If the 
>pointer is in L1 cache, an indirect call is probably only 3-8 clocks.
>Cache misses are the big cost.  Hitting L1 cache, the cheapest memory 
>access, is generally 2-4 clocks.  L2 cache is generally 6-30 clocks.  
>Missing cache entirely and having to go to main memory is 100-300+ clocks.  
>Inlining expands the code size, and thus means you're likely having more 
>expensive cache misses.  At 300 clocks/cache miss, it doesn't take all 
>that many cache misses to totally overwhealm the small advantages gained 
>by inlining functions.


Do you happen to have a pointer to a document listing the (approximate) 
timing of the various instructions on todays hardware?  You have listed 
a few and I was wondering if you have a more comprehensive study.

You say "Inlining expands the code size and thus you're likely having 
more expensive cache misses".  I wonder how true that is.  For example, 
consider a simple expression such as {if p() e0 e1}.  If the compiler 
decides to inline p (in case p is small enough, leaf node, etc ...), 
then in addition to the benefits of inlining (no need to save your live 
variable, or restore them later, constant folding, copy propagation, 
...), you're also avoiding the jump to p.  Since p can be anywhere in 
memory, a cache miss is very probable.  If p was inlined, its location 
is now close and is less likely to cause a cache miss.  Not inlining 
causes the PC to be all over the place cauing more cache misses.  Am I 
missing something?


To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners