Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
[Caml-list] FP's and HyperThreading Processors
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2003-06-13 (18:38)
From: Kip Macy <kmacy@f...>
Subject: Re: [Caml-list] FP's and HyperThreading Processors

> along with a multithreaded vendor supplied FFT routine (presumably optimized
> for their processor).
If it was optimized for the P2 it will by definition not be optimized for
the P4, being potentially penalized by a much deeper pipeline and the use
of a trace cache instead of a standard I-cache. For example loop unrolling
is *bad* when you have a limited number of pre-decoded ops. Writes to the
D-cache write 64 bytes, reads bring in a "sector" or 2 cache lines to try
and mask the increased latency of the memory bus. The hardware pre-fetcher
kicks in after you access 256 bytes sequentially. What this all translates
to is that perfectly healthy data access patterns on the P2 may be
pathological on the P4. And in may in part be due to the FFT.  Little if
any of this applies if you already have an appropriate version of the 
FFT for the P4.

It is also worth noting that with the small L1 cache sizes on the P4,
hyperthreading running data intensive programs could easily end up being a
net loss with competing processes kicking out each others cache entries.

As a side note you could end up being partly TLB limited if your access
patterns jump around. if you are running a more recent version of Linux
you might want to try putting your data on 4MB pages.
> net result is that this program runs only twice as fast on the new 3 GHz P4
> as it runs on the old 350 MHz P2.

I suspect your analysis is correct, but I'd really have to try out the
performance counters before I came to any conclusions. This doesn't
neccessarily mean that ML is intrinsically on the wrong track with
allocating new memory. It does mean that more work needs to be
done to make the memory allocator and garbage collector more locality 
aware. There is some discussion of this in "Compiling with 
Continuations" by Appel.


To unsubscribe, mail Archives:
Bug reports: FAQ:
Beginner's list: