Version française
Home     About     Download     Resources     Contact us    
Browse thread
Need for a built in round_to_int function
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Erik de Castro Lopo <ocaml-erikd@m...>
Subject: Re: [Caml-list] Need for a built in round_to_int function
On Mon, 21 Feb 2005 17:00:23 +0100
Xavier Leroy <Xavier.Leroy@inria.fr> wrote:

> On the other hand, according to the P4 optimization manuals, the P4 is
> supposed to special-case this particular use of fnstcw / fldcw, so
> perhaps the situation is no worse than on the P3.  

OK, I've just tested this. On P4 the performance hit of fnstcw / fldcw 
is not as bad as it is with P3, but its still significant:

Using this test program (compiled with my hacked version of ocamlopt):

     http://www.mega-nerd.com/tmp/round_to_int.ml

On a 450MHz P3:

    Time int_of_float : 5.970000
    Time round_to_int : 2.360000

On a 2.8GHz P4:

    Time int_of_float : 0.420000
    Time round_to_int : 0.260000

> Essentially zero :-(  Basically, this is a case where additional stuff is
> introduced in the machine-independent parts of ocamlopt and in every
> code generator just to work around the brain-dead x87 floating-point
> instruction set.

Obviously it is your decision, but I think round_to_int is a common 
enough operation to warrant its own function. The ISO C Standards
committee thought so.

>  Every other processor (as well as the SSE2 instr.set

Quite honestly I think the value of SSE and SSE2 are over sold.
There are certain algorthims which simply can't be made to run
as fast on SSE/SSE2 as they run on the x87 FPU.

For instance, my audio sample rate converter:

    http://www.mega-nerd.com/SRC/

If I compile this on a P3 with gcc-3.4 using -mfpmath=sse -msse,
the highest quality (and hence slowest) converter runs 50% slower 
than the x87 FPU version. I have also tried re-writing the algorithm 
in hand optimised SSE code. The best I could get (I'm not an assembler 
expert) was still 10% slower than the x87 FPU.

I have just now repeated my experiment by compiling SRC on a P4 
with -msse2 (-mfpmath=sse2 doesn't work), the converter runs 75% 
slower than the x87 FPU version.

> I spent a lot of time in the past trying to extract decent float
> performance out of the x87 instruction set,

<snip>

> Nowadays, I no longer care about
> performance for x87: users who want good float performance should
> simply use the x86-64 architecture (with SSE2 floats), 

I'd love to get my hands on one of these, but I really do doubt
that its performance will be much better than that of the P4. 
The main problem is that generating good SSE/SSE2 code from
a high level language is an order of magnitude more difficult
than generating code for the x87 FPU.

Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo  nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"Projects promoting programming in natural language are intrinsically
doomed to fail." -- Edsger Dijkstra