Browse thread
Need for a built in round_to_int function
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Erik de Castro Lopo <ocaml-erikd@m...> |
| Subject: | Re: [Caml-list] Need for a built in round_to_int function |
On Mon, 21 Feb 2005 17:00:23 +0100
Xavier Leroy <Xavier.Leroy@inria.fr> wrote:
> On the other hand, according to the P4 optimization manuals, the P4 is
> supposed to special-case this particular use of fnstcw / fldcw, so
> perhaps the situation is no worse than on the P3.
OK, I've just tested this. On P4 the performance hit of fnstcw / fldcw
is not as bad as it is with P3, but its still significant:
Using this test program (compiled with my hacked version of ocamlopt):
http://www.mega-nerd.com/tmp/round_to_int.ml
On a 450MHz P3:
Time int_of_float : 5.970000
Time round_to_int : 2.360000
On a 2.8GHz P4:
Time int_of_float : 0.420000
Time round_to_int : 0.260000
> Essentially zero :-( Basically, this is a case where additional stuff is
> introduced in the machine-independent parts of ocamlopt and in every
> code generator just to work around the brain-dead x87 floating-point
> instruction set.
Obviously it is your decision, but I think round_to_int is a common
enough operation to warrant its own function. The ISO C Standards
committee thought so.
> Every other processor (as well as the SSE2 instr.set
Quite honestly I think the value of SSE and SSE2 are over sold.
There are certain algorthims which simply can't be made to run
as fast on SSE/SSE2 as they run on the x87 FPU.
For instance, my audio sample rate converter:
http://www.mega-nerd.com/SRC/
If I compile this on a P3 with gcc-3.4 using -mfpmath=sse -msse,
the highest quality (and hence slowest) converter runs 50% slower
than the x87 FPU version. I have also tried re-writing the algorithm
in hand optimised SSE code. The best I could get (I'm not an assembler
expert) was still 10% slower than the x87 FPU.
I have just now repeated my experiment by compiling SRC on a P4
with -msse2 (-mfpmath=sse2 doesn't work), the converter runs 75%
slower than the x87 FPU version.
> I spent a lot of time in the past trying to extract decent float
> performance out of the x87 instruction set,
<snip>
> Nowadays, I no longer care about
> performance for x87: users who want good float performance should
> simply use the x86-64 architecture (with SSE2 floats),
I'd love to get my hands on one of these, but I really do doubt
that its performance will be much better than that of the P4.
The main problem is that generating good SSE/SSE2 code from
a high level language is an order of magnitude more difficult
than generating code for the x87 FPU.
Erik
--
+-----------------------------------------------------------+
Erik de Castro Lopo nospam@mega-nerd.com (Yes it's valid)
+-----------------------------------------------------------+
"Projects promoting programming in natural language are intrinsically
doomed to fail." -- Edsger Dijkstra