New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird performance on numerical code on x86 #6979
Comments
Comment author: @mlasson Hello, I was able to reproduce this issue on linux 32bits (full disclosure : on ubuntu 14.04 with ocaml 4.01 inside a virtual machine executed on a 64 bits host). Also, I tried to reproduce the performance issue outside of caml with the following piece of assembly code : .extern cos slowcos: fastcos: These two functions compute "cos(3.14)" one is copying "3.14" on the stack using two pushes (the slow one) and the other using one float load and one float pop on the float stack (the fast one). I have a no idea why this matters that much during the computation of cos. I join to this note a file "cos.asm" and a file "test.c". You can compile it on 64bits system with (drop --32 and -m32 to compile it on a 32bits system): For reference, I've also replaced the calls to cos with a call to an identity function. Here is the output on my computer (for 40 millions iterations): 1.62459s |
Comment author: @alainfrisch A discussion on this topic: |
Comment author: @mlasson I'm quite convinced by the explanation provided by EOF, one should not push double on the stack by cutting them in half. The following patch replaces in "i386/emit.mlp" the lines that push double on the stack by x87 instructions. It relies on the assumption that there is always one free slot on the float stack (is it an invariant maintained by the compiler ?). It solves the problem, but I hope it does not create others. Index: asmcomp/i386/emit.mlp --- asmcomp/i386/emit.mlp
|
Comment author: @mlasson I have two more remarks to add.
For instance here is the assembly produces by "gcc -S" to comple the function "double f() { double x = id(3.14), y = id(x); return x + y; }".
0.365847s It computes 3.14 + 3.14 + 3.14 + 3.14 + 3.14 + 3.14 + 3.14 + 3.14 + 3.14. .section .data fastpi: movl pi, %eax fstpl 0(%esp) fldl 0(%esp) faddl 0(%esp) addl $8, %esp slowpi: fstpl 0(%esp) movl pi, %eax fldl 0(%esp) faddl 0(%esp) addl $8, %esp |
Comment author: @xavierleroy Thanks for the detective work. The two-pushl idiom was what GCC used to generate a while ago. It made sense at that time because fldl/fstpl was reputed to be slow, owing perhaps to the 64 bits -> 80 bits -> 64 bits conversions involved. Also, store forwarding was probably less aggressive on the x86 processors of that time. I see that today GCC uses the fldl/fstpl sequence, so I'm fine with using it as well in the x86-32 back-end of OCaml. |
Comment author: @alainfrisch Thanks Xavier! Do you think Marc's patch (in a note below) is valid, i.e. that one can safely assume the existence of a free slot on the x87 pseudo stack when the move happens? |
Comment author: @xavierleroy No need to worry about FP stack overflow. The x87 stack is used only during the evaluation of pure subexpressions, and is empty otherwise, in particular when we push the arguments to a C function call. Besides, owing to the use of Ershov's algorithm in ocamlopt, you'd need a really huge FP expression to make use of all 8 slots of the FP stack. So, you can go ahead and merge Marc's patch. |
Comment author: @alainfrisch Thanks Marc and Xavier! Suggested fix committed to trunk, rev 16414. |
Original bug ID: 6979
Reporter: @alainfrisch
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2017-02-16T14:14:55Z)
Resolution: fixed
Priority: normal
Severity: minor
Fixed in version: 4.03.0+dev / +beta1
Category: back end (clambda to assembly)
Monitored by: @diml @jmeber @hcarty
Bug description
Consider the code below:
Using the 32-bit msvc port, it prints on my machine:
The first version, with a cosmetic let-binding (carried over to cmm) is so much slower!
Here are the two relevant .asm fragments which show the difference in the generated code:
The first version blits the x87 tos to the stack, and copies it again (with two 32-bit stack->stack moves). This certainly avoids useless moves, but I'm suprised by the impact on performance (moves should be cheap compared to "cos"). Perhaps some alignment issues?
Note: we observed such differences on real-world code.
File attachments
The text was updated successfully, but these errors were encountered: