Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird performance on numerical code on x86 #6979

Closed
vicuna opened this issue Sep 4, 2015 · 8 comments
Closed

Weird performance on numerical code on x86 #6979

vicuna opened this issue Sep 4, 2015 · 8 comments
Assignees

Comments

@vicuna
Copy link

vicuna commented Sep 4, 2015

Original bug ID: 6979
Reporter: @alainfrisch
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2017-02-16T14:14:55Z)
Resolution: fixed
Priority: normal
Severity: minor
Fixed in version: 4.03.0+dev / +beta1
Category: back end (clambda to assembly)
Monitored by: @diml @jmeber @hcarty

Bug description

Consider the code below:

let f0 () =
  let r = ref 42. in
  for i = 1 to 100000000 do
    let x = 42. +. !r in
    r := cos x;
  done

let f1 () =
  let r = ref 42. in
  for i = 1 to 100000000 do
    r := cos (42. +. !r);
  done

let () =
  let t0 = Unix.gettimeofday () in
  f0 ();
  let t1 = Unix.gettimeofday () in
  f1 ();
  let t2 = Unix.gettimeofday () in
  Printf.printf "%.02f   %.02f\n" (t1 -. t0) (t2 -. t1)

Using the 32-bit msvc port, it prints on my machine:

3.13   2.59

The first version, with a cosmetic let-binding (carried over to cmm) is so much slower!

Here are the two relevant .asm fragments which show the difference in the generated code:

	fadd	REAL8 PTR [esp]
	fstp	REAL8 PTR [esp]
	push	DWORD PTR [esp+4]
	push	DWORD PTR [esp+4]
	call	_cos
	add	esp, 8
...
	fadd	REAL8 PTR [esp]
	sub	esp, 8
	fstp	REAL8 PTR [esp]
	call	_cos
	add	esp, 8

The first version blits the x87 tos to the stack, and copies it again (with two 32-bit stack->stack moves). This certainly avoids useless moves, but I'm suprised by the impact on performance (moves should be cheap compared to "cos"). Perhaps some alignment issues?

Note: we observed such differences on real-world code.

File attachments

@vicuna
Copy link
Author

vicuna commented Sep 8, 2015

Comment author: @mlasson

Hello,

I was able to reproduce this issue on linux 32bits (full disclosure : on ubuntu 14.04 with ocaml 4.01 inside a virtual machine executed on a 64 bits host).

Also, I tried to reproduce the performance issue outside of caml with the following piece of assembly code :

.extern cos
.section .data
pi: .double 3.14
.section .text
.global slowcos
.global fastcos

slowcos:
pushl pi+4
pushl pi
call cos
addl $8, %esp
ret

fastcos:
fldl pi
subl $8, %esp
fstpl 0(%esp)
call cos
addl $8, %esp
ret

These two functions compute "cos(3.14)" one is copying "3.14" on the stack using two pushes (the slow one) and the other using one float load and one float pop on the float stack (the fast one).

I have a no idea why this matters that much during the computation of cos.

I join to this note a file "cos.asm" and a file "test.c". You can compile it on 64bits system with (drop --32 and -m32 to compile it on a 32bits system):
as --32 cos.asm -o cos.o
gcc -m32 -O0 cos.o test.c -lm -o test

For reference, I've also replaced the calls to cos with a call to an identity function. Here is the output on my computer (for 40 millions iterations):

1.62459s
slowcos = -0.999999
1.06661s
fastcos = -0.999999
0.254804s
fastpi = 3.14
0.265336s
slowpi = 3.14

@vicuna
Copy link
Author

vicuna commented Sep 8, 2015

@vicuna
Copy link
Author

vicuna commented Sep 8, 2015

Comment author: @mlasson

I'm quite convinced by the explanation provided by EOF, one should not push double on the stack by cutting them in half.

The following patch replaces in "i386/emit.mlp" the lines that push double on the stack by x87 instructions.

It relies on the assumption that there is always one free slot on the float stack (is it an invariant maintained by the compiler ?).

It solves the problem, but I hope it does not create others.

Index: asmcomp/i386/emit.mlp

--- asmcomp/i386/emit.mlp
+++ asmcomp/i386/emit.mlp
@@ -744,9 +744,10 @@
stack_offset := !stack_offset + 8
| {loc = Stack sl; typ = Float} ->
let ofs = slot_offset sl 1 in

  •        I.push (mem32 DWORD (ofs + 4) RSP);
    
  •        I.push (mem32 DWORD (ofs + 4) RSP);
    
  •        I.fld (mem32 REAL8 ofs RSP);
    
  •        I.sub (int 8) esp;
           cfi_adjust_cfa_offset 8;
    
  •        I.fstp (mem32 REAL8 0 RSP);
           stack_offset := !stack_offset + 8
       | _ ->
           I.push (reg r);
    

@vicuna
Copy link
Author

vicuna commented Sep 9, 2015

Comment author: @mlasson

I have two more remarks to add.

  1. it seems that gcc is almost always using the pair fldl/fstpl to push doubles on the stack.

For instance here is the assembly produces by "gcc -S" to comple the function "double f() { double x = id(3.14), y = id(x); return x + y; }".
f:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5

    subl    $24, %esp
    fldl    .LC1
    fstpl   (%esp)
    call    id
    fstpl   -16(%ebp)
    fldl    -16(%ebp)
    fstpl   (%esp)
    call    id
    fstpl   -8(%ebp)
    fldl    -16(%ebp)
    faddl   -8(%ebp)
    leave
  1. To confirm that this phenomenon is not specific to special math-functions (like cos and exp), the assembly code below (warp in test.c) ouputs :

0.365847s
fastpi = 28.26
1.24235s
slowpi = 28.26

It computes 3.14 + 3.14 + 3.14 + 3.14 + 3.14 + 3.14 + 3.14 + 3.14 + 3.14.
Both version uses the same instructions, but one first copies using "movs" then copies with "fstpl" while the other do the "fstpl" first and the mov after. Thus the first one is more "cache-friendly".

.section .data
pi: .double 3.14
.section .text
.global slowpi
.global fastpi

fastpi:
fldl pi
subl $8, %esp

movl pi, %eax
movl %eax, 0(%esp)
movl pi+4, %eax
movl %eax, 4(%esp)

fstpl 0(%esp)

fldl 0(%esp)

faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)

addl $8, %esp
ret

slowpi:
fldl pi
subl $8, %esp

fstpl 0(%esp)

movl pi, %eax
movl %eax, 0(%esp)
movl pi+4, %eax
movl %eax, 4(%esp)

fldl 0(%esp)

faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)
faddl 0(%esp)

addl $8, %esp
ret

@vicuna
Copy link
Author

vicuna commented Sep 9, 2015

Comment author: @xavierleroy

Thanks for the detective work.

The two-pushl idiom was what GCC used to generate a while ago. It made sense at that time because fldl/fstpl was reputed to be slow, owing perhaps to the 64 bits -> 80 bits -> 64 bits conversions involved. Also, store forwarding was probably less aggressive on the x86 processors of that time.

I see that today GCC uses the fldl/fstpl sequence, so I'm fine with using it as well in the x86-32 back-end of OCaml.

@vicuna
Copy link
Author

vicuna commented Sep 9, 2015

Comment author: @alainfrisch

Thanks Xavier! Do you think Marc's patch (in a note below) is valid, i.e. that one can safely assume the existence of a free slot on the x87 pseudo stack when the move happens?

@vicuna
Copy link
Author

vicuna commented Sep 9, 2015

Comment author: @xavierleroy

No need to worry about FP stack overflow. The x87 stack is used only during the evaluation of pure subexpressions, and is empty otherwise, in particular when we push the arguments to a C function call. Besides, owing to the use of Ershov's algorithm in ocamlopt, you'd need a really huge FP expression to make use of all 8 slots of the FP stack. So, you can go ahead and merge Marc's patch.

@vicuna
Copy link
Author

vicuna commented Sep 10, 2015

Comment author: @alainfrisch

Thanks Marc and Xavier! Suggested fix committed to trunk, rev 16414.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants