Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack shadow space allocation causes Windows performance problems #7664

Closed
vicuna opened this issue Oct 29, 2017 · 2 comments
Closed

Stack shadow space allocation causes Windows performance problems #7664

vicuna opened this issue Oct 29, 2017 · 2 comments

Comments

@vicuna
Copy link

vicuna commented Oct 29, 2017

Original bug ID: 7664
Reporter: russ
Status: resolved (set by @xavierleroy on 2017-10-30T08:31:27Z)
Resolution: won't fix
Priority: normal
Severity: minor
Platform: x64
OS: Windows
Version: 4.05.0
Category: back end (clambda to assembly)
Monitored by: @gasche

Bug description

Around every C call in 64-bit Windows, the compiler allocates shadow space on the stack for the callee. It looks like this:

sub rsp, 32
call QWORD PTR __caml_imp_get_array_float
add rsp, 32

Since this happens on every C call, it can cause a 50%+ slowdown in a tight loop. I think these RSP adjustments could be moved into the function entry and epilogue and merged with the existing RSP alignment (sub 8, add 8). I guess it should only do it if the function actually makes any C calls.

This may also be present in the generated assembly for 32-bit Windows, although I haven't checked.

Steps to reproduce

Here is a simple example where it slows down a call to caml_modify:

type foo =
| Empty
| Int of int

let myarray = Array.make 10000 Empty

let () =
let xx = Int 100 in
for z = 0 to 10000 do
for x = 0 to 9999 do
Array.set myarray x xx
done
done

@vicuna
Copy link
Author

vicuna commented Oct 30, 2017

Comment author: @xavierleroy

Background info:

The SP adjustments around C calls that you observed are performed any time the C calling conventions dictate that some arguments are passed on stack or that stack space must be reserved by the caller. For x86-64 under Win64, 32 bytes of stack space must be reserved. Other processor/OS combinations have less demanding requirements.

Concerning the proposed change in code generation:

First, I think you're greatly exaggerating the costs of these stack adjustments. The call is not cheap, and the function that is called does nontrivial work. Two register arithmetic operations are cheap compared to this.

Second, it is a conscious decision to SP-adjust around C calls rather than integrate the extra space in the calculation of the stack frame size as you suggest. The reason is that doing what you suggest would increase stack usage in the presence of recursive calls. Consider:

let rec f x =
if then

else

With your suggestion, every stack frame for f is 32 bytes bigger, leading to stack overflow earlier. With the current OCaml approach, the 32 bytes of extra stack space exist only during the C call and don't show up in the recursion.

Conclusion:

Thanks, but no thanks.

@vicuna vicuna closed this as completed Oct 30, 2017
@vicuna
Copy link
Author

vicuna commented Oct 30, 2017

Comment author: russ

Thank you for taking the time to take a look at this so quickly. I had not thought about the recursion aspect, you're absolutely right, it would be a bad idea to take extra stack space like that.

Still, I am noticing slower C calls between the Windows and Linux versions, particularly on the caml_modify write barrier function (as I said, it's 50% slower in a tight loop). The only major difference in the generated ASM was the SP modification, so I figured that was the cause. Maybe it's just MSVC doing a bad job optimizing the called C functions compared to GCC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant