|Anonymous | Login | Signup for a new account||2017-11-24 15:51 CET|
|Main | My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0007664||OCaml||back end (clambda to assembly)||public||2017-10-29 13:14||2017-10-30 11:34|
|Target Version||Fixed in Version|
|Summary||0007664: Stack shadow space allocation causes Windows performance problems|
|Description||Around every C call in 64-bit Windows, the compiler allocates shadow space on the stack for the callee. It looks like this:|
sub rsp, 32
call QWORD PTR __caml_imp_get_array_float
add rsp, 32
Since this happens on every C call, it can cause a 50%+ slowdown in a tight loop. I think these RSP adjustments could be moved into the function entry and epilogue and merged with the existing RSP alignment (sub 8, add 8). I guess it should only do it if the function actually makes any C calls.
This may also be present in the generated assembly for 32-bit Windows, although I haven't checked.
|Steps To Reproduce||Here is a simple example where it slows down a call to caml_modify:|
type foo =
| Int of int
let myarray = Array.make 10000 Empty
let () =
let xx = Int 100 in
for z = 0 to 10000 do
for x = 0 to 9999 do
Array.set myarray x xx
|Tags||No tags attached.|
The SP adjustments around C calls that you observed are performed any time the C calling conventions dictate that some arguments are passed on stack or that stack space must be reserved by the caller. For x86-64 under Win64, 32 bytes of stack space must be reserved. Other processor/OS combinations have less demanding requirements.
Concerning the proposed change in code generation:
First, I think you're greatly exaggerating the costs of these stack adjustments. The call is not cheap, and the function that is called does nontrivial work. Two register arithmetic operations are cheap compared to this.
Second, it is a conscious decision to SP-adjust around C calls rather than integrate the extra space in the calculation of the stack frame size as you suggest. The reason is that doing what you suggest would increase stack usage in the presence of recursive calls. Consider:
let rec f x =
if <cond> then
<some C call>
<non-tail recursive call to f>
With your suggestion, every stack frame for f is 32 bytes bigger, leading to stack overflow earlier. With the current OCaml approach, the 32 bytes of extra stack space exist only during the C call and don't show up in the recursion.
Thanks, but no thanks.
Thank you for taking the time to take a look at this so quickly. I had not thought about the recursion aspect, you're absolutely right, it would be a bad idea to take extra stack space like that.
Still, I am noticing slower C calls between the Windows and Linux versions, particularly on the caml_modify write barrier function (as I said, it's 50% slower in a tight loop). The only major difference in the generated ASM was the SP modification, so I figured that was the cause. Maybe it's just MSVC doing a bad job optimizing the called C functions compared to GCC.
|2017-10-29 13:14||russ||New Issue|
|2017-10-30 09:31||xleroy||Note Added: 0018629|
|2017-10-30 09:31||xleroy||Status||new => resolved|
|2017-10-30 09:31||xleroy||Resolution||open => won't fix|
|2017-10-30 11:34||russ||Note Added: 0018631|
|Copyright © 2000 - 2011 MantisBT Group|