Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0007664OCamlback end (clambda to assembly)public2017-10-29 13:142017-10-30 11:34
Assigned To 
StatusresolvedResolutionwon't fix 
Platformx64OSWindowsOS Version
Product Version4.05.0 
Target VersionFixed in Version 
Summary0007664: Stack shadow space allocation causes Windows performance problems
DescriptionAround every C call in 64-bit Windows, the compiler allocates shadow space on the stack for the callee. It looks like this:

sub rsp, 32
call QWORD PTR __caml_imp_get_array_float
add rsp, 32

Since this happens on every C call, it can cause a 50%+ slowdown in a tight loop. I think these RSP adjustments could be moved into the function entry and epilogue and merged with the existing RSP alignment (sub 8, add 8). I guess it should only do it if the function actually makes any C calls.

This may also be present in the generated assembly for 32-bit Windows, although I haven't checked.
Steps To ReproduceHere is a simple example where it slows down a call to caml_modify:

type foo =
    | Empty
    | Int of int

let myarray = Array.make 10000 Empty
let () =
    let xx = Int 100 in
    for z = 0 to 10000 do
        for x = 0 to 9999 do
            Array.set myarray x xx
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
xleroy (administrator)
2017-10-30 09:31

Background info:

The SP adjustments around C calls that you observed are performed any time the C calling conventions dictate that some arguments are passed on stack or that stack space must be reserved by the caller. For x86-64 under Win64, 32 bytes of stack space must be reserved. Other processor/OS combinations have less demanding requirements.

Concerning the proposed change in code generation:

First, I think you're greatly exaggerating the costs of these stack adjustments. The call is not cheap, and the function that is called does nontrivial work. Two register arithmetic operations are cheap compared to this.

Second, it is a conscious decision to SP-adjust around C calls rather than integrate the extra space in the calculation of the stack frame size as you suggest. The reason is that doing what you suggest would increase stack usage in the presence of recursive calls. Consider:

let rec f x =
  if <cond> then
    <some C call>
    <non-tail recursive call to f>

With your suggestion, every stack frame for f is 32 bytes bigger, leading to stack overflow earlier. With the current OCaml approach, the 32 bytes of extra stack space exist only during the C call and don't show up in the recursion.


Thanks, but no thanks.
russ (reporter)
2017-10-30 11:34

Thank you for taking the time to take a look at this so quickly. I had not thought about the recursion aspect, you're absolutely right, it would be a bad idea to take extra stack space like that.

Still, I am noticing slower C calls between the Windows and Linux versions, particularly on the caml_modify write barrier function (as I said, it's 50% slower in a tight loop). The only major difference in the generated ASM was the SP modification, so I figured that was the cause. Maybe it's just MSVC doing a bad job optimizing the called C functions compared to GCC.

- Issue History
Date Modified Username Field Change
2017-10-29 13:14 russ New Issue
2017-10-30 09:31 xleroy Note Added: 0018629
2017-10-30 09:31 xleroy Status new => resolved
2017-10-30 09:31 xleroy Resolution open => won't fix
2017-10-30 11:34 russ Note Added: 0018631

Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker