New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FLambda particularly slow with large file consisting in top-level trivial aliases. #7630
Comments
Comment author: @gasche Good news: the upcoming 4.06 release will come with a linear-scan allocator to be used optionally (option -linscan) that may be helpful to avoid register-allocation-level bottlenecks (we had a couple such bottlenecks even with the non-flambda backend in the past, for example with code within Camlp4 (not machine-generated)). |
Comment author: @ejgallego Thanks Gabriel, note thou that we are not sure about the reg allocator causing the problem, given the characteristics of the file I found Pierre-Marie's finding surprising. Has linscan been already merged in 4.06 ? |
Comment author: @ejgallego Never mind, I found #375 , also Paul Steckler has reported the following in Coq's gitter: OK, the profile does have Coloring.walk pretty high, but also .... |
Comment author: @gasche That indeed looks like the graph-based register coloration algorithm having trouble -- I have seen similar memory profiles in the past. You should definitely try with -linscan the just-branched 4.06 release branch if you can (there should be an alpha/beta release soon-ish). |
Comment author: @gasche Another tool that can help, by the way, is the -dtimings option that displays pass-by-pass time (and, in recent versions, memory) usage for the compiler. It keeps improving over the OCaml version. |
Comment author: @psteckler I had created a smaller example to demonstrate the slowdown introduced by flambda. I wanted to see if the linear scan code would help, and installed OPAM 4.06.0+trunk+fp+flambda. I used the -linscan flag and got:Fatal error: exception Invalid_argument("index out of bounds")
|
Comment author: @ejgallego Maybe the combination of flambda + linscan is not well tested. Anyways, Paul used |
Comment author: @gasche Well, there goes my hope of -linscan fixing your issue :p At least we get a lot of testing by convincing Coq people to try our new options. |
Comment author: @ejgallego Also a -dtimings for OrdersEx.v would be very useful, unfortunately I am away from my desktop so I'll take a while. |
Comment author: @psteckler I assume it's not necessary to try a non-fp switch, since the bug with linscan has been fixed. |
Comment author: @ejgallego Ok, so I am back to my desktop and for the example in #7631, I see: 4.05.0+trunk+flambda -Oclassic: 4.05.0: thus this is IMHO a bug. |
Comment author: @gasche Again, you cannot conclude that there is a bug from these numbers, because how much work the register allocator has to do depends a lot on the size and shape of your basic blocks, which in turns depends a lot on how aggressively you inline in the previous passes. The compiler backend has changed its compilation strategies for structures (long series of "let" at the toplevel or between "struct .. end") several times in the past to avoid similar performance issues. However, I believe that these changes were performed at the bytecomp/translmod level, that is before the split between flambda and non-flambda backends (or bytecode), so it should apply to all backends. It may be the case that some flambda optimizations undo this work, and it might be the case that a not-too-invasive change could un-undo this. See related issues: MPR #4074, MPR #5546, MPR #6001, MPR #7067 |
Comment author: @alainfrisch flambda goes through Translmod.transl_implementation_flambda, which uses the "normal" compilation strategy for structures, not the "allocate then update slots" one. My understanding is that later passes of flambda extract back a structured representation of toplevel definitions and do specific specific with them, but I don't know whether it is supposed to avoid sending "monsters" to the register allocator. |
Comment author: @ejgallego My reasoning was based on the (likely wrong) assumption that flambda -Oclassic inlining should be the same that non-flambda. Quoting the manual: "-Oclassic: Make inlining decisions at the point of definition of a function rather than at the call site(s). This mirrors the behaviour of OCaml compilers not using Flambda" so I thought indeed some other reason different from inlining could be creating the slow behavior, but of course I am likely wrong :) |
Comment author: @ejgallego More data, I tried with |
Comment author: @xavierleroy
The compile-time issues mentioned in this PR are the reasons why the "allocate then update slots" translation was introduced circa 1995. |
Comment author: @chambart Yes, flambda is supposed to try to do this. I tracked it a bit (on the example from 7631), and it seems that the nicely crafted "allocate then update slots" pattern is broken by CSE which shares symbol loads between all the updates, adding lots of conflicts for the register allocator to die on. This is shown by the fact that making Iconst_symbol a 'cheap operation' in CSEgen avoids the problem. I don't know the right solution. Some potential hack could be to split the module initialization function in a lot of small functions. Or make symbols cheap for this particular function (I would feel quite bad writing something like that without some better criteria for triggering it). |
Comment author: @chambart Also for some context as to why there is any difference between with and without flambda. In flambda, every top-level value gets its own symbol. Something like: let bar0 = [| 0 |] ends up to something like symbol bar0.(0) <- [| 0 |] and loading the address of the symbol bar0 is shared between the assignment of bar0.(0) and toplevel_module.(0). Hence the giant conflict. Closure generated version does not have this specific problem because toplevel values are all fields of the same symbol, reducing the number of conflicts. The specific initialisation pattern of clambda (without going through an intermediate symbol) is not representable in flambda in most situations. (And I would be quite reluctant on making that representable) Here, it could be represented since allocating and array is effect free, and the values are not used anywhere else than in the toplevel module bloc. In the coq generated example it doesn't seem like it could be. |
Comment author: @alainfrisch It would still be useful to know the timings with -linscan (now that it has been fixed to work with "fp"). |
Comment author: @ejgallego On the problematic Coq file, That's on top of PR1401 |
Comment author: @alainfrisch Thanks, that's good to know. Have you any way to evaluate the impact on the performance for the resulting program? (I assume that if you use flambda, runtime performance somehow matters!) |
Comment author: @gasche For Coq, you want to compile Coq itself (the type-checker, kernel etc.) with flambda for performance reasons. Then native-compute (which reduces Coq terms by compiling them into OCaml) requires that you compile all the Coq libraries on which you depend, and natdynlink (portions of) them with the coqc process -- so you need to use flambda there as well, but what you want is rather for them not to blow up (on the possibly different term/program shapes than OCaml's). |
Comment author: @alainfrisch Ok, thanks for the explanation. So I take it that compiling those dynlinked addins with -Oclassic + -linscan is not a crazy idea. (I'd also be interested to see if applying -linscan to the Coq core itself would give any noticeable slowdown at runtime and/or speedup at compilation time.) |
Comment author: @ejgallego The problematic file is a very degenerated case due several thousands of trivial top-level definitions; no similar case appears in the rest of contributions tested by Coq's CI. Thus, it seems to me that enabling -linscan by default is a bit too much. |
Comment author: @ejgallego Alain, soon (TM) we will have a way to systematically test performance of Coq with different combinations of compiler flags, we'll make an announcement once the setup is ready. |
Comment author: @alainfrisch
Looking forward to it!
It depends on the numbers. If it makes compilation significantly faster without degrading runtime performance in any observable way, it's just "better", not "too much". |
Comment author: @gasche Alain, I've already told Emilio that -linscan should be fine (not actually degrade performances in observable ways), but he's strong-willed to get the absolute best baseline performance. He is also, among the active Coq developers, the only one currently doing testing and benchmarking of cutting-edge OCaml-side changes, so be nice to him :-) (Remark: there is no problem compiling some files with -linscan but not others, so Coq could special-case some of its standard library to use it.) |
Comment author: @alainfrisch Gabriel, I'm always nice to everyone :-) I'm just honestly interested in more real-world results about -linscan. |
Comment author: @ejgallego I think my view here is more of "things should work fine without any special options" than "let's get the best baseline performance". I am also very interested about the results of different flags! |
Comment author: @xclerc Would it make sense to have a heuristics choosing (And then probably a -graphcoloring switch to |
Comment author: @chambart I made another PR #1455 that disables CSE |
Comment author: @ejgallego Thank you very much Pierre, we will test the new PR ASAP. |
Comment author: @ejgallego So far Coq has been working pretty well with FLambda and 4.07.0 for little more than a month; from our part this issue can be closed. |
Comment author: @gasche Great! Marking this as resolved. (Can you give a representative (by which I mean "any you like among those you have seen in practice") number on the runtime speedup and compilation-time slowdown related to your particular preferred flambda settings?) |
Comment author: @ejgallego
We are still in the process of gathering more data (*), I think that The speedup comes for free as there is no OCaml compiling involved. Compilation time in Gitlab CI is typically double, but a large part of Incidentally, a "bench" should appear here soon: (*) I'd like to have coq_dune ready before updating my little own |
Original bug ID: 7630
Reporter: @ejgallego
Assigned to: @gasche
Status: resolved (set by @gasche on 2018-09-10T14:51:33Z)
Resolution: fixed
Priority: normal
Severity: minor
Platform: AMD64
OS: Linux
OS Version: Ubuntu 17.04
Version: 4.05.0
Fixed in version: 4.07.0
Category: back end (clambda to assembly)
Tags: flambda
Related to: #4074
Parent of: #5546 #6001 #7067 #7631
Bug description
Dear OCaml developers,
we have observed very slow compilation times with
-flambda
in thecontext of the Coq project.
In particular, ocaml+flambda seems to struggle in a file that has a
few thousand top level trivial definitions of the form:
where
b
is defined in a different module.The file takes around 7 minutes and 10GiB of RAM, which is a problem.
Compilation time and memory seems significantly increased with regards
to the non-flambda OCaml compiler. The problem seems present in 4.03.0
and 4.05.0+trunk.
No command line option we have tried is able to improve this,
including
-Oclassic
and-rounds=0
.Steps to reproduce
I think that the best way to reproduce the bug is to compile Coq
master with a flambda-enabled compiler. You will reach the problematic
file,
OrdersEx.v
in a few minutes. The concrete bad '.ml' file thatCoq generates is available here https://ufile.io/kfcqi
Additional information
Pierre-Marie Pédrot profiled the compiler and he thinks that
ocamlopt
is stuck at register allocation, in concrete atColoring.walk
, but are lacking some more concrete data.Thank you!
The text was updated successfully, but these errors were encountered: