Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0007630OCamlback end (clambda to assembly)public2017-09-18 20:472017-10-27 19:02
Reporterejgallego 
Assigned To 
PrioritynormalSeverityminorReproducibilityalways
StatusacknowledgedResolutionopen 
PlatformAMD64OSLinuxOS VersionUbuntu 17.04
Product Version4.05.0 
Target VersionFixed in Version 
Summary0007630: FLambda particularly slow with large file consisting in top-level trivial aliases.
DescriptionDear OCaml developers,

we have observed very slow compilation times with `-flambda` in the
context of the Coq project.

In particular, ocaml+flambda seems to struggle in a file that has a
few thousand top level trivial definitions of the form:

```
let a = b
```

where `b` is defined in a different module.

The file takes around 7 minutes and 10GiB of RAM, which is a problem.
Compilation time and memory seems significantly increased with regards
to the non-flambda OCaml compiler. The problem seems present in 4.03.0
and 4.05.0+trunk.

No command line option we have tried is able to improve this,
including `-Oclassic` and `-rounds=0`.
Steps To ReproduceI think that the best way to reproduce the bug is to compile Coq
master with a flambda-enabled compiler. You will reach the problematic
file, `OrdersEx.v` in a few minutes. The concrete bad '.ml' file that
Coq generates is available here https://ufile.io/kfcqi [^]
Additional InformationPierre-Marie P├ędrot profiled the compiler and he thinks that
`ocamlopt` is stuck at register allocation, in concrete at
`Coloring.walk`, but are lacking some more concrete data.

Thank you!
Tagsflambda
Attached Files

- Relationships
related to 0004074acknowledged Patch for in-place compilation of structures 
parent of 0007631resolvedgasche -linscan option crashes ocamlopt 
parent of 0005546closedfrisch moving a function into an internal module slows down its use 
parent of 0006001closedhongboz Excessive memory consumption while compiling the OCaml distribution 
parent of 0007067closedfrisch Performance regression in the compiler between 4.02.3 and 4.01.0 

-  Notes
(0018266)
gasche (developer)
2017-09-18 21:19

Good news: the upcoming 4.06 release will come with a linear-scan allocator to be used optionally (option -linscan) that may be helpful to avoid register-allocation-level bottlenecks (we had a couple such bottlenecks even with the non-flambda backend in the past, for example with code within Camlp4 (not machine-generated)).
(0018267)
ejgallego (reporter)
2017-09-18 21:42

Thanks Gabriel, note thou that we are not sure about the reg allocator causing the problem, given the characteristics of the file I found Pierre-Marie's finding surprising.

Has linscan been already merged in 4.06 ?
(0018268)
ejgallego (reporter)
2017-09-18 21:49

Never mind, I found https://github.com/ocaml/ocaml/pull/375 [^] , also Paul Steckler has reported the following in Coq's gitter:

OK, the profile does have Coloring.walk pretty high, but also ....
Set.mem and Set.bal are even higher
and Interf.compare
the highest item is the garbage-collector, not much to do about that
and the caller of Set.mem is Flambda_invariants__declare_variable
(0018269)
gasche (developer)
2017-09-18 22:00

That indeed looks like the graph-based register coloration algorithm having trouble -- I have seen similar memory profiles in the past. You should definitely try with -linscan the just-branched 4.06 release branch if you can (there should be an alpha/beta release soon-ish).
(0018270)
gasche (developer)
2017-09-18 22:02

Another tool that can help, by the way, is the -dtimings option that displays pass-by-pass time (and, in recent versions, memory) usage for the compiler. It keeps improving over the OCaml version.
(0018271)
psteckler (reporter)
2017-09-18 22:12

I had created a smaller example to demonstrate the slowdown introduced by flambda. I wanted to see if the linear scan code would help, and installed OPAM 4.06.0+trunk+fp+flambda. I used the -linscan flag and got:
--
Fatal error: exception Invalid_argument("index out of bounds")
Raised by primitive operation at file "asmcomp/linscan.ml", line 111, characters 19-35
Called from file "list.ml", line 100, characters 12-15
Called from file "asmcomp/linscan.ml", line 115, characters 10-56
Called from file "asmcomp/linscan.ml", line 169, characters 4-28
Called from file "list.ml", line 100, characters 12-15
Called from file "asmcomp/asmgen.ml", line 87, characters 4-32
Called from file "utils/misc.ml", line 28, characters 20-27
Re-raised at file "utils/misc.ml", line 28, characters 50-57
--

I can file a separate bug report, if desired.
(0018272)
ejgallego (reporter)
2017-09-18 22:41

Maybe the combination of flambda + linscan is not well tested. Anyways, Paul used `-dtimings` to get more data and indeed register allocation is taking a lot more time in the flambda case.
(0018273)
gasche (developer)
2017-09-19 06:24

Well, there goes my hope of -linscan fixing your issue :p At least we get a lot of testing by convincing Coq people to try our new options.
(0018275)
xclerc (reporter)
2017-09-19 11:37

As noted in MPR#7631, the linscan bug seems to be related to fp;
could you try to build Coq with `-linscan` on a non-fp switch?
(0018276)
ejgallego (reporter)
2017-09-19 11:42

Also a -dtimings for OrdersEx.v would be very useful, unfortunately I am away from my desktop so I'll take a while.
(0018285)
psteckler (reporter)
2017-09-19 20:42

I assume it's not necessary to try a non-fp switch, since the bug with linscan has been fixed.
(0018327)
ejgallego (reporter)
2017-09-25 15:21

Ok, so I am back to my desktop and for the example in https://caml.inria.fr/mantis/view.php?id=7631, [^] I see:

4.05.0+trunk+flambda -Oclassic:
  regalloc(sourcefile(b.ml)): 5.408s

4.05.0:
  regalloc(sourcefile(b.ml)): 0.084s

thus this is IMHO a bug.
(0018329)
gasche (developer)
2017-09-25 15:41

Again, you cannot conclude that there is a bug from these numbers, because how much work the register allocator has to do depends a lot on the size and shape of your basic blocks, which in turns depends a lot on how aggressively you inline in the previous passes.

The compiler backend has changed its compilation strategies for structures (long series of "let" at the toplevel or between "struct .. end") several times in the past to avoid similar performance issues. However, I believe that these changes were performed at the bytecomp/translmod level, that is before the split between flambda and non-flambda backends (or bytecode), so it should apply to all backends. It may be the case that some flambda optimizations undo this work, and it might be the case that a not-too-invasive change could un-undo this.

See related issues: MPR 0004074, MPR 0005546, MPR 0006001, MPR 0007067
(0018331)
frisch (developer)
2017-09-25 15:52

flambda goes through Translmod.transl_implementation_flambda, which uses the "normal" compilation strategy for structures, not the "allocate then update slots" one. My understanding is that later passes of flambda extract back a structured representation of toplevel definitions and do specific specific with them, but I don't know whether it is supposed to avoid sending "monsters" to the register allocator.
(0018335)
ejgallego (reporter)
2017-09-25 17:20

My reasoning was based on the (likely wrong) assumption that flambda -Oclassic inlining should be the same that non-flambda. Quoting the manual:

"-Oclassic: Make inlining decisions at the point of definition of a function rather than at the call site(s). This mirrors the behaviour of OCaml compilers not using Flambda"

so I thought indeed some other reason different from inlining could be creating the slow behavior, but of course I am likely wrong :)
(0018337)
ejgallego (reporter)
2017-09-25 17:28

More data, I tried with `-Oclassic -inline 0`; no change; `-inlining-report` produces a zero-byte file; let me know if I can grab more information of help.
(0018412)
xleroy (administrator)
2017-09-30 10:31

> flambda goes through Translmod.transl_implementation_flambda, which uses the "normal" compilation strategy for structures, not the "allocate then update slots" one.

The compile-time issues mentioned in this PR are the reasons why the "allocate then update slots" translation was introduced circa 1995.
(0018458)
chambart (developer)
2017-10-03 01:24

Yes, flambda is supposed to try to do this. I tracked it a bit (on the example from 7631), and it seems that the nicely crafted "allocate then update slots" pattern is broken by CSE which shares symbol loads between all the updates, adding lots of conflicts for the register allocator to die on.

This is shown by the fact that making Iconst_symbol a 'cheap operation' in CSEgen avoids the problem.

I don't know the right solution. Some potential hack could be to split the module initialization function in a lot of small functions. Or make symbols cheap for this particular function (I would feel quite bad writing something like that without some better criteria for triggering it).
(0018459)
chambart (developer)
2017-10-03 01:46

Also for some context as to why there is any difference between with and without flambda. In flambda, every top-level value gets its own symbol.

Something like:

let bar0 = [| 0 |]
let bar1 = [| 1 |]
let bar2 = [| 2 |]
let bar3 = [| 3 |]

ends up to something like

symbol bar0.(0) <- [| 0 |]
symbol bar1.(0) <- [| 1 |]
symbol bar2.(0) <- [| 2 |]
symbol bar3.(0) <- [| 3 |]
symbol toplevel_module.(0) <- bar0.(0)
symbol toplevel_module.(1) <- bar1.(0)
symbol toplevel_module.(2) <- bar2.(0)
symbol toplevel_module.(3) <- bar3.(0)

and loading the address of the symbol bar0 is shared between the assignment of bar0.(0) and toplevel_module.(0). Hence the giant conflict.

Closure generated version does not have this specific problem because toplevel values are all fields of the same symbol, reducing the number of conflicts. The specific initialisation pattern of clambda (without going through an intermediate symbol) is not representable in flambda in most situations. (And I would be quite reluctant on making that representable)

Here, it could be represented since allocating and array is effect free, and the values are not used anywhere else than in the toplevel module bloc. In the coq generated example it doesn't seem like it could be.
(0018484)
gasche (developer)
2017-10-05 11:12

Pierre proposes a PR that partially alleviate some of these problems in

  https://github.com/ocaml/ocaml/pull/1401 [^]
(0018542)
frisch (developer)
2017-10-11 23:51

It would still be useful to know the timings with -linscan (now that it has been fixed to work with "fp").
(0018543)
ejgallego (reporter)
2017-10-12 10:36
edited on: 2017-10-12 10:37

On the problematic Coq file, `-linscan` reduces compilation time/memory from 10GiB and 10 minutes to 400MiB/20secs.

That's on top of PR1401

(0018544)
frisch (developer)
2017-10-12 11:33

Thanks, that's good to know. Have you any way to evaluate the impact on the performance for the resulting program? (I assume that if you use flambda, runtime performance somehow matters!)
(0018547)
gasche (developer)
2017-10-12 15:28

For Coq, you want to compile Coq itself (the type-checker, kernel etc.) with flambda for performance reasons. Then native-compute (which reduces Coq terms by compiling them into OCaml) requires that you compile all the Coq libraries on which you depend, and natdynlink (portions of) them with the coqc process -- so you need to use flambda there as well, but what you want is rather for them not to blow up (on the possibly different term/program shapes than OCaml's).
(0018548)
frisch (developer)
2017-10-12 15:44

Ok, thanks for the explanation. So I take it that compiling those dynlinked addins with -Oclassic + -linscan is not a crazy idea. (I'd also be interested to see if applying -linscan to the Coq core itself would give any noticeable slowdown at runtime and/or speedup at compilation time.)
(0018549)
ejgallego (reporter)
2017-10-12 17:18

The problematic file is a very degenerated case due several thousands of trivial top-level definitions; no similar case appears in the rest of contributions tested by Coq's CI.

Thus, it seems to me that enabling -linscan by default is a bit too much.
(0018550)
ejgallego (reporter)
2017-10-12 17:20

Alain, soon (TM) we will have a way to systematically test performance of Coq with different combinations of compiler flags, we'll make an announcement once the setup is ready.
(0018551)
frisch (developer)
2017-10-12 17:34

> we will have a way to systematically test performance of Coq with different combinations of compiler flags

Looking forward to it!

> it seems to me that enabling -linscan by default is a bit too much.

It depends on the numbers. If it makes compilation significantly faster without degrading runtime performance in any observable way, it's just "better", not "too much".
(0018553)
gasche (developer)
2017-10-12 17:55

Alain, I've already told Emilio that -linscan should be fine (not actually degrade performances in observable ways), but he's strong-willed to get the absolute best baseline performance. He is also, among the active Coq developers, the only one currently doing testing and benchmarking of cutting-edge OCaml-side changes, so be nice to him :-)

(Remark: there is no problem compiling some files with -linscan but not others, so Coq could special-case some of its standard library to use it.)
(0018555)
frisch (developer)
2017-10-12 18:16

Gabriel, I'm always nice to everyone :-)

I'm just honestly interested in more real-world results about -linscan.
(0018556)
ejgallego (reporter)
2017-10-12 19:02

I think my view here is more of "things should work fine without any special options" than "let's get the best baseline performance".

I am also very interested about the results of different flags!
(0018559)
xclerc (reporter)
2017-10-13 17:37

Would it make sense to have a heuristics choosing
the algorithm to be used for register allocation?

(And then probably a -graphcoloring switch to
enforce the use of graph coloring.)
(0018626)
chambart (developer)
2017-10-27 18:55

I made another PR https://github.com/ocaml/ocaml/pull/1455 [^] that disables CSE
for the toplevel function. This allows to compile that file in 0001497:0000013 seconds on
my laptop. -Oclassic isn't really faster on that file by the way. You can
further reduce the compilation time by 0000263:0000004 seconds using -dflambda-no-invariants
If you do, please keep the invariants checks enabled in you CI builds as this
helps uncover compiler bugs. Also there is no guarantee that this option will
stay.
(0018627)
ejgallego (reporter)
2017-10-27 19:02

Thank you very much Pierre, we will test the new PR ASAP.

- Issue History
Date Modified Username Field Change
2017-09-18 20:47 ejgallego New Issue
2017-09-18 20:47 ejgallego Tag Attached: flambda
2017-09-18 21:19 gasche Note Added: 0018266
2017-09-18 21:42 ejgallego Note Added: 0018267
2017-09-18 21:49 ejgallego Note Added: 0018268
2017-09-18 22:00 gasche Note Added: 0018269
2017-09-18 22:02 gasche Note Added: 0018270
2017-09-18 22:12 psteckler Note Added: 0018271
2017-09-18 22:41 ejgallego Note Added: 0018272
2017-09-19 06:24 gasche Note Added: 0018273
2017-09-19 11:37 xclerc Note Added: 0018275
2017-09-19 11:42 ejgallego Note Added: 0018276
2017-09-19 16:20 gasche Relationship added parent of 0007631
2017-09-19 20:42 psteckler Note Added: 0018285
2017-09-25 15:21 ejgallego Note Added: 0018327
2017-09-25 15:41 gasche Note Added: 0018329
2017-09-25 15:41 gasche Relationship added related to 0004074
2017-09-25 15:41 gasche Relationship added parent of 0005546
2017-09-25 15:41 gasche Relationship added parent of 0006001
2017-09-25 15:42 gasche Relationship added parent of 0007067
2017-09-25 15:52 frisch Note Added: 0018331
2017-09-25 17:20 ejgallego Note Added: 0018335
2017-09-25 17:28 ejgallego Note Added: 0018337
2017-09-30 10:31 xleroy Note Added: 0018412
2017-09-30 10:31 xleroy Status new => acknowledged
2017-10-03 01:24 chambart Note Added: 0018458
2017-10-03 01:46 chambart Note Added: 0018459
2017-10-05 11:12 gasche Note Added: 0018484
2017-10-11 23:51 frisch Note Added: 0018542
2017-10-12 10:36 ejgallego Note Added: 0018543
2017-10-12 10:37 ejgallego Note Edited: 0018543 View Revisions
2017-10-12 11:33 frisch Note Added: 0018544
2017-10-12 15:28 gasche Note Added: 0018547
2017-10-12 15:44 frisch Note Added: 0018548
2017-10-12 17:18 ejgallego Note Added: 0018549
2017-10-12 17:20 ejgallego Note Added: 0018550
2017-10-12 17:34 frisch Note Added: 0018551
2017-10-12 17:55 gasche Note Added: 0018553
2017-10-12 18:16 frisch Note Added: 0018555
2017-10-12 19:02 ejgallego Note Added: 0018556
2017-10-13 17:37 xclerc Note Added: 0018559
2017-10-27 18:55 chambart Note Added: 0018626
2017-10-27 19:02 ejgallego Note Added: 0018627


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker