New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in the compiler between 4.02.3 and 4.01.0 #7067
Comments
Comment author: @gasche I just uploaded a self-contained archive to avoid having to install external packages to test the regression. This seems to be ocamlopt-related: ocamlc is fast on all versions. On my machine, ocamlopt -c tsdl.cmx terminates in under three seconds on 4.01.0, but it is very slow (I didn't wait for termination) on either 4.00.1 or 4.02.3. |
Comment author: @dbuenzli I forgot to add that I once monitored the process via top and you could see repeated spikes of memory going from ~300Mo to 700Mo and back. Also there is a file https://github.com/dbuenzli/tsdl/blob/master/src/tsdl.ml#L16 This file is solely made of (mostly integer) constant definitions but is quite large:
A gut feeling is that there is some kind of non-linear behaviour w.r.t. the number of toplevel definitions since those are also quite plentiful in the tsdl module implementation:
|
Comment author: @mshinwell Daniel, we should check this with flambda, since the module compilation strategy has changed somewhat (and we've already worked to fix various problems that have occurred as a result of things like huge modules full of constants). It's compiler hacking tonight at Cambridge, so maybe someone could look then. |
Comment author: @alainfrisch Removing the outermost wrapper "module Sdl = struct ... end" allows to compile in a few seconds again. |
Comment author: @alainfrisch
Can you say a few words about the new strategy? The special compilation mode transl_store_structure was specially designed to reduce pressure on register allocation (and something apparently went wrong with it) compared to normal compilation of structures. Does flambda do something special to avoid big nested "let-in" with long lived identifier? |
Comment author: @alainfrisch
What happens is that the content of Sdl in the typedtree is a "Tmod_constraint(_, _, Tmodtype_implicit, _)", which breaks the "transl_store_structure" compilation scheme. The coercion is caused by duplicated value names. My guess is that this comes from: |
Comment author: @mshinwell The bytecode compilation strategy is used, which yields a big set of [let]s and a record creation at the end, together with arbitrary interspersed side effects. The constant subexpressions (which always have variable names bound to them in the Flambda terms, by construction) are identified and lifted out to special "Let_symbol" constructions. (See the "program" type in <https://github.com/chambart/ocaml-1/blob/flambda_trunk/middle_end/flambda.mli >) The remaining, possibly effectful parts of the module expression are then split into "Initialize_symbol" expressions; these bind symbols assigned to preallocated blocks of size 1. When the initialization expression has been evaluated, the result is put inside the first field of the preallocated block. The result of this is: |
Comment author: @alainfrisch
Thanks. So indeed this might very well fix this problem. Gabriel's self-contained archive (thanks!) makes it straightforward to check. |
Comment author: @alainfrisch I'm concerned that this might somehow complicate the analysis of the impact on compile-time for flambda. Shouldn't we try to fix this one pre-flambda to get a good comparison point? |
Comment author: @mshinwell Possibly yes, if it affects more than just this package. |
Comment author: @alainfrisch Well, any unit defining multiple values with the same name (and this is quite common) suffers from the regression (disabling the compilation scheme that worked well with the register allocator). This might not always be so dramatic because compilation units are usually smaller, but I guess the problem is rather widespread. |
Comment author: runhang The slowdown occurred between 4.01.0 and 4.02.0+rc1. During bisecting in this range, compiler fails to build for some commits for different reasons... |
Comment author: @gasche Runhang: Thanks for looking at this! You can "git bisect skip" some commits if they don't build, to get a more precise range. Note that at these time "make -j5" was unreliable, you have to ask for non-parallel builds. |
Comment author: @alainfrisch Checking the commit I cited as the probable source for the regression ( da96fcb ) might be faster than bisecting. |
Comment author: @alainfrisch I'm attaching a quick, certainly incomplete and perhaps buggy fix. At least the testsuite passes and it allows to compile Daniel's code in a few seconds again. |
Comment author: @alainfrisch
Definitely! Monitoring total memory allocation is more stable than runtime across runs and platforms, so it might serve as a good proxy for performance. Compiling the compiler itself with OCAMLRUNPARAM=v=0x400, extracting the allocation trace, and comparing (with some tolerance) to reference results could be a good first step. |
Comment author: runhang confirm the latest flambda solves this slowdown real 0m0.217s |
Comment author: @alainfrisch This is impressive, but isn't a 17x speedup compared to the version before the regression too good to be true, esp. that flambda is usually slower at compile-time? |
Comment author: @garrigue Well, I tried to understand that code at some point, but it is tricky. |
Comment author: @mshinwell Even if what I wrote above doesn't apply because those lets aren't toplevel (I wasn't sure if they were in your example), they would still be lifted out (using this [Let_symbol] construction) if their defining expressions were found to be constant. The worst case is probably when they are not toplevel and this does not happen. In this case, you end up with a big sequence of normal lets going through the compiler. The Flambda passes should all be tail-recursive and behave reasonably in the case of a long sequence of lets, but subsequent passes may not... |
Comment author: @alainfrisch Of course, the real culprit is the register allocator. At some point, we should perhaps reconsider the decision to reject the addition of a linear scan allocator... |
Comment author: @alainfrisch Is the flambda branch to check indeed https://github.com/chambart/ocaml-1/tree/flambda_trunk ? |
Comment author: @gasche Relevant link:
|
Comment author: @mshinwell frish: Yep |
Comment author: @alainfrisch I've tested the patch with success on LexiFi's code base. There is no noticeable change when compiling e.g. ocamldoc/*.ml with ocamlopt compared to trunk. (In a previous note, I reported some different results, but this was caused by a problem between the screen and the chair, namely comparing betweeen two working copies configured resp. for mingw and msvc....) Moreover, here is a program to generate simple reproduction cases: let () = print_endline "module X = struct"; print_endline " let x = 1"; print_endline " let x = 2"; for i = 1 to int_of_string Sys.argv.(1) do Printf.printf " let f%i x = x + %d\n" i i done; print_endline "end" Compiling with ocamlopt from the current trunk, trunk+patch and current flambda, I get: trunk trunk+patch flambda N = 200: 0.7s 0.12s 0.45s N = 400: 2.9s 0.25s 0.9s N = 800: 11.5s 0.45s 1.84s which shows clearly that one avoids the quadratic behavior, either with the patch or with flambda (which is about than 4 times slower than trunk+patch). Daniel's example shows that the problem does not occur only in theory. Here are the timings for "make native" on Gabriel's self-contained repro case, using ocamlopt.opt: trunk trunk+patch flambda 90.9s 1.6s 3.2s I'd like to push the change to trunk so that we get at least a good comparison point for flambda. |
Comment author: @gasche Are you sure that the patch is correct? I don't know this part of the compiler si I cannot relevantly review it. |
Comment author: @alainfrisch I also somehow lost track of coercions (in particular the second argument of Tcoerce_structure) after the addition of module aliases. So I definitely appreciate review from someone else (Jacques?). But I'm reasonably confident that it doesn't break code, or that this would be quickly found. |
Comment author: @garrigue I applied the patch and looked at the code, and it seems ok (it mimicks transl_structure, which is of course the reference). |
Comment author: @alainfrisch Ok, I've pushed that to trunk (commit 9d4b3a4). This is of course only a partial solution to the general problem of "the register allocator does not behave well with long pieces of code", but at least we avoid bad regressions in common cases with previous versions. |
Original bug ID: 7067
Reporter: @dbuenzli
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2017-02-16T14:18:24Z)
Resolution: fixed
Priority: normal
Severity: major
Version: 4.02.3
Target version: 4.03.0+dev / +beta1
Fixed in version: 4.03.0+dev / +beta1
Category: ~DO NOT USE (was: OCaml general)
Tags: compiler-time-or-space-regression
Related to: #6435
Child of: #7630
Monitored by: @gasche @diml @jmeber @hcarty
Bug description
As I said on the mailing list, the compilation time of tsdl using 4.02.3 is ~30x the time of 4.01.0. See the steps below to reproduce.
The bottleneck is in the compilation of the module https://github.com/dbuenzli/tsdl/blob/master/src/tsdl.ml
Steps to reproduce
opam install ocamlfind ctypes ctypes-foreign result
cd /tmp
git clone https://github.com/dbuenzli/tsdl.git
cd /tmp/tsdl
opam switch 4.01.0; eval
opam config env
time ./build tests.otarget
[...]
real 0m4.089s
user 0m3.227s
sys 0m0.694s
./build clean
opam switch 4.02.3; eval
opam config env
time ./build tests.otarget
real 1m31.203s
user 1m29.394s
sys 0m1.596s
File attachments
The text was updated successfully, but these errors were encountered: