New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
native compilation optimization of tuple assignment #4800
Comments
Comment author: @alainfrisch The experimental branch inplace_let, which you can find in the OCaml Subversion repository, implements the suggested optimization. Feel free to try it and report any performance difference you could observe. |
Comment author: @alainfrisch I've done some micro benchmarks. Consider the code below: let () = and the alternative version: ... The timings are: Another test: let f x y z = let () = The timings are: I've also found a case where the patch degrades performances: let () = The timings are: If we replace floats with ints, the patch improves performances: The following variant: let () = is again faster with the patch: 8.32s without the patch and so is this one: let () = 4.89s without the patch It seem that most of the time, we get significant speedups. Concerning the slowdown: I could not tell from the generated assembly code where it comes from; probably some cache or alignment issue. |
Comment author: @gasche A did a quick review. High-level points:
Nitpicking:
PS: it's easier if you give an URL to the diff(s). |
Comment author: @alainfrisch Hi Gabriel, Thanks for the detailed feedback.
I don't understand your last proposal about assign_pat; the wildcard case in assign_pat does not, by itself, disallow the optimization (opt can be set to true afterwards). |
Comment author: @gasche Here is the sort of things I had in mind. I didn't even try to compile the code, it's just a draft. I may also have misunderstood your code (in which case I'd welcome clarifying in-code comments) in which case this wouldn't be equivalent. let assign_pat loc pat lam = |
Comment author: @alainfrisch The behavior is different. The original code "pushes" the bindings down the bound expression, duplicating it on branching node. If one of the bindings "cancels" out a literal tuple, the optimization is triggered. This is achieved by sharing the reference across all the calls to assign_pat. For instance, consider: let x, y = Here, transform_return will push the binding (x,y) to the two branches; on the second fails, the corresponding call to assign_pat detects that the optimization operates. But on the first call to assign_pat (for "z"), we need the fallback case. |
Comment author: @gasche Indeed, that's... wicked. I hope you'll publish a testsuite with all those interesting corner cases covered and explained. Maybe an in-code comment is also good. I'm wondering however, could this example result in a performance degradation? Imagine that "z" is the usual case and (1,2) is an exceptional case (that's how I expect this pattern to be used). The result of your optimization, if I'm not mistaken, will allocate two dummy-initialized variables x and y, get the z couple, set x to z.fst and y to z.snd, and go on. The current compiler would just get z, bind x to z.fst and y to z.snd, and move on. The difference is tiny (in particular, of course you don't allocate more) but it would be interesting to know if the difference is noticeable. Maybe it would be safer to optimize only when all branches are tuples or do not return (assert false, raise...). A good thing with either choices is that the performance-maniac compiler-guru user can tune its code to get the semantics he wants: by turning z into PS: I understand that you don't want to invest more time on this until the usefulness of the optimization is demonstrated and accepted. My technical comments on the code should not be taken as an opinion. |
Comment author: @gasche While looking at Alain's implementation again, I had an idea of a different implementation of the optimization. I implemented it as the attached patch. The idea is to use Lstaticcatch/Lstaticraise instead of Lassign. Note: I have added comments liberally, but barring that the patches are of the exact same size. Alain patch transforms
into (approximatively; in this special case there is no name freshening)
While my implementation produces
I liked the use of control-flow instead of mutation here -- and the removal of those unsatisfying "dummy initialization". The performance profile is different from Alain's, but related. This patch tends to do slightly better in some cases, and slightly worse in others. See at the end of this message the compared timings on the micro-benchmarks posted by Alain above. I believe the implementation is more efficient in general, but maybe does not benefit from float or reference unboxing optimization in Lassign, and is therefore slower in this case -- I'm not familiar with unboxing optimizations, but that's the most plausible justification I have; I'd be glad to confirmed or contradicted. Performance numbers on Alain's tests, in the order corresponding to his message. test1.ml (integers) test2.ml (integers) test3.ml (float references) test4.ml (float references) test5.ml (float references) |
Comment author: @alainfrisch Gabriel, I really like your proposal. Indeed, it's much cleaner to use control-flow rather than mutation. |
Comment author: @gasche I have explored the boxing issue a bit. I just uploaded an experimental patch (very lightly tested) that fixes the issue I could identify by reading the cmm code. Unfortunately, in the process of testing this patch on my examples, I discovered that my measurements are much more variable that I thought : on the test machine, I get upto 1 second variation with the same optimizer, which makes 4-seconds measurements unusable... I will try to find more robust measurements, but in the meantime, here are the relevant region of the cmm code on test5.ml. No optimization at all ('match/1037' is the identifier of the tuple (data global "camlTest5__3" int 1277 "camlTest5__3": double 1.) (let With Alain's assign patch: (data global "camlTest5__3" int 1277 "camlTest5__3": double 1.) (let (x/1034 1a y/1035 1a z/1036 1a) With the catch/exit patch: (catch Notice the degradation with 'z' that is explicitely allocated instead With my 'avoid undeeded unboxing' patch: (data global "camlTest5__3" int 1277 "camlTest5__3": double 1.) (catch |
Comment author: @alainfrisch I believe the second patch (0001-avoid-float-unboxing-when-uneeded.patch) is wrong. If a mutable variable is needed in both boxed and unboxed form, one should not use the unboxed compilation scheme (otherwise assignments to the variable are only visible to context which use the unboxed form). |
Comment author: @alainfrisch FWIW, I could not observe any significant impact on performance of applying Gabriel's first patch (but not the second one) for our entire test suite (maybe because we avoid the optimized construction in performance critical code). There is a small reduction of the total number of allocated bytes (e.g. for some tests, around 3% fewer allocated bytes). |
Comment author: @alainfrisch With static exceptions (#5879), it is possible to code by hand Gabriel's solution. For instance: let () = let r = ref 0 in for k = 1 to 10000 do for i = 1 to 100000 do try if i mod 2 = 0 then raise (`Cont (1, i * 2)) else raise (`Cont (2, i * 3)) with `Cont (x, y) -> r := !r * x + y done; done; Printf.printf "-> %i\n%!" !r and this achieves identical performances. Of course, it is not very nice to write code like that, but it's good to know that even if no optimization for tuple assignment is ever included, static exceptions give a way to improve speed of performance critical code by hand. Another observation is that the changes required to support escaping try...with blocks with a static exception would benefit to Gabriel's approach as well, making it possible to optimize the case where the right-hand side build tuples in terminal position of a try...with body. |
Comment author: @alainfrisch I've committed this on trunk, rev 16501. This is a version close to Gabriel's, with some improvements in the map_return function in order to traverse trywith/staticatch expressions (because the staticraise we use is allowed to cross try...with blocks!) and to ignore staticraise/raise expression (because they don't return). This improves the effectiveness of the patch by avoiding tuple allocations in cases such as:
A test has been added to check that no allocation happens in this example. The optimization is currently done quite early in the compilation process, thus benefiting both to bytecode and the native. This means that (i) js_of_ocaml also benefits from it; (ii) the optimization is not reapplied after inlining. |
Original bug ID: 4800
Reporter: jerhoud
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2017-02-16T14:14:50Z)
Resolution: fixed
Priority: normal
Severity: feature
Version: 3.10.2
Fixed in version: 4.03.0+dev / +beta1
Category: ~DO NOT USE (was: OCaml general)
Tags: patch
Has duplicate: #6670
Related to: #5879 #7017
Monitored by: meyer @gasche abdallah @protz dario @ygrek till @jberdine "Julien Signoles" @hcarty @Chris00 @mjambon @yakobowski @mmottl
Bug description
Very often I need to define several variables according to a condition (or a pattern matching), so I write something like :
let (a, b) = if cond then (a1, b1) else (a2, b2) in ...
or
let (a, b) =
match value with
| case1 -> (a1, b1)
| case2 -> (a2, b2)
| ...
in ...
In these situation, the compiler allocates memory and creates a tuple. This is quite inefficient. I would like the compiler to optimize this allocation away.
File attachments
The text was updated successfully, but these errors were encountered: