| Anonymous | Login | Signup for a new account | 2013-05-21 16:02 CEST | ![]() |
| Main | My View | View Issues | Change Log | Roadmap |
| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | ||||||
| 0005384 | OCaml | OCaml general | public | 2011-10-25 14:40 | 2011-12-16 12:44 | ||||||
| Reporter | lefessan | ||||||||||
| Assigned To | xleroy | ||||||||||
| Priority | normal | Severity | feature | Reproducibility | always | ||||||
| Status | feedback | Resolution | open | ||||||||
| Platform | all | OS | all | OS Version | all | ||||||
| Product Version | 3.12.1 | ||||||||||
| Target Version | 3.12.1 | Fixed in Version | |||||||||
| Summary | 0005384: alloc of let-rec values is expensive (patch) | ||||||||||
| Description | Allocation of let rec values is not optimal: each value is allocated separately using "caml_alloc_dummy", and then they are allocated a second time, and then the initial allocated blocks are updated using "caml_update_dummy", making an heavy use of "caml_modify". The following patch modifies the allocation of let-rec values for native code. It tries a better allocation strategy, and fall-backs to the old strategy if the expression is not simple enough. The idea of the better allocation strategy is to break recursive expressions into simple non-recursive values (that can be evaluated before), and recursive blocks (that can all be allocated as one big block and then modified to add recursive pointers). The strategy fails if the block is too big and cannot be allocated in the minor heap. | ||||||||||
| Additional Information | Some performance measurements: type t = { f : float; x : t option; } let _ = for i = 1 to int_of_string Sys.argv.(1) do let rec v = { f = 0.0; x = Some v } in () done 76% speed-up: time ./test2.opt 100000000 0.55 s (optimized) 2.32 s (old) ----------------------------------------- let _ = let q = Queue.create () in for i = 1 to int_of_string Sys.argv.(1) do Queue.add 1 q; assert (Queue.take q = 1); done 50% speed-up: time ./test1.opt 100000000 2.16 s (optimized) 4.26 s (old) ----------------------------------------- let _ = for i = 1 to int_of_string Sys.argv.(1) do let q = Queue.create () in Queue.add 1 q; Queue.add 2 q; Queue.add 3 q; assert (Queue.take q = 1); assert (Queue.take q = 2); assert (Queue.take q = 3); done 30% speed-up: time ./test1.opt 100000000 5.24 s (optimized) 7.28 s (old) | ||||||||||
| Tags | No tags attached. | ||||||||||
| Attached Files | |||||||||||
Notes |
|
|
(0006265) xleroy (administrator) 2011-12-13 16:29 |
I was a bit surprised by this patch: "let rec" over values is a very rarely used Caml feature (I don't think I've used it in the last 5 years, and some people consider it a design mistake of the language, since it makes it possible that structurally recursive functions over immutable datatypes fail to terminate), so I didn't understand the need to optimize its compilation. Is there a real-world program where this optimization has a significant impact on the overall execution time? |
|
(0006269) frisch (developer) 2011-12-13 16:55 |
Fabrice gives the example of the Queue module, and some timings. I admit that the performance gain would probably be less impressive in a real-world algorithm, where Queue operations do not dominate the overall execution time. But it is the sign that some clever data structures can benefit from the patch. I remember using let-rec bindings several times when writing interpreters (to evaluate recursive closures, of course). I can also mention that we (=LexiFi) rely a lot on "let rec" values these days, in the form or lazy bindings (we have special syntactic sugar for that). I don't know if performance would be impacted by Fabrice's patch, though. |
|
(0006304) gasche (developer) 2011-12-15 09:53 |
Just a few comments on the patch; this is an outsider point of view, with no guarantee of technical accurateness: I'm not familiar with the OCaml codebase. High-level remarks: 1. Your patch adds a new recursive definition compilation method, alongside the old one, sometimes falling back to it. If I were to maintain that code in the future, I wouldn't like to have twice as much program logic to take care of. I believe it is possible to have something more factorized, so that more code and debugging/maintainance is shared between the two approaches. See below. 2. There is some redundancy between your code and two different parts of the existing code. First, your combined allocation involves size computation, which is already partly done by the `expr_size` function. Would it be possible to reuse the same size-computation mechanism in both? Second, you compute occurences of recursion variables in the recursive definitions; your non_rec_exp function is essentially a computation of free variables. This is already done in the bytecomp/translcore.ml:check_recursive_lambda function, which checks that recursive definitions are semantically valid. 3. More generally it seems that your optimization logic rather closely follows the computations done by the recursive definition checking in bytecomp/. I'm wondering whether some of it could be done in the same place. I think there are two independent aspects of your proposal: you single out non-recursive parts of the definition block to declare them beforehand (similarly you could pick the definitions that are not depended upon and declare them afterwards), and you have a clever allocation strategy for blocks. I think the first part of the optimization could be performed much sooner, as a source-to-source rewriting, in bytecomp/translcore.ml for example. 4. Your optimization reorders definitions in the native backend, but does not affect bytecode compilation. That could result in side-effects being performed in a different order on both compilers. Evaluation order of "let ... and .. and" is not specified, but my understanding is that the current code preserves top-to-bottom evaluation order. I think it is correct to change order for optimization purposes, but it would be much better if ocamlc and ocamlopt had the same behavior. Splitting the let-reordering part of your optimization to put it in a common pipeline would achieve that effect. 5. Your optimization trades the "caml_update_dummy" backpatching of the current semantics for `set_field` of block fields. I believe (but I have not tested my assumption) that this may degrade performances in some cases. For example consider the following code: type test = { x : test; y : test; z : test } let rec test = { x = test; y = test; z = test } My understanding is that your pass will do three field stores, while the current scheme would only update one dummy reference. It would be interesting to compare the tradeoff and see if there is a way to choose the best method locally. Nitpicking: - your combine_alloc fails if the combined block is larger than `max_young_wosize`. Would it be possible to rather split it into smaller blocks? - the "if non_rec_exp exp then .... else raise Exit" would be profitably rewritten into "if not (non_rec_exp exp) then raise Exit; ...". It could even be interesting to avoid double negation by implementing `rec_exp` rather than `non_rec_exp`. - I'm not fond of your use of reference variables for `set` and `cont`; the global style of the code is to pass along accumulators in the recursive function. I'm under the impression that the control flow of your code would allow to do that just as easily. |
|
(0006314) lefessan (developer) 2011-12-15 16:22 |
> type test = { x : test; y : test; z : test } > let rec test = { x = test; y = test; z = test } Your method: subq $8, %rsp .L100: movq $7, %rdi movq caml_alloc_dummy@GOTPCREL(%rip), %rax call caml_c_call@PLT .L101: movq %rax, %rbx call caml_alloc3@PLT .L102: leaq 8(%r15), %rsi movq $3072, -8(%rsi) movq %rbx, (%rsi) movq %rbx, 8(%rsi) movq %rbx, 16(%rsi) movq %rbx, %rdi call caml_update_dummy@PLT movq camlTest5@GOTPCREL(%rip), %rax movq %rbx, (%rax) movq $1, %rax addq $8, %rsp ret Structure is allocated twice, 3 calls to C functions, caml_update_dummy performs 3 caml_modify-s. My method: subq $8, %rsp .L100: call caml_alloc3@PLT .L101: leaq 8(%r15), %rbx movq $3072, -8(%rbx) movq $0, (%rbx) movq $0, 8(%rbx) movq $0, 16(%rbx) movq %rbx, (%rbx) movq %rbx, 8(%rbx) movq %rbx, 16(%rbx) movq camlTest5@GOTPCREL(%rip), %rax movq %rbx, (%rax) movq $1, %rax addq $8, %rsp ret Structure is allocated once, 6 stores instead of 3, 1 external call (actually, would be inlined if not toplevel) |
|
(0006315) xleroy (administrator) 2011-12-15 17:22 |
Your asm code excerpts don't answer my question: can you demonstrate non-negligible speedups on a real-world application to justify adding 150 lines of rather complex code to the compiler? |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2011-10-25 14:40 | lefessan | New Issue | |
| 2011-10-25 14:40 | lefessan | Status | new => assigned |
| 2011-10-25 14:40 | lefessan | Assigned To | => xleroy |
| 2011-10-25 14:40 | lefessan | File Added: ocaml-3.12.1-letrec.patch | |
| 2011-12-13 16:29 | xleroy | Note Added: 0006265 | |
| 2011-12-13 16:29 | xleroy | Status | assigned => feedback |
| 2011-12-13 16:55 | frisch | Note Added: 0006269 | |
| 2011-12-15 09:53 | gasche | Note Added: 0006304 | |
| 2011-12-15 16:22 | lefessan | Note Added: 0006314 | |
| 2011-12-15 16:22 | lefessan | Status | feedback => assigned |
| 2011-12-15 17:22 | xleroy | Note Added: 0006315 | |
| 2011-12-15 17:22 | xleroy | Status | assigned => feedback |
| Copyright © 2000 - 2011 MantisBT Group |