Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005384OCamlOCaml generalpublic2011-10-25 14:402013-10-09 14:39
Reporterlefessan 
Assigned Toxleroy 
PrioritynormalSeverityfeatureReproducibilityalways
StatusfeedbackResolutionopen 
PlatformallOSallOS Versionall
Product Version3.12.1 
Target Version3.12.1Fixed in Version 
Summary0005384: alloc of let-rec values is expensive (patch)
DescriptionAllocation of let rec values is not optimal: each value is allocated separately using "caml_alloc_dummy", and then they are allocated a second time, and then the initial allocated blocks are updated using "caml_update_dummy", making an heavy use of "caml_modify".

The following patch modifies the allocation of let-rec values for native code.
It tries a better allocation strategy, and fall-backs to the old strategy if
 the expression is not simple enough. The idea of the better allocation strategy is to break recursive expressions into simple non-recursive values (that can be evaluated before), and recursive blocks (that can all be allocated as one big block and then modified to add recursive pointers). The strategy fails if the block is too big and cannot be allocated in the minor heap.
Additional InformationSome performance measurements:

type t = { f : float; x : t option; }
let _ =
  for i = 1 to int_of_string Sys.argv.(1) do
    let rec v = { f = 0.0; x = Some v } in
    ()
  done

76% speed-up:
time ./test2.opt 100000000
0.55 s (optimized)
2.32 s (old)
-----------------------------------------
let _ =
  let q = Queue.create () in
  for i = 1 to int_of_string Sys.argv.(1) do
    Queue.add 1 q;
    assert (Queue.take q = 1);
  done

50% speed-up:
time ./test1.opt 100000000
2.16 s (optimized)
4.26 s (old)
-----------------------------------------
let _ =
  for i = 1 to int_of_string Sys.argv.(1) do
    let q = Queue.create () in
    Queue.add 1 q;
    Queue.add 2 q;
    Queue.add 3 q;
    assert (Queue.take q = 1);
    assert (Queue.take q = 2);
    assert (Queue.take q = 3);
  done

30% speed-up:
time ./test1.opt 100000000
5.24 s (optimized)
7.28 s (old)
Tagspatch
Attached Filespatch file icon ocaml-3.12.1-letrec.patch [^] (7,102 bytes) 2011-10-25 14:40 [Show Content]

- Relationships

-  Notes
(0006265)
xleroy (administrator)
2011-12-13 16:29

I was a bit surprised by this patch: "let rec" over values is a very rarely used Caml feature (I don't think I've used it in the last 5 years, and some people consider it a design mistake of the language, since it makes it possible that structurally recursive functions over immutable datatypes fail to terminate), so I didn't understand the need to optimize its compilation. Is there a real-world program where this optimization has a significant impact on the overall execution time?
(0006269)
frisch (developer)
2011-12-13 16:55

Fabrice gives the example of the Queue module, and some timings. I admit that the performance gain would probably be less impressive in a real-world algorithm, where Queue operations do not dominate the overall execution time. But it is the sign that some clever data structures can benefit from the patch. I remember using let-rec bindings several times when writing interpreters (to evaluate recursive closures, of course).

I can also mention that we (=LexiFi) rely a lot on "let rec" values these days, in the form or lazy bindings (we have special syntactic sugar for that). I don't know if performance would be impacted by Fabrice's patch, though.
(0006304)
gasche (developer)
2011-12-15 09:53

Just a few comments on the patch; this is an outsider point of view,
with no guarantee of technical accurateness: I'm not familiar with the
OCaml codebase.

High-level remarks:

1. Your patch adds a new recursive definition compilation method,
   alongside the old one, sometimes falling back to it. If I were to
   maintain that code in the future, I wouldn't like to have twice as
   much program logic to take care of. I believe it is possible to
   have something more factorized, so that more code and
   debugging/maintainance is shared between the two approaches. See
   below.

2. There is some redundancy between your code and two different parts
   of the existing code.

   First, your combined allocation involves size computation, which is
   already partly done by the `expr_size` function. Would it be
   possible to reuse the same size-computation mechanism in both?

   Second, you compute occurences of recursion variables in the
   recursive definitions; your non_rec_exp function is essentially
   a computation of free variables. This is already done in the
   bytecomp/translcore.ml:check_recursive_lambda function, which
   checks that recursive definitions are semantically valid.

3. More generally it seems that your optimization logic rather closely
   follows the computations done by the recursive definition checking
   in bytecomp/. I'm wondering whether some of it could be done in the
   same place. I think there are two independent aspects of your
   proposal: you single out non-recursive parts of the definition
   block to declare them beforehand (similarly you could pick the
   definitions that are not depended upon and declare them
   afterwards), and you have a clever allocation strategy for
   blocks. I think the first part of the optimization could be
   performed much sooner, as a source-to-source rewriting, in
   bytecomp/translcore.ml for example.

4. Your optimization reorders definitions in the native backend, but
   does not affect bytecode compilation. That could result in
   side-effects being performed in a different order on both
   compilers. Evaluation order of "let ... and .. and" is not
   specified, but my understanding is that the current code preserves
   top-to-bottom evaluation order. I think it is correct to change
   order for optimization purposes, but it would be much better if
   ocamlc and ocamlopt had the same behavior. Splitting the
   let-reordering part of your optimization to put it in a common
   pipeline would achieve that effect.

5. Your optimization trades the "caml_update_dummy" backpatching of
   the current semantics for `set_field` of block fields. I believe
   (but I have not tested my assumption) that this may degrade
   performances in some cases. For example consider the following
   code:
   
     type test = { x : test; y : test; z : test }
     let rec test = { x = test; y = test; z = test }

   My understanding is that your pass will do three field stores,
   while the current scheme would only update one dummy reference. It
   would be interesting to compare the tradeoff and see if there is
   a way to choose the best method locally.


Nitpicking:

- your combine_alloc fails if the combined block is larger than
  `max_young_wosize`. Would it be possible to rather split it into
  smaller blocks?

- the "if non_rec_exp exp then .... else raise Exit" would be
  profitably rewritten into
  "if not (non_rec_exp exp) then raise Exit; ...". It could even be
  interesting to avoid double negation by implementing `rec_exp`
  rather than `non_rec_exp`.

- I'm not fond of your use of reference variables for `set` and
  `cont`; the global style of the code is to pass along accumulators
  in the recursive function. I'm under the impression that the control
  flow of your code would allow to do that just as easily.
(0006314)
lefessan (developer)
2011-12-15 16:22

> type test = { x : test; y : test; z : test }
> let rec test = { x = test; y = test; z = test }

Your method:
        subq $8, %rsp
.L100:
        movq $7, %rdi
        movq caml_alloc_dummy@GOTPCREL(%rip), %rax
        call caml_c_call@PLT
.L101:
        movq %rax, %rbx
        call caml_alloc3@PLT
.L102:
        leaq 8(%r15), %rsi
        movq $3072, -8(%rsi)
        movq %rbx, (%rsi)
        movq %rbx, 8(%rsi)
        movq %rbx, 16(%rsi)
        movq %rbx, %rdi
        call caml_update_dummy@PLT
        movq camlTest5@GOTPCREL(%rip), %rax
        movq %rbx, (%rax)
        movq $1, %rax
        addq $8, %rsp
        ret
Structure is allocated twice, 3 calls to C functions, caml_update_dummy performs 3 caml_modify-s.

My method:
        subq $8, %rsp
.L100:
        call caml_alloc3@PLT
.L101:
        leaq 8(%r15), %rbx
        movq $3072, -8(%rbx)
        movq $0, (%rbx)
        movq $0, 8(%rbx)
        movq $0, 16(%rbx)
        movq %rbx, (%rbx)
        movq %rbx, 8(%rbx)
        movq %rbx, 16(%rbx)
        movq camlTest5@GOTPCREL(%rip), %rax
        movq %rbx, (%rax)
        movq $1, %rax
        addq $8, %rsp
        ret
Structure is allocated once, 6 stores instead of 3, 1 external call (actually, would be inlined if not toplevel)
(0006315)
xleroy (administrator)
2011-12-15 17:22

Your asm code excerpts don't answer my question: can you demonstrate non-negligible speedups on a real-world application to justify adding 150 lines of rather complex code to the compiler?

- Issue History
Date Modified Username Field Change
2011-10-25 14:40 lefessan New Issue
2011-10-25 14:40 lefessan Status new => assigned
2011-10-25 14:40 lefessan Assigned To => xleroy
2011-10-25 14:40 lefessan File Added: ocaml-3.12.1-letrec.patch
2011-12-13 16:29 xleroy Note Added: 0006265
2011-12-13 16:29 xleroy Status assigned => feedback
2011-12-13 16:55 frisch Note Added: 0006269
2011-12-15 09:53 gasche Note Added: 0006304
2011-12-15 16:22 lefessan Note Added: 0006314
2011-12-15 16:22 lefessan Status feedback => assigned
2011-12-15 17:22 xleroy Note Added: 0006315
2011-12-15 17:22 xleroy Status assigned => feedback
2013-10-09 14:39 doligez Tag Attached: patch


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker