Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alloc of let-rec values is expensive (patch) #5384

Closed
vicuna opened this issue Oct 25, 2011 · 6 comments
Closed

alloc of let-rec values is expensive (patch) #5384

vicuna opened this issue Oct 25, 2011 · 6 comments

Comments

@vicuna
Copy link

vicuna commented Oct 25, 2011

Original bug ID: 5384
Reporter: @lefessan
Assigned to: @xavierleroy
Status: resolved (set by @xavierleroy on 2016-12-07T11:01:50Z)
Resolution: won't fix
Priority: normal
Severity: feature
Platform: all
OS: all
OS Version: all
Version: 3.12.1
Target version: later
Category: ~DO NOT USE (was: OCaml general)
Tags: patch
Monitored by: @protz mehdi dario @ygrek

Bug description

Allocation of let rec values is not optimal: each value is allocated separately using "caml_alloc_dummy", and then they are allocated a second time, and then the initial allocated blocks are updated using "caml_update_dummy", making an heavy use of "caml_modify".

The following patch modifies the allocation of let-rec values for native code.
It tries a better allocation strategy, and fall-backs to the old strategy if
the expression is not simple enough. The idea of the better allocation strategy is to break recursive expressions into simple non-recursive values (that can be evaluated before), and recursive blocks (that can all be allocated as one big block and then modified to add recursive pointers). The strategy fails if the block is too big and cannot be allocated in the minor heap.

Additional information

Some performance measurements:

type t = { f : float; x : t option; }
let _ =
for i = 1 to int_of_string Sys.argv.(1) do
let rec v = { f = 0.0; x = Some v } in
()
done

76% speed-up:
time ./test2.opt 100000000
0.55 s (optimized)
2.32 s (old)

let _ =
let q = Queue.create () in
for i = 1 to int_of_string Sys.argv.(1) do
Queue.add 1 q;
assert (Queue.take q = 1);
done

50% speed-up:
time ./test1.opt 100000000
2.16 s (optimized)
4.26 s (old)

let _ =
for i = 1 to int_of_string Sys.argv.(1) do
let q = Queue.create () in
Queue.add 1 q;
Queue.add 2 q;
Queue.add 3 q;
assert (Queue.take q = 1);
assert (Queue.take q = 2);
assert (Queue.take q = 3);
done

30% speed-up:
time ./test1.opt 100000000
5.24 s (optimized)
7.28 s (old)

File attachments

@vicuna
Copy link
Author

vicuna commented Dec 13, 2011

Comment author: @xavierleroy

I was a bit surprised by this patch: "let rec" over values is a very rarely used Caml feature (I don't think I've used it in the last 5 years, and some people consider it a design mistake of the language, since it makes it possible that structurally recursive functions over immutable datatypes fail to terminate), so I didn't understand the need to optimize its compilation. Is there a real-world program where this optimization has a significant impact on the overall execution time?

@vicuna
Copy link
Author

vicuna commented Dec 13, 2011

Comment author: @alainfrisch

Fabrice gives the example of the Queue module, and some timings. I admit that the performance gain would probably be less impressive in a real-world algorithm, where Queue operations do not dominate the overall execution time. But it is the sign that some clever data structures can benefit from the patch. I remember using let-rec bindings several times when writing interpreters (to evaluate recursive closures, of course).

I can also mention that we (=LexiFi) rely a lot on "let rec" values these days, in the form or lazy bindings (we have special syntactic sugar for that). I don't know if performance would be impacted by Fabrice's patch, though.

@vicuna
Copy link
Author

vicuna commented Dec 15, 2011

Comment author: @gasche

Just a few comments on the patch; this is an outsider point of view,
with no guarantee of technical accurateness: I'm not familiar with the
OCaml codebase.

High-level remarks:

  1. Your patch adds a new recursive definition compilation method,
    alongside the old one, sometimes falling back to it. If I were to
    maintain that code in the future, I wouldn't like to have twice as
    much program logic to take care of. I believe it is possible to
    have something more factorized, so that more code and
    debugging/maintainance is shared between the two approaches. See
    below.

  2. There is some redundancy between your code and two different parts
    of the existing code.

    First, your combined allocation involves size computation, which is
    already partly done by the expr_size function. Would it be
    possible to reuse the same size-computation mechanism in both?

    Second, you compute occurences of recursion variables in the
    recursive definitions; your non_rec_exp function is essentially
    a computation of free variables. This is already done in the
    bytecomp/translcore.ml:check_recursive_lambda function, which
    checks that recursive definitions are semantically valid.

  3. More generally it seems that your optimization logic rather closely
    follows the computations done by the recursive definition checking
    in bytecomp/. I'm wondering whether some of it could be done in the
    same place. I think there are two independent aspects of your
    proposal: you single out non-recursive parts of the definition
    block to declare them beforehand (similarly you could pick the
    definitions that are not depended upon and declare them
    afterwards), and you have a clever allocation strategy for
    blocks. I think the first part of the optimization could be
    performed much sooner, as a source-to-source rewriting, in
    bytecomp/translcore.ml for example.

  4. Your optimization reorders definitions in the native backend, but
    does not affect bytecode compilation. That could result in
    side-effects being performed in a different order on both
    compilers. Evaluation order of "let ... and .. and" is not
    specified, but my understanding is that the current code preserves
    top-to-bottom evaluation order. I think it is correct to change
    order for optimization purposes, but it would be much better if
    ocamlc and ocamlopt had the same behavior. Splitting the
    let-reordering part of your optimization to put it in a common
    pipeline would achieve that effect.

  5. Your optimization trades the "caml_update_dummy" backpatching of
    the current semantics for set_field of block fields. I believe
    (but I have not tested my assumption) that this may degrade
    performances in some cases. For example consider the following
    code:

    type test = { x : test; y : test; z : test }
    let rec test = { x = test; y = test; z = test }

    My understanding is that your pass will do three field stores,
    while the current scheme would only update one dummy reference. It
    would be interesting to compare the tradeoff and see if there is
    a way to choose the best method locally.

Nitpicking:

  • your combine_alloc fails if the combined block is larger than
    max_young_wosize. Would it be possible to rather split it into
    smaller blocks?

  • the "if non_rec_exp exp then .... else raise Exit" would be
    profitably rewritten into
    "if not (non_rec_exp exp) then raise Exit; ...". It could even be
    interesting to avoid double negation by implementing rec_exp
    rather than non_rec_exp.

  • I'm not fond of your use of reference variables for set and
    cont; the global style of the code is to pass along accumulators
    in the recursive function. I'm under the impression that the control
    flow of your code would allow to do that just as easily.

@vicuna
Copy link
Author

vicuna commented Dec 15, 2011

Comment author: @lefessan

type test = { x : test; y : test; z : test }
let rec test = { x = test; y = test; z = test }

Your method:
subq $8, %rsp
.L100:
movq $7, %rdi
movq caml_alloc_dummy@GOTPCREL(%rip), %rax
call caml_c_call@PLT
.L101:
movq %rax, %rbx
call caml_alloc3@PLT
.L102:
leaq 8(%r15), %rsi
movq $3072, -8(%rsi)
movq %rbx, (%rsi)
movq %rbx, 8(%rsi)
movq %rbx, 16(%rsi)
movq %rbx, %rdi
call caml_update_dummy@PLT
movq camlTest5@GOTPCREL(%rip), %rax
movq %rbx, (%rax)
movq $1, %rax
addq $8, %rsp
ret
Structure is allocated twice, 3 calls to C functions, caml_update_dummy performs 3 caml_modify-s.

My method:
subq $8, %rsp
.L100:
call caml_alloc3@PLT
.L101:
leaq 8(%r15), %rbx
movq $3072, -8(%rbx)
movq $0, (%rbx)
movq $0, 8(%rbx)
movq $0, 16(%rbx)
movq %rbx, (%rbx)
movq %rbx, 8(%rbx)
movq %rbx, 16(%rbx)
movq camlTest5@GOTPCREL(%rip), %rax
movq %rbx, (%rax)
movq $1, %rax
addq $8, %rsp
ret
Structure is allocated once, 6 stores instead of 3, 1 external call (actually, would be inlined if not toplevel)

@vicuna
Copy link
Author

vicuna commented Dec 15, 2011

Comment author: @xavierleroy

Your asm code excerpts don't answer my question: can you demonstrate non-negligible speedups on a real-world application to justify adding 150 lines of rather complex code to the compiler?

@vicuna
Copy link
Author

vicuna commented Dec 7, 2016

Comment author: @xavierleroy

This PR has been sleeping for too long. If this performance improvement is still of interest, I suggest to submit it as a Github pull request so that it gets discussed and reviewed better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants