alloc of let-rec values is expensive (patch) #5384

vicuna · 2011-10-25T12:40:15Z

Original bug ID: 5384
Reporter: @lefessan
Assigned to: @xavierleroy
Status: resolved (set by @xavierleroy on 2016-12-07T11:01:50Z)
Resolution: won't fix
Priority: normal
Severity: feature
Platform: all
OS: all
OS Version: all
Version: 3.12.1
Target version: later
Category: ~DO NOT USE (was: OCaml general)
Tags: patch
Monitored by: @protz mehdi dario @ygrek

Bug description

Allocation of let rec values is not optimal: each value is allocated separately using "caml_alloc_dummy", and then they are allocated a second time, and then the initial allocated blocks are updated using "caml_update_dummy", making an heavy use of "caml_modify".

The following patch modifies the allocation of let-rec values for native code.
It tries a better allocation strategy, and fall-backs to the old strategy if
the expression is not simple enough. The idea of the better allocation strategy is to break recursive expressions into simple non-recursive values (that can be evaluated before), and recursive blocks (that can all be allocated as one big block and then modified to add recursive pointers). The strategy fails if the block is too big and cannot be allocated in the minor heap.

Additional information

Some performance measurements:

type t = { f : float; x : t option; }
let _ =
for i = 1 to int_of_string Sys.argv.(1) do
let rec v = { f = 0.0; x = Some v } in
()
done

76% speed-up:
time ./test2.opt 100000000
0.55 s (optimized)
2.32 s (old)

let _ =
let q = Queue.create () in
for i = 1 to int_of_string Sys.argv.(1) do
Queue.add 1 q;
assert (Queue.take q = 1);
done

50% speed-up:
time ./test1.opt 100000000
2.16 s (optimized)
4.26 s (old)

let _ =
for i = 1 to int_of_string Sys.argv.(1) do
let q = Queue.create () in
Queue.add 1 q;
Queue.add 2 q;
Queue.add 3 q;
assert (Queue.take q = 1);
assert (Queue.take q = 2);
assert (Queue.take q = 3);
done

30% speed-up:
time ./test1.opt 100000000
5.24 s (optimized)
7.28 s (old)

File attachments

ocaml-3.12.1-letrec.patch

vicuna · 2011-12-13T15:29:18Z

Comment author: @xavierleroy

I was a bit surprised by this patch: "let rec" over values is a very rarely used Caml feature (I don't think I've used it in the last 5 years, and some people consider it a design mistake of the language, since it makes it possible that structurally recursive functions over immutable datatypes fail to terminate), so I didn't understand the need to optimize its compilation. Is there a real-world program where this optimization has a significant impact on the overall execution time?

vicuna · 2011-12-13T15:55:27Z

Comment author: @alainfrisch

Fabrice gives the example of the Queue module, and some timings. I admit that the performance gain would probably be less impressive in a real-world algorithm, where Queue operations do not dominate the overall execution time. But it is the sign that some clever data structures can benefit from the patch. I remember using let-rec bindings several times when writing interpreters (to evaluate recursive closures, of course).

I can also mention that we (=LexiFi) rely a lot on "let rec" values these days, in the form or lazy bindings (we have special syntactic sugar for that). I don't know if performance would be impacted by Fabrice's patch, though.

vicuna · 2011-12-15T08:53:54Z

Comment author: @gasche

Just a few comments on the patch; this is an outsider point of view,
with no guarantee of technical accurateness: I'm not familiar with the
OCaml codebase.

High-level remarks:

Your patch adds a new recursive definition compilation method,
alongside the old one, sometimes falling back to it. If I were to
maintain that code in the future, I wouldn't like to have twice as
much program logic to take care of. I believe it is possible to
have something more factorized, so that more code and
debugging/maintainance is shared between the two approaches. See
below.
There is some redundancy between your code and two different parts
of the existing code.

First, your combined allocation involves size computation, which is
already partly done by the expr_size function. Would it be
possible to reuse the same size-computation mechanism in both?

Second, you compute occurences of recursion variables in the
recursive definitions; your non_rec_exp function is essentially
a computation of free variables. This is already done in the
bytecomp/translcore.ml:check_recursive_lambda function, which
checks that recursive definitions are semantically valid.
More generally it seems that your optimization logic rather closely
follows the computations done by the recursive definition checking
in bytecomp/. I'm wondering whether some of it could be done in the
same place. I think there are two independent aspects of your
proposal: you single out non-recursive parts of the definition
block to declare them beforehand (similarly you could pick the
definitions that are not depended upon and declare them
afterwards), and you have a clever allocation strategy for
blocks. I think the first part of the optimization could be
performed much sooner, as a source-to-source rewriting, in
bytecomp/translcore.ml for example.
Your optimization reorders definitions in the native backend, but
does not affect bytecode compilation. That could result in
side-effects being performed in a different order on both
compilers. Evaluation order of "let ... and .. and" is not
specified, but my understanding is that the current code preserves
top-to-bottom evaluation order. I think it is correct to change
order for optimization purposes, but it would be much better if
ocamlc and ocamlopt had the same behavior. Splitting the
let-reordering part of your optimization to put it in a common
pipeline would achieve that effect.
Your optimization trades the "caml_update_dummy" backpatching of
the current semantics for set_field of block fields. I believe
(but I have not tested my assumption) that this may degrade
performances in some cases. For example consider the following
code:

type test = { x : test; y : test; z : test }
let rec test = { x = test; y = test; z = test }

My understanding is that your pass will do three field stores,
while the current scheme would only update one dummy reference. It
would be interesting to compare the tradeoff and see if there is
a way to choose the best method locally.

Nitpicking:

your combine_alloc fails if the combined block is larger than
max_young_wosize. Would it be possible to rather split it into
smaller blocks?
the "if non_rec_exp exp then .... else raise Exit" would be
profitably rewritten into
"if not (non_rec_exp exp) then raise Exit; ...". It could even be
interesting to avoid double negation by implementing rec_exp
rather than non_rec_exp.
I'm not fond of your use of reference variables for set and
cont; the global style of the code is to pass along accumulators
in the recursive function. I'm under the impression that the control
flow of your code would allow to do that just as easily.

vicuna · 2011-12-15T15:22:05Z

Comment author: @lefessan

type test = { x : test; y : test; z : test }
let rec test = { x = test; y = test; z = test }

Your method:
subq $8, %rsp
.L100:
movq $7, %rdi
movq caml_alloc_dummy@GOTPCREL(%rip), %rax
call caml_c_call@PLT
.L101:
movq %rax, %rbx
call caml_alloc3@PLT
.L102:
leaq 8(%r15), %rsi
movq $3072, -8(%rsi)
movq %rbx, (%rsi)
movq %rbx, 8(%rsi)
movq %rbx, 16(%rsi)
movq %rbx, %rdi
call caml_update_dummy@PLT
movq camlTest5@GOTPCREL(%rip), %rax
movq %rbx, (%rax)
movq $1, %rax
addq $8, %rsp
ret
Structure is allocated twice, 3 calls to C functions, caml_update_dummy performs 3 caml_modify-s.

My method:
subq $8, %rsp
.L100:
call caml_alloc3@PLT
.L101:
leaq 8(%r15), %rbx
movq $3072, -8(%rbx)
movq $0, (%rbx)
movq $0, 8(%rbx)
movq $0, 16(%rbx)
movq %rbx, (%rbx)
movq %rbx, 8(%rbx)
movq %rbx, 16(%rbx)
movq camlTest5@GOTPCREL(%rip), %rax
movq %rbx, (%rax)
movq $1, %rax
addq $8, %rsp
ret
Structure is allocated once, 6 stores instead of 3, 1 external call (actually, would be inlined if not toplevel)

vicuna · 2011-12-15T16:22:21Z

Comment author: @xavierleroy

Your asm code excerpts don't answer my question: can you demonstrate non-negligible speedups on a real-world application to justify adding 150 lines of rather complex code to the compiler?

vicuna · 2016-12-07T11:01:50Z

Comment author: @xavierleroy

This PR has been sleeping for too long. If this performance improvement is still of interest, I suggest to submit it as a Github pull request so that it gets discussed and reviewed better.

vicuna closed this as completed Dec 7, 2016

vicuna added the wontfix label Mar 14, 2019

vicuna assigned xavierleroy Mar 14, 2019

vicuna added the feature-wish label Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alloc of let-rec values is expensive (patch) #5384

alloc of let-rec values is expensive (patch) #5384

vicuna commented Oct 25, 2011

vicuna commented Dec 13, 2011

vicuna commented Dec 13, 2011

vicuna commented Dec 15, 2011

vicuna commented Dec 15, 2011

vicuna commented Dec 15, 2011

vicuna commented Dec 7, 2016

alloc of let-rec values is expensive (patch) #5384

alloc of let-rec values is expensive (patch) #5384

Comments

vicuna commented Oct 25, 2011

Bug description

Additional information

76% speed-up: time ./test2.opt 100000000 0.55 s (optimized) 2.32 s (old)

50% speed-up: time ./test1.opt 100000000 2.16 s (optimized) 4.26 s (old)

File attachments

vicuna commented Dec 13, 2011

vicuna commented Dec 13, 2011

vicuna commented Dec 15, 2011

vicuna commented Dec 15, 2011

vicuna commented Dec 15, 2011

vicuna commented Dec 7, 2016

76% speed-up:
time ./test2.opt 100000000
0.55 s (optimized)
2.32 s (old)

50% speed-up:
time ./test1.opt 100000000
2.16 s (optimized)
4.26 s (old)