Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better compilation of functors #6710

Closed
vicuna opened this issue Dec 12, 2014 · 10 comments
Closed

Better compilation of functors #6710

vicuna opened this issue Dec 12, 2014 · 10 comments

Comments

@vicuna
Copy link

vicuna commented Dec 12, 2014

Original bug ID: 6710
Reporter: @alainfrisch
Status: acknowledged (set by @damiendoligez on 2015-01-08T23:25:02Z)
Resolution: open
Priority: low
Severity: feature
Category: middle end (typedtree to clambda)
Monitored by: @diml @ygrek @jmeber @hcarty @yakobowski

Bug description

A functor typically returns a bunch of functions that refers to the functor's argument and to previous functions defined in the same functor body.

Consider for example:

module F (X : sig val i : int end) = struct
  let f1 x = x + X.i
  let f2 x = f1 x + X.i
  let f3 x = f1 x + f2 x + X.i
end

and compile it with -inline 0 -dcmm. The code for F looks like:

(function camlFoo__F_1223 (X/1216: addr)
 (let
   (f1/1217 (alloc 3319 "camlFoo__f1_1217" 3 X/1216)
    f2/1219 (alloc 4343 "camlFoo__f2_1219" 3 X/1216 f1/1217)
    f3/1221 (alloc 5367 "camlFoo__f3_1221" 3 X/1216 f1/1217 f2/1219))
   (alloc 3072 f1/1217 f2/1219 f3/1221)))

This shows that the memory allocated by the functor evaluation can easily become quadratic in the number of functions (and thus also the evaluation time). Also, this creates a number of potentially long-lived local variables (f1/f2/f3), which can put some pressure on the register allocator (compile time). All that is probably not a big deal, but it is unfortunate in certain scenarios such as using functors to simulate type classes.

Another (perhaps more problematic) performance issue is that referring to a previous function requires a memory load:

(function camlFoo__f3_1221 (x/1222: addr env/1238: addr)
 (+
   (+
     (+ (app "camlFoo__f1_1217" x/1222 (load (+a env/1238 24)) addr)
       (app "camlFoo__f2_1219" x/1222 (load (+a env/1238 32)) addr))
     (load (load (+a env/1238 16))))
   -2))

I did not try to exhibit micro-benchmark, but I can imagine that this would induce a noticeable penalty e.g. in a functor that implements an efficient data structure with some shared helper functions (each call to such a function would require one extra memory load).

Interestingly, turning the set of function definitions into a single "let rec" actually produces nicer-looking and probably more efficient code. The functor body itself is compiled to a single common closure:

(function camlFoo__F_1223 (X/1216: addr)
 (let
   clos/1237
     (alloc 9463 "camlFoo__f1_1217" 3 3321 "camlFoo__f2_1218" 3 6393
       "camlFoo__f3_1219" 3 X/1216)
   (alloc 3072 clos/1237 (+a clos/1237 24) (+a clos/1237 48))))

and calls between functions in the body doesn't require a memory load (just an offset):

(function camlFoo__f3_1219 (x/1222: addr env/1236: addr)
 (+
   (+
     (+ (app "camlFoo__f1_1217" x/1222 (+a env/1236 -48) addr)
       (app "camlFoo__f2_1218" x/1222 (+a env/1236 -24) addr))
     (load (load (+a env/1236 16))))
   -2))

Could this transformation be applied automatically, and what could be the potential drawbacks?

An even better compilation scheme could to be create a single block (Y), which would represent the resulting structure, with extra fields to represent the "big closure", including the reference to X. The environment of the closure would be Y itself, and functions would access X and previously-defined functions through Y itself (this removes the need for adjusting the environment pointer when calling another function). Non-functional fields defined by the functor (and used in functions) would also be accessible through Y, without requiring extra fields to the closure environment. Y would be created first, with some dummy values for fields (the closure part is mostly static, except for the reference to X), and then filled up during the evaluation of the functor body (as for toplevel structures).

@vicuna
Copy link
Author

vicuna commented Dec 12, 2014

Comment author: @gasche

Sharing environments between independent closures would risk widening the lifetime of some of the captured values. Consider:

let g x =
  let module F () = struct
    let f0 () = I_use_a_lot_of_memory
    let g () = x
  end in
  let module M = F () in
  M.g ()

If f0 and g shared the same environment, then I_use_a_lot_of_memory would still be reachable through M.g's env at the end of this evaluation.

In your case that cannot happen because the functions already depend on each other, so each function's env is reachable from the last function. This scenario is also the one that is costly currently, as the quadratic blowup comes from the growing size of closed environments.

I think we could optimize consecutive function declarations can be proved to have the set environment-reachable set by compiling them just as mutually-recursive functions. (We could even considering reordering effect-free function declarations by a topological sort hoping to gain new opportunities for this). This would improve the scenario you give without any value-lifetime increase.

@vicuna
Copy link
Author

vicuna commented Dec 12, 2014

Comment author: @gasche

(It should be reasonably easy to test for the real-life performance impact of this kind of optimizations by looking for performance-sensitive functor-heavy code in the wild. OCamlgraph would qualify, Mirage as well, maybe the functorized static-analysis programs. I expect we wouldn't see much change, but who knows.)

@vicuna
Copy link
Author

vicuna commented Dec 12, 2014

Comment author: @alainfrisch

If f0 and g shared the same environment, then I_use_a_lot_of_memory would still be reachable through M.g's env at the end of this evaluation.

In your very example, f0 is an abstraction: the only identifiers it could depend upon are the functor's argument and other variables in scope, such as "x" in your case (but it is already kept alive by g).

The same example, but with "let f0 = ..." instead of "let f0 () = ..." could exhibit more easily the bad property you mention, but I doubt this occurs in practice: how often does a functor keep big data structures that can be released earlier than its functions?

@vicuna
Copy link
Author

vicuna commented Dec 12, 2014

Comment author: @gasche

I actually meant I_use_a_lot_of_memory as a reference to a variable in scope that takes a lot of memory -- sorry for the confusing use of a capital letter here.

I can craft reasonably-credible examples out of a large temporary variable created for caching a computation that happens locally, while a simple teardown function that doesn't depend on this memory is returned:

module Compute (Param : struct val len : int end) = struct
  let cache = Hashtbl.make ... 
  let compute () = ... grows cache a lot ...
  let return_cps result k = k result
end

let f param =
  let module C = Compute (struct let len = param end) in
  let result = C.compute () in
  C.return_cps result

Do you think restricting your optimization to the safe case of consecutive bindings that have the same liveset would limit its effectiveness in practice? Wherever it does not apply, the expert user can still turn it into a "let rec .. and .." to enforce the sharing semantics -- while I'm not sure how users could control the not-liveness-safe version when it bites them.

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label Mar 24, 2021
@gasche
Copy link
Member

gasche commented Mar 24, 2021

I don't remember the details of this discussion, but Stale-bot is asking for @alainfrisch to confirm whether they still care about this idea. More precisely:

  • it looks like I came up with an example where the proposed optimization is not memory-safe, and a proposed restriction to a situation where it is memory-safe
  • we are waiting for Alain's feedback on whether the restricted optimization is still desirable in his eyes
  • no one has volunteered to implement this change yet

@lthls
Copy link
Contributor

lthls commented Mar 24, 2021

Doesn't this discussion predate the inclusion of Flambda ? While the particular patch that Alain had in mind hasn't been implemented, Flambda has improved significantly the compilation of functors, so I wouldn't be surprised if the issue was much less relevant now.

@gasche
Copy link
Member

gasche commented Mar 24, 2021

Another point in the balance are the recent discussions to get rid of Infix_tag maybe. Plans to improve compilation by using it more often are maybe not as enticing than they once were.

@github-actions github-actions bot removed the Stale label Mar 26, 2021
@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label Mar 28, 2022
@alainfrisch
Copy link
Contributor

As far as I remember, this was more a theoretical concern than a practical one, so I'm closing the issue. (And the pressure on the register allocator is no longer an issue for us thanks to the the linear scan allocator.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants