Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

static data structures allocated on the GC heap #5779

Closed
vicuna opened this issue Oct 9, 2012 · 35 comments
Closed

static data structures allocated on the GC heap #5779

vicuna opened this issue Oct 9, 2012 · 35 comments
Assignees
Milestone

Comments

@vicuna
Copy link

vicuna commented Oct 9, 2012

Original bug ID: 5779
Reporter: markghayden
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2015-12-11T18:25:53Z)
Resolution: fixed
Priority: normal
Severity: minor
Version: 4.00.0
Target version: 4.01.1+dev
Fixed in version: 4.02.0+dev
Category: back end (clambda to assembly)
Tags: patch
Related to: #6337
Monitored by: @gasche meurer @ygrek @jmeber @hcarty @avsm @yakobowski @alainfrisch

Bug description

I have fairly large program (120k LOC) with significant amounts of static data structures. For example, large lists of records containing information for rules tables. These are all read-only structures (no mutable records, no arrays). I'd like these in static sections so they don't consume as much memory, have to be allocated and initialized, and incur ongoing overhead from GC scanning them. My software runs on mobile devices (iPads), which can often be quite memory constrained; whatever I can do to reduce memory consumption is helpful.

The Ocaml compiler seem to be fairly brittle about determining what should be allocated in static .data sections and what will be allocated and placed on the heap at run time. For example, a simple non-mutable record does is static. A non-mutable record included in a list or another non-mutable record does cause allocation on the heap. It seems references to staticly allocated objects within a second object causes the latter to be allocated on the heap.

Can the predicate for what is allocated staticly (not on the heap) be improved to capture a more complete set of cases? Eg ints, strings, floats, lists, non-mutable records, disjoint unions of the preceding. And have that documented?

Also, would be nice if these didn't require heap allocation:

  • Use of modules within files
  • Declaring an alias of a function at the top level. Eg: let f a = a ;; let g = f ;;

Steps to reproduce

I've attached a short program that declares various structures and monitors which one caused allocation by using Gc.stat. Most of the cases where allocation occurs are avoidable. A few (records with mutable fields) probably require allocation in some cases. I've only compiled the program using native code & for MacOS x64 but I imagine most of this is machine independent.

File attachments

@vicuna
Copy link
Author

vicuna commented Nov 14, 2012

Comment author: @damiendoligez

This looks more like a problem with constant propagation: if I inline the names in your examples, almost everything becomes optimal.

For example, I replace:
let int_alpha_record_record = { alpha = int_alpha_record } ;;
with
let int_alpha_record_record = { alpha = {alpha = 1} } ;;

When I do that for all expressions, the only ones that still allocate are:
int_mut_record alloced 2 --- this has to be in the heap because it's mutable
int_pairf alloced 3 --- I don't know why inlining doesn't make this one equivalent to int_pair.

Of course, my transformation destroys sharing, so you should use it with caution.

If you have large immutable data structures that you want to store out of the heap, you should probably have a look at the "ancient" library by Richard Jones:
http://git.annexia.org/?p=ocaml-ancient.git;a=summary
I hope it still works with the latest versions of OCaml.

@vicuna
Copy link
Author

vicuna commented Dec 5, 2012

Comment author: markghayden

Thanks for the comments. The problem is that I would like to have sharing in these structures. I guess Ancient is a possible approach... but I'd really like to just have the structures allocated as part of the executable. It makes like much simpler. Is there a reason shared structures cannot be allocated statically (not on the heap)?

@vicuna
Copy link
Author

vicuna commented Feb 12, 2014

Comment author: @alainfrisch

Most of the machinery is in place in the compiler to track "approximations" of values. This notion cannot currently capture references to statically allocated structured constants, but it doesn't look very difficult to extend it to do so.
I've attached a patch in this direction. It is not very nice, because it extends the notion of structured_constant shared with the bytecode compiler (Lambda.structured_constant) with a case only meaningful in native code. A bigger refactoring (probably creating a dedicated type of "structured constants" to be used in the native compiler) would make sense, I guess.

Anyway, comments are welcome on the patch (esp. whether it fully achieves the desired behavior or still miss cases).

@vicuna
Copy link
Author

vicuna commented Feb 12, 2014

Comment author: @alainfrisch

Note: inlined function calls are currently not turned into statically allocated constants. It's probably possible to do so by extending Closure.substitute to treat this case.

@vicuna
Copy link
Author

vicuna commented Feb 13, 2014

Comment author: @alainfrisch

A second version of the patch is attached. It addresses the case of inlined functions. With this patch, the following code does not allocate:

let a = (0, 1)
let b = (a, a)
let pair x y = (x, y)
let u = pair a a
let t = pair u u

@vicuna
Copy link
Author

vicuna commented Feb 13, 2014

Comment author: @alainfrisch

Ah ah, just spent 3 hours tracking a weird behavior!

The optimization under discussion can change the sharing of values between native and bytecode, because of inlining. For instance, in btype.ml, we have:

let newgenvar ?name () = newgenty (Tvar name)

A call to "newgenvar ()" will be inlined into "newgenty (Tvar None)", with the structured constant allocated statically, meaning that it will be shared across all invocations of this piece of code.

An unfortunate consequence is that ocamlc and ocamlc.opt could now produce different .cmi files (i.e. with different digest). This can be observed by compiling:

let a = function A -> () let b = function B -> ()

The resulting .cmi file has two occurrences of Tvar None, they are shared when produced by ocamlc.opt (unless if we compile ocamlc.opt with -inline 0), but different when produced by ocamlc.

This currently breaks the build system:
.....
../ocamlcompopt.sh -pp './remove_DEBUG' -I ../parsing -I ../utils -I ../typing -I ../driver -I ../bytecomp -I ../tools -I ../toplevel/ -I ../stdlib -I ../otherlibs/str -I ../otherlibs/dynlink -I ../otherlibs/unix -I ../otherlibs/num -I ../otherlibs/graph -warn-error A -c odoc_gen.ml
File "/tmp/ocamlppd01b7d", line 1:
Error: The files odoc_texi.cmi and odoc_gen.cmi make inconsistent assumptions
over interface Odoc_texi
.....

It's not clear whether having exactly the same sharing semantics between the bytecode and the native code compiler is critical, but the example above shows that one needs to be careful about it.

@vicuna
Copy link
Author

vicuna commented Feb 13, 2014

Comment author: @alainfrisch

Attached a third version of the patch. It now defines proper datatypes to represent the immediate and structured constants in the native code compiler.

The code is much cleaner (also compared to the current trunk, where the invariant on which kinds of constant can appear in clambda is burried in the code).

All structured constants are bound to global symbols, including sub-terms. This is different from the previous behavior (only the root of structured constants where bound). The advantage is that it will enable more optimizations such as simplifying a field projection from a constant block (let a = ("foo", "bar");; let b = fst a).

As a work-around for the problem reported in my previous note, there is a hack in subst.ml to normalize all occurrence of Tvar None (I did not check if other such constants could appear).

Note: I haven't reproduced for now the sharing of string literals generated by translclass (and there might be a different in the sharing of floats as well). It could make sense to allow even more sharing of identical structured constants, but this would have the effect of changing even more the sharing semantics between native and bytecode.

@vicuna
Copy link
Author

vicuna commented Feb 18, 2014

Comment author: @alainfrisch

A key point to be discussed is whether we want to maintain as much as possible the same behavior w.r.t. physical equality / sharing between code generated by ocamlopt and by ocamlc, or if we are free to improve sharing in native code (and then, it might make sense to share all structurally equal literals as well in a given unit).

@vicuna
Copy link
Author

vicuna commented Mar 3, 2014

Comment author: @alainfrisch

Just uploaded *_v4.diff. This patch adds the sharing of structurally equal structured constant (in a single unit). Some special care is taken not to share string literals (except the empty one!), though. Indeed, there is no consensus amongst developers that it is right to do so (even though the manual explicitly says that the compiler can share string literals as if they are constants).

The patch also adds constant propagation on field projection, so that code such as:

let z = ("abc", "def")
let swap z = (snd z, fst z)
let a = swap (swap z)

does not allocate. There is also an extra rule to compute statically the length of a known string so that the following does not allocate either (with a proper -inline directive):

let foo s = (s, String.length s)
let l = foo "abcd"

I think the patch is ready for inclusion, and I'd like to get some feedback on it.

Some points to be noted:

  • The patch increases the difference of behavior between ocamlc and ocamlopt w.r.t. physical equality of structured constants. This creates another problem while compiling ocamldoc (ocamlc and ocamlopt.opt produce different .cmi files when compiling odoc_parameter.ml, which leads to "The files odoc_class.cmi and odoc_str.cmi make inconsistent assumptions over interface Odoc_parameter"). This is because of the sharing of the empty string literal. We can: (i) disable this special case, (ii) arrange so that the compiler creates a single empty string literal (which will be shared both in bytecode and in native code); (iii) tweak the generic marshaling to have it share all empty strings; (iv) tweak ocamldoc's build system to avoid relying on the fact that ocamlc and ocamlopt.opt produce the same .cmi files. I've checked that not sharing the empty string fixes the build.

  • The compiler can generate useless constants (e.g. if a field is extracted from a constant block, which is not otherwise used). I don't think we care, but in case we do, it would not be difficult to filter out those unused constants while generating the cmm code.

  • With the patch, all structured constants (including sub-terms) are given a globally accessible symbol. This can add many exported symbols to a given object. I don't think that it is an issue.

  • There is still some logic in translclass to share method names and table of method names. It is redundant with the generic sharing of constants in ocamlopt, but it is still useful for bytecode, so I'll leave it there.

  • There is also some logic to share float literals in various versions of emit.mlp. This is for the instruction used to load a double constant into a register (Iconst_float). It could make sense to use the generic sharing of structured constants (which include floats) to achieve the same effect.

@vicuna
Copy link
Author

vicuna commented Mar 3, 2014

Comment author: @alainfrisch

With the patch, all structured constants (including sub-terms) are given a globally accessible symbol. This can add many exported symbols to a given object. I don't think that it is an issue.

Well, actually, this could be problematic with flexdll (when compiled as a 32-bit program), which puts all exported symbols in one section, materialized internally with an OCaml string.

I've tested with LexiFi current code base, and we have an instance where we create a .cmxs file whose .exptbl section has 12Mb bytes (with the patch), the limit being 16Mb. I'm confident that flexlink can be adapted to support larger cases, but this will still materialize as runtime structures.

@vicuna
Copy link
Author

vicuna commented Mar 3, 2014

Comment author: @alainfrisch

The *_v5.diff patch adds a pass to determine which constant symbols need to be made global (i.e. they appear in the "approx" of the structure stored in the cmx). This reduces the size of executables. For instance, under Linux 64-bit, here is the size in bytes of ocamlc.opt:

trunk:  5 399 977
   v4:  5 444 491
   v5:  4 846 475

On LexiFi's example of a huge .cmxs file, we go from 177148 exported symbols (with v4) to just 29467 (with v5), and .exptbl section is now tiny (about 1.2 Mb).

This also shows that it might be worth doing the same with the global symbols created for OCaml functions (many of them might not be required to be global).

@vicuna
Copy link
Author

vicuna commented Mar 4, 2014

Comment author: @alainfrisch

There seems to be a bad interaction with the compilation of objects and classes. The following code:

module F(X : sig end) = struct
  class op = object end

  let () = print_endline "A1"
  let o1 = object inherit op initializer print_endline "A2" end
  let () = print_endline "A3"
  let o2 = object inherit op initializer print_endline "A4" end
  let () = print_endline "A5"
end

module A = F(struct end)

produces:

A1
A2
A3
A2
A5

Note that "A2" is printed instead of "A4".

@vicuna
Copy link
Author

vicuna commented Mar 5, 2014

Comment author: @alainfrisch

Regarding the problem mentioned in the previous note: in stdlib/camlinternalOO.ml, the function build_path returns a block "Cons(Obj.magic 0, Empty, Empty)", but this block is mutated. I'm uploading a new patch which prevents static allocation of this block.

The new patch (*_v6) also addresses #6337 (duplicated emission of constants).

I've tried this patch on LexiFi's code base, and I can confirm that no test is broken (which is how I detected the previous bug in camlinternalOO). I also observe a significant reduction in the memory allocated by our main application's processes (I'll try to come up with more precise measures, but this does not seem completely negligible).

@vicuna
Copy link
Author

vicuna commented Mar 5, 2014

Comment author: @gasche

I don't know much about camlinternalOO, but I suspect that the whole approach of having
external mut : tables -> mut_tables = "%identity"
(so the existing code, rather than what you added) may be a bad idea in the long run.

In Batteries we have code which I think was suggested by Jacques, that lets us build (immutable) lists by using mutable lists (with a mutable tail field storing an immutable list) as a temporary data structure:
https://github.com/ocaml-batteries-team/batteries-included/blob/master/src/batList.ml#L50

The way it works is that mutable lists are turned into immutable lists (to insert them at the end of a previous mutable list), but never the other way around. I always justified the safeness of this choice by saying that it would be robust to compiler improvement reasoning on mutability.

Would it be possible to rewrite the camlinternalOO code to get rid of "mut" altogether? Does it automatically enforce a programming style that is safe with respect to sharing of immutable constants?

I think this is an important example to get straightened out as it may be representative of code existing in the wild. Having clear guidelines on what kind of hacks are ok and which are broken or risk becoming broken tomorrow is important.

@vicuna
Copy link
Author

vicuna commented Mar 6, 2014

Comment author: @gasche

I had a try at a version of camlinternalOO that does not convert an immutable into a mutable one. I could not use a Batteries-like solution, because the code performs lookup on the immutable structure and wants to mutate the tail at the point of lookup.

The solution I experimented is to make the structure always mutable; the code compiles, passes the testsuite and Alain's example above, but I know little about camlinternalOO (in particular, why an immutable structure was used in the first place), so it's probably not correct -- and even if it was, it's unclear that it should be preferred to Alain's solution.

I checked the use of lookup_tables in translclass.ml, and it seems to remain correct with that implementation with no change (my understanding is that it only manipulates the first field of the record/constructor, not the second and third fields affected by the change).

@vicuna
Copy link
Author

vicuna commented Mar 6, 2014

Comment author: @alainfrisch

Hi Gabriel,

Thanks for looking at that. Disclaimer: I'm not familiar at all with the logic and data structures involved in camlinternalOo (if someone could take the time to understand and document the compilation/runtime support for classes, this would be great, btw!).

Clearly, the original code was wrong: turning an immutable value into a mutable one through the use of magic is not ok. The compiler should be allowed to share physically two immutable blocks with identical [initial] content. The current patch is about static allocation of such constants blocks outside the heap, and another possible optimization could be to lift such allocations (even of non-constant blocks) outside functions (e.g. at the toplevel of a functor's body).

Your solution is thus quite appealing, since it completely remove this cast from immutable to mutable. But are we certain that the data structure will never be marshaled/unmarshaled (which would break the physical equality check on the sentinel)? (I'm thinking about marshaling a first-class module obtained from a compilation unit.) If not, what about implementing the equality with empty by "let is_empty x = x == x.next"?

@vicuna
Copy link
Author

vicuna commented Mar 6, 2014

Comment author: @gasche

I hadn't thought about marshalling indeed; the cyclicity test would be fine; another solution I thought about is to use the invariant that only the last cell has (Obj.magic 0) as key, but in fact it looks like the cyclicity test results in simpler code.

Re. correctness, I guess we should ask/wait for Jacques' opinion as he's the author of this code.

@vicuna
Copy link
Author

vicuna commented Mar 6, 2014

Comment author: @alainfrisch

We could probably integrate the main patch first. The changes to camlInternalOO are certainly worth investigating, but this could probably be done directly on trunk.

(One way to gain some confidence that there is no other such occurrences would be to emit structured constants (except string literals) in a read-only section.)

@vicuna
Copy link
Author

vicuna commented Mar 6, 2014

Comment author: @alainfrisch

With the latest version, ocamlc.opt now has 4 669 190 bytes (against 5 399 977 on trunk). That's a non-negligible 15% gain.

@vicuna
Copy link
Author

vicuna commented Mar 6, 2014

Comment author: @alainfrisch

I've monitored the impact of the patch on the memory usage of the main processes of LexiFi's application (under Windows, looking at "Memory (Private Working Set)", as reported by Windows task manager). The patch reduces this amount by about 9% for the simpler processes (which are spawned and do nothing until requested; e.g. a "scheduler processes uses 32Mb with the patch, 35Mb without it).

@vicuna
Copy link
Author

vicuna commented Mar 6, 2014

Comment author: @alainfrisch

Last patch (v6) committed to trunk, rev 14444.

@vicuna
Copy link
Author

vicuna commented Mar 7, 2014

Comment author: @garrigue

I was just ignoring this would thread because I thought I was not involved :-)

Concerning camlinternalOO.ml, it was rather hackish from the beginning, but maybe I brought that to a new level. Fortunately, since this is part of OCaml itself, bugs can be discovered early. I'm a bit concerned by the possibility that others may be using the same trick.
What is implemented is basically a simple algorithm building a tree where each level dispatches on a different part of the key. Since we need to be able to add instances on the fly, this structure is mutable.
The fix seems fine. It would be possible to replace all uses of Cons by (demut {key=...}), without incurring any extra cost. Going further, one could just remove the definition of tables, and just use mut_tables, but this would require a little more care (we still want to use a null pointer for empty to avoid indirections).
Thinking of it, all this is related to Jean-Christophe Filliatre's talk at JFLA 2014.

@vicuna
Copy link
Author

vicuna commented Mar 7, 2014

Comment author: @alainfrisch

I've committed Gabriel's proposal with a sentinel (modified to use a cyclicity test) in camlinternalOO.

@vicuna
Copy link
Author

vicuna commented Mar 7, 2014

Comment author: @gasche

Heh. In the morning shuttle I wrote a different version of the patch with a null value as sentinel, as suggested by Jacques. I'm uploading it for the reference, but I think I prefer the current solution, as it does not use any magic.

@vicuna
Copy link
Author

vicuna commented Mar 7, 2014

Comment author: @gasche

Re. potential code breakage for other users of unsafe features: I think a nice thing to do would be to rebase the change against 4.01, so that users can easily check their codebase against it, without interference of other trunk features (eg. Coq doesn't compile against trunk currently).

Alain, do you think you could do that? Otherwise I'll perhaps give it a try later next week.

@vicuna
Copy link
Author

vicuna commented Mar 7, 2014

Comment author: @alainfrisch

Alain, do you think you could do that? Otherwise I'll perhaps give it a try later next week.

Honestly, if you could give it a try, that would be nice.

Coq doesn't compile against trunk currently

What's the problem?

@vicuna
Copy link
Author

vicuna commented Mar 7, 2014

Comment author: @garrigue

Was there a misunderstanding?
The fix I said was fine was the one originally committed, not Gabriel's (which I hadn't seen).
This code has to be efficient, and modifying it for pedantism is not a good idea.
If you're afraid about safety, please first read transclass.ml and understand everything in it.

@vicuna
Copy link
Author

vicuna commented Mar 7, 2014

Comment author: @alainfrisch

Jacques: do you think the current version is indeed observationally slower than the previous one? If yes, I'll happily revert, but it seems to me that avoiding unsafe hacks (such as converting an immutable datastructure to a mutable one) is actually quite useful (even more since I spent about one day tracking that problem).

If performance of this piece of code is actually critical, it might be worth considering as well other micro-optimizations, such as removing array bound checks, avoiding the closure allocation in lookup_keys (e.g. by calling lookup_keys instead of lookup_key, and checking the i < 0 condition before the call where i changes).

Do you have examples of OO code exercising intensively this part of the code?

@vicuna
Copy link
Author

vicuna commented Mar 7, 2014

Comment author: @alainfrisch

I've reverted the change in camlinternalOO, to be on the safe side.

@vicuna
Copy link
Author

vicuna commented Mar 7, 2014

Comment author: @alainfrisch

Just for the fun of it, a slightly more efficient version of camlinternalOO.lookup_keys:

let lookup_keys i keys tables =
  let tables = mut tables in
  let key = ref (Array.unsafe_get keys i) in
  let i = ref i in
  let tables = ref tables in
  while (
    if (!tables).key == !key then begin
      if !i = 0 then (tables := mut (!tables).data; false)
      else (decr i; key := Array.unsafe_get keys !i; tables := mut (!tables).data; true)
    end else
      let next = (!tables).next in
      if next <> Empty then (tables := mut next; true)
      else begin
        let next = Cons (!key, Empty, Empty) in
        (!tables).next <- next;
        tables := mut (build_path (!i-1) keys (mut next));
        false
      end
  )
  do () done;
  demut !tables

I get a 8% speedup on the following code:

module F(X : sig end) = struct
  class op = object end
  let o = object inherit op end
  let o = object inherit op end
  let o = object inherit op end
  let o = object inherit op end
  let o = object inherit op end
  let o = object inherit op end
  let o = object inherit op end
  let o = object inherit op end
  let o = object inherit op end
  let o = object inherit op end
  let o = object inherit op end
  let o = object inherit op end
end

let () =
  for i = 1 to 10000000 do
    let module A = F(struct end) in
    ()
  done

(Compiling camlinternalOO with a high inline factor instead of the current -inline 0 leads to a small extra improvement, but I assume there is a good reason to have -inline 0.)

@vicuna
Copy link
Author

vicuna commented Mar 8, 2014

Comment author: @garrigue

Interesting. This shows that tail recursive functions are not yet as fast as loops...
I'm not sure this 8% is worth the loss in maintainability, but you have a point.
I just tried to make it more readable, and this doesn't seem to introduce any
slowdown. It gets even a bit faster after inlining lookup_keys inside lookup_tables.
(To keep the logic simpler, it is a bit slower when adding new keys, but this shouldn't matter.)
So, if your note was not intended as a joke, this may be worth it.

By the way, the "-inline 0" is essential, so that code in this module does not get inlined in
other modules, which would cause a code blowup. Object-oriented code is already rather
space consuming. But this makes harder to write efficient code inside the module.

let lookup_tables root keys =
let root = mut root in
if root.data <> Empty then
let tables = ref (mut root.data) in
for i = Array.length keys - 1 downto 0 do
let key = Array.unsafe_get keys i in
while (!tables).key != key do
let next = (!tables).next in
if next != Empty then tables := mut next else
let next = Cons (key, Empty, Empty) in
(!tables).next <- next;
tables := mut next;
ignore (build_path (i-1) keys (mut next))
done;
tables := mut (!tables).data
done;
demut !tables
else
build_path (Array.length keys - 1) keys root

@vicuna
Copy link
Author

vicuna commented Mar 10, 2014

Comment author: @alainfrisch

So, if your note was not intended as a joke, this may be worth it.

I'll let you decide! I assume that OO-heavy code is usually not critical in terms of performance, but it's always good to optimize such common "infrastructure" code.

@vicuna
Copy link
Author

vicuna commented Mar 19, 2014

Comment author: @alainfrisch

Marking this as resolved. The discussion on optimizing camlinternalOO is independent of this ticket.

@vicuna
Copy link
Author

vicuna commented Mar 19, 2014

Comment author: markghayden

Thanks for your efforts on this! I'm looking forward to trying this. Did anyone have a chance to run the original test program submitted with the problem report against the patched version of Ocaml?

@vicuna
Copy link
Author

vicuna commented Mar 19, 2014

Comment author: @alainfrisch

The original test programs reports:

init alloced 83
int_pair alloced 0
string alloced 0
string_pair alloced 0
int_pair_record alloced 0
int_alpha_record alloced 0
string_record alloced 0
int_record_record alloced 0
int_record_pair alloced 0
int_record alloced 0
int_record_record alloced 0
int_record_list alloced 0
int_mut_record alloced 2
int_list alloced 0
int_pair_list alloced 0
int_pairf alloced 0
int_pair_identify alloced 0
big_tuple alloced 0
big_tuple_syms alloced 0
function_alias alloced 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants