New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
static data structures allocated on the GC heap #5779
Comments
Comment author: @damiendoligez This looks more like a problem with constant propagation: if I inline the names in your examples, almost everything becomes optimal. For example, I replace: When I do that for all expressions, the only ones that still allocate are: Of course, my transformation destroys sharing, so you should use it with caution. If you have large immutable data structures that you want to store out of the heap, you should probably have a look at the "ancient" library by Richard Jones: |
Comment author: markghayden Thanks for the comments. The problem is that I would like to have sharing in these structures. I guess Ancient is a possible approach... but I'd really like to just have the structures allocated as part of the executable. It makes like much simpler. Is there a reason shared structures cannot be allocated statically (not on the heap)? |
Comment author: @alainfrisch Most of the machinery is in place in the compiler to track "approximations" of values. This notion cannot currently capture references to statically allocated structured constants, but it doesn't look very difficult to extend it to do so. Anyway, comments are welcome on the patch (esp. whether it fully achieves the desired behavior or still miss cases). |
Comment author: @alainfrisch Note: inlined function calls are currently not turned into statically allocated constants. It's probably possible to do so by extending Closure.substitute to treat this case. |
Comment author: @alainfrisch A second version of the patch is attached. It addresses the case of inlined functions. With this patch, the following code does not allocate: let a = (0, 1) |
Comment author: @alainfrisch Ah ah, just spent 3 hours tracking a weird behavior! The optimization under discussion can change the sharing of values between native and bytecode, because of inlining. For instance, in btype.ml, we have: let newgenvar ?name () = newgenty (Tvar name) A call to "newgenvar ()" will be inlined into "newgenty (Tvar None)", with the structured constant allocated statically, meaning that it will be shared across all invocations of this piece of code. An unfortunate consequence is that ocamlc and ocamlc.opt could now produce different .cmi files (i.e. with different digest). This can be observed by compiling: let a = function The resulting .cmi file has two occurrences of Tvar None, they are shared when produced by ocamlc.opt (unless if we compile ocamlc.opt with -inline 0), but different when produced by ocamlc. This currently breaks the build system: It's not clear whether having exactly the same sharing semantics between the bytecode and the native code compiler is critical, but the example above shows that one needs to be careful about it. |
Comment author: @alainfrisch Attached a third version of the patch. It now defines proper datatypes to represent the immediate and structured constants in the native code compiler. The code is much cleaner (also compared to the current trunk, where the invariant on which kinds of constant can appear in clambda is burried in the code). All structured constants are bound to global symbols, including sub-terms. This is different from the previous behavior (only the root of structured constants where bound). The advantage is that it will enable more optimizations such as simplifying a field projection from a constant block (let a = ("foo", "bar");; let b = fst a). As a work-around for the problem reported in my previous note, there is a hack in subst.ml to normalize all occurrence of Tvar None (I did not check if other such constants could appear). Note: I haven't reproduced for now the sharing of string literals generated by translclass (and there might be a different in the sharing of floats as well). It could make sense to allow even more sharing of identical structured constants, but this would have the effect of changing even more the sharing semantics between native and bytecode. |
Comment author: @alainfrisch A key point to be discussed is whether we want to maintain as much as possible the same behavior w.r.t. physical equality / sharing between code generated by ocamlopt and by ocamlc, or if we are free to improve sharing in native code (and then, it might make sense to share all structurally equal literals as well in a given unit). |
Comment author: @alainfrisch Just uploaded *_v4.diff. This patch adds the sharing of structurally equal structured constant (in a single unit). Some special care is taken not to share string literals (except the empty one!), though. Indeed, there is no consensus amongst developers that it is right to do so (even though the manual explicitly says that the compiler can share string literals as if they are constants). The patch also adds constant propagation on field projection, so that code such as: let z = ("abc", "def") let swap z = (snd z, fst z) let a = swap (swap z) does not allocate. There is also an extra rule to compute statically the length of a known string so that the following does not allocate either (with a proper -inline directive): let foo s = (s, String.length s) let l = foo "abcd" I think the patch is ready for inclusion, and I'd like to get some feedback on it. Some points to be noted:
|
Comment author: @alainfrisch
Well, actually, this could be problematic with flexdll (when compiled as a 32-bit program), which puts all exported symbols in one section, materialized internally with an OCaml string. I've tested with LexiFi current code base, and we have an instance where we create a .cmxs file whose .exptbl section has 12Mb bytes (with the patch), the limit being 16Mb. I'm confident that flexlink can be adapted to support larger cases, but this will still materialize as runtime structures. |
Comment author: @alainfrisch The *_v5.diff patch adds a pass to determine which constant symbols need to be made global (i.e. they appear in the "approx" of the structure stored in the cmx). This reduces the size of executables. For instance, under Linux 64-bit, here is the size in bytes of ocamlc.opt: trunk: 5 399 977 v4: 5 444 491 v5: 4 846 475 On LexiFi's example of a huge .cmxs file, we go from 177148 exported symbols (with v4) to just 29467 (with v5), and .exptbl section is now tiny (about 1.2 Mb). This also shows that it might be worth doing the same with the global symbols created for OCaml functions (many of them might not be required to be global). |
Comment author: @alainfrisch There seems to be a bad interaction with the compilation of objects and classes. The following code: module F(X : sig end) = struct class op = object end let () = print_endline "A1" let o1 = object inherit op initializer print_endline "A2" end let () = print_endline "A3" let o2 = object inherit op initializer print_endline "A4" end let () = print_endline "A5" end module A = F(struct end) produces: A1 A2 A3 A2 A5 Note that "A2" is printed instead of "A4". |
Comment author: @alainfrisch Regarding the problem mentioned in the previous note: in stdlib/camlinternalOO.ml, the function build_path returns a block "Cons(Obj.magic 0, Empty, Empty)", but this block is mutated. I'm uploading a new patch which prevents static allocation of this block. The new patch (*_v6) also addresses #6337 (duplicated emission of constants). I've tried this patch on LexiFi's code base, and I can confirm that no test is broken (which is how I detected the previous bug in camlinternalOO). I also observe a significant reduction in the memory allocated by our main application's processes (I'll try to come up with more precise measures, but this does not seem completely negligible). |
Comment author: @gasche I don't know much about camlinternalOO, but I suspect that the whole approach of having In Batteries we have code which I think was suggested by Jacques, that lets us build (immutable) lists by using mutable lists (with a mutable tail field storing an immutable list) as a temporary data structure: The way it works is that mutable lists are turned into immutable lists (to insert them at the end of a previous mutable list), but never the other way around. I always justified the safeness of this choice by saying that it would be robust to compiler improvement reasoning on mutability. Would it be possible to rewrite the camlinternalOO code to get rid of "mut" altogether? Does it automatically enforce a programming style that is safe with respect to sharing of immutable constants? I think this is an important example to get straightened out as it may be representative of code existing in the wild. Having clear guidelines on what kind of hacks are ok and which are broken or risk becoming broken tomorrow is important. |
Comment author: @gasche I had a try at a version of camlinternalOO that does not convert an immutable into a mutable one. I could not use a Batteries-like solution, because the code performs lookup on the immutable structure and wants to mutate the tail at the point of lookup. The solution I experimented is to make the structure always mutable; the code compiles, passes the testsuite and Alain's example above, but I know little about camlinternalOO (in particular, why an immutable structure was used in the first place), so it's probably not correct -- and even if it was, it's unclear that it should be preferred to Alain's solution. I checked the use of lookup_tables in translclass.ml, and it seems to remain correct with that implementation with no change (my understanding is that it only manipulates the first field of the record/constructor, not the second and third fields affected by the change). |
Comment author: @alainfrisch Hi Gabriel, Thanks for looking at that. Disclaimer: I'm not familiar at all with the logic and data structures involved in camlinternalOo (if someone could take the time to understand and document the compilation/runtime support for classes, this would be great, btw!). Clearly, the original code was wrong: turning an immutable value into a mutable one through the use of magic is not ok. The compiler should be allowed to share physically two immutable blocks with identical [initial] content. The current patch is about static allocation of such constants blocks outside the heap, and another possible optimization could be to lift such allocations (even of non-constant blocks) outside functions (e.g. at the toplevel of a functor's body). Your solution is thus quite appealing, since it completely remove this cast from immutable to mutable. But are we certain that the data structure will never be marshaled/unmarshaled (which would break the physical equality check on the sentinel)? (I'm thinking about marshaling a first-class module obtained from a compilation unit.) If not, what about implementing the equality with empty by "let is_empty x = x == x.next"? |
Comment author: @gasche I hadn't thought about marshalling indeed; the cyclicity test would be fine; another solution I thought about is to use the invariant that only the last cell has (Obj.magic 0) as key, but in fact it looks like the cyclicity test results in simpler code. Re. correctness, I guess we should ask/wait for Jacques' opinion as he's the author of this code. |
Comment author: @alainfrisch We could probably integrate the main patch first. The changes to camlInternalOO are certainly worth investigating, but this could probably be done directly on trunk. (One way to gain some confidence that there is no other such occurrences would be to emit structured constants (except string literals) in a read-only section.) |
Comment author: @alainfrisch With the latest version, ocamlc.opt now has 4 669 190 bytes (against 5 399 977 on trunk). That's a non-negligible 15% gain. |
Comment author: @alainfrisch I've monitored the impact of the patch on the memory usage of the main processes of LexiFi's application (under Windows, looking at "Memory (Private Working Set)", as reported by Windows task manager). The patch reduces this amount by about 9% for the simpler processes (which are spawned and do nothing until requested; e.g. a "scheduler processes uses 32Mb with the patch, 35Mb without it). |
Comment author: @alainfrisch Last patch (v6) committed to trunk, rev 14444. |
Comment author: @garrigue I was just ignoring this would thread because I thought I was not involved :-) Concerning camlinternalOO.ml, it was rather hackish from the beginning, but maybe I brought that to a new level. Fortunately, since this is part of OCaml itself, bugs can be discovered early. I'm a bit concerned by the possibility that others may be using the same trick. |
Comment author: @alainfrisch I've committed Gabriel's proposal with a sentinel (modified to use a cyclicity test) in camlinternalOO. |
Comment author: @gasche Heh. In the morning shuttle I wrote a different version of the patch with a null value as sentinel, as suggested by Jacques. I'm uploading it for the reference, but I think I prefer the current solution, as it does not use any magic. |
Comment author: @gasche Re. potential code breakage for other users of unsafe features: I think a nice thing to do would be to rebase the change against 4.01, so that users can easily check their codebase against it, without interference of other trunk features (eg. Coq doesn't compile against trunk currently). Alain, do you think you could do that? Otherwise I'll perhaps give it a try later next week. |
Comment author: @alainfrisch
Honestly, if you could give it a try, that would be nice.
What's the problem? |
Comment author: @garrigue Was there a misunderstanding? |
Comment author: @alainfrisch Jacques: do you think the current version is indeed observationally slower than the previous one? If yes, I'll happily revert, but it seems to me that avoiding unsafe hacks (such as converting an immutable datastructure to a mutable one) is actually quite useful (even more since I spent about one day tracking that problem). If performance of this piece of code is actually critical, it might be worth considering as well other micro-optimizations, such as removing array bound checks, avoiding the closure allocation in lookup_keys (e.g. by calling lookup_keys instead of lookup_key, and checking the i < 0 condition before the call where i changes). Do you have examples of OO code exercising intensively this part of the code? |
Comment author: @alainfrisch I've reverted the change in camlinternalOO, to be on the safe side. |
Comment author: @alainfrisch Just for the fun of it, a slightly more efficient version of camlinternalOO.lookup_keys: let lookup_keys i keys tables = let tables = mut tables in let key = ref (Array.unsafe_get keys i) in let i = ref i in let tables = ref tables in while ( if (!tables).key == !key then begin if !i = 0 then (tables := mut (!tables).data; false) else (decr i; key := Array.unsafe_get keys !i; tables := mut (!tables).data; true) end else let next = (!tables).next in if next <> Empty then (tables := mut next; true) else begin let next = Cons (!key, Empty, Empty) in (!tables).next <- next; tables := mut (build_path (!i-1) keys (mut next)); false end ) do () done; demut !tables I get a 8% speedup on the following code: module F(X : sig end) = struct class op = object end let o = object inherit op end let o = object inherit op end let o = object inherit op end let o = object inherit op end let o = object inherit op end let o = object inherit op end let o = object inherit op end let o = object inherit op end let o = object inherit op end let o = object inherit op end let o = object inherit op end let o = object inherit op end end let () = for i = 1 to 10000000 do let module A = F(struct end) in () done (Compiling camlinternalOO with a high inline factor instead of the current -inline 0 leads to a small extra improvement, but I assume there is a good reason to have -inline 0.) |
Comment author: @garrigue Interesting. This shows that tail recursive functions are not yet as fast as loops... By the way, the "-inline 0" is essential, so that code in this module does not get inlined in let lookup_tables root keys = |
Comment author: @alainfrisch
I'll let you decide! I assume that OO-heavy code is usually not critical in terms of performance, but it's always good to optimize such common "infrastructure" code. |
Comment author: @alainfrisch Marking this as resolved. The discussion on optimizing camlinternalOO is independent of this ticket. |
Comment author: markghayden Thanks for your efforts on this! I'm looking forward to trying this. Did anyone have a chance to run the original test program submitted with the problem report against the patched version of Ocaml? |
Comment author: @alainfrisch The original test programs reports:init alloced 83
|
Original bug ID: 5779
Reporter: markghayden
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2015-12-11T18:25:53Z)
Resolution: fixed
Priority: normal
Severity: minor
Version: 4.00.0
Target version: 4.01.1+dev
Fixed in version: 4.02.0+dev
Category: back end (clambda to assembly)
Tags: patch
Related to: #6337
Monitored by: @gasche meurer @ygrek @jmeber @hcarty @avsm @yakobowski @alainfrisch
Bug description
I have fairly large program (120k LOC) with significant amounts of static data structures. For example, large lists of records containing information for rules tables. These are all read-only structures (no mutable records, no arrays). I'd like these in static sections so they don't consume as much memory, have to be allocated and initialized, and incur ongoing overhead from GC scanning them. My software runs on mobile devices (iPads), which can often be quite memory constrained; whatever I can do to reduce memory consumption is helpful.
The Ocaml compiler seem to be fairly brittle about determining what should be allocated in static .data sections and what will be allocated and placed on the heap at run time. For example, a simple non-mutable record does is static. A non-mutable record included in a list or another non-mutable record does cause allocation on the heap. It seems references to staticly allocated objects within a second object causes the latter to be allocated on the heap.
Can the predicate for what is allocated staticly (not on the heap) be improved to capture a more complete set of cases? Eg ints, strings, floats, lists, non-mutable records, disjoint unions of the preceding. And have that documented?
Also, would be nice if these didn't require heap allocation:
Steps to reproduce
I've attached a short program that declares various structures and monitors which one caused allocation by using Gc.stat. Most of the cases where allocation occurs are avoidable. A few (records with mutable fields) probably require allocation in some cases. I've only compiled the program using native code & for MacOS x64 but I imagine most of this is machine independent.
File attachments
The text was updated successfully, but these errors were encountered: