Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005389OCamlOCaml generalpublic2011-10-28 03:462014-11-18 17:06
ReporterMartin Jambon 
Assigned Todoligez 
PrioritynormalSeveritymajorReproducibilityalways
StatusresolvedResolutionfixed 
PlatformOSOS Version
Product Version3.12.1 
Target Version4.00.0+devFixed in Version4.00.0+dev 
Summary0005389: GC fails to collect large array
DescriptionA large array initialized with zeros fails to be collected after a call to Gc.compact. However a string or a bigarray of the same physical size are collected. This behavior is observed on amd64 Linux with OCaml 3.11.2 and 3.12.1.

See attached code.

In practice we are experiencing this problem with a large read-only hash table loaded from a file using Marshal and copied using Ancient in order to increase performance. The old copy produced by Marshal is not reclaimed, which doubles the expected memory usage.
TagsNo tags attached.
Attached Files? file icon test_gc.ml [^] (988 bytes) 2011-10-28 03:46 [Show Content]
? file icon test_gc_workaround.ml [^] (1,484 bytes) 2011-10-28 22:35 [Show Content]
? file icon Gcfail.ml [^] (445 bytes) 2012-03-09 16:48 [Show Content]
? file icon compact_problem.ml [^] (601 bytes) 2012-04-16 16:17 [Show Content]
patch file icon compact-5389.patch [^] (3,278 bytes) 2012-04-16 16:24 [Show Content]

- Relationships
related to 0005757closeddoligez GC compaction bug 

-  Notes
(0006188)
lefessan (developer)
2011-10-28 09:41

Actually, the GC collects the block, but fails to free the chunk during compaction. The reason is simple: it is the first chunk in the list of chunks (because Linux gives it the smallest address in memory), and the compaction algorithm never frees the first chunk. Clearly, the compaction algorithm should use a different strategy to free chunks: it should first sort chunks by the increasing probability that a chunk be freed during compaction, so that it would not use a chunk that is free at 99% as the target for copying the other chunks...
(0006189)
lefessan (developer)
2011-10-28 10:35

Another (simpler) solution would be to truncate the chunk when it is clearly much bigger than necessary.
(0006190)
ygrek (reporter)
2011-10-28 11:21

I am seeing similar behaviour quite often, i.e. after some heap growing and shrinking Gc reports small overall heap size while OS reports large chunks allocated for process. but never got to investigate it closely, thanks a lot for your explanation. Looking forward for a patch.
(0006191)
Martin Jambon (reporter)
2011-10-28 22:36

I found a workaround which may or may not work in other situations. It consists in allocating a string of about the same length in bytes (multiply array length by 8 or 4) after having allocated the array and before the garbage collector performs any compaction.

See alloc_array_fix and without_gc_compaction in the newly uploaded program test_gc_workaround.ml.
(0006193)
lefessan (developer)
2011-10-29 20:05

Your "workaround" will probably lead to a memory leak sooner or later. The difference between arrays and strings is not that the string is freed, and the array is not, the real difference is that strings are not initialized, and thus, they are not completely allocated by the kernel (pages in the middle will be mapped on demand). What you see in your workaround is that the chunk containing the array is reclaimed by the compaction, but the chunk containing the string (almost the same size) is not. Since the string is not initialized, it looks like everything has been collected, but it is not the case, and as the chunk will be progressively used for other data, pages will be mapped by the kernel, and the memory footprint of your program is going to grow until the size of chunk (1GB). However, this might happen slowly enough that it won't bother your application.
(0006194)
Martin Jambon (reporter)
2011-10-30 08:08

Thanks for the explanation.

Is there a document that describes how OCaml's GC works and that would help me understand the problem better and find a workaround?

Alternatively, are chunks created as large as needed, rounded up to the nearest power of two (here 1GB)? What is the smallest chunk size? Can I artificially create a small first chunk?

I am thinking that creating a list that uses at least as many bytes as my array would do the trick (i.e. allocating a large array, allocating a long list then compacting) because it would have to create a new chunk but not a huge one. It seems to work in my naive test. Is it supposed to?
(0006195)
lefessan (developer)
2011-10-30 18:51

Unfortunately, there is no public description of OCaml garbage collection algorithm, probably because it has been tuned so much that a complete description would not be easier to read than the sources... ;-)

If your goal is to use ancient to move the array outside the heap, you might consider allocating outside the heap in the first place.
(0006448)
protz (manager)
2011-12-21 14:27

Fabrice, can I close this as "not a problem", or do you still think something ought to be done about this? http://caml.inria.fr/mantis/view.php?id=5389#c6188 [^] implies that the GC could be smarter, but http://caml.inria.fr/mantis/view.php?id=5389#c6195 [^] implies that Martin shouldn't do this in the first place :)
(0006452)
ygrek (reporter)
2011-12-21 14:43

I for one would suggest that the first chunk should be truncated. Otherwise for the user it looks like the program eats much more memory than it really uses.
(0006464)
lefessan (developer)
2011-12-21 16:22

Yes, this PR should be kept as at least a feature wish, with "need to truncate the first chunk after compaction".
(0007033)
varobert (reporter)
2012-03-09 16:56
edited on: 2012-03-09 16:56

I stumbled upon that bug today as I was fuzz-testing my application. Basically I would get some length from the fuzzed input and allocate a string of that size. This could go up to 2^31-ish in my runs.

This string is just used to do some comparison up to a few characters, then completely discarded (yes, what a waste to allocate it in the first place!).

http://caml.inria.fr/mantis/file_download.php?file_id=615&type=bug [^] < This shows exactly what I'm doing. The output shows pretty clearly what is happening:

268435456
Before: 60525056, 33555710
 After: 60398080, 1299

536870912
Before: 181194240, 67110142
 After: 120796160, 1299

1073741824
Before: 362388480, 134219006
 After: 241592320, 1299

[...]

You can see that the difference in the major heap's size between "before" and "after" compaction is equal to its previous size:
181 194 240 - 120 796 160 = 60 398 080 (exactly the size after compaction of the previous iteration)

Since I run several of these fuzz-tests in parallel, my OS is painfully DOS-ed by those OCaml programs claiming to need ~4GB of the RAM-cake each!

(0007134)
gerd (reporter)
2012-03-23 14:54
edited on: 2012-03-23 14:55

varobert: a possible workaround exists (not really nice, but should solve your problem). The idea is to allocate a large buffer as bigarray (which is malloced memory, and can even be reused in each iteration), and then initialize the bigarray so it looks like a string.

Ocamlnet contains the required C functions for this (feel free to copy them):

let buffer = Netsys_mem.alloc_memory_pages (n+64)
let () = Netsys_mem.value_area buffer
let (voffset,bytelen) = Netsys_mem.init_string buffer 0 n
let s = (Netsys_mem.as_value buffer voffset : string)

Now s is a string outside Ocaml's heap, and you have detailed control about memory management. The only "danger" here is that the bigarray is deallocated when buffer gets unreachable, but s is still used. You can avoid that by always passing (buffer,s) to any function.

For questions about using that, please contact me directly (gerd (at) gerd-stolpmann.de)

(0007138)
frisch (developer)
2012-03-23 17:51

> You can avoid that by always passing (buffer,s) to any function.

Are you sure this is enough? In native code, the garbage collector knows, for each point in code where it can be triggered, which are the live values (used in the rest of the function). So you need to make sure that buffer is explicitly used "at the end" of the function, to be sure it won't be collected.

It seems very easy to forget about it and get some very rare bug (difficult to track).
(0007161)
doligez (administrator)
2012-03-26 12:32

lefessan: Truncating the chunk is not an option, because it is malloced memory, and realloc may move the block when resizing it.

ygrek: If the overall heap size is small, then the OCaml GC has already freed the memory and it's your libc that fails to give it back to the OS. There's nothing much we can do in that case, short of reimplementing our own malloc.

Martin Jambon: Fabrice is right, your workaround is likely to fail in long-running programs. You cannot create a small first chunk because we don't control the placement of the chunks in memory.

protz and lefessan: don't close this PR. I consider this a performance bug, not a mere feature wish.

I think I have a solution, but it involves changes to the compaction algorithm that will break one undocumented invariant: as it is now, the compaction doesn't change the relative order of blocks in memory; after the change, it will. I know at least one library out there relies on the relative order of pointers. However foolish that is, I want to give the authors advance warning about this change.

The only problem is, I don't remember who is doing that...
(0007358)
doligez (administrator)
2012-04-16 16:21

I couldn't reproduce the problem with either test_gc.ml or Gcfail.ml, but I managed to trigger it with a small synthetic example (compact_problem.ml).

I have a patch that seems to solve the problem. This is how it works:
When the heap is still too large after a compaction, we allocate one chunk of the desired heap size and re-do a compaction that will move all data into this chunk and free the over-large chunk.

This patch is only lightly tested for the moment. If you guys could try it out, that would be great.
(0007364)
varobert (reporter)
2012-04-17 09:28

268435456
Before: 60525056, 33554532
 After: 126976, 98

536870912
Before: 120923136, 67108964
 After: 126976, 98

1073741824
Before: 241719296, 134217828
 After: 126976, 98

It indeed solves my issue with Gcfail.ml (in both 3.12.1 and 4.01.0+dev1_2012-03-31).
(0007365)
doligez (administrator)
2012-04-17 10:23

Patch applied to 4.00 (commit 12364) and trunk (12365).
Still waiting for more testing before closing this PR.
(0007367)
ygrek (reporter)
2012-04-17 12:48
edited on: 2012-04-17 14:38

Is it expected to work on 3.11? I've applied this patch on 3.11.2 and observe crash in do_compaction (second invocation) in large application (with some potentially buggy C bindings, so I am not absolutely sure that this patch is a reason, maybe just a trigger)..

PS patch doesn't update caml_stat_heap_chunks - is it ok?

(0007369)
hcarty (reporter)
2012-04-17 18:34

I'm able to build trunk and the problem is fixed for me there, but version/4.00 does not build for me.

make[2]: Entering directory `/home/hcarty/ocamlbrew/ocaml-4.00.0/build/build/otherlibs/graph'
make[2]: *** No rule to make target `/opt/local/include/X11/Xlib.h', needed by `open.o'. Stop.
(0007482)
doligez (administrator)
2012-05-31 14:18

About version/4.00: this is an independent problem that was fixed (PR#5582).

About caml_stat_heap_chunks: well spotted. I've fixed this in 4.00 (commit 12524) and trunk (commit 12525).

About 3.11: I have no idea whether it might work with this patch.

- Issue History
Date Modified Username Field Change
2011-10-28 03:46 Martin Jambon New Issue
2011-10-28 03:46 Martin Jambon File Added: test_gc.ml
2011-10-28 09:41 lefessan Note Added: 0006188
2011-10-28 10:35 lefessan Note Added: 0006189
2011-10-28 11:21 ygrek Note Added: 0006190
2011-10-28 22:35 Martin Jambon File Added: test_gc_workaround.ml
2011-10-28 22:36 Martin Jambon Note Added: 0006191
2011-10-29 20:05 lefessan Note Added: 0006193
2011-10-30 08:08 Martin Jambon Note Added: 0006194
2011-10-30 18:51 lefessan Note Added: 0006195
2011-12-21 14:27 protz Note Added: 0006448
2011-12-21 14:43 ygrek Note Added: 0006452
2011-12-21 16:22 lefessan Note Added: 0006464
2011-12-21 16:47 doligez Assigned To => doligez
2011-12-21 16:47 doligez Status new => assigned
2012-03-09 16:48 varobert File Added: Gcfail.ml
2012-03-09 16:56 varobert Note Added: 0007033
2012-03-09 16:56 varobert Note Edited: 0007033 View Revisions
2012-03-23 14:54 gerd Note Added: 0007134
2012-03-23 14:55 gerd Note Edited: 0007134 View Revisions
2012-03-23 17:51 frisch Note Added: 0007138
2012-03-26 12:32 doligez Note Added: 0007161
2012-04-16 16:17 doligez File Added: compact_problem.ml
2012-04-16 16:21 doligez Note Added: 0007358
2012-04-16 16:24 doligez File Added: compact-5389.patch
2012-04-16 16:26 doligez Status assigned => feedback
2012-04-16 16:26 doligez Target Version => 4.00.0+dev
2012-04-17 09:28 varobert Note Added: 0007364
2012-04-17 10:23 doligez Note Added: 0007365
2012-04-17 10:23 doligez Fixed in Version => 4.00.0+dev
2012-04-17 10:23 doligez Resolution open => fixed
2012-04-17 12:48 ygrek Note Added: 0007367
2012-04-17 14:38 ygrek Note Edited: 0007367 View Revisions
2012-04-17 18:34 hcarty Note Added: 0007369
2012-05-31 14:18 doligez Note Added: 0007482
2012-05-31 14:19 doligez Status feedback => resolved
2014-11-18 17:06 doligez Relationship added related to 0005757


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker