Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor Gc can takes several minutes to complete #7433

Closed
vicuna opened this issue Dec 16, 2016 · 5 comments
Closed

Minor Gc can takes several minutes to complete #7433

vicuna opened this issue Dec 16, 2016 · 5 comments

Comments

@vicuna
Copy link

vicuna commented Dec 16, 2016

Original bug ID: 7433
Reporter: joris
Status: resolved (set by @xavierleroy on 2017-02-19T17:08:16Z)
Resolution: suspended
Priority: normal
Severity: minor
Platform: linux
Category: runtime system and C interface
Monitored by: @ygrek @yakobowski

Bug description

When the heap becomes quite large, compaction can be really slow to run, taking several minutes. One solution is to increase Gc.max_overhead or periodically trigger compaction when stalling the program for several minutes won't be an issue.
In this case though, another issue emerge. In some cases, minor Gc can takes several minutes to complete.

For instance, this is an example of heap where the issue arise :

[2016-12-16T01:38:41.4598] 39191:0 [memory:info] GC: Heap: 66.4GB (max 66.4GB, chunks 8414) Counters(mi,pr,ma): 54.0TB 2.0TB 3.2TB Collections(mv,ma,mi): 7 1607 14680433
[2016-12-16T01:38:42.9726] 39191:0 [memory:info] VM: rss 68.7GB, vsz 72.5GB, swap 0B, maps 332. MALLOC: size 72.1GB, used 68.9GB, free 1.8GB

In this case, the process became stuck for 6 minutes performing a minor collection:

  7 caml_fl_allocate caml_alloc_shr caml_oldify_one caml_oldify_mopup caml_empty_minor_heap caml_minor_collection caml_make_vect Array.init Tableq.check Tableq.setup_writer List.iter Tableq.start Supertable.on_new_txn Supertable.master_handle Messaging.callback Messaging.#1645
  3 caml_fl_add_blocks caml_alloc_shr caml_oldify_one caml_oldify_mopup caml_empty_minor_heap caml_minor_collection caml_make_vect Array.init Tableq.check Tableq.setup_writer List.iter Tableq.start Supertable.on_new_txn Supertable.master_handle Messaging.callback Messaging.#1645

With the following parameters :
Minor heap size 64MB
Compact disabled, performed once an hour manually
Space overhead 50

Steps to reproduce

The informations provided here are probably not enough to reproduce, if you have suggestions on how to find more detailed information i run some tests.

@vicuna
Copy link
Author

vicuna commented Dec 16, 2016

Comment author: @gasche

If you use 4.03 or above, you can enable GC instrumentation by compiling your program with the "-runtime-variant i" parameter and running it with the environment variable OCAML_INSTR_FILE being set to the path of a log file in which to write instrumentation logs.

@vicuna
Copy link
Author

vicuna commented Dec 16, 2016

Comment author: @lefessan

If you disable compaction, the major heap is going to be fragmented, with lot of small unusable blocks in the free list. As a consequence, when a block is promoted from minor to major, it takes more time to find a block in the free list. That might explain this problem.

I don't remember if Damien implemented the idea of having several free lists sorted by size in 4.04. Did he ?

@vicuna
Copy link
Author

vicuna commented Jan 10, 2017

Comment author: joris

Thank you for your suggestion. Sorry I haven't had time to look at this issue yet, and i need to find a proper way to test this without killing production machines.

@vicuna
Copy link
Author

vicuna commented Feb 19, 2017

Comment author: @xavierleroy

Also, if compaction is disabled, switching the allocation policy to first-fit could help. (See module Gc, record type "control", field "allocation_policy".)

Until you've gathered more data, I'm setting this issue to "suspended".

@vicuna vicuna closed this as completed Feb 19, 2017
@vicuna
Copy link
Author

vicuna commented Apr 11, 2017

Comment author: @ygrek

For the record, switching to allocation_policy=1 made things noticeably worse - workers get stuck in caml_empty_minor_heap -> caml_fl_allocate for an hour and longer (probably forever). We didn't investigate further and switched back to allocation_policy=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant