Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in ocaml runtime with backtraces+threads #7220

Closed
vicuna opened this issue Apr 11, 2016 · 17 comments
Closed

Memory leak in ocaml runtime with backtraces+threads #7220

vicuna opened this issue Apr 11, 2016 · 17 comments
Assignees
Milestone

Comments

@vicuna
Copy link

vicuna commented Apr 11, 2016

Original bug ID: 7220
Reporter: robhoes
Assigned to: @gasche
Status: closed (set by @xavierleroy on 2017-09-24T15:32:22Z)
Resolution: fixed
Priority: normal
Severity: minor
Platform: amd64
OS: Linux, Mac OS X
Version: 4.02.3
Target version: 4.03.0+dev / +beta1
Fixed in version: 4.03.0+dev / +beta1
Category: runtime system and C interface
Related to: #4702 #5188
Monitored by: lindig braibant @hcarty @dbuenzli

Bug description

An OCaml program with backtrace recording enabled leaks memory each time a new thread is created and completes.

Steps to reproduce

The following example program, compiled with the -g option, will very quickly eat up all your memory:

let _ =
Printexc.record_backtrace true;
let rec loop () =
let t = Thread.create (fun () ->
try
raise Not_found
with Not_found ->
()
) () in
Thread.join t;
loop ()
in
loop ()

Additional information

I’ve tried this on various compilers (4.02.3, 4.01.0 and 3.12.1), on Linux and Mac, native and bytecode, and see the problem on all of them.

Valgrind reports that the allocation of caml_backtrace_buffer in the function caml_stash_backtrace in asmrun/backtrace_prim.c is the culprit (https://github.com/ocaml/ocaml/blob/trunk/asmrun/backtrace_prim.c#L98):

if (caml_backtrace_buffer == NULL) {
Assert(caml_backtrace_pos == 0);
caml_backtrace_buffer = malloc(BACKTRACE_BUFFER_SIZE
* sizeof(backtrace_slot));
if (caml_backtrace_buffer == NULL) return;
}

The caml_backtrace_buffer variable is a global variable initialised to NULL in byterun/backtrace.c. However, it seems to be local to a thread (how does it become thread local?), is set to NULL whenever a thread is created (where?), and a new malloc happens when an exception is raised. It is not freed when the thread completes.

File attachments

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: @gasche

I'm tentatively setting the target for 4.03; we're very late in the release cycle, so if the bugfix turns out to be invasive we may have to delay it for the next release, but I hope there is a simple and safe solution.

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: bobot

how does it become thread local?
You can take a look at otherlibs/thread/scheduler.c function schedule_thread.

It is not free when the thread completes
It should be in the function thread_kill:

  if (th->backtrace_buffer != NULL) {
    free(th->backtrace_buffer);
    th->backtrace_buffer = NULL;
  }

The problem seems more complicated.

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: robhoes

Thanks, I was grepping for caml_backtrace_buffer and didn't notice the #define backtrace_buffer.

I suppose that the thread_kill function corresponds to the OCaml function Thread.kill, and is not called when the thread exits otherwise.

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: @gasche

I looked at the issue a bit as well. I don't know anything about threads, but my understanding is that the code in
otherlibs/threads/scheduler.c corresponds to vmthreads.cm(x)a
otherlibs/systhreads/st_stubs.c corresponds to threads.cm(x)a
in particular, the problem that is being observed here is in the interaction between backtrace_prim.c (in either byterun or asmrun) and systhreads/st_stubs.c, rather than scheduler.c -- but we should also fix scheduler.c.

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: @gasche

Here is my understanding of what is happening right now.

caml_thread_{enter,leave}_blocking_section follow a bracketing discipline where

  • thread_enter_blocking_section will save global resources into
    a thread-specific storage space (the caml_thread_struct),
    including caml_backtrace_buffer and related backtrace-related globals
  • thread_leave_blocking_section will restore the global resources
    from the thread-specific storage

st_stubs.c also has a "master runtime lock" that protects the global
data of the scheduler (list of current threads, etc.). This lock must
be held at thread creation, deletion, scheduling time. It is also held
as long as a caml thread is still running, that is outside the
blocking section: it is acquired by
caml_thread_leave_blocking_section, and released by
caml_thread_enter_blocking_section.

I will call "thread-specific backtrace buffer" the value of the
thread-specific stored backtrace buffer, and "global backtrace buffer"
the global caml_backtrace_buffer value.

Upon creating a new thread, the thread-specific backtrace buffer is
set to NULL. Upon freeing a thread, it is freed if not NULL
anymore.

The memory leak happens as the following events happen in order
in your reproduction case, when a thread is created and then joined
from the "main thread":

  1. thread creation: the new thread is created with its thread-specific
    buffer set to NULL

  2. thread join (into the scheduler): the main thread releases control, enters the
    blocking section, storing the current global backtrace buffer as
    its thread-specific buffer.

  3. thread join (out of the scheduler): the new thread gains control,
    leaves the blocking section, restoring its current thread-specific
    buffer (NULL) into the global backtrace buffer.

  4. An exception is raised with backtrace enabled; the runtime tries to
    collect the backtrace into the global backtrace buffer, notices that it is NULL,
    and malloc() a new backtrace buffer in it. The thread-specific buffer value is
    not changed, it is still NULL.

  5. When a thread terminate, caml_thread_stop() is called (see
    caml_thread_start declaration for how this is
    implemented). caml_thread_stop calls caml_thread_remove_info which
    cleanups the thread-local structure; but the thread-local backtrace
    buffer of the new thread is still NULL, so nothing happens. Control
    is given back to the scheduler by releasing the master runtime
    lock, without calling caml_thread_enter_blocking_section (this
    would be invalid as the thread-specific storage does not exist, and
    also we must leave in the same state we were when we started the
    thread, and that is just after calling
    caml_thread_leave_blocking_section from the main thread).

The memory leak happens between steps (3) and (4): a new backtrace
buffer has been allocated by the backtrace code, but the thread-local
structure does not know about it so it is not properly cleaned-up when
stopping the thread.

(Possible ideas on how to fix the leak in a separate message.)

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: @gasche

Possible fixes:

  1. At the time of stopping a thread, store the current global
    backtrace buffer in the thread-local buffer, so that it gets
    cleaned up correctly by caml_thread_remove_info.

    I'm not fond of this approach because this is a hardcoded fix for
    a specific problem that doesn't feel right in the code, we may be
    forgetting about other similar issues or other instances of the
    problem.

  2. Add a hook to the runtime so that when a new backtrace is
    allocated, the thread-local storage is updated correspondingly.

    This is an invasive change and it risks degrading the performance
    of the single-threaded code, which we don't want.

  3. Clarify the invariant between the thread-local storage and the
    global runtime data, and restore it more often during thread
    scheduling operations.

    This seems more general than (1) but also possibly a source of
    inefficiency for threaded code.

  4. Always assume that threaded code may enable backtrace collection
    and allocate an array of the right size instead of NULL at thread
    creation time.

    This makes thread creation more costly. It is also a rather
    special-case change, similar to (1), so we may be missing other
    issues.

  5. Split caml_thread_{enter,leave}_blocking_section in the
    local-store/global-restore logic and the runtime lock control
    logic. At thread stopping time, call the local-store logic.

    I think this is the most principled solution, and also a good one
    in terms of performance (thread stopping becomes marginally slower,
    but I don't think this is an issue).

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: @gasche

I just uploaded a patch implementing approach (5), that fixes the leak on the provided reproduction code on my machine. Interestingly, this allows to remove some lines that would save curr_thread->stack_low at the very same point of code for the exact same reason -- save the full thread-local state would have avoided the present bug:

https://github.com/ocaml/ocaml/blob/5401ce8/otherlibs/systhreads/st_stubs.c#L462-L466

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: @gasche

I think that the proposed patch fixes the issue as reported, but I wonder whether there aren't similar bug in other parts of the codebase:

  1. there is a way to register and unregister threads from C code instead of the Thread module, does it suffer from the same leak?

  2. does the -vmthread-compiled code (running thread/scheduler.c) suffer from the same leak?

For (1) I think that no change is required: when you unregister a thread from the C code, you must already be inside the blocking section, which means that enter_blocking_section should have been called since the last time OCaml code was running, so the thread-local data should be up-to-date with respect to the global runtime state.

For (2) I also think that no change is required. The thread_kill is the one that does cleanup the thread-local backtrace buffer; if the thread being killed is the current thread, it first activates another thread
https://github.com/ocaml/ocaml/blob/5401ce8/otherlibs/threads/scheduler.c#L741-L747
which means that the save-the-global-state logic is being run before the cleanup logic. (If the thread being killed is not the current thread, it has been interrupted first and so the global state was saved then.)

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: bobot

@gasche as usual well investigated and well explained!

Remarks on the patch:

  • nice refactoring
  • /* Restore the stack-related global variables */ this copy pasted comment seems out of date, or at least limiting.
  • + curr_thread->backtrace_buffer = caml_backtrace_buffer; seems to be an investigation left over, it is already done in caml_thread_save_thread_state();., no?

FWIW I think it is small enough for 4.03.

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: braibant

Looking at

https://github.com/ocaml/ocaml/blob/5401ce8/otherlibs/systhreads/st_stubs.c#L362

don't you need to perform the cleaning of the other threads there too? That is, why isn't line 372 a call to caml_thread_remove_info?

@vicuna
Copy link
Author

vicuna commented Apr 11, 2016

Comment author: @gasche

bobot: oops, indeed this is a leftover, I'll remove this line, thanks!

braibant: my understanding is that this function is only called after fork(), only in the newly forked (child) process. This means that there is no need (and it would be wrong) to reclaim and cleanup here: the threads may still be used in the parent process, and will be cleaned up there.

@vicuna
Copy link
Author

vicuna commented Apr 12, 2016

Comment author: robhoes

Fantastic, thanks for the quick investigation and patch, gasche! This seems to work for me on my example program.

This issue turned out to be quite a big deal for XenServer. We are currently using OCaml 4.02.2, and your patch applies there as well. I am currently testing it to see if our memory leaks go away.

@vicuna
Copy link
Author

vicuna commented Apr 12, 2016

Comment author: @damiendoligez

OK for 4.03.

@vicuna
Copy link
Author

vicuna commented Apr 12, 2016

Comment author: robhoes

My testing of today suggests that this patch fixes the leak in the xenopsd daemon of XenServer as well. Thanks again!

@vicuna
Copy link
Author

vicuna commented Apr 12, 2016

Comment author: @gasche

I committed the fix in 4.03:
bf775ed

(bobot: in the comment I renamed "stack-local variables" into "runtime state", and accordingly caml_thread_{save,restore}thread_state are now caml_thread{save,restore}_runtime_state)

Thinking more about braibant's remark, I'm not sure I understand myself why freeing the thread structure at the beginning of the forked child's life suffices: if the backtrace buffers are copied at fork time, then we should free the thread structure in depth (also free the backtrace buffers reachable from them) instead of shallowly as is done currently. However, I have tried to produce a reproduction case for a leak and I was not able to: the code at the end of this comment will leak without the uploaded patch, but does not leak anymore with the proposed patch.

I suspect that the reason why just a shallow free of the thread structure suffices for the child may be the copy-on-write implementation of fork on my system (and, I would assume, most fork-supporting systems): the thread-specific backtrace buffers would only actually be copied if they were modified in either the parent or the child's copy of the memory. This is impossible in the child, as they are discarded at child-birth time; in theory we could have a race where the parent process mutates its own copies of the backtrace buffer during the fork, turning the child's buffers into unreclaimed copy, but I don't know how to reproduce that.

Here is the code I tried (assuming that the child's copy of the parent's thread-local backtrace buffers would not be freed) and which does not leak on my system. If someone knows how to find a system where fork() performs a deep copy instead of copy-on-write, it would be interesting to try there.

let create_backtrace () = try raise Not_found with Not_found -> ()

let _ =
Printexc.record_backtrace true;
create_backtrace ();
let rec loop n =
if n = 0 then ()
else begin
Unix.sleepf 0.001;
match Unix.fork () with
| 0 ->
let threads =
Array.init 10_000
(fun _ -> Thread.create create_backtrace ()) in
Array.iter Thread.join threads;
loop (n - 1)
| pid -> exit 0
| exception _ -> exit 0
end
in
loop (try int_of_string (Sys.argv.(1)) with _ -> max_int)

@vicuna
Copy link
Author

vicuna commented Apr 20, 2016

Comment author: junsli

Just curious: What does the "st" mean in the interface file? It mentioned "st interface." poSix Threads? Standard Threads?

@vicuna
Copy link
Author

vicuna commented Apr 20, 2016

Comment author: @gasche

I would guess "system threads", by opposition with the green threads implementation where the scheduling is implemented by the OCaml runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants