|Anonymous | Login | Signup for a new account||2016-08-28 18:39 CEST|
|Main | My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0004577||OCaml||OCaml general||public||2008-07-10 17:59||2009-04-19 11:07|
|Target Version||Fixed in Version||3.11.0|
|Summary||0004577: Forks and multi-threading not well-supported|
|Description||It is currently impossible to safely fork processes that have started threads. The reason is that the runtime doesn't clean up its thread-related datastructures after a fork and therefore segfaults once it has to perform a garbage collection, which apparently will chase roots (stack) belonging to the now-dead threads.|
It seems that this problem should be solvable with reasonably little work. We only need to register cleanup functions that will be called whenever the process needs to fork. This can be achieved with the pthread_atfork function. Everytime a new thread is created we register atfork-functions that will make sure that the thread state gets removed from the runtime datastructures in the child process as if the thread had terminated normally. Furthermore, the tick thread needs to be restarted with a clean state, which should be easy to do, since it performs a very simple task.
As an additional feature it seems worthwhile to provide an atfork function (it's POSIX) in the thread library. It only needs to record the callback functions required by pthread_atfork by storing them in the associated thread info structure. Once the actually registered pthread_atfork callbacks for this thread get called (one before the fork, one in the parent and one in the child after the fork), they invoke the user-provided atfork callbacks in the appropriate order.
Many people have already run into these issues when implementing code for exploiting multiple cores/CPUs in OCaml. It would therefore be great if this issue could be resolved, since multi-core CPUs are pretty much the standard nowadays.
|Tags||No tags attached.|
Even in the POSIX-C world, there is a bit of controversy on mixing fork() with threads -- many consider that the only sensible thing to do in the child process after a fork() is to call exec().
At any rate, I agree that a bit of cleanup in the child process could be good, if only to restart the tick thread. However, I'd like to see an example that produces the segfault you mention -- I couldn't reproduce it. It is true that the other threads are not running in the child process, but fork() completely duplicates the memory state, including the stacks of those threads. So, stack scanning durin GC should not crash, even though it is performing useless work.
For atfork, remember that fork handlers in POSIX are global, not per-thread. The same functionality could be achieved directly from Caml, not relying on pthread_at_fork.
We have observed segfaults related to forking and threads in several large applications. Not forking always made the problem go away. Unfortunately, we have not yet managed to come up with a small test case that causes segfaults, but I hope we will soon.
However, we have found another problem related to fork and threads: since threads acquire I/O-mutexes on channels while performing I/O on them, forking will leave these mutexes in a locked state. Therefore, the thread performing the fork may encounter such locked mutexes when trying to perform I/O on the same channel, thus blocking forever. I think this problem could be quickly fixed by simply unlocking all locked channel mutexes right after a fork in the child process.
Btw., you are right, of course, there is no need to use pthread_at_fork, since the forking thread should, it seems, always be able to perform any sort of cleanups by itself depending on whether it is in the parent or child.
We now have a fairly small program which segfaults after having created some threads, done a fork (which only duplicates the calling thread), and created some more threads. I'm sorry the executable can't be posted here (although we're trying to construct one that can be), but the top portion of the backtrace upon SIGSEGV is:
0x0848b502 in caml_do_local_roots ()
0x08485d69 in caml_thread_scan_roots ()
0x0848ba3c in caml_oldify_local_roots ()
0x0848d6a4 in caml_empty_minor_heap ()
0x0848d774 in caml_minor_collection ()
0x0848e2e1 in caml_alloc ()
0x0848e3b7 in caml_alloc_dummy ()
Examination reveals that caml_do_local_roots is attempting to deference a null pointer.
I've done some more investigation with a colleague today. This seems to reveal that Xavier's assertion that "duplicates ... the stacks of those threads" isn't correct for a Linux/glibc based system when using NPTL. Take a look at nptl/allocatestack.c in the glibc source distribution, function __reclaim_stacks. This function is registered (in sysdeps/) using the atfork mechanism such that it is called in the child after a fork. To my knowledge this is the only time that __reclaim_stacks is ever called. This function goes through all thread stacks except the one from the current process and clears them up -- which actually involves memset to zero on certain portions of the stacks (but not any parts which the user program would actually use, IIUC).
Stepping back a bit, a thread stack and associated control block is allocated using what is effectively a copy-on-write mmap of /dev/zero: a clever means of obtaining memory that is lazily initialized to zero. Such blocks are kept in a cache and can be reused (by nptl/allocatestack.c:allocate_stack) when a new thread is created. __reclaim_stacks is one function that can deem a thread stack finished with and leave it in the cache ready for reuse.
With this in mind, consider the following sequence of events:
- a thread, call it X, is created, which causes such an mmap to happen;
- pointers into X's thread stack end up in the Caml heap (I haven't actually investigated how this would happen, but I assume it can given Xavier's reply);
- the program forks (from the main thread);
- then in the child:
- thread X doesn't exist any more;
- __reclaim_stacks does its thing, leaving the mmaped block in the cache;
- another thread, Y, is created;
- allocate_stack for thread Y recycles the previously-mmaped block;
- the GC crashes upon the next minor collection, since there are pointers
into the region that is now used for Y's stack.
I've got some experimental evidence that this scenario isn't entirely unlikely. I took a copy of the glibc shared library and, with the help of xxd, nop-ed out the call that registers __reclaim_stacks to be called upon a fork. The consequence of this is that the creation of thread Y would not touch the mmaped area previously used for thread X, since it would not be marked as unused in the cache; instead, a fresh area would be allocated. Using this glibc library at runtime causes the offending executable to not segfault any more, whereas with the vanilla library, it segfaults upon every execution.
I'd be pleased to hear what others have to think of this. If correct, I think it basically means that you can't safely create any Caml-related threads after a fork if any Caml-related threads have been created before the fork. Perhaps it might be possible to fix this by using the atfork mechanism to drop the Caml references to any thread stacks that the C library is going to reclaim.
Tricky stuff :-)
|Nice detective work! Caml never puts a pointer from the heap into the stack, but if the threading library erases some parts of the stacks after a fork(), this could indeed cause the GC to crash while trying to follow zero-ed pointers found in the stack. I'm preparing a fix that will remove the "dead" stacks from consideration by the GC, as well as reset the tick thread and clean up mutex locks. Stay tuned.|
I don't quite follow when you say "Caml never puts a pointer from the heap into the stack". Don't we end up with pointers in the C stack referencing the Caml heap when we use the CAMLparam/CAMLlocal macros? Thinking about it further, it seems that perhaps these are part of the necessary requirements to cause this failure. For suppose the threads created before the fork call C functions which use CAMLparam/CAMLlocal and then, say, block on a system call. When the fork happens, we have Caml values on the C heap. The sequence of events I originally described will be triggered, and the GC will no doubt end up scanning through the caml_local_roots variable (whose value I imagine is ok since that's all allocated in malloced blocks) to reach the stack locations that are now corrupted.
Alternatively, perhaps there is some other way that I don't know about that causes the GC to traverse the C stack.
Thanks for working on a fix; looking forward to it!
It seems to me that the problem is actually not so much one of "only" seeing segfaults (which at least tells you the hard way that something is wrong), but also of leaking memory. Imagine the following scenario:
Thread A calls a C-function, which protects one of its parameters from being reclaimed by using the CAMLparam-macros, e.g. a bigarray whose contents should be written to a file descriptor and which should then be discarded. Then the thread releases the OCaml-lock to e.g. perform the potentially blocking write system call.
Now thread B comes along and forks. Thread A dies a peaceful death, leaving its stack as a legacy to the OCaml runtime. Since the latter doesn't let go of it, the bigarray (which may indeed be big) is also still reachable, thus potentially wasting precious resources. If we keep forking often in such scenarios, the resulting new processes may be inheriting a bigger and bigger legacy...
In practice it's not likely, of course, that the above will happen so most people probably just never notice that they may be leaking memory on the OCaml heap with forks (and CPU-time due to the useless scanning of old stacks). But anyway, since NPTL obviously (and quite reasonably) assumes that old stacks from dead threads may be recycled, we may actually end up with corrupted root blocks in old thread stacks once the new thread e.g. recurses deeply enough. If the GC scans the old root stacks of dead threads then, it's bound to encounter such a corrupted block and die a not so peaceful death.
Luckily, Xavier's suggestion of getting rid of old root stacks of dead threads after a fork should fix all of the above problems so I'm looking forward to it in the next release :-)
Yes, there are lots of pointers stack -> heap, but I thought you were thinking about pointers heap -> stack, which don't exist as far as I can remember. Sorry if I misunderstood or used ambiguous wording.
The memory leak Markus mentions is there in 3.10, of course, but is a minor concern compared with a crash.
|Sure, we are certainly more concerned about a crash than leaking a little bit of memory. But even if POSIX-threads didn't behave as they do with NPTL, i.e. cause crashes, the leaking of memory might be a sufficient reason to remove old stacks.|
Tentative fix in CVS trunk.
|2008-07-10 17:59||mottl||New Issue|
|2008-08-04 16:49||doligez||Status||new => acknowledged|
|2008-08-05 16:00||xleroy||Note Added: 0004572|
|2008-08-05 16:00||xleroy||Status||acknowledged => feedback|
|2008-08-14 17:17||mottl||Note Added: 0004582|
|2008-09-26 14:12||shinwell||Note Added: 0004635|
|2008-09-26 16:43||xleroy||Note Added: 0004636|
|2008-09-26 17:27||shinwell||Note Added: 0004637|
|2008-09-26 18:02||mottl||Note Added: 0004638|
|2008-09-26 18:14||xleroy||Note Added: 0004639|
|2008-09-26 18:14||xleroy||Assigned To||=> xleroy|
|2008-09-26 18:14||xleroy||Status||feedback => assigned|
|2008-09-26 18:51||mottl||Note Added: 0004640|
|2008-09-27 13:37||xleroy||Note Added: 0004641|
|2008-09-27 13:37||xleroy||Status||assigned => resolved|
|2009-04-19 11:07||xleroy||Fixed in Version||=> 3.11.0|
|2009-04-19 11:07||xleroy||Status||resolved => closed|
|Copyright © 2000 - 2011 MantisBT Group|