Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threads + signals: runtime hangs #3659

Closed
vicuna opened this issue May 23, 2005 · 4 comments
Closed

Threads + signals: runtime hangs #3659

vicuna opened this issue May 23, 2005 · 4 comments
Labels

Comments

@vicuna
Copy link

vicuna commented May 23, 2005

Original bug ID: 3659
Reporter: administrator
Status: closed
Resolution: fixed
Priority: normal
Severity: minor
Category: ~DO NOT USE (was: OCaml general)

Bug description

Full_Name: Gerd Stolpmann
Version: 3.08.3
OS: Linux, kernel 2.6
Submission from: p54a79e6c.dip0.t-ipconnect.de (84.167.158.108)

Hello,

I recently got a bug report for one of my libraries (equeue) that did not work
in a multi-threaded program. Actually, the program wasn't multi-threaded, but
compiled with -thread and threads.cma (because of a another library), so the
multi-threading machinery was initialized. My library is an enhanced version of
"system", i.e. does fork + exec, and sometimes SIGCHILD signals are emitted. I
never thought it worked in an mt program because of other reasons (problems with
fork, no access to the thread-specific signal mask). However, it turns out that
the problems are much more fundamental, and can hang the O'Caml runtime at any
time (although this is very unlikely if the program doesn't use signals for
application purposes).

Now here is a short program that almost always hangs the O'Caml runtime. It
sends lots of signals to a process that blocks from time to time:

let rec microsleep t =
let t0 = Unix.gettimeofday() in
try
ignore(Unix.select [] [] [] t)
with
Unix.Unix_error(Unix.EINTR,,) ->
microsleep (t -. (Unix.gettimeofday() -. t0))
;;

let pid = Unix.getpid() ;;

let generate_signals() =
match Unix.fork() with
| 0 ->
while true do
Unix.kill pid Sys.sigusr1;
done;
exit 0

| _ ->
()

;;

let _ = Thread.create in (* Ensure mt machinery is enabled )
let n = ref 0 in
Sys.set_signal Sys.sigusr1 (Sys.Signal_handle(fun _ -> incr n));
generate_signals();
let s = ref 0 in
for k = 1 to 1000 do
s := !s + k;
microsleep 0.0001 (
block for a short moment *)
done;

prerr_endline "Done!";
prerr_endline ("Number of signals: " ^ string_of_int !n)
;;

Compiled with:

ocamlopt -o signals unix.cmxa threads.cmxa -thread signals.ml

After a short time, the parent process freezes. strace shows the process hangs
in a futex system call. gdb shows more:

#0 0xffffe410 in __kernel_vsyscall ()
#1 0x4002dfae in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib/tls/i686/cmov/libpthread.so.0
#2 0x08057b7b in caml_thread_leave_blocking_section ()
#3 0x0805ebaa in caml_leave_blocking_section ()
#4 0x0805ebd2 in handle_signal ()
#5
#6 0x0804a96c in ?? ()
#7 0x08057bb8 in caml_thread_leave_blocking_section ()
#8 0x0805ebaa in caml_leave_blocking_section ()
#9 0x0805afa5 in unix_select ()
(rest stripped)

Obviously, the signal appears in the middle of caml_leave_blocking_section, just
after the thread acquired the master lock. The signal handler, because still in
asynchronous mode, tries to acquire the master lock again - deadlock.

This dump is typical of most of the freezes, but not for all. There seem to be
other problems as well.

Although not tested, I have the impression that masking all signals during
caml_leave_blocking_section could help (until the asynchronous signal mode is
finished).

Another observation: In a signal handler, POSIX forbids to call any pthread
function. The O'Caml runtime does, however, call pthread functions (in
thread_enter/leave_blocking_section). Obviously, this works most of the time,
but maybe there are race conditions in libpthread that can be triggered under
these circumstances. A very known author writes in pthread_cond_signal(3): "In
particular, calling !pthread_cond_signal! or !pthread_cond_broadcast! from a
signal handler may deadlock the calling thread." (Linux man page, signed by XL).
Well, the above test was for a 2.6 kernel with new threading library, so the
code base has changed.

Gerd

@vicuna
Copy link
Author

vicuna commented Jul 29, 2005

Comment author: administrator

Dear Gerd,

I recently got a bug report for one of my libraries (equeue) that
did not work in a multi-threaded program. Actually, the program
wasn't multi-threaded, but compiled with -thread and threads.cma
(because of a another library), so the multi-threading machinery was
initialized. My library is an enhanced version of "system",
i.e. does fork + exec, and sometimes SIGCHILD signals are emitted. I
never thought it worked in an mt program because of other reasons
(problems with fork, no access to the thread-specific signal
mask). However, it turns out that the problems are much more
fundamental, and can hang the O'Caml runtime at any time (although
this is very unlikely if the program doesn't use signals for
application purposes).

Thank you for this interesting bug report and for the repro case,
and thanks Samuel for the proposed fix (which however exhibits another
symmetrical race condition).

Damien Doligez and I partially fixed the bug in the CVS trunk (should be in
the 3.09 release).

The fix is partial in that it assumes that the POSIX thread function
pthread_mutex_trylock() is async-signal safe, which is not guaranteed
by the POSIX thread spec. For example, the fix works under Linux
(both the old LinuxThreads library and the new NPTL library implement
trylock() in a way that is async-signal safe), but the deadlock is
still there under MacOSX.

Given the way OCaml processes signals and the limitations of POSIX
threads, it is fundamentally impossible to implement things so that
threads and signals work well together. So, the current fix is a best
effort, and improves over the previous implementation, but there are
still no guarantees that threads and signal combine well.

Another observation: In a signal handler, POSIX forbids to call any pthread
function. The O'Caml runtime does, however

Yes, and it calls many other functions that should not be called from
signal handlers according to POSIX. Basically, the OCaml runtime
system makes the assumption that if the main program is blocked on a
syscall, it is safe to do arbitrary computations in a signal handler.
That assumption has proved effective in practice for "true" system
calls, but as the MacOSX example demonstrates, is not always true for
POSIX thread primitives. Again, if we were to do things by the specs,
I'm afraid there would be no signal handling at all in OCaml...

Best wishes,

  • Xavier

@vicuna
Copy link
Author

vicuna commented Jul 29, 2005

Comment author: administrator

Am Freitag, den 29.07.2005, 15:48 +0200 schrieb Xavier Leroy:

The fix is partial in that it assumes that the POSIX thread function
pthread_mutex_trylock() is async-signal safe, which is not guaranteed
by the POSIX thread spec. For example, the fix works under Linux
(both the old LinuxThreads library and the new NPTL library implement
trylock() in a way that is async-signal safe), but the deadlock is
still there under MacOSX.

Given the way OCaml processes signals and the limitations of POSIX
threads, it is fundamentally impossible to implement things so that
threads and signals work well together. So, the current fix is a best
effort, and improves over the previous implementation, but there are
still no guarantees that threads and signal combine well.

Another observation: In a signal handler, POSIX forbids to call any pthread
function. The O'Caml runtime does, however

Yes, and it calls many other functions that should not be called from
signal handlers according to POSIX. Basically, the OCaml runtime
system makes the assumption that if the main program is blocked on a
syscall, it is safe to do arbitrary computations in a signal handler.
That assumption has proved effective in practice for "true" system
calls, but as the MacOSX example demonstrates, is not always true for
POSIX thread primitives. Again, if we were to do things by the specs,
I'm afraid there would be no signal handling at all in OCaml...

I think one of the problems is that OCaml supports asynchronous signals,
i.e. in a blocking section the signal handler is immediately executed.
Imagine what would happen when such signals would be simply deferred
until the blocking section is left. The implementation would be much
simpler, and I think it could be done conforming to the specs. The
visible difference to the current implementation depends on the type of
the blocking section. If it is a direct syscall, nothing will change,
because the syscall immediately returns with EINTR (most blocking
sections are of this type; I don't know whether Windows behaves the
same). If it is a C routine like name service lookup, there will be a
small, but limited delay until the signal handler is executed. The
problematic case is an external event loop like labltk. However, I think
this case can be handled when the event loop allows it to inject custom
events, i.e. signals are mapped to events. Of course, all these
libraries need to be changed (but there aren't that many libraries of
this type).

Interestingly, the multi-threading tick signal isn't affected at all,
because it needs not to wake up blocking sections. (Actually, I am
wondering why a real OS-level signal is used. In principle, this signal
only interrupts the execution of OCaml code from time to time to emulate
time slices, and this can be done without OS-level signals. One never
needs to interrupt blocked threads - or am I missing something?)

I think we need a reliable coexistence of threads and signals for many
real-world applications. This is more important than mimicking the C
behaviour of signals as closely as possible (OCaml does not support AIO
anyway, and for almost other applications small delays in signal
handling are acceptable). Even now, if there was a switch to turn off
asynchronous signals in favour of absolute reliability, I would do this
for some applications (e.g. networking applications).

So my suggestion is to provide such a switch in 3.09, and allow people
to experiment whether they can accept the new semantics of signals.

Gerd


Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany
gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de
Telefon: 06151/153855 Telefax: 06151/997714

@vicuna
Copy link
Author

vicuna commented Jul 30, 2005

Comment author: administrator

Dear Gerd,

Imagine what would happen when such signals would be simply deferred
until the blocking section is left. The implementation would be much
simpler, and I think it could be done conforming to the specs. The
visible difference to the current implementation depends on the type of
the blocking section. If it is a direct syscall, nothing will change,
because the syscall immediately returns with EINTR (most blocking
sections are of this type; I don't know whether Windows behaves the
same).

Interesting suggestion. Actually, my first attempt at signal
handling in Caml Light circa 1990 was along these lines, but did not
work because we were working on BSD systems (SunOS 3 and 4) at this
time, and in BSD many system calls restart on a signal rather than
returning EINTR. The situation is likely to be different now. I'll
have to check whether POSIX and the Single Unix Specification
guarantee the EINTR behavior by default.

Note that we could get the best of both worlds (?) as follows: use
synchronous signal processing as in 3.08 for single-threaded code, and
use your deferred strategy for multi-threaded code, therefore avoiding
the problematic pthread_mutex_trylock call in the signal handler. The
new signal handling code Damien and I just put in 3.09 makes it easy
to switch to the deferred strategy when the threading library
initializes.

There are no problems with Windows: the only signal-like mechanism is
ctrl-C keyboard interrupts, and it is already implemented along the
lines of your deferred strategy, i.e. just record the interrupt but
don't act over it immediately. The reason is that Windows runs the
ctrl-C handler function in a new thread, so immediate callback to Caml
code just doesn't work.

Interestingly, the multi-threading tick signal isn't affected at all,
because it needs not to wake up blocking sections. (Actually, I am
wondering why a real OS-level signal is used. In principle, this signal
only interrupts the execution of OCaml code from time to time to emulate
time slices, and this can be done without OS-level signals. One never
needs to interrupt blocked threads - or am I missing something?)

No, you're correct. Actually, the systhreads implementation (over
POSIX/Win32 threads) does not use an OS periodic signal a la setitimer():
there is a separate thread that loops over a "sleep; post a pending signal"
sequence.

I think we need a reliable coexistence of threads and signals for many
real-world applications.

I still believe you're taking serious risks by mixing signals and
threads: this is the darkest corner of the POSIX thread spec. But,
yes, it would be best to make Caml itself handle the mixture reliably,
so that all the bugs you're going to run into are those of your code
and the external libraries you call, but not ours :-)

Best wishes,

  • Xavier

@vicuna
Copy link
Author

vicuna commented Jul 31, 2005

Comment author: administrator

see also #3680. Fixed in 3.09 by XL and DD, 2005-07-29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant