Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad locking in io.c #7503

Closed
vicuna opened this issue Mar 14, 2017 · 4 comments · Fixed by #9722
Closed

Bad locking in io.c #7503

vicuna opened this issue Mar 14, 2017 · 4 comments · Fixed by #9722
Assignees

Comments

@vicuna
Copy link

vicuna commented Mar 14, 2017

Original bug ID: 7503
Reporter: @stedolan
Status: confirmed (set by @mshinwell on 2017-03-15T16:25:51Z)
Resolution: open
Priority: normal
Severity: minor
Category: runtime system and C interface
Related to: #5141
Monitored by: @stedolan @gasche

Bug description

With ocamlopt, the locking in io.c is not sufficient in the presence of asynchronous operations like signal handlers and finalisers. Here's an example program badlock.ml:

let () =
  Sys.set_signal Sys.sigint (Sys.Signal_handle (fun _ ->
    print_endline "signalled!"));
  while true; do
    print_endline "hello"
  done

The exact behaviour varies depending on whether systhreads is enabled. With trunk and 4.05.0+beta2,

ocamlopt badlock.ml -o badlock && ./badlock
^C

gets stuck in an infinite loop in caml_flush

ocamlopt -runtime-variant d badlock.ml -o badlock && ./badlock
^C

crashes with an assertion failure (the same issue)

ocamlopt -thread unix.cmxa threads.cmxa badlock.ml -o badlock && ./badlock
^C

deadlocks in pthread_mutex_lock

Version 4.04 and before do not initialise systhreads when merely linked with threads.cmxa if the program does not use any thread features, so they behave as the first case even when compiled with -thread.

I think I understand what's going on in the deadlock case, since I came across this issue when trying to fix a similar problem in the multicore branch (until KC pointed out that the same problem likely exists on trunk). Here's a backtrace from the deadlock in 4.05.0+beta2:

#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f2cd0d32bc5 in __GI___pthread_mutex_lock (mutex=0x563e7210ea60) at ../nptl/pthread_mutex_lock.c:80
#2 0x0000563e70e55a6e in caml_io_mutex_lock ()
#3 0x0000563e70e6920d in caml_ml_output_bytes (vchannel=, buff=, start=,
length=) at io.c:660
#4 0x0000563e70e2e0da in camlPervasives__output_string_1213 () at pervasives.ml:349
#5 0x0000563e70e2e97a in camlPervasives__print_endline_1310 () at pervasives.ml:472
#6 0x0000563e70e786f4 in caml_start_program ()
#7 0x0000563e70e5db09 in caml_execute_signal (signal_number=signal_number@entry=2, in_signal_handler=in_signal_handler@entry=0)
at signals.c:176
#8 0x0000563e70e5dbda in caml_process_pending_signals () at signals.c:58
#9 0x0000563e70e5dc87 in caml_process_pending_signals () at signals.c:53
#10 caml_leave_blocking_section () at signals.c:131
#11 0x0000563e70e71d2b in caml_write_fd (fd=1, flags=, buf=buf@entry=0x563e720ee7a0, n=n@entry=6) at unix.c:94
#12 0x0000563e70e68480 in caml_flush_partial (channel=0x563e720ee750) at io.c:168
#13 0x0000563e70e68f48 in caml_flush (channel=) at io.c:182
#14 caml_ml_flush (vchannel=) at io.c:612
#15 0x0000563e70e2e9bf in camlPervasives__print_endline_1310 () at pervasives.ml:472
#16 0x0000563e70e24c47 in camlBadlock__entry ()
#17 0x0000563e70e22369 in caml_program ()
#18 0x0000563e70e786f4 in caml_start_program ()
#19 0x0000563e70e5c9b5 in caml_main (argv=0x7ffff6bb5a18) at startup.c:145
#20 0x0000563e70e2203c in main (argc=, argv=) at main.c:37

After performing the write, caml_write_fd calls leave_blocking_section which runs any pending signal handlers. However, the channel's lock is held because it was taken in caml_flush, so when the signal handler tries to re-acquire the lock it deadlocks. (Without -threads, I think it continues and corrupts the channel state, although I'm not sure of the details).

I think that it is unsafe in general to hold mutexes while interacting with the OCaml heap. Due to finalisers and signal handlers, several heap operations can run arbitrary user code (e.g. caml_alloc, caml_enter_blocking_section, and more in multicore), which is to be avoided in a critical section.

@vicuna
Copy link
Author

vicuna commented Mar 14, 2017

Comment author: @stedolan

I think that most of the issue can be remedied by avoiding holding channel mutexes outside of blocking sections, which means moving the entire I/O operation into a single blocking section (At the moment, there are two blocking sections: one to acquire the channel mutex, and then a seperate one to do the I/O. The channel mutex is released outside a blocking section).

The remaining issue is that signals can be handled inside blocking sections. If a signal arrives during a blocking section, I think the code in signals_asm.c will run arbitrary OCaml code from a signal handler context. This seems very unsafe, not just because of this sort of locking issue.

@vicuna
Copy link
Author

vicuna commented Mar 30, 2017

Comment author: @stedolan

See #1128.

@github-actions
Copy link

github-actions bot commented May 8, 2020

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label May 8, 2020
@lpw25
Copy link
Contributor

lpw25 commented May 8, 2020

Still being worked on in #1128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants