You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Original bug ID: 4351 Reporter:@mmottl Assigned to:@xavierleroy Status: closed (set by @xavierleroy on 2009-03-31T11:09:20Z) Resolution: fixed Priority: normal Severity: tweak Version: 3.10+dev Fixed in version: 3.11.0 Category: ~DO NOT USE (was: OCaml general) Monitored by:@dbuenzli@oandrieu@alainfrisch@mmottl
Bug description
It seems that several parts of the runtime system do not make optimal
use of locking primitives. Here are some recommendations, which are
easy to implement and which would give quite noticable performance
improvements in heavily threaded code:
Mutex.unlock releases the runtime lock during the call to
pthread_mutex_unlock. This seems completely unnecessary, since this
function never blocks and should always return very quickly. Note that
leaving the blocking section itself unlocks a mutex and may even cause a
context switch so why more than double the effort? Fix: drop the calls
to enter/leave_blocking_section and the then unnecessary GC-protection
of the mutex value.
Note that caml_condition_signal and caml_condition_broadcast also release
the runtime lock, which may not be a good idea for similar reasons.
There is also a problem with Mutex.lock: this function essentially makes
an unrealistically pessimistic assumption, namely that the mutex to be
acquired is already locked. Only then would it be justified to always
release the runtime lock.
But it is good practice to avoid lock contention so in the very vast
majority of cases sane user code will attempt to acquire an uncontented
mutex. In this most frequent case there is no reason to release the
runtime lock. Fix: attempt to lock the mutex with "pthread_mutex_trylock"
first. Only if this fails, release the runtime lock and do a blocking
"pthread_mutex_lock". The almost certain, resulting context-switch will
generally dwarf the cost of trying to lock twice in this infrequent case
anyway, and the frequent case will be considerably more efficient.
The same problem (pessimistic locking assumptions) also seems to be
present in the code for locking I/O-channels.
The proposed fixes would not only reduce the computational effort required
to lock/unlock mutexes under realistic conditions, but would also reduce
(very expensive!) context-switches under these conditions if there are
many threads competing for the OCaml runtime lock.
I have attached two files in a tarball that demonstrate the drastic difference in an admittedly artificial benchmark. "foo.ml" contains the OCaml-code, and
"foo_stubs.c" contains improved bindings for thread synchronization, which you may want to incorporate into your code in otherlibs/systhreads/posix.c. I haven't written any fixes for the I/O-mutexes, but this should be straightforward.
Running this code takes around 56ms on my machine. Commenting out the
code block in foo.ml that binds the "fast" locking primitives (i.e. to
use the standard calls) will result in a program that takes around 1500ms,
which is a pretty substantial slowdown (> 25x).
Very interesting ideas, thanks. I especially like the suggestion to use mutex_trylock before releasing the runtime lock.
The reason for releasing the runtime lock around non-blocking mutex and condition variable operations is to offer opportunities for rescheduling around these (communication) operations, and therefore make it less likely that a thread can starve other threads. I agree we're probably rescheduling too much with the current implementation. As usual, we need to find a compromise between throughput and fairness. I put that on my "to do" list.
Original bug ID: 4351
Reporter: @mmottl
Assigned to: @xavierleroy
Status: closed (set by @xavierleroy on 2009-03-31T11:09:20Z)
Resolution: fixed
Priority: normal
Severity: tweak
Version: 3.10+dev
Fixed in version: 3.11.0
Category: ~DO NOT USE (was: OCaml general)
Monitored by: @dbuenzli @oandrieu @alainfrisch @mmottl
Bug description
It seems that several parts of the runtime system do not make optimal
use of locking primitives. Here are some recommendations, which are
easy to implement and which would give quite noticable performance
improvements in heavily threaded code:
Mutex.unlock releases the runtime lock during the call to
pthread_mutex_unlock. This seems completely unnecessary, since this
function never blocks and should always return very quickly. Note that
leaving the blocking section itself unlocks a mutex and may even cause a
context switch so why more than double the effort? Fix: drop the calls
to enter/leave_blocking_section and the then unnecessary GC-protection
of the mutex value.
Note that caml_condition_signal and caml_condition_broadcast also release
the runtime lock, which may not be a good idea for similar reasons.
There is also a problem with Mutex.lock: this function essentially makes
an unrealistically pessimistic assumption, namely that the mutex to be
acquired is already locked. Only then would it be justified to always
release the runtime lock.
But it is good practice to avoid lock contention so in the very vast
majority of cases sane user code will attempt to acquire an uncontented
mutex. In this most frequent case there is no reason to release the
runtime lock. Fix: attempt to lock the mutex with "pthread_mutex_trylock"
first. Only if this fails, release the runtime lock and do a blocking
"pthread_mutex_lock". The almost certain, resulting context-switch will
generally dwarf the cost of trying to lock twice in this infrequent case
anyway, and the frequent case will be considerably more efficient.
The same problem (pessimistic locking assumptions) also seems to be
present in the code for locking I/O-channels.
The proposed fixes would not only reduce the computational effort required
to lock/unlock mutexes under realistic conditions, but would also reduce
(very expensive!) context-switches under these conditions if there are
many threads competing for the OCaml runtime lock.
I have attached two files in a tarball that demonstrate the drastic difference in an admittedly artificial benchmark. "foo.ml" contains the OCaml-code, and
"foo_stubs.c" contains improved bindings for thread synchronization, which you may want to incorporate into your code in otherlibs/systhreads/posix.c. I haven't written any fixes for the I/O-mutexes, but this should be straightforward.
Just compile the code using:
ocamlopt -cc "gcc" -thread unix.cmxa threads.cmxa foo_stubs.c foo.ml -o foo
Running this code takes around 56ms on my machine. Commenting out the
code block in foo.ml that binds the "fast" locking primitives (i.e. to
use the standard calls) will result in a program that takes around 1500ms,
which is a pretty substantial slowdown (> 25x).
File attachments
The text was updated successfully, but these errors were encountered: