Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0007669OCamlthreadspublic2017-11-09 15:102018-01-12 13:53
Assigned To 
Platformx86-64OSLinuxOS Version
Product Version 
Target Version4.06.1+dev/rc1/rc2Fixed in Version4.07.0+dev 
Summary0007669: Thread preemption
DescriptionWhen using the systhread library with at least one computation intensive thread, the scheduler has very bad behavior. Preemption happens only rarely. This is, for example, demonstrated by the following program when run in native mode :

  let rec generate_list n =
    let rec aux acc = function
      | 0 -> acc
      | n -> aux (float n :: acc) (n-1)
    aux [] n

  let long_list = generate_list 1000000
  let short_list = generate_list 100

  let rec slow_loop () =
    Printf.printf "SLOW %d\n%!" (List.length (List.rev_map sin long_list));
    slow_loop ()

  let rec fast_loop () =
    Printf.printf "FAST %d\n%!" (List.length (List.rev_map sin short_list));
    fast_loop ()

  let () =
    ignore (Thread.create slow_loop ());
    fast_loop ()

In the output, there are roughly as many "FAST" lines as "SLOW" lines, meaning that preemption only happens when printing to standard output. I would expect to see much more "FAST" lines than "SLOW" lines (i.e., the slow computation is preempted by the fast one even if the slow one does not interact).

Note that both computation do allocate memory through the GC, so the signals used for preemption can indeed be handled. This can be seen by adding the following lines at the beginning of the previous code, which gets the expected behavior:

  let preempt signal = Thread.delay 1e-6
  let _ =
    Sys.set_signal Sys.sigvtalrm (Sys.Signal_handle preempt);
    ignore (Unix.setitimer Unix.ITIMER_VIRTUAL
               { Unix.it_interval = 1e-2; Unix.it_value = 1e-2 })

I.e., we periodically force preemption by suspending the currently executing thread for a very short time.

My analysis of the problem is that there are two flaws in the implementation of the systhread library:

1- The "tick thread" regularly sends a signal so that yield() is called regularly. But on linux, the only thing yield() does is releasing and acquiring back the master lock. Since scheld_yield() apparently doesn't do want we want here, one solution could be to use usleep(1) instead. My experiments seem to indicate that the loss in performance is not measurable.
2- Actually, 1- would be fine if the implementation of the masterlock would garantee fairness. But it doesn't. Another soution would therfore to use a fair implementation for the master lock (such as a ticket lock). This should not make the implementation much more complicated, given that the master lock is already implemented "by hand".
TagsNo tags attached.
Attached Files? file icon [^] (1,917 bytes) 2017-12-08 17:37 [Show Content]

- Relationships

-  Notes
shinwell (developer)
2017-11-09 15:43

Jane Street did some experiments with using [nanosleep] (similar to your point 1) and as far as I remember, the conclusion was that we would be keen for the runtime to provide an option to use that, rather than the current [yield].
xleroy (administrator)
2017-11-12 17:37

The fairness issue you mention is especially visible under Linux, because the Linux scheduler favors throughput over fairness, and because sched_yield() was made unusable in Linux 2.6. Windows is more fair as far as I can remember.

I'm open to using nanosleep(1 ns) instead of sched_yield() under Linux, provided someone conducts serious experiments to measure the overhead, if any.

But it's unclear to me why we would want fairness in thread scheduling and be willing to invest time and possibly degrade performance to get it. What is wrong with the behavior reported in this MPR?
jacques-henri.jourdan (manager)
2017-11-16 14:03

> But it's unclear to me why we would want fairness in thread scheduling and be willing to invest time and possibly degrade performance to get it. What is wrong with the behavior reported in this MPR?

To me, one of the typical uses of threads apart from performance is to be able to have a worker thread and an UI thread. This is made impossible with this bug.
seliopou (reporter)
2017-11-22 23:24
edited on: 2017-11-22 23:29

There are a few different schemes that we considered and experimented with. The two that I have numbers for are:

  1. dropping the `#ifndef` in `yield` to re-instate the `sched_yield` call; and
  2. using `nanosleep(1ns)` instead of `sched_yield`.

In addition to these modifications, we modified the Async scheduler to call `Thread.yield` in its main loop. You can see the location of this change here, though that location is currently occupied by a nanosleep that can be enabled and disabled through a configuration flag (disabled by default): [^]

You can find the experimental setup for the tests we ran in this gist: [^]

Briefly, we have an Async-based client and server that support two commands: `get-updates` and `ping`. A `get-updates` request will cause the server to continuously stream bytes to the client until the client closes the connection. A `ping` request on the other hand will cause the server to send a short response and close the connection.

What we're testing, then, is the latency of pings to a server that is servicing a `get-updates` request. We'd expect a fair system to result in low ping times, on the order tens of milliseconds on average.

You can see the results in the gist linked above (units in seconds, and include process startup time for the client, so remove 50ms to approximate the real latency). Unfortunately, the tabular data is not rendered properly in Mantis.

As you can see, the baseline had a good median, bad average and tail, and a concordant high variance. `sched_yield` had worse medians but that won you a better average tail.

And as is clear from the table, `nanosleep` is just an across-the-board improvement. It's a clear winner in terms of thread fairness, though it doesn't really speak to the performance impact on the long-running operation. In addition it certainly has the downside that it makes the current thread unrunnable. How severe a penalty you take by making the current thread unrunnable really dependes on how much work the other threads in your system have to do before releasing the lock again. Ideally, we would just stay runnable but just go to the back of the run queue, which is exactly the behavior that was changed in `sched_yield` so long ago.

It might make sense to somehow expose the number of threads waiting on the lock. That way, decisions on how affect a yield can be left up to the application. In Async we approximate this using the number of that are in use from our thread pool. This quantity conflates threads that may be making progress, threads that are blocked in a system call, and threads that are blocked waiting to acquire the OCaml lock. In the absence of change to `Thread.yield`, it'd be nice to tease out just that latter component so that use cases like ours don't have to over-approximate the condition where yielding is a productive action.

jacques-henri.jourdan (manager)
2017-11-24 00:34

Thanks for these experiments, that indeed confirm what I observed informally.

I think Xavier was rather interested in measurement of overhead in terms of computation time. If a worker thread (i.e., a thread that uses CPU time) gets interrupted 20 times per second and (at least) a context switch occurs each time, this has a cost. If this cost is indeed negligible (which I guess it should be, but I never did serious experiments), then I do not see a reason not doing this change.

I can try to do these experiments soon.
xleroy (administrator)
2017-11-26 16:05

Yes, I'm interested in both kinds of numbers: latency and throughput.

I guess a simple throughput test would be to have N computational threads (that do enough allocation to properly yield when preempted) and just measure the overall execution time.
jacques-henri.jourdan (manager)
2017-12-08 17:36

I made some throughput tests.

I tried to do what Xavier suggested, by executing in parallel N computational threads. However, the fact that several threads are actually executing at the same time (i.e., preempt each other) has an important effect on the allocation pattern, and hence on the performance of the GC. It gets quite difficult to get relevant numbers: depending on the computation, the version with preemption gets better or worse performances by as much as >5%.

Instead, I used a different setup: I measure the throughput of one computation thread running at the same time as an I/O thread. The computation thread is designed to both allocate and compute things. It does 1000 runs of a computation that lasts about 0.1s and then prints statistics about the timings.

The I/O thread essentially echoes on stdout numbers that it reads from stdin. I did experiments where I either give zero input to the program or a small amount of input data. The presence of a small amount of input does not seem to have any impact on performance. Giving it a large amount input would be obviously unfair, since the non-preemptive version would not handle I/Os while the preemptive version would.

The difference between the preemptive and the non-preemptive version is the presence of the following lines of code in the preemptive version of the code:
  let preempt signal = Thread.delay 1e-6
  let () =
    Sys.set_signal Sys.sigvtalrm (Sys.Signal_handle preempt)
This effectively overrides Thread's sigvtalrm handler with mine. My handler does exactly what the new handler of Thread would do.

I attach my source file if you want more details about the actual code. all the experiments have been done with 4.06.0. I can't see any reason why trunk would make any difference here.

Results without preemption (i.e., current setup in OCaml 4.06.0):

USER : mean=0.127288 stddev=0.010402 conf.inter.95%=[0.126630;0.127945]
SYSTEM : mean=0.000076 stddev=0.000362 conf.inter.95%=[0.000053;0.000099]
REAL : mean=0.127776 stddev=0.010435 conf.inter.95%=[0.127116;0.128436]

Results with preemption:

USER : mean=0.127881 stddev=0.011377 conf.inter.95%=[0.127162;0.128600]
SYSTEM : mean=0.000071 stddev=0.000438 conf.inter.95%=[0.000043;0.000099]
REAL : mean=0.128547 stddev=0.011431 conf.inter.95%=[0.127825;0.129270]

Moreover, running the executable several times seems to indicate that the results are fairly reproducible: We cannot see any significant performance change. Even though the preemptive version seems a bit slower, it is by about 0.5%.
doligez (administrator)
2017-12-13 14:40

I think 0.5% is a small price to pay to get usable UI threads. Jacques-Henri, could you submit a GitHub PR and move the discussion there?
xleroy (administrator)
2018-01-12 13:52

See [^]

- Issue History
Date Modified Username Field Change
2017-11-09 15:10 jacques-henri.jourdan New Issue
2017-11-09 15:43 shinwell Note Added: 0018644
2017-11-12 17:37 xleroy Note Added: 0018648
2017-11-12 17:37 xleroy Status new => acknowledged
2017-11-16 14:03 jacques-henri.jourdan Note Added: 0018665
2017-11-22 23:24 seliopou Note Added: 0018685
2017-11-22 23:29 seliopou Note Edited: 0018685 View Revisions
2017-11-24 00:34 jacques-henri.jourdan Note Added: 0018686
2017-11-26 16:05 xleroy Note Added: 0018688
2017-12-08 17:36 jacques-henri.jourdan Note Added: 0018737
2017-12-08 17:37 jacques-henri.jourdan File Added:
2017-12-13 14:40 doligez Note Added: 0018742
2018-01-11 15:22 doligez Target Version => 4.06.1+dev/rc1/rc2
2018-01-12 13:52 xleroy Note Added: 0018827
2018-01-12 13:53 xleroy Status acknowledged => resolved
2018-01-12 13:53 xleroy Resolution open => fixed
2018-01-12 13:53 xleroy Fixed in Version => 4.07.0+dev

Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker