Thread preemption #7669

vicuna · 2017-11-09T14:10:55Z

Original bug ID: 7669
Reporter: @jhjourdan
Status: resolved (set by @xavierleroy on 2018-01-12T12:53:01Z)
Resolution: fixed
Priority: normal
Severity: minor
Platform: x86-64
OS: Linux
Target version: 4.06.1+dev/rc1/rc2
Fixed in version: 4.07.0+dev/beta2/rc1/rc2
Category: threads
Monitored by: @gasche @ygrek

Bug description

When using the systhread library with at least one computation intensive thread, the scheduler has very bad behavior. Preemption happens only rarely. This is, for example, demonstrated by the following program when run in native mode :

let rec generate_list n =
let rec aux acc = function
| 0 -> acc
| n -> aux (float n :: acc) (n-1)
in
aux [] n

let long_list = generate_list 1000000
let short_list = generate_list 100

let rec slow_loop () =
Printf.printf "SLOW %d\n%!" (List.length (List.rev_map sin long_list));
slow_loop ()

let rec fast_loop () =
Printf.printf "FAST %d\n%!" (List.length (List.rev_map sin short_list));
fast_loop ()

let () =
ignore (Thread.create slow_loop ());
fast_loop ()

In the output, there are roughly as many "FAST" lines as "SLOW" lines, meaning that preemption only happens when printing to standard output. I would expect to see much more "FAST" lines than "SLOW" lines (i.e., the slow computation is preempted by the fast one even if the slow one does not interact).

Note that both computation do allocate memory through the GC, so the signals used for preemption can indeed be handled. This can be seen by adding the following lines at the beginning of the previous code, which gets the expected behavior:

let preempt signal = Thread.delay 1e-6
let _ =
Sys.set_signal Sys.sigvtalrm (Sys.Signal_handle preempt);
ignore (Unix.setitimer Unix.ITIMER_VIRTUAL
{ Unix.it_interval = 1e-2; Unix.it_value = 1e-2 })

I.e., we periodically force preemption by suspending the currently executing thread for a very short time.

My analysis of the problem is that there are two flaws in the implementation of the systhread library:

1- The "tick thread" regularly sends a signal so that yield() is called regularly. But on linux, the only thing yield() does is releasing and acquiring back the master lock. Since scheld_yield() apparently doesn't do want we want here, one solution could be to use usleep(1) instead. My experiments seem to indicate that the loss in performance is not measurable.
2- Actually, 1- would be fine if the implementation of the masterlock would garantee fairness. But it doesn't. Another soution would therfore to use a fair implementation for the master lock (such as a ticket lock). This should not make the implementation much more complicated, given that the master lock is already implemented "by hand".

File attachments

test.ml

vicuna · 2017-11-09T14:43:07Z

Comment author: @mshinwell

Jane Street did some experiments with using [nanosleep] (similar to your point 1) and as far as I remember, the conclusion was that we would be keen for the runtime to provide an option to use that, rather than the current [yield].

vicuna · 2017-11-12T16:37:20Z

Comment author: @xavierleroy

The fairness issue you mention is especially visible under Linux, because the Linux scheduler favors throughput over fairness, and because sched_yield() was made unusable in Linux 2.6. Windows is more fair as far as I can remember.

I'm open to using nanosleep(1 ns) instead of sched_yield() under Linux, provided someone conducts serious experiments to measure the overhead, if any.

But it's unclear to me why we would want fairness in thread scheduling and be willing to invest time and possibly degrade performance to get it. What is wrong with the behavior reported in this MPR?

vicuna · 2017-11-16T13:03:20Z

Comment author: @jhjourdan

But it's unclear to me why we would want fairness in thread scheduling and be willing to invest time and possibly degrade performance to get it. What is wrong with the behavior reported in this MPR?

To me, one of the typical uses of threads apart from performance is to be able to have a worker thread and an UI thread. This is made impossible with this bug.

vicuna · 2017-11-22T22:24:50Z

Comment author: seliopou

There are a few different schemes that we considered and experimented with. The two that I have numbers for are:

dropping the #ifndef in yield to re-instate the sched_yield call; and
using nanosleep(1ns) instead of sched_yield.

In addition to these modifications, we modified the Async scheduler to call Thread.yield in its main loop. You can see the location of this change here, though that location is currently occupied by a nanosleep that can be enabled and disabled through a configuration flag (disabled by default):

https://github.com/janestreet/async_unix/blob/19e9656db3797c6df73c951208174bd9b6fa5c9c/src/raw_scheduler.ml#L721-L723

You can find the experimental setup for the tests we ran in this gist:

https://gist.github.com/seliopou/2fb86817ec7dae2c61ff02b49b274031

Briefly, we have an Async-based client and server that support two commands: get-updates and ping. A get-updates request will cause the server to continuously stream bytes to the client until the client closes the connection. A ping request on the other hand will cause the server to send a short response and close the connection.

What we're testing, then, is the latency of pings to a server that is servicing a get-updates request. We'd expect a fair system to result in low ping times, on the order tens of milliseconds on average.

You can see the results in the gist linked above (units in seconds, and include process startup time for the client, so remove 50ms to approximate the real latency). Unfortunately, the tabular data is not rendered properly in Mantis.

As you can see, the baseline had a good median, bad average and tail, and a concordant high variance. sched_yield had worse medians but that won you a better average tail.

And as is clear from the table, nanosleep is just an across-the-board improvement. It's a clear winner in terms of thread fairness, though it doesn't really speak to the performance impact on the long-running operation. In addition it certainly has the downside that it makes the current thread unrunnable. How severe a penalty you take by making the current thread unrunnable really dependes on how much work the other threads in your system have to do before releasing the lock again. Ideally, we would just stay runnable but just go to the back of the run queue, which is exactly the behavior that was changed in sched_yield so long ago.

It might make sense to somehow expose the number of threads waiting on the lock. That way, decisions on how affect a yield can be left up to the application. In Async we approximate this using the number of that are in use from our thread pool. This quantity conflates threads that may be making progress, threads that are blocked in a system call, and threads that are blocked waiting to acquire the OCaml lock. In the absence of change to Thread.yield, it'd be nice to tease out just that latter component so that use cases like ours don't have to over-approximate the condition where yielding is a productive action.

vicuna · 2017-11-23T23:34:40Z

Comment author: @jhjourdan

Thanks for these experiments, that indeed confirm what I observed informally.

I think Xavier was rather interested in measurement of overhead in terms of computation time. If a worker thread (i.e., a thread that uses CPU time) gets interrupted 20 times per second and (at least) a context switch occurs each time, this has a cost. If this cost is indeed negligible (which I guess it should be, but I never did serious experiments), then I do not see a reason not doing this change.

I can try to do these experiments soon.

vicuna · 2017-11-26T15:05:42Z

Comment author: @xavierleroy

Yes, I'm interested in both kinds of numbers: latency and throughput.

I guess a simple throughput test would be to have N computational threads (that do enough allocation to properly yield when preempted) and just measure the overall execution time.

vicuna · 2017-12-08T16:36:12Z

Comment author: @jhjourdan

I made some throughput tests.

I tried to do what Xavier suggested, by executing in parallel N computational threads. However, the fact that several threads are actually executing at the same time (i.e., preempt each other) has an important effect on the allocation pattern, and hence on the performance of the GC. It gets quite difficult to get relevant numbers: depending on the computation, the version with preemption gets better or worse performances by as much as >5%.

Instead, I used a different setup: I measure the throughput of one computation thread running at the same time as an I/O thread. The computation thread is designed to both allocate and compute things. It does 1000 runs of a computation that lasts about 0.1s and then prints statistics about the timings.

The I/O thread essentially echoes on stdout numbers that it reads from stdin. I did experiments where I either give zero input to the program or a small amount of input data. The presence of a small amount of input does not seem to have any impact on performance. Giving it a large amount input would be obviously unfair, since the non-preemptive version would not handle I/Os while the preemptive version would.

The difference between the preemptive and the non-preemptive version is the presence of the following lines of code in the preemptive version of the code:
let preempt signal = Thread.delay 1e-6
let () =
Sys.set_signal Sys.sigvtalrm (Sys.Signal_handle preempt)
This effectively overrides Thread's sigvtalrm handler with mine. My handler does exactly what the new handler of Thread would do.

I attach my source file if you want more details about the actual code. all the experiments have been done with 4.06.0. I can't see any reason why trunk would make any difference here.

Results without preemption (i.e., current setup in OCaml 4.06.0):

USER : mean=0.127288 stddev=0.010402 conf.inter.95%=[0.126630;0.127945]
SYSTEM : mean=0.000076 stddev=0.000362 conf.inter.95%=[0.000053;0.000099]
REAL : mean=0.127776 stddev=0.010435 conf.inter.95%=[0.127116;0.128436]

Results with preemption:

USER : mean=0.127881 stddev=0.011377 conf.inter.95%=[0.127162;0.128600]
SYSTEM : mean=0.000071 stddev=0.000438 conf.inter.95%=[0.000043;0.000099]
REAL : mean=0.128547 stddev=0.011431 conf.inter.95%=[0.127825;0.129270]

Moreover, running the executable several times seems to indicate that the results are fairly reproducible: We cannot see any significant performance change. Even though the preemptive version seems a bit slower, it is by about 0.5%.

vicuna · 2017-12-13T13:40:52Z

Comment author: @damiendoligez

I think 0.5% is a small price to pay to get usable UI threads. Jacques-Henri, could you submit a GitHub PR and move the discussion there?

vicuna · 2018-01-12T12:52:34Z

Comment author: @xavierleroy

See #1533

vicuna closed this as completed Jan 12, 2018

vicuna added the threads label Mar 14, 2019

vicuna added this to the 4.06.1 milestone Mar 14, 2019

vicuna added the bug label Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread preemption #7669

Thread preemption #7669

vicuna commented Nov 9, 2017

vicuna commented Nov 9, 2017

vicuna commented Nov 12, 2017

vicuna commented Nov 16, 2017

vicuna commented Nov 22, 2017

vicuna commented Nov 23, 2017

vicuna commented Nov 26, 2017

vicuna commented Dec 8, 2017

vicuna commented Dec 13, 2017

vicuna commented Jan 12, 2018

Thread preemption #7669

Thread preemption #7669

Comments

vicuna commented Nov 9, 2017

Bug description

File attachments

vicuna commented Nov 9, 2017

vicuna commented Nov 12, 2017

vicuna commented Nov 16, 2017

vicuna commented Nov 22, 2017

vicuna commented Nov 23, 2017

vicuna commented Nov 26, 2017

vicuna commented Dec 8, 2017

vicuna commented Dec 13, 2017

vicuna commented Jan 12, 2018