New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread preemption #7669
Comments
Comment author: @mshinwell Jane Street did some experiments with using [nanosleep] (similar to your point 1) and as far as I remember, the conclusion was that we would be keen for the runtime to provide an option to use that, rather than the current [yield]. |
Comment author: @xavierleroy The fairness issue you mention is especially visible under Linux, because the Linux scheduler favors throughput over fairness, and because sched_yield() was made unusable in Linux 2.6. Windows is more fair as far as I can remember. I'm open to using nanosleep(1 ns) instead of sched_yield() under Linux, provided someone conducts serious experiments to measure the overhead, if any. But it's unclear to me why we would want fairness in thread scheduling and be willing to invest time and possibly degrade performance to get it. What is wrong with the behavior reported in this MPR? |
Comment author: @jhjourdan
To me, one of the typical uses of threads apart from performance is to be able to have a worker thread and an UI thread. This is made impossible with this bug. |
Comment author: seliopou There are a few different schemes that we considered and experimented with. The two that I have numbers for are:
In addition to these modifications, we modified the Async scheduler to call You can find the experimental setup for the tests we ran in this gist: https://gist.github.com/seliopou/2fb86817ec7dae2c61ff02b49b274031 Briefly, we have an Async-based client and server that support two commands: What we're testing, then, is the latency of pings to a server that is servicing a You can see the results in the gist linked above (units in seconds, and include process startup time for the client, so remove 50ms to approximate the real latency). Unfortunately, the tabular data is not rendered properly in Mantis. As you can see, the baseline had a good median, bad average and tail, and a concordant high variance. And as is clear from the table, It might make sense to somehow expose the number of threads waiting on the lock. That way, decisions on how affect a yield can be left up to the application. In Async we approximate this using the number of that are in use from our thread pool. This quantity conflates threads that may be making progress, threads that are blocked in a system call, and threads that are blocked waiting to acquire the OCaml lock. In the absence of change to |
Comment author: @jhjourdan Thanks for these experiments, that indeed confirm what I observed informally. I think Xavier was rather interested in measurement of overhead in terms of computation time. If a worker thread (i.e., a thread that uses CPU time) gets interrupted 20 times per second and (at least) a context switch occurs each time, this has a cost. If this cost is indeed negligible (which I guess it should be, but I never did serious experiments), then I do not see a reason not doing this change. I can try to do these experiments soon. |
Comment author: @xavierleroy Yes, I'm interested in both kinds of numbers: latency and throughput. I guess a simple throughput test would be to have N computational threads (that do enough allocation to properly yield when preempted) and just measure the overall execution time. |
Comment author: @jhjourdan I made some throughput tests. I tried to do what Xavier suggested, by executing in parallel N computational threads. However, the fact that several threads are actually executing at the same time (i.e., preempt each other) has an important effect on the allocation pattern, and hence on the performance of the GC. It gets quite difficult to get relevant numbers: depending on the computation, the version with preemption gets better or worse performances by as much as >5%. Instead, I used a different setup: I measure the throughput of one computation thread running at the same time as an I/O thread. The computation thread is designed to both allocate and compute things. It does 1000 runs of a computation that lasts about 0.1s and then prints statistics about the timings. The I/O thread essentially echoes on stdout numbers that it reads from stdin. I did experiments where I either give zero input to the program or a small amount of input data. The presence of a small amount of input does not seem to have any impact on performance. Giving it a large amount input would be obviously unfair, since the non-preemptive version would not handle I/Os while the preemptive version would. The difference between the preemptive and the non-preemptive version is the presence of the following lines of code in the preemptive version of the code: I attach my source file if you want more details about the actual code. all the experiments have been done with 4.06.0. I can't see any reason why trunk would make any difference here. Results without preemption (i.e., current setup in OCaml 4.06.0): USER : mean=0.127288 stddev=0.010402 conf.inter.95%=[0.126630;0.127945] Results with preemption: USER : mean=0.127881 stddev=0.011377 conf.inter.95%=[0.127162;0.128600] Moreover, running the executable several times seems to indicate that the results are fairly reproducible: We cannot see any significant performance change. Even though the preemptive version seems a bit slower, it is by about 0.5%. |
Comment author: @damiendoligez I think 0.5% is a small price to pay to get usable UI threads. Jacques-Henri, could you submit a GitHub PR and move the discussion there? |
Comment author: @xavierleroy See #1533 |
Original bug ID: 7669
Reporter: @jhjourdan
Status: resolved (set by @xavierleroy on 2018-01-12T12:53:01Z)
Resolution: fixed
Priority: normal
Severity: minor
Platform: x86-64
OS: Linux
Target version: 4.06.1+dev/rc1/rc2
Fixed in version: 4.07.0+dev/beta2/rc1/rc2
Category: threads
Monitored by: @gasche @ygrek
Bug description
When using the systhread library with at least one computation intensive thread, the scheduler has very bad behavior. Preemption happens only rarely. This is, for example, demonstrated by the following program when run in native mode :
let rec generate_list n =
let rec aux acc = function
| 0 -> acc
| n -> aux (float n :: acc) (n-1)
in
aux [] n
let long_list = generate_list 1000000
let short_list = generate_list 100
let rec slow_loop () =
Printf.printf "SLOW %d\n%!" (List.length (List.rev_map sin long_list));
slow_loop ()
let rec fast_loop () =
Printf.printf "FAST %d\n%!" (List.length (List.rev_map sin short_list));
fast_loop ()
let () =
ignore (Thread.create slow_loop ());
fast_loop ()
In the output, there are roughly as many "FAST" lines as "SLOW" lines, meaning that preemption only happens when printing to standard output. I would expect to see much more "FAST" lines than "SLOW" lines (i.e., the slow computation is preempted by the fast one even if the slow one does not interact).
Note that both computation do allocate memory through the GC, so the signals used for preemption can indeed be handled. This can be seen by adding the following lines at the beginning of the previous code, which gets the expected behavior:
let preempt signal = Thread.delay 1e-6
let _ =
Sys.set_signal Sys.sigvtalrm (Sys.Signal_handle preempt);
ignore (Unix.setitimer Unix.ITIMER_VIRTUAL
{ Unix.it_interval = 1e-2; Unix.it_value = 1e-2 })
I.e., we periodically force preemption by suspending the currently executing thread for a very short time.
My analysis of the problem is that there are two flaws in the implementation of the systhread library:
1- The "tick thread" regularly sends a signal so that yield() is called regularly. But on linux, the only thing yield() does is releasing and acquiring back the master lock. Since scheld_yield() apparently doesn't do want we want here, one solution could be to use usleep(1) instead. My experiments seem to indicate that the loss in performance is not measurable.
2- Actually, 1- would be fine if the implementation of the masterlock would garantee fairness. But it doesn't. Another soution would therfore to use a fair implementation for the master lock (such as a ticket lock). This should not make the implementation much more complicated, given that the master lock is already implemented "by hand".
File attachments
The text was updated successfully, but these errors were encountered: