Re: Interpreter vs hardware threads

From: Xavier Leroy (Xavier.Leroy@inria.fr)
Date: Mon Mar 06 2000 - 18:35:46 MET

Next message: Julian Assange: "tueareg sym-lock and xemacs21"

Previous message: Pierre Weis: "New release of the MMM web browser"
In reply to: William Chesters: "Re: Interpreter vs hardware threads"
Reply: John Max Skaller: "Re: Interpreter vs hardware threads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> IIRC, Linux native pthreads, for one, aren't at the moment meant to
> be used in huge numbers---the flood-test performance of those Linux
> Java VMs which map Java threads onto them supposedly suffers compared
> with those that don't. But Xavier would be able to tell us more about
> that :).

My pleasure :-) It is true that threads in LinuxThreads are not as
lightweight as some would like, mostly because they map 1-to-1 to
kernel processes, and the Linux kernel limits the number of processes
to 512 (for the super-user) or 256 (for normal users). Also, kernel
scheduling takes time linear in the number of processes (just like the
OCaml bytecode-thread scheduler, which was strongly inspired by that
of the Linux kernel), so context switching times increase as more
threads are run.

More generally, there are two very different viewpoints on threads
that call for radically different implementations.

The first viewpoint is that threads are a convenient abstraction to
exploit fully some hardware features, such as multiple processors and
overlapping of I/O and computation. You can exploit multiple
processors and overlap I/O without threads using e.g. communicating
Unix processes and asynchronous I/O. But it's a pain to program;
threads make the programming of such applications much easier.

In this viewpoint, you never need huge numbers of threads. For heavy
computation on a multiprocessor, N + 1 or N + 2 threads, where N is
the number of processors, delivers optimal throughput; you're not
going to run any faster with more threads. For overlapped I/O on
PC-class hardware, 20 threads or so are plenty: if you overlap more
than 20 I/O requests at the same time, the disks and network cards
won't follow, and throughput will not be increased either.

Both LinuxThreads and to a lesser extent Caml threads were designed
with that kind of applications in mind. Caml threads can't exploit
multiprocessors because the runtime system isn't safe
w.r.t. concurrent execution, but they have proven quite effective for
overlapped I/O, e.g. in the V6 intelligent Web proxy.

The other viewpoint is that threads are a code structuring facility,
making it easier to write programs that conceptually perform a lot of
things concurrently, e.g. in response to external stimuli. In this
viewpoint, threads should be as lightweight as possible (starting a
thread shouldn't be much more expensive than calling a function), and
limited in number only by available memory.

The sad fact is that there doesn't exist any implementation technique
for threads that satisfy both viewpoints at once. Very lightweight
threads do exist, see e.g. the call/cc-based threads of SML/NJ, but
entail significant performance penalties not only on I/O, but also on
the actual running speed of the sequential code. "Heavyweight"
threads such as LinuxThreads or Win32 threads are very efficient
w.r.t. I/O, but are expensive to start and to context-switch.
Two-level thread libraries are a compromise that is appealing on
paper, but not that much "lighter" than pure kernel-based threads in
reality.

Add to this the fact that the primitives provided by the underlying OS
affect quite a lot thread libraries. For instance, some Unixes
provide async I/O notification via signals, and that could be used to
speed up the Caml bytecode-thread scheduler, but not all Unixes
provide them. Also, if we were certain that the underlying OS
provides native threads, we could build a two-level scheduler for
bytecode threads that would probably be more efficient. But of course
we have to shoot for the lowest common denominator.

So, don't expect miracles from Caml threads, either bytecode or
system. As Michael Hicks said, the current bytecode thread scheduler
could be improved to run in time proportional to the number of threads
waiting on I/O, rather than on the number of threads. That would help
if most of your threads are blocked on mutexes, conditions or event
channels, but not if most of your threads perform I/O.

> (Of course, one could always ask the ocaml team that won the ICFP
> competition in such spectacular fashion to knock off an
> ocaml-to-event-driven-FSM compiler ...)

I'm not sure that OCaml is the best input language for generating
FSMs, but, yes, generating FSMs from a high-level concurrent language
is an excellent approach. You get both ultimate execution speed and a
readable, maintainable description of the program. Have a look for
instance at Esterel (another glorious INRIA product :-))

- Xavier Leroy

Next message: Julian Assange: "tueareg sym-lock and xemacs21"
Previous message: Pierre Weis: "New release of the MMM web browser"
In reply to: William Chesters: "Re: Interpreter vs hardware threads"
Reply: John Max Skaller: "Re: Interpreter vs hardware threads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Mar 06 2000 - 19:04:31 MET