Re: Interpreter vs hardware threads

From: Xavier Leroy (Xavier.Leroy@inria.fr)
Date: Mon Mar 06 2000 - 18:35:46 MET

  • Next message: Julian Assange: "tueareg sym-lock and xemacs21"

    > IIRC, Linux native pthreads, for one, aren't at the moment meant to
    > be used in huge numbers---the flood-test performance of those Linux
    > Java VMs which map Java threads onto them supposedly suffers compared
    > with those that don't. But Xavier would be able to tell us more about
    > that :).

    My pleasure :-) It is true that threads in LinuxThreads are not as
    lightweight as some would like, mostly because they map 1-to-1 to
    kernel processes, and the Linux kernel limits the number of processes
    to 512 (for the super-user) or 256 (for normal users). Also, kernel
    scheduling takes time linear in the number of processes (just like the
    OCaml bytecode-thread scheduler, which was strongly inspired by that
    of the Linux kernel), so context switching times increase as more
    threads are run.

    More generally, there are two very different viewpoints on threads
    that call for radically different implementations.

    The first viewpoint is that threads are a convenient abstraction to
    exploit fully some hardware features, such as multiple processors and
    overlapping of I/O and computation. You can exploit multiple
    processors and overlap I/O without threads using e.g. communicating
    Unix processes and asynchronous I/O. But it's a pain to program;
    threads make the programming of such applications much easier.

    In this viewpoint, you never need huge numbers of threads. For heavy
    computation on a multiprocessor, N + 1 or N + 2 threads, where N is
    the number of processors, delivers optimal throughput; you're not
    going to run any faster with more threads. For overlapped I/O on
    PC-class hardware, 20 threads or so are plenty: if you overlap more
    than 20 I/O requests at the same time, the disks and network cards
    won't follow, and throughput will not be increased either.

    Both LinuxThreads and to a lesser extent Caml threads were designed
    with that kind of applications in mind. Caml threads can't exploit
    multiprocessors because the runtime system isn't safe
    w.r.t. concurrent execution, but they have proven quite effective for
    overlapped I/O, e.g. in the V6 intelligent Web proxy.

    The other viewpoint is that threads are a code structuring facility,
    making it easier to write programs that conceptually perform a lot of
    things concurrently, e.g. in response to external stimuli. In this
    viewpoint, threads should be as lightweight as possible (starting a
    thread shouldn't be much more expensive than calling a function), and
    limited in number only by available memory.

    The sad fact is that there doesn't exist any implementation technique
    for threads that satisfy both viewpoints at once. Very lightweight
    threads do exist, see e.g. the call/cc-based threads of SML/NJ, but
    entail significant performance penalties not only on I/O, but also on
    the actual running speed of the sequential code. "Heavyweight"
    threads such as LinuxThreads or Win32 threads are very efficient
    w.r.t. I/O, but are expensive to start and to context-switch.
    Two-level thread libraries are a compromise that is appealing on
    paper, but not that much "lighter" than pure kernel-based threads in
    reality.

    Add to this the fact that the primitives provided by the underlying OS
    affect quite a lot thread libraries. For instance, some Unixes
    provide async I/O notification via signals, and that could be used to
    speed up the Caml bytecode-thread scheduler, but not all Unixes
    provide them. Also, if we were certain that the underlying OS
    provides native threads, we could build a two-level scheduler for
    bytecode threads that would probably be more efficient. But of course
    we have to shoot for the lowest common denominator.

    So, don't expect miracles from Caml threads, either bytecode or
    system. As Michael Hicks said, the current bytecode thread scheduler
    could be improved to run in time proportional to the number of threads
    waiting on I/O, rather than on the number of threads. That would help
    if most of your threads are blocked on mutexes, conditions or event
    channels, but not if most of your threads perform I/O.

    > (Of course, one could always ask the ocaml team that won the ICFP
    > competition in such spectacular fashion to knock off an
    > ocaml-to-event-driven-FSM compiler ...)

    I'm not sure that OCaml is the best input language for generating
    FSMs, but, yes, generating FSMs from a high-level concurrent language
    is an excellent approach. You get both ultimate execution speed and a
    readable, maintainable description of the program. Have a look for
    instance at Esterel (another glorious INRIA product :-))

    - Xavier Leroy



    This archive was generated by hypermail 2b29 : Mon Mar 06 2000 - 19:04:31 MET