Version française
Home     About     Download     Resources     Contact us    
Browse thread
OCaml Summer Project decisions are in
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Berke Durak <berke.durak@e...>
Subject: Re: [Caml-list] OCaml Summer Project decisions are in
Xavier Leroy wrote:

> By Amdahl's law, a stop-the-world GC cannot scale well, but it might
> be good enough for 2 to 4 cores.  For instance, assuming a sequential
> program that spends 25% of its time in garbage collection and
> parallelizes with low overhead, you would get a speedup of 2 on
> a 4-core machine.

No a very bright prospect.  Many people already have dual cores under their desks,
so quad cores should be the norm in a year or two.  I'd hate to see Ocaml
become one of the slower languages.

> A related question is how long it takes to stop a
> large number of threads.  But only experimentation can tell.
 > ..
 > Another crucial point is the ability to stop all threads and obtain
 > their roots, which is not obvious at all and a prerequisite for any
 > form of concurrent garbage collection.

Stopping the world in our heavily threaded ExaVM for GC is quite painful
since the threads often run in native C/C++ code which naturally
has no business checking the stop condition as often as Exascript code,
giving very poor performance on multi-core machines (we got something like
10-15% CPU usage on a 32-core machine)

XL:
> There are two points that I'd like to emphasize. The first step towards
> any form of true parallelism in Caml (including message-passing
> between threads having their own heaps and sequential GCs) is to clean
> up the numerous global C variables that the run-time system uses.
> But maybe this was implicit in your point 1.

Besides moving the interpreter/runtime global state into some datastructure,
what about emitted asm code that references things like caml_young_limit?
Those would have to be made thread-specific.  Why not allocate one or two
of these extra x86-64 registers for storing those?

That could also be an occasion to "librarify" the interpreter in order to
allow multiple Ocaml interpreters/run-times to execute in the same process (with
disjoint heaps).

That way one could embed multiple Ocaml libraries in programs written in other
languages (which could allow easier monetization of Ocaml software ;).

Also, we could explore alternate paths to paralellism, such as running multiple
Ocaml interpreters/runtimes in different threads with disjoint heaps.
Those threads would run truly in parallel on different CPUs.


Then, we could try different custom solutions for cheap inter-thread communication.

For example, we could write a Marshal derivative that emits a simpler inter-thread
representation.  It would work under the hypothesis that the deserializing process
runs the same code as the serializing one, on the same machine, at the same time.
This would allow to remove some serialization overhead:
   - No need to compact integers for a byte-based serialization format
   - No need to worry about endianness or float representations
   - No need to outputting or verify program MD5s for functions
   - Native data could be passed as-is.

Thus we would have one process running n threads, but those threads having disjoint
heap spaces, allowing for fully parallel GC and execution (distinct heap spaces mean reduced
cache/bus contention), exchanging data in shared memory thru an inter-thread
Marshal module.

(Note that using n processes with shared memory would not be
as convenient:
   - You'd need to set up shared memory segment using IPC, /dev/shm or other
unportable, OS-specific mechanisms.
   - You won't be able to use native multi-threaded libraries
   - You won't be able to share file desriptors or common C code/data.
   - You won't be able to use inter-thread signaling/synchronization mechanisms,
for your implementation)

People would then write libraries for shared immutable data, or a shared store,
probably using a STM-like approach.  Those could probably be written in Ocaml.
Also some POSIX thread synchronization mechanisms (mutexes, etc.) could be exposed
as Ocaml types.

As for the interpreter threads, they could use lightweight threads or classic Ocaml
serializing threads for their internal non-parallel concurrency.

I personally don't believe much in fine-grained parallelism using threads and locks
because it makes the GC difficult, relies on the notoriously difficult to use and
non-composable synchronization mechanisms such as locks and mutexes, doesn't
scale to distributed computing, and doesn't run as fast as most people think it would
because of bus/cache contention.

> At any rate, it's probably better not to ask too many questions at
> this point and let you concentrate on the task at hand.  Feel free to
> contact me and Damien Doligez for specific questions or general
> advice.

I certainly don't want to discourage anyone from writing a parallel GC!

-- 
Berke DURAK