English version
Accueil     À propos     Téléchargement     Ressources     Contactez-nous    

Ce site est rarement mis à jour. Pour les informations les plus récentes, rendez-vous sur le nouveau site OCaml à l'adresse ocaml.org.

Browse thread
thousands of CPU cores
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2008-07-15 (14:39)
From: Kuba Ober <ober.14@o...>
Subject: Re: [Caml-list] thousands of CPU cores
> > It is a stop-gap solution...
> That is not true. Many-core machines will always be decomposed into
> shared-memory clusters of as many cores as possible because shared memory
> parallelism will always be orders of magnitude more efficient than
> distributed parallelism.

The way "shared memory" on today's systems is implemented in hardware is
already by essentially message passing. It's just that hardcoded logic does it
all and provides an impression of shared memory, rather than having software
deal with it.

The fact that the software sees it as shared memory doesn't change the fact
that at current system bandwidths we've already run into physical 
implementation limits that make the smooth, fully-random-access memory
a mere illusion. When you read a single uncached byte out of RAM,
there's a big bunch of housekeeping and what-amounts-to-transactional
processing done at the hardware level.

If you count the "efficiency" of such out-of-the-blue uncached truly random
access in terms of clock cycles, current hardware may be 1-2 orders of
magnitude less efficient than almost any 8-bit microcontroller out there...
On most MCUs you can read a random byte out of the SRAM in say 1-4 clock
cycles. On your commonplace modern multicore CPU, it may take a hundred clock
cycles to do the same, and essentially the same amount of time in terms of the
wall clock (a 2GHz CPU has only 100 times faster clock than a run of the mill
20MHz MCU).

What I'm trying to say is that such random, small memory accesses highlight
the inherent message passing / transactional overhead of the hardware
implementation. Those overheads amortize when you run real number tasks,
not a made-up cold single byte access of course. But they are there.

It's akin to mmaped file: you can use CPU's MMU to implement it in the 
usual OS/stock hardware framework, or you can have an FPGA handle memory
transactions and talk directly to the hard drive. It doesn't change the
fact that it's still a mmaped file :)

Cheers, Kuba