[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Xavier Leroy <Xavier.Leroy@i...> |
| Subject: | Re: VLIW & caml: how? |
> I've been reading that VLIW as implemented on the IA-64/Merced will post > problems for conventional compilers such as gcc which don't have a very > expansive view of the code they're compiling. How well will o'caml deal > with optimizing for this sort of architecture? Any thoughts? It's hard to say anything precise until Intel releases detailed documentation on the IA64 instruction set. If your question is about instruction-level parallelism (ILP) in general, it must be noted that today's superscalar architectures (ushc as the Alpha 21264 and the PowerPC 604) already offer more parallelism (i.e. 4 instructions issued per cycle) than can be exploited by most compiled programs. This is due in part to insufficient optimizations in compilers (extracting ILP from sequential code might require significant program transformations) and in part to the fact that many programs simply do not contain enough parallelism by nature of the algorithms used. Often, the only way to exploit fully the resources of those superscalar processors is to write carefully tuned assembly code by hand... Code generated by ocamlopt has characteristics similar to the so-called "commercial workload" subset of Spec95, i.e. high number of memory accesses, low to medium ILP, and relatively low CPI. This is not surprising, as hardware manufacturers generally increase ILP by throwing more integer and floating-point ALUs, which are not useful for most Caml applications, but don't increase the number of load-store units, which would be good for Caml but is very hard to implement in hardware. However, there is some hope that the clean semantics of Caml might allow more aggressive scheduling of memory accesses as is possible with e.g. C programs. In particular, the type system gives a lot of non-aliasing properties "for free" (e.g. a load from an immutable data structure cannot interfere with a non-initializing store). See my PLDI'98 tutorial for more details (http://pauillac.inria.fr/~xleroy/). But again, this can be useful only if the hardware supports many pending memory accesses simultaneously. All in all, I'm not expecting much speedups from ILP. The important speedups we've observed on Caml programs when moving from older architectures (e.g. the Alpha 21064) to newer ones (e.g. the Alpha 21164 or PowerPC G3) are due to better caches and faster memory subsystems much more than to increased on-chip parallelism. - Xavier Leroy