Re: VLIW & caml: how?

From: Xavier Leroy (Xavier.Leroy@inria.fr)
Date: Wed Sep 02 1998 - 19:23:56 MET DST


Date: Wed, 2 Sep 1998 19:23:56 +0200
From: Xavier Leroy <Xavier.Leroy@inria.fr>
To: Todd Graham Lewis <tlewis@mindspring.net>, caml-list@inria.fr
Subject: Re: VLIW & caml: how?
In-Reply-To: <Pine.LNX.3.96.980828011145.1695L-100000@reflections.eng.mindspring.net>; from Todd Graham Lewis on Fri, Aug 28, 1998 at 01:18:34AM -0400

> I've been reading that VLIW as implemented on the IA-64/Merced will post
> problems for conventional compilers such as gcc which don't have a very
> expansive view of the code they're compiling. How well will o'caml deal
> with optimizing for this sort of architecture? Any thoughts?

It's hard to say anything precise until Intel releases detailed
documentation on the IA64 instruction set.

If your question is about instruction-level parallelism (ILP) in
general, it must be noted that today's superscalar architectures (ushc
as the Alpha 21264 and the PowerPC 604) already offer more parallelism
(i.e. 4 instructions issued per cycle) than can be exploited by most
compiled programs. This is due in part to insufficient optimizations in
compilers (extracting ILP from sequential code might require
significant program transformations) and in part to the fact that many
programs simply do not contain enough parallelism by nature of the
algorithms used. Often, the only way to exploit fully the resources
of those superscalar processors is to write carefully tuned assembly
code by hand...

Code generated by ocamlopt has characteristics similar to the
so-called "commercial workload" subset of Spec95, i.e. high number
of memory accesses, low to medium ILP, and relatively low CPI. This
is not surprising, as hardware manufacturers generally increase ILP by
throwing more integer and floating-point ALUs, which are not useful for
most Caml applications, but don't increase the number of load-store
units, which would be good for Caml but is very hard to implement in
hardware.

However, there is some hope that the clean semantics of Caml might
allow more aggressive scheduling of memory accesses as is possible
with e.g. C programs. In particular, the type system gives a lot of
non-aliasing properties "for free" (e.g. a load from an immutable data
structure cannot interfere with a non-initializing store). See my
PLDI'98 tutorial for more details (http://pauillac.inria.fr/~xleroy/).
But again, this can be useful only if the hardware supports many
pending memory accesses simultaneously.

All in all, I'm not expecting much speedups from ILP. The important
speedups we've observed on Caml programs when moving from older
architectures (e.g. the Alpha 21064) to newer ones (e.g. the Alpha
21164 or PowerPC G3) are due to better caches and faster memory
subsystems much more than to increased on-chip parallelism.

- Xavier Leroy



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:15 MET