Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] native code optimization priorities
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Xavier Leroy <xavier.leroy@i...>
Subject: Re: [Caml-list] native code optimization priorities
> > I have
> > vague ideas about things that could be done, e.g. a Pentium-4 back-end
> > that would use SSE2 registers for floating-point, but this is all
> > low priority.
> 
> May I ask if you ever did implement this, would you limit it to some
> P4 specific technique? I've idly toyed with the idea of implementing
> something for Altivec on the G4.

I'm afraid I wasn't clear enough: the first step would be to use SSE2
registers as normal floating-point registers, storing only one float
per register, and performing single floating-point operations.  This
would already improve float performance quite a lot compared with the
current x86 float stack.  Other processors do not need this hack,
because they already have a sensible register-based float architecture.

The next step, of course, would be to actually use SIMD instructions
to operate on pairs or quadruples of floats.  The standard approach
would be to have special abstract types for these packed floats, with
operations corresponding to what the hardware SIMD unit provides.  The
problem here is that of portability: SSE2 and Altivec, for instance,
do not provide the same SIMD instructions...

> I wondered if it would be possible
> to integrate this into the type inference; if the compiler can infer
> that certain values will never require more than a certain number of
> bits they become candidates for use in a SIMD unit. This is along the
> lines of Bitwidth Analysis (PLDI'00 Stephenson et al, and Larsen and
> Amarasinghe's Exploting Superword Level Parallelism with Multimedia
> Instruction Sets, same conference). Scott Ananian's SM thesis at MIT
> also included a predicated (forward and reverse) SSA variant that used
> a similar optimization to find narrow operations that could be executed in
> parallel. 

We're getting into really advanced stuff here!  It's a research topic
on its own, and I somewhat doubt that we can extract much parallelism
this way, but we'll see.

- Xavier Leroy
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr