Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] How hard would more inlining, more unboxed floats be?
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: William Chesters <williamc@p...>
Subject: [Caml-list] How hard would more inlining, more unboxed floats be?
ocaml is nearly a marvellous tool for abstract numerical programming.
By that I mean the ability to write Matlab or Fortran 90-style
expressions, and express the natural abstractness of algorithms using
functors.

   In C++ these highly desirable goals _can_ be achieved with
intricate template techniques (Blitz++, PETE/POOMA, MTL, ...), but
they are only marginally feasible, and in practice it's questionable
whether they save more in expressiveness than they cause in trouble.

   Now, ocaml is very good---within a compilation unit---at
inlining, partial evaluation, and elimination of temporary objects,
which are the essential optimisations required.

   And the backend code generator can do amazing things with a bit of
help.  For example, I tried the following code for a dot product:

    type floatref = { mutable it: float }

    let dot x x0 x1 y y0 =
      let j = ref 0 and acc = { it = 0. } in
      for i = x0 to x1 - 1 do
        acc.it <- acc.it +. Array.unsafe_get x i *. Array.unsafe_get y !j;
        incr j
      done;
      acc.it

Note use of specialised all-"float" record (record_representation =
Record_float) --- otherwise it's impossible to avoid allocation of a
boxed float in the inner loop, whether expressed imperatively or
tail-recursively ...  On a Pentium, it compiles to

    .L101:
            fldl    -4(%ebp, %ebx, 4)
            fmull   -4(%edx, %esi, 4)
            faddl   (%edi)
            fstpl   (%edi)
            addl    $2, %esi
            addl    $2, %ebx
            cmpl    %ecx, %ebx
            jle     .L101

By contrast, the best gcc can do is this:

    .L6:
            fldl (%esi,%eax,8)
            fmull (%ebx,%edx,8)
            incl %eax
            incl %edx
            faddp %st,%st(1)
            cmpl %ecx,%eax
            jl .L6

which tests maybe 10% faster.  Actually I think on a Pentium one can
get a few more percent out by using direct pointer increments, but
most people don't do that, not least because it's actually slower on
most RISCs.

   (Glasgow Haskell is impressive in many ways, but completely misses
this level of performance---its own code generator is not very
ambitious, and the C code it can alternatively feed to gcc is too
low-level, tries to micro-manage the stack and ends up obscuring
what is really going on, so that gcc ends up using no register
variables at all ...)

   Arguably, all that's standing between ocaml-3.01 and a killer
language for scientific computing is:

     -- inlining and partial evaluation _across_ compilation
        boundaries, and in particular through modules/functors

     -- explicit user control of inlining

     -- facilities for handling unboxed floats directly, and/or
        elision of box/unbox in tail calls, so that recursive
        "loops" don't incur allocation penalty

How hard would it be to get these things happening?
-------------------
To unsubscribe, mail caml-list-request@inria.fr.  Archives: http://caml.inria.fr