Writing efficient numerical code in Objective Caml

Here are what I think are the main points to remember for writing efficient numerical code in Objective Caml. I don't have much experience with numerical code, in O'Caml or otherwise, so these are only hints.

``Optimization''

The ocamlopt compiler is poorly named because it basically does not optimize the user's code. That is, everything will be executed pretty much as written in the source. The only exception is function applications, where the compiler inlines small function bodies. But function definitions fun x -> ... will always heap-allocate a closure (no lambda-lifting, no closure hoisting). Arithmetic expressions will be computed at the point where they occur in the source (no loop-invariant code motion). So, the programmer is very much in control.

Matrices

Matrices in Caml are really arrays of (pointers to) arrays. So, a matrix access a.(i).(j) reads the i-th element of a, which is a pointer to an array, then read the j-th element of that array. It's about as efficient as the standard layout for matrices when the dimensions are not known at compile-time (we're trading a dereference for a multiplication), but can be a big loss if the dimensions are statically known and one happens to be, say, a power of 2.

A first consequence is that rows of a matrix are really individual arrays: they need not be the same length (great for triangular matrices) and a whole row of a matrix can be replaced in constant time: just do a.(i) <- some_array. There might also be some creative things you can do by sharing a row between several matrices.

Another consequence is that matrices are row-major (hope that's the right term): the first coordinate should vary slower than the second. E.g. to sum two N*N matrices, don't write

        for j = 0 to N-1 do
          for i = 0 to N-1 do
            c.(i).(j) <- a.(i).(j) + b.(i).(j)
          done
        done

but write:

        for i = 0 to N-1 do
          let row_a = a.(i) and row_b = b.(i) and row_c = c.(i) in
          for j = 0 to N-1 do
            row_c.(j) <- row_a.(j) + row_b.(j)
          done
        done

(Notice the manual optimization: three loop-invariant expressions have been lifted outside of the inner loop.)

Floating-point boxing

Because of the combination of garbage collection, polymorphism/type abstraction, and separate compilation, floating-point numbers are often boxed, that is, stored in a heap-allocated block and handled through a pointer. Most Lisp and ML implementations box (or tag) floats, and that's the main reason why they deliver poor floating-point performance: any arithmetic operation on boxed floats has to load from memory its arguments, compute the result, heap-allocate a block, and store the result in it.

One of the ancestors of Objective Caml, the Gallium experimental compiler, implemented a fairly aggressive type-based unboxing strategy that would completely eliminate boxing from Fortran-style code. i.e. code without polymorphic functions. Unfortunately, it turned out to complicate greatly the runtime system and make the garbage collector less efficient, hence hampering the performance of symbolic code. Since symbolic computation is still the main application of ML, I dropped this unboxing strategy.

However, some of the unboxing tricks developed for Gallium are still used in Objective Caml, at least on a local scale (where they don't interact with the garbage collector and everything). Here are some hints on which floats are boxed and which are not:

Function arguments are always boxed. (Unless the function is inlined, of course.)
Free variables of functions are always boxed.
Intermediate results of arithmetic expressions are not boxed. By "intermediate result", I mean the result of an arithmetic operation (+. -. *. /. ** sin cos exp log ...) that is immediately used as argument of an arithmetic operation.
Example: fun x -> 3.14 *. x *. x
The parameter x and the result are boxed, but not x *. x.
let-bound results of arithmetic operations are not boxed if they are only used later as arguments to arithmetic operations. Example:
```
     let x = y +. z in x *. x               x not boxed
     let x = y +. z in x *. x *. f x        x boxed (passed as argument)
```
Floats in data structures are generally boxed. E.g. a list of floats is really a list of pointers to floats. But there are two exceptions:
- arrays of floats (the floats are laid out contiguously, as in C; this is not an array of pointers to floats)
- record types whose fields are all floats (same as arrays)
An access in a float array or in such a record avoids boxing in the same way as for arithmetic operations.
Example: a.(i) <- 2.0 *. a.(i) performs only one float read and one float store, instead of two reads, a boxing and a store as in most ML implementations.

A few consequences of these hints:

2 and 4 =>

iterate with for and while loops instead of recursive functions.
Example:

        let x = y +. z in
        for i = 0 to N-1 do a.(i) <- x *. a.(i) done

does not box x, but

        let x = y +. z in
        let rec iter i =
          if i >= N then () else (a.(i) <- x *. a.(i); iter(i+1))
        in iter 0

is not only unreadable, but causes x to be boxed because it is free in iter.

5 =>

use records in preference to tuples for representing points, complex numbers, etc. Example:

        type complex = {re: float; im: float}

saves 4 loads and 2 allocations for each complex addition compared with

        type complex = float * float

all =>

Fortran/C style (big functions, loops all over the place) is compiled better than ``classic'' functional style (functions, functionals, iterators, combinators till you scream).

Another consequence of the optimization of float arrays is that there are really two formats of arrays in the system: arrays of floats and arrays of pointers/integers.

The illusion of a parametric array type 'a array is maintained at some cost: when accessing an array whose static type is 'a array (e.g. in a polymorphic function), a run-time test is generated to distinguish the two array formats, and some additional boxing/unboxing may take place if it's a float array. When more is known on the static type of the array, no test is generated and no extra boxing takes place.

So, for maximal performance, don't operate over float arrays with polymorphic array functions. For instance, transposing a float matrix with

        let transpose m =
          let m' = Array.new_matrix (Array.length m.(0)) (Array.length m) in
          for i = 0 to Array.length m do
            let row_i = m.(i) in
            for j = 0 to Array.length row_i do
              m'.(j).(i) <- row_i.(j)
            done
           done;
           m'

is much more efficient if you add a type constraint:

        let transpose (m: float array array) = ...

IEEE conformance

Objective Caml tries to conform as much as possible to the IEEE 754 standard for floating-point arithmetic. The known deviations at the time of this writing are:

The polymorphic comparisons =, <, compare, etc, return incorrect results when one of the arguments is a float NaN (``Not a Number''). Namely, NaNs are considered equal to any other float, while they should be different from any float including themselves. However, when =, <, etc, are used with the static type float -> float -> bool, then the compiler bypasses the polymorphic comparisons and calls the floating-point comparisons directly. The floating-point comparisons do conform to the IEEE standard.
So, if you want to define a ``is Not a Number'' predicate that builds on the fact that NaN is the only number different from itself, don't write
```
let isnan x = x <> x
```
because you'll get a polymorphic comparison that doesn't work over NaN. Instead, force the floating-point comparison by writing
```
let isnan (x : float) = x <> x
```
On the Alpha under Digital Unix/Tru 64, NaNs are not handled properly and may cause traps. This is a conscious design decision for the Alpha chip. Since OCaml 3.04, the bytecode interpreter actually enforces IEEE conformance, but we still don't know how to get IEEE conformance for the native-code compiler ocamlopt.

Caml - Objective Caml - Projet Cristal - Homepages - Contact the Caml maintainers
Author: Xavier Leroy -- Last modified: 2002/07/31