Making code run fast

I want my code to run fast. How should I write functions with several arguments?

With the bytecode compiler, the ``curried'' form
        fun x y z -> ...
is more efficient than the ``tupled'' (or ``uncurried'') form
        fun (x,y,z) -> ...
The bytecode compiler ensures that a complete application of a curried function (that is, arguments are provided for all parameters -- no partial application) executes without heap allocation. In contrast, applying an uncurried function to a tuple of arguments always heap-allocate the tuple, which is costly both because of increased memory traffic (writes to fill the tuple, loads to access its components) and increased load on the garbage collector.

The native-code compiler optimizes both curried and tupled functions, avoiding heap allocation and using registers to pass the parameters.

Should I use local functions to factor out invariant parameters of a recursive function?

You mean: if you have a recursive function such as
        let rec f x y z =
          ... f x y e1 ...
          ... f x y e2 ...
where all recursive calls have the same x and y parameters as on entrance, is it worth rewriting it as below?
        let f x y z =
          let rec local_f x =
            ... local_f e1 ...
            ... local_f e2 ...
          in local_f x
The short answer is: it depends, but in doubt don't do it.

It depends on the number of invariant parameters: parameter passing is cheap, so don't bother making a local function if you have only one or two invariant parameters.

It also depends on the expected number of recursive calls of the function: building the closure for local_f adds a constant overhead to the initial call of f, which can be overcome by the fact that recursive calls to local_f are more efficient than recursive calls to f if there are many recursive calls; but if f does not recurse very deeply, the cost of building the closure for local_f will make the ``optimized'' form less efficient than the naive form.

I hope the compiler inlines small functions, because I have zillions of them.

The bytecode compiler does not do any function inlining. The native-code compiler can inline small functions, but in a fairly conservative way. In particular, recursive functions are never inlined. Generally speaking, function calls are pretty efficient. But if you want maximal speed, you may consider inlining time-critical functions by hand.

What is most efficient: recursive functions (functional style) or loops with references (imperative style)?

Hard to say. Objective Caml supports both styles with comparable efficiency. A for loop is slightly faster than the corresponding recursive function (and a lot more readable too!). For numerical iterations (ints and floats), loops are probably slightly more efficient. On the other hand, assignment to a reference holding a heap-allocated data structure can be relatively expensive (due to the generational garbage collector), so it is better to stick with recursive functions when working over data structures.

Any garbage collector settings I should be aware of?

The o (memory overhead) parameter. The major collector is incremental: it does a ``slice'' of major collection each time a minor collection occurs. To decide how much work to do at each minor collection, it estimates the allocation rate of the program and bases its ``speed'' on that. Actually, the collector tries to ensure that no more than o percents of the major heap are wasted, that is, contains garbage (dead objects) that haven't been reclaimed yet. Lower values of o mean that the major collector works harder (and consumes more CPU time) to reclaim fresh garbage, but the memory requirements of the program are reduced. Higher values of o result in less time spent in garbage collection, but higher memory requirements.

The default value for o is 30%, meaning that the GC shoots for memory occupancy 30% higher than the theoretical minimum. This value has proven adequate for long-running programs with a relatively steady allocation rate.

However, for programs that run for a short time but allocate like crazy (read: benchmarks), that value of o is a little too low and causes the garbage collector to work unnecessarily hard. For these kind of programs, higher values of o, e.g. o=100, lead to important speedups without increasing much the memory requirements.

At any rate, o=100 means that on an average 50% of the memory space is wasted, which is exactly what a conventional copying major collector does; so, setting o=100 basically makes O'Caml's incremental marking collector even with conventional copying collectors.

The o parameter can be set either from the environment variable CAMLRUNPARAM, as in

        CAMLRUNPARAM='o=100' mycamlprogram
or through the Gc.set library function.

I'm interested in numerical programming. Any efficiency hints?

Here are a few. See also the discussion on the caml-list mailing list.

Do you blow Standard ML of New Jersey's socks off?

The performance of the native-code compiler are definitely on a par with that of SMLNJ. On early tests, we were actually slightly faster, by a factor of 1.3 on the SparcStation 10 and 1.9 on the Dec Alpha 3000/300.

These figures are geometrical means on a relatively small set of benchmarks. Individual figures are much more contrasted. On the Alpha, the speedups for each benchmark range from as low as 0.23 (i.e. SMLNJ is five times faster) on Eratosthene's sieve to as high as 11.5 (O'Caml is eleven times faster) on FFT. On the Knuth-Bendix benchmark, which I consider representative of the typical ML program, the speedup is 1.85.

The usual caveats about these figures apply: the versions of O'Caml and SMLNJ used were working versions, not fully tuned yet; the benchmark programs may or may not be representative of real-world applications; finally, the benchmark programs were written in the curried style, which is well supported by O'Caml but possibly not that well suited to SMLNJ. In short, these figures don't mean much.

What speedup should I expect from the native-code compiler, compared with the bytecode compiler? From the bytecode compiler, compared with Caml Light 0.7?

O'Caml bytecode programs run 1.5 to 2 times faster than Caml Light 0.7 programs. This is due to a number of improvements in the bytecode (simpler abstract machine, 32-bit encoding of instructions).

O'Caml programs compiled with the native-code compiler run 6 times faster (on an average) than the same programs compiled with the O'Caml bytecode compiler. The speedup depends a lot on the type of program, and ranges from 2 (test program: a lexer generator) to 15 (test program: FFT).

You mean there are programs that run only twice as fast when going to native code? Isn't this a ridiculous speedup?

There are some class of operations that are much faster when going to native code instead of bytecode: function calls, loops, tests and pattern-matching, integer arithmetic, float arithmetic. However, many other operations take exactly the same time because they are implemented mostly in C: heap allocation, garbage collection, input/output, structured comparisons (=), hashing, parsing with ocamlyacc-generated parsers. If your program spends a significant amount of time in the latter operations, then the speedups obtained by going to native code are limited.

For instance, assume that your bytecode program spends 1/3 of its time in garbage collection, structural equality, and input/output. Further assume that the remaining 2/3 of operations speed up by a factor of 8 when compiled to native-code. Then, the whole native-code program will run ``only'' 2.4 times faster than the original bytecode program.

Caml - Objective Caml - Projet Cristal - Homepages - Contact the Caml maintainers
Author: Xavier Leroy -- Last modified: 2002/07/31