fun x y z -> ...is more efficient than the ``tupled'' (or ``uncurried'') form
fun (x,y,z) -> ...The bytecode compiler ensures that a complete application of a curried function (that is, arguments are provided for all parameters -- no partial application) executes without heap allocation. In contrast, applying an uncurried function to a tuple of arguments always heap-allocate the tuple, which is costly both because of increased memory traffic (writes to fill the tuple, loads to access its components) and increased load on the garbage collector.
The native-code compiler optimizes both curried and tupled functions, avoiding heap allocation and using registers to pass the parameters.
let rec f x y z = ... f x y e1 ... ... f x y e2 ...where all recursive calls have the same
yparameters as on entrance, is it worth rewriting it as below?
let f x y z = let rec local_f x = ... local_f e1 ... ... local_f e2 ... in local_f xThe short answer is: it depends, but in doubt don't do it.
It depends on the number of invariant parameters: parameter passing is cheap, so don't bother making a local function if you have only one or two invariant parameters.
It also depends on the expected number of recursive calls of the
function: building the closure for
local_f adds a
constant overhead to the initial call of
f, which can be
overcome by the fact that recursive calls to
more efficient than recursive calls to
f if there are
many recursive calls; but if
f does not recurse very
deeply, the cost of building the closure for
make the ``optimized'' form less efficient than the naive form.
forloop is slightly faster than the corresponding recursive function (and a lot more readable too!). For numerical iterations (ints and floats), loops are probably slightly more efficient. On the other hand, assignment to a reference holding a heap-allocated data structure can be relatively expensive (due to the generational garbage collector), so it is better to stick with recursive functions when working over data structures.
o(memory overhead) parameter. The major collector is incremental: it does a ``slice'' of major collection each time a minor collection occurs. To decide how much work to do at each minor collection, it estimates the allocation rate of the program and bases its ``speed'' on that. Actually, the collector tries to ensure that no more than
opercents of the major heap are wasted, that is, contains garbage (dead objects) that haven't been reclaimed yet. Lower values of
omean that the major collector works harder (and consumes more CPU time) to reclaim fresh garbage, but the memory requirements of the program are reduced. Higher values of
oresult in less time spent in garbage collection, but higher memory requirements.
The default value for
o is 30%, meaning that the GC
shoots for memory occupancy 30% higher than the theoretical
minimum. This value has proven adequate for long-running programs with
a relatively steady allocation rate.
However, for programs that run for a short time but allocate like
crazy (read: benchmarks), that value of
o is a little too
low and causes the garbage collector to work unnecessarily hard. For
these kind of programs, higher values of
o=100, lead to important speedups without increasing
much the memory requirements.
At any rate,
o=100 means that on an average 50% of the
memory space is wasted, which is exactly what a conventional copying
major collector does; so, setting
o=100 basically makes
O'Caml's incremental marking collector even with conventional copying
o parameter can be set either from the environment
CAMLRUNPARAM, as in
CAMLRUNPARAM='o=100' mycamlprogramor through the
These figures are geometrical means on a relatively small set of benchmarks. Individual figures are much more contrasted. On the Alpha, the speedups for each benchmark range from as low as 0.23 (i.e. SMLNJ is five times faster) on Eratosthene's sieve to as high as 11.5 (O'Caml is eleven times faster) on FFT. On the Knuth-Bendix benchmark, which I consider representative of the typical ML program, the speedup is 1.85.
The usual caveats about these figures apply: the versions of O'Caml and SMLNJ used were working versions, not fully tuned yet; the benchmark programs may or may not be representative of real-world applications; finally, the benchmark programs were written in the curried style, which is well supported by O'Caml but possibly not that well suited to SMLNJ. In short, these figures don't mean much.
O'Caml programs compiled with the native-code compiler run 6 times faster (on an average) than the same programs compiled with the O'Caml bytecode compiler. The speedup depends a lot on the type of program, and ranges from 2 (test program: a lexer generator) to 15 (test program: FFT).
(=), hashing, parsing with
ocamlyacc-generated parsers. If your program spends a significant amount of time in the latter operations, then the speedups obtained by going to native code are limited.
For instance, assume that your bytecode program spends 1/3 of its time in garbage collection, structural equality, and input/output. Further assume that the remaining 2/3 of operations speed up by a factor of 8 when compiled to native-code. Then, the whole native-code program will run ``only'' 2.4 times faster than the original bytecode program.
Objective Caml -
Projet Cristal -
Contact the Caml maintainers
Author: Xavier Leroy -- Last modified: 2002/07/31