This site is updated infrequently. For up-to-date information, please visit the new OCaml website at ocaml.org.

Optimising: div and mod on AMD64
• Jon Harrop
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
 Date: 2005-11-30 (04:35) From: Jon Harrop Subject: Optimising: div and mod on AMD64
```
I've spent a little time playing around with my 19-line Sudoku program:

http://www.ffconsultancy.com/free/sudoku

Algorithmic optimisations are clearly the way forward from my brute force
approach. However, I am curious about low-level aspects of the performance of
the existing implementation in various ways.

Note that 85-90% of the time is spent in the "invalid" function:

let rec invalid ?(i=0) x y n =
i<9 && (m.(y).[i] = n || m.(i).[x] = n ||
m.(y/3*3 + i/3).[x/3*3 + i mod 3] = n || invalid ~i:(i+1) x y n)

This is a simple recursive function. You can improve its performance a bit by
getting rid of the optional argument:

let rec invalid i x y n =
i<9 && (m.(y).[i] = n || m.(i).[x] = n ||
m.(y/3*3 + i/3).[x/3*3 + i mod 3] = n || invalid (i+1) x y n)

The C++ equivalent of this function (including bounds checking) is:

bool invalid(int i, int x, int y, int n) {
return i<9 && (m.at(y).at(i) == n || m.at(i).at(x) == n ||
m.at(y/3*3 + i/3).at(x/3*3 + i%3) == n ||
invalid(i+1, x, y, n));
}

When switching from 32-bit x86 to 64-bit AMD64, the C++ becomes slightly
faster but the OCaml becomes much slower, to the extent that the OCaml is
4.8x slower than the C++ on AMD64:

Platform  x86 (900MHz) AMD64 (800MHz)
OCaml     3.716        6.209
C++       1.351        1.284

After quite a bit of playing I found out some interesting things. Allocation
is very slow, but the "invalid" function does no allocation and incurs no GC.
Recursion can be slow. Unrolling the invalid function makes little difference
to performance. Removing bounds checking only improves performance by 6% on
both platforms.

To my surprise, it seems that the vast majority of the time in the OCaml
implementation is spent computing x/3*3, i/3 and i mod 3. This is surprising
because these operations are "normally" very fast, e.g. in the C++.

Getting rid of integer div and mod in the "invalid" function by hoisting the
computation of x/3*3 out of "invalid" and then out of the fold, and recursing
over ix=[0,3) and iy=[0,3) rather than i=[0,9) gives:

let rec invalid ix iy fx fy x y n =
if ix=3 then invalid 0 (iy+1) fx fy x y n else
iy<3 && (m.(y).[ix + 3*iy] = n || m.(ix + 3*iy).[x] = n ||
m.(fy + iy).[fx + ix] = n || invalid (ix+1) iy fx fy x y n)

Greatly improves performance, particularly on AMD64:

OCaml     1.946        1.308

Hoisting x/3*3 out of the fold (which isn't even the inner loop!) in the
"search" function almost doubles the performance of the whole program:

let fx = x/3*3 and fy = y/3*3 in
fold (fun accu n ->
let n = Char.chr (n + 48) in
if invalid 0 0 fx fy x y n then accu else
(m.(y).[x] <- n;
let accu = search (x+1) y f accu in
m.(y).[x] <- '0';
accu)) accu 1 10

From my mediocre knowledge of x86 and AMD64 assembler, it looks as though
ocamlopt is generating naive integer divisions and modulos, even when the
divisors are small, constant integers. In contrast, g++ is converting these
operations into multiplications by constants, shifts and