Version française
Home     About     Download     Resources     Contact us    
Browse thread
Optimising: div and mod on AMD64
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Jon Harrop <jon@f...>
Subject: Optimising: div and mod on AMD64

I've spent a little time playing around with my 19-line Sudoku program:

  http://www.ffconsultancy.com/free/sudoku

Algorithmic optimisations are clearly the way forward from my brute force 
approach. However, I am curious about low-level aspects of the performance of 
the existing implementation in various ways.

Note that 85-90% of the time is spent in the "invalid" function:

let rec invalid ?(i=0) x y n =
  i<9 && (m.(y).[i] = n || m.(i).[x] = n ||
      m.(y/3*3 + i/3).[x/3*3 + i mod 3] = n || invalid ~i:(i+1) x y n)

This is a simple recursive function. You can improve its performance a bit by 
getting rid of the optional argument:

let rec invalid i x y n =
  i<9 && (m.(y).[i] = n || m.(i).[x] = n ||
      m.(y/3*3 + i/3).[x/3*3 + i mod 3] = n || invalid (i+1) x y n)

The C++ equivalent of this function (including bounds checking) is:

bool invalid(int i, int x, int y, int n) {
  return i<9 && (m.at(y).at(i) == n || m.at(i).at(x) == n ||
                 m.at(y/3*3 + i/3).at(x/3*3 + i%3) == n ||
                 invalid(i+1, x, y, n));
}

When switching from 32-bit x86 to 64-bit AMD64, the C++ becomes slightly 
faster but the OCaml becomes much slower, to the extent that the OCaml is 
4.8x slower than the C++ on AMD64:

Platform  x86 (900MHz) AMD64 (800MHz)
OCaml     3.716        6.209
C++       1.351        1.284

After quite a bit of playing I found out some interesting things. Allocation 
is very slow, but the "invalid" function does no allocation and incurs no GC. 
Recursion can be slow. Unrolling the invalid function makes little difference 
to performance. Removing bounds checking only improves performance by 6% on 
both platforms.

To my surprise, it seems that the vast majority of the time in the OCaml 
implementation is spent computing x/3*3, i/3 and i mod 3. This is surprising 
because these operations are "normally" very fast, e.g. in the C++.

Getting rid of integer div and mod in the "invalid" function by hoisting the 
computation of x/3*3 out of "invalid" and then out of the fold, and recursing 
over ix=[0,3) and iy=[0,3) rather than i=[0,9) gives:

let rec invalid ix iy fx fy x y n =
  if ix=3 then invalid 0 (iy+1) fx fy x y n else
    iy<3 && (m.(y).[ix + 3*iy] = n || m.(ix + 3*iy).[x] = n ||
	m.(fy + iy).[fx + ix] = n || invalid (ix+1) iy fx fy x y n)

Greatly improves performance, particularly on AMD64:

OCaml     1.946        1.308

Hoisting x/3*3 out of the fold (which isn't even the inner loop!) in the 
"search" function almost doubles the performance of the whole program:

	let fx = x/3*3 and fy = y/3*3 in
        fold (fun accu n ->
                let n = Char.chr (n + 48) in
                if invalid 0 0 fx fy x y n then accu else
                  (m.(y).[x] <- n;
                   let accu = search (x+1) y f accu in
                   m.(y).[x] <- '0';
                   accu)) accu 1 10

From my mediocre knowledge of x86 and AMD64 assembler, it looks as though 
ocamlopt is generating naive integer divisions and modulos, even when the 
divisors are small, constant integers. In contrast, g++ is converting these 
operations into multiplications by constants, shifts and 
addition/subtraction.

I am surprised that this makes such a big difference but, assuming I'm right, 
may we have optimised integer arithmetic for constant divisors please?

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
Objective CAML for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists