Optimising: div and mod on AMD64
 Jon Harrop
[
Home
]
[ Index:
by date

by threads
]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
Date:  20051130 (04:35) 
From:  Jon Harrop <jon@f...> 
Subject:  Optimising: div and mod on AMD64 
I've spent a little time playing around with my 19line Sudoku program: http://www.ffconsultancy.com/free/sudoku Algorithmic optimisations are clearly the way forward from my brute force approach. However, I am curious about lowlevel aspects of the performance of the existing implementation in various ways. Note that 8590% of the time is spent in the "invalid" function: let rec invalid ?(i=0) x y n = i<9 && (m.(y).[i] = n  m.(i).[x] = n  m.(y/3*3 + i/3).[x/3*3 + i mod 3] = n  invalid ~i:(i+1) x y n) This is a simple recursive function. You can improve its performance a bit by getting rid of the optional argument: let rec invalid i x y n = i<9 && (m.(y).[i] = n  m.(i).[x] = n  m.(y/3*3 + i/3).[x/3*3 + i mod 3] = n  invalid (i+1) x y n) The C++ equivalent of this function (including bounds checking) is: bool invalid(int i, int x, int y, int n) { return i<9 && (m.at(y).at(i) == n  m.at(i).at(x) == n  m.at(y/3*3 + i/3).at(x/3*3 + i%3) == n  invalid(i+1, x, y, n)); } When switching from 32bit x86 to 64bit AMD64, the C++ becomes slightly faster but the OCaml becomes much slower, to the extent that the OCaml is 4.8x slower than the C++ on AMD64: Platform x86 (900MHz) AMD64 (800MHz) OCaml 3.716 6.209 C++ 1.351 1.284 After quite a bit of playing I found out some interesting things. Allocation is very slow, but the "invalid" function does no allocation and incurs no GC. Recursion can be slow. Unrolling the invalid function makes little difference to performance. Removing bounds checking only improves performance by 6% on both platforms. To my surprise, it seems that the vast majority of the time in the OCaml implementation is spent computing x/3*3, i/3 and i mod 3. This is surprising because these operations are "normally" very fast, e.g. in the C++. Getting rid of integer div and mod in the "invalid" function by hoisting the computation of x/3*3 out of "invalid" and then out of the fold, and recursing over ix=[0,3) and iy=[0,3) rather than i=[0,9) gives: let rec invalid ix iy fx fy x y n = if ix=3 then invalid 0 (iy+1) fx fy x y n else iy<3 && (m.(y).[ix + 3*iy] = n  m.(ix + 3*iy).[x] = n  m.(fy + iy).[fx + ix] = n  invalid (ix+1) iy fx fy x y n) Greatly improves performance, particularly on AMD64: OCaml 1.946 1.308 Hoisting x/3*3 out of the fold (which isn't even the inner loop!) in the "search" function almost doubles the performance of the whole program: let fx = x/3*3 and fy = y/3*3 in fold (fun accu n > let n = Char.chr (n + 48) in if invalid 0 0 fx fy x y n then accu else (m.(y).[x] < n; let accu = search (x+1) y f accu in m.(y).[x] < '0'; accu)) accu 1 10 From my mediocre knowledge of x86 and AMD64 assembler, it looks as though ocamlopt is generating naive integer divisions and modulos, even when the divisors are small, constant integers. In contrast, g++ is converting these operations into multiplications by constants, shifts and addition/subtraction. I am surprised that this makes such a big difference but, assuming I'm right, may we have optimised integer arithmetic for constant divisors please?  Dr Jon D Harrop, Flying Frog Consultancy Ltd. Objective CAML for Scientists http://www.ffconsultancy.com/products/ocaml_for_scientists