Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
[Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2002-10-20 (09:42)
From: Xavier Leroy <xavier.leroy@i...>
Subject: Re: [Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C
> Looking at the assembly produced by O'Caml and GCC, it appears that GCC
> is performance loop unrolling (as requested with -funroll-loops) and
> strength reduction in the inner loops. I can easily see why these two
> optimizations can result in such a tremendous performance difference.
> My question is this: I can obviously performance loop unrolling myself
> by hand - does ocamlopt perform strength reduction?

No, it doesn't.  It doesn't hoist loop-invariant computations either.

Multiplications are pretty expensive on modern processors, so your
code would run much faster if you hoist the computation i*120 out of
the j and k loops, and strength-reduce k*120 in the innermost

Another option worth considering is to swap the j and k loops, thus
making k*120 and a.(i*120+k) two invariants of the inner loop.  In
this case, you might not even have to strength-reduce k*120.

Other loop optimizations such as loop unrolling would (I guess) make
much less of a difference -- modern processors are very good at branch
prediction, making loop overhead low.

Some more thoughts:

Bagley's code works on a "proper" matrix (an array of
arrays), while yours flatten the matrix into one single array.  There
are pros and cons to either approach, but notice that your approach
(without strength reduction) entails fewer loads but more
multiplications, which can be more expensive...

Tiling loops is good for really large matrices, but yours (in this
example) occupies "only" 115 Kb, thus fits comfortably in L2 cache.
Perhaps tiling isn't beneficial in this case.

- Xavier Leroy
To unsubscribe, mail Archives:
Bug reports: FAQ:
Beginner's list: