Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Paul Stodghill <stodghil@C...>
Subject: Re: [Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C
>>>>> "Xavier" == Xavier Leroy <xavier.leroy@inria.fr> writes:
    Xavier> No, it doesn't. It doesn't hoist loop-invariant computations
    Xavier> either. (See http://caml.inria.fr/ocaml/numerical.html)

I'll take a look. Thanks.

    Xavier> Multiplications are pretty expensive on modern processors,
    Xavier> so your code would run much faster if you hoist the
    Xavier> computation i*120 out of the j and k loops, and
    Xavier> strength-reduce k*120 in the innermost loop.

I don't see how to do strength-reduction without introducing references,
which appears to reduce performance more than is gained.

    Xavier> Another option worth considering is to swap the j and k
    Xavier> loops, thus making k*120 and a.(i*120+k) two invariants of
    Xavier> the inner loop. In this case, you might not even have to
    Xavier> strength-reduce k*120.

I tried this, and indeed the performance of the ML code get up to 95
Mflops. However, the performance of the C code goes down, as the ikj
loop ordering requires an additional store of C[i,j] in every iteration.

    Xavier> Other loop optimizations such as loop unrolling would (I
    Xavier> guess) make much less of a difference -- modern processors
    Xavier> are very good at branch prediction, making loop overhead
    Xavier> low.

Loop unrolling can also increase ILP. It can make a big difference in
performance.

Loop unrolling can be thought of as a way of getting some of the effect
of software pipelining without the complicated scheduling.


    Xavier> Some more thoughts:

    Xavier> Bagley's code works on a "proper" matrix (an array of
    Xavier> arrays), while yours flatten the matrix into one single
    Xavier> array. There are pros and cons to either approach, but
    Xavier> notice that your approach (without strength reduction)
    Xavier> entails fewer loads but more multiplications, which can be
    Xavier> more expensive...

Right. So it is a win when it is done in combination with strength reduction.


    Xavier> Tiling loops is good for really large matrices, but yours
    Xavier> (in this example) occupies "only" 115 Kb, thus fits
    Xavier> comfortably in L2 cache. Perhaps tiling isn't beneficial in
    Xavier> this case.

True. This was a miscalculation on my part. However, the blocksize that
I chose makes the blocks fit in the L1 cache, so I was still getting a
measurable benefit.

    Xavier> - Xavier Leroy

Here is what I was trying to accomplish - I am involved with a project
that is trying to automate/generalize some of the tricks that ATLAS
(http://math-atlas.sourceforge.net/) uses for getting maximal
performance from matrix-matrix multiply (MMM). I knew about Bagley's
conclusion that O'Caml was competitive with C for MMM, so I was curious
how close O'Caml came to GCC on Blocked MMM. I would like to use O'Caml
as a target language over C or Fortran.

My conclusion, at this point, is that the current native O'Caml compiler
is not going to competitive with a native C or Fortran compiler because
the it does not optimize the innermost loop as aggressively.

Thanks for your help.

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners