speeding up matrix multiplication (newbie question)
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
 Date: -- (:) From: Will M. Farr Subject: Re: [Caml-list] speeding up matrix multiplication (newbie question)
Erick,

Sorry about the long email, but here is an explanation of what
"boxing" means, how it slows you down in this case, and how you can
(eventually) figure out whether it will slow you down in general.  I'm
not an expert, so I've probably made mistakes in the following, but I
think the broad outlines are correct.  True experts can weigh in to
correct me.

Due to polymorphism (i.e. the fact that one must be able to stuff
*any* ocaml value into an 'a list), every ocaml value must have a
uniform format and size if it is possible that it can be used
polymorphically.  The current compiler can prove in certain cases that
a value is not used polymorphically; when this is possible, the value
can be stored more efficiently.  For example, in the expression:

let f a b =
let x = a *. b in
3.0 *. x

the compiler can tell that x is always going to be used as a float and
not polymorphically.  However, a, b, and (3.0 *. x) can escape to the
larger world, where they could potentially be used polymorphically,
and therefore need to have the uniform representation.

It turns out that the representation chosen for the OCaml runtime is a
32-bit integer (i.e. a pointer).  So, to represent a float that can be
used polymorphically, the runtime allocates 64-bits of memory, stuffs
the float into it, and uses the pointer to the memory to represent the
float in the program.  This procedure is called "boxing".  In the case
above, the contents of x can be represented "unboxed" as a real 64-bit
floating point number, while the contents of a and b, and the result
(3.0 *. x) must be boxed.

The boxing rules for the ocaml compiler are described at
http://caml.inria.fr/pub/old_caml_site/ocaml/numerical.html (I don't
know if this is current or not).  The relevant rules for analyizing
your code are probably a) no cross-module inlining of functor argument
functions (i.e. N.add, N.mul, etc), b) boxing of floats at all
(non-inlined) function boundaries, and c) floating point arrays store
their elements unboxed.  There are other rules, given at the above
website.

So, the code you wrote

result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j))

allocates a box for the result of (get m i j) (unless this procedure
is inlined, which is possible if it's short, and not coming in via
functor argument---it would be better to use m.(i).(j) if possible),
another box for v.(j), passes them to N.mul, which allocates a box for
the result of the multiplication.  Then a box is allocated for
result.(i), and the previous result and result.(i) are passed to
N.add, which allocates a box for its result.  Then this box is opened
and the value in it is stored in result.(i).  Ugh.  If, instead, you
write

result.(i) <- result.(i) +. (get m i j)*.v.(j)

there is *at most* one box for (get m i j).  If we assume that (get m
i j) is inlined and therefore the boxing overhead is eliminated, the
entire process takes only a few clock cycles.  Compare to possibly
several hundred or more clock cycles above, plus the extra stress on
the GC from collecting all the boxes when they are no longer

In short, you'll get a lot faster if you specialize the operations to
floats.  Like order-of-magnitude or greater speedups.

Good luck!

Will

On Fri, Feb 20, 2009 at 2:53 PM, Erick Matsen <matsen@berkeley.edu> wrote:
> Wow, once again I am amazed by the vitality of this list. Thank you
>
> Here is the context: we are interested in calculating the likelihood
> of taxonomic placement of short "metagenomics" sequence fragments from
> unknown organisms in the ocean. We start by assuming a model of
> sequence evolution, which is a reversible Markov chain. The taxonomy
> is represented as a tree, and the sequence information is a collection
> of likelihoods of sequence identities. As we move up the tree, these
> sequences "evolve" by getting multiplied by the exponentiated
> instantaneous Markov matrix.
>
> The matrices are of the size of the sequence model: 4x4 when looking
> at nucleotides, and 20x20 when looking at proteins.
>
> The bottleneck is (I mis-spoke before) that we are multiplying many
> length-4 or length-20 vectors by a collection of matrices which
> represent the time evolution of those sequences as follows.
>
> Outer loop:
>  modify the amount of time each markov process runs
>  exponentiate the rate matrices to get transition matrices
>
>  Recur over the tree, starting at the leaves:
>    at a node, multiply all of the daughter likelihood vectors together
>    return the multiplication of that product by the trasition matrix
> (bottleneck!)
>
> The trees are on the order of 50 leaves, and there are about 500
> likelihood vectors done at once.
>
> All of this gets run on a big cluster of Xeons. It's not worth
> parallelizing because we are running many instances of this process
> already, which fills up the cluster nodes.
>
> So far I have been doing the simplest thing possible, which is just to
> multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit
> embarassing.
>
> let mul_vec m v =
>    if Array.length v <> n_cols m then
>      failwith "mul_vec: matrix size and vector size don't match!";
>    let result = Array.create (n_rows m) N.zero in
>    for i=0 to (n_rows m)-1 do
>      for j=0 to (n_cols m)-1 do
>        result.(i) <- N.add result.(i) (N.mul (get m i j) v.(j))
>      done;
>    done;
>    result
>
> I have implemented it in a functorial way for flexibility. N is the
> number class. How much improvement might I hope for if I make a
> dedicated float vector multiplication function? I'm sorry, I know
>
>
> Thank you again,
>
> Erick
>
>
>
> On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote:
>>> I'm working on speeding up some code, and I wanted to check with
>>> someone before implementation.
>>>
>>> As you can see below, the code primarily spends its time multiplying
>>> relatively small matrices. Precision is of course important but not
>>> an incredibly crucial issue, as the most important thing is relative
>>> comparison between things which *should* be pretty different.
>>
>> You need to post your matrix multiplication code so that the regulars
>> on this list can tear it to pieces :-)
>>
>> From the profile you gave, it looks like you parameterized your matrix
>> multiplication code over the + and * operations over matrix elements.
>> This is good for genericity but not so good for performance, as it
>> will result in more boxing (heap allocation) of floating-point values.
>> The first thing you should try is write a version of matrix
>> multiplication that is specialized for type "float".
>>
>> Then, there are several ways to write the textbook matrix
>> multiplication algorithm, some of which perform less boxing than
>> others.  Again, post your code and we'll let you know.
>>
>>> Currently I'm just using native (double-precision) ocaml floats and
>>> the native ocaml arrays for a first pass on the problem.  Now I'm
>>> thinking about moving to using float32 bigarrays, and I'm hoping
>>> that the code will double in speed. I'd like to know: is that
>>> realistic? Any other suggestions?
>>
>> It won't double in speed: arithmetic operations will take exactly the
>> same time in single or double precision.  What single-precision
>> bigarrays buy you is halving the memory footprint of your matrices.
>> That could result in better cache behavior and therefore slightly
>> better speed, but it depends very much on the sizes and number of your
>> matrices.
>>
>> - Xavier Leroy
>>
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>