speeding up matrix multiplication (newbie question)
[
Home
]
[ Index:
by date

by threads
]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
Date:  20090220 (19:53) 
From:  Erick Matsen <matsen@b...> 
Subject:  Re: [Camllist] speeding up matrix multiplication (newbie question) 
Wow, once again I am amazed by the vitality of this list. Thank you for your suggestions. Here is the context: we are interested in calculating the likelihood of taxonomic placement of short "metagenomics" sequence fragments from unknown organisms in the ocean. We start by assuming a model of sequence evolution, which is a reversible Markov chain. The taxonomy is represented as a tree, and the sequence information is a collection of likelihoods of sequence identities. As we move up the tree, these sequences "evolve" by getting multiplied by the exponentiated instantaneous Markov matrix. The matrices are of the size of the sequence model: 4x4 when looking at nucleotides, and 20x20 when looking at proteins. The bottleneck is (I misspoke before) that we are multiplying many length4 or length20 vectors by a collection of matrices which represent the time evolution of those sequences as follows. Outer loop: modify the amount of time each markov process runs exponentiate the rate matrices to get transition matrices Recur over the tree, starting at the leaves: at a node, multiply all of the daughter likelihood vectors together return the multiplication of that product by the trasition matrix (bottleneck!) The trees are on the order of 50 leaves, and there are about 500 likelihood vectors done at once. All of this gets run on a big cluster of Xeons. It's not worth parallelizing because we are running many instances of this process already, which fills up the cluster nodes. So far I have been doing the simplest thing possible, which is just to multiply the matrices out like \sum_j a_ij v_j. Um, this is a bit embarassing. let mul_vec m v = if Array.length v <> n_cols m then failwith "mul_vec: matrix size and vector size don't match!"; let result = Array.create (n_rows m) N.zero in for i=0 to (n_rows m)1 do for j=0 to (n_cols m)1 do result.(i) < N.add result.(i) (N.mul (get m i j) v.(j)) done; done; result I have implemented it in a functorial way for flexibility. N is the number class. How much improvement might I hope for if I make a dedicated float vector multiplication function? I'm sorry, I know nothing about "boxing." Where can I read about that? Thank you again, Erick On Fri, Feb 20, 2009 at 10:46 AM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: >> I'm working on speeding up some code, and I wanted to check with >> someone before implementation. >> >> As you can see below, the code primarily spends its time multiplying >> relatively small matrices. Precision is of course important but not >> an incredibly crucial issue, as the most important thing is relative >> comparison between things which *should* be pretty different. > > You need to post your matrix multiplication code so that the regulars > on this list can tear it to pieces :) > > From the profile you gave, it looks like you parameterized your matrix > multiplication code over the + and * operations over matrix elements. > This is good for genericity but not so good for performance, as it > will result in more boxing (heap allocation) of floatingpoint values. > The first thing you should try is write a version of matrix > multiplication that is specialized for type "float". > > Then, there are several ways to write the textbook matrix > multiplication algorithm, some of which perform less boxing than > others. Again, post your code and we'll let you know. > >> Currently I'm just using native (doubleprecision) ocaml floats and >> the native ocaml arrays for a first pass on the problem. Now I'm >> thinking about moving to using float32 bigarrays, and I'm hoping >> that the code will double in speed. I'd like to know: is that >> realistic? Any other suggestions? > > It won't double in speed: arithmetic operations will take exactly the > same time in single or double precision. What singleprecision > bigarrays buy you is halving the memory footprint of your matrices. > That could result in better cache behavior and therefore slightly > better speed, but it depends very much on the sizes and number of your > matrices. > >  Xavier Leroy >