Package with multidimensional AND sparse matrices?

Andreas Biegert
 Markus Mottl

Jon Harrop
 Andreas Biegert
[
Home
]
[ Index:
by date

by threads
]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
Date:  20060704 (07:46) 
From:  Andreas Biegert <andreas.biegert@g...> 
Subject:  Re: [Camllist] Package with multidimensional AND sparse matrices? 
Well, in short: the algorithm computes the optimal multiple alignment of up to n protein sequences with maximum length L. The straightforward and theoretically optimal approach to solve this problem is by dynamic programming, therefore the time and space complexity is O(L^n). I managed to come up with a pretty good heuristic that lets me boil down the number of cells that need to be computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact I can even manually adjust the number of nonzero entries in the matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will be nonzero. Furthermore for all nonzero cells M(i,j,k,l) holds i<j<k<l. The bad news is that performance is important since the program has to run under 5min. An estimation of the time requirements given the number of sequences n and masimum lengths L is: 24 * L^2 * T_x * 2^(n+1)  1 with T_x being about 100ns For n=4 and L=2000 I get just about 5min. Would it be a bad idea to use Hashtables all the way instead of matrices? Accessing elements in a hashtable has O(1), right? Andreas 2006/7/4, Andreas Biegert <andreas.biegert@googlemail.com>: > Well, in short: the algorithm computes the optimal multiple alignment > of up to n protein sequences with maximum length L. The > straightforward and theoretically optimal approach to solve this > problem is by dynamic programming, therefore the time and space > complexity is O(L^n). I managed to come up with a pretty good > heuristic that lets me boil down the number of cells that need to be > computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact > I can even manually adjust the number of nonzero entries in the > matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will > be nonzero. Furthermore for all nonzero cells M(i,j,k,l) holds > i<j<k<l. The bad news is that performance is important since the > program has to run under 5min. An estimation of the time requirements > given the number of sequences n and masimum lengths L is: > > 24 * L^2 * T_x * 2^(n+1)  1 > with T_x being about 100ns > > For n=4 and L=2000 I get just about 5min. Would it be a bad idea to > use Hashtables all the way instead of matrices? Accessing elements in > a hashtable has O(1), right? > > Andreas > > > > 2000/1/2, Jon Harrop <jon@ffconsultancy.com>: > > On Monday 03 July 2006 23:55, Andreas Biegert wrote: > > > The ocamlfloat package provides a sparse matrix implmentation for > > > 2dimensional matrices, but I would need at least 4 dimensions. > > > > You probably mean "tensor" rather than "matrix". > > > > > In case there is no such package, would it be very hard to write one by > > > myself? > > > > Provided performance and memory usage are not too important, it is quite > > simple to write your own implementation. First, you must decide what > > requirements you have, i.e. what do you want to do with these tensors? Then, > > you must choose a data structure to represent your tensors, e.g. a hash table > > that maps int * int * int * int to float. Finally, you must implement the > > functions that you need. > > > > You could start with something like this: > > > > module SparseArray : sig > > type t > > > > val make : int list > t > > val get : t > int list > float > > val set : t > int list > float > unit > > end = struct > > type t = int list * (int list, float) Hashtbl.t > > > > let make n = n, Hashtbl.create 0 > > > > let check s n i = > > List.iter2 (fun n i > if i<0  i>=n then invalid_arg s) n i > > > > let get (n, m) i = > > check "SparseArray.get" n i; > > try Hashtbl.find m i with Not_found > 0. > > > > let set (n, m) i x = > > check "SparseArray.set" n i; > > if x<>0. then Hashtbl.replace m i x else > > try Hashtbl.remove m i with Not_found > () > > end;; > > > > That should be fine for getting and setting but you probably want to loop over > > dimensions and perform arithmetic operations, in which case you need a way to > > jump to the nonzero elements. > > > > How many nonzero elements do you have and how long do you expect your > > calculations to take? > > > >  > > Dr Jon D Harrop, Flying Frog Consultancy Ltd. > > Objective CAML for Scientists > > http://www.ffconsultancy.com/products/ocaml_for_scientists > > > > _______________________________________________ > > Camllist mailing list. Subscription management: > > http://yquem.inria.fr/cgibin/mailman/listinfo/camllist > > Archives: http://caml.inria.fr > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/camlbugs > > >