Browse thread
Package with multidimensional AND sparse matrices?
-
Andreas Biegert
- Markus Mottl
-
Jon Harrop
- Andreas Biegert
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2006-07-04 (07:46) |
From: | Andreas Biegert <andreas.biegert@g...> |
Subject: | Re: [Caml-list] Package with multidimensional AND sparse matrices? |
Well, in short: the algorithm computes the optimal multiple alignment of up to n protein sequences with maximum length L. The straightforward and theoretically optimal approach to solve this problem is by dynamic programming, therefore the time and space complexity is O(L^n). I managed to come up with a pretty good heuristic that lets me boil down the number of cells that need to be computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact I can even manually adjust the number of nonzero entries in the matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds i<j<k<l. The bad news is that performance is important since the program has to run under 5min. An estimation of the time requirements given the number of sequences n and masimum lengths L is: 24 * L^2 * T_x * 2^(n+1) - 1 with T_x being about 100ns For n=4 and L=2000 I get just about 5min. Would it be a bad idea to use Hashtables all the way instead of matrices? Accessing elements in a hashtable has O(1), right? Andreas 2006/7/4, Andreas Biegert <andreas.biegert@googlemail.com>: > Well, in short: the algorithm computes the optimal multiple alignment > of up to n protein sequences with maximum length L. The > straightforward and theoretically optimal approach to solve this > problem is by dynamic programming, therefore the time and space > complexity is O(L^n). I managed to come up with a pretty good > heuristic that lets me boil down the number of cells that need to be > computed giving me a complexitiy of O(L^2) instead of O(L^n), in fact > I can even manually adjust the number of nonzero entries in the > matrix. In the 2000*2000*2000*2000 matrix only at most 2000 cells will > be non-zero. Furthermore for all non-zero cells M(i,j,k,l) holds > i<j<k<l. The bad news is that performance is important since the > program has to run under 5min. An estimation of the time requirements > given the number of sequences n and masimum lengths L is: > > 24 * L^2 * T_x * 2^(n+1) - 1 > with T_x being about 100ns > > For n=4 and L=2000 I get just about 5min. Would it be a bad idea to > use Hashtables all the way instead of matrices? Accessing elements in > a hashtable has O(1), right? > > Andreas > > > > 2000/1/2, Jon Harrop <jon@ffconsultancy.com>: > > On Monday 03 July 2006 23:55, Andreas Biegert wrote: > > > The ocamlfloat package provides a sparse matrix implmentation for > > > 2-dimensional matrices, but I would need at least 4 dimensions. > > > > You probably mean "tensor" rather than "matrix". > > > > > In case there is no such package, would it be very hard to write one by > > > myself? > > > > Provided performance and memory usage are not too important, it is quite > > simple to write your own implementation. First, you must decide what > > requirements you have, i.e. what do you want to do with these tensors? Then, > > you must choose a data structure to represent your tensors, e.g. a hash table > > that maps int * int * int * int to float. Finally, you must implement the > > functions that you need. > > > > You could start with something like this: > > > > module SparseArray : sig > > type t > > > > val make : int list -> t > > val get : t -> int list -> float > > val set : t -> int list -> float -> unit > > end = struct > > type t = int list * (int list, float) Hashtbl.t > > > > let make n = n, Hashtbl.create 0 > > > > let check s n i = > > List.iter2 (fun n i -> if i<0 || i>=n then invalid_arg s) n i > > > > let get (n, m) i = > > check "SparseArray.get" n i; > > try Hashtbl.find m i with Not_found -> 0. > > > > let set (n, m) i x = > > check "SparseArray.set" n i; > > if x<>0. then Hashtbl.replace m i x else > > try Hashtbl.remove m i with Not_found -> () > > end;; > > > > That should be fine for getting and setting but you probably want to loop over > > dimensions and perform arithmetic operations, in which case you need a way to > > jump to the non-zero elements. > > > > How many non-zero elements do you have and how long do you expect your > > calculations to take? > > > > -- > > Dr Jon D Harrop, Flying Frog Consultancy Ltd. > > Objective CAML for Scientists > > http://www.ffconsultancy.com/products/ocaml_for_scientists > > > > _______________________________________________ > > Caml-list mailing list. Subscription management: > > http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list > > Archives: http://caml.inria.fr > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > Bug reports: http://caml.inria.fr/bin/caml-bugs > > >