This site is updated infrequently. For up-to-date information, please visit the new OCaml website at ocaml.org.

Re: Re: Re: [Caml-list] newbie questions
• Dr.Dr.Ruediger M.Flaig
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
 Date: -- (:) From: Dr.Dr.Ruediger M.Flaig Subject: Re: Re: Re: [Caml-list] newbie questions
```> Please define "fast processing of large amounts of data". This can mean
> widely different things.

I am working with DNA, so my idea of large amounts of data is a huge annotated sequence file. Well, the annotations are not a problem, they are small by comparison and may be kept separate... so what remains is a simple trail of up to several millions of base pairs -- indexed 2-bit elements, strictly speaking, but usually dealt with as bytes.

Imagine I want to do the following: In order to plan an experiment, I have to find to which positions of a long DNA sequence a shorter one may bind under certain circumstances... there are approximations for that, but lab experience shows that they just don't work properly for real life. So the most reliable thing to do is: for all possible subsets of both sequences, calculate their affinity:

Seq 1 = ggatcggctaag -> Subsets: ggatcggctaa, ggatcggcta, ggatcggct, ggatcggc, ..., gg, gatcggctaag, gatcggctaa, gatcggcta, gatcggc, ..., ga, ..., ctaa, cta, ct, taa, ta .
Seq 2 = aacgtaa -> Subsets: aacgta, aacgt, aacg, ..., aa, acgta, acgt, acg, ac, ..., taa, ta, aa .
Match ggatcggctaa with aacgta, aacgt, aacg, ..., aa, acgta, acgt, acg, ac, ..., taa, ta, aa ; match ggatcggcta with aacgta, aacgt, aacg, ..., aa, acgta, acgt, acg, ac, ..., taa, ta, aa ; ......... ; match ta with aa.
where "match" means: calculate the maximal temperature at which these sequences may bind to each other.

Okay, getting this done either recursively or iteratively is freshman level programming, and you can add lots of "cutoffs" to reduce the work load, but getting it done FAST is quite a different matter when one of the sequences is in the megabyte range...

> And figure out how you can minimize the rate of new object
> creation.

No new object creation is needed at all, if all this is done by indexing... (I have followed the thread about GC efficiency)

> If you are dealing with matrices (numerical analysis...), yes, probably
> you want Array's or Bigarray's.
>
> Otherwise, even for structures mapping an integer range to values, arrays
> may not be the best choice. I have in mind a particular example where we
> used a balanced binary map from integers to values, because this allowed
> implementing certain optimizations (see section 6.2 of
> http://www.di.ens.fr/~monniaux/biblio/Static_analyzer_LNCS2566.pdf ).

Yup, that sounds very interesting. I'll have a look.

Yours,
Ruediger

Dr. Dr. Ruediger Marcus Flaig
Institute for Immunology
University of Heidelberg
Im Neuenheimer Feld 305
D-69120 Heidelberg
<flaig@cirith-ungol.sanctacaris.net>
Tel. +49-172-7652946
Fax  +49-4075110-17171

_____________________________________________________________
Free eMail .... the way it should be....
http://www.hablas.com

_____________________________________________________________