English version
Accueil     À propos     Téléchargement     Ressources     Contactez-nous    

Ce site est rarement mis à jour. Pour les informations les plus récentes, rendez-vous sur le nouveau site OCaml à l'adresse ocaml.org.

Browse thread
Ropes: progressively crazy ideas
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2007-08-13 (21:32)
From: Jon Harrop <jon@f...>
Subject: Ropes: progressively crazy ideas

I decided to have another go at curing cancer over the weekend. Downloaded the 
human genome sequence from UCSC and started reading papers on suffix trees.

The data is primarily a sequence of A, C, G and T, of course. However, there 
are also sparse entries of other characters, primarily N. You really want to 
store the bulk of the data as a dense array of bit pairs for the four most 
common characters and then store the exceptional cases in a sparse data 

Perhaps it would be useful if your Vect implementation could be abstracted 
over different concrete leaf implementations, such as bit vectors? Presumably 
leaf size should be measured in bytes rather than elements?

Also, I've noticed a correlation between the purely functional implementations 
of the linear-time suffix tree generation algorithms, Huet's zippers and my 
work that I mentioned before on lazy metadata in nodes.

I'm toying with the idea of a sequence type based on Vect that lazily computes 
its own suffix tree. You would need some way to compose suffix trees when 
sequences are combined (e.g. concatenation of sequences) but the result could 
be very useful.

For example, Mathematica's pattern matcher allows you to search for 

  f[{front___, x, y, z, back___}] := {front, back}

This is currently implemented completely naively by recursive linear searches 
of an array. If the sequence type could generate and reuse suffix trees then 
subsequence matches should be vastly more efficient. The ability to search 
for subsequences is the main reason why Mathematica's pattern matches are not 
optimized as ML's are.

Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists