Browse thread
Thread safe Str
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Gerd Stolpmann <info@g...> |
| Subject: | Re: [Caml-list] Thread safe Str |
On Die, 2005-01-11 at 18:55, skaller wrote: > On Tue, 2005-01-11 at 23:08, Gerd Stolpmann wrote: > > > > Capturing just for the purpose of string extraction is not problematic. > > Unfortunately you're wrong. I had in fact hoped this was not > the case, but having read a couple of paper on it, > I now know that, unfortunately, they are. (See below). It's a matter of taste. Granted, you cannot define capturing directly for an automaton, only by a possibly poor backtracking algorithm. (This means the approach is "definition by algorithm", not really the best way to do it.) Which leads to a more serious problem: One must know the algorithm of the implementation to avoid running into a hard backtracking case (which can be exponential). There are a lot of tutorials explaining good and bad regexps. The lex-type scanners do not have this problem (but you may run into space problems instead because the automaton becomes large). > Alain Frisch is one of the leaders in this area, perhaps > he can explain it better. You can get Cardelli and Frisch > paper here: > > http://felix.sf.net/papers/greedy.pdf When I understand the direction correctly, this paper explains how to avoid backtracking by using a two-pass algorithm (automaton plus expression-based postprocessing to get the captured strings). > another by Ville Laurikari here: > > http://felix.sf.net/papers/spire2000-tnfa.ps > > and his Master thesis on the topic here: > > http://felix.sf.net/papers/regex-submatch.ps > > [and if you can figure out how to actually build > a tagged DFA after reading any of that please let > me know .. ] > > Frisch's algorithm used in CDuce works with a forward > pass that ignores captures, but records the automaton states, > and then a backwards pass the extracts the information > (CDuce actually builds trees). > > Unfortunately this has a problem: it requires a > bidirectional iterator, whereas a DFA only requires > an input iterator. (NFA's require forward iterators) For string regexps this is not a big problem, the string is in memory. For XML, this is more a problem, especially when the XML document is only available event by event (processing large documents). Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de ------------------------------------------------------------