Browse thread
index of substring
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2005-01-27 (16:56) |
From: | skaller <skaller@u...> |
Subject: | Re: [Caml-list] index of substring |
On Thu, 2005-01-27 at 21:20, Jean-Christophe Filliatre wrote: > > What's wrong with this way of handling nested comments with ocamllex: > > ====================================================================== > > | "(*" { comment lexbuf; ... } > ... > > and comment = parse > | "(*" { comment lexbuf; comment lexbuf } > | "*)" { () } > | _ { comment lexbuf } > Well it doesn't handle (* or *) in strings .. However, whilst this code (given a fix to handle strings) would probably work, it isn't a plain lexer, but a mix of lexing and client code. In particular the recursion only works 'by luck of the implementation' that allows sharing of a lexbuf between two lexer functions -- who knows how the lexer handles fallbacks? I have actually being toying with an implementation that does a bit of 'Boyer/Moore' processing to reduce or eliminate the need to rescan characters. The idea is that whenever to reach an accepting state, you create a new automaton, and begin scanning from the start state. When you finally fail you can fallback to the last success, but to lex the next token you do not need to rescan from the start state with the next character, since you have already done that. Although this idea may actually lead to faster processing sometimes, a big advantage is that it works with input iterators, whereas fallback and rescan requires a forward iterator. An input iterator can be converted to a forward one, but only with a potentially infinite buffer. In theory this means that whilst a deterministic finite state automaton can *recognize* a string with finite memory in linear time, tokenisation is in fact NOT finite state at all, but requires linear state -- which is a major disadvantage. By actually merging the automata described above, it may be possible for the 'regexp compiler' to bound the extra state required for some regexps, obtaining finite state tokenisers, whereas at present lexers are not. Anyhow, the point is that with such an augmented lexer it is not clear 'client code' would be allowed to call another (or the same) lexer with the same lexbuf. [In other words, your code is not transparent.] -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net