English version
Accueil     À propos     Téléchargement     Ressources     Contactez-nous    

Ce site est rarement mis à jour. Pour les informations les plus récentes, rendez-vous sur le nouveau site OCaml à l'adresse ocaml.org.

Browse thread
index of substring
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2005-01-27 (16:56)
From: skaller <skaller@u...>
Subject: Re: [Caml-list] index of substring
On Thu, 2005-01-27 at 21:20, Jean-Christophe Filliatre wrote:
> What's wrong with this way of handling nested comments with ocamllex:
> ======================================================================
> | "(*" { comment lexbuf; ... }
> ...
> and comment = parse
> | "(*" { comment lexbuf; comment lexbuf }
> | "*)" { () }
> | _    { comment lexbuf }

Well it doesn't handle (* or *) in strings ..

However, whilst this code (given a fix
to handle strings) would probably work,
it isn't a plain lexer, but a mix of lexing
and client code. 

In particular the recursion only works 'by luck of 
the implementation' that allows sharing of a lexbuf
between two lexer functions -- who knows how
the lexer handles fallbacks?

I have actually being toying with an implementation
that does a bit of 'Boyer/Moore' processing
to reduce or eliminate the need to rescan characters.

The idea is that whenever to reach an accepting
state, you create a new automaton, and begin
scanning from the start state. When you finally
fail you can fallback to the last success,
but to lex the next token you do not need
to rescan from the start state with the next
character, since you have already done that.

Although this idea may actually lead to faster
processing sometimes, a big advantage is that
it works with input iterators, whereas fallback
and rescan requires a forward iterator. 
An input iterator can be converted to a forward
one, but only with a potentially infinite buffer.

In theory this means that whilst a deterministic
finite state automaton can *recognize* a string
with finite memory in linear time, tokenisation
is in fact NOT finite state at all, but requires
linear state -- which is a major disadvantage.

By actually merging the automata described above,
it may be possible for the 'regexp compiler' to
bound the extra state required for some
regexps, obtaining finite state tokenisers,
whereas at present lexers are not.

Anyhow, the point is that with such an augmented
lexer it is not clear 'client code' would be allowed
to call another (or the same) lexer with the same

[In other words, your code is not transparent.]

John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net