Browse thread
Matching start of input in lexer created with ocamllex
-
Janne Hellsten
-
skaller
-
Janne Hellsten
- skaller
-
Janne Hellsten
-
skaller
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | skaller <skaller@u...> |
| Subject: | Re: [Caml-list] Matching start of input in lexer created with ocamllex |
On Thu, 2007-04-05 at 23:58 +0300, Janne Hellsten wrote:
> I do have the latter construct in my lexer. How does the above help
> me match the beginning of line in the rule
Ok, sorry .. lets try this:
rule headlex =
| "!for" nonalpha { Some FOR }
| { None }
rule taillex = ...
and now use
let make_lexer () =
let atstart = ref true in
fun lexer ->
if !atstart then begin
atstart := false
match headlex lexbuf with
| None ->
lexbuf.curr_pos <- lexbuf.curr_pos -1;
taillex lexbuf
| Some FOR -> FOR
end else taillex lexbuf
let lexer = make_lexer () in
This is an animal, because you have to 'pushback' an
unwanted character following the "for" keyword.
This will cease working if the implementation of lexbuf
changes, in particular if the record is made private.
I may also have got the details on this wrong .. it's
an undocumented feature, you'll need to read lexing.ml
to figure out if it works.
The problem I'm trying to solve here is that you want
rulename lexbuf
to return a single correct token. This is a requirement
of the parser.
In Felix I don't bother with that. Instead, I just return
a pre-token, I gather the whole input as a list, and then
I write filters that translate the list of pre-tokens into
tokens. Then I make a 'lexer' function that just returns
the next token in the list already prepared, and uses
a dummy lexbuf to satisfy the broken ocamlyacc interface.
In other words, I control invert by returning a list and then
actively rewriting it. For example I would do this,
assuming you have multiple lines that can start with "for":
let strip_white toks = filter (fun tok -> tok = WHITE) toks
let append_eol toks = NEWLINE :: toks
let fixup_for toks =
let aux out inp = match inp with
| NEWLINE :: BANG :: IDENT "for" :: tail -> aux (FOR :: out) tail
| NEWLINE :: tail -> aux out tail
| head :: tail -> aux (head::out) tail
| [] -> rev out
in aux [] toks
let tokens = fixup_for (append_eol (strip_white toks))
The point is this process has arbitrary lookahead.
The downside is you have to hold the whole input in memory.
This is not going to be a problem unless it's the encoding
of an exported database.
The tagged automata underlying the lexer can probably be modified
to allow forward context checks, eg to match stuff like:
"!for" / nonalpha
where / denotes lookahead (but I'm not sure). I know TRE can
do this, however ocamllex is using a prebuilt tagged
deterministic automaton, whereas TRE uses a lazily
constructed one.
The technique basically matches
"!for" (TAG) nonalpha
first, but sets the end location to the TAG position. I believe this
might be possible because the automata ALREADY looks ahead and
backtracks to the last success .. so we need to keep two 'success'
points to do this, the real one and the lookahead one. Normally
they're the same, but in this case they're not. The lexer engine
uses the lookahead one until it is finished, but then fudges
its result to the real one.
The two pointers are needed because in general in
regexp1 / regexp2
we have regexp2 'success' point to track. If you allow
regexp1 / (regexp2 / regexp3)
you might need more pointers (not sure), so this would probably
have to be banned (not sure).
--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net