Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
Matching start of input in lexer created with ocamllex
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2007-04-06 (05:52)
From: skaller <skaller@u...>
Subject: Re: [Caml-list] Matching start of input in lexer created with ocamllex
On Thu, 2007-04-05 at 23:58 +0300, Janne Hellsten wrote:

> I do have the latter construct in my lexer.  How does the above help
> me match the beginning of line in the rule

Ok, sorry .. lets try this:

rule headlex = 
| "!for" nonalpha { Some FOR }
| { None }

rule taillex = ...

and now use 

let make_lexer () =
  let atstart  = ref true in
  fun lexer ->
    if !atstart then begin
      atstart := false
      match headlex lexbuf with
      | None -> 
         lexbuf.curr_pos <- lexbuf.curr_pos -1;
         taillex lexbuf
      | Some FOR -> FOR
    end else taillex lexbuf

let lexer = make_lexer () in

This is an animal, because you have to 'pushback' an
unwanted character following the "for" keyword.
This will cease working if the implementation of lexbuf
changes, in particular if the record is made private.

I may also have got the details on this wrong .. it's
an undocumented feature, you'll need to read
to figure out if it works.

The problem I'm trying to solve here is that you want 

	rulename lexbuf

to return a single correct token. This is a requirement
of the parser.

In Felix I don't bother with that. Instead, I just return
a pre-token, I gather the whole input as a list, and then
I write filters that translate the list of pre-tokens into
tokens. Then I make a 'lexer' function that just returns
the next token in the list already prepared, and uses
a dummy lexbuf to satisfy the broken ocamlyacc interface.

In other words, I control invert by returning a list and then
actively rewriting it. For example I would do this,
assuming you have multiple lines that can start with "for":

let strip_white toks = filter (fun tok -> tok = WHITE) toks

let append_eol toks = NEWLINE :: toks 

let fixup_for toks =
  let aux out inp = match inp with
  | NEWLINE :: BANG :: IDENT "for" :: tail -> aux (FOR :: out) tail
  | NEWLINE :: tail -> aux out tail
  | head :: tail -> aux (head::out) tail
  | [] -> rev out
  in aux [] toks

let tokens = fixup_for (append_eol (strip_white toks)) 

The point is this process has arbitrary lookahead.

The downside is you have to hold the whole input in memory.
This is not going to be a problem unless it's the encoding
of an exported database.

The tagged automata underlying the lexer can probably be modified
to allow forward context checks, eg to match stuff like:

	"!for" / nonalpha

where / denotes lookahead (but I'm not sure). I know TRE can
do this, however ocamllex is using a prebuilt tagged 
deterministic automaton, whereas TRE uses a lazily 
constructed one.

The technique basically matches

 	"!for" (TAG) nonalpha

first, but sets the end location to the TAG position. I believe this
might be possible because the automata ALREADY looks ahead and
backtracks to the last success .. so we need to keep two 'success'
points to do this, the real one and the lookahead one. Normally
they're the same, but in this case they're not. The lexer engine
uses the lookahead one until it is finished, but then fudges
its result to the real one.

The two pointers are needed because in general in

	regexp1 / regexp2

we have regexp2 'success' point to track. If you allow

	regexp1 / (regexp2 / regexp3)

you might need more pointers (not sure), so this would probably 
have to be banned (not sure).

John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: