Proposal for new Lexing Module (long)

From: Christian Lindig (
Date: Mon Jan 18 1999 - 09:24:39 MET

Date: Mon, 18 Jan 1999 09:24:39 +0100
From: Christian Lindig <>
To: Caml Mailing List <>
Subject: Proposal for new Lexing Module (long)

OCamlLex generated lexers often need some state information which must
survive the actual call of the lexer from a parser. Examples for this
kind of state are the current source line and column or some context
information. Lexing HTML for example requires information whether the
scanner reads tokens inside of a <tag attribute="value"> or outside of
it. The meaning of quotes is totally different in- and outside of
tags and thus the lexer must store some informations about its current

This information is typically stored in global variables inside the
lexer. The generated lexer already uses and passes around a value of
type `lexbuf' for its internal purposes. This value is accessible
inside semantic actions of lexer rules. I would like to propose an
extended data type for lexbuf which also permits to store user data
inside of it.

The current lexbuf declaration in OCaml 2.01:

     type lexbuf =
      { refill_buff : lexbuf -> unit;
        mutable lex_buffer : string;
        mutable lex_buffer_len : int;
        mutable lex_abs_pos : int;
        mutable lex_start_pos : int;
        mutable lex_curr_pos : int;
        mutable lex_last_pos : int;
        mutable lex_last_action : int;
        mutable lex_eof_reached : bool }
The proposed new type lexstate with a type alias for backward

    type 'a lexstate =
      { refill_buff : 'a lexstate -> unit;
        mutable lex_buffer : string;
        mutable lex_buffer_len : int;
        mutable lex_abs_pos : int;
        mutable lex_start_pos : int;
        mutable lex_curr_pos : int;
        mutable lex_last_pos : int;
        mutable lex_last_action : int;
        mutable lex_eof_reached : bool ;
        mutable lex_state: 'a (* read/write accessible for user *)
    type lexbuf = unit lexstate

In a lexstate value a user can store mutable informations of type 'a.
A classical lexbuf is simply a lexstate which stores unit. Also for
backward compatibility all old access functions working on lexbuf must
be present together with new access functions which work on lexstate.
They can be easily implemented using the following scheme:

    let lex_from_function initial_state f =
      { refill_buff = lex_refill f (String.create 512);
        lex_buffer = String.create 1024;
        lex_buffer_len = 1024;
        lex_abs_pos = - 1024;
        lex_start_pos = 1024;
        lex_curr_pos = 1024;
        lex_last_pos = 1024;
        lex_last_action = 0;
        lex_eof_reached = false ;
        lex_state = initial_state }
    let from_function = lex_from_function ()

The implementation doing the real work have their name prefixed with
`lex_' and work on the new type lexstate. They have an additional
parameter initial_state which is used to initialize the new field
lex_state. The function for backward compatibility uses it by passing
a unit value.

All old sources work correctly because they use the appropriate
functions. New sources can use the lexstate type and two functions
which provide access to the new user state information:

    let lex_get_state lexstate = lexstate.lex_state
    let lex_set_state lexstate x = lexstate.lex_state <- x
The code to create a lexstate for a scanner looks like this:

    let lexstate = Lexing.lex_from_channel 1 stdin in
Here 1 is passed as the initial state and could be thought of as the
current line number. Inside OCamlLex semantic actions lexstate and
lexbuf values are always accessed under the (old) name lexbuf.

The code generator of OCamlLex must not be changed. The code
generated is polymorphic enough to work with (old) lexbuf based
scanners and lexstate based scanners as well. The lexer engine is
written in C and is implemented in the runtime system
(byterun/lexing.c). It accesses the lexbuf/lexstate from C and the C
data type declaration should be changed accordingly. It is not
strictly necessary because the buffer is never passed by value and
thus the current C implementation is polymorphic enough as well.

However, simply replacing and lexing.mli in an installed
OCaml 2.01 system does not work. The lexing module is heavily used
inside the whole system. It must be replaced in the source tree and a
new OCaml system be built. I have such a patched system running and
have not encountered any problems yet.

Since the new Lexing module adds flexibility for future scanners and
is backward compatible with old sources I would like to see it
integrated in a future release of OCaml.

The new Lexing module implementation is available from the following
web page:

-- Christian

 Christian Lindig   Technische Universitaet Braunschweig, Germany
                   "be declarative. be functional. just be."

This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:18 MET