Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
Parser state variables
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2005-10-27 (07:36)
From: skaller <skaller@u...>
Subject: Parser state variables
Are there any plans to extend Ocamlyacc to allow state variables,
in a manner similar to the recent extensions to Ocamllex to
allow them?

At present, there are two workarounds:

(a) global variables

I hope I do not need to say how bad this is for a language
that is supposed to provide good support for functional programming.

(b) Token hack trick

This trick avoids the problems of global variables by using
a proper state object. The idea is that EVERY token should
contain the state object, which is put into it by the
lexer: thus the state variables passed to the lexer can
be transmitted to the parser.

The cost of course is an extra pointer in every token,
plus the messiness of actually constructing the tokens,
which has to be repeated in every single lexing rule,
and of course, you have to declare the tokens:

%token<MyLexerState.lexer_state> ATOKEN

An obvious use of such a facility is a C parser,
which needs to modify a list of typedef names,
so the that lexer can generate the right token
for a given identifier.

Most C parsers use a global variable for communication,
which of course is very bad.

In other cases, such as the Felix parser, tokenisation
is entirely divorced from parsing. Nevertheless, whilst
the parse is not influence by any data, the *user actions*
could be. One such use is a 'fresh variable' counter.

Whilst LALR(1) parsers are not directly
amenable to extension, one can certainly use a pushdown
list of LALR(1) automata to integrate 'recursive descent'
parsing techniques with LALR(1).

For example, a grammar for statements might use a single
token for 'expressions', and we have a separate expression
grammar. We then create a pair of lexers: one lexes statements,
recognizing an expression as a single token, attaching the whole
expression as an attribute. The statement parser, on seeing
such a token, extracts the string from it, and lexes and
parses it using an expression lexer/parser, takes the expression
AST which results, and glues it into the statement AST.

Without arguing about the feasibility or quality of such a 
design, the point is that, for example, the expression lexer
parser might be made extensible and depend on tables.
And the question is: where do the tables come from?

The only way to do this properly at the moment is the
token hack trick -- the tables have to be stored IN
the tokens, simply because there is no way to pass them
to the parser, so that the parser can pass them to user

One possible solution is to put the state data in a Lexbuf.t,
unfortunately, this would require it to be polymorphic,
and thus the lexer and parsers would also have to be polymorphic.
Since the parser has access to the lexbuf .. it can get the user
data out and pass it to the user actions. 

John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: