Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
[Caml-list] Calling ocamlyacc from ocamlyacc
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2003-09-06 (03:36)
From: skaller <skaller@o...>
Subject: Re: [Caml-list] Calling ocamlyacc from ocamlyacc
On Wed, 2003-09-03 at 01:23, katre wrote:
> Michal Moskal wrote:
> > 
> > You can define several start symbols in your grammar. Parsing functions
> > are defined for each. You can also define several rule ... in your lexer
> > (lexing functions are defined for each). Hope that helps, I can't help
> > more, since I don't quite understand nature of your problem.
> > 
> Right, but is there a way, in a ocamlyacc action, to switch which lexer
> rule you're using?  That seems to be the main part I am missing.  Or if
> I could access the lexbuf directly, I could also use that.

I think the answer is no, you can't do that.
The reason is that yacc et al are LR(1) meaning
1 token look ahead is needed before a reduction:
in general, when a reduction occurs there is no
guarrantee what other tokens haven't been fetched.

Now yacc/lex are normally driven by the parser
fetching tokens. So what you _can_ do is pretend
that the 'deviant sublanguage' you need to use
a different grammar for is a single *huge* token.

The lexer, unlike the parser, can be invoked
recursively, and in particular when a regex
is matched, the code which returns a value can
do anything.

In particular, you can switch to another lexing rule.
I do this all the time for handling comments
and strings: the lexeme for open quote is
recognised and it calls a string gathering
rule which uses the same lexbuf.

Lexbufs have a current position, which is the
end of the lexeme .. even if the finite state
automaton actually looked ahead further.

So with lexbuf, you know *exactly* where you are,
whenever the code associated with a regexp is matched.

Here is the mainline lexer:
(* Python strings *)
| quote  { fun state -> state#inbody; parse_qstring lexbuf state }
| qqq    { fun state -> state#inbody; parse_qqqstring lexbuf state }

which invokes the sublexer:

rule parse_qstring = parse
| qstring { 
    fun state -> 
      [STRING (
        state#get_srcref lexbuf, 
        state#decode decode_qstring (lexeme lexbuf)
| _ { 
  fun state -> 
      state#get_srcref lexbuf, 
      "' string"

and parse_qqqstring = parse
| qqqstring { 
    fun state -> 
      [STRING (
        state#get_srcref lexbuf, 
        state#decode decode_qqqstring (lexeme lexbuf)
| _ { 
  fun state -> 
      state#get_srcref lexbuf, 
      "''' string"

Now, I said you can do anything, and in the example I just
call another lexer rule, but .. there is no reason you
can't call a parser function, passing the same lexbuf.

Note that you have to do this from the LEXER code,
to ensure that the sub-parser is invoked on exactly
the correct starting character.

You may also note in the example my lexer codes
have a 

	fun state -> ...

form (for every lexeme which is boring to write).
This state is a mutable object which is passed
to the lexer as an extra argument (just add it
after the call to the rule as in the nested example:

| quote  { fun state -> state#inbody; parse_qstring lexbuf state }
-                                     rule                extra arg

Note: I am returning lists of tokens not tokens. My lexer code
is NOT called by the parser. I call it myself and build a list
of tokens, pre-process them, and pass the output of that to
the parser via a dummy lexbuf.

I have in fact constructed a PYTHON parser using Ocamlyacc,
even though Python grammar is 'strongly not LR(1)' :-))
I do this by something like 13 filterings of the token
streams (to find the indentation etc) before it
is in an LR(1) form.

To unsubscribe, mail Archives:
Bug reports: FAQ:
Beginner's list: