Previous Up Next
Chapter 3 Grammars in Camlp4
A grammar in Camlp4 is a value of type Grammar.g. It is created by the function Grammar.make which takes a lexer as parameter (see also the functorial interface, another way to create a grammar). Let's ignore for the moment how to create lexers and let's just take a lexer provided in Camlp4, which is the default OCaml lexer. It is in the module Plexer and you can create an instantiation by using Plexer.gmake ().

3.1 Grammar, entries: an example

You can create your grammar like this:
     # let gram = Grammar.gcreate (Plexer.gmake ());;
A grammar is composed of entries. Entries are values of type 'a Grammar.Entry.e. The 'a type parameter represents the type of values which the entry returns. To create an entry, use Grammar.Entry.create, which has two parameters: 1/ the associated grammar 2/ a string, the entry name, used for error messages.

An entry is a mutable value. When you create it, it is empty, its type is '_a Grammar.Entry.e (not generalized, you cannot create entries returning polymorphic values), the type parameter being set when the entry is extended, showing then the type of its results. Let's take an entry expr:
     # let expr = Grammar.Entry.create gram "expr";;
Entries apply to char streams via the function Grammar.Entry.parse. If you apply an empty entry (an entry just created, for example), the exception Stream.Failure is raised.

To define rules to an entry, you must use the statement EXTEND. Note that EXTEND is not a syntactic construction in OCaml: it is a syntax extension provided by Camlp4. The syntax of EXTEND is:
     extend-statement ::=
        EXTEND
           list-of-entries-extensions
        END
Notice that EXTEND is an expression (i.e. not a declaration): it can be evaluated at toplevel, but also inside a function: in this case, the syntax extension takes place when the function is called.

An entry extension has the syntax:
     entry-extension ::=
        identifier : [ list-of-levels-separated-by-bars ] ;
The identifier is an entry name. An entry can have one or several levels, representing precedences and associativity.

A level has the syntax:
     level ::=
        [ list-of-rules-separated-by-bars ]

     rule ::=
        list-of-symbols-separated-by-semicolons -> action
A rule is like a pattern matching case: it can introduce patterns variables which can be used in the action part. When the rule is accepted, the action is executed. The type of the action is 'a for an 'a entry ('a Grammar.Entry.e, more precisely).

Let's define our expr to parse a simple computation of arithmetic expression with addition, subtraction, multiplication and division, integer constants and parentheses. This can be written:
       EXTEND
         expr:
           [ [ x = expr; "+"; y = expr -> x + y
             | x = expr; "-"; y = expr -> x - y ]
           | [ x = expr; "*"; y = expr -> x * y
             | x = expr; "/"; y = expr -> x / y ]
           | [ x = INT -> int_of_string x
             | "("; e = expr; ")" -> e ] ]
         ;
       END;;
The expr entry has now three levels. Grammar entries are extensible at any time: you can extend expr again to add more constructions. You can add more rules in already existing levels and you can insert new levels. We see that further.

Each level has its own associativity. You can specify left, right or non-associative. By default, the associativity is left. See further.

If you are in the OCaml toplevel and have tested this example, you can see now that expr is of type int entry, since it returns values of type int.

You can try it out:
     # Grammar.Entry.parse expr (Stream.of_string "2 + 3");;
     - : int = 5
     # Grammar.Entry.parse expr (Stream.of_string "8 * (5 - 2)");;
     - : int = 24
And so on.

What happens in case of syntax error? In the general case, the exception Stream.Error is raised, enclosed in another exception named Exc_located which indicates the location of the error in the stream. Here, the right parenthesis is missing:
     # Grammar.Entry.parse expr (Stream.of_string "9 / (7 + 1 ");;
     Uncaught exception:
     Stdpp.Exc_located
      ((11, 12), Stream.Error "')' expected after [expr] (in [expr])").
If there are unexpected symbols after a correct expression, it is not a parsing error, the parsing of the stream just stops and the trailing symbols are ignored:
     # Grammar.Entry.parse expr (Stream.of_string "8 * (5 - 2) 7 foo");;
     - : int = 24
To ensure that there are no trailing tokens in the input stream, we can create another entry expr_eoi expecting an expr followed by the end of the input EOI:
     # let expr_eoi = Grammar.Entry.create gram "expr_eoi";;
     # EXTEND expr_eoi: [ [ e = expr; EOI -> e ] ]; END;;
Now:
     # Grammar.Entry.parse expr_eoi (Stream.of_string "8 * (5 - 2) 7 foo");;
     Uncaught exception:
     Stdpp.Exc_located
      ((12, 13),
       Stream.Error "end of input expected after [expr] (in [expr_eoi])").
3.2 Remark about the lexer

Notice that "+", "-", "*", "/", EOI are terminals in this grammar, but they are not specifically predefined in Camlp4 grammar system: it depends on how the associated lexer works. Remember: we used Plexer.gmake () as associated lexer in the present grammar. In another grammar, with another lexer, these terminals might have no meaning.

Before doing the extensions, the statement EXTEND first scan all rules and, for each terminal, asks the lexer whether this terminal is correct or not. It uses for that a function named using defined in the lexer record type (see interface of module Token). This function can be used also to update a list (or hashtable) of keywords (in the case when there is a notion of keywords, what is not mandatory).

Ok. But it is lexer stuff... We don't need to know about it for the moment. However, it is interesting to know that in the predefined lexer Plexer.gmake (), this function using prints an error message and raises an exception if a bad terminal is used in a EXTEND statement:
        # EXTEND expr_eoi: [ [ AAA -> 3 ] ]; END;;
        Lexer initialization error.
        The constructor "AAA" is not recognized by Plexer
        Uncaught exception: Failure "Grammar.extend".
        # EXTEND expr: [ [ x = expr; "a+b" -> x + 1 ] ]; END;;
        Lexer initialization error.
        The token "a+b" does not respect Plexer rules
        Uncaught exception: Failure "Grammar.extend".
All this details are described in a chapter about lexers, in the reference manual. See also the predefined modules Token and Plexer.

3.3 Labelled levels and associativity

Now, we are going to extend expr. But as we defined it, it is not possible to point the entries. To point them, we need labels.

The syntax of a level is actually:
     level ::=
        optional-label optional-associativity
        [ list-of-rules-separated-by-bars ]
A label is a string. Any string you choose. The associativity is either LEFTA, RIGHTA or NONA.

Let's write expr again (we take a fresh entry, in order to start with an empty entry) with labels and explicit associativity:
     # let expr = Grammar.Entry.create gram "expr";;
     # EXTEND
         expr:
           [ "add" LEFTA
             [ x = expr; "+"; y = expr -> x + y
             | x = expr; "-"; y = expr -> x - y ]
           | "mult" RIGHTA
             [ x = expr; "*"; y = expr -> x * y
             | x = expr; "/"; y = expr -> x / y ]
           | "simple" NONA
             [ x = INT -> int_of_string x
             | "("; e = expr; ")" -> e ] ]
         ;
       END;;
By the way, an useful function, especially in the toplevel, is Grammar.Entry.print, which displays the contents of an entry (just the rules, in fact):
     # Grammar.Entry.print expr;;
     [ "add" LEFTA
       [ SELF; "+"; SELF
       | SELF; "-"; SELF ]
     | "mult" RIGHTA
       [ SELF; "*"; SELF
       | SELF; "/"; SELF ]
     | "simple" NONA
       [ "("; SELF; ")"
       | INT ] ]
Notice that all expr have been replaced by SELF: this is the same thing: when an entry calls itself, you can use either its name or the keyword SELF. It represents either the current level, the next level or the first level, depending on the associativity and the position of the SELF in the rule (current or next level if the SELF starts or ends the rule, first level otherwise).

When you extend an entry, by default the first level of the extension extends the first level of the entry:
     # EXTEND expr: [ [ x = expr; "plus1plus"; y = expr -> x + 1 + y ] ]; END;;
This extended the first level, i.e. the one labelled "add". Type Grammar.Entry.print expr to see the result.

You can extend any existing level and insert new levels. Actually, the syntax of an entry extension is:
     entry-extension ::=
        optional-position
        identifier : [ list-of-levels-separated-by-bars ] ;

     position ::=
        FIRST | LAST | BEFORE label | AFTER label | LEVEL label
3.4 Extending a level

To extend some specified level, we can use LEVEL followed by the label of the level to be extended:
     # let env = ref [];;
     # EXTEND
         expr: LEVEL "simple" [ [ x = LIDENT -> List.assoc x !env ] ];
       END;;
You Grammar.Entry.print expr again to see the result.

The symbol LIDENT is a constructor defined by our lexer Plexer. It represents an identifier starting by a lowercase character. For details, see the interface of the module Plexer (plexer.mli, given also in the reference manual, chapter libraries).

Just small tests:
     # Grammar.Entry.parse expr (Stream.of_string "foo + 1");;
     Uncaught exception: Stdpp.Exc_located ((3, 4), Not_found)
     # env := ("foo", 27) :: !env;;
     # Grammar.Entry.parse expr (Stream.of_string "foo + 1");;
     - : int = 28
3.5 Insert a level

To insert a level, you can use BEFORE or AFTER, relatively to an existing level:
     # EXTEND
         expr: AFTER "mult"
           [ "power" RIGHTA
             [ x = expr; "**"; y = expr -> int_of_float (float x ** float y) ] ]
         ;
       END;;
There is no limit to the number of levels: it is just a list. It is also possible to use FIRST or LAST: they create levels in the beginning or at the end.

3.6 Infix operator

Inside an EXTEND statement you can use antiquotations in places where strings are expected. The antiquotation is some expression between two ``dollar'' signs. A typical example is a function adding an infix operator ``op'' at the level ``lev'':
     # let add_infix lev op =
         EXTEND
           expr: LEVEL $lev$
             [ [ x = expr; $op$; y = expr -> <:expr< $lid:op$ $x$ $y$ >> ] ]
           ;
         END;;
This function can be called when you want to add your infix. The infix becomes automatically a keyword (actually, it depends on the lexer behaviour). It can be used also to define an infix macro in the OCaml grammar (chapter 7).

3.7 Attributed grammars

It is not possible to create attributed grammars, i.e. grammars with parameters (in our terminology, entries with parameters). But entries can return functions. Let us take another version where the entry expr returns a function which takes an environment env:
     # let expr = Grammar.Entry.create gram "expr";;
     # EXTEND
         expr:
           [ "add" LEFTA
             [ x = expr; "+"; y = expr -> fun env -> x env + y env
             | x = expr; "-"; y = expr -> fun env -> x env - y env ]
           | "mult" RIGHTA
             [ x = expr; "*"; y = expr -> fun env -> x env * y env
             | x = expr; "/"; y = expr -> fun env -> x env / y env ]
           | "simple" NONA
             [ x = INT -> fun env -> int_of_string x
             | x = LIDENT -> fun env -> List.assoc x env
             | "("; e = expr; ")" -> e ] ]
         ;
       END;;

To call the entry, we need now to add the environment (a list) as parameter:
     # Grammar.Entry.parse expr (Stream.of_string "foo + 1") [];;
     Uncaught exception: Not_found.

(since foo is not in the environment)
     # Grammar.Entry.parse expr (Stream.of_string "foo + 1") [("foo", 48)];;
     - : int = 49

We can improve the error message to display the unbound variable name:
     # EXTEND
         expr: LEVEL "simple"
           [ [ x = LIDENT ->
                 fun env ->
                   try List.assoc x env with
                     Not_found -> failwith ("unbound variable " ^ x) ] ]
         ;
       END;;
Notice that there is already a rule LIDENT in the level "simple" of expr. In this case, the EXTEND statement replaces the old version by the new one and displays a warning. To mask this message, one can set the variable Grammar.warning_verbose to false.

Now, it is more informative:
     # Grammar.Entry.parse expr (Stream.of_string "foo + 1") [];;
     Uncaught exception: Failure "unbound variable foo".
3.8 Source location

We can improve our above error system again, by telling the location of the error. In our examples, we tested short texts, it is easy to see the error, but if your grammar becomes very big and treats very big input files, it is very important to know with precision where the error happened in the source.

An useful function for that is Stdpp.raise_with_loc taking an input location and an exception as parameters. It raises the exception Stdpp.Exc_located already seen some sections above.

We could directly raise this exception Exc_located but raise_with_loc has the advantage to just re-raise the exception parameter when it is already Exc_located, which is useful to propagate exceptions without stacking Exc_located.

The input location of a rule is in the variable loc always available in the action part:
     # EXTEND
         expr: LEVEL "simple"
           [ [ x = LIDENT ->
                 fun env ->
                   try List.assoc x env with
                     Not_found ->
                       Stdpp.raise_with_loc loc
                         (Failure ("unbound variable " ^ x)) ] ]
         ;
       END;;
     # Grammar.Entry.parse expr (Stream.of_string "3 + foo + 1") [];;
     Uncaught exception:
     Stdpp.Exc_located ((4, 7), Failure "unbound variable foo").
We are going to extend now our entry expr with a "let" construction, which can extend the environment.

This is an occasion to introduce the notion of meta symbols in entry rules.

3.9 Meta symbols: lists, options, levels

It is possible to use some meta symbols in rules: Now, we can write a let statement, which calls an non empty list of bindings separated by the keyword "and":
     # let binding = Grammar.Entry.create gram "let_binding";;
     # EXTEND
         expr: FIRST
           [ [ "let"; r = LIST1 binding SEP "and"; "in"; e = expr ->
                 fun env ->
                   let new_env =
                     List.fold_right (fun b new_env -> b env :: new_env)
                       r env
                   in
                   e new_env ] ]
          ;
          binding:
            [ [ p = LIDENT; "="; e = expr -> fun env -> (p, e env) ] ]
          ;
       END;;

Let us define an useful function to test our entries:
     # let apply e s = Grammar.Entry.parse e (Stream.of_string s) [];;

Here are some examples:
     # apply expr "let a = 25 and b = 12 in a + b";;
     - : int = 37

     # apply expr "let a = 25 and b = a + 5 in a + b";;
     Uncaught exception:
     Stdpp.Exc_located ((19, 20), Failure "unbound variable a").

     # apply expr "let a = 25 in let b = a + 5 in a + b";;
     - : int = 55
3.10 Local and global entries

We have now three entries: expr_eoi, expr and binding.

In big grammars, we often have to create a lot of small entries, forcing us to define them with Grammar.Entry.create. It is not practical for several reasons: 1/ it is tedious 2/ we generally don't need to access all of them directly 3/ the ones which are not eventually extended (which may happen when you perfect your grammar) remain of type _'a entry which cause ocamlc to fail at end of the module.

To avoid that, it is possible to ask EXTEND to automatically define all these small entries. It actually works the wrong way round: you have to define the list of the entries which are globally defined. The other ones (the ones which are ``extended'' in the statement) are locally defined. Actually, the definition of EXTEND is:
     extend-statement ::=
        EXTEND
           optional-global
           list-of-entries-extensions
        END

     global ::=
        GLOBAL : list-of-entries ;
Warning: this statement is a little bit complicated to read: GLOBAL introduces the list of entries which have been defined previously. It means that all other entries in the EXTEND statement are automatically locally defined. Therefore, by default, if there is no GLOBAL, it must be read: ``all entries are global''.

In our example, if we want that only expr_eoi be defined and be visible, we can add the GLOBAL entry with only expr_eoi: in this case, expr and binding are locally defined and therefore not extensible.

To be sure that we don't use the previously defined expr and binding by quitting the toplevel and entering it again. Don't forget the:
     #load "camlp4o.cma";;
     #load "pa_extend.cmo";;

And now:
     # let gram = Grammar.gcreate (Plexer.gmake ());;
     # let expr_eoi = Grammar.Entry.create gram "expr_eoi";;
     # EXTEND
         GLOBAL: expr_eoi;
         expr_eoi:
           [ [ e = expr; EOI -> e ] ]
         ;
         expr:
           [ [ "let"; r = LIST1 binding SEP "and"; "in"; e = expr ->
                 fun env ->
                   let new_env =
                     List.fold_right (fun b new_env -> b env :: new_env)
                       r env
                   in
                   e new_env ]
           | "add" LEFTA
             [ x = expr; "+"; y = expr -> fun env -> x env + y env
             | x = expr; "-"; y = expr -> fun env -> x env - y env ]
           | "mult" RIGHTA
             [ x = expr; "*"; y = expr -> fun env -> x env * y env
             | x = expr; "/"; y = expr -> fun env -> x env / y env ]
           | "simple" NONA
             [ x = INT -> fun env -> int_of_string x
             | x = LIDENT ->
                 (fun env -> try List.assoc x env with
                    Not_found ->
                      Stdpp.raise_with_loc loc
                        (Failure ("unbound variable " ^ x)))
             | "("; e = expr; ")" -> e ] ]
         ;
         binding:
           [ [ p = LIDENT; "="; e = expr -> fun env -> (p, e env) ] ]
         ;
       END;;
     # let apply e s = Grammar.Entry.parse e (Stream.of_string s) [];;

     # apply expr_eoi "let a = 25 and b = 12 in a + b";;
     - : int = 37

     # apply expr_eoi "let a = 25 and b = 12 in a + b foo bar";;
     Uncaught exception:
     Stdpp.Exc_located
       ((31, 34),
          Stream.Error "end of input expected after [expr] (in [expr_eoi])")
3.11 Deleting a rule

To delete a rule, use the statement DELETE_RULE. Its syntax is:
     delete-rule ::=
        DELETE_RULE entry : list-of-symbols-separated-by-semicolons END
It deletes the first rule found in the levels of entry matching the list of symbols.

For example, in the above example, you can delete the ``addition'' rule of the entry expr, by typing:
     # DELETE_RULE expr: SELF; "+"; SELF END;;
3.12 Parsed language

The grammar entries levels are just improved streams parsers. Streams parsers use recursive descendant parsing (something close to LL(1) but actually a little less powerful). The improvements are: There is no left factorization between different entries, or in different levels in the same entry.

This is often a problem. This example does not work:

x ::= y | z
y ::= A B | ...
z ::= A C | ...

The input "A C" raises Stream.Error. There is no simple solution to this problem, but there is a solution (not very clean, actually): create a entry from a parser (it is possible via the function Grammar.Entry.of_parser). This parser must scan the stream using Stream.npeek until the number of necessary tokens allows to make the difference; return unit or raise Stream.Failure according to the cases.

3.13 Functorial interface

There is another way to create a grammar: using the functional interface. See the module Grammar. In this case, grammars are no more values, but modules. The extension of a grammar is done by the keyword GEXTEND followed by the grammar module name.

The differences are the following: Use the normal interface or the functorial interface is a question of personal taste.

3.14 Writing a lexer

If you want to write your own lexer, a simple solution is to take the sources of the provided lexer (plexer.ml) and make your changes in it.

Otherwise you can read the section ``writing a lexer'' in the reference manual.

3.15 Grammar for OCaml syntax

The command camlp4 uses to parse OCaml files: for that, it uses extensible grammars described in the present chapter (a grammar with extensible entries). The main entries of the grammar of OCaml are accessible and therefore extensible: for expressions, for patterns, for structures, signatures, and so on. All are defined in the provided module Pcaml. See chapter 7, the reference manual, chapter of library modules or look at the interface pcaml.mli.

But we are not yet ready to write syntax extensions for OCaml: we first need to create syntax trees which are returned by these entries.

The way to create OCaml syntax trees is explained in the following chapters, the first one being about another feature of Camlp4: the quotations.



Previous Up Next