Lexer and parser generators (camllex, camlyacc)

This chapter describes two program generators: camllex, that produces a lexical analyzer from a set of regular expressions with associated semantic actions, and camlyacc, that produces a parser from a grammar with associated semantic actions.

These program generators are very close to the well-known lex and yacc commands that can be found in most C programming environments. This chapter assumes a working knowledge of lex and yacc: while it describes the input syntax for camllex and camlyacc and the main differences with lex and yacc, it does not explain the basics of writing a lexer or parser description in lex and yacc. Readers unfamiliar with lex and yacc are referred to ``Compilers: principles, techniques, and tools'' by Aho, Sethi and Ullman (Addison-Wesley, 1986), ``Compiler design in C'' by Holub (Prentice-Hall, 1990), or ``Lex & Yacc'', by Mason and Brown (O'Reilly, 1990).

Streams and stream matching, as described in section 4.1, provide an alternative way to write lexers and parsers. The stream matching technique is more powerful than the combination of camllex and camlyacc in some cases (higher-order parsers), but less powerful in other cases (precedences). Choose whichever approach is more adapted to your parsing problem.

Mac:: These commands are MPW tool, not standalone Macintosh applications.

Overview of `camllex`

The camllex command produces a lexical analyzer from a set of regular expressions with attached semantic actions, in the style of lex. Assuming the input file is lexer.mll, executing

        camllex lexer.mll

produces Caml Light code for a lexical analyzer in file lexer.ml. This file defines one lexing function per entry point in the lexer definition. These functions have the same names as the entry points. Lexing functions take as argument a lexer buffer, and return the semantic attribute of the corresponding entry point.

Lexer buffers are an abstract data type implemented in the standard library module lexing. The functions create_lexer_channel, create_lexer_string and create_lexer from module lexing create lexer buffers that read from an input channel, a character string, or any reading function, respectively. (See the description of module lexing in chapter 14.)

When used in conjunction with a parser generated by camlyacc, the semantic actions compute a value belonging to the type token defined by the generated parsing module. (See the description of camlyacc below.)

Syntax of lexer definitions

The format of lexer definitions is as follows:

{ header }
rule entrypoint =
  parse regexp { action }
      | ...
      | regexp { action }
and entrypoint =
  parse ...
and ...
;;

Comments are delimited by (* and *), as in Caml Light.

Header

The header section is arbitrary Caml Light text enclosed in curly braces. It can be omitted. If it is present, the enclosed text is copied as is at the beginning of the output file. Typically, the header section contains the #open directives required by the actions, and possibly some auxiliary functions used in the actions.

Entry points

The names of the entry points must be valid Caml Light identifiers.

Regular expressions

The regular expressions are in the style of lex, with a more Caml-like syntax.

` char `: A character constant, with the same syntax as Caml Light character constants. Match the denoted character.
_: Match any character.
eof: Match the end of the lexer input.
" string ": A string constant, with the same syntax as Caml Light string constants. Match the corresponding sequence of characters.
[ character-set ]: Match any single character belonging to the given character set. Valid character sets are: single character constants ` c `; ranges of characters ` c₁ ` - ` c₂ ` (all characters between c_1 and c_2, inclusive); and the union of two or more character sets, denoted by concatenation.
[ ^ character-set ]: Match any single character not belonging to the given character set.
regexp *: (Repetition.) Match the concatenation of zero or more strings that match regexp.
regexp +: (Strict repetition.) Match the concatenation of one or more strings that match regexp.
regexp ?: (Option.) Match either the empty string, or a string matching regexp.
regexp₁ | regexp₂: (Alternative.) Match any string that matches either regexp₁ or regexp₂
regexp₁ regexp₂: (Concatenation.) Match the concatenation of two strings, the first matching regexp₁, the second matching regexp₂.
( regexp ): Match the same strings as regexp.

Concerning the precedences of operators, * and + have highest precedence, followed by ?, then concatenation, then | (alternation).

Actions

The actions are arbitrary Caml Light expressions. They are evaluated in a context where the identifier lexbuf is bound to the current lexer buffer. Some typical uses for lexbuf, in conjunction with the operations on lexer buffers provided by the lexing standard library module, are listed below.

lexing__get_lexeme lexbuf: Return the matched string.
lexing__get_lexeme_char lexbuf n: Return the n-th character in the matched string. The first character corresponds to n = 0.
lexing__get_lexeme_start lexbuf: Return the absolute position in the input text of the beginning of the matched string. The first character read from the input text has position 0.
lexing__get_lexeme_end lexbuf: Return the absolute position in the input text of the end of the matched string. The first character read from the input text has position 0.
entrypoint lexbuf: (Where entrypoint is the name of another entry point in the same lexer definition.) Recursively call the lexer on the given entry point. Useful for lexing nested comments, for example.

Overview of `camlyacc`

The camlyacc command produces a parser from a context-free grammar specification with attached semantic actions, in the style of yacc. Assuming the input file is grammar.mly, executing

        camlyacc options grammar.mly

produces Caml Light code for a parser in the file grammar.ml, and its interface in file grammar.mli.

The generated module defines one parsing function per entry point in the grammar. These functions have the same names as the entry points. Parsing functions take as arguments a lexical analyzer (a function from lexer buffers to tokens) and a lexer buffer, and return the semantic attribute of the corresponding entry point. Lexical analyzer functions are usually generated from a lexer specification by the camllex program. Lexer buffers are an abstract data type implemented in the standard library module lexing. Tokens are values from the concrete type token, defined in the interface file grammar.mli produced by camlyacc.

Syntax of grammar definitions

Grammar definitions have the following format:

%{
  header
%}
  declarations
%%
  rules
%%
  trailer

Comments are enclosed between /* and */ (as in C) in the ``declarations'' and ``rules'' sections, and between (* and *) (as in Caml) in the ``header'' and ``trailer'' sections.

Header and trailer

The header and the trailer sections are Caml Light code that is copied as is into file grammar.ml. Both sections are optional. The header goes at the beginning of the output file; it usually contains #open directives required by the semantic actions of the rules. The trailer goes at the end of the output file.

Declarations

Declarations are given one per line. They all start with a % sign.

%token symbol...symbol

Declare the given symbols as tokens (terminal symbols). These symbols are added as constant constructors for the token concrete type.

%token < type > symbol...symbol

Declare the given symbols as tokens with an attached attribute of the given type. These symbols are added as constructors with arguments of the given type for the token concrete type. The type part is an arbitrary Caml Light type expression, except that all type constructor names must be fully qualified (e.g. modname__typename) for all types except standard built-in types, even if the proper #open directives (e.g. #open "modname") were given in the header section. That's because the header is copied only to the .ml output file, but not to the .mli output file, while the type part of a %token declaration is copied to both.

%start symbol...symbol

Declare the given symbols as entry points for the grammar. For each entry point, a parsing function with the same name is defined in the output module. Non-terminals that are not declared as entry points have no such parsing function. Start symbols must be given a type with the %type directive below.

%type < type > symbol...symbol

Specify the type of the semantic attributes for the given symbols. This is mandatory for start symbols only. Other nonterminal symbols need not be given types by hand: these types will be inferred when running the output files through the Caml Light compiler (unless the -s option is in effect). The type part is an arbitrary Caml Light type expression, except that all type constructor names must be fully qualified (e.g. modname__typename) for all types except standard built-in types, even if the proper #open directives (e.g. #open "modname") were given in the header section. That's because the header is copied only to the .ml output file, but not to the .mli output file, while the type part of a %token declaration is copied to both.

%left symbol...symbol

%right symbol...symbol

%nonassoc symbol...symbol

Associate precedences and associativities to the given symbols. All symbols on the same line are given the same precedence. They have higher precedence than symbols declared before in a %left, %right or %nonassoc line. They have lower precedence than symbols declared after in a %left, %right or %nonassoc line. The symbols are declared to associate to the left (%left), to the right (%right), or to be non-associative (%nonassoc). The symbols are usually tokens. They can also be dummy nonterminals, for use with the %prec directive inside the rules.

Rules

The syntax for rules is as usual:

nonterminal :
    symbol ... symbol { semantic-action }
  | ...
  | symbol ... symbol { semantic-action }
;

Rules can also contain the %prec symbol directive in the right-hand side part, to override the default precedence and associativity of the rule with the precedence and associativity of the given symbol.

Semantic actions are arbitrary Caml Light expressions, that are evaluated to produce the semantic attribute attached to the defined nonterminal. The semantic actions can access the semantic attributes of the symbols in the right-hand side of the rule with the $ notation: $1 is the attribute for the first (leftmost) symbol, $2 is the attribute for the second symbol, etc.

Actions occurring in the middle of rules are not supported. Error recovery is not implemented.

Options

The camlyacc command recognizes the following options:

-v: Generate a description of the parsing tables and a report on conflicts resulting from ambiguities in the grammar. The description is put in file grammar.output.
-s: Generate a grammar.ml file with smaller phrases. Semantic actions are presented in the grammar.ml output file as one large vector of functions. By default, this vector is built by a single phrase. When the grammar is large, or contains complicated semantic actions, the resulting phrase may require large amounts of memory to be compiled by Caml Light. With the -s option, the vector of actions is constructed incrementally, one phrase per action. This lowers the memory requirements for the compiler, but it is no longer possible to infer the types of nonterminal symbols: typechecking is turned off on symbols that do not have a type specified by a %type directive.
-bprefix: Name the output files prefix.ml, prefix.mli, prefix.output, instead of the default naming convention.

A complete example

The all-time favorite: a desk calculator. This program reads arithmetic expressions on standard input, one per line, and prints their values. Here is the grammar definition:

        /* File parser.mly */
        %token <int> INT
        %token PLUS MINUS TIMES DIV
        %token LPAREN RPAREN
        %token EOL
        %left PLUS MINUS        /* lowest precedence */
        %left TIMES DIV         /* medium precedence */
        %nonassoc UMINUS        /* highest precedence */
        %start Main             /* the entry point */
        %type <int> Main
        %%
        Main:
            Expr EOL                { $1 }
        ;
        Expr:
            INT                     { $1 }
          | LPAREN Expr RPAREN      { $2 }
          | Expr PLUS Expr          { $1 + $3 }
          | Expr MINUS Expr         { $1 - $3 }
          | Expr TIMES Expr         { $1 * $3 }
          | Expr DIV Expr           { $1 / $3 }
          | MINUS Expr %prec UMINUS { - $2 }
        ;

Here is the definition for the corresponding lexer:

        (* File lexer.mll *)
        {
        #open "parser";;        (* The type token is defined in parser.mli *)
        exception Eof;;
        }
        rule Token = parse
            [` ` `\t`]     { Token lexbuf }     (* skip blanks *)
          | [`\n` ]        { EOL }
          | [`0`-`9`]+     { INT(int_of_string (get_lexeme lexbuf)) }
          | `+`            { PLUS }
          | `-`            { MINUS }
          | `*`            { TIMES }
          | `/`            { DIV }
          | `(`            { LPAREN }
          | `)`            { RPAREN }
          | eof            { raise Eof }
        ;;

Here is the main program, that combines the parser with the lexer:

        (* File calc.ml *)
        try
          let lexbuf = lexing__create_lexer_channel std_in in
          while true do
            let result = parser__Main lexer__Token lexbuf in
              print_int result; print_newline(); flush std_out
          done
        with Eof ->
          ()
        ;;

To compile everything, execute:

        camllex lexer.mll       # generates lexer.ml
        camlyacc parser.mly     # generates parser.ml and parser.mli
        camlc -c parser.mli
        camlc -c lexer.ml
        camlc -c parser.ml
        camlc -c calc.ml
        camlc -o calc lexer.zo parser.zo calc.zo