Version française
Home     About     Download     Resources     Contact us    
Browse thread
Re: [Caml-list] ocamllex, regular expression syntax
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Luc Maranget <luc.maranget@i...>
Subject: Re: [Caml-list] ocamllex, regular expression syntax
> I new to ocaml and today I played a little bit around with
> ocamllex. Now I'm wondering why ocamllex has this strange regular
> expression syntax. One has to quoted every character, an arbitrary
> character is matched by the underscore...
> 
> The manual for ocamllex says: "The regular expressions are in the
> style of lex, with a more Caml-like syntax."
> 
> But the regular expression syntax in the Str module looks "normal" to
> me.
> 
> Regular expressions like this
> 
> "[^"\\]*(\\.[^"\\]*)*"
> 
> are not easy to read, but with the ocamllex syntax it is even more
> difficult:
> 
> '"'[^'"''\\']*('\\'_[^'"''\\']*)*'"'
> 
> (and harder to write).
> 
> Is this just for historical reason or is there a practical reason for
> this syntax? I'm just curious...
> 

Ah, regexp syntax ! I think I can explain a few principles, as I see
them.


Lex-like tools are part of, let us say, a compiler culture.
In lex-style regexp, you clearly have a too stage definition.

1. The tokens:
     Characters (caml-style) 'c', with some escape mechamism
(such as '\\')
     Various operators such as *, +, etc. or delimiters such as (, )
     Spacing between tokens is irrelevant.

2. From the tokens, regexp are defined as trees 

This allows a clean, regular, definition of regexp syntax. Moreover,
lexing conventions are the ones of Caml.
<http://caml.inria.fr/ocaml/htmlman/manual026.html#htoc126>

But then, as you noticed, users have to type many quotes.


Perl-like tools follow a different idea, they intend to minimize
keystrokes. I guess the first idea was to make unescaped/unquoted
characters correspond to their ``most frequent usage''.
The consequence is that users type many backslashes,

In my opinion, the meaning of quotes (ocamllex) is clear because they
express one simple construct: I want this caracter.
The meaning of backslahes (perl) is less clear, it means ``I want some
special meaning of this characters'', which covers many situations.
In particuler \ ordinary meanig is not ``a backslah, and this implies that
\\ means ``I want a backslash''. The same applies to *, whose default
meaning is being the repetition operator. This is a bit irregular in
my opinion.

Some additional problems arise when several meanings are considered.
consider, for instance, \1 (reference to \(..\) number one) and \001
(character whose code is one). It is no surprise that various regexp
tools disagree on such subtle points.

As a conclusion, lex way of doing things is inspired by design
(first lex, then parse), whereas perl way of doing things 
is inspired by minimizing users keystrokes, leading to, in my opinion, some
dark corners.


--Luc






-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners