Version française
Home     About     Download     Resources     Contact us    
Browse thread
features of PCRE-OCaml
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: John Max Skaller <skaller@o...>
Subject: Re: features of PCRE-OCaml
Markus Mottl wrote:
> 
> On Fri, 08 Dec 2000, John Max Skaller wrote:
> > Funny. Python 1.5.2 used the _same_ C library by Philip Hazel. :-)
> > Given the fact this library builds DFA's instead of NFA's, Python
> > ought to be faster than Perl. :-)
> 
> Well, the matching engine is not everything... ;)

	It is for code doing extensive matching of long strings
against a single pattern: everything else should be dwarfed
by the match time.

> > Note also, Python 2.0 uses a modified library which does something
> > PCRE-OCaml cannot: it works with Unicode characters (supposedly).
> 
> To my knowledge, Phil Hazel is working on support for this. Unless the
> PCRE-library supports Unicode (and unless OCaml does ;), there is not
> much one can do about it...

	What? You mean it isn't generic enough to just change
'char' to 'short' and recompile?  [:-)]
 
> I am not sure whether it is really necessary to have a Str compatible
> interface: the regular expressions are already different so exchanging
> the old against the new library would break code anyway.

	If the expressions were translated?

	BTW: I think some of the features of the regex are
parochial, and should be eliminated: support for case insensitive
matching, and matching 'words' etc should be dropped. Such things
might make sense in English, but are much too hard to build in
to a regexp facility correctly for internationalised text.

	By the way, how big can the DFA tables get?
Does it eliminate duplicate columns? 

	[Ocaml lex cannot support large enough tables for matching
ISO-10646 identifiers, when encoded using UTF-8. This is a real pain,
since all my languages specify UTF-8 encoded ISO-10646: I have to 
cheat, and assume 'almost everything' is a suitable character to
put in an identifier, and then check it afterwards. This makes it
hard to use use special symbols as tokens. I'm not sure why
this is, but I guess it doesn't eliminate duplicate columns?]

-- 
John (Max) Skaller, mailto:skaller@maxtal.com.au
10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850
checkout Vyper http://Vyper.sourceforge.net
download Interscript http://Interscript.sourceforge.net