Browse thread
features of PCRE-OCaml
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2000-12-08 (09:06) |
From: | John Max Skaller <skaller@o...> |
Subject: | Re: features of PCRE-OCaml |
Markus Mottl wrote: > > On Fri, 08 Dec 2000, John Max Skaller wrote: > > Funny. Python 1.5.2 used the _same_ C library by Philip Hazel. :-) > > Given the fact this library builds DFA's instead of NFA's, Python > > ought to be faster than Perl. :-) > > Well, the matching engine is not everything... ;) It is for code doing extensive matching of long strings against a single pattern: everything else should be dwarfed by the match time. > > Note also, Python 2.0 uses a modified library which does something > > PCRE-OCaml cannot: it works with Unicode characters (supposedly). > > To my knowledge, Phil Hazel is working on support for this. Unless the > PCRE-library supports Unicode (and unless OCaml does ;), there is not > much one can do about it... What? You mean it isn't generic enough to just change 'char' to 'short' and recompile? [:-)] > I am not sure whether it is really necessary to have a Str compatible > interface: the regular expressions are already different so exchanging > the old against the new library would break code anyway. If the expressions were translated? BTW: I think some of the features of the regex are parochial, and should be eliminated: support for case insensitive matching, and matching 'words' etc should be dropped. Such things might make sense in English, but are much too hard to build in to a regexp facility correctly for internationalised text. By the way, how big can the DFA tables get? Does it eliminate duplicate columns? [Ocaml lex cannot support large enough tables for matching ISO-10646 identifiers, when encoded using UTF-8. This is a real pain, since all my languages specify UTF-8 encoded ISO-10646: I have to cheat, and assume 'almost everything' is a suitable character to put in an identifier, and then check it afterwards. This makes it hard to use use special symbols as tokens. I'm not sure why this is, but I guess it doesn't eliminate duplicate columns?] -- John (Max) Skaller, mailto:skaller@maxtal.com.au 10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850 checkout Vyper http://Vyper.sourceforge.net download Interscript http://Interscript.sourceforge.net