Browse thread
[Caml-list] Regarding regular expressions
- John Skaller
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2002-08-07 (05:02) |
From: | John Skaller <skaller@o...> |
Subject: | [Caml-list] Regarding regular expressions |
Below is some analysis of use of backreferences in Perl, done by Eric Niebler <ericne#micorosft.com> for the C++ committee. This post is stolen from the library subgroup committee reflector. The C++ committee is looking at which kind of regular expresssions to provide as standard in the C++ Standard Library. Thought I'd post it here in case it's useful to the Ocaml team which is also looking at RE libraries. ------------------------------------------------------------------------ OK, I have some numbers. As a refresher, POSIX-extended doesn't do backreferences, but guarantees complexity O(SN) where S is the number of states in the NFA, and N is the number of characters in the input string. ECMAScript does backreferences, but worst-case complexity is O(2^N). For a language where backreferences are available, I thought it would be interesting to see how often people make use of this feature. So I downloaded the CPAN archive and analyzed the perl scripts there. Here is what I found: - I found 165228 perl scripts/modules in CPAN [1]. - Of those, 68501 use regular expressions [2]. - Of those, 32359 (or 47 percent) use backreferences [3]. So, nearly half of all perl scripts on CPAN that use regular expressions make use of the backreference feature. IMO this argues strongly in favor of supporting backreferences in C++. (Backreferences can only be handled by a backtracking NFA engine, IIRC.) There are other features besides backreferences that can only be provided by a backtracking NFA. These features include non-greedy quantification, positive and negative look-ahead and look-behind assertions, independent sub-expressions, conditional sub-expressions, and backreferences within the pattern itself. I did a separate analysis to see how often these features were used. Here's what I found: - Of the 68501 scripts that use regular expressions, 12199 (or 18 percent) use at least one of the features listed above [4]. If I count the files that use either backreferences or one of these other features, I find that: - 33881 files, or 49.5 percent, use some feature that requires a backtracking NFA engine. To get some idea of how many false-positives were turning up in my tests, I ran the same search against the files which do not use regular expressions at all. I found that of the 96727 scripts that do *not* use regular expressions [2], 2300 tested positive for a backtracking regex feature. That amounts to a 2% false positive rate. So it would be fair to knock 2% off the percentages quoted above. Eric [1] - I was only looking at files with .pm and .pl extensions, after extracting all archives with .tar.gz, .tgz, and .zip extensions. [2] - A perl file was determined to use regular expressions if it matched the pattern, "[=!]~ *[^ty ]". That is, if it used the operators =~ or !~ for matching or substituting (i.e. not character translation). [3] - A file was determined to use backreferences if it contained the pattern, "\$[1-9][^0-9]". [4] - A file was included in this total if it matched one of the following patterns: \(\?= // a positive look-ahead assertion \(\?! // a negative look-ahead assertion [0-9]\}\? // a non-greedy quantifier [+*]\? // also a non-greedy quantifier \(\?> // an independent sub-expression \(\?<= // a positive look-behind assertion \(\?<! // a negative look-behind assertion \(\?\( // a conditional sub-expression [=!]~.*\\[1-9][^0-9] // a backreference used within a pattern -- John Max Skaller, mailto:skaller@ozemail.com.au snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia. voice:61-2-9660-0850 ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners