Browse thread
Re: Why OCaml sucks
-
Jon Harrop
- Matthew William Cox
- Arthur Chan
- Tom_PrimožiÄ
- Elliott Oti
- Richard Jones
- David Teller
- Ulf Wiger (TN/EAB)
- Vincent Hanquez
- Ulf Wiger (TN/EAB)
-
Kuba Ober
-
Jon Harrop
-
Kuba Ober
-
Robert Fischer
-
Brian Hurt
- Robert Fischer
- Gerd Stolpmann
- Kuba Ober
-
Brian Hurt
-
Robert Fischer
-
Kuba Ober
-
Jon Harrop
- Kuba Ober
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Gerd Stolpmann <info@g...> |
| Subject: | Re: [Caml-list] Re: Why OCaml sucks |
Am Dienstag, den 13.05.2008, 10:01 -0400 schrieb Brian Hurt: > Robert Fischer wrote: > > > > > > 5. Strings: pushing unicode throughout a general purpose language is a > > > > > > mistake, IMHO. This is why languages like Java and C# are so slow. > > > > > > > > > > > Unicode by itself, when wider-than-byte encodings are used, adds "zero" > > > > > runtime overhead; the only overhead is storage (2 or 4 bytes per > > > > > character). > > > > > > > > > You cannot degrade memory consumption without also degrading performance. > > > > Moreover, there are hidden costs such as the added complexity in a lexer > > > > which potentially has 256x larger dispatch tables or an extra indirection > > > > for every byte read. > > > > > > > > Okay, I was going to let this slide, but it kept resurfacing and annoying me. > > > > Is there any empirical support for the assertion that Java and C# are slow because of *unicode*? Of > > all things, *unicode*? The fact that they're bytecod languages isn't a bigger hit? At least with > > the JVM, the hypercomplicated GC should probably take some of the blame, too -- I've seen 2x speed > > increases by *reducing* the space available to the GC, and 10x speed increases by boosting the space > > available to ridiculous levels so that the full GC barely ever has to fire. The the nigh-universal > > optimization-ruining mutable data and virtual function (e.g. method) dispatch I'm sure doesn't help, > > too. And this is to say nothing of user-space problems like the explosion of nontrivial types > > associated with the object-driven style. With all that going on, you're blaming their *Unicode > > support* for why they're slow? "This is why languages like Java and C# are so slow." Really? Got > > evidence for that? > > > > ~~ Robert. > > > > _ > > The problem, as I understand it, is in writting parsers. Your > standard finite automata based regular expression library or lexical > analyzer is based, at it's heart, on a table lookup- you have a 2D > array, whose size is the number of input characters times the number > of states. For ASCII input, the number of possible input characters > is small- 256 at most. 256 input characters times hundreds of states > isn't that big of a table- we're looking at sizes in 10's of K- easily > handlable even in the bad old days of 64K segments. Even going to > UTF-16 ups the number of input characters from 256 to 65,536- and now > a moderately large state machine (hundreds of states) weighs in at > tens of megabytes of table space. And, of course, if you try to > handle the entire 31-bit full unicode point space, welcome to really > large tables :-). > > The solution, I think, is to change the implementation of your finite > automata to use some data structure smarter than a flat 2D array, but > that's me. Note that we have such a lexer already: ulex. It uses binary decision trees, AFAIK. The resulting code has moderate size. It can take some time, however, until ulex has converted the NFA to a DFA. Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------