Browse thread
Re: Why OCaml sucks
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Berke Durak <berke.durak@e...> |
| Subject: | Re: [Caml-list] Re: Why OCaml sucks |
Robert Fischer wrote: > Getting back to the original question, though -- is there any evidence that Java/C# are slow because > of unicode support, and not because of other aspects of the languages? Because that assertion seems > flat-out bogus to me. I do not think the JVM is especially slow in practice. However, one potential source of slowness could be, in some particular cases, conversions to and from the internal short array-based string representation to UTF8 when using native code. Similarly, Java strings being immutable, in-place modification of strings is not possible from native code, so a lot of bindings to C libraries end up duplicating strings a lot (see e.g. PCRE). This is why the NIO API exposing mutable and/or externally allocated buffers was introduced in the JVM, but it remains hard to use. However it is true that regexes on UTF8 can be quite slow. Compare (on Linux): /udir/durak> dd if=/dev/urandom bs=10M count=1 of=/dev/shm/z /udir/durak> time LANG=en_US.UTF-8 grep -c "^[a-z]*$" /dev/shm/z 2.31s user 0.01s system 99% cpu 2.320 total /udir/durak> time LANG=C grep -c "^[a-z]*$" /dev/shm/z 0.04s user 0.01s system 98% cpu 0.048 total Lesson 1: when lexing, do not read unicode chars one at a time. Pre-process your regular expression according to your input encoding, instead. That being said, I think strings should be represented as they are today, and that the core Ocaml libraries do not have much business dealing with UTF8. We seldom need letter-indexed random access to strings. However, the time is ripe for throwing out old 8-bit charsets such as ISO-8859-x (a.k.a Latin-y) and whatnot. This simplifies considerably lesson 1: it's either ASCII or UTF8 Unicode. I think the Ocaml lexer should simply treat any byte with its high bit set as a lowercase letter. -- Berke DURAK