Version française
Home     About     Download     Resources     Contact us    
Browse thread
Re: Why OCaml sucks
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Berke Durak <berke.durak@e...>
Subject: Re: [Caml-list] Re: Why OCaml sucks
Robert Fischer wrote:

> Getting back to the original question, though -- is there any evidence that Java/C# are slow because
> of unicode support, and not because of other aspects of the languages?  Because that assertion seems
> flat-out bogus to me.

I do not think the JVM is especially slow in practice.  However, one potential source of
slowness could be, in some particular cases, conversions to and from the internal short array-based
string representation to UTF8 when using native code.  Similarly, Java strings being immutable,
in-place modification of strings is not possible from native code, so a lot of bindings to C libraries
end up duplicating strings a lot (see e.g. PCRE).

This is why the NIO API exposing mutable and/or externally allocated buffers was introduced in the JVM,
but it remains hard to use.

However it is true that regexes on UTF8 can be quite slow.  Compare (on Linux):

/udir/durak> dd if=/dev/urandom bs=10M count=1 of=/dev/shm/z

/udir/durak> time LANG=en_US.UTF-8 grep -c "^[a-z]*$" /dev/shm/z
2.31s user 0.01s system 99% cpu 2.320 total
/udir/durak> time LANG=C grep -c "^[a-z]*$" /dev/shm/z
0.04s user 0.01s system 98% cpu 0.048 total

Lesson 1: when lexing, do not read unicode chars one at a time.  Pre-process your regular expression
according to your input encoding, instead.

That being said, I think strings should be represented as they are today, and that the core
Ocaml libraries do not have much business dealing with UTF8.  We seldom need letter-indexed
random access to strings.

However, the time is ripe for throwing out old 8-bit charsets such as ISO-8859-x (a.k.a Latin-y)
and whatnot.  This simplifies considerably lesson 1:  it's either ASCII or UTF8 Unicode.  I think
the Ocaml lexer should simply treat any byte with its high bit set as a lowercase letter.
-- 
Berke DURAK