Re: Unicode-Support in Ocaml?

From: Xavier Leroy (Xavier.Leroy@inria.fr)
Date: Fri Mar 12 1999 - 16:26:10 MET


Date: Fri, 12 Mar 1999 16:26:10 +0100
From: Xavier Leroy <Xavier.Leroy@inria.fr>
To: Jens Wagner <jwagner@handshake.de>, caml-list@inria.fr
Subject: Re: Unicode-Support in Ocaml?
In-Reply-To: <36E6C8E6.45008AEF@handshake.de>; from Jens Wagner on Wed, Mar 10, 1999 at 08:32:54PM +0100

> Are there any plans of supporting Unicode in objective caml?
> A support for Unicode characters is very important for writing
> Web-Applications and would be a great improvement for other applications
> too (IMHO)!

Several applications need Unicode support, indeed. For instance, my
CamlIDL COM interface will eventually need Unicode strings, because
many Windows API use them. Here are my thoughts about it.

Switching everything to use Unicode strings is too much work, so we'd
need a separate type "wstring" of wide-character strings in addition
to the standard "string", with conversion functions between the two.

Now, I can see two ways to go about it:

- Build on the ISO C "wide character" and "wide string" abstractions.
  Pros:

  * all recent C libraries support those, and provide powerful
    functions for conversion between wide strings and multi-byte strings
    (e.g. Unicode <-> UTF8), for comparing wide strings, etc;
  * the use of Unicode and UTF8 is not hardwired, most systems provide
    32-bit characters and support several multi-byte encodings.
    I've been told this is better for e.g. Japanese, which has
    several popular encodings that are not Unicode-based.

  Cons:

  * the use of Unicode and UTF8 is not hardwired, so different systems
    represent wide strings differently (e.g. as Unicode under Windows
    and 32-bit chars under Unix), precluding binary exchange of
    wide strings via output_value/input_value;
  * the various multi-byte encodings provided differ from Unix vendor
    to Unix vendor (different encodings, different names for the same
    encoding, ...);
  * while in theory Unicode and UTF8 are just one encoding among others,
    I don't know of any Unix system that actually provides "locale"
    files describing Unicode/UTF8: Linux with glibc 2.0 doesn't,
    Digital Unix 4 doesn't seem to either, and ditto for Solaris 2.5.

- Decide that Caml will use Unicode on every platform (like Java does).
  Pros:

  * binary compatibility between platforms;
  * proper Unicode support is guaranteed.

  Cons:

  * lots of code to write by hand (Unicode/UTF8 conversion,
    comparisons, upper/lowercasing, etc);
  * the Japanese might not be happy about it.

> Id could also be interesting having a standard signature for the DOM (
> http://www.w3.org/ ) in ocaml.
> Is there any organization defining signatures of existing API's? Of
> course one could use an IDL->stub converter, but there is much more
> possible using signatures than IDL-Files.

In the great tradition of free software, whomever writes the code (to
interface to an existing API) gets to choose the signature. However,
you can ask for feedback or suggestions on this list.

- Xavier Leroy



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:21 MET