Re: localization, internationalization and Caml

From: Xavier Leroy (Xavier.Leroy@inria.fr)
Date: Sun Oct 17 1999 - 16:29:17 MET DST


Date: Sun, 17 Oct 1999 16:29:17 +0200
From: Xavier Leroy <Xavier.Leroy@inria.fr>
To: caml-list@inria.fr
Subject: Re: localization, internationalization and Caml
In-Reply-To: <199910151406.QAA07501@yana.inria.fr>; from Gerard Huet on Fri, Oct 15, 1999 at 03:53:15PM +0200

Wow, there's nothing like internationalization to spark lively
discussions. Since even Gérard Huet (oops, sorry for that 8859-1
accent, couldn't resist) and Francis Dupont broke their
vows of silence, I guess I have to say something too.

The support for ISO-8859-1 in Caml Light and OCaml is essentially an
historical and geographical accident. The first books on Caml were
written in French, and it was nice to be able to use accented french
words as identifiers. Also, that was at a time (1991-1992) where
Unicode and consorts didn't even exist.

The choice of ISO-8859-1 is not that politically incorrect either: it
works not only for western Europe, but also for Latin America, many
Pacific countries, and large parts of Africa. If we were to choose an
8-bit character set based on the number of OCaml programmers that
actually need it, I guess ISO-8859-1 (or its newer incarnation with
the Euro sign whose name I can't remember) would still win. (At least
until we get OCaml in the Chinese curriculum...)

Notice also that Caml doesn't prevent the programmer from putting any
character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
Unicode) in character strings and in comments.

There are several ways to internationalize further. One is to support
other 8-bit character sets the POSIX way (the LC_CTYPE stuff). There
are several problems with this:
- It's not enough for Asian languages.
- The POSIX localization stuff isn't supported under Windows.
- It's badly supported on all Unixes I know (e.g. to get French, I
  need to set LC_CTYPE to different values under Linux, Solaris, and
  Digital Unix; it gets worse for other languages such as Japanese).
- Handling of mixed-language texts is a nightmare.

Unicode / ISO10646 is probably a better approach. However, it has its
own problems:
- There's 16-bit Unicode and 32-bit Unicode. Early adopters of that
  technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
  chose 32-bit Unicode. (That's the great things about standards:
  there are so many to choose from...)
- Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
  E.g. Java seems to have its own variant of UTF8. How are we going
  to interoperate?
- I/O is a nightmare. The API has to handle at least byte streams,
  wide character streams, and UTF8-encoded streams.
- Support for Unicode / UTF8 files in today's operating systems and GUIs
  is very low. When will I be able to do "more" on an UTF8 file and see my
  French accented letters?

My conclusion is that I18N is such a mess that I don't think we'll do
much about it in Caml anytime soon. Perhaps some basic support for
wide characters and wide character strings will be added at some
point, if only because COM interoperability requires it.

- Xavier Leroy



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET