Date: Wed, 20 Oct 1999 04:36:24 +1000
From: skaller <email@example.com>
To: Xavier Leroy <Xavier.Leroy@inria.fr>
Subject: Re: localization, internationalization and Caml
Xavier Leroy wrote:
> The support for ISO-8859-1 in Caml Light and OCaml is essentially an
> historical and geographical accident. The first books on Caml were
> written in French, and it was nice to be able to use accented french
> words as identifiers. Also, that was at a time (1991-1992) where
> Unicode and consorts didn't even exist.
And supporting ISO-8859-1 was a fine thing to do at the time!
> The choice of ISO-8859-1 is not that politically incorrect either: it
> works not only for western Europe, but also for Latin America, many
> Pacific countries, and large parts of Africa. If we were to choose an
> 8-bit character set based on the number of OCaml programmers that
> actually need it, I guess ISO-8859-1 (or its newer incarnation with
> the Euro sign whose name I can't remember) would still win. (At least
> until we get OCaml in the Chinese curriculum...)
While this is true, there is a circularity here: people not
using 8 bit character sets face an extra battle using ocaml.
> Notice also that Caml doesn't prevent the programmer from putting any
> character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
> Unicode) in character strings and in comments.
Yes. This one of the key points of my argument that UTF-8
is the natural way to go: it provides ISO-10646 compliance without
requiring any new string kind.
> There are several ways to internationalize further. One is to support
> other 8-bit character sets the POSIX way (the LC_CTYPE stuff). There
> are several problems with this:
> - It's not enough for Asian languages.
> - The POSIX localization stuff isn't supported under Windows.
> - It's badly supported on all Unixes I know (e.g. to get French, I
> need to set LC_CTYPE to different values under Linux, Solaris, and
> Digital Unix; it gets worse for other languages such as Japanese).
> - Handling of mixed-language texts is a nightmare.
If you are suggesting not using C locale stuff -- I agree entirely.
> Unicode / ISO10646 is probably a better approach. However, it has its
> own problems:
> - There's 16-bit Unicode and 32-bit Unicode. Early adopters of that
> technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
> chose 32-bit Unicode. (That's the great things about standards:
> there are so many to choose from...)
I cannot see the problem -- except for the 16 bit adopters,
who must eventually upgrade .. again.
> - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
> E.g. Java seems to have its own variant of UTF8. How are we going
> to interoperate?
I do not understand: UTF-8 is a fixed, internationally standardised
encoding. If it is used, the ISO Standard is followed. If Java doesn't
that is Java's problem.
> - I/O is a nightmare. The API has to handle at least byte streams,
> wide character streams, and UTF8-encoded streams.
No, it doesn't. This is a possibility. But it is NOT necessary.
It is necessary only to read byte streams. Conversion can be done
later using strings. This is less efficient, but it is a sensible
starting point (to ignore internationalisation on I/O completely).
> - Support for Unicode / UTF8 files in today's operating systems and GUIs
> is very low. When will I be able to do "more" on an UTF8 file and see my
> French accented letters?
Yes. I agree. This is a major problem. One of the answers is
"When programming languages provide the support that applications
programmers need" :-)
> My conclusion is that I18N is such a mess that I don't think we'll do
> much about it in Caml anytime soon.
I agree. The way forward is, I believe:
a) do not change the I/O system, but deprecate TEXT mode
(all I/O should be done in binary)
b) do not change the String module, but deprecate the
upper/lower case functions (and anything else that
smacks of relating to natural language)
c) Provide functions to support internationalisation.
d) modify the ocaml compiler, to process \uXXXX and \UXXXX
e) provide a fast variable length array type
(d) could be done easily using ocamlp4 I think.
>Perhaps some basic support for
> wide characters and wide character strings will be added at some
> point, if only because COM interoperability requires it.
I don't think it is necessary, a variable length
array of integers is good enough.
-- John Skaller, mailto:firstname.lastname@example.org 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller
This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET