Re: localization, internationalization and Caml

From: skaller (skaller@maxtal.com.au)
Date: Tue Oct 19 1999 - 20:36:24 MET DST


Date: Wed, 20 Oct 1999 04:36:24 +1000
From: skaller <skaller@maxtal.com.au>
To: Xavier Leroy <Xavier.Leroy@inria.fr>
Subject: Re: localization, internationalization and Caml

Xavier Leroy wrote:

> The support for ISO-8859-1 in Caml Light and OCaml is essentially an
> historical and geographical accident. The first books on Caml were
> written in French, and it was nice to be able to use accented french
> words as identifiers. Also, that was at a time (1991-1992) where
> Unicode and consorts didn't even exist.

        And supporting ISO-8859-1 was a fine thing to do at the time!
 
> The choice of ISO-8859-1 is not that politically incorrect either: it
> works not only for western Europe, but also for Latin America, many
> Pacific countries, and large parts of Africa. If we were to choose an
> 8-bit character set based on the number of OCaml programmers that
> actually need it, I guess ISO-8859-1 (or its newer incarnation with
> the Euro sign whose name I can't remember) would still win. (At least
> until we get OCaml in the Chinese curriculum...)

        While this is true, there is a circularity here: people not
using 8 bit character sets face an extra battle using ocaml.
 
> Notice also that Caml doesn't prevent the programmer from putting any
> character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
> Unicode) in character strings and in comments.

        Yes. This one of the key points of my argument that UTF-8
is the natural way to go: it provides ISO-10646 compliance without
requiring any new string kind.
 
> There are several ways to internationalize further. One is to support
> other 8-bit character sets the POSIX way (the LC_CTYPE stuff). There
> are several problems with this:
> - It's not enough for Asian languages.
> - The POSIX localization stuff isn't supported under Windows.
> - It's badly supported on all Unixes I know (e.g. to get French, I
> need to set LC_CTYPE to different values under Linux, Solaris, and
> Digital Unix; it gets worse for other languages such as Japanese).
> - Handling of mixed-language texts is a nightmare.

        If you are suggesting not using C locale stuff -- I agree entirely.

> Unicode / ISO10646 is probably a better approach. However, it has its
> own problems:
> - There's 16-bit Unicode and 32-bit Unicode. Early adopters of that
> technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
> chose 32-bit Unicode. (That's the great things about standards:
> there are so many to choose from...)

        I cannot see the problem -- except for the 16 bit adopters,
who must eventually upgrade .. again.

> - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
> E.g. Java seems to have its own variant of UTF8. How are we going
> to interoperate?

        I do not understand: UTF-8 is a fixed, internationally standardised
encoding. If it is used, the ISO Standard is followed. If Java doesn't
do that,
that is Java's problem.

> - I/O is a nightmare. The API has to handle at least byte streams,
> wide character streams, and UTF8-encoded streams.

        No, it doesn't. This is a possibility. But it is NOT necessary.
It is necessary only to read byte streams. Conversion can be done
later using strings. This is less efficient, but it is a sensible
starting point (to ignore internationalisation on I/O completely).

> - Support for Unicode / UTF8 files in today's operating systems and GUIs
> is very low. When will I be able to do "more" on an UTF8 file and see my
> French accented letters?

        Yes. I agree. This is a major problem. One of the answers is
"When programming languages provide the support that applications
programmers need" :-)
 
> My conclusion is that I18N is such a mess that I don't think we'll do
> much about it in Caml anytime soon.

        I agree. The way forward is, I believe:

        a) do not change the I/O system, but deprecate TEXT mode
           (all I/O should be done in binary)

        b) do not change the String module, but deprecate the
           upper/lower case functions (and anything else that
           smacks of relating to natural language)

        c) Provide functions to support internationalisation.

        d) modify the ocaml compiler, to process \uXXXX and \UXXXX
           escapes [everywhere]

        e) provide a fast variable length array type

(d) could be done easily using ocamlp4 I think.

>Perhaps some basic support for
> wide characters and wide character strings will be added at some
> point, if only because COM interoperability requires it.

        I don't think it is necessary, a variable length
array of integers is good enough.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET