Re: localization, internationalization and Caml

From: Gerard Huet (Gerard.Huet@inria.fr)
Date: Fri Oct 15 1999 - 15:53:15 MET DST


Message-Id: <199910151406.QAA07501@yana.inria.fr>
Date: Fri, 15 Oct 1999 15:53:15 +0200
To: Francis Dupont <Francis.Dupont@inria.fr>, skaller <skaller@maxtal.com.au>
From: Gerard Huet <Gerard.Huet@inria.fr>
Subject: Re: localization, internationalization and Caml

Just to put my 2 cents on this issue...

At 10:26 15/10/99 +0200, Francis Dupont wrote:
> In your previous mail you wrote:
>
> The current 'support' for 8 bit characters in ocaml should be
> deprecated immediately. It is an extremely bad thing to have, since
> Latin-1 et al are archaic 8 bit standards incompatible with the
> international standard for ISO10646 communication, namely
> the UTF-8 encoding.

I do not agree. What we need is not ayatollah dictats, but careful thinking
about evolution of standards.

First of all, ISO-Latin is as international a standard as ISO10646, only a=
 bit
more mature. By essence, international standards are not immediately=
 obsolete,
they are here to stay because we need some stability in a world of sound
engineering, as opposed to the permanent hype which our discipline is
subjected to.

Secondly, the string data type of Ocaml is not about ASCII or ISO-Latin or
whatever. It is a low-level data type of implementation of lists of bytes
of data efficiently represented in machine memory. These bytes may be used
for encoding elements of various finite sets such as ASCII or ISO-Latin,
but the string library does not care about such intentions.

When such strings are used to represent natural language sentences, there
is a natural tendency to sophistication, from UPPER CASE letters of the
computer printers of old to ASCII to ISO-Latin 1, 2, etc to Unicode. At
some point (256) these sets of codes cannot be represented in a one to one
fashion into bytes, and so multi-bytes representations must be designed,
such as UTF-8.
Such multi-bytes representations are inconsistent with ISO-Latin convention
somewhere, and thus the ISO-Latin character set must be shifted out of its
usual representation since the 8th bit is needed for the multi-byte=
 encoding.

So for instance engineers designing natural language interfaces must make
the choice of sticking to the old convention in a purely local software, or
upgrading their software to the international standard, typically for Web
applications. At some point I am sure some brave soul from the Ocaml
implementation team will write a Unicode library for implementing the
non-trivial manipulations of lists of Unicode characters, so that the above
engineers will have a generic tool to use. Such libraries will typically
implement a NEW datatype of "unistring" or whatever, with proper conversion
to string representations of course, but the string data type is surely
here to stay, because bytes are not going to become obsolete overnight. :-)

>=> there is a rather strong opposition against UTF-8 in France
>because it is not a natural encoding (ie. if ASCII maps to ASCII
>it is not the case for ISO 8859-* characters, imagine a new UTF-X
>encoding maps ASCII to strange things and you'd be able to understand
>our concern).

I do not share Francis' pessimism. The ISO commitees are not entirely
stupid, and care has been taken to make the move as painless as possible.
ISO-Latin has just been shifted by a mere translation. Here is my Ocaml
code for translating strings of ISO-Latin 1 characters to UTF-8 HTML:

let print_unicode c =
    let ascii = int_of_char c in (* test for ISO-LATIN *)
        if ascii < 128 then print_char c (* 7 bit ascii *)
        else print_string ("&#" ^ (string_of_int ascii) ^ ";");

This is hardly mysterious or complicated or inefficient.

>=> my problem is the output of the filter will be no more readable when
>I've put too much French in the program (in comments for instance).

Come on, Francis, we do not read core dumps nowadays, we read through the
eyes of HTML or TeX or whatever !

>=> I believe internationalization should not be done by countries
>where English is the only used language: this is at least awkward...

I simply do not understand this remark in a WWW world.

Cheers
Gérard



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET