Re: localization, internationalization and Caml

From: skaller (skaller@maxtal.com.au)
Date: Thu Oct 21 1999 - 17:35:00 MET DST


Date: Fri, 22 Oct 1999 01:35:00 +1000
From: skaller <skaller@maxtal.com.au>
To: matias@k-bell.com
Subject: Re: localization, internationalization and Caml

Matías Giovannini wrote:

> OCaml uses Latin1 for its *internal* encoding of identifiers. While I'll
> agree that my view is chauvinistic (and selfish, perhaps: I already have
> "¿¡áéíóúuñÁÉÍÓÚÜÑ" for writing in Spanish, why should I ask for more?),
> I see no restriction in that (well, If I were Chinese, or Egiptian, I
> would see things differently).

        Exactly. There are quite a lot of Chinese, Indian,
Russian ... and non-Latin people in the world: more than Latins.
And many are faced with a barrier, participating in the computing world
because of language problems.

>What's more, the whole syntactic
> apparatus of a programming language *assumes* a Latin setting, where
> things make sense when read from left to right, from top to bottom; and
> where punctuation is what we're used to. Programming languages suited
> for a Han, or Arab, or even a Hebrew audience would have to be rethinked
> from the grounds up.

        Actually, no. Most of these peoples learn English and learn
computing, if they are to work with computers. But they still wish
to use comments, strings, and identifiers in their native script.

        Have you ever seen a Japanese program? I have.
Quite an interesting challenge: normal C/C++ code, with
Latin characters encoding Japanese character names in identifiers,
and actual Japanese characters in comments and strings.
 
        I had no idea what the code did. My point: for a non-native
speaker, being forced to use a foreign language for identifiers and
comments is a serious impediment, not having native characters
in string is not an impediment, but a complete disaster (how will
the users of the program understand it -- they may not know any
Latin language)

> On the other hand, OCaml provides a String type that *can be* seen as a
> variable-length sequence of uninterpreted bytes.

        Yes. What ocaml does not provide is a way of encoding
extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers.

>We have uninterpreted
> bytes! It's all we need to build whatever I18NString type we may need.
> What is missing is *library* facilities to abstract that view into a
> full-fledged i18n machinery.

        I agree.

>Of course, there's a problem with the
> manipulation of 32-bit integer values, but if used with care, the Nat
> datatype could serve perfectly well as the underlying, low-level datatype.
>
> Which makes me think, John, you already have variable-length int arrays.

        But they're not standard (yet). Actually, ocaml 'int' is 31 bits,
which is enough bits for ISO10646 (with some careful fiddling to avoid
problems with the sign?).

        So there are TWO issues -- one is to make ocaml itself
ISO10646 aware (i.e., the compiler), and the other is to provide
users with libraries to manipulate extended characters.

        Please note: neither of these features would be optional,
were ocaml to be submitted for ISO standardisation. ISO directives
require all ISO languages to upgrade to provide international
support. I know ocaml isn't an ISO language, but I think the
basic intent is sound. [In some sense, ocaml is already a leader,
accepting Latin-1 characters when other languages only allowed ASCII]

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET