Re: localization, internationalization and Caml

From: Gerd Stolpmann (Gerd.Stolpmann@darmstadt.netsurf.de)
Date: Fri Oct 15 1999 - 22:28:44 MET DST


From: Gerd Stolpmann <Gerd.Stolpmann@darmstadt.netsurf.de>
To: caml-list@inria.fr
Subject: Re: localization, internationalization and Caml
Date: Fri, 15 Oct 1999 22:28:44 +0200
Message-Id: <99101523515200.25289@ice>

I agree that Unicode or even ISO-10646 support would be a nice thing. I
also agree that for many users (including myself) the Latin-1 character set
suffices. Luckily, both character sets are strongly related: The first 256
character positions of Unicode (which is the 16-bit subset of ISO-10646) are
exactly the same as in Latin1.

UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit
character is represented by one to six bytes; the higher the character code the
more bytes are needed. This encoding is mainly interesting for I/O, and not for
internal processing, because it is impossible to access the characters of a
string by their position. Internally, you must represent the characters as 16-
or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is
wasted (this is the price for the enhanced possibilities). UTF-8 is thought for
compatibility, because the following holds:

- Every ASCII character (i.e. with codes 0 to 127) is represented as before,
  and every non-ASCII character is represented by a byte sequence where the
  eighth bit is set. Old, non-UTF-aware programs can at least interpret the
  ASCII characters.
  (Note that there is a variant of UTF-8 which encodes the 0 character
  differently - by two characters.)

- If you sort UTF-8 strings alphabetically (more precisely, using
  the byte values of the encoding as criterion) you get the same result as if
  you sorted the strings alphabetically by their character codes.

This means that we need at least three types of strings: Latin1 strings for
compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings
for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by
the same language type "string", and to provide "wchar" and "wstring" for the
extended character set.

Of course, the current "string" data type is mainly an implementation of byte
sequences, which is independent of the underlying interpretation. Only the
following functions seem to introduce a character set:

- The String.uppercase and .lowercase functions
- The channels if opened in text mode (because they specially recognize the
  line endings if needed for the operating system, and newline characters are
  part of the character set)

The best solution would be to have internationalized versions of these
functions (and of perhaps some more functions) which still operate on the
"string" type but allow the user to select the encoding and the locale.

This means we would have something like

        type encoding = UTF8 | Latin1 | ...
        type locale = ...
        val String.i18n_uppercase : encoding -> locale -> string -> string
        val String.i18n_lowercase : encoding -> locale -> string -> string
        val String.recode : encoding -> encoding -> string -> string
                (* changes the encoding if possible *)

For "wstring" it is simpler:

        val Wstring.uppercase : string -> string
        val Wstring.i18n_uppercase : locale -> string -> string

New opening mode for channels:
        Text of encoding
This encoding specifies the encoding of the file. It must be possible to change
this later (e.g. to process XML's "encoding" declaration).

New input/output functions:

        val output_i18n_string : out_channel -> encoding -> string -> unit
        val input_i18n_line : in_channel -> encoding -> string

Here, the encoding argument specifies the encoding of the internal
representation. The other I/O functions need I18N versions as well, and of
course we need to operate on "wstring"s directly:

        val output_wstring : out_channel -> wstring -> unit
        val input_wstring_line : in_channel -> wstring

This all means that the number of string functions explodes: We need functions
for compatibility (Latin1), functions for arbitrary 8 bit encodings, and
functions for wide strings. I think this is the main argument against it,
and it is very difficult to get around this. (Any ideas?)

Francis Dupont:
>=> my problem is the output of the filter will be no more readable when
>I've put too much French in the program (in comments for instance).

The enlarged character sets become more and more important, and it is only a
matter of time until every piece of software which wants to be taken seriously
can process them, even a dumb terminal or simple text editor. So you will be
able to put accented characters into your comments, and you will see them as
such even if you 'cat' the program text to the terminal or printer; this will
work everywhere...

>=> I believe internationalization should not be done by countries
>where English is the only used language: this is at least awkward...

but in the USA (a general prejudice not worth to discuss; there is only some
"personal experience" behind it).

Gerd

--
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   Gerd.Stolpmann@darmstadt.netsurf.de (privat)
Germany                     
----------------------------------------------------------------------------



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET