From: Gerd Stolpmann <Gerd.Stolpmann@darmstadt.netsurf.de>
Subject: Re: localization, internationalization and Caml
Date: Fri, 15 Oct 1999 22:28:44 +0200
I agree that Unicode or even ISO-10646 support would be a nice thing. I
also agree that for many users (including myself) the Latin-1 character set
suffices. Luckily, both character sets are strongly related: The first 256
character positions of Unicode (which is the 16-bit subset of ISO-10646) are
exactly the same as in Latin1.
UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit
character is represented by one to six bytes; the higher the character code the
more bytes are needed. This encoding is mainly interesting for I/O, and not for
internal processing, because it is impossible to access the characters of a
string by their position. Internally, you must represent the characters as 16-
or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is
wasted (this is the price for the enhanced possibilities). UTF-8 is thought for
compatibility, because the following holds:
- Every ASCII character (i.e. with codes 0 to 127) is represented as before,
and every non-ASCII character is represented by a byte sequence where the
eighth bit is set. Old, non-UTF-aware programs can at least interpret the
(Note that there is a variant of UTF-8 which encodes the 0 character
differently - by two characters.)
- If you sort UTF-8 strings alphabetically (more precisely, using
the byte values of the encoding as criterion) you get the same result as if
you sorted the strings alphabetically by their character codes.
This means that we need at least three types of strings: Latin1 strings for
compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings
for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by
the same language type "string", and to provide "wchar" and "wstring" for the
extended character set.
Of course, the current "string" data type is mainly an implementation of byte
sequences, which is independent of the underlying interpretation. Only the
following functions seem to introduce a character set:
- The String.uppercase and .lowercase functions
- The channels if opened in text mode (because they specially recognize the
line endings if needed for the operating system, and newline characters are
part of the character set)
The best solution would be to have internationalized versions of these
functions (and of perhaps some more functions) which still operate on the
"string" type but allow the user to select the encoding and the locale.
This means we would have something like
type encoding = UTF8 | Latin1 | ...
type locale = ...
val String.i18n_uppercase : encoding -> locale -> string -> string
val String.i18n_lowercase : encoding -> locale -> string -> string
val String.recode : encoding -> encoding -> string -> string
(* changes the encoding if possible *)
For "wstring" it is simpler:
val Wstring.uppercase : string -> string
val Wstring.i18n_uppercase : locale -> string -> string
New opening mode for channels:
Text of encoding
This encoding specifies the encoding of the file. It must be possible to change
this later (e.g. to process XML's "encoding" declaration).
New input/output functions:
val output_i18n_string : out_channel -> encoding -> string -> unit
val input_i18n_line : in_channel -> encoding -> string
Here, the encoding argument specifies the encoding of the internal
representation. The other I/O functions need I18N versions as well, and of
course we need to operate on "wstring"s directly:
val output_wstring : out_channel -> wstring -> unit
val input_wstring_line : in_channel -> wstring
This all means that the number of string functions explodes: We need functions
for compatibility (Latin1), functions for arbitrary 8 bit encodings, and
functions for wide strings. I think this is the main argument against it,
and it is very difficult to get around this. (Any ideas?)
>=> my problem is the output of the filter will be no more readable when
>I've put too much French in the program (in comments for instance).
The enlarged character sets become more and more important, and it is only a
matter of time until every piece of software which wants to be taken seriously
can process them, even a dumb terminal or simple text editor. So you will be
able to put accented characters into your comments, and you will see them as
such even if you 'cat' the program text to the terminal or printer; this will
>=> I believe internationalization should not be done by countries
>where English is the only used language: this is at least awkward...
but in the USA (a general prejudice not worth to discuss; there is only some
"personal experience" behind it).
-- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 100 64293 Darmstadt EMail: Gerd.Stolpmann@darmstadt.netsurf.de (privat) Germany ----------------------------------------------------------------------------
This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET