Date: Wed, 20 Oct 1999 04:06:06 +1000
From: skaller <email@example.com>
Subject: Re: localization, internationalization and Caml
Gerd Stolpmann wrote:
> I agree that Unicode or even ISO-10646 support would be a nice thing. I
> also agree that for many users (including myself) the Latin-1 character set
Generally, for me, 7 bit ASCII 'suffices'. But that is irrelevant,
the world is bigger than my country, or Europe.
>Luckily, both character sets are strongly related: The first 256
> character positions of Unicode (which is the 16-bit subset of ISO-10646) are
> exactly the same as in Latin1.
.. of course, this is not luck ..
> UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit
> character is represented by one to six bytes; the higher the character code the
> more bytes are needed. This encoding is mainly interesting for I/O, and not for
> internal processing, because it is impossible to access the characters of a
> string by their position. Internally, you must represent the characters as 16-
> or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is
> wasted (this is the price for the enhanced possibilities).
I don't agree. If you read ISO10646 carefully, you will find
that you must STILL parse sequences of code points to obtain the
of a 'character', if, indeed, such a concept exists,
and furthermore, the sequences are not unique. For example,
many diacritic marks such as accents may be appended to a code point,
and act in conjunction with the preceding code point to represent
a character. This is permitted EVEN if there is a single code point
for that character; and worse, if there are TWO such marks, the order is
And that's just simple European languages. Now try Arabic or Thai :-)
> This means that we need at least three types of strings: Latin1 strings for
> compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings
> for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by
> the same language type "string", and to provide "wchar" and "wstring" for the
> extended character set.
I'd like to suggest we forget the 'wchar' string, at least initially.
I think you will find UTF-8 encoding requires very few changes. For
genuine regular expressions work out of the box. String searching
works out of the box.
What doesn't work efficiently is indexing.
And it is never necessary to do it for human script.
Why would you ever want to, say, replace the 10'th character
of a string?? [You could: if you were analysing, say, a stock code,
but in that case the n'th byte would do: it isn't natural language
The way to handle Latin-1, or Big-5, or KSC, or ShiftJis,
is to translate it with an input filter, or internally, if the client
is reading the codes directly.
> Of course, the current "string" data type is mainly an implementation of byte
> sequences, which is independent of the underlying interpretation. Only the
> following functions seem to introduce a character set:
> - The String.uppercase and .lowercase functions
It is best to get rid of these functions. The belong in a more
natural language processing package.
> - The channels if opened in text mode (because they specially recognize the
> line endings if needed for the operating system, and newline characters are
> part of the character set)
This is a serious problem. It is also partly ill formed:
what is a 'line' in Chinese, which writes characters top down?
What is a 'line' in a Web page :-)
> The best solution would be to have internationalized versions of these
> functions (and of perhaps some more functions) which still operate on the
> "string" type but allow the user to select the encoding and the locale.
There is more. The compiler must be modified to accept
identifiers in the extended character set. This should work out
of the box (since 8'th bit set characters are already accepted);
in fact, it is too permissive. Secondly, literals need to be
processed, to expand \uXXXX and \UXXXXXXXX escapes.
> This means we would have something like
> type encoding = UTF8 | Latin1 | ...
Be careful to distinguish CODE SET from ENCODING.
See the Unicode home page for more details: Latin-1 is NOT
an encoding, but a character set. There is a MAPPING from
Latin-1 to ISO-10646. This is not the same thing as an encoding.
I think this is the wrong approach: we do not want built-in
cases for every possible encoding/character set.
Instead, we want an open ended set of conversions from (and perhaps to)
the internally used representation. There are a LOT of such
we need to add new ones without breaking into a module. We should do it
functionally; this should work well in ocaml :-)
> type locale = ...
> val String.i18n_uppercase : encoding -> locale -> string -> string
> val String.i18n_lowercase : encoding -> locale -> string -> string
Not in the String module. This belongs in a different package
which handles complex vagaries of human script. [This particular
function, is relatively simple. Another is 'get_digit', 'isdigit'.
Whitespace is much harder. Collation is a nightmare :-]
> val String.recode : encoding -> encoding -> string -> string
> (* changes the encoding if possible *)
This isn't quite right. The way to do this is to have a function:
which does the mapping (from a SINGLE code point in LATIN1 to ISO10646).
The code point is an int. Separately, we handle encodings:
converts an int to a UTF8 string, and
UTF8_to_UCS4 string position
parses the string from position, returning a code point and position.
There are other encodings, such as DBCS encodings, which generally
are tied to a single character set. [UCS4/UTF8 are less depenent]
> This all means that the number of string functions explodes:
Exactly. And we don't want that. So I suggest, we continue
to use the existing strings of 8 bit bytes ONLY, and represent
ALL foreign [non ISO10646] character sets using ISO-10646 code points,
encoded as UTF-8, and provide an input filter for the compiler.
In addition, some extra functions to convert other
character sets and encodings to ISO-10646/UTF-8 are provided,
and, if you like, they can be plugged into the I/O system.
This means a lot of conversion functions, but ONE
internal representation only: the one we already have.
We need functions
> for compatibility (Latin1), functions for arbitrary 8 bit encodings, and
> functions for wide strings. I think this is the main argument against it,
> and it is very difficult to get around this. (Any ideas?)
I've been trying to tell you how to do it. The solution is simple,
to adopt ISO-10646 as the SOLE character set, and UTF-8 as the SOLE
encoding of it; and provide conversions from other character sets and
All the code that needs to manipulate strings can then be provided NOW
as additional functions manipulating the existing string type.
The apparent loss of indexing is a mirage. The gain is huge:
ISO-10646 Level 1 compliance without any explosiion of data types.
Yes, some more _functions_ are needed to do extra processing,
such as normalisation, comparisons of various kinds,
capitalisation, etc. Regular expressions will need to be
enhanced, to fix the special features (like case insensitive searching),
but the basic regular expressions will work out of the box.
> The enlarged character sets become more and more important, and it is only a
> matter of time until every piece of software which wants to be taken seriously
> can process them, even a dumb terminal or simple text editor. So you will be
> able to put accented characters into your comments, and you will see them as
> such even if you 'cat' the program text to the terminal or printer; this will
> work everywhere...
Yes. This time is not here yet, but it will come soon that
international support is mandatory for all large software purchases
by governments and large corporations.
-- John Skaller, mailto:firstname.lastname@example.org 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller
This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET