Date: Tue, 26 Oct 1999 04:55:01 +1000
From: skaller <skaller@maxtal.com.au>
To: John Prevost <prevost@maya.com>
Subject: Re: Request for Ideas: i18n issues
John Prevost wrote:
> Back to the charset/encoding module type. Here's what I think might
> want to be in here. I appeal to you for suggestions about things to
> remove or add.
>
> Charsets:
>
> * a type for characters in the charset
I think you mean 'code points'. These are
logically distinct from characters (which are kind of
amorphous). For example, a single character -- if there
is such a thing in the language -- may consists of several
code points combined, and there are code points which are
not characters.
> * a type for strings in the charset (maybe char array, maybe not)
I think you mean 'script'. Strings of code points
can be used to represent script.
> * functions to determine the "class" of a character. This would
> probably involve a standard "character class" type, possibly
> informed by character classes in Unicode.
Yes. For ISO10646 plane 1 (Unicode), the data is readily
available for a few key attributes (such as character class,
case mappings, corresponding decimal digit, etc).
> * functions to work with strings in the character set in order to do
> standard manipulations. If we said a string is always a char
> array and that there are standard functions to work on strings
> given the above, this might be something that can be done away
> with.
Probably not. There is a distinction, for example, between
concatenating two arrays of code points, and producing a new
array of code points corresponding to the concatenated script.
This is 'most true' in Arabic, but it is also true in
plain old English: the script
"It is a"
and
"nice day"
requires a space to be inserted between the code point arrays
to obtain the correctly concatenated script
"It is a nice day"
> * functions to convert characters and strings to a reference format,
> perhaps UCS-4. UCS-4 isn't perfect, but it does have a great deal
> of coverage, and without some common format, converting from one
> character set to another is problemmatic.
I agree. I think there are two choices here, UCS-4 and UTF-8
encodings of ISO-10646. UCS-4 is better for internal operations,
UTF-8 for IO and streaming operations (perhaps including
regular expression matching where tables of 256 cases are more
acceptable than 2^31 :-)
> Encodings:
>
> These are tied to charsets much of the time, but not always.
>
> * functions to encode and decode strings in streams and buffers.
Yes.
> Locales:
>
> * functions to do case mapping, collation, etc.
No. It is generally accepted that 'locale' information
is limited to culturally sensitive variations like whether
full stop or comma is used for a decimal point, and whether
the date is written dd/mm/yy or yy/mm/dd or mm/dd/yy.
Collation, case mapping, etc are not locale
data, but specific to a particular script.
The tendency in i18n developments has been, I think,
to divorce character sets, encodings, collation, and script
issues from the locale: the locale may indicate the local
language, but that is independent (more or less) of
script processing.
> (I think this is really why Java went to "the one true charset is
> Unicode". Not just because of politics, but because interacting with
> mutually incompatible character sets can be a type-safety nightmare.)
Yes. I am somewhat suprised to see an attempt to create a more
abstract interface to multiple character sets/encodings. This area
tends, I think, to be complex, full of ad hoc rules, and so quirky
as to defy genuine abstraction.
Fixing a single standard (ISO10646) is a simpler
alternative; even simpler if there is a single reference encoding
such as UCS-4 or UTF-8. In that case, the functions that do the
work can be specialised to a well researched International Standard.
It is still necessary to provide functions that
encode/decode the standard format to other formats (encodings/character
sets),
but no functions need be provided to do things like collation or
case mapping for these other formats.
> One or more of the above might be functors, so that you can compose a
> character set, encoding, and locale to get what you want. This, of
> course, gets into questions of whether certain character sets,
> locales, and encodings are interoperable, and how one might cause a
> type error when trying to combine an encoding and a character set that
> don't work together. Dunno if this is possible.
I think it is, but it isn't desriable. For example,
it is possible to use UCS-4 or UTF-8 encodings of ANY
character set, since all have integral code points to represent
them: UCS-4 is universal for all sets of less than 2^32 points,
UTF-8 for sets less than 2^31 points.
> How to recover from
> failure is of course a good question to try to answer as well.
There are, in fact, multiple answers; this is one
of the complicating factors.
-- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller
This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET