Re: Request for Ideas: i18n issues

From: John Prevost (prevost@maya.com)
Date: Wed Oct 27 1999 - 02:37:55 MET DST


To: skaller <skaller@maxtal.com.au>
Subject: Re: Request for Ideas: i18n issues
From: John Prevost <prevost@maya.com>
Date: 26 Oct 1999 20:37:55 -0400
In-Reply-To: skaller's message of "Wed, 27 Oct 1999 06:16:38 +1000"

skaller <skaller@maxtal.com.au> writes:

> Sorry: I do not understand it entirely myself. But consider:
> the code point for a space, has no character. There is no such
> thing as a 'space' character. There is no glyph. There _is_ such a
> thing as white space in (latin) script, however.

I think this is overly pedantic. Let's try not to go out this far.

> > 1) Basic char and string types
> > 2) Locale type which exists solely for locale identity
> > 3) Collator type which allows various sorting of strings
> > 4) Formatter and parser types for different data values
> > 4a) A sort of formatter which allows reordering of arguments in the output
> > string is needed (not too hard).
> > 5) Reader and writer types for encoding characters into various encodings
>
> Seems reasonable. I personally would address (1) and (5)
> first. Also, I think 'regexp' et al fits in somewhere, and needs to
> be addressed.

True. And you're right, 1 and 5 are probably the most important
things. I'm just trying to clarify what sorts of things are needed
for i18n stuff.

> > My current thought is that the char and string types should be defined
> > in terms of Unicode.
>
> Please NO. ISO10646. 'Unicode' is no longer supported by
> Unicode Corp, which is fully behind ISO10646. "unicode" is for
> implementations that will be out of date before anyone get the tools
> to actually use internationalised software (like text editors,
> fonts, etc).

Ahh. I'm not entirely sure which distinction you are making here. I
use the phrase "Unicode" as meaning "ISO 10646", because I find it
easier to pronounce. I understand the differences, but not why you
went off on a flame when you saw "Unicode". As my understanding goes,
the goal in Unicode has been to conform to ISO 10646 since at least
revision 2.0.

> > The char type should be abstract,
>
> Why? ISO 10646 code points are just integers 0 .. 2^31-1, and
> it is necessary to have some way of representing them as
> _literals_. In particular, you cannot 'match' easily on an abstract
> type, but matching is going to be a very common operation:
>
> match wchar with
> | WhatDoIPutHere -> ...

This is true. And having base-language support for wchar literals
would be ideal. However, the type of characters and the type of
integers should be disjoint, since the number 32 and the character ' '
are distinct. You do not want to apply a function on characters to an
integer, because they are semantically different. This is precisely
the sort of thing the type system in ML is supposed to be used for.

I won't stand for leaving such an important constraint out of the
abstraction. You may map an integer into a character of the
equivalent codepoint and vice-versa, but a character is not an
integer.

In the implementation, of course, it actually is.

>From another part of your message:

> But let me give another clarifying example. In Unicode, there
> are some code points which are reserved, and do not represent
> characters, namely the High and Low surrogates. These are used to
> allow two byte 16 bit UTF-16 encoding of the first 16 planes of
> ISO-10646.

I'd like to see this issue not exist in O'Caml's character type. To
me, a surrogate is half a character. If I have an array of
characters, I should not be able to look up a cell and get half a
character. Therefore, the base character type in O'Caml should be
defined in terms of the codepoint itself. This still leaves the
surrogate area as problemmatic: what if I want to be able to deal with
input in which surrogates are mismatched? What if the library client
asks me to turn an integer which is the codepoint of a surrogate into
a character?

But I think that in O'Caml, the character type should never contain
"half a character".

Your other message:

> In that process, I have learned a bit; mainly that:
>
> (1) it is complicated
> (2) it is not wise to specify interfaces based on
> vague assumptions.
>
> That is, I'd like to suggest focussing on encodings/mappings, and
> character classes; and I believe that you will find that the issues
> demand different interfaces than you might think at first: I know
> that what I _thought_ made sense, at first, turned out to be not
> really correct.

I agree. And this is where I'll focus first. At the same time, one
might argue that an object that represents collation, at some level,
has a specifc interface which allows an operation: comparing two
pieces of script (to employ an excessively long turn of phrase) for
order. Regardless of how the order is determined, that's what the
goal is.

I don't intend to try to come up with some things and have them set in
stone. I intend this as an excercise to come up with some tools which
people can use and try out and fiddle with and make suggestions about.

> On things like collation, there is no way I can see to proceed
> without obtaining the relevant ISO documents. I seriously doubt
> that it is possible to design an 'abstract' interface, without
> actually first trying to implement the algorithms: the interfaces
> will be determined by concrete details and issues.

In this case, I don't propose to describe in detail all of the issues
of collating. But, in order to provide a non-codepoint ordered
mechanism for ordering strings, and maybe provide a way to find such a
mechanism, it might be worthwhile to explore a few possibilities, even
if they come to nought in the end.

As for myself, I do not generally have the funds or werewithal to
obtain ISO documents. I do have a copy of the Unicode v2.0 book
handy, but it doesn't concern itself with such issues.

Anyway, you're right, the encoding and character class stuff is first.
I think format strings and message catalogs is a doable second goal,
which actually won't cause too much pain. Ways to format dates and
the like is harder, not least because you have to have a standard way
to express a date before you can have a standard interface for
outputting a date in some localized format.

Anyway, encodings first. Then the world. I'll take a look at your
Python stuff.

John.



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET