English version
Accueil     À propos     Téléchargement     Ressources     Contactez-nous    

Ce site est rarement mis à jour. Pour les informations les plus récentes, rendez-vous sur le nouveau site OCaml à l'adresse ocaml.org.

Browse thread
Request for Ideas: i18n issues
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 1999-10-28 (17:09)
From: John Prevost <prevost@m...>
Subject: Re: Request for Ideas: i18n issues
skaller <skaller@maxtal.com.au> writes:

> 	Sorry: I do not understand it entirely myself.  But consider:
> the code point for a space, has no character.  There is no such
> thing as a 'space' character. There is no glyph.  There _is_ such a
> thing as white space in (latin) script, however.

I think this is overly pedantic.  Let's try not to go out this far.

> > 1) Basic char and string types
> > 2) Locale type which exists solely for locale identity
> > 3) Collator type which allows various sorting of strings
> > 4) Formatter and parser types for different data values
> > 4a) A sort of formatter which allows reordering of arguments in the output
> >     string is needed (not too hard).
> > 5) Reader and writer types for encoding characters into various encodings
> 	Seems reasonable. I personally would address (1) and (5)
> first. Also, I think 'regexp' et al fits in somewhere, and needs to
> be addressed.

True.  And you're right, 1 and 5 are probably the most important
things.  I'm just trying to clarify what sorts of things are needed
for i18n stuff.

> > My current thought is that the char and string types should be defined
> > in terms of Unicode. 
> 	Please NO. ISO10646. 'Unicode' is no longer supported by
> Unicode Corp, which is fully behind ISO10646. "unicode" is for
> implementations that will be out of date before anyone get the tools
> to actually use internationalised software (like text editors,
> fonts, etc).

Ahh.  I'm not entirely sure which distinction you are making here.  I
use the phrase "Unicode" as meaning "ISO 10646", because I find it
easier to pronounce.  I understand the differences, but not why you
went off on a flame when you saw "Unicode".  As my understanding goes,
the goal in Unicode has been to conform to ISO 10646 since at least
revision 2.0.

> > The char type should be abstract, 
> 	Why? ISO 10646 code points are just integers 0 .. 2^31-1, and
> it is necessary to have some way of representing them as
> _literals_. In particular, you cannot 'match' easily on an abstract
> type, but matching is going to be a very common operation:
> 	match wchar with
> 	| WhatDoIPutHere -> ...

This is true.  And having base-language support for wchar literals
would be ideal.  However, the type of characters and the type of
integers should be disjoint, since the number 32 and the character ' '
are distinct.  You do not want to apply a function on characters to an
integer, because they are semantically different.  This is precisely
the sort of thing the type system in ML is supposed to be used for.

I won't stand for leaving such an important constraint out of the
abstraction.  You may map an integer into a character of the
equivalent codepoint and vice-versa, but a character is not an

In the implementation, of course, it actually is.

>From another part of your message:

> 	But let me give another clarifying example.  In Unicode, there
> are some code points which are reserved, and do not represent
> characters, namely the High and Low surrogates. These are used to
> allow two byte 16 bit UTF-16 encoding of the first 16 planes of
> ISO-10646.

I'd like to see this issue not exist in O'Caml's character type.  To
me, a surrogate is half a character.  If I have an array of
characters, I should not be able to look up a cell and get half a
character.  Therefore, the base character type in O'Caml should be
defined in terms of the codepoint itself.  This still leaves the
surrogate area as problemmatic: what if I want to be able to deal with
input in which surrogates are mismatched?  What if the library client
asks me to turn an integer which is the codepoint of a surrogate into
a character?

But I think that in O'Caml, the character type should never contain
"half a character".

Your other message:

> In that process, I have learned a bit; mainly that:
> 	(1) it is complicated
> 	(2) it is not wise to specify interfaces based on 
> 		vague assumptions.
> That is, I'd like to suggest focussing on encodings/mappings, and
> character classes; and I believe that you will find that the issues
> demand different interfaces than you might think at first: I know
> that what I _thought_ made sense, at first, turned out to be not
> really correct.

I agree.  And this is where I'll focus first.  At the same time, one
might argue that an object that represents collation, at some level,
has a specifc interface which allows an operation: comparing two
pieces of script (to employ an excessively long turn of phrase) for
order.  Regardless of how the order is determined, that's what the
goal is.

I don't intend to try to come up with some things and have them set in
stone.  I intend this as an excercise to come up with some tools which
people can use and try out and fiddle with and make suggestions about.

> On things like collation, there is no way I can see to proceed
> without obtaining the relevant ISO documents.  I seriously doubt
> that it is possible to design an 'abstract' interface, without
> actually first trying to implement the algorithms: the interfaces
> will be determined by concrete details and issues.

In this case, I don't propose to describe in detail all of the issues
of collating.  But, in order to provide a non-codepoint ordered
mechanism for ordering strings, and maybe provide a way to find such a
mechanism, it might be worthwhile to explore a few possibilities, even
if they come to nought in the end.

As for myself, I do not generally have the funds or werewithal to
obtain ISO documents.  I do have a copy of the Unicode v2.0 book
handy, but it doesn't concern itself with such issues.

Anyway, you're right, the encoding and character class stuff is first.
I think format strings and message catalogs is a doable second goal,
which actually won't cause too much pain.  Ways to format dates and
the like is harder, not least because you have to have a standard way
to express a date before you can have a standard interface for
outputting a date in some localized format.

Anyway, encodings first.  Then the world.  I'll take a look at your
Python stuff.