RE: Request for Ideas: i18n issues

From: Frank A. Christoph (
Date: Tue Oct 26 1999 - 15:36:39 MET DST

From: "Frank A. Christoph" <>
To: "John Prevost" <>, "skaller" <>
Subject: RE: Request for Ideas: i18n issues
Date: Tue, 26 Oct 1999 22:36:39 +0900
In-Reply-To: <>

John Prevost wrote in response to John Skaller:
> > Probably not. There is a distinction, for example, between
> > concatenating two arrays of code points, and producing a new array
> > of code points corresponding to the concatenated script. This is
> > 'most true' in Arabic, but it is also true in plain old English: the
> > script
> {...}
> I don't believe that this is a job for the basic string manipulation
> stuff to do. There do need to be methods for manipulating strings as
> sequences. As such, I'm not going to worry about it at this level.

Maybe he is arguing for a higher-level approach to text representation. Text
is, after all, an important enough application field that it deserves some
treatment in a standard library. Unfortunately (as we are all bearing
witness to in this discussion) there is no standard way to do it, even for
one (natural) language, much less all of them.

> The big difficulty here is that not everybody wants to eat Unicode. I
> think it's appropriate, but not everyone does. And there are still
> characters in iso-2022, for example, which have no Unicode code point.
> I think that as I look at the problem more, though, I'm inclined to
> say "one definitive set of characters" is a better idea. Especially
> since that set is needed for reasonable interoperation between the
> others.

This is getting a little afield from the topic of the list, but I have been
wondering if maybe Unicode (and other related standards) might not end up
being most valuable in the long run, not for being able to represent text
from existing languages but, for shaping the form of languages to come. I
don't mean completely new languages of course (I think I heard Klingon and
Tengwar were actually submitted for inclusion!), but rather that as
electronic information becomes more and more central to communication, the
need to encode it will change the way people speak and write in other media
as well. I think this is happening already, sort of: for example, I've seen
email emoticons used in published materials, and on billboards here in
Japan; and we all know a hundred words from computer jargon that have made
it into mainstream languages.

> Something I've noted looking at O'Caml these last few days: the
> "string" type is really more an efficient byte array type. And the
> char type is really a byte type. There's no real way to do "input
> bytes from a stream" except inputting them as characters and then
> interpreting those characters as bytes.

That's exactly the way I think of and use O'Caml characters and strings. It
is sort of unfortunate that O'Caml inherited this terminology from C... I
would have preferred "byte" or "octet" for characters, at least.


This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET