Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
Request for Ideas: i18n issues
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Frank A. Christoph <christo@n...>
Subject: RE: Request for Ideas: i18n issues
John Prevost wrote in response to John Skaller:
> > 	Probably not. There is a distinction, for example, between
> > concatenating two arrays of code points, and producing a new array
> > of code points corresponding to the concatenated script.  This is
> > 'most true' in Arabic, but it is also true in plain old English: the
> > script
>   {...}
> I don't believe that this is a job for the basic string manipulation
> stuff to do.  There do need to be methods for manipulating strings as
> sequences.  As such, I'm not going to worry about it at this level.

Maybe he is arguing for a higher-level approach to text representation. Text
is, after all, an important enough application field that it deserves some
treatment in a standard library. Unfortunately (as we are all bearing
witness to in this discussion) there is no standard way to do it, even for
one (natural) language, much less all of them.

> The big difficulty here is that not everybody wants to eat Unicode.  I
> think it's appropriate, but not everyone does.  And there are still
> characters in iso-2022, for example, which have no Unicode code point.
> I think that as I look at the problem more, though, I'm inclined to
> say "one definitive set of characters" is a better idea.  Especially
> since that set is needed for reasonable interoperation between the
> others.

This is getting a little afield from the topic of the list, but I have been
wondering if maybe Unicode (and other related standards) might not end up
being most valuable in the long run, not for being able to represent text
from existing languages but, for shaping the form of languages to come. I
don't mean completely new languages of course (I think I heard Klingon and
Tengwar were actually submitted for inclusion!), but rather that as
electronic information becomes more and more central to communication, the
need to encode it will change the way people speak and write in other media
as well. I think this is happening already, sort of: for example, I've seen
email emoticons used in published materials, and on billboards here in
Japan; and we all know a hundred words from computer jargon that have made
it into mainstream languages.

> Something I've noted looking at O'Caml these last few days: the
> "string" type is really more an efficient byte array type.  And the
> char type is really a byte type.  There's no real way to do "input
> bytes from a stream" except inputting them as characters and then
> interpreting those characters as bytes.

That's exactly the way I think of and use O'Caml characters and strings. It
is sort of unfortunate that O'Caml inherited this terminology from C... I
would have preferred "byte" or "octet" for characters, at least.