Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
[Caml-list] Stop at exception
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2002-01-17 (11:26)
From: YAMAGATA yoriyuki <yoriyuki@m...>
Subject: RE: [Caml-list] Non-mutable strings
From: "Mattias Waldau" <>
Subject: RE: [Caml-list] Non-mutable strings
Date: Wed, 16 Jan 2002 20:22:36 +0100

> Thus, introducing Unicode strings (or something similar, I heard that Asians
> don't like Unicode at all) and introducing non-mutable strings should
> preferrable be done simultaneously.

There is criticism to Unicode (Most of them goes to Han-unification,
which integrates all regional variants of ideographics to a single set
of character), but as far as I know, it is the only international
character set in which the standard ways of string matching,
comparison and sorting are defined.  Pattern matching is important to
caml, so I think using Unicode is preferable.

> P.s. Microsoft NT, 2000, XP handles double byte chars everywhere, it is
> called BSTR and in order to make string comparasion etc library-routines are
> called all the time. However, since Unicode can be 4 byte, I don't know how
> that is encoded into 2 bytes.

Unicode standard requires handling an unicode character as one or two
16bits integers.  If a characters is longer than 2 bytes, it is
represented as a pair of surrogate points (specially aligned 16 bits
integers for this purpose.)  Surrogate pairs can only represent 3
bytes character, so Unicode as its narrow sense can only be 3 bytes.
I don't know whether Windows supports surrogates, but since MS is one
of the founding members of Unicode consortium, they will be supported
in the future, any ways.

However, Unicode, as customary called, has another standard, ISO-UCS.
ISO-UCS allows that characters becomes 31-bits long, and ISO seems to
recommend that all characters are represented as 32-bits integers.

Clearly, ISO approaches are more simple and allows fast indexing.  On
the other hand, Unicode is more widely used and provide better
algorithm for case mapping, character classification etc.

For caml, in my really humble opinion, the language had better to hide
such difference (16-bits or 32-bits) and if it can not be hidden (like
case mapping), offer choice to users.

YAMAGATA, yoriyuki (doctoral student)
Department of Mathematical Science, University of Tokyo.
Bug reports:  FAQ:
To unsubscribe, mail  Archives: