Browse thread
[Caml-list] Stop at exception
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | YAMAGATA yoriyuki <yoriyuki@m...> |
| Subject: | RE: [Caml-list] Non-mutable strings |
From: "Mattias Waldau" <mattias.waldau@abc.se> Subject: RE: [Caml-list] Non-mutable strings Date: Wed, 16 Jan 2002 20:22:36 +0100 > Thus, introducing Unicode strings (or something similar, I heard that Asians > don't like Unicode at all) and introducing non-mutable strings should > preferrable be done simultaneously. There is criticism to Unicode (Most of them goes to Han-unification, which integrates all regional variants of ideographics to a single set of character), but as far as I know, it is the only international character set in which the standard ways of string matching, comparison and sorting are defined. Pattern matching is important to caml, so I think using Unicode is preferable. > P.s. Microsoft NT, 2000, XP handles double byte chars everywhere, it is > called BSTR and in order to make string comparasion etc library-routines are > called all the time. However, since Unicode can be 4 byte, I don't know how > that is encoded into 2 bytes. Unicode standard requires handling an unicode character as one or two 16bits integers. If a characters is longer than 2 bytes, it is represented as a pair of surrogate points (specially aligned 16 bits integers for this purpose.) Surrogate pairs can only represent 3 bytes character, so Unicode as its narrow sense can only be 3 bytes. I don't know whether Windows supports surrogates, but since MS is one of the founding members of Unicode consortium, they will be supported in the future, any ways. However, Unicode, as customary called, has another standard, ISO-UCS. ISO-UCS allows that characters becomes 31-bits long, and ISO seems to recommend that all characters are represented as 32-bits integers. Clearly, ISO approaches are more simple and allows fast indexing. On the other hand, Unicode is more widely used and provide better algorithm for case mapping, character classification etc. For caml, in my really humble opinion, the language had better to hide such difference (16-bits or 32-bits) and if it can not be hidden (like case mapping), offer choice to users. Regards -- YAMAGATA, yoriyuki (doctoral student) Department of Mathematical Science, University of Tokyo. ------------------- Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr