[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 1999-10-19 (21:39) |
From: | skaller <skaller@m...> |
Subject: | Re: localization, internationalization and Caml |
Xavier Leroy wrote: > The support for ISO-8859-1 in Caml Light and OCaml is essentially an > historical and geographical accident. The first books on Caml were > written in French, and it was nice to be able to use accented french > words as identifiers. Also, that was at a time (1991-1992) where > Unicode and consorts didn't even exist. And supporting ISO-8859-1 was a fine thing to do at the time! > The choice of ISO-8859-1 is not that politically incorrect either: it > works not only for western Europe, but also for Latin America, many > Pacific countries, and large parts of Africa. If we were to choose an > 8-bit character set based on the number of OCaml programmers that > actually need it, I guess ISO-8859-1 (or its newer incarnation with > the Euro sign whose name I can't remember) would still win. (At least > until we get OCaml in the Chinese curriculum...) While this is true, there is a circularity here: people not using 8 bit character sets face an extra battle using ocaml. > Notice also that Caml doesn't prevent the programmer from putting any > character set that includes ASCII (ISO-8859-x, but also UTF8-encoded > Unicode) in character strings and in comments. Yes. This one of the key points of my argument that UTF-8 is the natural way to go: it provides ISO-10646 compliance without requiring any new string kind. > There are several ways to internationalize further. One is to support > other 8-bit character sets the POSIX way (the LC_CTYPE stuff). There > are several problems with this: > - It's not enough for Asian languages. > - The POSIX localization stuff isn't supported under Windows. > - It's badly supported on all Unixes I know (e.g. to get French, I > need to set LC_CTYPE to different values under Linux, Solaris, and > Digital Unix; it gets worse for other languages such as Japanese). > - Handling of mixed-language texts is a nightmare. If you are suggesting not using C locale stuff -- I agree entirely. > Unicode / ISO10646 is probably a better approach. However, it has its > own problems: > - There's 16-bit Unicode and 32-bit Unicode. Early adopters of that > technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix) > chose 32-bit Unicode. (That's the great things about standards: > there are so many to choose from...) I cannot see the problem -- except for the 16 bit adopters, who must eventually upgrade .. again. > - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well. > E.g. Java seems to have its own variant of UTF8. How are we going > to interoperate? I do not understand: UTF-8 is a fixed, internationally standardised encoding. If it is used, the ISO Standard is followed. If Java doesn't do that, that is Java's problem. > - I/O is a nightmare. The API has to handle at least byte streams, > wide character streams, and UTF8-encoded streams. No, it doesn't. This is a possibility. But it is NOT necessary. It is necessary only to read byte streams. Conversion can be done later using strings. This is less efficient, but it is a sensible starting point (to ignore internationalisation on I/O completely). > - Support for Unicode / UTF8 files in today's operating systems and GUIs > is very low. When will I be able to do "more" on an UTF8 file and see my > French accented letters? Yes. I agree. This is a major problem. One of the answers is "When programming languages provide the support that applications programmers need" :-) > My conclusion is that I18N is such a mess that I don't think we'll do > much about it in Caml anytime soon. I agree. The way forward is, I believe: a) do not change the I/O system, but deprecate TEXT mode (all I/O should be done in binary) b) do not change the String module, but deprecate the upper/lower case functions (and anything else that smacks of relating to natural language) c) Provide functions to support internationalisation. d) modify the ocaml compiler, to process \uXXXX and \UXXXX escapes [everywhere] e) provide a fast variable length array type (d) could be done easily using ocamlp4 I think. >Perhaps some basic support for > wide characters and wide character strings will be added at some > point, if only because COM interoperability requires it. I don't think it is necessary, a variable length array of integers is good enough. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller