Version française
Home     About     Download     Resources     Contact us    
Browse thread
Re: localization, internationalization and Caml
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: skaller <skaller@m...>
Subject: Re: localization, internationalization and Caml
Xavier Leroy wrote:

> The support for ISO-8859-1 in Caml Light and OCaml is essentially an
> historical and geographical accident.  The first books on Caml were
> written in French, and it was nice to be able to use accented french
> words as identifiers.  Also, that was at a time (1991-1992) where
> Unicode and consorts didn't even exist.

	And supporting ISO-8859-1 was a fine thing to do at the time!
 
> The choice of ISO-8859-1 is not that politically incorrect either: it
> works not only for western Europe, but also for Latin America, many
> Pacific countries, and large parts of Africa.  If we were to choose an
> 8-bit character set based on the number of OCaml programmers that
> actually need it, I guess ISO-8859-1 (or its newer incarnation with
> the Euro sign whose name I can't remember) would still win.  (At least
> until we get OCaml in the Chinese curriculum...)

	While this is true, there is a circularity here: people not
using 8 bit character sets face an extra battle using ocaml.
 
> Notice also that Caml doesn't prevent the programmer from putting any
> character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
> Unicode) in character strings and in comments.

	Yes. This one of the key points of my argument that UTF-8
is the natural way to go: it provides ISO-10646 compliance without
requiring any new string kind.
 
> There are several ways to internationalize further.  One is to support
> other 8-bit character sets the POSIX way (the LC_CTYPE stuff).  There
> are several problems with this:
> - It's not enough for Asian languages.
> - The POSIX localization stuff isn't supported under Windows.
> - It's badly supported on all Unixes I know (e.g. to get French, I
>   need to set LC_CTYPE to different values under Linux, Solaris, and
>   Digital Unix; it gets worse for other languages such as Japanese).
> - Handling of mixed-language texts is a nightmare.

	If you are suggesting not using C locale stuff -- I agree entirely.

> Unicode / ISO10646 is probably a better approach.  However, it has its
> own problems:
> - There's 16-bit Unicode and 32-bit Unicode.  Early adopters of that
>   technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
>   chose 32-bit Unicode.   (That's the great things about standards:
>   there are so many to choose from...)

	I cannot see the problem -- except for the 16 bit adopters,
who must eventually upgrade .. again.

> - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
>   E.g. Java seems to have its own variant of UTF8.  How are we going
>   to interoperate?

	I do not understand: UTF-8 is a fixed, internationally standardised
encoding. If it is used, the ISO Standard is followed. If Java doesn't
do that,
that is Java's problem.

> - I/O is a nightmare.  The API has to handle at least byte streams,
>   wide character streams, and UTF8-encoded streams.

	No, it doesn't. This is a possibility. But it is NOT necessary.
It is necessary only to read byte streams. Conversion can be done
later using strings. This is less efficient, but it is a sensible
starting point (to ignore internationalisation on I/O completely).

> - Support for Unicode / UTF8 files in today's operating systems and GUIs
>   is very low.  When will I be able to do "more" on an UTF8 file and see my
>   French accented letters?

	Yes. I agree. This is a major problem. One of the answers is
"When programming languages provide the support that applications
programmers need" :-)
 
> My conclusion is that I18N is such a mess that I don't think we'll do
> much about it in Caml anytime soon.  

	I agree. The way forward is, I believe:

	a) do not change the I/O system, but deprecate TEXT mode
	   (all I/O should be done in binary)

	b) do not change the String module, but deprecate the
	   upper/lower case functions (and anything else that
	   smacks of relating to natural language)

	c) Provide functions to support internationalisation.

	d) modify the ocaml compiler, to process \uXXXX and \UXXXX
	   escapes [everywhere]

	e) provide a fast variable length array type

(d) could be done easily using ocamlp4 I think.

>Perhaps some basic support for
> wide characters and wide character strings will be added at some
> point, if only because COM interoperability requires it.

	I don't think it is necessary, a variable length
array of integers is good enough.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller