Version française
Home     About     Download     Resources     Contact us    
Browse thread
Request for Ideas: i18n issues
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: skaller <skaller@m...>
Subject: Re: Request for Ideas: i18n issues
John Prevost wrote:

> Back to the charset/encoding module type.  Here's what I think might
> want to be in here.  I appeal to you for suggestions about things to
> remove or add.
> 
>   Charsets:
> 
>   * a type for characters in the charset

	I think you mean 'code points'. These are
logically distinct from characters (which are kind of
amorphous). For example, a single character -- if there
is such a thing in the language -- may consists of several
code points combined, and there are code points which are
not characters.
 
>   * a type for strings in the charset (maybe char array, maybe not)

	I think you mean 'script'. Strings of code points
can be used to represent script.
 
>   * functions to determine the "class" of a character.  This would
>     probably involve a standard "character class" type, possibly
>     informed by character classes in Unicode.

	Yes. For ISO10646 plane 1 (Unicode), the data is readily
available for a few key attributes (such as character class,
case mappings, corresponding decimal digit, etc).
 
>   * functions to work with strings in the character set in order to do
>     standard manipulations.  If we said a string is always a char
>     array and that there are standard functions to work on strings
>     given the above, this might be something that can be done away
>     with.

	Probably not. There is a distinction, for example, between
concatenating two arrays of code points, and producing a new
array of code points corresponding to the concatenated script.
This is 'most true' in Arabic, but it is also true in
plain old English: the script

	"It is a"

and 

	"nice day"

requires a space to be inserted between the code point arrays
to obtain the correctly concatenated script

	
"It is a nice day"
 
>   * functions to convert characters and strings to a reference format,
>     perhaps UCS-4.  UCS-4 isn't perfect, but it does have a great deal
>     of coverage, and without some common format, converting from one
>     character set to another is problemmatic.

	I agree. I think there are two choices here, UCS-4 and UTF-8
encodings of ISO-10646. UCS-4 is better for internal operations,
UTF-8 for IO and streaming operations (perhaps including
regular expression matching where tables of 256 cases are more
acceptable than 2^31 :-)
 
>   Encodings:
> 
>     These are tied to charsets much of the time, but not always.
> 
>   * functions to encode and decode strings in streams and buffers.

	Yes.
 
>   Locales:
> 
>   * functions to do case mapping, collation, etc.

	No. It is generally accepted that 'locale' information
is limited to culturally sensitive variations like whether
full stop or comma is used for a decimal point, and whether
the date is written dd/mm/yy or yy/mm/dd or mm/dd/yy.

	Collation, case mapping, etc are not locale
data, but specific to a particular script. 

	The tendency in i18n developments has been, I think,
to divorce character sets, encodings, collation, and script
issues from the locale: the locale may indicate the local
language, but that is independent (more or less) of
script processing.
 
> (I think this is really why Java went to "the one true charset is
> Unicode".  Not just because of politics, but because interacting with
> mutually incompatible character sets can be a type-safety nightmare.)

	Yes. I am somewhat suprised to see an attempt to create a more
abstract interface to multiple character sets/encodings. This area
tends, I think, to be complex, full of ad hoc rules, and so quirky
as to defy genuine abstraction.

	Fixing a single standard (ISO10646) is a simpler
alternative; even simpler if there is a single reference encoding
such as UCS-4 or UTF-8. In that case, the functions that do the
work can be specialised to a well researched International Standard.
 
	It is still necessary to provide functions that
encode/decode the standard format to other formats (encodings/character
sets),
but no functions need be provided to do things like collation or
case mapping for these other formats.

> One or more of the above might be functors, so that you can compose a
> character set, encoding, and locale to get what you want.  This, of
> course, gets into questions of whether certain character sets,
> locales, and encodings are interoperable, and how one might cause a
> type error when trying to combine an encoding and a character set that
> don't work together.  Dunno if this is possible. 

	I think it is, but it isn't desriable. For example,
it is possible to use UCS-4 or UTF-8 encodings of ANY 
character set, since all have integral code points to represent
them: UCS-4 is universal for all sets of less than 2^32 points,
UTF-8 for sets less than 2^31 points.

> How to recover from
> failure is of course a good question to try to answer as well.

	There are, in fact, multiple answers; this is one
of the complicating factors.
 
-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller