Browse thread
Request for Ideas: i18n issues
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | skaller <skaller@m...> |
| Subject: | Re: Request for Ideas: i18n issues |
John Prevost wrote: > Back to the charset/encoding module type. Here's what I think might > want to be in here. I appeal to you for suggestions about things to > remove or add. > > Charsets: > > * a type for characters in the charset I think you mean 'code points'. These are logically distinct from characters (which are kind of amorphous). For example, a single character -- if there is such a thing in the language -- may consists of several code points combined, and there are code points which are not characters. > * a type for strings in the charset (maybe char array, maybe not) I think you mean 'script'. Strings of code points can be used to represent script. > * functions to determine the "class" of a character. This would > probably involve a standard "character class" type, possibly > informed by character classes in Unicode. Yes. For ISO10646 plane 1 (Unicode), the data is readily available for a few key attributes (such as character class, case mappings, corresponding decimal digit, etc). > * functions to work with strings in the character set in order to do > standard manipulations. If we said a string is always a char > array and that there are standard functions to work on strings > given the above, this might be something that can be done away > with. Probably not. There is a distinction, for example, between concatenating two arrays of code points, and producing a new array of code points corresponding to the concatenated script. This is 'most true' in Arabic, but it is also true in plain old English: the script "It is a" and "nice day" requires a space to be inserted between the code point arrays to obtain the correctly concatenated script "It is a nice day" > * functions to convert characters and strings to a reference format, > perhaps UCS-4. UCS-4 isn't perfect, but it does have a great deal > of coverage, and without some common format, converting from one > character set to another is problemmatic. I agree. I think there are two choices here, UCS-4 and UTF-8 encodings of ISO-10646. UCS-4 is better for internal operations, UTF-8 for IO and streaming operations (perhaps including regular expression matching where tables of 256 cases are more acceptable than 2^31 :-) > Encodings: > > These are tied to charsets much of the time, but not always. > > * functions to encode and decode strings in streams and buffers. Yes. > Locales: > > * functions to do case mapping, collation, etc. No. It is generally accepted that 'locale' information is limited to culturally sensitive variations like whether full stop or comma is used for a decimal point, and whether the date is written dd/mm/yy or yy/mm/dd or mm/dd/yy. Collation, case mapping, etc are not locale data, but specific to a particular script. The tendency in i18n developments has been, I think, to divorce character sets, encodings, collation, and script issues from the locale: the locale may indicate the local language, but that is independent (more or less) of script processing. > (I think this is really why Java went to "the one true charset is > Unicode". Not just because of politics, but because interacting with > mutually incompatible character sets can be a type-safety nightmare.) Yes. I am somewhat suprised to see an attempt to create a more abstract interface to multiple character sets/encodings. This area tends, I think, to be complex, full of ad hoc rules, and so quirky as to defy genuine abstraction. Fixing a single standard (ISO10646) is a simpler alternative; even simpler if there is a single reference encoding such as UCS-4 or UTF-8. In that case, the functions that do the work can be specialised to a well researched International Standard. It is still necessary to provide functions that encode/decode the standard format to other formats (encodings/character sets), but no functions need be provided to do things like collation or case mapping for these other formats. > One or more of the above might be functors, so that you can compose a > character set, encoding, and locale to get what you want. This, of > course, gets into questions of whether certain character sets, > locales, and encodings are interoperable, and how one might cause a > type error when trying to combine an encoding and a character set that > don't work together. Dunno if this is possible. I think it is, but it isn't desriable. For example, it is possible to use UCS-4 or UTF-8 encodings of ANY character set, since all have integral code points to represent them: UCS-4 is universal for all sets of less than 2^32 points, UTF-8 for sets less than 2^31 points. > How to recover from > failure is of course a good question to try to answer as well. There are, in fact, multiple answers; this is one of the complicating factors. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller