Re: Request for Ideas: i18n issues

From: skaller (skaller@maxtal.com.au)
Date: Tue Oct 26 1999 - 22:16:38 MET DST


Date: Wed, 27 Oct 1999 06:16:38 +1000
From: skaller <skaller@maxtal.com.au>
To: John Prevost <prevost@maya.com>
Subject: Re: Request for Ideas: i18n issues

John Prevost wrote:
>
> skaller <skaller@maxtal.com.au> writes:
>
> > > * a type for characters in the charset
> >
> > I think you mean 'code points'. These are logically distinct
> > from characters (which are kind of amorphous). For example, a single
> > character -- if there is such a thing in the language -- may
> > consists of several code points combined, and there are code points
> > which are not characters.
>
> I'm not sure the distinction matters much at this level. I am
> predisposed to prefer the name "char" to the name "codepoint" just
> because more people will understandit.

        I myself tend to use these (incorrect) words as well,
through familiarity. However, in the longer run, vague,
and even incorrect, terminology will make it even harder
to communicate.

        My experience with this is from the C++ Standardisation
committee (rather than I18n forums). In C++, there are a number
of important and distinct concepts without proper terminology,
and it makes it almost impossible to have a technical discussion
about the C++ language. Over 90% of all issues and debates
centred around terminology and interpretation, rather than
functionality.

        But let me give another clarifying example.
In Unicode, there are some code points which are
reserved, and do not represent characters, namely
the High and Low surrogates. These are used
to allow two byte 16 bit UTF-16 encoding of the first 16
planes of ISO-10646.

        So when you see two of these together,
they are not two 'characters', but in fact an
encoding of a SINGLE code point, which itself
may or may not be a character.
        
> In the back of my head, I also
> generally interpret "codepoint" as meaning the number associated with
> the character,

        Yes, this is more or less correct.
And, basic array /string manipulations must first work with
code points, rather than the abstract 'characters'.

        In fact, the 'meaning' of the code points
as characters is to be found in the functions which
manipulate the code points (sort of like an ocaml module).
More correctly, sequences of code points can be
understood as 'script', and manipulated to represent
script manipulations.

        Thus: it is possible to do a lexicographical
string sort by code points, which is not the same
as a native language sort on script. The former
is a convenience for representing strings in
_some_ total order, for example, for use in an
ocaml 'Set'.

        The latter may not even be well defined,
or have multiple definitions, depending on the
script or use -- for example, the usual code point
order sort of ASCII character is utterly wrong
for a book index.

>rather than the abstract "character" itself. I believe
> Unicode makes this distinction between a "glyph", a "character", and a
> "codepoint".

        Yes: a glyph is the shape, a font's character is particular
rendering of it, a character is an abstraction (is 'A' the same
character as 'a'?? The shape is different, the concept is the same).
A code point is just a number in a set of numbers used to encode
script.

        The difference is subtle, and I do not claim to fully
understand it, but the ISO and Unicode committees needing to
create standards emphasise the distinctions (to the point
of the liason on the C++ committee complaining about the
incorrect use of the word 'character' in the C++ language draft).
 
> > > * a type for strings in the charset (maybe char array, maybe not)
> >
> > I think you mean 'script'. Strings of code points can be used
> > to represent script.
>
> Uh. Okay, whatever. Again, I will tend to use the more traditional
> word "string" to mean a sequence of characters. (On reflection, I
> think you're trying to say something deeper, but your explanation is
> so vague I won't try to interpret what it is. I leave that to you.)

        Sorry: I do not understand it entirely myself.
But consider: the code point for a space, has no character.
There is no such thing as a 'space' character. There is no glyph.
There _is_ such a thing as white space in (latin) script, however.
 
> I don't believe that this is a job for the basic string manipulation
> stuff to do. There do need to be methods for manipulating strings as
> sequences. As such, I'm not going to worry about it at this level.

        I agree. But, then, you are only dealing with some kind
of array of code points. Ocaml already has an 'array' type
which can be used for this purpose, and if there were a
string type that were variable length, we would be somewhat
better off.

        But manipulating these data structures is entirely
structural, and only relates to 'script' in the sense
that the data structures are convenient for working with
script representations. That is, actual script related functions
such as

        (1) trimming whitespace
        (2) capitalisation
        (3) collation
        (4) charset mapping and encoding/decoding

are, IMHO, quite separate.

 
> The big difficulty here is that not everybody wants to eat Unicode.

        I know this is true. However, the many issues have been
addressed by representatives of National Governments and industry,
and a consensus has been reached, and is embodied in an
International Standard.

>I think it's appropriate, but not everyone does. And there are still
> characters in iso-2022, for example, which have no Unicode code point.

        These issues are being addressed by members of the
ISO technical committee. Any complaints should be addressed to them.
Programmers should probably implement what is standardised,
and perhaps complain, or provide input to the process, rather than
going off on their own private, probably doomed track.

        [This is not the same issue, IMHO, as programming languages:
there is just ONE commonly accepted universal standard for script,
namely ISO-10646]

> I think that as I look at the problem more, though, I'm inclined to
> say "one definitive set of characters" is a better idea. Especially
> since that set is needed for reasonable interoperation between the
> others.

        This is my feeling, partly for another reason: anything
else is just too complicated. ISO 10646 is hard enough as it is.
 
> Something I've noted looking at O'Caml these last few days: the
> "string" type is really more an efficient byte array type. And the
> char type is really a byte type.

        Yes. And it is needed as such, even though perhaps
poorly named. Something which represents byte strings
is essentially distinct from something representing
'script'/'text'. IMHO.

> Anyway, I think this partitions things like so:
>
> 1) Basic char and string types
> 2) Locale type which exists solely for locale identity
> 3) Collator type which allows various sorting of strings
> 4) Formatter and parser types for different data values
> 4a) A sort of formatter which allows reordering of arguments in the output
> string is needed (not too hard).
> 5) Reader and writer types for encoding characters into various encodings

        Seems reasonable. I personally would address (1) and (5)
first. Also, I think 'regexp' et al fits in somewhere, and needs
to be addressed.
 
> My current thought is that the char and string types should be defined
> in terms of Unicode.

        Please NO. ISO10646. 'Unicode' is no longer supported by Unicode Corp,
which is fully behind ISO10646. "unicode" is for implementations
that will be out of date before anyone get the tools to actually
use internationalised software (like text editors, fonts, etc).

        Java has gone this way, and will pay the price in the long
run. Python is going that way too. But Unicode is ALREADY out of date.
If work is to be done -- and there is a LOT of work in this area --
then it should conform to the long sighted International Standard,
which has plenty of space for extension, and is supported by
an international consensus of National bodies and technical
experts.

> The char type should be abstract,

        Why? ISO 10646 code points are just integers 0 .. 2^31-1,
and it is necessary to have some way of representing them
as _literals_. In particular, you cannot 'match' easily
on an abstract type, but matching is going to be a very
common operation:

        match wchar with
        | WhatDoIPutHere -> ...

One solution is to have a native wchar type in ocaml,
with literals recognized by the lexer .. but it seems to
me that integers will do the job, even if they
can be abused, by, for example, multiplication
(which, off hand, doesn't seem appropriate).

[That is, code points are a Z module: they're offsets]

 
> You should be able to get ahold of collators, formatters, message
> catalogs, default encodings, and the like by having a locale. You
> should of course be able to ignore them, too.

        I would leave most of this out, at least for the moment:
it is enough work to just handle encodings and mappings.
These things are important in the transition from other
character sets including Latin-1, ShiftJis -- etc.

        Without these mapping/encoding tools, it isn't
really possible to actually work with ISO10646, because
there are not many tools that can do so: most people
use either 8 bit editors, or specialised editors for
their DBCS encoding (like ShiftJis).
 
> I'll try to get some type signatures for possibilities up in the next
> day or so.

        Great!

        BTW: I have some Python code for doing a lot
of mappings/encodings (all the ones available on
the Unicode.org website). This may be useful,
I can generate tables of any kind as required.
(Please send me private email).

        Anyone interested can examine my
web page:

        http://www.triode.net.au/~skaller/unicode/index.html

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET