Date: Thu, 21 Oct 1999 14:42:48 +1000
From: skaller <skaller@maxtal.com.au>
To: Gerd.Stolpmann@darmstadt.netsurf.de
Subject: Re: localization, internationalization and Caml
Gerd Stolpmann wrote:
> On Tue, 19 Oct 1999, John Skaller wrote:
> >
> > I don't agree. If you read ISO10646 carefully, you will find
> >that you must STILL parse sequences of code points
> Let's begin with languages we know. As far as I know, ISO10646 allows it not to
> implement the combining characters.
From memory, you are correct: there are three specified levels of
compliance.
Level 1 compliance does not require processing combining characters.
> I think, a programming language should only
> provide the basic means by which you can operate with characters, but should not
> solve it completely.
Yes, I agree, at least at this time.
> > What doesn't work efficiently is indexing.
> >And it is never necessary to do it for human script.
> >Why would you ever want to, say, replace the 10'th character
> >of a string??
>
> Because I have an algorithm operating on the characters of a string.
If the string represents human script, it is then wrong
because it makes incorrect assumptions about the nature of human script.
You will need to rewrite it, if you want it to work in an international
setting.
> Such
> algorithms use indexes as pointers to parts of a string, and in most cases the
> indexes are only incremented or decremented. On a UTF-8 string, you could
> define an index as
>
> type index = { index_position : int; byte_position : int }
>
> and define the operations "increment", "decrement" (only working together with
> the string), "add", "substract", "compare" (to calculate string lengths). Such
> indexes have strange properties; they can only be interpreted together with the
> string to which they refer.
>
> You cannot avoid such an index type really; you can only avoid to give the
> thing a name and program the index operations every time anew.
I agree. But my point is: you should change your code _anyhow_,
to use the new and correct parsing method, because it is necessary
for Level 2 and Level 3 compliance. Your code will then work
correctly at those levels when the 'increment' function is upgraded.
What you will find is something which by chance, perhaps, is
natural in Python: there is no such thing as a character. A string
is NOT an array of characters. Strings can be composed from strings,
and decomposed into arrays of strings, but there is not really any
character type.
> Perhaps your suggestion works; but string manipulation will then be much
> slower. For example, an "increment" must be implemented by finding the next
> beginning of a character (instead of just incrementing a numeric index).
Yes, but this is a fact: it is actually required for correct
processing of human script. You cannot 'magic' away the facts.
What you can do, is, if you are programming with a known
subset, such as the characters for a stock code, then you can
use indexing anyhow, perhaps with the ASCII subset. That is,
you can use the byte strings as character strings.
> There will always be a difference between natural languages and sophisticated
> packages.
Yes. However, there is an important point here. Natural languages
are quirky and behaviour is variant: each human uses language
differently
each sentence, varing with region, context .. etc. Obviously, computer
systems only use some abstracted representation. While there are many
levels and ways of abstracting this, there is one that is worthy of
special interest here: the ISO10646 Standard.
So I guess my suggestion is that in the _standard_ language libraries
we will eventually need to implement the algorithms required for
compliance
with that Standard. In my opinion, that naturally breaks into two parts:
a) (byte/word) string management: this is an issue of storage
allocation and manipulation, not natural language processing
b) basic natural language processing
>Even the current String.uppercase is wrong (in Latin1 there is a
> lower case character without corresponding capital character (\223), but WORDS
> containing this character can be capitalized by applying a semantical rule).
>
> I would suppose that String.upper/lowercase are part of the library because the
> compiler itself needs them. Currently, ocaml depends on languages that know the
> distinction of character cases.
AH! you are right!
> In my opinion such case functions can only approximate the semantical meaning,
> and a simple approximation is better than no approximation.
No. That is, I agree entirely, but make a different point:
an arbitrary simple approximation is worthless, the one that is useful
is the ISO Standardised one.
> My idea is that the type "encoding" enumerates all supported combinations; I
> expect only a few.
Please no. Leave the type open to external augmentation.
Just consider: my Interscript literate programming tool ALREADY supports
something like 30 "encodings" -- all those present on the unicode.org
website. Your 'type' is already a joke. I already support a lot more
encodings than that.
> What kind of problem do you want to solve with an open ended set of
> conversions? Isn't this the task of a specialized program?
No. It allows a generalised ISO10646 compliant program
to read and perhaps write any file encoded in any supported encoding,
but manipulate it internally in one format. If there is an encoding
that is missed, it is easy to add a new pair of conversion functions,
without breaking the standard library.
That is, it is the task of specialised _functions_.
It makes sense to provide some as standard like the ones your
type suggests -- but not represent the cases with a type.
Ocaml variants are not amenable to extension. Function parameters
are.
That is, I think there are exactly two cases:
a) no conversion required
b) user supplied conversion function
> I think collation should be left out by a basic library.
Probably right. Level 1 compliance is a good start,
and does not require collation.
> The most correct interface is not always the best.
What do you mean 'most correct'?
Either the interface supports the (ISO10646) required behaviour or not.
> There will be a significant slow-down of all ocaml programs if the strings are
> encoded as UTF-8.
No. On the contrary, most existing programs will be unaffected.
Those which actually care about internationalisation can only be made
faster ( by providing native support).
>I think the user of a language should be able to choose what
> is more important: time or space or reduced functionality. UTF-8 saves space,
> and costs time; UCS-4 wastes space, but saves time; UCS-2 is a compromise
> and bad because it is a compromise; Latin 1 (or another 8 bit cs) saves
> time and space but has less functionality.
Sure, but, this leads to multiple interfaces. Was that not
the original problem?
Let me put the argument for UTF-8 differently.
Processing UTF-8 'as is' is non-trivial and should be done
in low level system functions for speed.
Processing arrays of 31 bit integers is _already_ well
supported in ocaml, and will be better supported by adding
variable length arrays with functions that are designed with
some view of use for string processing.
So we don't actually need a wide character string type
or supporting functions, precisely because in the simplest
cases a standard data type not really specialised to script
processing will do the job.
What is actually required (in both cases) are some
'data tables' to support things like case mapping. For example,
a function
convert_to_upper i
which takes an ocaml integer argument would be useful,
and it is easy enough to 'map' this over an array.
Sigh. See next post. I will post my code, so it can
be torn up by experts.
-- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller
This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET