Re: localization, internationalization and Caml

From: Gerd Stolpmann (Gerd.Stolpmann@darmstadt.netsurf.de)
Date: Wed Oct 20 1999 - 23:05:41 MET DST


From: Gerd Stolpmann <Gerd.Stolpmann@darmstadt.netsurf.de>
To: skaller <skaller@maxtal.com.au>
Subject: Re: localization, internationalization and Caml
Date: Wed, 20 Oct 1999 23:05:41 +0200
Message-Id: <99102100543400.15513@ice>

On Tue, 19 Oct 1999, John Skaller wrote:
>Gerd Stolpmann wrote:
>> UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit
>> character is represented by one to six bytes; the higher the character code the
>> more bytes are needed. This encoding is mainly interesting for I/O, and not for
>> internal processing, because it is impossible to access the characters of a
>> string by their position. Internally, you must represent the characters as 16-
>> or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is
>> wasted (this is the price for the enhanced possibilities).
>
> I don't agree. If you read ISO10646 carefully, you will find
>that you must STILL parse sequences of code points to obtain the
>equivalent
>of a 'character', if, indeed, such a concept exists,
>and furthermore, the sequences are not unique. For example,
>many diacritic marks such as accents may be appended to a code point,
>and act in conjunction with the preceding code point to represent
>a character. This is permitted EVEN if there is a single code point
>for that character; and worse, if there are TWO such marks, the order is
>not
>fixed.
>
> And that's just simple European languages. Now try Arabic or Thai :-)

Let's begin with languages we know. As far as I know, ISO10646 allows it not to
implement the combining characters. I think, a programming language should only
provide the basic means by which you can operate with characters, but should not
solve it completely.

>> This means that we need at least three types of strings: Latin1 strings for
>> compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings
>> for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by
>> the same language type "string", and to provide "wchar" and "wstring" for the
>> extended character set.
>
> I'd like to suggest we forget the 'wchar' string, at least initially.
>I think you will find UTF-8 encoding requires very few changes. For
>example,
>genuine regular expressions work out of the box. String searching
>works out of the box.
>
> What doesn't work efficiently is indexing.
>And it is never necessary to do it for human script.
>Why would you ever want to, say, replace the 10'th character
>of a string?? [You could: if you were analysing, say, a stock code,
>but in that case the n'th byte would do: it isn't natural language
>script]

Because I have an algorithm operating on the characters of a string. Such
algorithms use indexes as pointers to parts of a string, and in most cases the
indexes are only incremented or decremented. On a UTF-8 string, you could
define an index as

type index = { index_position : int; byte_position : int }

and define the operations "increment", "decrement" (only working together with
the string), "add", "substract", "compare" (to calculate string lengths). Such
indexes have strange properties; they can only be interpreted together with the
string to which they refer.

You cannot avoid such an index type really; you can only avoid to give the
thing a name and program the index operations every time anew.

Perhaps your suggestion works; but string manipulation will then be much
slower. For example, an "increment" must be implemented by finding the next
beginning of a character (instead of just incrementing a numeric index).

>> Of course, the current "string" data type is mainly an implementation of byte
>> sequences, which is independent of the underlying interpretation. Only the
>> following functions seem to introduce a character set:
>>
>> - The String.uppercase and .lowercase functions
>
>It is best to get rid of these functions. The belong in a more
>sophisticated
>natural language processing package.

There will always be a difference between natural languages and sophisticated
packages. Even the current String.uppercase is wrong (in Latin1 there is a
lower case character without corresponding capital character (\223), but WORDS
containing this character can be capitalized by applying a semantical rule).

I would suppose that String.upper/lowercase are part of the library because the
compiler itself needs them. Currently, ocaml depends on languages that know the
distinction of character cases.

In my opinion such case functions can only approximate the semantical meaning,
and a simple approximation is better than no approximation.

>> - The channels if opened in text mode (because they specially recognize the
>> line endings if needed for the operating system, and newline characters are
>> part of the character set)
>
> This is a serious problem. It is also partly ill formed:
>what is a 'line' in Chinese, which writes characters top down?

But lines exist. For example, your message is divided into lines. The concept
of lines is too important to be dropped although it is simple (much of the
success has to do with its simplicity). Other writing traditions also have a
writing direction.

>What is a 'line' in a Web page :-)

What is a 'line' in the sky?

>> This means we would have something like
>>
>> type encoding = UTF8 | Latin1 | ...
>
> Be careful to distinguish CODE SET from ENCODING.
>See the Unicode home page for more details: Latin-1 is NOT
>an encoding, but a character set. There is a MAPPING from
>Latin-1 to ISO-10646. This is not the same thing as an encoding.
>I think this is the wrong approach: we do not want built-in
>cases for every possible encoding/character set.

Character sets and encodings are both artificial concepts. When I program, I
have always to do with a combination of both. The distinction is irrelevant for
most applications; it is important if you want to convert texts from (cs1,enc1)
to (cs2,enc2) because conversion is not always possible.

My idea is that the type "encoding" enumerates all supported combinations; I
expect only a few.

> Instead, we want an open ended set of conversions from (and perhaps to)
>the internally used representation. There are a LOT of such
>combinations,
>we need to add new ones without breaking into a module. We should do it
>functionally; this should work well in ocaml :-)

What kind of problem do you want to solve with an open ended set of
conversions? Isn't this the task of a specialized program?

>> type locale = ...
>> val String.i18n_uppercase : encoding -> locale -> string -> string
>> val String.i18n_lowercase : encoding -> locale -> string -> string
>
> Not in the String module. This belongs in a different package
>which handles complex vagaries of human script. [This particular
>function, is relatively simple.

See above.

>Another is 'get_digit', 'isdigit'.
>Whitespace is much harder. Collation is a nightmare :-]

I think collation should be left out by a basic library. Even for a single
language, there are often several traditions how to sort, and it also depends on
the kind of strings your are sorting (for example, think of personal names).
Members of traditions can contribute special modules for collation.

>
>> val String.recode : encoding -> encoding -> string -> string
>> (* changes the encoding if possible *)
>
> This isn't quite right. The way to do this is to have a function:
>
> LATIN1_to_ISO10646 code
>
>which does the mapping (from a SINGLE code point in LATIN1 to ISO10646).
>The code point is an int. Separately, we handle encodings:
>
> UCS4_to_UTF8 code
>
>converts an int to a UTF8 string, and
>
> UTF8_to_UCS4 string position
>
>parses the string from position, returning a code point and position.
>There are other encodings, such as DBCS encodings, which generally
>are tied to a single character set. [UCS4/UTF8 are less depenent]
>

The most correct interface is not always the best.

>[....]
>> This all means that the number of string functions explodes:
>
> Exactly. And we don't want that. So I suggest, we continue
>to use the existing strings of 8 bit bytes ONLY, and represent
>ALL foreign [non ISO10646] character sets using ISO-10646 code points,
>encoded as UTF-8, and provide an input filter for the compiler.
>
> In addition, some extra functions to convert other
>character sets and encodings to ISO-10646/UTF-8 are provided,
>and, if you like, they can be plugged into the I/O system.
>
> This means a lot of conversion functions, but ONE
>internal representation only: the one we already have.

There will be a significant slow-down of all ocaml programs if the strings are
encoded as UTF-8. I think the user of a language should be able to choose what
is more important: time or space or reduced functionality. UTF-8 saves space,
and costs time; UCS-4 wastes space, but saves time; UCS-2 is a compromise
and bad because it is a compromise; Latin 1 (or another 8 bit cs) saves
time and space but has less functionality.

>> We need functions
>> for compatibility (Latin1), functions for arbitrary 8 bit encodings, and
>> functions for wide strings. I think this is the main argument against it,
>> and it is very difficult to get around this. (Any ideas?)
>
> I've been trying to tell you how to do it. The solution is simple,
>to adopt ISO-10646 as the SOLE character set, and UTF-8 as the SOLE
>encoding of it; and provide conversions from other character sets and
>encodings.

It looks simple but I suppose it is not what the ocaml users want.

>All the code that needs to manipulate strings can then be provided NOW
>as additional functions manipulating the existing string type.

And because compatibility is lost, the whole current code base has to be worked
through.

> The apparent loss of indexing is a mirage. The gain is huge:
>ISO-10646 Level 1 compliance without any explosiion of data types.
>Yes, some more _functions_ are needed to do extra processing,
>such as normalisation, comparisons of various kinds,
>capitalisation, etc. Regular expressions will need to be
>enhanced, to fix the special features (like case insensitive searching),
>but the basic regular expressions will work out of the box.
>
>> The enlarged character sets become more and more important, and it is only a
>> matter of time until every piece of software which wants to be taken seriously
>> can process them, even a dumb terminal or simple text editor. So you will be
>> able to put accented characters into your comments, and you will see them as
>> such even if you 'cat' the program text to the terminal or printer; this will
>> work everywhere...
>
> Yes. This time is not here yet, but it will come soon that
>international support is mandatory for all large software purchases
>by governments and large corporations.

I do not believe that this will be the driving force because the current
solutions exist, and it is VERY expensive to replace them. It is even cheaper
to replace a language than a character set/encoding. Looks like another Year
2000 but without deadline.

The first field where some progress will be made is data exchange, because
ISO10646 can bridge several character sets. At that time, tools will be
available to view and edit such data, and of course to convert them; ISO10646
will be used in parallel with the "traditional" character set. These tools will
be low-level, and perhaps operating systems will then support the tools with
fonts, input methods, and conventions how to indicate the encoding.(The current
environment variables solution is a pain if you try to use two encodings in
parallel. For example, I can imagine that Unix terminal drivers allow it to
select the encoding directly, in the same way as you can set other terminal
properties.)

In contrast to this, many applications need not to be replaced and won't be.
Perhaps they will have a ISO10646 import/export filter.

--
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   Gerd.Stolpmann@darmstadt.netsurf.de (privat)
Germany                     
----------------------------------------------------------------------------



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:27 MET