[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 1999-10-21 (09:43) |
From: | skaller <skaller@m...> |
Subject: | Re: localization, internationalization and Caml |
Gerd Stolpmann wrote: > On Tue, 19 Oct 1999, John Skaller wrote: > > > > I don't agree. If you read ISO10646 carefully, you will find > >that you must STILL parse sequences of code points > Let's begin with languages we know. As far as I know, ISO10646 allows it not to > implement the combining characters. From memory, you are correct: there are three specified levels of compliance. Level 1 compliance does not require processing combining characters. > I think, a programming language should only > provide the basic means by which you can operate with characters, but should not > solve it completely. Yes, I agree, at least at this time. > > What doesn't work efficiently is indexing. > >And it is never necessary to do it for human script. > >Why would you ever want to, say, replace the 10'th character > >of a string?? > > Because I have an algorithm operating on the characters of a string. If the string represents human script, it is then wrong because it makes incorrect assumptions about the nature of human script. You will need to rewrite it, if you want it to work in an international setting. > Such > algorithms use indexes as pointers to parts of a string, and in most cases the > indexes are only incremented or decremented. On a UTF-8 string, you could > define an index as > > type index = { index_position : int; byte_position : int } > > and define the operations "increment", "decrement" (only working together with > the string), "add", "substract", "compare" (to calculate string lengths). Such > indexes have strange properties; they can only be interpreted together with the > string to which they refer. > > You cannot avoid such an index type really; you can only avoid to give the > thing a name and program the index operations every time anew. I agree. But my point is: you should change your code _anyhow_, to use the new and correct parsing method, because it is necessary for Level 2 and Level 3 compliance. Your code will then work correctly at those levels when the 'increment' function is upgraded. What you will find is something which by chance, perhaps, is natural in Python: there is no such thing as a character. A string is NOT an array of characters. Strings can be composed from strings, and decomposed into arrays of strings, but there is not really any character type. > Perhaps your suggestion works; but string manipulation will then be much > slower. For example, an "increment" must be implemented by finding the next > beginning of a character (instead of just incrementing a numeric index). Yes, but this is a fact: it is actually required for correct processing of human script. You cannot 'magic' away the facts. What you can do, is, if you are programming with a known subset, such as the characters for a stock code, then you can use indexing anyhow, perhaps with the ASCII subset. That is, you can use the byte strings as character strings. > There will always be a difference between natural languages and sophisticated > packages. Yes. However, there is an important point here. Natural languages are quirky and behaviour is variant: each human uses language differently each sentence, varing with region, context .. etc. Obviously, computer systems only use some abstracted representation. While there are many levels and ways of abstracting this, there is one that is worthy of special interest here: the ISO10646 Standard. So I guess my suggestion is that in the _standard_ language libraries we will eventually need to implement the algorithms required for compliance with that Standard. In my opinion, that naturally breaks into two parts: a) (byte/word) string management: this is an issue of storage allocation and manipulation, not natural language processing b) basic natural language processing >Even the current String.uppercase is wrong (in Latin1 there is a > lower case character without corresponding capital character (\223), but WORDS > containing this character can be capitalized by applying a semantical rule). > > I would suppose that String.upper/lowercase are part of the library because the > compiler itself needs them. Currently, ocaml depends on languages that know the > distinction of character cases. AH! you are right! > In my opinion such case functions can only approximate the semantical meaning, > and a simple approximation is better than no approximation. No. That is, I agree entirely, but make a different point: an arbitrary simple approximation is worthless, the one that is useful is the ISO Standardised one. > My idea is that the type "encoding" enumerates all supported combinations; I > expect only a few. Please no. Leave the type open to external augmentation. Just consider: my Interscript literate programming tool ALREADY supports something like 30 "encodings" -- all those present on the unicode.org website. Your 'type' is already a joke. I already support a lot more encodings than that. > What kind of problem do you want to solve with an open ended set of > conversions? Isn't this the task of a specialized program? No. It allows a generalised ISO10646 compliant program to read and perhaps write any file encoded in any supported encoding, but manipulate it internally in one format. If there is an encoding that is missed, it is easy to add a new pair of conversion functions, without breaking the standard library. That is, it is the task of specialised _functions_. It makes sense to provide some as standard like the ones your type suggests -- but not represent the cases with a type. Ocaml variants are not amenable to extension. Function parameters are. That is, I think there are exactly two cases: a) no conversion required b) user supplied conversion function > I think collation should be left out by a basic library. Probably right. Level 1 compliance is a good start, and does not require collation. > The most correct interface is not always the best. What do you mean 'most correct'? Either the interface supports the (ISO10646) required behaviour or not. > There will be a significant slow-down of all ocaml programs if the strings are > encoded as UTF-8. No. On the contrary, most existing programs will be unaffected. Those which actually care about internationalisation can only be made faster ( by providing native support). >I think the user of a language should be able to choose what > is more important: time or space or reduced functionality. UTF-8 saves space, > and costs time; UCS-4 wastes space, but saves time; UCS-2 is a compromise > and bad because it is a compromise; Latin 1 (or another 8 bit cs) saves > time and space but has less functionality. Sure, but, this leads to multiple interfaces. Was that not the original problem? Let me put the argument for UTF-8 differently. Processing UTF-8 'as is' is non-trivial and should be done in low level system functions for speed. Processing arrays of 31 bit integers is _already_ well supported in ocaml, and will be better supported by adding variable length arrays with functions that are designed with some view of use for string processing. So we don't actually need a wide character string type or supporting functions, precisely because in the simplest cases a standard data type not really specialised to script processing will do the job. What is actually required (in both cases) are some 'data tables' to support things like case mapping. For example, a function convert_to_upper i which takes an ocaml integer argument would be useful, and it is easy enough to 'map' this over an array. Sigh. See next post. I will post my code, so it can be torn up by experts. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller