Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] Announcement: PXP 1.1.92 (development version)
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: John Max Skaller <skaller@o...>
Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version)
Yamagata Yoriyuki wrote:


> Data at Unicode.org for East Asian encodings are buggy.  Don't use
> them.


Noted.

>>My functions are in Python, and take the form:
>>
>>	decode: string -> (int * string)
>>	encode: int -> string
>>
>>where string is an 8 bit byte stream,
>>and int is a unicode (or other) code point.
>>
> 
> This interface has a problem with stateful encodings, which are quite
> important here.  (ISO-2020-JP or JIS encoding is stateful, and
> standard encoding for email.)  In addition, it is inefficient.


Agree on both counts, though none of the encodings I handle
are stateful (I handle Shift-Jis which isn't stateful AFAIK)

The functions I give are canonical, and they're fast
enough in Python (if you want fast, you'd use C anyhow).
There is an issue for Ocaml: what is a Unicode string like?
My answer would be 'array of int'. But another answer
is 'string with UTF-8 encoding'.


In theory, mappings and codecs are orthogonal.

UTF-8 has nothing to do with Unicode, it works
just fine for any national character set.
In practice, many character sets are defined
by two byte encodings.

So you might want a function:

	Shift-Jis -> Unicode as UTF-8

modelled by

	string -> string (8 bit clean strings)

That can be made from the canonical functions,

but it isn't efficient to do the conversion
via an integer intermediate form.


> I read somewhere that Perl6 delegates code conversion to add-on
> programs, since making standard mapping tables is really hard.
> (Even naming of encodings is a problem.  There is no cross-platform
> way of this.)  Introducing generic channel type (for char and unicode
> character) and letting 3rd party libraries do conversion is better
> solution, IMO.


Well, you also want in-core conversions. And then a third
party library is an arbitrary function. The problem
is that people are rewriting these functions for each
application that needs some i18n support. Reuse would
be better, but that requires some form of
standardisation. Its both hard to get the conversions
right, and also to make them efficient. I spent ages
converting the unicode.org data (I also found a bug
in the UNICODE tables).

The problem is: 'third party libraries' might be a reasonable
answer for a C program. Its not so reasonable for Ocaml:
where are they? We're short of useful libraries .. indeed,
for a mechanism to install and access them.

-- 
John Max Skaller, mailto:skaller@ozemail.com.au
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners