Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] Announcement: PXP 1.1.92 (development version)
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Yamagata Yoriyuki <yoriyuki@m...>
Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version)
From: John Max Skaller <skaller@ozemail.com.au>
Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version)
Date: Sun, 01 Sep 2002 18:52:20 +1000

> I have ALL the code sets specified at Unicode.org in
> programmatic form. Easy to generate Ocaml versions
> of the tables.

Data at Unicode.org for East Asian encodings are buggy.  Don't use
them.  (Moreover, Unicode Consortium declared they don't want to fix
these bugs, and make East Asian mapping tables obsolete.  see
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/README.TXT)
I uses mapping tables from glibc for my camomile, which seems more
debugged.

> My functions are in Python, and take the form:
> 
> 	decode: string -> (int * string)
> 	encode: int -> string
> 
> where string is an 8 bit byte stream,
> and int is a unicode (or other) code point.

This interface has a problem with stateful encodings, which are quite
important here.  (ISO-2020-JP or JIS encoding is stateful, and
standard encoding for email.)  In addition, it is inefficient.

> The actual python functions use dynamically loaded
> data tables, but each character set has a fixed
> format for the tables that knows about the raw
> structure of the character set (eg what ranges of
> hi and low bytes are allowed in two byte encodings
> of Shift-Jis, KSC, etc). For Ocaml, we'd probably
> want to bind the encodings at compile time
> (since there is no well defined way to find
> the data tables at run time :(
> 
> The tables are very compact, but there are quite
> a few encodings -- some overhead if they're all
> in the one module ..

I read somewhere that Perl6 delegates code conversion to add-on
programs, since making standard mapping tables is really hard.
(Even naming of encodings is a problem.  There is no cross-platform
way of this.)  Introducing generic channel type (for char and unicode
character) and letting 3rd party libraries do conversion is better
solution, IMO.
--
Yamagata Yoriyuki
http://www.mars.sphere.ne.jp/yoriyuki/
PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners