Browse thread
[Caml-list] Announcement: PXP 1.1.92 (development version)
-
Gerd Stolpmann
-
John Max Skaller
- Yamagata Yoriyuki
-
John Max Skaller
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Yamagata Yoriyuki <yoriyuki@m...> |
| Subject: | Re: [Caml-list] Announcement: PXP 1.1.92 (development version) |
From: John Max Skaller <skaller@ozemail.com.au> Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version) Date: Sun, 01 Sep 2002 18:52:20 +1000 > I have ALL the code sets specified at Unicode.org in > programmatic form. Easy to generate Ocaml versions > of the tables. Data at Unicode.org for East Asian encodings are buggy. Don't use them. (Moreover, Unicode Consortium declared they don't want to fix these bugs, and make East Asian mapping tables obsolete. see ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/README.TXT) I uses mapping tables from glibc for my camomile, which seems more debugged. > My functions are in Python, and take the form: > > decode: string -> (int * string) > encode: int -> string > > where string is an 8 bit byte stream, > and int is a unicode (or other) code point. This interface has a problem with stateful encodings, which are quite important here. (ISO-2020-JP or JIS encoding is stateful, and standard encoding for email.) In addition, it is inefficient. > The actual python functions use dynamically loaded > data tables, but each character set has a fixed > format for the tables that knows about the raw > structure of the character set (eg what ranges of > hi and low bytes are allowed in two byte encodings > of Shift-Jis, KSC, etc). For Ocaml, we'd probably > want to bind the encodings at compile time > (since there is no well defined way to find > the data tables at run time :( > > The tables are very compact, but there are quite > a few encodings -- some overhead if they're all > in the one module .. I read somewhere that Perl6 delegates code conversion to add-on programs, since making standard mapping tables is really hard. (Even naming of encodings is a problem. There is no cross-platform way of this.) Introducing generic channel type (for char and unicode character) and letting 3rd party libraries do conversion is better solution, IMO. -- Yamagata Yoriyuki http://www.mars.sphere.ne.jp/yoriyuki/ PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners