Browse thread
[Caml-list] Explaining bit sets
-
Diego Olivier Fernandez Pons
- john.chu@e...
- Yamagata Yoriyuki
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Yamagata Yoriyuki <yoriyuki@m...> |
| Subject: | Re: [Caml-list] Explaining bit sets |
From: Diego Olivier Fernandez Pons <Diego-Olivier.FERNANDEZ-PONS@cicrp.jussieu.fr> Subject: [Caml-list] Explaining bit sets Date: Mon, 2 Sep 2002 11:40:42 +0200 (DST) > > Hmmm. I need a lexer that supports unicode, > > i.e. 2^24 characters. To make that work, > > I need to classify characters (array of 2^24 size is a bit expensive), > > but also the lexer needs an efficient representation > > of subsets of unicode characters, typically > > containining a few ranges (eg 'letter' is a code > > in one of about 30 different ranges). I wrote something for the same purpose (Unicode -> foo tables), which might be useful. The relevant modules are tbl.ml[i], bitsvect.ml[i], bytesvect.ml[i] in http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/camomile/camomile/internal There are 6 table types, 'a Tbl32.rw, 'a Tbl32.t, which are tables from int to 'a, Tbl32.bits_rw, Tbl32.bits which are from int to int (< 256), Tbl32.bytes_rw, Tbl32.bytes which are from int to int. Tbl32.*rw are internally tries whose nodes have 256 branches. Terminal nodes are arrays (Tbl32.rw) or bits vectors (Tbl32.bits_rw) or bytes sequences (Tbl32.bytes_rw). Inplace-update is possible for these types. 'a Tbl32.t (and Tbl32.bits, Tbl32.bytes) are read-only. They are internally 4-dimensional arrays, so access should be fast. Tbl32.rw_to_ro creates Tbl32.t from Tbl32.rw. (There are similar functions for Tbl32.bits and Tbl32.bytes.) It scans the whole trie and make identical nodes shared, so that table size is small. For example, the size of Unicode character category table (Classification of Unicode characters into about 30 categories) becomes 16K. (btw, Unicode is 21-bits, not 24-bits.) -- Yamagata Yoriyuki http://www.mars.sphere.ne.jp/yoriyuki/ PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners