Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] Explaining bit sets
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Yamagata Yoriyuki <yoriyuki@m...>
Subject: Re: [Caml-list] Explaining bit sets
From: Diego Olivier Fernandez Pons <Diego-Olivier.FERNANDEZ-PONS@cicrp.jussieu.fr>
Subject: [Caml-list] Explaining bit sets
Date: Mon, 2 Sep 2002 11:40:42 +0200 (DST)

> > Hmmm. I need a lexer that supports unicode,
> > i.e. 2^24 characters. To make that work,
> > I need to classify characters (array of 2^24 size is a bit expensive),
> > but also the lexer needs an efficient representation
> > of subsets of unicode characters, typically
> > containining a few ranges (eg 'letter' is a code
> > in one of about 30 different ranges).

I wrote something for the same purpose (Unicode -> foo tables), which
might be useful.  The relevant modules are tbl.ml[i],
bitsvect.ml[i], bytesvect.ml[i] in
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/camomile/camomile/internal

There are 6 table types, 'a Tbl32.rw, 'a Tbl32.t, which are tables
from int to 'a, Tbl32.bits_rw, Tbl32.bits which are from int to int (<
256), Tbl32.bytes_rw, Tbl32.bytes which are from int to int.
Tbl32.*rw are internally tries whose nodes have 256 branches.  Terminal
nodes are arrays (Tbl32.rw) or bits vectors (Tbl32.bits_rw) or bytes
sequences (Tbl32.bytes_rw).  Inplace-update is possible for these
types.

'a Tbl32.t (and Tbl32.bits, Tbl32.bytes) are read-only.  They are
internally 4-dimensional arrays, so access should be fast.
Tbl32.rw_to_ro creates Tbl32.t from Tbl32.rw.  (There are similar
functions for Tbl32.bits and Tbl32.bytes.)  It scans the whole trie
and make identical nodes shared, so that table size is small.  For
example, the size of Unicode character category table (Classification
of Unicode characters into about 30 categories) becomes 16K.

(btw, Unicode is 21-bits, not 24-bits.)
--
Yamagata Yoriyuki
http://www.mars.sphere.ne.jp/yoriyuki/
PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners