Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] Explaining bit sets
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Diego Olivier Fernandez Pons <Diego-Olivier.FERNANDEZ-PONS@c...>
Subject: [Caml-list] Explaining bit sets
    Bonjour,

I will try to explain how does the bitSet module work.

A bit set is just à n-ary tree with n = 30 and an integer data in
each node : instead of just having

  type tree = Empty | Node of int * tree * tree

you have

  type tree = Empty | Node of int * tree array

(I really use an other representation which is mosty equivalent,
but saves some space and makes algorithms easier : 
tree = N of int * int array)

When you call (insert [1,15,2,3,17] my_set) the only thing you are
doing is set to 1 the 17h bit of the integer located in the first tree
first level, 15th tree second level, 2nd tree third level, 3rd tree
4th level. 

Now, all the game is to find a convenient decomposition for each value
you want to save, the actual module provide two classical function

- big-endian (base 30)
example : path 15123 -> [16,24,3] 
          path 15124 -> [16,24,4] (same mask)
          path 15153 -> [16,25,3] (next node)

that means that integer masks will represent numbers as follows
[ 0, 1, ..., 30 ] [ 31, 32, ... , 60] etc.

- endian-endian (base 30)

example : path 15123 -> [3,24,16]
          path 16023 -> [3,24,17] (same mask)
          path 15124 -> [4,24,16] (next tree from the root)

that means that integer masks will represent numbers as follows
[ 0, 30, 60, ..., 900] [ 1, 31, 61, ..., 901] etc.

I am working on a third function allowing other modulo representations
e.g. [ 0, 5, 10, 15, ... , 150] etc.


> Hmmm. I need a lexer that supports unicode,
> i.e. 2^24 characters. To make that work,
> I need to classify characters (array of 2^24 size is a bit expensive),
> but also the lexer needs an efficient representation
> of subsets of unicode characters, typically
> containining a few ranges (eg 'letter' is a code
> in one of about 30 different ranges).

You could use big-endian representation of course, but you could even
use your own 30-ary representation. Why ? 

let say you insert 10 ints from 221 456 to 221 466
that is [8,6,1,26] to [8,6,2,6],

You will have a degenerated tree with 2 useless nodes (level 1 and 2)
since the rest of the tree is empty 
     |
     |
    / \
  int  int

Of course, two nodes is not much but modulo 30 operations have a cost. 
The same data structure with modulo 2 operations for optimal speed
acces would look have 5 times more useless nodes (because 30 ~= 32 =
2^5)

To have an optimal space occupation you just need to write down your
own decomposition function which should have the following properties

- injectivity in the domain you are working with
- neighbourg conservation
- most accessed integers in the top (id est shortest path)

There are other solutions based on using common prefixes :
- patricia trees
- a trie of packed arrays

You can find patricia trees in Jean Christophe Fillâtre's home page
I have written a small packed arrays module which will soon be in my
data structures library (an old buggy beta version is in my home page
but will be removed)

If you give me some more informations (on typical ranges that you can
find in unicode subsets) I will be able to give more precise answers

        Diego Olivier

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners