[Camllist] Explaining bit sets
 Diego Olivier Fernandez Pons
[
Home
]
[ Index:
by date

by threads
]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
Date:  20020902 (09:42) 
From:  Diego Olivier Fernandez Pons <DiegoOlivier.FERNANDEZPONS@c...> 
Subject:  [Camllist] Explaining bit sets 
Bonjour, I will try to explain how does the bitSet module work. A bit set is just à nary tree with n = 30 and an integer data in each node : instead of just having type tree = Empty  Node of int * tree * tree you have type tree = Empty  Node of int * tree array (I really use an other representation which is mosty equivalent, but saves some space and makes algorithms easier : tree = N of int * int array) When you call (insert [1,15,2,3,17] my_set) the only thing you are doing is set to 1 the 17h bit of the integer located in the first tree first level, 15th tree second level, 2nd tree third level, 3rd tree 4th level. Now, all the game is to find a convenient decomposition for each value you want to save, the actual module provide two classical function  bigendian (base 30) example : path 15123 > [16,24,3] path 15124 > [16,24,4] (same mask) path 15153 > [16,25,3] (next node) that means that integer masks will represent numbers as follows [ 0, 1, ..., 30 ] [ 31, 32, ... , 60] etc.  endianendian (base 30) example : path 15123 > [3,24,16] path 16023 > [3,24,17] (same mask) path 15124 > [4,24,16] (next tree from the root) that means that integer masks will represent numbers as follows [ 0, 30, 60, ..., 900] [ 1, 31, 61, ..., 901] etc. I am working on a third function allowing other modulo representations e.g. [ 0, 5, 10, 15, ... , 150] etc. > Hmmm. I need a lexer that supports unicode, > i.e. 2^24 characters. To make that work, > I need to classify characters (array of 2^24 size is a bit expensive), > but also the lexer needs an efficient representation > of subsets of unicode characters, typically > containining a few ranges (eg 'letter' is a code > in one of about 30 different ranges). You could use bigendian representation of course, but you could even use your own 30ary representation. Why ? let say you insert 10 ints from 221 456 to 221 466 that is [8,6,1,26] to [8,6,2,6], You will have a degenerated tree with 2 useless nodes (level 1 and 2) since the rest of the tree is empty   / \ int int Of course, two nodes is not much but modulo 30 operations have a cost. The same data structure with modulo 2 operations for optimal speed acces would look have 5 times more useless nodes (because 30 ~= 32 = 2^5) To have an optimal space occupation you just need to write down your own decomposition function which should have the following properties  injectivity in the domain you are working with  neighbourg conservation  most accessed integers in the top (id est shortest path) There are other solutions based on using common prefixes :  patricia trees  a trie of packed arrays You can find patricia trees in Jean Christophe Fillâtre's home page I have written a small packed arrays module which will soon be in my data structures library (an old buggy beta version is in my home page but will be removed) If you give me some more informations (on typical ranges that you can find in unicode subsets) I will be able to give more precise answers Diego Olivier  To unsubscribe, mail camllistrequest@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/camlbugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners