Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] intersecting huge integer sets
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Tibor Simko <tibor.simko@c...>
Subject: [Caml-list] intersecting huge integer sets
Hello

I wonder what is the OCaml canonical way to get fast intersection of
huge sets of integers: say, to intersect two sets of 500000 items
containing distinct integer values from the (0,1000000) interval.  

I quickly tested the Hashtbl way, along the lines:

+---- 
|    let dsize d = 
|      (* return length of dictionary *)
|      let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;;  
| 
|    let dcreate nitems maxval =     
|      (* create dictionary of nitems with integer keys from (0,maxval) interval *)
|      let d = Hashtbl.create 0 and
|        size = ref 0 and
|        rnd = ref 0 in
|        while !size < nitems && !size < maxval do
|          rnd := Random.int maxval;
|          if Hashtbl.mem d !rnd 
|          then ()
|          else (incr size;
|                Hashtbl.add d !rnd 1)
|        done;
|      d;;
| 
|    let dintersect d1 d2 =
|      (* intersect two dictionaries: retain elements in d1 that are also in d2 *)
|      (* note: assuming that d1 is smaller in size than d2.  Otherwise not efficient. *)
|      Hashtbl.iter (fun key _ ->
|                      if Hashtbl.mem d2 key then ()
|                      else Hashtbl.remove d1 key) d1;;
| 
|    (* main *)
|    let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and
|        maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;;
|    Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval;
|    flush stdout;
|    let d1 = dcreate nitems maxval and
|        d2 = dcreate nitems maxval in
|      let t1 = Sys.time () in
|        dintersect d1 d2;
|        Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);;
+----

I found that the dintersect function appears to be about 3-4 times
faster than its Python equivalent, and about 10% faster than Java's
native HashSet retainAll() method.

I wonder how to improve OCaml performance in this special case.  Is
there a better suited OCaml data structure to simulate huge hashed
sets of integers rather than by using OCaml's native Hashtbl?

Tibor
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners