Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] intersecting huge integer sets
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Yaron M. Minsky <yminsky@C...>
Subject: Re: [Caml-list] intersecting huge integer sets
If the sets are really as dense as you say they are (on the order of 50%
of all possible elements), then the fastest thing is probably just to
use a bool array.  Such an approach would require 3 passes:   one to set
all the elements of the array to false.  One to mark the elements
corresponding to set 1 to true, and then one pass over set 2 to see
which of its elements are shared by 1.

y

On Tue, 2002-08-27 at 07:20, Tibor Simko wrote:
> Hello
> 
> I wonder what is the OCaml canonical way to get fast intersection of
> huge sets of integers: say, to intersect two sets of 500000 items
> containing distinct integer values from the (0,1000000) interval.  
> 
> I quickly tested the Hashtbl way, along the lines:
> 
> +---- 
> |    let dsize d = 
> |      (* return length of dictionary *)
> |      let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;;  
> | 
> |    let dcreate nitems maxval =     
> |      (* create dictionary of nitems with integer keys from (0,maxval) interval *)
> |      let d = Hashtbl.create 0 and
> |        size = ref 0 and
> |        rnd = ref 0 in
> |        while !size < nitems && !size < maxval do
> |          rnd := Random.int maxval;
> |          if Hashtbl.mem d !rnd 
> |          then ()
> |          else (incr size;
> |                Hashtbl.add d !rnd 1)
> |        done;
> |      d;;
> | 
> |    let dintersect d1 d2 =
> |      (* intersect two dictionaries: retain elements in d1 that are also in d2 *)
> |      (* note: assuming that d1 is smaller in size than d2.  Otherwise not efficient. *)
> |      Hashtbl.iter (fun key _ ->
> |                      if Hashtbl.mem d2 key then ()
> |                      else Hashtbl.remove d1 key) d1;;
> | 
> |    (* main *)
> |    let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and
> |        maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;;
> |    Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval;
> |    flush stdout;
> |    let d1 = dcreate nitems maxval and
> |        d2 = dcreate nitems maxval in
> |      let t1 = Sys.time () in
> |        dintersect d1 d2;
> |        Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);;
> +----
> 
> I found that the dintersect function appears to be about 3-4 times
> faster than its Python equivalent, and about 10% faster than Java's
> native HashSet retainAll() method.
> 
> I wonder how to improve OCaml performance in this special case.  Is
> there a better suited OCaml data structure to simulate huge hashed
> sets of integers rather than by using OCaml's native Hashtbl?
> 
> Tibor
> -------------------
> To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
> Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
-- 
|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|

Open PGP --- KeyID B1FFD916 (new key as of Dec 4th)
Fingerprint: 5BF6 83E1 0CE3 1043 95D8 F8D5 9F12 B3A9 B1FF D916

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners