Browse thread
[Caml-list] intersecting huge integer sets
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2002-08-27 (12:07) |
From: | Yaron M. Minsky <yminsky@C...> |
Subject: | Re: [Caml-list] intersecting huge integer sets |
If the sets are really as dense as you say they are (on the order of 50% of all possible elements), then the fastest thing is probably just to use a bool array. Such an approach would require 3 passes: one to set all the elements of the array to false. One to mark the elements corresponding to set 1 to true, and then one pass over set 2 to see which of its elements are shared by 1. y On Tue, 2002-08-27 at 07:20, Tibor Simko wrote: > Hello > > I wonder what is the OCaml canonical way to get fast intersection of > huge sets of integers: say, to intersect two sets of 500000 items > containing distinct integer values from the (0,1000000) interval. > > I quickly tested the Hashtbl way, along the lines: > > +---- > | let dsize d = > | (* return length of dictionary *) > | let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;; > | > | let dcreate nitems maxval = > | (* create dictionary of nitems with integer keys from (0,maxval) interval *) > | let d = Hashtbl.create 0 and > | size = ref 0 and > | rnd = ref 0 in > | while !size < nitems && !size < maxval do > | rnd := Random.int maxval; > | if Hashtbl.mem d !rnd > | then () > | else (incr size; > | Hashtbl.add d !rnd 1) > | done; > | d;; > | > | let dintersect d1 d2 = > | (* intersect two dictionaries: retain elements in d1 that are also in d2 *) > | (* note: assuming that d1 is smaller in size than d2. Otherwise not efficient. *) > | Hashtbl.iter (fun key _ -> > | if Hashtbl.mem d2 key then () > | else Hashtbl.remove d1 key) d1;; > | > | (* main *) > | let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and > | maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;; > | Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval; > | flush stdout; > | let d1 = dcreate nitems maxval and > | d2 = dcreate nitems maxval in > | let t1 = Sys.time () in > | dintersect d1 d2; > | Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);; > +---- > > I found that the dintersect function appears to be about 3-4 times > faster than its Python equivalent, and about 10% faster than Java's > native HashSet retainAll() method. > > I wonder how to improve OCaml performance in this special case. Is > there a better suited OCaml data structure to simulate huge hashed > sets of integers rather than by using OCaml's native Hashtbl? > > Tibor > ------------------- > To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr > Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners -- |--------/ Yaron M. Minsky \--------| |--------\ http://www.cs.cornell.edu/home/yminsky/ /--------| Open PGP --- KeyID B1FFD916 (new key as of Dec 4th) Fingerprint: 5BF6 83E1 0CE3 1043 95D8 F8D5 9F12 B3A9 B1FF D916 ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners