Browse thread
[Caml-list] intersecting huge integer sets
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Tibor Simko <tibor.simko@c...> |
| Subject: | [Caml-list] intersecting huge integer sets |
Hello I wonder what is the OCaml canonical way to get fast intersection of huge sets of integers: say, to intersect two sets of 500000 items containing distinct integer values from the (0,1000000) interval. I quickly tested the Hashtbl way, along the lines: +---- | let dsize d = | (* return length of dictionary *) | let n = ref 0 in Hashtbl.iter (fun _ _ -> incr n) d; !n;; | | let dcreate nitems maxval = | (* create dictionary of nitems with integer keys from (0,maxval) interval *) | let d = Hashtbl.create 0 and | size = ref 0 and | rnd = ref 0 in | while !size < nitems && !size < maxval do | rnd := Random.int maxval; | if Hashtbl.mem d !rnd | then () | else (incr size; | Hashtbl.add d !rnd 1) | done; | d;; | | let dintersect d1 d2 = | (* intersect two dictionaries: retain elements in d1 that are also in d2 *) | (* note: assuming that d1 is smaller in size than d2. Otherwise not efficient. *) | Hashtbl.iter (fun key _ -> | if Hashtbl.mem d2 key then () | else Hashtbl.remove d1 key) d1;; | | (* main *) | let nitems = if Array.length Sys.argv > 1 then int_of_string Sys.argv.(1) else 300000 and | maxval = if Array.length Sys.argv > 2 then int_of_string Sys.argv.(2) else 500000;; | Printf.printf "intersecting sets of %d items from (0,%d) ... " nitems maxval; | flush stdout; | let d1 = dcreate nitems maxval and | d2 = dcreate nitems maxval in | let t1 = Sys.time () in | dintersect d1 d2; | Printf.printf "%d common items retained in %.2f seconds.\n" (dsize d1) (Sys.time () -. t1);; +---- I found that the dintersect function appears to be about 3-4 times faster than its Python equivalent, and about 10% faster than Java's native HashSet retainAll() method. I wonder how to improve OCaml performance in this special case. Is there a better suited OCaml data structure to simulate huge hashed sets of integers rather than by using OCaml's native Hashtbl? Tibor ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners