Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] Hashtbl and max_len
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Marcin 'Qrczak' Kowalczyk <qrczak@k...>
Subject: [Caml-list] Hashtbl and max_len
let create initial_size =
  let s = if initial_size < 1 then 1 else initial_size in
  let s = if s > Sys.max_array_length then Sys.max_array_length else s in
  { max_len = 3; data = Array.make s Empty }

let resize hashfun tbl =
  let odata = tbl.data in
  let osize = Array.length odata in
  let nsize = min (2 * osize + 1) Sys.max_array_length in
  [...]
  tbl.max_len <- 2 * tbl.max_len

The initial_size argument of Hashtbl.create has much bigger impact on
efficiency than I thought. The documentation is misleading in saying
that "The table grows as needed, so n is just an initial guess",
thus suggesting that it shouldn't matter much what was given after
the table is filled.

The reality is that the initial_size establishes an approximate ratio
of bucket count to bucket size which persists for the whole life of
this table.

A table created by Hashtbl.create 10 and filled with 1000 entries is
approximately 10 times slower in element access than a table created
by Hashtbl.create 1000 with the same entries!

This is unfortunate when Hashtbl is wrapped in an API which
doesn't always provide initial_size. The wrapper doesn't know which
initial_size to set. It can't say: "let's make this table small
initially, because there might be thousands of such tables with 5
entries in each, but if thousands of entries are put into this table,
it shouldn't become much slower than it would be if it knew its size
from the beginning".

In my case equality and hashing is quite slow by nature and thus
having too big max_len or frequent resizing gives horrible performance.
OTOH always providing large initial_size wastes memory.

IMHO max_len shouldn't grow that much. Why does it grow at all when
the current size is not near Sys.max_array_length?

Anyway, besides considering including a tweaked hashtbl.ml, I seek for
a better general-purpose imperative dictionary which uses arbitrary
hashing and equality and behaves well when these functions are slow
and the size of a table is not known initially.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^
QRCZAK

-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr