Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] Hashtbl and max_len
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Xavier Leroy <xavier.leroy@i...>
Subject: Re: [Caml-list] Hashtbl and max_len
> The initial_size argument of Hashtbl.create has much bigger impact on
> efficiency than I thought.
> [...]
> IMHO max_len shouldn't grow that much. Why does it grow at all when
> the current size is not near Sys.max_array_length?

The original strategy was indeed to bound the length of buckets by a
constant, and resize when a bucket exceeds this length.  However, this
results in excessive resizing when the hashing causes collisions, or
(worse!) the same key is repeatedly inserted in the hash table.
Hence, as a stopgap, we decided to increase the maximal bucket length
at each resizing.

There are two problems with this strategy:
- as you noticed, this causes the initial size to impact performance
  significantly;
- checking for long buckets becomes linear in the size of the table,
  resulting in quadratic behavior for long sequences of insertions.

The latter problem was the last nail in the coffin of this
implementation.  In the working sources, we now keep track of the size
(total number of elements inserted in the table) and base resizing
decisions on this.  In case the hashing is good (all buckets have
approximately the same length), this results in constant length for
buckets.  If the hashing is not good, the performances degrade more
gracefully than with the old scheme.

- Xavier Leroy

PS.  Thanks for pointing out the typos in the working implementation.
The dangers of cut-and-paste programming...
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr