Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005225OCamlOCaml generalpublic2011-02-16 13:512012-09-25 20:06
Reportermgiovann 
Assigned Toxleroy 
PrioritynormalSeverityfeatureReproducibilityalways
StatusclosedResolutionfixed 
PlatformOSOS Version
Product Version3.12.0 
Target VersionFixed in Version3.13.0+dev 
Summary0005225: Strengthen the built-in hash function
DescriptionThe built-in hash function in hash.c has extremely poor statistical properties. It is of the form (a*x + b) mod M for M = 2^word size, but the modulus should be prime for best results (and for correct results in the case of Cuckoo Hash, for instance). Tested on a word list it fails Knuth's criteria for chi-square and Kolmogorov-Smirnov tests.
Additional InformationSince the hash value is less than 2^31, suitable prime moduli (for Combine and Combine_small) are 2^31 - 19 and 2^31 - 1. In effect, the code for hash.c would be:


#define Alpha 65599
#define Beta 19
#define M0 ((1U << 31) - 19)
#define M1 ((1U << 31) - 1)
#define Combine(new) (hash_accu = (hash_accu * Alpha + (new)) % M0)
#define Combine_small(new) (hash_accu = (hash_accu * Beta + (new)) % M1)
TagsNo tags attached.
Attached Filestxt file icon statistics.txt [^] (4,249 bytes) 2011-02-17 15:36 [Show Content]

- Relationships
related to 0005222closedxleroy Hashtbl.hash is not compatible with (=) 

-  Notes
(0005823)
mgiovann (reporter)
2011-02-17 15:53

I enclose the results of Knuth's standard battery of tests against random generators. Four hashes are tested against a large (> 200.000) English words. "Std hash" is a bit-correct reimplementation of Combine_small. "Mod hash" is the following hash:

let combine_small' h c = Int64.rem (Int64.add (Int64.mul h 9973L) (Int64.of_int (int_of_char c))) 4294967291L

where the modulus is (2^32 - 5), a prime. "MurmurHash2" is a C binding of the hash of the same name. "Murmur+Std" is Combine_small whitened by one round of 32-bit MurmurHash2.

Chi-square tests are performed by binning the hashes, taking the 2^k least-significant-bits of the hash and accumulating. Taking the 2^(31-k) most significant bits gives markedly worse results (not shown) for the linear-congruential generators, implying that the multipliers are far too small.

As can be seen from the test results:

* Combine_small's distribution is not uniform over 31 bits (K+ too large).
  Also, and more worrying, it has a very high number of collisions
* The proposed universal hash has smoother Chi-2 statistics, although
  it is a bit too high as shown in the K+ statistic. However, it has a very
  low number of collisions, an order of magnitude better than Combine_small.
* MurmurHash2 has a very smooth distribution of values
* Whitening Combine_small gives it better statistical properties but of
  course it has at least as many collisions as the underlying hash function has.

In an unrelated test I've found that Hashtbl under the same data set (i.e., hashing strings) using MurmurHash2 shows around 6% *better* performance than using Combine_small, even though the hash is computationally more intensive, which shows that a better-behaved hash function can be beneficial.
(0005837)
xleroy (administrator)
2011-03-04 16:29

Thanks for the suggestions. There are other infelicities in the current implementation of hash_univ, e.g. PR#5222 and also

http://caml.inria.fr/pub/ml-archives/caml-list/2002/08/70fc38b62a61330808f6d14a9e00dd94.en.html [^]

The next major release might be a good time to address them all. One thing we must be careful about is that changing the hash function can break programs that store hashtables to persistent storage, so it's better to batch several changes in one major release.
(0005933)
xleroy (administrator)
2011-05-27 19:00

I'm starting to look at better hash functions and it would save me time if could have your (mgiovanni's) test harness. At your leisure, could you please e-mail it to me? (xavier.leroy AT inria.fr) Thanks.
(0005934)
xleroy (administrator)
2011-05-29 12:09

SVN trunk now contains a complete reimplementation of OCaml's generic hash function, using the Murmur 3 algorithm. It seems to address the poor distribution issue, as well as other issues of the old generic hash. Planned to go in release 3.13.


- Issue History
Date Modified Username Field Change
2011-02-16 13:51 mgiovann New Issue
2011-02-17 15:36 mgiovann File Added: statistics.txt
2011-02-17 15:53 mgiovann Note Added: 0005823
2011-03-04 16:26 xleroy Relationship added related to 0005222
2011-03-04 16:29 xleroy Note Added: 0005837
2011-03-04 16:29 xleroy Assigned To => xleroy
2011-03-04 16:29 xleroy Status new => confirmed
2011-05-27 19:00 xleroy Note Added: 0005933
2011-05-29 12:09 xleroy Note Added: 0005934
2011-05-29 12:09 xleroy Status confirmed => resolved
2011-05-29 12:09 xleroy Resolution open => fixed
2011-05-29 12:09 xleroy Fixed in Version => 3.13.0+dev
2012-09-25 20:06 xleroy Status resolved => closed


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker