New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strengthen the built-in hash function #5225
Comments
Comment author: mgiovann I enclose the results of Knuth's standard battery of tests against random generators. Four hashes are tested against a large (> 200.000) English words. "Std hash" is a bit-correct reimplementation of Combine_small. "Mod hash" is the following hash: let combine_small' h c = Int64.rem (Int64.add (Int64.mul h 9973L) (Int64.of_int (int_of_char c))) 4294967291L where the modulus is (2^32 - 5), a prime. "MurmurHash2" is a C binding of the hash of the same name. "Murmur+Std" is Combine_small whitened by one round of 32-bit MurmurHash2. Chi-square tests are performed by binning the hashes, taking the 2^k least-significant-bits of the hash and accumulating. Taking the 2^(31-k) most significant bits gives markedly worse results (not shown) for the linear-congruential generators, implying that the multipliers are far too small. As can be seen from the test results:
In an unrelated test I've found that Hashtbl under the same data set (i.e., hashing strings) using MurmurHash2 shows around 6% better performance than using Combine_small, even though the hash is computationally more intensive, which shows that a better-behaved hash function can be beneficial. |
Comment author: @xavierleroy Thanks for the suggestions. There are other infelicities in the current implementation of hash_univ, e.g. #5222 and also http://caml.inria.fr/pub/ml-archives/caml-list/2002/08/70fc38b62a61330808f6d14a9e00dd94.en.html The next major release might be a good time to address them all. One thing we must be careful about is that changing the hash function can break programs that store hashtables to persistent storage, so it's better to batch several changes in one major release. |
Comment author: @xavierleroy I'm starting to look at better hash functions and it would save me time if could have your (mgiovanni's) test harness. At your leisure, could you please e-mail it to me? (xavier.leroy AT inria.fr) Thanks. |
Comment author: @xavierleroy SVN trunk now contains a complete reimplementation of OCaml's generic hash function, using the Murmur 3 algorithm. It seems to address the poor distribution issue, as well as other issues of the old generic hash. Planned to go in release 3.13. |
Original bug ID: 5225
Reporter: mgiovann
Assigned to: @xavierleroy
Status: closed (set by @xavierleroy on 2012-09-25T18:06:17Z)
Resolution: fixed
Priority: normal
Severity: feature
Version: 3.12.0
Fixed in version: 3.13.0+dev
Category: ~DO NOT USE (was: OCaml general)
Related to: #5222
Monitored by: @ygrek "Julien Signoles" "Christoph Bauer"
Bug description
The built-in hash function in hash.c has extremely poor statistical properties. It is of the form (a*x + b) mod M for M = 2^word size, but the modulus should be prime for best results (and for correct results in the case of Cuckoo Hash, for instance). Tested on a word list it fails Knuth's criteria for chi-square and Kolmogorov-Smirnov tests.
Additional information
Since the hash value is less than 2^31, suitable prime moduli (for Combine and Combine_small) are 2^31 - 19 and 2^31 - 1. In effect, the code for hash.c would be:
#define Alpha 65599
#define Beta 19
#define M0 ((1U << 31) - 19)
#define M1 ((1U << 31) - 1)
#define Combine(new) (hash_accu = (hash_accu * Alpha + (new)) % M0)
#define Combine_small(new) (hash_accu = (hash_accu * Beta + (new)) % M1)
File attachments
The text was updated successfully, but these errors were encountered: