Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strengthen the built-in hash function #5225

Closed
vicuna opened this issue Feb 16, 2011 · 4 comments
Closed

Strengthen the built-in hash function #5225

vicuna opened this issue Feb 16, 2011 · 4 comments
Assignees

Comments

@vicuna
Copy link

vicuna commented Feb 16, 2011

Original bug ID: 5225
Reporter: mgiovann
Assigned to: @xavierleroy
Status: closed (set by @xavierleroy on 2012-09-25T18:06:17Z)
Resolution: fixed
Priority: normal
Severity: feature
Version: 3.12.0
Fixed in version: 3.13.0+dev
Category: ~DO NOT USE (was: OCaml general)
Related to: #5222
Monitored by: @ygrek "Julien Signoles" "Christoph Bauer"

Bug description

The built-in hash function in hash.c has extremely poor statistical properties. It is of the form (a*x + b) mod M for M = 2^word size, but the modulus should be prime for best results (and for correct results in the case of Cuckoo Hash, for instance). Tested on a word list it fails Knuth's criteria for chi-square and Kolmogorov-Smirnov tests.

Additional information

Since the hash value is less than 2^31, suitable prime moduli (for Combine and Combine_small) are 2^31 - 19 and 2^31 - 1. In effect, the code for hash.c would be:

#define Alpha 65599
#define Beta 19
#define M0 ((1U << 31) - 19)
#define M1 ((1U << 31) - 1)
#define Combine(new) (hash_accu = (hash_accu * Alpha + (new)) % M0)
#define Combine_small(new) (hash_accu = (hash_accu * Beta + (new)) % M1)

File attachments

@vicuna
Copy link
Author

vicuna commented Feb 17, 2011

Comment author: mgiovann

I enclose the results of Knuth's standard battery of tests against random generators. Four hashes are tested against a large (> 200.000) English words. "Std hash" is a bit-correct reimplementation of Combine_small. "Mod hash" is the following hash:

let combine_small' h c = Int64.rem (Int64.add (Int64.mul h 9973L) (Int64.of_int (int_of_char c))) 4294967291L

where the modulus is (2^32 - 5), a prime. "MurmurHash2" is a C binding of the hash of the same name. "Murmur+Std" is Combine_small whitened by one round of 32-bit MurmurHash2.

Chi-square tests are performed by binning the hashes, taking the 2^k least-significant-bits of the hash and accumulating. Taking the 2^(31-k) most significant bits gives markedly worse results (not shown) for the linear-congruential generators, implying that the multipliers are far too small.

As can be seen from the test results:

  • Combine_small's distribution is not uniform over 31 bits (K+ too large).
    Also, and more worrying, it has a very high number of collisions
  • The proposed universal hash has smoother Chi-2 statistics, although
    it is a bit too high as shown in the K+ statistic. However, it has a very
    low number of collisions, an order of magnitude better than Combine_small.
  • MurmurHash2 has a very smooth distribution of values
  • Whitening Combine_small gives it better statistical properties but of
    course it has at least as many collisions as the underlying hash function has.

In an unrelated test I've found that Hashtbl under the same data set (i.e., hashing strings) using MurmurHash2 shows around 6% better performance than using Combine_small, even though the hash is computationally more intensive, which shows that a better-behaved hash function can be beneficial.

@vicuna
Copy link
Author

vicuna commented Mar 4, 2011

Comment author: @xavierleroy

Thanks for the suggestions. There are other infelicities in the current implementation of hash_univ, e.g. #5222 and also

http://caml.inria.fr/pub/ml-archives/caml-list/2002/08/70fc38b62a61330808f6d14a9e00dd94.en.html

The next major release might be a good time to address them all. One thing we must be careful about is that changing the hash function can break programs that store hashtables to persistent storage, so it's better to batch several changes in one major release.

@vicuna
Copy link
Author

vicuna commented May 27, 2011

Comment author: @xavierleroy

I'm starting to look at better hash functions and it would save me time if could have your (mgiovanni's) test harness. At your leisure, could you please e-mail it to me? (xavier.leroy AT inria.fr) Thanks.

@vicuna
Copy link
Author

vicuna commented May 29, 2011

Comment author: @xavierleroy

SVN trunk now contains a complete reimplementation of OCaml's generic hash function, using the Murmur 3 algorithm. It seems to address the poor distribution issue, as well as other issues of the old generic hash. Planned to go in release 3.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants