| Anonymous | Login | Signup for a new account | 2013-06-20 06:51 CEST | ![]() |
| Main | My View | View Issues | Change Log | Roadmap |
| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | |||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | |||
| 0005225 | OCaml | OCaml general | public | 2011-02-16 13:51 | 2012-09-25 20:06 | |||
| Reporter | mgiovann | |||||||
| Assigned To | xleroy | |||||||
| Priority | normal | Severity | feature | Reproducibility | always | |||
| Status | closed | Resolution | fixed | |||||
| Platform | OS | OS Version | ||||||
| Product Version | 3.12.0 | |||||||
| Target Version | Fixed in Version | 3.13.0+dev | ||||||
| Summary | 0005225: Strengthen the built-in hash function | |||||||
| Description | The built-in hash function in hash.c has extremely poor statistical properties. It is of the form (a*x + b) mod M for M = 2^word size, but the modulus should be prime for best results (and for correct results in the case of Cuckoo Hash, for instance). Tested on a word list it fails Knuth's criteria for chi-square and Kolmogorov-Smirnov tests. | |||||||
| Additional Information | Since the hash value is less than 2^31, suitable prime moduli (for Combine and Combine_small) are 2^31 - 19 and 2^31 - 1. In effect, the code for hash.c would be: #define Alpha 65599 #define Beta 19 #define M0 ((1U << 31) - 19) #define M1 ((1U << 31) - 1) #define Combine(new) (hash_accu = (hash_accu * Alpha + (new)) % M0) #define Combine_small(new) (hash_accu = (hash_accu * Beta + (new)) % M1) | |||||||
| Tags | No tags attached. | |||||||
| Attached Files | ||||||||
Notes |
|
|
(0005823) mgiovann (reporter) 2011-02-17 15:53 |
I enclose the results of Knuth's standard battery of tests against random generators. Four hashes are tested against a large (> 200.000) English words. "Std hash" is a bit-correct reimplementation of Combine_small. "Mod hash" is the following hash: let combine_small' h c = Int64.rem (Int64.add (Int64.mul h 9973L) (Int64.of_int (int_of_char c))) 4294967291L where the modulus is (2^32 - 5), a prime. "MurmurHash2" is a C binding of the hash of the same name. "Murmur+Std" is Combine_small whitened by one round of 32-bit MurmurHash2. Chi-square tests are performed by binning the hashes, taking the 2^k least-significant-bits of the hash and accumulating. Taking the 2^(31-k) most significant bits gives markedly worse results (not shown) for the linear-congruential generators, implying that the multipliers are far too small. As can be seen from the test results: * Combine_small's distribution is not uniform over 31 bits (K+ too large). Also, and more worrying, it has a very high number of collisions * The proposed universal hash has smoother Chi-2 statistics, although it is a bit too high as shown in the K+ statistic. However, it has a very low number of collisions, an order of magnitude better than Combine_small. * MurmurHash2 has a very smooth distribution of values * Whitening Combine_small gives it better statistical properties but of course it has at least as many collisions as the underlying hash function has. In an unrelated test I've found that Hashtbl under the same data set (i.e., hashing strings) using MurmurHash2 shows around 6% *better* performance than using Combine_small, even though the hash is computationally more intensive, which shows that a better-behaved hash function can be beneficial. |
|
(0005837) xleroy (administrator) 2011-03-04 16:29 |
Thanks for the suggestions. There are other infelicities in the current implementation of hash_univ, e.g. PR#5222 and also http://caml.inria.fr/pub/ml-archives/caml-list/2002/08/70fc38b62a61330808f6d14a9e00dd94.en.html [^] The next major release might be a good time to address them all. One thing we must be careful about is that changing the hash function can break programs that store hashtables to persistent storage, so it's better to batch several changes in one major release. |
|
(0005933) xleroy (administrator) 2011-05-27 19:00 |
I'm starting to look at better hash functions and it would save me time if could have your (mgiovanni's) test harness. At your leisure, could you please e-mail it to me? (xavier.leroy AT inria.fr) Thanks. |
|
(0005934) xleroy (administrator) 2011-05-29 12:09 |
SVN trunk now contains a complete reimplementation of OCaml's generic hash function, using the Murmur 3 algorithm. It seems to address the poor distribution issue, as well as other issues of the old generic hash. Planned to go in release 3.13. |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2011-02-16 13:51 | mgiovann | New Issue | |
| 2011-02-17 15:36 | mgiovann | File Added: statistics.txt | |
| 2011-02-17 15:53 | mgiovann | Note Added: 0005823 | |
| 2011-03-04 16:26 | xleroy | Relationship added | related to 0005222 |
| 2011-03-04 16:29 | xleroy | Note Added: 0005837 | |
| 2011-03-04 16:29 | xleroy | Assigned To | => xleroy |
| 2011-03-04 16:29 | xleroy | Status | new => confirmed |
| 2011-05-27 19:00 | xleroy | Note Added: 0005933 | |
| 2011-05-29 12:09 | xleroy | Note Added: 0005934 | |
| 2011-05-29 12:09 | xleroy | Status | confirmed => resolved |
| 2011-05-29 12:09 | xleroy | Resolution | open => fixed |
| 2011-05-29 12:09 | xleroy | Fixed in Version | => 3.13.0+dev |
| 2012-09-25 20:06 | xleroy | Status | resolved => closed |
| Copyright © 2000 - 2011 MantisBT Group |