[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2010-02-20 (13:54) |
From: | Jacques Garrigue <garrigue@m...> |
Subject: | Re: [Caml-list] range of hash function |
From: Grégoire Seux <kamaradclimber@gmail.com> > i would like to use the polymorhpic hash function on strings. But i would > like to know what is the probability of a collision between two hashes. > > my first question is about the range of the Hashtbl.hash function: what is > its range ? ( string -> [1..N] ?) Just to get things straight: this is 0..2^30-1 (0..0x3fffffff). The result of the hash function is the same on 32-bit and 64-bit architectures. > the second question is : can i assume that the result is a uniform > distribution over [1..N] ? (for 10ⶠwords which is an estimation of the > english vocabulary size) The algorithm for strings is as follows: i = caml_string_length(obj); for (p = &Byte_u(obj, 0); i > 0; i--, p++) hash_accu = hash_accu * 19 + *p; So you can assume an uniform distribution for sufficiently long strings. > the third one is : is it possible to predict which will be the collision ? I > mean collisions are between words which are very 'similar' (for ex: "boy" > and "boys") or are completely unpredictable. Since you have the algorithm, you can predict collisions. Basically shifting character n by 1 is equivalent to shifting character n+1 by 19, so you have lots of collisions. But this hash function being intended for hashtables, collisions are not a problem, uniform distribution matters more. By the way, for polymorphic variants collisions matter, and the hash function is different. The range is 31-bits rather than 30-bits, and the factor is 243, so that names of no more than 4 characters are guaranteed to be different. You still have collisions, but they are going to be less similar. Both hash functions are defined in byterun/hash.c. Hope this helps, Jacques Garrigue