As explained in my blog post on OCaml, I have toy examples that I write for every language I attempt to learn. I realized the other day that I don't have a good data structures example. My spellcheck example was meant to do that, but I realized I kept cheating using a language's hash table if it was available; the delegate pattern is not a data structure.
So last night I decided I would take out the heapsort and spellcheck examples I do and replace them with a hash set. I would read in a file with newline-delimited words to be put into a hash set to make sure that file I/O was not lost. The arguments on the command-line would be checked against the hash set and any that were in the set would be printed to stdout. That way I get to combine both a data structures and file I/O example into one so that it is easy to verify the hash set works properly.
Designing a hash set means choosing both a collision strategy and a hash function. I already knew I wanted to use chaining for the collision strategy since that forces the creation of a linked list instead of using just an array for open addressing and thus just creating another delegate situation. But choosing a hash function was another matter.
I looked up stuff on hash functions today and discovered a lot of info that was never taught to me. It seems the three popular hash functions these days is the Jenkins, FNV, and Hsieh hash functions. They all have very nice randomization values thanks to a focus on "avalanching", which says that most of the time you want a single bit change in the intput to change about half of the bits in the output. All three have varying pros and cons, but in general all seem very strong when you need a general hash function.
Being me, I also checked out Python's implementation. Tim Peters pointed out that Python's hash function for strings was much like the FNV hash function by accident. But it has different characteristics for strings that are similar by not spacing them out but putting them near each other. This has the perk of lowering collision possibilities which is good when your namespaces are hash tables and people tend to use very similar variable names (e.g., var_a, var_b, etc.).
I have pretty much decided that in my example I am going to use Python's implementation. As Time has said, "we read Knuth so you don't have to" (or at least Tim has). And if I need an explanation about some magic number in the algorithm I can go straight to the source. Plus it has the perk of being a very simple algorithm that is easy to implement.
Resizing will double the number of buckets.As for the rest of the hash set, I am going to resize when the usage hits or surpasses 2/3. I am going to start off with a bucket count of 8 since my planned inputs are 13 and 50 (Canadian provinces/territories and US states, respectively) and that will guarantee a couple resizings. Should be a fun challenge to get this all implemented in various languages.
2008-02-19
Subscribe to:
Post Comments (Atom)
4 comments:
Avalanche,
(1) as you write: "a single bit change in the intput [changes] about half of the bits in the output"
(2) as it really is: due to a single bit change in the input, each bit in the output has a probability of 1/2 to change.
(2) implies (1). (Toss a coin N times, heads means no change, tails means change. How many tails are you likely to have? Mean of sum of independent indicator variables.)
(1) does *not* imply (2).
For example, create a function that assigns, to any input, a two-bit sequence, X0. The first bit is X, the second bit is always 0. For any input, compute the parity bit over the whole input sequence, and store it into X. Now, if you flip one bit in the input, it changes half of the bits in the output, namely the first bit, X (satisfying (1)). However, this "construct" doesn't satisfy (2), because when flipping a bit in the input, the first output bit flips with a propbability of 1, and the second output bit flips with a probability of 0.
If you have a crypto hash that satisfies (2) and produces N bits of output, you can take any fixed {i1, i2, in}, n <= N set of bit-positions, and (hash_i1, ... hash_in) will be a good general purpose hash (looking at the scattering, not at the performance). For example, you could pick the first 16 bits of 128 bit MD5.
Sorry, now I see I was talking about "strict avalanche".
http://en.wikipedia.org/wiki/Avalanche_effect#Strict_avalanche_criterion
I'm no hash function expert, but have you come across Bob Jenkins's lookup2 and lookup3? They claim to be the best hash functions out there today (fastest and most thorough).
@Josh:
Double check the link I have in the blog post; it's the same as yours. =) I am not going to use it for my toy examples as I want a simple hashing algorithm and the one from Jenkin's are not as simple as the other ones.
Post a Comment