2008-02-19

There is a lot more to hash functions than what they teach you in school

As explained in my blog post on OCaml, I have toy examples that I write for every language I attempt to learn. I realized the other day that I don't have a good data structures example. My spellcheck example was meant to do that, but I realized I kept cheating using a language's hash table if it was available; the delegate pattern is not a data structure.

So last night I decided I would take out the heapsort and spellcheck examples I do and replace them with a hash set. I would read in a file with newline-delimited words to be put into a hash set to make sure that file I/O was not lost. The arguments on the command-line would be checked against the hash set and any that were in the set would be printed to stdout. That way I get to combine both a data structures and file I/O example into one so that it is easy to verify the hash set works properly.

Designing a hash set means choosing both a collision strategy and a hash function. I already knew I wanted to use chaining for the collision strategy since that forces the creation of a linked list instead of using just an array for open addressing and thus just creating another delegate situation. But choosing a hash function was another matter.

I looked up stuff on hash functions today and discovered a lot of info that was never taught to me. It seems the three popular hash functions these days is the Jenkins, FNV, and Hsieh hash functions. They all have very nice randomization values thanks to a focus on "avalanching", which says that most of the time you want a single bit change in the intput to change about half of the bits in the output. All three have varying pros and cons, but in general all seem very strong when you need a general hash function.

Being me, I also checked out Python's implementation. Tim Peters pointed out that Python's hash function for strings was much like the FNV hash function by accident. But it has different characteristics for strings that are similar by not spacing them out but putting them near each other. This has the perk of lowering collision possibilities which is good when your namespaces are hash tables and people tend to use very similar variable names (e.g., var_a, var_b, etc.).

I have pretty much decided that in my example I am going to use Python's implementation. As Time has said, "we read Knuth so you don't have to" (or at least Tim has). And if I need an explanation about some magic number in the algorithm I can go straight to the source. Plus it has the perk of being a very simple algorithm that is easy to implement.

Resizing will double the number of buckets.As for the rest of the hash set, I am going to resize when the usage hits or surpasses 2/3. I am going to start off with a bucket count of 8 since my planned inputs are 13 and 50 (Canadian provinces/territories and US states, respectively) and that will guarantee a couple resizings. Should be a fun challenge to get this all implemented in various languages.