What would you do? You’ve got a C# List<T> full of thousands of objects of the same class. You think there’s a few duplicates hiding in there somewhere and you want to root them out. You think that you would really like to put your objects into a dictionary because if you did you could get that shiny O(1)-ness that the cool sleek lines of a dictionary is going to give you.
Only thing is your keys aren’t exact, they’re doubles or DateTimes or some such and you can’t use them because you really need to apply a tolerance to those values. What you need is an imperfect hash function.
The solution to this problem is obvious to me now but not so obvious at the time when I was trying to speed up my painfully O(N^2) application. To add your elements into a dictionary and get faster lookup you must obviously include all the elements of the comparison in the Equals() method but exclude the non-exact ones from the GetHashCode() method.
Before I start with this analysis I’d like to apologise to all vegetarians on the planet for eating your friends. Secondly, and more importantly, I’d like to apologise to all decent-folk for producing an example so contrived that it will make you want to vomit. Please pay particular attention to avoid your shoes.
Let’s begin:
public class Beefy
{
public string name;
public double percentage;
public override bool Equals(object o)
{
Beefy b = o as Beefy;
if (b == null) return false;
return (name.Equals(b.name) && Math.Abs(percentage - b.percentage) >= 1);
}
public override int GetHashCode()
{
return name.GetHashCode();
}
}
This works by only relying on the string "name" to derive the hashcode. What this means is kind-of subtle and relies on what the Dictonary<T> does to resolve collisions. A collision occurs when two items hash to the same location. Since we know that for Beefy we're not using a unique identifier to place the items into the Dictionary<T> then that dictionary must have a way of coping with the situation that two hash-codes resolve to the same place. Now you would need to check the language you love to find out exactly what it does in this instance. But for .NET 2.0, C# uses a technique called chaining. That is it creates a secondary data-structure which is commonly referred to as a bucket but is really just another list.
Looking at the following diagram, (and again my sincere apologies) you'll see that we have 3 topsides in the same bucket. This happened because the hashcode for the string "Topside" led us to the fourth bucket. Every time we see "Topside" as the name we will end up inserting the Beefy object into its bucket. Every time we want to retrieve a "Topside" we will start at the first element of the bucket and iterate over the chain until we find a match.
In this way then we can get the trade-off we need. Hopefully our inexact matching fields will be few and our exact ones will be many so that we will get close to O(1) performance, however if it all goes horribly badly and every piece of beef we have is "Topside" then we will tend towards an O(N) lookup time.
This started me thinking. Exactly what does this continuum look like? In other words for a trivial class like the one above what sort of read access times would I get searching for each of the 1000 objects in a variety of data structures? Now what if I vary the number of hashcodes between 1 (i.e. everything hashes to the same bucket) and 1000 (where everything hashes to a different bucket) and try to access the same 1000 objects for each hashcode distribution. I performed the process on a Dictionary<T>, my List<T> and a Hashtable, which has a different collision resolution mechanism in C# to the dictionary.
The results were almost as I expected. The Dictionary<T> is faster than the Hashtable because when a Hashtable has an imperfect hash the collisions 'spill' over one another which means that other future values can potentially be mis-placed, not just the current one. However, that performance converges as the quality of the hash improves. Secondly, the way I set up my List<T> the items being searched for should on average be found in the middle (requiring iteration over half of the List<T> to find a single element). You can see that for a very imperfect hash where there are less than 100 distinct hashes in 1000 items, it's better to use a List<T> than either a Hashtable or a Dictionary<T>. At a 90% perfect hash then a List
All good news for me because my real hashcode was much more discriminating than that typically sitting above the 90% perfect range. So the List<T> has gone and my hash is meaty sister. Very meaty indeed.