Tag Archives: csvhelper

Deduplication. Or: How I learned to stop worrying and love the HashSet<T>

Let’s say, for example, you have a relatively large set of data.
For our example – it’s about a half million records. They’re all in memory, and you used CsvHelper to load them, because you don’t like reinventing the wheel. Especially a very well thought out, tried, and proven wheel.
(Sidenote: Trying to do a .Take(BatchSize) on the CsvHelper.Read() method caused worlds of pain for me on rusty disks. YMMV)

So, tangent aside: You have 500,000 in-memory objects. These objects were read from an unreliable source, so there may be duplicates. Duplicates are bad, so you want to remove them.

Option 1:
Create a new list, and manually add records from the original list, as long as they aren’t already in the new list (using your IEqualityComparer<T> or whatever).
This is great, up until you hit about 50k records and suddenly you’re scanning a huge list for duplicates for every additional record. This was pretty fast up to about 10k records. Speed: On^2. I gave up after about 180 seconds on a 3.7ghz i7 with SSD and 16gb of memory. It was <200k records into it.

Option 2:
IEnumerable.Distinct()
This sounds great at first – until you realize that it’s doing the same thing as Option 1. Again, I killed the process after a few minutes of waiting. It’s entirely possible that I somehow implemented this wrong (apparently you need to manually have IEquatable<T> implemented – I had only done an override on .Equals and .GetHashCode)

Option 3:
new HashSet<T>(IEnumerable<T>, IEqualityComparer<T>)
No lie, this bad boy did the job in a fraction of a second.
Removed 297143 duplicates from array size 454002. Final size: 156859. Elapsed: 00:00:00.2096235

Now, my EqualityComparer was only looking at some fields (a few properties combine to yield a “unique” data record) – and the way the HashSet<T>(IEnumerable<T>, IEqualityComparer<T>) constructor works from a layman’s perspective – is that it builds up an adequately sized hashtable, and starts inserting items into it (top to bottom from the original list). If it encounters an item that’s already been inserted (pretty sure this is a HashTable under the covers – works like a Dictionary where the object’s HashCode is the key… only faster), it simply skips adding any records that have matching HashCodes.

So, have some code:

public class Deduper
{
	public FancyData[] Dedupe(IEnumerable fancyDataArray)
	{
		var instantArray = fancyDataArray.ToArray(); //Multiple enumeration of IEnumerable?
		HashSet uniques = new HashSet(instantArray, new FancyDataEqualityComparer());
		var array = uniques.Distinct().ToArray(); //.Distinct() does nothing. I just like it there.
		return array;
	}
	public class FancyDataEqualityComparer : IEqualityComparer
	{
		public bool Equals(FancyData x, FancyData y)
		{
			if (x == null && y == null) return true;
			if (x == null || y == null) return false;
			return
				x.KeyProperty1 ==y.KeyProperty1
				&& x.KeyProperty2 == y.KeyProperty2
				&& x.KeyProperty3 == y.KeyProperty3
				&& x.KeyProperty4 == y.KeyProperty4;
		}

		public int GetHashCode(FancyData obj)
		{
			var name = typeof (FancyData).Name;
			if (obj == null) return name.GetHashCode();
			return
				string.Format("{0}:{1}:{2}:{3}:{4}", name, obj.KeyProperty1, obj.KeyProperty2,
							  obj.KeyProperty3, obj.KeyProperty4).GetHashCode();
		}
	}
}

Seriously… 210 milliseconds!

Tagged , , , ,