Tag Archives: .net

Deduplication. Or: How I learned to stop worrying and love the HashSet<T>

Let’s say, for example, you have a relatively large set of data.
For our example – it’s about a half million records. They’re all in memory, and you used CsvHelper to load them, because you don’t like reinventing the wheel. Especially a very well thought out, tried, and proven wheel.
(Sidenote: Trying to do a .Take(BatchSize) on the CsvHelper.Read() method caused worlds of pain for me on rusty disks. YMMV)

So, tangent aside: You have 500,000 in-memory objects. These objects were read from an unreliable source, so there may be duplicates. Duplicates are bad, so you want to remove them.

Option 1:
Create a new list, and manually add records from the original list, as long as they aren’t already in the new list (using your IEqualityComparer<T> or whatever).
This is great, up until you hit about 50k records and suddenly you’re scanning a huge list for duplicates for every additional record. This was pretty fast up to about 10k records. Speed: On^2. I gave up after about 180 seconds on a 3.7ghz i7 with SSD and 16gb of memory. It was <200k records into it.

Option 2:
IEnumerable.Distinct()
This sounds great at first – until you realize that it’s doing the same thing as Option 1. Again, I killed the process after a few minutes of waiting. It’s entirely possible that I somehow implemented this wrong (apparently you need to manually have IEquatable<T> implemented – I had only done an override on .Equals and .GetHashCode)

Option 3:
new HashSet<T>(IEnumerable<T>, IEqualityComparer<T>)
No lie, this bad boy did the job in a fraction of a second.
Removed 297143 duplicates from array size 454002. Final size: 156859. Elapsed: 00:00:00.2096235

Now, my EqualityComparer was only looking at some fields (a few properties combine to yield a “unique” data record) – and the way the HashSet<T>(IEnumerable<T>, IEqualityComparer<T>) constructor works from a layman’s perspective – is that it builds up an adequately sized hashtable, and starts inserting items into it (top to bottom from the original list). If it encounters an item that’s already been inserted (pretty sure this is a HashTable under the covers – works like a Dictionary where the object’s HashCode is the key… only faster), it simply skips adding any records that have matching HashCodes.

So, have some code:

public class Deduper
{
	public FancyData[] Dedupe(IEnumerable fancyDataArray)
	{
		var instantArray = fancyDataArray.ToArray(); //Multiple enumeration of IEnumerable?
		HashSet uniques = new HashSet(instantArray, new FancyDataEqualityComparer());
		var array = uniques.Distinct().ToArray(); //.Distinct() does nothing. I just like it there.
		return array;
	}
	public class FancyDataEqualityComparer : IEqualityComparer
	{
		public bool Equals(FancyData x, FancyData y)
		{
			if (x == null && y == null) return true;
			if (x == null || y == null) return false;
			return
				x.KeyProperty1 ==y.KeyProperty1
				&& x.KeyProperty2 == y.KeyProperty2
				&& x.KeyProperty3 == y.KeyProperty3
				&& x.KeyProperty4 == y.KeyProperty4;
		}

		public int GetHashCode(FancyData obj)
		{
			var name = typeof (FancyData).Name;
			if (obj == null) return name.GetHashCode();
			return
				string.Format("{0}:{1}:{2}:{3}:{4}", name, obj.KeyProperty1, obj.KeyProperty2,
							  obj.KeyProperty3, obj.KeyProperty4).GetHashCode();
		}
	}
}

Seriously… 210 milliseconds!

Tagged , , , ,

Circuit Breaker pattern

So I was talking with a coworker and we were discussing this occasional bug we get wherein a system that processes a massive number of requests (very quickly) will sometimes lose connection to certain databases (luckily, not our logging database).

However, the fact that it can still talk to logging means that within a few seconds we may have hundreds of thousands of error records being written. Combine this with the fact that our logging system is global for all our applications, and suddenly you have the entire system being brought down (since it cannot connect to the logging db, which is expected to “always be up”) simply because a small component with high activity has an error.

Enter the Circuit breaker pattern. Most developers have probably already imagined something of this nature – but it’s basically coding to watch for X number of errors in Y seconds, and triggering an “offline” status so that any further requests are queued until the service is back up. 

Part 2 is that you have a small trickle of requests (1 per 5 seconds, 1 per 30 seconds) that are allowed through [Or alternately, you put the code into an alternate track where a specific type of request is made to check status of the ‘downed’ service]. Once the service comes back up, you change the status back to good and the flow of requests resumes.

… I might consider slowly increasing the flow of requests so that you don’t hammer it and put it back out of service again as soon as it comes up. But that’s just me.

Anyhow, that’s the circuit breaker in a nutshell.

The end result is that you have fewer erroneous calls. The only real drawbacks I see are potentially longer delay before you know the service is back up, and it’s a bit more to code.

 

Cheers
Josh 

Tagged , ,

Upcoming MVC4 talk

So, I’m giving a short (1h or less) MVC4 talk to my coworkers in a couple weeks.

-MVC: Defining the pattern
–How MVC != Asp.Net MVC
-MVC4 Goodies
–Device-specific views
—JQuery Mobile, view switcher (may be deferred to the MVC4 on Mobile talk)
–Async controller classes
–Web API
–Single-Page Applications
–Recipes (is this MVC4 or just happens to be bundled with it?)
–Azure?
-Not MVC4 but you should know anyway:
–SignalR
–Backbone.js, Knockout.js

Tagged , ,