Deduplication. Or: How I learned to stop worrying and love the HashSet<T>

Let’s say, for example, you have a relatively large set of data.
For our example – it’s about a half million records. They’re all in memory, and you used CsvHelper to load them, because you don’t like reinventing the wheel. Especially a very well thought out, tried, and proven wheel.
(Sidenote: Trying to do a .Take(BatchSize) on the CsvHelper.Read() method caused worlds of pain for me on rusty disks. YMMV)

So, tangent aside: You have 500,000 in-memory objects. These objects were read from an unreliable source, so there may be duplicates. Duplicates are bad, so you want to remove them.

Option 1:
Create a new list, and manually add records from the original list, as long as they aren’t already in the new list (using your IEqualityComparer<T> or whatever).
This is great, up until you hit about 50k records and suddenly you’re scanning a huge list for duplicates for every additional record. This was pretty fast up to about 10k records. Speed: On^2. I gave up after about 180 seconds on a 3.7ghz i7 with SSD and 16gb of memory. It was <200k records into it.

Option 2:
IEnumerable.Distinct()
This sounds great at first – until you realize that it’s doing the same thing as Option 1. Again, I killed the process after a few minutes of waiting. It’s entirely possible that I somehow implemented this wrong (apparently you need to manually have IEquatable<T> implemented – I had only done an override on .Equals and .GetHashCode)

Option 3:
new HashSet<T>(IEnumerable<T>, IEqualityComparer<T>)
No lie, this bad boy did the job in a fraction of a second.
Removed 297143 duplicates from array size 454002. Final size: 156859. Elapsed: 00:00:00.2096235

Now, my EqualityComparer was only looking at some fields (a few properties combine to yield a “unique” data record) – and the way the HashSet<T>(IEnumerable<T>, IEqualityComparer<T>) constructor works from a layman’s perspective – is that it builds up an adequately sized hashtable, and starts inserting items into it (top to bottom from the original list). If it encounters an item that’s already been inserted (pretty sure this is a HashTable under the covers – works like a Dictionary where the object’s HashCode is the key… only faster), it simply skips adding any records that have matching HashCodes.

So, have some code:

public class Deduper
{
	public FancyData[] Dedupe(IEnumerable fancyDataArray)
	{
		var instantArray = fancyDataArray.ToArray(); //Multiple enumeration of IEnumerable?
		HashSet uniques = new HashSet(instantArray, new FancyDataEqualityComparer());
		var array = uniques.Distinct().ToArray(); //.Distinct() does nothing. I just like it there.
		return array;
	}
	public class FancyDataEqualityComparer : IEqualityComparer
	{
		public bool Equals(FancyData x, FancyData y)
		{
			if (x == null && y == null) return true;
			if (x == null || y == null) return false;
			return
				x.KeyProperty1 ==y.KeyProperty1
				&& x.KeyProperty2 == y.KeyProperty2
				&& x.KeyProperty3 == y.KeyProperty3
				&& x.KeyProperty4 == y.KeyProperty4;
		}

		public int GetHashCode(FancyData obj)
		{
			var name = typeof (FancyData).Name;
			if (obj == null) return name.GetHashCode();
			return
				string.Format("{0}:{1}:{2}:{3}:{4}", name, obj.KeyProperty1, obj.KeyProperty2,
							  obj.KeyProperty3, obj.KeyProperty4).GetHashCode();
		}
	}
}

Seriously… 210 milliseconds!

Advertisements
Tagged , , , ,

Self-maintaining documentation for HTTP Rest APIs

So, I’ve got an API. I probably want to expose the basics on usage to the public, but I sure as hell don’t want to write documentation. My solution? Self-documenting code. ❤ reflection.

There are a few base classes referenced that I'm using to determine items which are allowed to be accessible publicly, and some that aren't. Maybe my code sucks, whatever, but it works. It certainly makes a few assumptions about what & how you're doing things (like Get/Post method names for the accepted http verbs).

Anyhow, it works for me and spits out a fancy JSON object that can be inspected via fiddler or chrome dev tools, which is enough to at least get started with consuming a new API. Sure beats randomly trying URLs and input/outputs.

 [Unauthenticated]
     public class ConfigController : SiteApiController
     {
          public ConfigData Get()
          {
               var types = Assembly.GetAssembly(typeof(ContractBase))
                    .GetTypes().Where(x => x.IsClass 
                         && typeof(PublicItem).IsAssignableFrom(x) 
                         && x != typeof(PublicItem));
               var parser = new ObjectParser();
               var d = new Dictionary();
               foreach (var type in types)
               {
                  d.Add(type.Name,parser.ParseObject(type));
               }
               var etypes = Assembly.GetAssembly(typeof(ContractBase)).GetTypes().Where(x => x.IsEnum);
               var e = new Dictionary();
               foreach (var etype in etypes)
               {
                    e.Add(etype.Name, parser.ParseEnum(etype));
               }
               var controllers = Assembly.GetAssembly(typeof(SiteApiController))
                    .GetTypes().Where(x => typeof(SiteApiController).IsAssignableFrom(x) && x != typeof(SiteApiController));
               var m = new Dictionary();
               foreach (var c in controllers)
               {
                    m.Add("/api/" + c.Name.Replace("Controller", ""), parser.ParseController(c));
               }

               return new ConfigData() {Classes = d, Enums = e, Methods = m};
          }
          public class ConfigData
          {
               public dynamic Classes { get; set; }
               public dynamic Enums { get; set; }
               public dynamic Methods { get; set; }
          }
     }
internal class ObjectParser
     {
          public dynamic ParseObject(Type type)
          {
               var props =
                    type
                       .GetProperties(BindingFlags.Instance | BindingFlags.Public | BindingFlags.SetProperty |
                                          BindingFlags.GetProperty).Where(prop =>
                         !Attribute.IsDefined(prop, typeof(NotPublicAttribute)))
                         .ToArray(); ;

               var dict = new Dictionary();
               foreach (var prop in props)
               {
                    dict[prop.Name] = GetReadableType(prop.PropertyType);
               }
               return dict;
          }
          private string GetReadableType(Type type)
          {
               string ptype = type.Name;
               if (type.IsGenericType)
                    {
                         ptype = ptype.Replace("`1", "";

                    }
               return ptype;
          }
          public dynamic ParseEnum(Type type)
          {
               var list = new List();
               foreach (var val in Enum.GetValues(type))
               {
                    list.Add(string.Format("{0} ({1})",Enum.Parse(type,val.ToString()).ToString(),((int)val).ToString() ));
               }
               return list;
          }

          public dynamic ParseController(Type type)
          {
               var methods = new List();
               string input = "none";
               string output = "none";
               bool typesSet = false;
               bool requiresSSL =
                    type.CustomAttributes.Any(x => x.AttributeType == typeof(RequireSSLAttribute));
               bool unAuthenticated =
                    type.CustomAttributes.Any(
                         x => x.AttributeType == typeof (UnauthenticatedAttribute));
               foreach (
                    var method in
                         type.GetMethods(BindingFlags.Instance | BindingFlags.Public | BindingFlags.DeclaredOnly)
                    )
               {
                    methods.Add(method.Name);
                    if (!typesSet)
                    {
                         var parms = method.GetParameters();
                         if (parms.Any())
                              input = parms.First().ParameterType.Name;
                         var returnType = method.ReturnType;
                         output = returnType.Name;
                         typesSet = true;
                    }
               }
               return new
                    {
                         Methods = methods,
                         Input = input,
                         Output = output,
                         RequiresSSL = requiresSSL,
                         RequiresAuthentication = !unAuthenticated,
                    };

          }
     }

Tagged ,

Posting complex object to MVC3 controller using jquery

Not as easy as it should be. I fought with this for way too long.
Basically, you should be able to construct the object in the $.post function call. But, for whatever reason, you cannot. Well, I couldn’t – your luck might be different.

this fails:

$.post("/SomeController/ActionThatAcceptsPost", {
Property1: someValue,
Property2: 3.0
}, function(response) {
//Make happy noises
},
"json");
});

but this works:

var thisIsTheParameterName = {
Property1: someValue,
Property2: 3.0
};
$.post("/SomeController/ActionThatAcceptsPost", thisIsTheParameterName
function(response) {
//moar happy noises
},
"json");
});

So, in short – make a variable to pass in a complex object to your MVC3 controller and it starts working.

Oh, and not sure if it matters, but the variable was the same name as the parameter on the Action, and the class was all [DataContract] decorated. Those two may be overkill, but I didn’t feel like taking them out to check.

~jb

Circuit Breaker pattern

So I was talking with a coworker and we were discussing this occasional bug we get wherein a system that processes a massive number of requests (very quickly) will sometimes lose connection to certain databases (luckily, not our logging database).

However, the fact that it can still talk to logging means that within a few seconds we may have hundreds of thousands of error records being written. Combine this with the fact that our logging system is global for all our applications, and suddenly you have the entire system being brought down (since it cannot connect to the logging db, which is expected to “always be up”) simply because a small component with high activity has an error.

Enter the Circuit breaker pattern. Most developers have probably already imagined something of this nature – but it’s basically coding to watch for X number of errors in Y seconds, and triggering an “offline” status so that any further requests are queued until the service is back up. 

Part 2 is that you have a small trickle of requests (1 per 5 seconds, 1 per 30 seconds) that are allowed through [Or alternately, you put the code into an alternate track where a specific type of request is made to check status of the ‘downed’ service]. Once the service comes back up, you change the status back to good and the flow of requests resumes.

… I might consider slowly increasing the flow of requests so that you don’t hammer it and put it back out of service again as soon as it comes up. But that’s just me.

Anyhow, that’s the circuit breaker in a nutshell.

The end result is that you have fewer erroneous calls. The only real drawbacks I see are potentially longer delay before you know the service is back up, and it’s a bit more to code.

 

Cheers
Josh 

Tagged , ,

Migrating Shelvesets between branches in TFS

So, if you work on a branch-per-release cycle, sometimes things take longer than expected.
Here’s a quick way to move a shelveset from one branch to another:
tfpt unshelve - Unshelve into workspace with pending changes

Allows a shelveset to be unshelved into a workspace with pending changes.
Merges content between local and shelved changes. Allows migration of shelved
changes from one branch into another by rewriting server paths.

Usage: tfpt unshelve [shelvesetname[;username]] [/nobackup]
[/migrate /source:serverpath /target:serverpath]

shelvesetname The name of the shelveset to unshelve
/nobackup Skip the creation of a backup shelveset
/migrate Rewrite the server paths of the shelved items
(for example to unshelve into another branch)
/source:serverpath Source location for path rewrite (supply with /migrate)
/target:serverpath Target location for path rewrite (supply with /migrate)


Serverpath is $/Project/DirectoryStructure

Don’t use nobackup. Bad idea. Backups never hurt anybody. Also, you should run this from within a directory that you’ve got mapped (source directory works great), otherwise you might have an issue with TFPT not being able to figure out which workspace, even if there’s only one.

Tagged , ,

Upcoming MVC4 talk

So, I’m giving a short (1h or less) MVC4 talk to my coworkers in a couple weeks.

-MVC: Defining the pattern
–How MVC != Asp.Net MVC
-MVC4 Goodies
–Device-specific views
—JQuery Mobile, view switcher (may be deferred to the MVC4 on Mobile talk)
–Async controller classes
–Web API
–Single-Page Applications
–Recipes (is this MVC4 or just happens to be bundled with it?)
–Azure?
-Not MVC4 but you should know anyway:
–SignalR
–Backbone.js, Knockout.js

Tagged , ,

HttpHandler won’t register

Or at least, the HttpHandler code won’t run..

If you have an HttpHandler in some semi-legacy code that should be firing, but instead you’re getting a 404 … well, you might be on IIS7 or Windows 7… Try using rather than … thanks go to http://stackoverflow.com/questions/1465859/httphandler-not-working-in-iis-7

Tagged

AutoMapper is awesome, but ArgumentNullException on IEnumerable.Select pisses me off.

Let me start this off by saying:
I am way undereducated with AutoMapper. I’m fairly certain there is a more elegant way to accomplish this, but I did it my way for the following reasons:

1) I’m stubborn
2) I don’t want to create ValueResolver classes. I don’t know why exactly, but I prefer to just have maps.
3) I can’t rename any of my classes, nor can I change the structure. I’m stuck with what I have.
4) In my instance, I can’t create a map directly from the source type to the destination type – I need to keep the actual references.

The situation:
I have a large number of classes with one to many relationships. For example, let’s take three simple classes:

public class Person
{
     public IList Addresses { get; set; }
}

public class PersonAddress
{
     public Person Person { get; set; }
     public PhysicalAddress PhysicalAddress { get; set; }
}

public class PhysicalAddress
{
     public string City { get; set; }
}

public class DestinationList
{
     public IList<PhysicalAddress> Cities { get; set; }
}

So, I tried a few different methods, with varying results.

Attempt 1:

public static void CreatePersonAddressMap()
{
     AutoMapper.Mapper.CreateMap()
          .ForMember(dest => dest.Cities,
               config => config.MapFrom(person => person.Addresses.Select(add => add.PhysicalAddress)));
          //Throws null reference exception if person.Addresses is null! OH NOES!
}

Result: ArgumentNullException when person.Addresses is null.

Attempt 2:

public static void CreatePersonAddressMap()
{
     AutoMapper.Mapper.CreateMap()
          .AfterMap((person, dest) =>
          {
               if (person.Addresses != null)
                    dest.Cities = person.Addresses.Select(add => add.PhysicalAddress).ToList();
          });
          //This method works, but it's A) Ugly and B) fails the AssertConfigurationIsValid check.
          //I could add .ForMember(dest => dest.Cities, s=> s.Ignore()), but again that sucks
}

Result: Works for the actual mapping, even when person.Addresses is null. However, fails the AutoMapper.Mapper.AssertConfigurationIsValid check

Attempt 3: I thought for sure this would work; what else could the .Condition method be for?

public static void CreatePersonAddressMap()
{
     AutoMapper.Mapper.CreateMap()
          .ForMember(dest => dest.Cities,
               config =>
               {
                    config.Condition(person => person.Addresses != null);
                    config.MapFrom(
                    person => person.Addresses.Select(add => add.PhysicalAddress));
               });
               //This is the method that I thought would work, but the Condition feature is underdocumented, at
               //least in my undereducated opinion.
}

Result: ArgumentNullException when person.Addresses is null. Grr.

Attempt 4:

public static void CreatePersonAddressMap()
{
     AutoMapper.Mapper.CreateMap()
          .ForMember(dest => dest.Cities,
          config => config.MapFrom(
               person =>
                    person.Addresses == null ? null : person.Addresses.Select(add => add.PhysicalAddress)));
               //This is the method I went with. It's still ugly, but it's (i think) a bit easier to follow, and won't crash
               //the AssertConfigurationIsValid check.
}

Result: It works, doesn’t throw an exception, and correctly projects the desired information. However, I will readily admit that it is quite ugly and would much prefer a better option for the configuration of AutoMapper…. Or some advice on what I’m doing wrong. That works too.

UITextField.MaxLength (or something like it)

So, I need to enforce: Numbers only, and a maximum length (Zip code, whee!) of a UITextField on my iPhone app.

Here’s what I came up with (later, I shall implement some sort of RegEx style of doing this)

               bool CheckText(UITextField fld, NSRange rng, string newChar, int maxLength, bool numbersOnly)
		{
			const string numbers = "0123456789";
			if (fld.Text.Length >= maxLength && rng.Length == 0) {
					return false;
				} else {
					if (!numbersOnly)
					return true;
					if (numbers.IndexOf(newChar) >=0)
					return true;
						return false;
				}
		}

and then to attach it to a text field:

               ZipInput.ShouldChangeCharacters = (fld, rng, str) => CheckText(fld,rng,str,5,true);

Easy as pie.

Why doesn’t apple just show the network activity indicator?

So, for real: Why doesn’t apple just turn on the network activity indicator when there’s traffic? I find it silly that I have to manage that. I mean, beyond just single-connection scenarios, if I have some process (let’s say, a UITableView) that is going to download multiple items, now I have to manually monitor the number of threads in use, and when that hits zero shut the indicator off.

And if I don’t, Apple is quite likely to reject my app, as it does not adhere to the development guidelines/standards they set up.

It seems like something this trivial should be handled by the FRAMEWORK, since they won’t allow me to directly access the API that, I’m rather sure KNOWS when the network is in use.

That said, here’s some code to do it for you – just call AddNetworkConnection and RemoveNetworkConnection (you can do it in a simpler format, but this is the more awesome way with read and writer locks);


public static class Utility
{
private static ReaderWriterLock rwl = new ReaderWriterLock();
private  static int Connections = 0;

public static void AddNetworkConnection()
{
rwl.AcquireReaderLock(TimeSpan.FromSeconds(1)); //You don't actually have to check, you can just always set this to true
if (Connections == 0)  // but I like to be fancy.
UIApplication.SharedApplication.NetworkActivityIndicatorVisible = true;
rwl.UpgradeToWriterLock(TimeSpan.FromSeconds(1));
Connections++;
rwl.ReleaseLock();
}

public static void RemoveNetworkConnection()
{
rwl.AcquireWriterLock(TimeSpan.FromSeconds(1));
Connections--;
if (Connections < 0)
Connections = 0; //just in case, y'know?
if (Connections <= 0)
UIApplication.SharedApplication.NetworkActivityIndicatorVisible = false;
rwl.ReleaseLock();
}
}

And that’s it, you’ve got a simple way to keep the Network Activity indicator updated.
If I were more awesome, I’d find a way to attach it to the thread accessing the network, and automatically decrement the count. But, this way works with both blocking and non-blocking calls. So meh.