So I was talking with a coworker and we were discussing this occasional bug we get wherein a system that processes a massive number of requests (very quickly) will sometimes lose connection to certain databases (luckily, not our logging database).
However, the fact that it can still talk to logging means that within a few seconds we may have hundreds of thousands of error records being written. Combine this with the fact that our logging system is global for all our applications, and suddenly you have the entire system being brought down (since it cannot connect to the logging db, which is expected to “always be up”) simply because a small component with high activity has an error.
Enter the Circuit breaker pattern. Most developers have probably already imagined something of this nature – but it’s basically coding to watch for X number of errors in Y seconds, and triggering an “offline” status so that any further requests are queued until the service is back up.
Part 2 is that you have a small trickle of requests (1 per 5 seconds, 1 per 30 seconds) that are allowed through [Or alternately, you put the code into an alternate track where a specific type of request is made to check status of the ‘downed’ service]. Once the service comes back up, you change the status back to good and the flow of requests resumes.
… I might consider slowly increasing the flow of requests so that you don’t hammer it and put it back out of service again as soon as it comes up. But that’s just me.
Anyhow, that’s the circuit breaker in a nutshell.
The end result is that you have fewer erroneous calls. The only real drawbacks I see are potentially longer delay before you know the service is back up, and it’s a bit more to code.