The art of avoiding cascading failures
Last week, I published the first part of my series on patterns for resilient architecture— with a focus on the infrastructure layer, embracing redundancy, immutability and the concept of infrastructure as code.
In this post, I will focus on cascading failures, or what I like to call The Punisher— that super hero living in the darkness of your architecture taking revenge on those responsible for not thinking enough about the small details. I have been a victim several times, and The Punisher is bad, real bad!
Avoiding cascading failures
One of the most common triggers for outages is cascading failure, where one part of a system experiences a local failure and takes down the entire system through inter-connections and failure propagation. Often, that propagation follows the butterfly effect, where a seemingly small failure ripples out to produce a much larger one.
A classic example of a cascading failure is overload. This occurs when traffic load distributed between two clusters brutally changes due to one of the clusters having a failure. The sudden peak of traffic overloads the remaining cluster, which in turn, fails from resource exhaustion, taking the entire service down.
To avoid such scenarios, here are a few tricks that have served me well:
Backoff algorithms
Typical components in a distribute software system include servers, load-balancers, databases, and DNS servers. In operation, and subject to the failures discussed in part 1, any of these can start generating errors.
The default technique for dealing with errors is to implement retries on the requester side. This technique increases the reliability of the application and reduces operational costs for the developer.
However, at scale — and if requesters attempt to retry the failed operation as soon as an error occurs — the network can quickly become saturated with new and retired requests, each competing for network bandwidth. This pattern will continue until a full system meltdown occurs.
To avoid such scenarios, backoff algorithms such as the common exponential backoff must be used. Exponential backoff algorithms gradually decrease the rate at which retries are performed, thus avoiding network congestion.
In its simplest form, a pseudo exponential backoff algorithm looks like this:
Luckily, many SDKs and software libraries, including those from AWS, implement a version (often more sophisticated) of these algorithms. However, never assume — always verify and test this is the case.
It is important to realize that this backoff alone is not enough. As explained by Marc Brooker in the post Exponential Backoff and Jitter, the wait()
function in the exponential backoff should always add jitter to prevent clusters of retry calls. One approach is to modify the wait()
as follows:
wait = random_between(0, min(cap, base * 2 ** attempt))
Timeouts
To illustrate the importance of timeouts, imagine that you have steady baseline traffic, and that suddenly, your database decides to ‘slow down’ and your INSERT
queries take a few seconds to complete. The baseline traffic doesn’t change, so this means that more request threads are holding on to database connections and the pool of connections quickly runs out. Then, other APIs start to fail because there are no connections left in the pool to access the database. This is a classic example of cascading failure, and if your API had timed out instead of holding on to the database, you could have degraded the service instead of failing.
The importance of thinking, planning and implementing timeouts is frequently underestimated. And today, many frameworks don’t expose timeouts in requests methods to developers — or even worse, have infinite default values.
Idempotent operations
Now that you have implemented retries and timeouts correctly, imagine for a moment that your client sends a message over HTTP to your applications, and that due to a transient error, you receive a timeout. What should you do? Was the request processed by the application before it timed out?
Often with timeouts, the default behavior for the user is to retry — e.g clicking on the submit button until it succeeds. But what if the initial message had already been received and processed by the application? Now, suppose the request was an INSERT
to the database. What happens when the retries are processed?
It’s easy to understand why retry mechanisms aren’t safe for all types of requests: if the retry leads to a second INSERT
of the same data, you’ll end-up with multiple records of the exact same data, which you probably don’t want. Often, this will lead to uniqueness constraint violation, resulting in a DuplicateKey exception thrown by the database.
Now what? Should you READ
before or after putting new data in the database to verify or prevent duplication exceptions? Probably not, as this is would be expensive in terms of the latency of the application. Which part of the application needs to handle this complexity?
Most of these scenarios can be avoided if the service implements idempotent operations. An idempotent operation is an operation that can be repeated over and over again without side effects or failure of the application. In other words, idempotent operations always produce the same result.
One option is to use unique traceable identifiers in requests to your application and reject those that have been processed successfully. Keeping track of this information in the cache will help you verify the status of requests and determine whether you should return a canned response, the original response, or simply ignore it.
Awareness is key — more so than having a bullet proof system — because it forces developers to think about potential issues at development times, preventing Unhandled Exceptions before they happen.
Service degradation & fallbacks
Degradation simply means that instead of failing, your application degrades to a lower-quality service. There are two strategies here: 1) Offering a variant of the service which is easier to compute and deliver to the user; or 2) Dropping unimportant traffic.
Continuing with the above example, if your INSERT
queries get too slow, you can timeout and then fallback to a read-only mode on the database until the issue with the INSERT
is fixed. If your application does not support read-only mode, you can return cached content.
Imagine you want to return a personalized list of recommendations to your users. If your recommendation engine fails, maybe it’s OK to return generic lists like trending now, popular or often bought together. These lists can be generated on daily basis and stored in the cache and are therefore easy to access and deliver. Netflix is doing this remarkably well — and at scale.
“What we weren’t counting on was getting throttled by our database for sending it too much traffic. Fortunately, we implemented custom fallbacks for database selects and so our service dependency layer started automatically tripping the corresponding circuit and invoking our fallback method, which checked a local query cache on database failure.” Source here.
Rejection
Rejection is the final act of ‘self-defense’: you start dropping requests deliberately when the service begins to overload. Initially, you’ll want to drop unimportant requests, but eventually you might have to drop more important ones, too. Dropping requests can happen server side, on load-balancers or reverse-proxies, or even on the client’s side.
Resilience against intermittent and transient errors
An important problem to realize when dealing with large-scale distributed systems in the cloud is that given the size and complexity of the systems and architecture used, intermittent errors are the norm rather than the exception, which is why we want to build resilient systems.
Intermittent errors can be caused by transient network connectivity issues, request timeouts, or when dependency or external services are overloaded.
The question with intermittent errors is, how should you react? Should you tolerate errors or should you react immediately? This is a big issue since reacting too fast might create risk, add significant operational overhead and incur additional cost.
Take the case of running across multi-regions for example. When do you decide that it’s a good time to initiate a complete failover to the other region?
This is a hard problem to solve. Often, it’s a good idea to use thresholds. Collect statistics about intermittent errors in your baseline, and based on that data, define a threshold that will trigger the reaction to errors. This takes practice to get right.
This doesn’t mean you shouldn’t observe and monitor intermittent errors by the way — quite the opposite. It simply means that you shouldn’t react to every one of them.
An important thing to remember when handling transient errors, is that often, transient exceptions are hidden by outer exceptions of different type. Therefore, make sure you inspect all inner exceptions of any given exception object in a recursive fashion.
Circuit breaking
Circuit breaking is a process of applying circuit breakers to potentially-failing method calls, to avoid cascading failure. The best implementations of circuit breaking use techniques used to prevent cascading failure — backoff, degradation, fallback, timeout, rejection, and intermittent error handling — and wrap them in an easy-to-use object for developers implementing methods calls. The circuit breaker pattern works as follows.
A circuit breaker object monitors for the number of consecutive failures between a producer and a consumer. If the number passes over a failure-threshold, the circuit breaker object trips, and all attempts by the producer to invoke the consumer will fail immediately, or return a defined fallback. After a waiting period, the circuit breaker allows for a few requests to pass through. If those requests successfully pass a success-threshold, the circuit breaker resumes it’s normal state. Otherwise, it stays broken and continues monitoring the consumer.
Circuit breakers are great since they force developers to apply resilient architecture principles at implementation time. The most famous circuit breaker implementation is, naturally, Hystrix from Netflix.
Wrapping up
That’s all for now folks. I hope you have enjoyed this part 2.
Please do not hesitate to give feedback, share your own opinion or simply clap your hands. In the next part, I will go on and discuss the importance of health checking.
Part 1 — Embracing Failure at Scale
Part 2 — Avoiding Cascading Failures
Part 3 — Preventing Service Failures with Health Check
Part 4 — Caching for Resiliency
Adrian
EDITED after suggestions from Markus Krüger (exponential backoff + jitter)