Designing for Downtime: Understanding the Cranky Path

Peter_Lavelle

0/5 (0 vote)

Mar 7, 2016

CPOL

6 min read

18355

In this article, I’ll describe some design strategies to provide a better user-experience while accommodating for potential network connectivity problems.

For more information please visit developer.intuit.com.

As a self-taught, long-time application developer, I’ve learned many hard lessons. Most are common to us all: backup the database before dropping tables, ground yourself before messing with your server’s motherboard, review infinite loops that might spam the CEO, etc. Recently, I was reminded of one of the toughest lessons to learn: network calls will fail at the worst possible times. Whether it's Skyping with a big client, a database call to write a new customer record or an API call to a third-party service, it is critical to prepare for transient errors and downtime. I’ve learned the best way to keep customers happy is by designing for downtime from a user experience perspective. In this article, I’ll describe some design strategies to provide a better user-experience while accommodating for potential network connectivity problems.

Understanding the Cranky Path

When designing for downtime, the first step is to fully understand the user experience. I like to think of each step of the user experience in terms of the "happy path," where everything is beautiful; the user clicks the right buttons (in the right order) and ends up on a congratulatory page that celebrates their delightful experience. Regrettably, the happy path is easy. Designing for the "cranky path" protects our user from having a frustrating experience. Frustrated users lose sight of the greater benefit of the application and ultimately dismiss our countless hours of work and neat lines of code.

Understanding the cranky path provides a number of significant benefits to the technical design of applications:

I can prioritize application fixes based on how they impact the cranky path
I can set retry policies to better effect the user experience
Set user expectations and develop thoughtful messaging to keep my application’s users happy.

Here are a few of the methods that I utilize and will help you design your application for the cranky path and keep your customers delighted.

Idempotent Requests

Idempotency sounds daunting and is very important, especially when managing accounting data. A quick definition of idempotency is that the same operation with the same inputs should always produce the same outputs. "Request idempotency" ensures that repeating requests to the QuickBooks Online service will not result in a duplicate transaction and create inaccurate data. When calling a service, you must consider that a failure may occur after an underlying service has successfully executed. In these scenarios you will need to retry executing the service call, and the idempotent service should handle your request and deliver the same output every time.

Without request idempotency, the application cannot guarantee data accuracy, and risks duplicate or abandoned transactions. Consider the potential problem if a bank or accountant retried processing a $100 deposit 10 times:

In an idempotent service, our account balance is increased $100 for the single deposit of $100 and the new balance is reported identically 10 times
With an idempotent service, our account balance is increased $100 each of the 10 times and the new balance is reported as increasing $100 each of the 10 times.

For a more detailed description of idempotent requests, Sridhar Kalaga recently wrote an excellent post on idempotent APIs. Consider idempotency a prerequisite to any and all retry policies.

Retry Policies

A retry policy is a mechanism that allows your application to re-execute operations that fail on a scheduled basis or based on other conditions in and around your application. There are three common retry algorithms that can be utilized in a retry policy:

Fixed – If the user is performing an action and expects an immediate response, a fixed one-time retry (or no retry) may be a suitable attempt to deliver the expected behavior for our users.
Incremental – If the user is running a long or unattended operation, longer intervals between retries may be acceptable to increase the likelihood of a successful retry of the operation.
Exponential – If the unattended operation can tolerate it, the duration between retries called the back-off interval, can be increased exponentially to allow for more time to elapse the longer the operation continues to fail.

In all three policy algorithms, the goal is to deliver a successful completion of the targeted operation in the user’s perspective. If the retry policy fails, presenting the user with a thoughtful error message would likely make a significant difference in the user-experience.

If you are leveraging a mature library in your application like the .NET and Java QuickBooks SDKs that have fixed, incremental, and exponential policies built-in you can tune these to meet your applications needs.

When implementing a retry policy, it is also imperative to consider when to "short-circuit" or prevent retry operations. If any service has a complete outage, your fancy exponential back-off algorithm will be rendered useless. Once the application retries a defined maximum number of attempts, you must short-circuit and wait an extended period of time before retrying. I recommend you consider a lightweight API as a health check before resuming processing to conserve resources, time, and prevent your logs filling with errors.

Atomic Operations

An atomic operation is an operation in your application that will always be executed without any other process or operation able to interrupt it and change the underlying state that is being operated on. Relational database transactions help to make interactions with our data atomic. When reviewing the service calls made by your application, consider which groups of operations must be atomic. This is very important when you consider sales or other transactions that directly affect regulated information like accounting data.

For example, if the user expects an invoice with a linked payment, but the application is only able to write the invoice, it is essential to alert the user and retry the payment operation until the invoice is properly marked as paid. In this scenario, I recommend making the invoice and payment operation atomic in an attempt to avoid the need for a retry policy and notification mechanism.

Introducing Chaos

Introducing chaos into your application is another great way to understand the cranky path, test the applications retry policies and fully understand how your user would feel when services fail. Netflix introduced into their systems a tool they called the "Chaos Monkey" to intentionally create different failures and validate that their applications can continue to survive through those outages. During testing, your application can mock the web service exceptions using tools like WireMock, MockServer, or even your own stubbed service.

Limited, Experimental Program to Test Chaos Mode

To help you start testing your processes, we are offering a limited availability, experimental program for developers. This enables you to test with a Chaos Mode built directly into one of our QuickBooks SDKs. With Chaos Mode enabled, the SDK will introduce mock service errors at a configurable tolerance level. We hope this feature helps developers continue to deliver a delightful customer experience, even when services may not be cooperating. Apply for access to our experimental program and get started testing your services with chaos-mode enabled.