Resilient Imperfect Systems

dietmar paul schoder

4.98/5 (19 votes)

May 30, 2023

CPOL

12 min read

7507

How to write code that is not perfect but still resilient in production? A practical use of the "chain of responsibility" design pattern.

Introduction

I wanted to write code - using a generic standard design pattern provided as a NuGet package - which:

immediately sends me a (Slack) alarm whenever it runs into a problem,
automatically reverts everything that it changed until it ran into the problem,
tells me how it got into the problem with which problematic data,
does not disturb me at all whenever it works perfectly fine,
reduces complexity, and
does not slow my system down in any way.

I created this generic working solution in the form of the NuGet package "Schoder.Chain" using a chain-of-responsibility pattern - as suggested in this article, but going further than that.

Background

Let us imagine that we have an endpoint in the backend which is supposed to book a flight for a user, to update user address information or to implement any kind of a relatively complex business process.

Let us say that this process can be broken down into steps similar to these (as an example):

validate the user input,
fetch related data from database 1,
get extra data from another service,
perform some processing based on these three inputs,
insert data into database 2, and
update existing data in database 1.

Chain of Actions

A generic flow for that looks like this:

This is a chain of actions. Each of these actions can either work or fail. Whenever an action fails, all the succeeding actions must not be performed.

In our practical example in Step 1 (see above), the user input can be invalid. In this case, it makes no sense to perform any of the following steps. We need to skip Steps 2 to 6 and we need to tell the user that their input is invalid.

If Step 1 works, Step 2 could also go wrong. The user gave us formally valid input data, but maybe we do not have the expected data in our database. In this case, we need to skip Steps 3 to 6 and tell the user that we did not find any data matching their input.

User vs. System Errors

But I want to make a crucial distinction here. These examples of failures are user errors. Step 2 could also fail because our database is "down". Similarly Step 3 could fail because the external service is "down". These failures would be system errors. In these cases, I must tell the user that something went wrong, that it is not their fault and that there is nothing they can do about it. But I also want to tell the user that I received an alert and that I'm working on fixing the problem now. And in these system error cases, I want the process to limit the damage and send me a notification including all details about the error.

Damage Limitation

In case of a system error, all data changes preceding the failure must be undone. The generic flow for that looks like this:

My specific example consisting of the six steps above looks like this:

In our practical case, we can see that any system error in Steps 1 to 4 can go through empty undo-actions, because none of these steps changed data. Also Step 5 did not change data when it failed. But if Step 6 fails, then Step 5 must be undone to maintain the integrity of the total business transaction.

Short Term Memory

As long as my system is processing user inputs without problems, I'm not so interested in what happened. I can see the successful results in my databases anyway, and I have happy users.

But whenever a system error occurs, I want to know what exactly happened from the moment the user submitted their data until the moment the error occurred. Therefore, my system needs what I call a "short term memory". That cannot be achieved with logging, because logging either produces endless amounts of noise, or it does not log much until my system detects that "it should have logged all details" for a failed single transaction, but now it is too late.

My solution presented in this article sends me a Slack message only in case of a system error, but this then includes all necessary details of the genesis of the error.

Using the Code

If I coded the practical example described above in a "classical" way, it would look something like this:

public async Task<IResult> ConfirmBookingAsync(object inputData)
{
    var errorMessage = ValidateUserInput();
    if (!string.IsNullOrEmpty(errorMessage))
    {
        return Results.Json(errorMessage);
    }
    var data1 = await _data1Accessor.GetAsync(inputData);
    if (data1 is null)
    {
        return Results.Json("Could not find data in DB1.");
    }
    var data2 = await _data2Client.GetAsync(inputData);
    if (data1 is null)
    {
        return Results.Json("Could not get data from external service.");
    }
    var result = _calculator.Calculate(inputData, data1, data2);
    var newId = await _data2Accessor.InsertAsync(result);
    if (newId is null)
    {
        return Results.Json("Could not insert result into DB2.");
    }
    var ok = await _data1Accessor.UpdateAsync(inputData, newId);
    if (!ok)
    {
        await _data2Accessor.DeleteAsync(newId);
        return Results.Json("Could not update data in DB1.");
    }

    return Results.Redirect("/nextpage");
}

This "spaghetti" code has so many disadvantages that I do not even know where to start, but the most important one is obvious: it puts the burden of dealing with all system errors onto the user. This code does not meet any of my requirements of a resilient system (see introduction).

Instead of that, I can use a chain of actions like this (see the BookingManager class in the code example):

var result = await _chain.ProcessAsync(calledBy,
    typeof(ValidateUserInput),
    typeof(FetchFromDB1),
    typeof(GetDataFromExternalService),
    typeof(CalculateResult),
    typeof(InsertDataIntoDB2),
    typeof(UpdateDataInDB1));

When I compare this chain implementation with the spaghetti code above, I only need to understand once and for all that

the actions in the chain are performed one after the other,
whenever any action detects a user error the processing stops, skips all succeeding actions and returns the error message to the user, and
whenever any action runs into a system error the processing stops, it reverts all previous steps (which need reverting), it alerts me (e.g., on Slack) immediately, it tells me the whole story of what had happened, and it sends the user to a generic error page.

Each action itself is broken down into a tiny processor (=action) class, e.g., ValidateUserInput, which contains one step/method (ProcessOk/ProcessOkAsync) of the pure business logic:

using SchoderChain;

namespace ChainExample.BLL.BookingActions
{
    // Inherit the SchoderChain.Processor class
    public class ValidateUserInput : Processor
    {
        private readonly BookingData _bookingData;

        // Inject your data object for the user data
        // Inject your slack manager for the system error notifications
        public ValidateUserInput(BookingData bookingData, ISlackManager slackManager)
            : base(slackManager) => _bookingData = bookingData;

        // Insert your business logic here
        protected override bool ProcessOk()
        {
            if (_bookingData.UserInput is null)
            {
                // this is a user error
                _bookingData.Result = Results.Json("Please enter data.");
            }
            // return "true" when everything is o.k. and the chain shall continue
            // return "false" when the chain shall stop (i.e., user error)
            return _bookingData.Result is not null;
        }
    }
}

In the same way, I implement FetchFromDB1, GetDataFromExternalService, CalculateResult, InsertDataIntoDB2 and UpdateDataInDB1.

Step 5 of my chain of actions is a little bit special because I said that it needs to undo what it did in case of a "rollback" of the whole business transaction - which would be caused by any system error in a succeeding step. Therefore, in InsertDataIntoDB2, I also override the UndoAsync() method of the Processor class it inherits. In addition, I can use the _chainResult.StackTrace to add to my Slack notification what exactly was undone.

public InsertDataIntoDB2(BookingData bookingData, 
                         IDataAccessor dataAccessor, ISlackManager slackManager)
    : base(slackManager)
{
    _bookingData = bookingData;
    _dataAccessor = dataAccessor;
}

protected override async Task<bool> ProcessOkAsync()
{
    _bookingData.AnyDataObject.Id = Guid.NewGuid();
    var ok = await _dataAccessor.InsertAsync(_bookingData.AnyDataObject);
    return ok; // The chain only continues if the insert was successful
}

protected override async Task UndoAsync()
{
    if (_bookingData.AnyDataObject is not null)
    {
        await _dataAccessorDB2.DeleteAsync(_bookingData.AnyDataObject.Id);
        _chainResult.StackTrace.Add
        ($"Undo {GetType().Name}: deleted record with id 
        {_bookingData.AnyDataObject.Id}");
    }
}

Each of these classes performs one specific action/step of my chain (i.e., my business logic) and follows the same pattern:

The class inherits the SchoderChain.Processor class.
I inject an object for the data I'm operating with in these chain actions (e.g., BookingData),
I inject my SlackManager for the real time system error notifications.
I also inject a data accessor or service client when needed.
I override the ProcessOk() resp. the ProcessOkAsync() method to perform my business logic.
I override the UndoAsync() method when needed, and
I return "false" in the ProcessOkAsync() method in case my code detected a user error (which stops the chain and returns the result to the user).

As a result, all my BLL managers stay very light, e.g., my BookingManager:

using ChainExample.BLL.BookingActions;
using SchoderChain;

namespace ChainExample.BLL
{
    public class BookingManager : IBookingManager
    {
        private readonly IChain _chain;
        private readonly BookingData _bookingData;

        public BookingManager(IChain chain, BookingData bookingData)
        {
            _chain = chain;
            _bookingData = bookingData;
        }

        public async Task<IResult> ConfirmBookingAsync
        (string calledBy, object? userInput, IResult redirectToErrorPage)
        {
            _bookingData.UserInput = userInput;
            var result = await _chain.ProcessAsync(calledBy,
                typeof(ValidateUserInput),
                typeof(FetchFromDB1),
                typeof(GetDataFromExternalService),
                typeof(CalculateResult),
                typeof(InsertDataIntoDB2),
                typeof(UpdateDataInDB1));

            return result.Exception is null
                ? _bookingData.Result
                : redirectToErrorPage;
        }

        // public async Task PerformMoreBusinessLogicAsync( ...
        // using a different chain of actions
    }
}

This is the essence of my Program.cs file (using minimal APIs in my case). I only need the dependency injection for my individual SlackManager implementing the SchoderChain.ISlackManager interface, and I need the dependency injection for SchoderChain.Chain class which does all the work (see below "Behind the scenes"). Everything else is straight forward:

using ChainExample.BLL;
using ChainExample.BLL.BookingActions;
using ChainExample.Helpers;
using SchoderChain;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddScoped<ISlackManager, SlackManager>();
builder.Services.AddScoped<IChain, Chain>();

builder.Services.AddScoped<IBookingManager, BookingManager>();
builder.Services.AddScoped<BookingData, BookingData>();

builder.Services.AddScoped<IProcessor, CalculateResult>();
builder.Services.AddScoped<IProcessor, FetchFromDB1>();
builder.Services.AddScoped<IProcessor, GetDataFromExternalService>();
builder.Services.AddScoped<IProcessor, InsertDataIntoDB2>();
builder.Services.AddScoped<IProcessor, UpdateDataInDB1>();
builder.Services.AddScoped<IProcessor, ValidateUserInput>();

var app = builder.Build();

var userErrorPage = "/error";
var userErrorRedirect = Results.Redirect(userErrorPage);

var bookingsEndpoint = "/bookings";
app.MapGet(bookingsEndpoint, async (IBookingManager bookingManager) =>
{
    return await bookingManager.ConfirmBookingAsync
    (bookingsEndpoint, userInput: null, userErrorRedirect);
});

app.MapGet(userErrorPage, () =>
{
    return Results.Empty;
});

app.Run();

This is my SlackManager for the real time system error notifications. What it does can be replaced by anything you want to do with your system errors (e.g., also storing them in an error table, writing them into the logs or sending them to an error handling micro service).

using SchoderChain;
using SlackBotMessages;
using SlackBotMessages.Models;

namespace ChainExample.Helpers
{
    public class SlackManager : ISlackManager
    {
        public SlackManager() { }

        public async Task SlackErrorAsync(string messageBody)
            => await new SbmClient(SlackSecrets.SLACK_WEBHOOKURL_ERROR).Send(
                new Message
                {
                    Username = SlackSecrets.SLACK_USER,
                    Text = messageBody,
                    IconEmoji = Emoji.Bomb
                });
    }
}

This is what my folders and files look like in Visual Studio (see code here). The point is that any data access layers or clients calling other services are not injected into the BLL manager classes, but only into the single actions where they are needed.

In real life, this is what happens in case of a system error, i.e., anything that goes wrong that I didn't think of in the first place, and anything that is not under my control (while user errors are directly shown to the user). The ChainResult.StackTrace keeps track of all steps performed and undone, and it gives me the description of the error it ran into. This is what is sent to Slack in case of an error in my example:

I can see that the endpoint /bookings was called with a user input error, which actions were performed and undone, and what specific error occurred. Of course, it would also show the real error message coming from my database access, if this was not a simulation. Because I have access to the _chainResult.StackTrace in all my processors, I can write everything I want into this error memory that helps me to investigate any error. These details never add noise to my logs, because they are only sent when an error occurs.

All of the above works using the NuGet package Schoder.Chain, and you can use any library for the Slack messages (or send the notifications in any way you want to anyone you want). This is my NuGet package, and this is the Slack bot I used:

You can also create your own implementation of this chain, if you do not want to use my NuGet package. This is what happens behind the scenes, and it should be easy to replicate.

Behind the Scenes

The NuGet package "Schoder.Chain" consists mainly of three small classes.

The Chain class has a ProcessAsync() method which can be called in any BLL manager (like my BookingManager above) with the list of processors (processorChainTypes) that shall be processed step by step in the given order. Because this class has all Processor classes injected as a collection (_allProcessors), it can find the ones it needs this time. Then it sets the predecessor and successor for each processor in the current chain - which links them together logically - and finally, it starts the execution of the first processor.

To be clear: _allProcessors are injected by the .NET Core dependency injection (see my Program.cs file above). But I have several different chains defined in several different methods in several different BLL managers performing several different business processes. The ones which shall be performed "now" are handed over in the processorChainTypes parameter when _chain.ProcessAsync() is called in a BLL manager (see BookingManager above).

namespace SchoderChain
{
    public class Chain : IChain
    {
        private readonly IEnumerable<IProcessor> _allProcessors;

        public ChainResult ChainResult { get; set; }

        public Chain(IEnumerable<IProcessor> allProcessors) => 
                     _allProcessors = allProcessors;

        public async Task<ChainResult> ProcessAsync
               (string calledBy, params Type[] processorChainTypes)
        {
            ChainResult = ChainResult.Create(calledBy);
            await (FirstLinkedProcessor()?.ProcessChainAsync(ChainResult) ?? 
                   Task.FromResult<ChainResult>(null));
            return ChainResult;

            IProcessor FirstLinkedProcessor()
            {
                IProcessor firstProcessor = null, previousProcessor = null;

                foreach (var processorType in processorChainTypes)
                {
                    var processor = _allProcessors.Single
                                    (p => p.GetType() == processorType);
                    processor.Successor = null;
                    processor.Predecessor = previousProcessor;
                    if (processor.Predecessor is not null)
                    {
                        processor.Predecessor.Successor = processor;
                    }
                    firstProcessor = firstProcessor ?? processor;
                    previousProcessor = processor;
                }

                return firstProcessor;
            }
        }
    }
}

The Processor class is the base class for all processors (i.e., each step in our business processes like my InsertDataIntoDB2 class above). When the first processor of a given chain is executed (ProcessChainAsync) by the Chain (see above), this tries to perform all the action (business logic) in a specific processor. Whenever the result of such a business logic is true, this method executes the next processor in the chain - until the end of the chain is reached.

In case of an error, it goes through all undo steps in the opposite direction of this chain (UndoChainAsync) and finally sends the whole stack trace and the error message to the injected SlackManager.

public class Processor : IProcessor
{
    public IProcessor Predecessor { get; set; }

    public IProcessor Successor { get; set; }

    protected readonly ISlackManager _slackManager;
    protected ChainResult _chainResult;

    public Processor(ISlackManager slackManager) => _slackManager = slackManager;

    public async Task ProcessChainAsync(ChainResult chainResult)
    {
        _chainResult = chainResult;
        try
        {
            _chainResult.StackTrace.Add(GetType().Name);
            if (!await ProcessOkAsync()) { return; }
            await (Successor?.ProcessChainAsync(_chainResult) ?? 
                              Task.FromResult<object>(null));
        }
        catch (Exception ex)
        {
            _chainResult.Exception = ex;
            await UndoChainAsync(_chainResult);
            await _slackManager.SlackErrorAsync($"{_chainResult.CalledBy}" +
                $"{Environment.NewLine}{new string('-', 20)}
                  {Environment.NewLine}{string.Join
                  (Environment.NewLine, _chainResult.StackTrace)}" +
                $"{Environment.NewLine}{Environment.NewLine}
                  {_chainResult.Exception.Message}{Environment.NewLine}
                  {_chainResult.Exception.InnerException?.Message}");
        }
    }

    public async Task UndoChainAsync(ChainResult chainResult)
    {
        await UndoAsync();
        await (Predecessor?.UndoChainAsync
              (_chainResult) ?? Task.FromResult<object>(null));
    }

This class also offers the necessary methods which you override in your business logic actions to perform and undo your business logic steps/actions (see example, InsertDataIntoDB2 above):

protected async virtual Task<bool> ProcessOkAsync()
{
    await ProcessAsync();
    Process();
    return ProcessOk();
}

protected async virtual Task ProcessAsync() => await Task.CompletedTask;

protected virtual void Process() { }

protected virtual bool ProcessOk() => true;

protected virtual Task UndoAsync() => Task.CompletedTask;

The ChainResult class contains the "short term memory" of everything that happened during the processing of a given chain:

public class ChainResult
{
    public string CalledBy { get; set; }

    public List<string> StackTrace { get; set; } = new List<string>();

    public Exception Exception { get; set; }

    public static ChainResult Create(string calledBy) => 
                  new ChainResult { CalledBy = calledBy };
}

You can find the source code here on GitHub, and you can use the unit tests in there to start your own implementation of such a design pattern, if you do not want to use my NuGet package.

Points of Interest

Why does this approach make anything better?

In a production environment with real users, we can look at the system in a narrow sense consisting of software and hardware. When I compare the "spaghetti code" example above with the chain implementation in my BookingManager, I think it is obvious that my suggestion presented here decouples the control flow as a generic pattern from the specific actions. That alone reduces complexity and therefore makes this system already more resilient.

But I also want to look at the system in a broader sense and to include users, customer support, incident management, testers, developers, product managers and so on. In a "classic" implementation (especially in a microservices architecture), it is impossible to avoid all system errors. I think it is obvious that many users simply shy away from a platform when they run into an error of any kind. Only very few of them get in touch with customer support. If they report a problem, they spend a lot of time explaining (vaguely) what they did, when they did it and what happened. Eventually, a huge organizational apparatus is busy to document, categorize and manage the problem reported by the user.

All of that results in frustrating investigations by developers based on little to no facts, and usually these investigations start far too late. I think every developer having gone through tons of logs and trying to extract as much information as possible from (understandably) helpless users involving (understandably) similarly helpless customer services knows what I'm talking about.

My solution presented here makes this larger system much more resilient. As a developer, I'm informed about any system error by my own code immediately - even if most users don't tell us about the error. My code tells me exactly what I need to know (it even limits the damage). I can also tell my users on an error page that they do not need to get in touch with us because we already know that these users ran into system error and that we are already working on it.

As a result, customer services can focus on their core job to support users with real user problems, because they are all together not bothered with any system errors.

Limitations

What are the limitations of my suggested approach?

I think it is clear that a collapse of my two databases (see example above) in the wrong moment could lead to a successful insert action, then the two databases could go offline, and then a failing update would lead to an also failing delete in the undo part. That extreme case would still result in inconsistent data and would not be covered by the damage limitation.

Slack could fail, and at the same time, my error service logging these errors could also fail. In the end, I would not get my notifications.

When my whole system is "down", nothing will work anymore, and then I won't get any notifications, of course.

My business processes might, in fact, be very awful nested ifs which I cannot break down into my generic design pattern of chains of actions.

History

30^th May, 2023 - Document creation
1^st June, 2023 - Typos
16th June, 2023 - Typo