Distributed Locking with Redlock.net

Bohdan Stupak

4.71/5 (9 votes)

Mar 17, 2020

CPOL

10 min read

22471

How to perform distributed lock between multiple instances of a microservice with Redlock.Net

Why Locking Things

Microservice architecture becomes widely adopted these days. One of the benefits it offers is the possibility of horizontal scaling which allows us to increase the performance of our application dramatically. However, there are situations when multiple instances of service face contention for some shared resource. In such a case, one of the instances would acquire a lock over this resource claiming exclusive access to it. Let’s have a look at some possible use cases when this is helpful.

Sending Notifications

Let’s imagine a service that takes a batch of notifications from the database and sends them to customers. Service is designed as a background worker which operates once in a given period of time. In case we want to increase the performance of the service via horizontal scaling, we need to somehow signal that the batch is consumed by a given instance of the service. We could create a flag inside the database and set it if the messages are occupied by worker instance. However, such a solution would pollute our data model. Instead, we can place a lock for a given batch inside a distributed storage.

Poor Man’s Failover

Another reason might be poor-man’s failover scenario, where one instance of a service executes work, while the other is idle and waits just in case the first instance fails for some reason.

Ensuring Exactly Once Message Processing

Imagine multiple service instances consuming messages from the stream. Depending on stream implementation, this may lead to every instance consuming the same message. One of the possible solutions is to shard stream creating a sub-stream for every dedicated instance. The disadvantage of this solution is that we have to reconfigure the stream each time we upscale/downscale service. An alternative solution would be each service actually consume the message but try to acquire a distributed lock on this message before processing it. This way, each service will be processed only once.

Using Redis

As storage for a distributed lock, we’ll use Redis. Redis has a number of advantages:

Redis is in-memory key-value storage that provides additional speed.
It has the ability to set TTL value for a key which allows us to expire lock. We will see later that this is the important part of the algorithm.

There is already a library that implements distributed lock over Redis. So we just have to make use of it.

Redlock Implementation

Let’s imagine we’ll use a single Redis as a storage lock. Such a solution has an obvious downside: it is the single point of failure. However, you can’t mitigate this issue by simply adding a replica because it will lead to a race condition. Imagine the following scenario:

Client A acquires the lock in the master.
The master crashes before the write to the key is transmitted to the replica.
The replica gets promoted to master.
Client B acquires the lock to the same resource A already holds a lock for.

So how do you implement the algorithm correctly?

In the distributed version of the algorithm, we assume we have N Redis masters. Those nodes are totally independent, so we don’t use replication or any other implicit coordination system. We already described how to acquire and release the lock safely in a single instance. We take for granted that the algorithm will use this method to acquire and release the lock in a single instance. In our examples, we set N=5, which is a reasonable value, so we need to run 5 Redis masters on different computers or virtual machines in order to ensure that they’ll fail in a mostly independent way.

In order to acquire the lock, the client performs the following operations:

It gets the current time in milliseconds.
It tries to acquire the lock in all the N instances sequentially, using the same key name and random value in all the instances. During step 2, when setting the lock in each instance, the client uses a timeout which is small compared to the total lock auto-release time in order to acquire it. For example, if the auto-release time is 10 seconds, the timeout could be in the ~ 5-50 milliseconds range. This prevents the client from remaining blocked for a long time trying to talk with a Redis node which is down: if an instance is not available, we should try to talk with the next instance ASAP.
The client computes how much time elapsed in order to acquire the lock, by subtracting from the current time the timestamp obtained in step 1. If and only if the client was able to acquire the lock in the majority of the instances (at least 3), and the total time elapsed to acquire the lock is less than lock validity time, the lock is considered to be acquired.
If the lock was acquired, its validity time is considered to be the initial validity time minus the time elapsed, as computed in step 3.
If the client failed to acquire the lock for some reason (either it was not able to lock N/2+1 instances or the validity time is negative), it will try to unlock all the instances (even the instances it believed it was not able to lock).

Is Redlock Safe?

In order to be considered safe, Redlock algorithm must satisfy the following properties:

Safety property: Mutual exclusion. At any given moment, only one client can hold a lock.
Liveness property A: Deadlock free. Eventually, it is always possible to acquire a lock, even if the client that locked a resource crashes or gets partitioned.
Liveness property B: Fault tolerance. As long as the majority of Redis nodes are up, clients are able to acquire and release locks.

Let’s examine various scenarios to see if Redlock conforms to these properties.

To start, let’s assume that a client is able to acquire the lock in the majority of instances. All the instances will contain a key with the same time to live. However, the key was set at different times, so the keys will also expire at different times. But if the first key was set at worst at time T1 (the time we sample before contacting the first server) and the last key was set at worst at time T2 (the time we obtained the reply from the last server), we are sure that the first key to expire in the set will exist for at least MIN_VALIDITY=TTL-(T2-T1)-CLOCK_DRIFT. All the other keys will expire later, so we are sure that the keys will be simultaneously set for at least this time.

During the time that the majority of keys are set, another client will not be able to acquire the lock, since N/2+1 SET NX operations can’t succeed if N/2+1 keys already exist. So if a lock was acquired, it is not possible to re-acquire it at the same time (violating the mutual exclusion property).

However, we want to also make sure that multiple clients trying to acquire the lock at the same time can’t simultaneously succeed.

If a client locked the majority of instances using a time near, or greater, than the lock maximum validity time (the TTL we use for SET basically), it will consider the lock invalid and will unlock the instances, so we only need to consider the case where a client was able to lock the majority of instances in a time which is less than the validity time. In this case for the argument already expressed above, for MIN_VALIDITY no client should be able to re-acquire the lock. So multiple clients will be able to lock N/2+1 instances at the same time (with “time” being the end of Step 2) only when the time to lock the majority was greater than the TTL time, making the lock invalid.

The system's liveness is based on three main features:

The auto release of the lock (since keys expire): eventually keys are available again to be locked.
The fact that clients, usually, will cooperate removing the locks when the lock was not acquired, or when the lock was acquired and the work terminated, making it likely that we don’t have to wait for keys to expire to re-acquire the lock.
The fact that when a client needs to retry a lock, it waits a time which is comparably greater than the time needed to acquire the majority of locks, in order to probabilistically make split brain conditions during resource contention unlikely.

However, we pay an availability penalty equal to TTL time on network partitions, so if there are continuous partitions, we can pay this penalty indefinitely. This happens every time a client acquires a lock and gets partitioned away before being able to remove the lock.

Basically, if there are infinite continuous network partitions, the system may become not available for an infinite amount of time.

The Code

The sample code can be accessed on Github. Let’s break down what actually happens here.

The idea behind leader election via distributed lock is whoever acquires lock over the shared resource becomes a leader. So naturally, we have a lock key quite similarly to built-in C# lock construct.

 private const string _resource = "the-thing-we-are-locking-on";

Obviously, the storage is a single point of failure so we have to make sure that it is reliable. RedLock.net which we use for our case allows us to use multiple instances of Redis instead of single in order to improve reliability.

Here’s how we create a connection to Redis during start-up.

 var endPoints = new List<RedLockEndPoint>
{
    new DnsEndPoint("redis1", 6379)
    new DnsEndPoint("redis2", 6379)
    new DnsEndPoint("redis3", 6379)
};
_distributedLockFactory = RedLockFactory.Create(endPoints);

In case you need to provide password, you may take advantage of RedLockEndPoint.

var endpoint = new RedLockEndPoint(new DnsEndPoint("localhost", 49153));
endpoint.Password = "redispw";

Each instance tries to acquire a lock once in a given period of time. If it succeeds, it becomes a leader. If not, it will try once again later. CreateLockAsync has an overload that accepts the retry interval as well as the interval by which the lock is acquired. The method will block until the lock is acquired or until wait timeout provided as the parameter will expire.

private async Task TryAcquireLock(CancellationToken token)
{
    if (token.IsCancellationRequested)
        return;

    var distributedLock = await _distributedLockFactory.CreateLockAsync(
        _resource,
        _expiry,
        _wait,
        _retry,
        token);
    if (distributedLock.IsAcquired)
    {
        DoLeaderJob();
    }
}

As you can see, the lock expires after the provided amount of time which is the part of auto release mechanism of RedLock algorithm. As an author of the algorithm notes:

A distributed lock without an auto release mechanism, where the lock owner will hold it indefinitely, is basically useless. If the client holding the lock crashes and does not recover with full state in a short amount of time, a deadlock is created where the shared resource that the distributed lock tried to protect remains forever unaccessible. This creates a liveness issue that is unacceptable in most situations, so a sane distributed lock must be able to auto release itself.

However, you do not need to re-acquire the lock explicitly in your code since it has auto-extend feature. At first encounter with RedLock.net, this might be unintuitive so it should be noted. In simple words, you might think of it as a sort of leader health-check built into RedLock algorithm.

As mentioned above, we get rid of re-acquire timer as soon as an instance becomes a leader taking advantage of auto-extend feature. Once the instance fails, the lock is released and is up to other instances for grabs.

Summary

As we can see, the implementation of leader election via distributed lock is pretty straightforward. Still, it should be used with care since every locking increases contention between instances of a microservice and thus reduces the benefits of horizontal scaling.

History

16^th March, 2020: Initial version
11^th July, 2022: Replaced hacky timer with dedicated overload. Fixed paragraph spacing,.
19^th March, 2023: Added distributed lock use cases
25^th July, 2023: Added Redlock implementation details