Introduction

The Internet API workload has increased multi-fold in the past few years. Microservices are created to break down large systems into small manageable and communicable services. Owing to this distributed nature, High Availability (HA) is as much a contributing success factor as correctness and fast execution times. In this post, I am going to describe how HA is managed now by reverse proxies and suggest few improvements in certain special cases.

What High Availability Entails

HA comes in many forms — load distribution, concurrent requests/s, support from hardware (RAM, CPU cores), software (language constructs, caching strategy, connection handling), network bandwidth and so on. Then there are factors like downtimes and deployments which may cause long service breaks and hinder HA. In general, while most of these are handle-able and depend on the capability of developers and infra team, few go beyond the scope of a single API service, things like — load distribution, downtimes and deployment breaks.

Problem - Effects Of Service Outages

Any kind of stoppage in the service (caused due to downtimes or deployment breaks), however small it is, brings down HA metrics and breaks trust with consumers of that service. In such scenarios, the services will usually respond with errors that can either have a cascading effect on clients latency, or break down client workflow or induce tedious rework on client's part. The scope of these errors remain limited to a single request-response cycle and thus any subsequent client request result in further failures or reduce overall throughput of the cluster.

Proposed Solution - Listening To Errors

One solution to handle service errors is to build a cluster of parallel services and hide it behind a protective layer. Let's refer to these services as nodes now. This layer is responsible for choosing an available node among a set of faulty and working nodes. This is not uncommon though. Lot of present day reverse proxies like nginx, haproxy, caddy use this approach which in broad terms is called load balancing. They internally utilize various routing algorithms to cater to varying degree of use cases. While round robin is usually the algorithm of choice and works in most cases, there might be a requirement of (say) tying certain clients to a set of services, which is then solved by using IP hashing techniques. This is one of the examples. There are multiple such algorithms (or a combination of algorithms) which can be employed to serve special use cases.

The current set of proxies use the service errors only to select the next available service and throw error to the client if they are unable to find any. The issue with using plain round robin is that every request may be tried on every node irrespective of its state. Also, if all nodes are down, the proxy may simply relay the errors back to the client, which ends the communication. I am proposing two ways to better handle such scenarios.

a) Consider Error as a Metric in Routing Algorithm

Consider a scenario where one out of three nodes in a cluster is down. If the proxy uses plain round robin algorithm, there is a 33% chance that a request will hit this errored service and then retried on one of the other nodes. This increases the response times of these 33% requests and effects overall throughput of the cluster. If we start storing errors as weights against the load balanced nodes, we can use that information to reduce the percentage requests from 33% down to less than 1% depending on the length of downtime. The weights can be reset when all nodes are back up again. This can completely take care of partial cluster outages (1 - n-1 nodes down).

b) Defer Request Forwarding on Total Outage

In case of total cluster outage (n nodes down), the existing proxies simply relay the errors back to the client, which then handles it as it sees fit - stopping the flow of dependant requests, or run alternate logic, or rerun failed requests in batches every day, and so on. This work can be offloaded to the proxy itself and free the client from running any auxillary logic. Consider requests that send updates to a cluster and further behaviour is not dependant on the response. Examples could include most asynchronous requests like initiating refunds on failed transaction, sending notification, updating order details with alternate delivery timings, and so on. Rather than failing these requests due to temporary cluster unavailability, a proxy can store and look to re-process them when at least one of the nodes is re-available.

Implementation

To showcase the difference, I have implemented a small HTTP load balancer 'ServiceQ', which caters to both the points mentioned above (HTTP - because it is the protocol of internet but same principles can be applied to any other protocols as well such as FTP, XMPP and so on).

For point (a), ServiceQ maintains a map of node to number of errors seen since last availability and uses it to dynamically assign weights to node which the routing algorithm makes use of. The probability of node selection decreases as the errors on that node increases. This algorithm is run on every new request and first try. Subsequent retries select nodes by round robin method.

For point (b), ServiceQ maintains a FIFO queue of failed requests (when all nodes in the cluster were down) and retries them after a set interval using the same algorithm mentioned above. This goes on until at least one of the nodes is re-available.

Impact

I have used the excellent Apache Bench (ab) to test the new system. I'll be sending down 2000 HTTPS requests with concurrency of 100 to a cluster with 4 nodes (deployed on an EC2 t2 medium linux machines).

Test 1

Let's call our new algorithm in point (a) as 'error_probabilistic'. The base algorithms are 'randomize' and 'round_robin'.

Cluster: 1
Nodes: 4 (:8000, :8001, :8002, :8003)
Unavailable Nodes: :8002, :8003
No of Requests: 2000

Algorithm: randomize
No of hits and errors: 

:8000 => 1460
:8001 => 542
:8002 => 513 (failed, another node chosen)
:8003 => 998 (failed, another node chosen)

Algorithm: round_robin
No of hits and errors:

:8000 => 943
:8001 => 943
:8002 => 935 (failed, another node chosen)
:8003 => 936 (failed, another node chosen)

Algorithm: error_probabilistic
No of hits and errors:

:8000 => 1024
:8001 => 978
:8002 => 57 (failed, another node chosen)
:8003 => 60 (failed, another node chosen)

As we can observe, our new algorithm has attempted requests very few times on failed nodes - 57 and 60. Also, observe that the distribution is almost equal here (as :8002 and :8003 were down for the same amount of time).

Another point to notice here is that a total of 2119 tries were needed to complete all 2000 requests with error_probabilistic algorithm, whereas the same number was 3513 and 4757 for other two algorithms. In turn, this effects the total response times and throughput of the cluster.

Test 2

Cluster: 1
Nodes: 4 (:8000, :8001, :8002, :8003)
Unavailable Nodes: :8000, :8001, :8002, :8003
No of Requests: 2000

:8000 => 2000 (failed, another node chosen)
:8001 => 2000 (failed, another node chosen)
:8002 => 2000 (failed, another node chosen)
:8003 => 2000 (failed, another node chosen)

When all requests are failed after retries on each node, they are pushed to a deferred FIFO queue. ServiceQ sends appropriate response message to client that request is buffered. When one of the nodes (:8002) is re-available:

:8000 => 99 (failed, another node chosen)
:8001 => 131 (failed, another node chosen)
:8002 => 2001
:8003 => 51 (failed, another node chosen)

This way, we don't lose requests and clients are freed from writing any auxillary error handling logic.

Conclusion

There are many techniques to high availability and load balancing in use today that are based on very sound principles. However, they employ a set of rules which are static in nature and does not respond well to changing cluster behaviour. Listening to cluster errors in the way proposed not only helps overcome these issues but makes deployment easier, downtimes manageable and service more trustworthy.