Mitigating downtime when selecting a new active node

We are reviewing Vault to see if it can help with our secret management. For a POC, we created an EKS cluster and used the official helm chart to deploy a HA Vault cluster & enabled ingress via an ALB.
However, testing under load we have found that should the active node be killed no traffic is served.

I understand this is be design - there can only be 1 active node, so requests cannot be served while the active node is being changed. However, while this process is quick (~1s) the concern is a cluster receiving 100’s of requests a second, moving to potentially 1000’s, thats a lot of failing requests in that period. Combined with the fact that theres a few comments out there from people seeing it take 30s, or even 1 case of someone mentioning it taking 5 mins.

As the company grows, we are going to be handling a large number of requests a second that cannot fail. The concern is, during cluster upgrades (which need to happen regularly for compliance) that the active node switching is going to be something that happens & we need to try ensure that the impact is ideally nothing.

So how are people handling this on production clusters? Is the expectation to just have clients re-try in the hope that a new active node is selected quickly enough that after a re-try or 2 its fine, or are there other options?
The majority of our traffic is going to be read requests - Does enterprise with performance replicas solve this issue? I assume the downtime during the active node selection would still happen, but the fact the standby nodes can response to read requests would that help mitigate issues?


Requests that cannot fail is something that is extremely difficult to achieve in reality.

As you mention the Enterprise performance replicas feature can be useful for read traffic, but no use for write requests. Also, even for reads you would like still get failed requests, as a load balancer detecting a failure before removing from a pool still takes some time.

In general for any connected remote system you need to assume failures and design accordingly - transient network or server issues are a fact of life, so requests will fail and need to be retried. Using techniques like random and exponential back off are advised to prevent cascading problems.

At the level of that many requests per second it may also be worth looking at functional sharding - multiple clusters split by usage type to reduce the impact of a single issue or blip.