We are reviewing Vault to see if it can help with our secret management. For a POC, we created an EKS cluster and used the official helm chart to deploy a HA Vault cluster & enabled ingress via an ALB.
However, testing under load we have found that should the active node be killed no traffic is served.
I understand this is be design - there can only be 1 active node, so requests cannot be served while the active node is being changed. However, while this process is quick (~1s) the concern is a cluster receiving 100’s of requests a second, moving to potentially 1000’s, thats a lot of failing requests in that period. Combined with the fact that theres a few comments out there from people seeing it take 30s, or even 1 case of someone mentioning it taking 5 mins.
As the company grows, we are going to be handling a large number of requests a second that cannot fail. The concern is, during cluster upgrades (which need to happen regularly for compliance) that the active node switching is going to be something that happens & we need to try ensure that the impact is ideally nothing.
So how are people handling this on production clusters? Is the expectation to just have clients re-try in the hope that a new active node is selected quickly enough that after a re-try or 2 its fine, or are there other options?
The majority of our traffic is going to be read requests - Does enterprise with performance replicas solve this issue? I assume the downtime during the active node selection would still happen, but the fact the standby nodes can response to read requests would that help mitigate issues?