So we have Vault running in AWS backed by Consul. All works well day to day and 90% of the upgrades we have run have no issue at all. Occasionally though I seem to get a large timeout from Consul whilst the leader node is terminated. My understanding is that vault is only running a single node as active and everything forwards traffic to the leader. Each Vault node has a Consul agent running that communicates with consul server instances (behind their own ELB in our case). Any of these nodes can server traffic from my understanding.
The process we have written works off of the autopilot API endpoint to:
- find nodes that are not “leader: true” and terminate one
- wait until new node is spun out by ASG
- wait for node to become ready in AWS
- monitor Failure to Tolerate entry in autopilot before terminating next node that is not leader
- Finally once that is complete terminate the leader node.
Initially this did cause issues as we did not have the “leave_on_terminate” configured so the cluster would have a bit of a panic then elect a new leader once it realised the node was gone for good. Setting “leave_on_terminate” resolved this and everything has been fine. However it seems inconsistent still and I can get 10-50 seconds of no response from consul which manifests as vault errors to clients trying to pull secrets.
Is there any setting that someone thinks I might have missed and should look in to to try and speed up the leader failover when the leader is shutdown cleanly? Or is there a better method to force a cluster election to happen?