Patching Consul for Vault behind ASG

timalexander-inv · February 18, 2021, 2:23pm

So we have Vault running in AWS backed by Consul. All works well day to day and 90% of the upgrades we have run have no issue at all. Occasionally though I seem to get a large timeout from Consul whilst the leader node is terminated. My understanding is that vault is only running a single node as active and everything forwards traffic to the leader. Each Vault node has a Consul agent running that communicates with consul server instances (behind their own ELB in our case). Any of these nodes can server traffic from my understanding.

The process we have written works off of the autopilot API endpoint to:

find nodes that are not “leader: true” and terminate one
wait until new node is spun out by ASG
wait for node to become ready in AWS
monitor Failure to Tolerate entry in autopilot before terminating next node that is not leader
Finally once that is complete terminate the leader node.

Initially this did cause issues as we did not have the “leave_on_terminate” configured so the cluster would have a bit of a panic then elect a new leader once it realised the node was gone for good. Setting “leave_on_terminate” resolved this and everything has been fine. However it seems inconsistent still and I can get 10-50 seconds of no response from consul which manifests as vault errors to clients trying to pull secrets.

Is there any setting that someone thinks I might have missed and should look in to to try and speed up the leader failover when the leader is shutdown cleanly? Or is there a better method to force a cluster election to happen?

timalexander-inv · May 18, 2021, 11:15am

Ah so figured out the cause. Our fix to add in “leave_on_terminate” was great bar the fact that we added it as config to a separate file that was created after consul had been started. A consul reload resolved the issue but the fix was just a case of ordering the code better so it creates the config file before consul is started. Silly me.

Topic		Replies	Views
HA active/standby mode switching frequency Vault	2	1368	September 16, 2019
[SOLVED:] Vault (Linux i386) briefly joining Consul-backed HA cluster before timeout Vault consul-vault	3	1730	November 21, 2020
Containerize Vault and Consul Consul	6	515	January 25, 2023
Trouble getting Consul servers to re-join when AWS marks a node as impared Consul	0	620	January 2, 2020
Consul reporting two Vault active nodes Consul vault	7	523	November 21, 2022

Patching Consul for Vault behind ASG

Related topics