Upgrading Consul servers using Helm / GitOps

ngc4579 · September 16, 2022, 2:37pm

We are running a Consul / Vault installation with 3 replicas (RKE cluster); the entire stack is deployed by a GitOps controller using Helm Charts.

The recommended way to upgrade Consul servers suggests setting the upgradePartition value and then lowering its value successively until it reaches 0.

How long is the Consul cluster supposed to take to ‘recover’ once one of its instances is replaced by a newer version? We’ve tried to apply the above mentioned recommendations several times, yet it seems the cluster remains in an unstable state and never recovers.

consul members reports all instances to be alive, but cluster leader election (Raft) seems to be stuck in an endless loop. We’ve seen cases where the new instance cannot join properly or does so as ‘non-voter’. Is this rather a matter of time, i.e. should we allow more time for cluster reconciliation after an instance upgrade?

david.lopes · September 16, 2022, 3:31pm

I’m also curious about this, because I’m facing the exact same issue. For me it was just a matter of time until a consensus was reached, but it took over 5 minutes.

What happens to the availability of Consul during this time? I suppose while consensus is not reached, there is effective downtime on the Consul system, is this correct?

Topic		Replies	Views
Question about "helm upgrading" a consul cluster Consul	2	414	November 3, 2021
Unstable deployment on K8s with helm chart Consul	4	1082	September 23, 2022
Failed leadership election with three node cluster in GKE (Consul v1.5.2) Consul	4	500	February 20, 2023
Containerize Vault and Consul Consul	6	682	January 25, 2023
Patching Consul for Vault behind ASG Consul	1	546	May 18, 2021

Upgrading Consul servers using Helm / GitOps

Related topics