Upgrading Consul servers using Helm / GitOps

We are running a Consul / Vault installation with 3 replicas (RKE cluster); the entire stack is deployed by a GitOps controller using Helm Charts.

The recommended way to upgrade Consul servers suggests setting the upgradePartition value and then lowering its value successively until it reaches 0.

How long is the Consul cluster supposed to take to ‘recover’ once one of its instances is replaced by a newer version? We’ve tried to apply the above mentioned recommendations several times, yet it seems the cluster remains in an unstable state and never recovers.

consul members reports all instances to be alive, but cluster leader election (Raft) seems to be stuck in an endless loop. We’ve seen cases where the new instance cannot join properly or does so as ‘non-voter’. Is this rather a matter of time, i.e. should we allow more time for cluster reconciliation after an instance upgrade?

I’m also curious about this, because I’m facing the exact same issue. For me it was just a matter of time until a consensus was reached, but it took over 5 minutes.

What happens to the availability of Consul during this time? I suppose while consensus is not reached, there is effective downtime on the Consul system, is this correct?