We are running a Consul / Vault installation with 3 replicas (RKE cluster); the entire stack is deployed by a GitOps controller using Helm Charts.
The recommended way to upgrade Consul servers suggests setting the upgradePartition
value and then lowering its value successively until it reaches 0.
How long is the Consul cluster supposed to take to ‘recover’ once one of its instances is replaced by a newer version? We’ve tried to apply the above mentioned recommendations several times, yet it seems the cluster remains in an unstable state and never recovers.
consul members
reports all instances to be alive
, but cluster leader election (Raft) seems to be stuck in an endless loop. We’ve seen cases where the new instance cannot join properly or does so as ‘non-voter’. Is this rather a matter of time, i.e. should we allow more time for cluster reconciliation after an instance upgrade?