I would like to discuss about the proper way to upgrade a running consul cluster.
In more detail, in our product, Consul is deployed as a Daemonset. There are 3 pods in total and a Consul server is running on each of them.
Then we need to upgrade to a newer version of this helm chart - the consul version itself doesn’t necessarily change in the context of this helm upgrade. Most of the time it’s the consul docker image that changes because the base image is updated due to security fixes etc.
During the “helm upgrade” and although the 3 consul pods are sequentially upgraded ( the Daemonset’s updateStrategy is set to rollingUpdate), the quorum is lost and it takes around 4-5 minutes until a new leader is elected. The new leader is elected when the new/upgraded pods start.
This causes several issues to the applications that need to access consul’s KV store during this timeframe.
So, now I’m trying to figure out what is the proper way to upgrade the consul pods without losing the quorum and the cluster leader.
I’ve read the instructions about upgrading the consul version in a running Consul cluster.
I think that the fact that we simply “helm upgrade” our chart and we do not follow any of the steps described in the aforementioned link (e.g. leave the upgrade/re-creation of the consul leader’s pod for last) is the reason we lose the quorum and the cluster leader for some minutes.
But how could we overcome this?
Has anyone else probably faced a similar issue?