we have a Consul deployment in k8s (3 worker nodes) with 3 consul agents and 3 consul servers. It’s deployed by helm (chart version 0.27.0). Apart from setting resources and enabling ui these are helm values used:
global: name: consul gossipEncryption: secretName: consul-gossip-encryption-key secretKey: key server: storage: 50Gi acls: manageSystemACLs: 'true'
So it uses consul version 1.9.0. I can see a lot of issues like
connection reset by peer,
no ack received,
election timeout reached etc. You can see logs here
They immediately suggests something is wrong with network. Ping between these pods work. I can even see that ports are open on remote pod by netcat. Consul UI is showing all services green. The issue is that approximetly once a day some of the consul server probe fails becuase consul leader is lost. This then disruptes downstream apps using consul’s locking mechanism…
So far I lean towrads thinking I have misconfigured consul but I do not know where. Can someone please suggest me what should I check? How to debug this? I am running out of ideas.
Happy to provide more info
Thanks a lot