we have a Consul deployment in k8s (3 worker nodes) with 3 consul agents and 3 consul servers. It’s deployed by helm (chart version 0.27.0). Apart from setting resources and enabling ui these are helm values used:
They immediately suggests something is wrong with network. Ping between these pods work. I can even see that ports are open on remote pod by netcat. Consul UI is showing all services green. The issue is that approximetly once a day some of the consul server probe fails becuase consul leader is lost. This then disruptes downstream apps using consul’s locking mechanism…
So far I lean towrads thinking I have misconfigured consul but I do not know where. Can someone please suggest me what should I check? How to debug this? I am running out of ideas.
To be honest, it doesn’t look like your clients are ever really very happy, according to those logs; the member list seems very unstable. Can you try running it with proper DNS names for the workers? That is, to address this:
2020-12-01T14:43:27.807Z [WARN] agent.auto_config: Node name "worker3.k8s" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
I wonder whether that blind spot is enough to send the membership spiralling downwards. (Although, tbh, if the UI is happy, I can’t think what else, internally to the cluster, would be using DNS; as I understand it, it’s more about providing a hook to external services.)
that line seems fishy to mee as well. But I ignored it for now since
A) kube dns names are used to connect servers to themselves (see below)
B) IP addresses from logs are correct addresses of all the pods
Ok so I have confirmed that there is no firewall blocking any ports.
Another thing I have noticed is that if I consider only logs with word error in them and then see those on timeline then each lump of instability starts with messages like
2020-12-02T09:54:44.326Z [WARN] agent: error getting server health from server: server=consul-server-1 error="context deadline exceeded"
2020-12-02T09:54:44.326Z [WARN] agent: error getting server health from server: server=consul-server-0 error="context deadline exceeded"
This is actually mentioned in troubleshooting section of consul. We will try implementing monitoring as advised there.
It is a new installation with barely any load so monitoring was not setup yet. Let’s see.
Monitoring is setup and some data collected. I can clearly see all bursts of context deadline exceeded from logs in timeline of consul.raft.leader.lastContact metric:
This metrics measures the time since the leader was last able to contact the follower nodes when checking its leader lease. Spikes here exactly corresponds with what I see in logs. Naturally I can see it also on these metrics:
number of raft transactions
autopilot failure tolerance
But this is an effect not a cause. Looking on other metrics I do not see anything causing this. Remember this install is barely used yet. It is just idling consul cluster. All of these metrics show no anomaly around these spikes:
raft log commit time
autopilot healthy
containers cpu is idle
containers network is stable and below 10kb/s
container disk io is barely moving (all below 80kb/s)
memory usage stable and below 5%
time in memory gc is always below 3ms per second (usually well below)
Looking at metrics of underlying cluster nodes
there is plenty of free cpu and memory availble
only slight thing is that I see some spikes in load of cluster nodes during consul issues but there are lots of other load spikes without any issues on consul side.
If there is anything else I can check please somebody let me know.