Classic networking issues


we have a Consul deployment in k8s (3 worker nodes) with 3 consul agents and 3 consul servers. It’s deployed by helm (chart version 0.27.0). Apart from setting resources and enabling ui these are helm values used:

  name: consul
    secretName: consul-gossip-encryption-key
    secretKey: key
  storage: 50Gi
  manageSystemACLs: 'true'

So it uses consul version 1.9.0. I can see a lot of issues like connection reset by peer, no ack received, election timeout reached etc. You can see logs here

They immediately suggests something is wrong with network. Ping between these pods work. I can even see that ports are open on remote pod by netcat. Consul UI is showing all services green. The issue is that approximetly once a day some of the consul server probe fails becuase consul leader is lost. This then disruptes downstream apps using consul’s locking mechanism…

So far I lean towrads thinking I have misconfigured consul but I do not know where. Can someone please suggest me what should I check? How to debug this? I am running out of ideas.

Happy to provide more info

Thanks a lot

To be honest, it doesn’t look like your clients are ever really very happy, according to those logs; the member list seems very unstable. Can you try running it with proper DNS names for the workers? That is, to address this:

2020-12-01T14:43:27.807Z [WARN] agent.auto_config: Node name "worker3.k8s" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.

I wonder whether that blind spot is enough to send the membership spiralling downwards. (Although, tbh, if the UI is happy, I can’t think what else, internally to the cluster, would be using DNS; as I understand it, it’s more about providing a hook to external services.)

Hey jlj7,

that line seems fishy to mee as well. But I ignored it for now since
A) kube dns names are used to connect servers to themselves (see below)
B) IP addresses from logs are correct addresses of all the pods

Command used to run server by chart

       exec /bin/consul agent \
         -advertise="${POD_IP}" \
         -bind= \
         -bootstrap-expect=3 \
         -client= \
         -config-dir=/consul/config \
         -datacenter=dc1 \
         -data-dir=/consul/data \
         -domain=consul \
         -encrypt="${GOSSIP_KEY}" \
         -hcl="connect { enabled = true }" \
         -ui \
         -retry-join=${CONSUL_FULLNAME}-server-0.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
         -retry-join=${CONSUL_FULLNAME}-server-1.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
         -retry-join=${CONSUL_FULLNAME}-server-2.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \

Also k8s cluster is running a lot of other things so I would rather not play with naming of nodes unless it really is the problem.

The discrepancy between logs full of errors and UI showing all consul instances as healthy is puzzling to me as well.

In a meantime I am double checking with our infra guys that there really is no firewall or anything like that.

I am going to raise log level and see if there is any more useful info. In a meantime I am open for things to check/try.

1 Like

Ok so I have confirmed that there is no firewall blocking any ports.

Another thing I have noticed is that if I consider only logs with word error in them and then see those on timeline then each lump of instability starts with messages like

2020-12-02T09:54:44.326Z [WARN]  agent: error getting server health from server: server=consul-server-1 error="context deadline exceeded"
2020-12-02T09:54:44.326Z [WARN]  agent: error getting server health from server: server=consul-server-0 error="context deadline exceeded"

This is actually mentioned in troubleshooting section of consul. We will try implementing monitoring as advised there.

It is a new installation with barely any load so monitoring was not setup yet. Let’s see.

1 Like

Monitoring is setup and some data collected. I can clearly see all bursts of context deadline exceeded from logs in timeline of consul.raft.leader.lastContact metric:

This metrics measures the time since the leader was last able to contact the follower nodes when checking its leader lease. Spikes here exactly corresponds with what I see in logs. Naturally I can see it also on these metrics:

  • number of raft transactions
  • autopilot failure tolerance

But this is an effect not a cause. Looking on other metrics I do not see anything causing this. Remember this install is barely used yet. It is just idling consul cluster. All of these metrics show no anomaly around these spikes:

  • raft log commit time
  • autopilot healthy
  • containers cpu is idle
  • containers network is stable and below 10kb/s
  • container disk io is barely moving (all below 80kb/s)
  • memory usage stable and below 5%
  • time in memory gc is always below 3ms per second (usually well below)

Looking at metrics of underlying cluster nodes

  • there is plenty of free cpu and memory availble
  • only slight thing is that I see some spikes in load of cluster nodes during consul issues but there are lots of other load spikes without any issues on consul side.

If there is anything else I can check please somebody let me know.