Classic networking issues

james64 · December 1, 2020, 4:37pm

Hey,

we have a Consul deployment in k8s (3 worker nodes) with 3 consul agents and 3 consul servers. It’s deployed by helm (chart version 0.27.0). Apart from setting resources and enabling ui these are helm values used:

global:
  name: consul
  gossipEncryption:
    secretName: consul-gossip-encryption-key
    secretKey: key
server:
  storage: 50Gi
acls:
  manageSystemACLs: 'true'

So it uses consul version 1.9.0. I can see a lot of issues like connection reset by peer, no ack received, election timeout reached etc. You can see logs here
log-server-0
log-server-1
log-server-2
log-client-0
log-client-1
log-client-2

They immediately suggests something is wrong with network. Ping between these pods work. I can even see that ports are open on remote pod by netcat. Consul UI is showing all services green. The issue is that approximetly once a day some of the consul server probe fails becuase consul leader is lost. This then disruptes downstream apps using consul’s locking mechanism…

So far I lean towrads thinking I have misconfigured consul but I do not know where. Can someone please suggest me what should I check? How to debug this? I am running out of ideas.

Happy to provide more info

Thanks a lot

jlj7 · December 1, 2020, 7:42pm

To be honest, it doesn’t look like your clients are ever really very happy, according to those logs; the member list seems very unstable. Can you try running it with proper DNS names for the workers? That is, to address this:

2020-12-01T14:43:27.807Z [WARN] agent.auto_config: Node name "worker3.k8s" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.

I wonder whether that blind spot is enough to send the membership spiralling downwards. (Although, tbh, if the UI is happy, I can’t think what else, internally to the cluster, would be using DNS; as I understand it, it’s more about providing a hook to external services.)

james64 · December 2, 2020, 9:02am

Hey jlj7,

that line seems fishy to mee as well. But I ignored it for now since
A) kube dns names are used to connect servers to themselves (see below)
B) IP addresses from logs are correct addresses of all the pods

Command used to run server by chart

       exec /bin/consul agent \
         -advertise="${POD_IP}" \
         -bind=0.0.0.0 \
         -bootstrap-expect=3 \
         -client=0.0.0.0 \
         -config-dir=/consul/config \
         -datacenter=dc1 \
         -data-dir=/consul/data \
         -domain=consul \
         -encrypt="${GOSSIP_KEY}" \
         -hcl="connect { enabled = true }" \
         -ui \
         -retry-join=${CONSUL_FULLNAME}-server-0.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
         -retry-join=${CONSUL_FULLNAME}-server-1.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
         -retry-join=${CONSUL_FULLNAME}-server-2.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
         -server

Also k8s cluster is running a lot of other things so I would rather not play with naming of nodes unless it really is the problem.

The discrepancy between logs full of errors and UI showing all consul instances as healthy is puzzling to me as well.

In a meantime I am double checking with our infra guys that there really is no firewall or anything like that.

I am going to raise log level and see if there is any more useful info. In a meantime I am open for things to check/try.

james64 · December 3, 2020, 9:13am

Ok so I have confirmed that there is no firewall blocking any ports.

Another thing I have noticed is that if I consider only logs with word error in them and then see those on timeline then each lump of instability starts with messages like

2020-12-02T09:54:44.326Z [WARN]  agent: error getting server health from server: server=consul-server-1 error="context deadline exceeded"
2020-12-02T09:54:44.326Z [WARN]  agent: error getting server health from server: server=consul-server-0 error="context deadline exceeded"

This is actually mentioned in troubleshooting section of consul. We will try implementing monitoring as advised there.

It is a new installation with barely any load so monitoring was not setup yet. Let’s see.

james64 · December 8, 2020, 1:32pm

Monitoring is setup and some data collected. I can clearly see all bursts of context deadline exceeded from logs in timeline of consul.raft.leader.lastContact metric:

This metrics measures the time since the leader was last able to contact the follower nodes when checking its leader lease. Spikes here exactly corresponds with what I see in logs. Naturally I can see it also on these metrics:

number of raft transactions
autopilot failure tolerance

But this is an effect not a cause. Looking on other metrics I do not see anything causing this. Remember this install is barely used yet. It is just idling consul cluster. All of these metrics show no anomaly around these spikes:

raft log commit time
autopilot healthy
containers cpu is idle
containers network is stable and below 10kb/s
container disk io is barely moving (all below 80kb/s)
memory usage stable and below 5%
time in memory gc is always below 3ms per second (usually well below)

Looking at metrics of underlying cluster nodes

there is plenty of free cpu and memory availble
only slight thing is that I see some spikes in load of cluster nodes during consul issues but there are lots of other load spikes without any issues on consul side.

If there is anything else I can check please somebody let me know.

Topic		Replies	Views
Consul not starting with helm in K8 Client Pods: Reason: BadRequest (400) Consul k8s	1	1365	September 13, 2021
Facing DNS Issues [Consul + Kubernetes] Consul k8s , connect , consul	6	344	October 15, 2024
Consul Helm with very similar values does not behave the same Consul	0	341	November 23, 2020
K8s Installation with Helm doesn't work - connection error's Consul	5	191	July 5, 2024
Consul helm k8 ERROR Consul k8s	1	526	September 13, 2021

Classic networking issues

Related topics