Consul Server unable to join Consul Cluster

Consul.txt (5.8 KB) logs-from-consul-in-consul-consul-server-1 (5).txt (35.4 KB)

Hello,

Before I start explaining what my problem is, I would like to point out that I do not have lots of experience with Consul, so please be patient with me :smiley:
I would need your assistance in figuring out what is wrong with the Consul I have deployed in my Azure AKS.
The infrastructure I have looks like this:

  • 3 servers running in AKS (Consul version 1.8.4)
  • 6 clients running on VMs (Consul version 1.8.0)
  • The AKS cluster is Private

Everything was working well, but suddenly, the pods started to die one after the other.
I redeployed Consul running in AKS and now I have the problem that I have only two from three Consul servers running. The third server will be in a Running state for about maybe 30 s and then it will become OOM killed and then enter CrashLoopBackOff status.
When I run the command consul members, I get all the server & clients listed and the problematic pod will be shown as “left”, while the others are being shown as “alive”.
I have also tried to execute the command consul join {ip address} but this gives me the following error message:

/ # consul join 10.0.0.153
Error joining address '10.0.0.153': Unexpected response code: 500 (1 error occurred:
        * Failed to join 10.0.0.153: dial tcp 10.0.0.153:8301: connect: connection refused

)
Failed to join any nodes.

I attached the yaml file from my Consul StatefulSet and the error log from the problematic pod.

I must point out that I am having this infrastructure for maybe 2 months and everything was looking fine, and all the pods were healthy and running. In the last 3 days I am dealing with this issue, researching on the internet trying to figure out how can I fix this issue, but with no result.

Could you please help me figure out why suddenly this started happening and eventually help me in solving this issue?

Thanks in advance for your time,

Mike

Hi @mike-miller-ct,

It looks like there might be an issue with networking on that host.

As I mentioned over in Meaning of "Failed to resolve consul-consul-server-0.consul-consul-server.consul.svc" - #3 by mike-miller-ct, it looks like one of the problems is that the node cannot resolve the DNS hostnames for the other Consul servers. Have you tried execing into the pod and executing a few queries to see if you’re manually able to resolve and contact this hostname?

Here’s a few commands you can run to test basic connectivity.

$ cat /etc/resolv.conf
search consul.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.43.0.10
options ndots:5

$ nslookup hashicorp-consul-server-0.hashicorp-consul-server.consul.svc.cluster.local
Server:		10.43.0.10
Address:	10.43.0.10:53

Name:	hashicorp-consul-server-0.hashicorp-consul-server.consul.svc.cluster.local
Address: 10.42.4.75

$ ping -c1 hashicorp-consul-server-0.hashicorp-consul-server.consul.svc
PING hashicorp-consul-server-0.hashicorp-consul-server.consul.svc.cluster.local (10.42.4.75) 56(84) bytes of data.
64 bytes from hashicorp-consul-server-0.hashicorp-consul-server.consul.svc.cluster.local (10.42.4.75): icmp_seq=1 ttl=64 time=0.075 ms

--- hashicorp-consul-server-0.hashicorp-consul-server.consul.svc.cluster.local ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.075/0.075/0.075/0.000 ms

Also, is the IP address you’re specifying in consul join the address of another server or client in the environment?

Hi @blake thanks for getting back to me.
Here is what I get when I run the commands that you have run:

 / # cat /etc/resolv.conf
nameserver 168.63.129.16
search eaxyh9uou81y3mrokgl1ap4l6h.ax.internal.cloudapp.net

/ # nslookup hashicorp-consul-server-0.hashicorp-consul-server.consul.svc.cluste
r.local
Server:         168.63.129.16
Address:        168.63.129.16:53

** server can't find hashicorp-consul-server-0.hashicorp-consul-server.consul.svc.cluster.local: NXDOMAIN

** server can't find hashicorp-consul-server-0.hashicorp-consul-server.consul.svc.cluster.local: NXDOMAIN

/ # nslookup consul-consul-server-0.consul-consul-server.consul.svc
Server:         168.63.129.16
Address:        168.63.129.16:53

** server can't find consul-consul-server-0.consul-consul-server.consul.svc: NXDOMAIN

** server can't find consul-consul-server-0.consul-consul-server.consul.svc: NXDOMAIN

/ # nslookup consul-consul-server-1.consul-consul-server.consul.svc
Server:         168.63.129.16
Address:        168.63.129.16:53

** server can't find consul-consul-server-1.consul-consul-server.consul.svc: NXDOMAIN

** server can't find consul-consul-server-1.consul-consul-server.consul.svc: NXDOMAIN

/ # nslookup consul-consul-server-2.consul-consul-server.consul.svc
Server:         168.63.129.16
Address:        168.63.129.16:53

** server can't find consul-consul-server-2.consul-consul-server.consul.svc: NXDOMAIN

** server can't find consul-consul-server-2.consul-consul-server.consul.svc: NXDOMAIN

/ # ping -c1 consul-consul-server-0.consul-consul-server.consul.svc
ping: consul-consul-server-0.consul-consul-server.consul.svc: Name does not resolve

Regarding your last question, the IP I used in my consul join command, is the IP of the server that is having the issue.

Please let me know if you need any additional information,
BR
Mike

Hi @mike-miller-ct,

Thanks for sharing that info. The issue looks to be the fact that resolv.conf does not contain the DNS search domains for the local cluster hostnames.

As in my example, the search [domains] line should contain consul.svc.cluster.local svc.cluster.local cluster.local. This would cause the resolver to append one of these domains for non-fully qualified DNS queries (i.e., those which do not end in .) before attempting the lookup. Queries with the last hostname appended should succeed.

$ nslookup consul-consul-server-0.consul-consul-server.consul.svc.cluster.local

Its not clear from your config what is removing these search domains. You’ll want to investigate that a bit. DNS for Services and Pods | Kubernetes might be helpful in this regard.

You’ll want to run consul join from the server that is having issues, and provide it the IP of one of the other agents (client or server) which is correctly functioning in the environment. It should then successfully join the cluster, assuming no other configuration issues exist.

Hi @blake,

Thank you for your Help :slight_smile:
As I mentioned out earlier, everything was working well but then all of a sudden the problems started occurring. A day earlier the coredns pod has restarted and since that moment on I started to have the problems in Consul. At the beginning I didnt pay much attention to this because the coredns pod was again up and running.
At first I thought that this is nothing, but now it all makes sense. Therefore, I will definetily have a look at the link you sent regarding the DNS in Kubernetes.
I would also like to let you know that I fixed my problem, by making some adjustements to the StatefulSet. I removed this:

-retry-join=${CONSUL_FULLNAME}-server-0.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
-retry-join=${CONSUL_FULLNAME}-server-1.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
-retry-join=${CONSUL_FULLNAME}-server-2.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \

and replaced it with this:

-retry-join="10.0.0.173:8301" \
-retry-join="10.0.0.143:8301" \
-retry-join="10.0.0.153:8301" \

Now, the Leader election is a lot faster and the Consul servers are running without restarts and crashes.

Again, thanks a lot for your help and assistance.
Have a great start in the new week,
BR
Mike