Before I start explaining what my problem is, I would like to point out that I do not have lots of experience with Consul, so please be patient with me
I would need your assistance in figuring out what is wrong with the Consul I have deployed in my Azure AKS.
The infrastructure I have looks like this:
3 servers running in AKS (Consul version 1.8.4)
6 clients running on VMs (Consul version 1.8.0)
The AKS cluster is Private
Everything was working well, but suddenly, the pods started to die one after the other.
I redeployed Consul running in AKS and now I have the problem that I have only two from three Consul servers running. The third server will be in a Running state for about maybe 30 s and then it will become OOM killed and then enter CrashLoopBackOff status.
When I run the command consul members, I get all the server & clients listed and the problematic pod will be shown as âleftâ, while the others are being shown as âaliveâ.
I have also tried to execute the command consul join {ip address} but this gives me the following error message:
/ # consul join 10.0.0.153
Error joining address '10.0.0.153': Unexpected response code: 500 (1 error occurred:
* Failed to join 10.0.0.153: dial tcp 10.0.0.153:8301: connect: connection refused
)
Failed to join any nodes.
I attached the yaml file from my Consul StatefulSet and the error log from the problematic pod.
I must point out that I am having this infrastructure for maybe 2 months and everything was looking fine, and all the pods were healthy and running. In the last 3 days I am dealing with this issue, researching on the internet trying to figure out how can I fix this issue, but with no result.
Could you please help me figure out why suddenly this started happening and eventually help me in solving this issue?
Thanks for sharing that info. The issue looks to be the fact that resolv.conf does not contain the DNS search domains for the local cluster hostnames.
As in my example, the search [domains] line should contain consul.svc.cluster.local svc.cluster.local cluster.local. This would cause the resolver to append one of these domains for non-fully qualified DNS queries (i.e., those which do not end in .) before attempting the lookup. Queries with the last hostname appended should succeed.
Its not clear from your config what is removing these search domains. Youâll want to investigate that a bit. DNS for Services and Pods | Kubernetes might be helpful in this regard.
Youâll want to run consul join from the server that is having issues, and provide it the IP of one of the other agents (client or server) which is correctly functioning in the environment. It should then successfully join the cluster, assuming no other configuration issues exist.
Thank you for your Help
As I mentioned out earlier, everything was working well but then all of a sudden the problems started occurring. A day earlier the coredns pod has restarted and since that moment on I started to have the problems in Consul. At the beginning I didnt pay much attention to this because the coredns pod was again up and running.
At first I thought that this is nothing, but now it all makes sense. Therefore, I will definetily have a look at the link you sent regarding the DNS in Kubernetes.
I would also like to let you know that I fixed my problem, by making some adjustements to the StatefulSet. I removed this: