Consul servers in non-ready state due to error="No cluster leader"

Hi team,

I’m using consul (community helm chart version 0.49.2) for service discovery and KV configuration on EKS v1.27 (3 nodes) with 1 server replica.
The requirement is to upgrade the replica count to 3 to ensure high availability(HA) of consul server.
But with the following configuration, the consul servers and clients are in non-ready state

values.yaml
values.txt (1.1 KB)

kubectl get pods -n consul
NAME                                           READY   STATUS    RESTARTS   AGE
consul-client-xxxxx                            0/1     Running   0          13m
consul-client-yyyyy                            0/1     Running   0          13m
consul-client-zzzzz                            0/1     Running   0          13m
consul-connect-injector-5b86xxxx-7mk9x         1/1     Running   0          13m
consul-connect-injector-5b86xxxx-wft8z         1/1     Running   0          13m
consul-controller-d77bf9xxx-l9zsg              1/1     Running   0          13m
consul-server-0                                0/1     Running   0          13m
consul-server-1                                0/1     Running   0          13m
consul-server-2                                0/1     Running   0          13m
consul-webhook-cert-manager-6cb69bbbbb-4pj4r   1/1     Running   0          13m

The logs of consul servers and clients
consul-client logs
consul-client.txt (7.5 KB)

consul-server logs
note: consul-server-0 and 1 have similar logs
consul-server-1.txt (1.0 KB)
consul-server-2.txt (7.7 KB)

I have tried adding various other configs (mentioned below) to the values.yaml

enabled dns,
bootstrapExpect: 3
exposeGossipAndRPCPorts: true (for server and client)
hostNetwork: true (client)
dnsPolicy: ClusterFirstWithHostNet (client)

Updated existing consul single replica setup and also tested with new consul deployment on new target but the issue still persists

One thing I observed is only one of the 3 servers (consul-server-2) will have required ports open in the container and rest of the two just have 8300. Probably because of this we get connection refused as in logs

inside consul-server-0 and consul-server-1 containers:
kubectl exec -it consul-server-1 -n consul -- sh
/ $ netstat -tulpn
Active Internet connections (only servers)
Proto  Recv-Q  Send-Q  Local Address  Foreign Address   State PID/Program name
tcp             0      0                :::8300                 :::*                    LISTEN      11/consul

/ $ consul members
Node                                           Address             Status  Type    Build   Protocol  DC   Partition  Segment
consul-server-2                                xx.yy.zzz.225:8301  alive   server  1.13.4  2         dc1  default    <all>
ip-xx-yy-zzz-32.eu-central-1.compute.internal  xx.yy.zzz.203:8301  alive   client  1.13.4  2         dc1  default    <default>
ip-xx-yy-zzz-74.eu-central-1.compute.internal  xx.yy.zzz.212:8301  alive   client  1.13.4  2         dc1  default    <default>
ip-xx-yy-zzx-4.eu-central-1.compute.internal   xx.yy.zzx.240:8301  alive   client  1.13.4  2         dc1  default    <default>

num_peers = 0
kubectl exec -it consul-server-2 -n consul -- sh
/ $ netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 :::8500                 :::*                    LISTEN      11/consul
tcp        0      0 :::8503                 :::*                    LISTEN      11/consul
tcp        0      0 :::8600                 :::*                    LISTEN      11/consul
tcp        0      0 :::8300                 :::*                    LISTEN      11/consul
tcp        0      0 :::8301                 :::*                    LISTEN      11/consul
tcp        0      0 :::8302                 :::*                    LISTEN      11/consul
udp        0      0 :::8301                 :::*                                11/consul
udp        0      0 :::8302                 :::*                                11/consul
udp        0      0 :::8600                 :::*                                11/consul

/ $ consul members
Node                                           Address             Status  Type    Build   Protocol  DC   Partition  Segment
consul-server-2                                xx.yy.zzz.225:8301  alive   server  1.13.4  2         dc1  default    <all>
ip-xx-yy-zzz-32.eu-central-1.compute.internal  xx.yy.zzz.203:8301  alive   client  1.13.4  2         dc1  default    <default>
ip-xx-yy-zzz-74.eu-central-1.compute.internal  xx.yy.zzz.212:8301  alive   client  1.13.4  2         dc1  default    <default>
ip-xx-yy-zzx-4.eu-central-1.compute.internal   xx.yy.zzx.240:8301  alive   client  1.13.4  2         dc1  default    <default>

This setup works perfectly fine with single consul server replica.
Can someone please look into this and guide me through?
Any help would be greatly appreciated . Thanks in advance

Hi @aishwarya.poojary1,

Welcome to the HashiCorp Forums!

You shouldn’t be setting the server.exposeGossipAndRPCPorts and client.exportGosssipPorts at the same time unless you change the port for the serverGossipPort. This is because, the both the server and client pods would try to bind to the gossip port.

Please refer to the documentation linked below.

so if you are running clients and servers on the same node the ports will conflict if they are both 8301. When you enable server.exposeGossipAndRPCPorts and client.exposeGossipPorts, you must change this from the default to an unused port on the host, e.g. 9301.
ref: Helm Chart Reference | Consul | HashiCorp Developer

I hope this helps.

Thanks for your response @Ranjandas
Yes you are right and I had also tried those settings one at a time to check if that makes any difference but still the issue was same. Moreover enabling this is not required in our environment as we don’t connect to any client outside the EKS. This is the config file that was last applied (commented settings were also tested at a point)
values.txt (1.1 KB)

Please let me know if you see anything else blocking the servers from electing a leader

Hi @aishwarya.poojary1,

I tried installing Consul on K8S 1.27 using the values.txt file you attached, and everything is working fine for me. I would recommend you look at the logs/share from all the 3 server agents to see why it is unable to elect the leader.

Btw, when I looked at the logs you shared, it shows that consul-server-1 and consul-server-2 has the same IP (atleast the last two octets).

$ egrep "Cluster Addr|Node name" consul-server-{1,2}.txt
consul-server-1.txt:            Node name: 'consul-server-1'
consul-server-1.txt:         Cluster Addr: xx.yy.141.22 (LAN: 8301, WAN: 8302)
consul-server-2.txt:            Node name: 'consul-server-2'
consul-server-2.txt:         Cluster Addr: xx.yy.141.22 (LAN: 8301, WAN: 8302)

Can you verify this?

Hello @Ranjandas

Thanks for your time.
My EKS is within private network (endpoint access: private) so can you please confirm yours and also check the output of the command " netstat -tulpn " from each consul-servers ? I am not sure if this could make any difference but it may be a networking issue purely because of the allowed ports within the consul server container (attached op in first text) and the error in consul-server-2 logs indicating “dial tcp xx.yy.141.212:8301: connect: connection refused”
The IPs in logs are random and I intentionally did so for security reasons.