I’m having an issue in my environment where I cannot join additional servers/nodes to the cluster. There might be an issue with some existing nodes as well, but this might all be related. The environment runs mostly inside Docker/Rancher using consul:1.2.0
as the image (config below) with a few VMs outside of Docker/Rancher.
I’m getting the same two errors in the logs over and over:
6/27/2022 1:35:01 PM 2022/06/27 13:35:01 [ERR] agent: failed to sync remote state: No cluster leader
6/27/2022 1:35:24 PM 2022/06/27 13:35:24 [ERR] agent: Coordinate update error: No cluster leader
Environment variables passed via Docker:
CONSUL_BIND_INTERFACE=eth0
CONSUL_HTTP_ADDR=0.0.0.0:8500
CONSUL_LOCAL_CONFIG={"acl_datacenter":"dc1","acl_default_policy":"allow","acl_down_policy":"allow","acl_master_token":"XXXXXXXX","disable_remote_exec":true,"encrypt":"XXXXXXXX","log_level": "INFO","reconnect_timeout":"8h","skip_leave_on_interrupt": true}
Command (via Docker): agent -server -ui -bootstrap-expect=3 -client=0.0.0.0 -datacenter=dc1 -domain=XXXXX.consul -retry-join=consul.discovery.rancher.internal -recursor=169.254.169.250
In case it helps, this is the ps
output from inside a Docker container:
PID USER TIME COMMAND
1 root 0:00 {docker-entrypoi} /usr/bin/dumb-init /bin/sh /usr/local/bin/docker-entrypoint.sh agent -server -ui -bootstrap-expect=3 -client=0.0.0.0 -datacenter=dc1 -domain=XXXXX.consul -retry-join=consul.discovery.rancher.internal -recursor=169.254.169.250
6 consul 0:42 consul agent -data-dir=/consul/data -config-dir=/consul/config -bind=10.xx.yy.zz -server -ui -bootstrap-expect=3 -client=0.0.0.0 -datacenter=dc1 -domain=XXXXX.consul -retry-join=consul.discovery.rancher.internal -recursor=169.254.169.250
Telnet works to the cluster leader on 8301.
On the new server being added, I can see 16 members (2 of them are new and not working).
/ # consul members
Node Address Status Type Build Protocol DC Segment
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
linux-serverXX 10.xx.yy.zz:8301 alive server 0.7.4 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
docker-containerXX 10.xx.yy.zz:8301 alive server 1.2.0 2 dc1 <all>
win-clientXX 10.xx.yy.zz:8301 alive client 0.7.1 2 dc1 <default>
win-clientXX 10.xx.yy.zz:8301 alive client 0.7.1 2 dc1 <default>
On one of the working docker containers, I see all the raft peers that excludes my new containers and the VMs:
/ # consul operator raft list-peers
Node ID Address State Voter RaftProtocol
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 follower true 3
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 follower true 3
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 leader true 3
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 follower true 3
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 follower true 3
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 follower true 3
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 follower true 3
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 follower true 3
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 follower true 3
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 follower true 3
docker-containerXX XXXXXXXX 10.xx.yy.zz:8300 follower true 3
On linux-serverXX, I get the same consul members
list but I get an error when trying to get raft peers. This is the same error I’m getting on the new servers when trying to query raft peers.
[root@linux-serverXX ~]# consul operator raft -list-peers
Operator "raft" subcommand failed: Unexpected response code: 500 (No cluster leader)