Can't Join New Servers to Existing Cluster

Hossy · June 27, 2022, 7:49pm

I’m having an issue in my environment where I cannot join additional servers/nodes to the cluster. There might be an issue with some existing nodes as well, but this might all be related. The environment runs mostly inside Docker/Rancher using consul:1.2.0 as the image (config below) with a few VMs outside of Docker/Rancher.

I’m getting the same two errors in the logs over and over:

6/27/2022 1:35:01 PM    2022/06/27 13:35:01 [ERR] agent: failed to sync remote state: No cluster leader
6/27/2022 1:35:24 PM    2022/06/27 13:35:24 [ERR] agent: Coordinate update error: No cluster leader

Environment variables passed via Docker:

CONSUL_BIND_INTERFACE=eth0
CONSUL_HTTP_ADDR=0.0.0.0:8500
CONSUL_LOCAL_CONFIG={"acl_datacenter":"dc1","acl_default_policy":"allow","acl_down_policy":"allow","acl_master_token":"XXXXXXXX","disable_remote_exec":true,"encrypt":"XXXXXXXX","log_level": "INFO","reconnect_timeout":"8h","skip_leave_on_interrupt": true}

Command (via Docker): agent -server -ui -bootstrap-expect=3 -client=0.0.0.0 -datacenter=dc1 -domain=XXXXX.consul -retry-join=consul.discovery.rancher.internal -recursor=169.254.169.250

In case it helps, this is the ps output from inside a Docker container:

PID   USER     TIME   COMMAND
    1 root       0:00 {docker-entrypoi} /usr/bin/dumb-init /bin/sh /usr/local/bin/docker-entrypoint.sh agent -server -ui -bootstrap-expect=3 -client=0.0.0.0 -datacenter=dc1 -domain=XXXXX.consul -retry-join=consul.discovery.rancher.internal -recursor=169.254.169.250
    6 consul     0:42 consul agent -data-dir=/consul/data -config-dir=/consul/config -bind=10.xx.yy.zz -server -ui -bootstrap-expect=3 -client=0.0.0.0 -datacenter=dc1 -domain=XXXXX.consul -retry-join=consul.discovery.rancher.internal -recursor=169.254.169.250

Telnet works to the cluster leader on 8301.

On the new server being added, I can see 16 members (2 of them are new and not working).

/ # consul members
Node                                Address            Status  Type    Build  Protocol  DC   Segment
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
linux-serverXX      10.xx.yy.zz:8301  alive   server  0.7.4  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
docker-containerXX  10.xx.yy.zz:8301  alive   server  1.2.0  2         dc1  <all>
win-clientXX        10.xx.yy.zz:8301  alive   client  0.7.1  2         dc1  <default>
win-clientXX        10.xx.yy.zz:8301  alive   client  0.7.1  2         dc1  <default>

On one of the working docker containers, I see all the raft peers that excludes my new containers and the VMs:

/ # consul operator raft list-peers
Node                ID        Address           State     Voter  RaftProtocol
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  follower  true   3
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  follower  true   3
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  leader    true   3
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  follower  true   3
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  follower  true   3
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  follower  true   3
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  follower  true   3
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  follower  true   3
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  follower  true   3
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  follower  true   3
docker-containerXX  XXXXXXXX  10.xx.yy.zz:8300  follower  true   3

On linux-serverXX, I get the same consul members list but I get an error when trying to get raft peers. This is the same error I’m getting on the new servers when trying to query raft peers.

[root@linux-serverXX ~]# consul operator raft -list-peers
Operator "raft" subcommand failed: Unexpected response code: 500 (No cluster leader)

maxb · June 28, 2022, 8:18am

Crikey, there’s a great deal to respond to here!

Let’s take the easiest point first… Consul 1.2 is extremely old. Not only do you not benefit from literally years of bug fixes, but most people will have long since forgotten what things were like back then, so you restrict your ability to get support. You need to upgrade.

Next up, you have a Consul 0.7 server in your cluster. I’m pretty sure this is not OK. A Consul server cluster is supposed to be running all the same version of the software, except for a short time whilst an upgrade is in progress - the fact that it’s a 0.x version makes it even more concerning.

Next, you have way too many servers. The majority of Consul clusters should have either 3 or 5 servers. Beyond that, you’re adding quite a lot of extra overhead in coordination traffic between servers for questionable benefit.

Lastly, and critically, you’re operating with an unsafe bootstrap-expect configuration. This setting means that whenever that many uninitialised servers find each other, they set themselves up as a new Raft cluster.

My guess is that last issue may mean that different sets of servers in your cluster disagree about which servers are actually members.

One last tip: I think, IIRC

consul operator raft list-peers -stale

works to see the Raft peer set according to a server, even if it doesn’t know if a leader.

Hossy · June 28, 2022, 12:44pm

Thank you for the reply! Unfortunately, I do not disagree with what you are saying. I inherited this environment and am in the process of replacing it, but I still need to keep life support operational. This is not the only environment configured like this, but the other one is accepting of new servers.

I ran the following to check the raft peers on all servers and they all agree (the servers that are working, that is).

/ # for i in `consul members | tr -s ' ' | grep 'server' | cut -d' ' -f2 | cut -d: -f1 | sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n`; do echo $i; curl http://$i:8500/v1/status/peers; echo; done
10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
10.xx.yy.zz - linux VM
curl: (7) Failed to connect to 10.xx.yy.zz port 8500: Connection refused

10.xx.yy.zz - docker container
curl: (7) Failed to connect to 10.xx.yy.zz port 8500: Connection refused

10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
10.xx.yy.zz - docker container
[]
10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
10.xx.yy.zz
["10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300","10.xx.yy.zz:8300"]
/ #

My understanding of bootstrap-expect is that a new cluster will not be formed unless 3 uninitialized servers meet using the DNS result of consul.discovery.rancher.internal from the Rancher internal DNS resolver 169.254.169.250, which would not be possible in this environment unless I was truly starting from scratch.

Are there any other logs I can inspect that might suggest why the new servers are unable to find a cluster leader? I ran a similar command to above to query the cluster leader from all servers and they all agree with the exception of the 3 problem servers again.

for i in `consul members | tr -s ' ' | grep 'server' | cut -d' ' -f2 | cut -d: -f1 | sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n`; do echo $i; curl http://$i:8500/v1/status/leader; echo; done

maxb · June 28, 2022, 3:28pm

I would not be in the least bit surprised if a Consul 0.7 server cannot join a Consul 1.2 cluster.

As for why some of your Docker containers running 1.2 can’t join, you’d have to show some logs. Preferably stop obfuscating the IP addresses, as host identity may be important.

Hossy · June 28, 2022, 4:37pm

Do you have specific logs in mind? I was just looking at the console output (same as consul monitor) and it didn’t really show anything short of enabling DEBUG logs.

maxb · June 28, 2022, 5:29pm

Debug logs, including a restart of the Consul process, to see how it attempts to join the cluster

Hossy · June 28, 2022, 6:56pm

So, because Consul is running in Rancher, I had to make the log level change from INFO to DEBUG in the Rancher service, which applies it to all the Consul containers (through the deployment of new containers reusing the Consul data directory). Here is the (slightly edited) DEBUG log output from the problem Consul Docker container instance. I changed the network and server names for privacy, but the data is still meaningful bidirectionally. I had to use pastebin because of the character limit here.

maxb · June 28, 2022, 7:19pm

It appears you have multiple nodes trying to identify themselves to the cluster as localhost.localdomain - each node needs to have a unique name:

[WARN] serf: Name conflict for 'localhost.localdomain' both 192.168.11.116:8301 and 192.168.11.120:8301 are claiming
[ERR] memberlist: Conflicting address for localhost.localdomain. Mine: 192.168.11.116:8301 Theirs: 192.168.11.120:8301

Furthermore, you also have a conflict of node IDs:

[DEBUG] memberlist: Failed to join 192.168.11.28: Member 'dc1-116.dc1.company.corp' has conflicting node ID '9123397d-1058-471b-4977-70daa1afac2e' with member 'localhost.localdomain'

Since the node ID is a UUID, randomly generated at first node initialisation and saved into the data directory, I wonder if someone has done something weird like clone an initialised data directory to multiple nodes.

These issues should be rectified first. Once they’re out of the way and no longer causing trouble, we can see if there are additional issues or not.

maxb · June 28, 2022, 7:22pm

Also please post the output of consul operator raft list-peers without redacting the node IDs.

Hossy · July 1, 2022, 5:50pm

Thanks for pointing out the conflict with localhost.localdomain. I have resolved that by rebooting the hosts which are supposed to get their names from PTRs in DNS. All are named correctly now. As far as the node ID conflict goes, that must’ve been temporary or old because it is no longer in conflict, at least that I’m seeing in the raft output.

DEBUG logs of one of the problematic hosts, dc1-144.dc1.company.corp:

2022-07-01T17:40:42.439564714Z ==> Found address '192.168.11.144' for interface 'eth0', setting bind option...
2022-07-01T17:40:42.501497364Z WARNING: LAN keyring exists but -encrypt given, using keyring
2022-07-01T17:40:42.501539712Z WARNING: WAN keyring exists but -encrypt given, using keyring
2022-07-01T17:40:42.501544592Z bootstrap_expect > 0: expecting 3 servers
2022-07-01T17:40:42.502489953Z ==> Starting Consul agent...
2022-07-01T17:40:42.557092866Z ==> Consul agent running!
2022-07-01T17:40:42.557114738Z            Version: 'v1.2.0'
2022-07-01T17:40:42.557125832Z            Node ID: '9e519d4e-9707-7449-a13b-bf17718a839e'
2022-07-01T17:40:42.557128867Z          Node name: 'dc1-144.dc1.company.corp'
2022-07-01T17:40:42.557131847Z         Datacenter: 'dc1' (Segment: '<all>')
2022-07-01T17:40:42.557135023Z             Server: true (Bootstrap: false)
2022-07-01T17:40:42.557137863Z        Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, DNS: 8600)
2022-07-01T17:40:42.557140834Z       Cluster Addr: 192.168.11.144 (LAN: 8301, WAN: 8302)
2022-07-01T17:40:42.557143739Z            Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false
2022-07-01T17:40:42.557146575Z 
2022-07-01T17:40:42.557149137Z ==> Log data will now stream in as it occurs:
2022-07-01T17:40:42.557152066Z 
2022-07-01T17:40:42.557155317Z     2022/07/01 12:40:42 [WARN] agent: Node name "dc1-144.dc1.company.corp" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2022-07-01T17:40:42.557161407Z     2022/07/01 12:40:42 [INFO] raft: Initial configuration (index=0): []
2022-07-01T17:40:42.557164201Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-144.dc1.company.corp.dc1 192.168.11.144
2022-07-01T17:40:42.557167243Z     2022/07/01 12:40:42 [INFO] serf: Attempting re-join to previously known node: dc1-142.dc1.company.corp.dc1: 192.168.11.142:8302
2022-07-01T17:40:42.557170231Z     2022/07/01 12:40:42 [INFO] raft: Node at 192.168.11.144:8300 [Follower] entering Follower state (Leader: "")
2022-07-01T17:40:42.557173255Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-144.dc1.company.corp 192.168.11.144
2022-07-01T17:40:42.557176147Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.142:8302
2022-07-01T17:40:42.557179164Z     2022/07/01 12:40:42 [INFO] serf: Attempting re-join to previously known node: dc1-136.dc1.company.corp: 192.168.11.136:8301
2022-07-01T17:40:42.557182510Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-144.dc1.company.corp (Addr: tcp/192.168.11.144:8300) (DC: dc1)
2022-07-01T17:40:42.557185462Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-144.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.557188583Z     2022/07/01 12:40:42 [DEBUG] agent/proxy: managed Connect proxy manager started
2022-07-01T17:40:42.557191626Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-145.dc1.company.corp.dc1 192.168.11.145
2022-07-01T17:40:42.557203384Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-121.dc1.company.corp.dc1 192.168.11.121
2022-07-01T17:40:42.557207552Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-127.dc1.company.corp.dc1 192.168.11.127
2022-07-01T17:40:42.557210574Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-27.dc1.company.corp.dc1 192.168.11.27
2022-07-01T17:40:42.557213444Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-28.dc1.company.corp.dc1 192.168.11.28
2022-07-01T17:40:42.557216239Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-143.dc1.company.corp.dc1 192.168.11.143
2022-07-01T17:40:42.557219068Z     2022/07/01 12:40:42 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
2022-07-01T17:40:42.557221845Z     2022/07/01 12:40:42 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
2022-07-01T17:40:42.557224710Z     2022/07/01 12:40:42 [INFO] agent: Started HTTP server on 0.0.0.0:8500 (tcp)
2022-07-01T17:40:42.557227643Z     2022/07/01 12:40:42 [INFO] agent: started state syncer
2022-07-01T17:40:42.557230468Z     2022/07/01 12:40:42 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce os scaleway softlayer triton
2022-07-01T17:40:42.557233361Z     2022/07/01 12:40:42 [INFO] agent: Joining LAN cluster...
2022-07-01T17:40:42.557236205Z     2022/07/01 12:40:42 [INFO] agent: (LAN) joining: [consul.discovery.rancher.internal]
2022-07-01T17:40:42.557239081Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.136:8301
2022-07-01T17:40:42.558450939Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-142.dc1.company.corp.dc1 192.168.11.142
2022-07-01T17:40:42.558467183Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-116.dc1.company.corp.dc1 192.168.11.116
2022-07-01T17:40:42.558471230Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-146.dc1.company.corp.dc1 192.168.11.146
2022-07-01T17:40:42.558474232Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-120.dc1.company.corp.dc1 192.168.11.120
2022-07-01T17:40:42.558477110Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-consul01.dc1 192.168.11.37
2022-07-01T17:40:42.558480055Z     2022/07/01 12:40:42 [WARN] memberlist: Refuting an alive message
2022-07-01T17:40:42.558482868Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-136.dc1.company.corp.dc1 192.168.11.136
2022-07-01T17:40:42.564460407Z     2022/07/01 12:40:42 [INFO] serf: Re-joined to previously known node: dc1-142.dc1.company.corp.dc1: 192.168.11.142:8302
2022-07-01T17:40:42.564484890Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-145.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564497506Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-121.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564500887Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-127.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564513782Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-27.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564517302Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-28.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564520545Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-143.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564523530Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-142.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564526519Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-116.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564529946Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-146.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564533153Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-120.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564536208Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-consul01.dc1" in area "wan"
2022-07-01T17:40:42.564539440Z     2022/07/01 12:40:42 [INFO] consul: Handled member-join event for server "dc1-136.dc1.company.corp.dc1" in area "wan"
2022-07-01T17:40:42.564542477Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-142.dc1.company.corp 192.168.11.142
2022-07-01T17:40:42.564545327Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-116.dc1.company.corp 192.168.11.116
2022-07-01T17:40:42.564548127Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-28.dc1.company.corp 192.168.11.28
2022-07-01T17:40:42.564551080Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-consul01 192.168.11.37
2022-07-01T17:40:42.564553932Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-121.dc1.company.corp 192.168.11.121
2022-07-01T17:40:42.564557750Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-143.dc1.company.corp 192.168.11.143
2022-07-01T17:40:42.564560679Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-145.dc1.company.corp 192.168.11.145
2022-07-01T17:40:42.564563531Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-svcapi02 192.168.11.54
2022-07-01T17:40:42.564566327Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-136.dc1.company.corp 192.168.11.136
2022-07-01T17:40:42.564569292Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-120.dc1.company.corp 192.168.11.120
2022-07-01T17:40:42.564572120Z     2022/07/01 12:40:42 [WARN] memberlist: Refuting a suspect message (from: dc1-144.dc1.company.corp)
2022-07-01T17:40:42.564574953Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-27.dc1.company.corp 192.168.11.27
2022-07-01T17:40:42.567431549Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.146:8301
2022-07-01T17:40:42.567457487Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-146.dc1.company.corp 192.168.11.146
2022-07-01T17:40:42.567462375Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-127.dc1.company.corp 192.168.11.127
2022-07-01T17:40:42.567465296Z     2022/07/01 12:40:42 [INFO] serf: EventMemberJoin: dc1-svcapi01 192.168.11.53
2022-07-01T17:40:42.567468272Z     2022/07/01 12:40:42 [INFO] serf: Re-joined to previously known node: dc1-136.dc1.company.corp: 192.168.11.136:8301
2022-07-01T17:40:42.567478446Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-142.dc1.company.corp (Addr: tcp/192.168.11.142:8300) (DC: dc1)
2022-07-01T17:40:42.570434803Z     2022/07/01 12:40:42 [INFO] consul: Existing Raft peers reported by dc1-136.dc1.company.corp, disabling bootstrap mode
2022-07-01T17:40:42.570453116Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-116.dc1.company.corp (Addr: tcp/192.168.11.116:8300) (DC: dc1)
2022-07-01T17:40:42.570457218Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-28.dc1.company.corp (Addr: tcp/192.168.11.28:8300) (DC: dc1)
2022-07-01T17:40:42.570460447Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-consul01 (Addr: tcp/192.168.11.37:8300) (DC: dc1)
2022-07-01T17:40:42.570463474Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-121.dc1.company.corp (Addr: tcp/192.168.11.121:8300) (DC: dc1)
2022-07-01T17:40:42.570466454Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-143.dc1.company.corp (Addr: tcp/192.168.11.143:8300) (DC: dc1)
2022-07-01T17:40:42.570469539Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-145.dc1.company.corp (Addr: tcp/192.168.11.145:8300) (DC: dc1)
2022-07-01T17:40:42.570472910Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-136.dc1.company.corp (Addr: tcp/192.168.11.136:8300) (DC: dc1)
2022-07-01T17:40:42.570483935Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-120.dc1.company.corp (Addr: tcp/192.168.11.120:8300) (DC: dc1)
2022-07-01T17:40:42.570486898Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-27.dc1.company.corp (Addr: tcp/192.168.11.27:8300) (DC: dc1)
2022-07-01T17:40:42.570490121Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-146.dc1.company.corp (Addr: tcp/192.168.11.146:8300) (DC: dc1)
2022-07-01T17:40:42.570493055Z     2022/07/01 12:40:42 [INFO] consul: Adding LAN server dc1-127.dc1.company.corp (Addr: tcp/192.168.11.127:8300) (DC: dc1)
2022-07-01T17:40:42.574465099Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.28:8301
2022-07-01T17:40:42.578431198Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.116:8301
2022-07-01T17:40:42.584443605Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.143:8301
2022-07-01T17:40:42.590445177Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.127:8301
2022-07-01T17:40:42.596447778Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.136:8301
2022-07-01T17:40:42.603465506Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.121:8301
2022-07-01T17:40:42.608404801Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.27:8301
2022-07-01T17:40:42.614402155Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.142:8301
2022-07-01T17:40:42.620400955Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.145:8301
2022-07-01T17:40:42.629404454Z     2022/07/01 12:40:42 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.120:8301
2022-07-01T17:40:42.629431711Z     2022/07/01 12:40:42 [INFO] agent: (LAN) joined: 11 Err: <nil>
2022-07-01T17:40:42.629436442Z     2022/07/01 12:40:42 [DEBUG] agent: systemd notify failed: No socket
2022-07-01T17:40:42.629439874Z     2022/07/01 12:40:42 [INFO] agent: Join LAN completed. Synced with 11 initial agents
2022-07-01T17:40:42.854161881Z     2022/07/01 12:40:42 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:42.949520343Z     2022/07/01 12:40:42 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:42.965567413Z     2022/07/01 12:40:42 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:43.142855141Z     2022/07/01 12:40:43 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:43.144017549Z     2022/07/01 12:40:43 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:43.176223289Z     2022/07/01 12:40:43 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:43.183260878Z     2022/07/01 12:40:43 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:43.225413823Z     2022/07/01 12:40:43 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:43.386470675Z     2022/07/01 12:40:43 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:43.456438930Z     2022/07/01 12:40:43 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:43.541289126Z     2022/07/01 12:40:43 [DEBUG] serf: messageJoinType: dc1-144.dc1.company.corp
2022-07-01T17:40:44.362436130Z     2022/07/01 12:40:44 [DEBUG] memberlist: Stream connection from=192.168.11.28:46286
2022-07-01T17:40:49.612776245Z     2022/07/01 12:40:49 [ERR] agent: failed to sync remote state: No cluster leader
2022-07-01T17:40:49.744516885Z     2022/07/01 12:40:49 [DEBUG] memberlist: Stream connection from=192.168.11.53:57548
2022-07-01T17:40:50.373887816Z     2022/07/01 12:40:50 [WARN] raft: no known peers, aborting election
2022-07-01T17:40:51.971733277Z     2022/07/01 12:40:51 [DEBUG] memberlist: Stream connection from=192.168.11.145:42482
2022-07-01T17:40:58.330897843Z ==> Newer Consul version available: 1.12.2 (currently running: 1.2.0)
2022-07-01T17:41:07.920823431Z     2022/07/01 12:41:07 [ERR] agent: Coordinate update error: No cluster leader
2022-07-01T17:41:14.476675842Z     2022/07/01 12:41:14 [ERR] agent: failed to sync remote state: No cluster leader
2022-07-01T17:41:21.990202362Z     2022/07/01 12:41:21 [DEBUG] memberlist: Stream connection from=192.168.11.121:37514
2022-07-01T17:41:30.286488724Z     2022/07/01 12:41:30 [DEBUG] memberlist: Stream connection from=192.168.11.28:47024
2022-07-01T17:41:33.719233557Z     2022/07/01 12:41:33 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.143:8301
2022-07-01T17:41:44.898813238Z     2022/07/01 12:41:44 [ERR] agent: Coordinate update error: No cluster leader
2022-07-01T17:41:47.608325495Z     2022/07/01 12:41:47 [ERR] agent: failed to sync remote state: No cluster leader
2022-07-01T17:41:51.159686633Z     2022/07/01 12:41:51 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.11.120:8302

Also, here is the output from consul operator raft list-peers and consul members:

/ # consul operator raft list-peers
Node                       ID                                    Address              State     Voter  RaftProtocol
dc1-116.dc1.company.corp   9123397d-1058-471b-4977-70daa1afac2e  192.168.11.116:8300  follower  true   3
dc1-120.dc1.company.corp   97bfe9d2-d75c-8ec1-3c6b-cd3c3a3686ec  192.168.11.120:8300  follower  true   3
dc1-121.dc1.company.corp   d12b9f9f-4827-e889-f4f2-5d680926735c  192.168.11.121:8300  follower  true   3
dc1-127.dc1.company.corp   1f49d8ff-a2e6-5028-6ae5-49610e2df551  192.168.11.127:8300  follower  true   3
dc1-136.dc1.company.corp   a7dadf66-4c59-64ca-d5e5-c6729914686c  192.168.11.136:8300  follower  true   3
dc1-142.dc1.company.corp   7aa0063e-35d3-329e-c18c-ff608eca5a18  192.168.11.142:8300  follower  true   3
dc1-143.dc1.company.corp   72fafea5-63d7-b93a-6d2f-98605ceb77e7  192.168.11.143:8300  leader    true   3
dc1-145.dc1.company.corp   b5ad39dd-e335-3d51-7a59-c2d41bfc9c42  192.168.11.145:8300  follower  true   3
dc1-146.dc1.company.corp   a01119f3-fa9f-c129-de22-683bcab32421  192.168.11.146:8300  follower  true   3
dc1-27.dc1.company.corp    6ad0b6ef-1f21-9e3c-1acb-02f260e161f6  192.168.11.27:8300   follower  true   3
dc1-28.dc1.company.corp    bf43ea61-5b16-5ad8-3723-e3e3d3565c50  192.168.11.28:8300   follower  true   3
/ # consul members
Node                       Address              Status  Type    Build  Protocol  DC   Segment
dc1-116.dc1.company.corp   192.168.11.116:8301  alive   server  1.2.0  2         dc1  <all>
dc1-120.dc1.company.corp   192.168.11.120:8301  alive   server  1.2.0  2         dc1  <all>
dc1-121.dc1.company.corp   192.168.11.121:8301  alive   server  1.2.0  2         dc1  <all>
dc1-127.dc1.company.corp   192.168.11.127:8301  alive   server  1.2.0  2         dc1  <all>
dc1-136.dc1.company.corp   192.168.11.136:8301  alive   server  1.2.0  2         dc1  <all>
dc1-142.dc1.company.corp   192.168.11.142:8301  alive   server  1.2.0  2         dc1  <all>
dc1-143.dc1.company.corp   192.168.11.143:8301  alive   server  1.2.0  2         dc1  <all>
dc1-144.dc1.company.corp   192.168.11.144:8301  alive   server  1.2.0  2         dc1  <all>
dc1-145.dc1.company.corp   192.168.11.145:8301  alive   server  1.2.0  2         dc1  <all>
dc1-146.dc1.company.corp   192.168.11.146:8301  alive   server  1.2.0  2         dc1  <all>
dc1-consul01               192.168.11.37:8301   alive   server  0.7.4  2         dc1  <all>
dc1-27.dc1.company.corp    192.168.11.27:8301   alive   server  1.2.0  2         dc1  <all>
dc1-28.dc1.company.corp    192.168.11.28:8301   alive   server  1.2.0  2         dc1  <all>
dc1-svcapi01               192.168.11.53:8301   alive   client  0.7.1  2         dc1  <default>
dc1-svcapi02               192.168.11.54:8301   alive   client  0.7.1  2         dc1  <default>
/ #

Hossy · July 1, 2022, 6:03pm

I did find these two lines to be weird.

2022-07-01T17:40:42.557155317Z 2022/07/01 12:40:42 [WARN] agent: Node name "dc1-144.dc1.company.corp" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.

I presume this is complaining because the node name is the FQDN instead of just the hostname portion?

2022-07-01T17:40:42.564572120Z 2022/07/01 12:40:42 [WARN] memberlist: Refuting a suspect message (from: dc1-144.dc1.company.corp)

I thought it was weird that it would be refuting a message from itself.

maxb · July 2, 2022, 6:05am

OK, first let’s deal with dc1-consul01. Either upgrade the Consul version, or reconfigure it to not even try being a server. Consul 0.7 is not capable of joining a Raft Protocol 3 cluster.

dc1-144 is a bit more interesting. I’m really not sure what’s going on here, and the next thing I’d be inclined to try is:

Shut down Consul on dc1-144
Wipe its data directory so it’s initialising from a clean start
Watch and save the Consul logs from the current Consul leader as well as dc1-144 whilst starting it up and seeing if it manages to join

Hossy · July 21, 2022, 7:02pm

Hi @maxb , sorry I didn’t see your reply until just now. I got pulled off onto another project. Hopefully when I can circle back to this you are available. I really appreciate your help and responsiveness!

Topic		Replies	Views
New consul server cannot join consul cluster Consul	9	5693	September 22, 2020
Failed to sync remote state: error="No cluster leader" Consul	3	8893	June 30, 2022
Consul[407]: agent: Coordinate update error: error="No cluster leader" Consul	8	1603	May 27, 2023
Consul not able to start in server side Consul	1	300	November 14, 2019
Errors in new Consul cluster Consul	3	2490	February 26, 2023

Can't Join New Servers to Existing Cluster

Related topics