Failed to sync remote state: error="No cluster leader"

wusikijeronii · June 29, 2022, 5:34pm

Hello,
I can’t init a cluster. I tried 5, 3, 2, and 1 node. Consul successfully works using only one node with the help of command bootstrap_expect 1. Config file (same on all nodes):

datacenter = "dc1"
data_dir = "/opt/consul"
client_addr = "0.0.0.0"
log_level = "INFO"
enable_syslog = true
node_name = "%{node_name}%"
server = true
bootstrap_expect = 3
bind_addr = "0.0.0.0"
advertise_addr = "10.xxx.xxx.xxx"
start_join =  [
        "10.xxx.xxx.xxx",
        "10.xxx.xxx.xxx",
        "10.xxx.xxx.xxx"
    ]
retry_join = [
        "10.xxx.xxx.xxx:8301",
        "10.xxx.xxx.xxx:8301",
        "10.xxx.xxx.xxx:8301"
  ]

rejoin_after_leave = true

Logs:
node0:

Jun 29 20:21:17 srv1-prod consul[44883]: agent: Started DNS server: address=0.0.0.0:8600 network=udp
Jun 29 20:21:17 srv1-prod consul[44883]: agent: Started DNS server: address=0.0.0.0:8600 network=tcp
Jun 29 20:21:17 srv1-prod consul[44883]: agent: Starting server: address=[::]:8500 network=tcp protocol=http
Jun 29 20:21:17 srv1-prod consul[44883]: agent: Joining cluster
Jun 29 20:21:17 srv1-prod consul[44883]: agent: (LAN) joining: lan_addresses=[10.0.1.4, 10.0.1.5, 10.0.1.9]
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.memberlist.lan: memberlist: Initiating push/pull sync with:  10.0.1.4:8301
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.memberlist.lan: memberlist: Stream connection from=10.0.1.4:48710
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server: Existing Raft peers reported by server, disabling bootstrap mode: server=srv2-prod
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server: Adding LAN server: server="srv2-prod (Addr: tcp/10.0.1.5:8300) (DC: dc1)"
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.memberlist.lan: memberlist: Initiating push/pull sync with:  10.0.1.5:8301
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.440+0300 [DEBUG] agent.server.memberlist.lan: memberlist: Initiating push/pull sync with:  10.0.1.9:8>
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.memberlist.lan: memberlist: Initiating push/pull sync with:  10.0.1.9:8301
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.444+0300 [INFO]  agent: (LAN) joined: number_of_nodes=3
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.444+0300 [DEBUG] agent: systemd notify failed: error="No socket"
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.444+0300 [INFO]  agent: Join completed. Initial agents synced with: agent_count=3
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.445+0300 [INFO]  agent: started state syncer
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.445+0300 [INFO]  agent: Consul agent running!
Jun 29 20:21:17 srv1-prod consul[44883]: agent: (LAN) joined: number_of_nodes=3
Jun 29 20:21:17 srv1-prod consul[44883]: agent: systemd notify failed: error="No socket"
Jun 29 20:21:17 srv1-prod consul[44883]: agent: Join completed. Initial agents synced with: agent_count=3
Jun 29 20:21:17 srv1-prod consul[44883]: agent: started state syncer
Jun 29 20:21:17 srv1-prod consul[44883]: agent: Consul agent running!
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.713+0300 [DEBUG] agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.785+0300 [DEBUG] agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.812+0300 [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=10.0.1.9:53510
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.memberlist.lan: memberlist: Stream connection from=10.0.1.9:53510
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.812+0300 [DEBUG] agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.911+0300 [DEBUG] agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:23 srv1-prod consul[44883]: 2022-06-29T20:21:23.426+0300 [WARN]  agent: Check missed TTL, is now critical: check=redis@10.0.1.4:6379:replication-stat>
Jun 29 20:21:23 srv1-prod consul[44883]: agent: Check missed TTL, is now critical: check=redis@10.0.1.4:6379:replication-status-check
Jun 29 20:21:24 srv1-prod consul[44883]: 2022-06-29T20:21:24.500+0300 [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
Jun 29 20:21:24 srv1-prod consul[44883]: agent.anti_entropy: failed to sync remote state: error="No cluster leader"
Jun 29 20:21:24 srv1-prod consul[44883]: 2022-06-29T20:21:24.765+0300 [DEBUG] agent.server.memberlist.wan: memberlist: Stream connection from=10.0.1.5:47670
Jun 29 20:21:24 srv1-prod consul[44883]: agent.server.memberlist.wan: memberlist: Stream connection from=10.0.1.5:47670

node1:

Jun 29 20:18:13 srv2-prod consul[17328]: agent: (LAN) joined: number_of_nodes=3
Jun 29 20:18:13 srv2-prod consul[17328]: agent: Join completed. Initial agents synced with: agent_count=3
Jun 29 20:18:13 srv2-prod consul[17328]: agent: started state syncer
Jun 29 20:18:13 srv2-prod consul[17328]: agent: Consul agent running!
Jun 29 20:18:19 srv2-prod consul[17328]: 2022-06-29T20:18:19.588+0300 [WARN]  agent: Check missed TTL, is now critical: check=redis@10.0.1.5:6379:replication-stat>
Jun 29 20:18:19 srv2-prod consul[17328]: 2022-06-29T20:18:19.588+0300 [WARN]  agent: Check missed TTL, is now critical: check="Resec: slave replication status"
Jun 29 20:18:19 srv2-prod consul[17328]: agent: Check missed TTL, is now critical: check=redis@10.0.1.5:6379:replication-status-check
Jun 29 20:18:19 srv2-prod consul[17328]: agent: Check missed TTL, is now critical: check="Resec: slave replication status"
Jun 29 20:18:20 srv2-prod consul[17328]: 2022-06-29T20:18:20.681+0300 [ERROR] agent.rpcclient.health: subscribe call failed: err="rpc error: code = Unknown desc =>
Jun 29 20:18:20 srv2-prod consul[17328]: 2022-06-29T20:18:20.681+0300 [ERROR] agent.http: Request error: method=GET url=/v1/health/service/redis?index=1&passing=1>
Jun 29 20:18:20 srv2-prod consul[17328]: agent.rpcclient.health: subscribe call failed: err="rpc error: code = Unknown desc = No cluster leader" topic=ServiceHeal>
Jun 29 20:18:20 srv2-prod consul[17328]: agent.http: Request error: method=GET url=/v1/health/service/redis?index=1&passing=1&tag=master&wait=30000ms from=172.17.>
Jun 29 20:18:20 srv2-prod consul[17328]: 2022-06-29T20:18:20.721+0300 [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
Jun 29 20:18:20 srv2-prod consul[17328]: agent.anti_entropy: failed to sync remote state: error="No cluster leader"

node2:

Jun 29 17:20:33 clusterNode3 consul[2238]: agent: (LAN) joined: number_of_nodes=3
Jun 29 17:20:33 clusterNode3 consul[2238]: agent: Join completed. Initial agents synced with: agent_count=3
Jun 29 17:20:33 clusterNode3 consul[2238]: agent: started state syncer
Jun 29 17:20:33 clusterNode3 consul[2238]: agent: Consul agent running!
Jun 29 17:20:36 clusterNode3 consul[2238]: 2022-06-29T17:20:36.030Z [WARN]  agent: Check socket connection failed: check=_nomad-check-bb4d3ebd68c5282a64fbbe7e52b9b
9ca4cc8dd8c error="dial tcp 0.0.0.0:4648: connect: connection refused"
Jun 29 17:20:36 clusterNode3 consul[2238]: 2022-06-29T17:20:36.030Z [WARN]  agent: Check is now critical: check=_nomad-check-bb4d3ebd68c5282a64fbbe7e52b9b9ca4cc8dd
8c
Jun 29 17:20:36 clusterNode3 consul[2238]: agent: Check socket connection failed: check=_nomad-check-bb4d3ebd68c5282a64fbbe7e52b9b9ca4cc8dd
8c error="dial tcp 0.0.0.0:4648: connect: connection refused"
Jun 29 17:20:36 clusterNode3 consul[2238]: agent: Check is now critical: check=_nomad-check-bb4d3ebd68c5282a64fbbe7e52b9b9ca4cc8dd8c
Jun 29 17:20:38 clusterNode3 consul[2238]: 2022-06-29T17:20:38.533Z [WARN]  agent: Check socket connection failed: check=_nomad-check-eed2ace5fdfef736b944c575e1d06
c3b09aba34e error="dial tcp 0.0.0.0:4647: connect: connection refused"
Jun 29 17:20:38 clusterNode3 consul[2238]: 2022-06-29T17:20:38.533Z [WARN]  agent: Check is now critical: check=_nomad-check-eed2ace5fdfef736b944c575e1d06c3b09aba3
4e
Jun 29 17:20:38 clusterNode3 consul[2238]: agent: Check socket connection failed: check=_nomad-check-eed2ace5fdfef736b944c575e1d06c3b09aba3
4e error="dial tcp 0.0.0.0:4647: connect: connection refused"
Jun 29 17:20:38 clusterNode3 consul[2238]: agent: Check is now critical: check=_nomad-check-eed2ace5fdfef736b944c575e1d06c3b09aba34e
Jun 29 17:20:40 clusterNode3 consul[2238]: 2022-06-29T17:20:40.505Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
Jun 29 17:20:40 clusterNode3 consul[2238]: agent.anti_entropy: failed to sync remote state: error="No cluster leader"

I have tried:

Increase/decrease count of nodes
Enable/disable bootstrap
Create peers.json on each node
force-leave
Manual join using CLI command

What else can I try?

maxb · June 29, 2022, 6:51pm

There’s a lot going on in these logs and I wonder if all your attempts have created existing state in the data directories that is now causing follow-on complications.

Try wiping all your Consul data directories, and starting from scratch, with a 3 node cluster. Then post those logs - including the header part Consul prints out when it first starts before the main log.

You can also simplify your config a bit:

It’s not really needed or useful to specify both retry_join and start_join. Pick just one. Probably retry_join.
You don’t need to specify the port number on each entry there, either.
You probably don’t need to set bind_addr or advertise_addr either - the defaults are usually fine.

wusikijeronii · June 29, 2022, 7:29pm

Hello,
Thank you for your reply. I stopped all transfered services to Consul and Consul itself, wiped all data in each node, and then restarted Consul. And it worked. But this error occurs so very often after any changes. Is there any magic approach to fix the issue besides cleaning up? Or maybe you can just explain why this error appears?

maxb · June 30, 2022, 5:41am

There is no “magic approach”, just good old traditional debugging.

It’s definitely possible, but it sounded like your cluster had been through quite a few configuration changes already as you had been trying to investigate, which I feared may have compounded the problems. It seemed likely to be painfully hard to unravel through messages in a discussion forum, requiring a lot of questions back and forth.

Given you introduced the problem as initial cluster setup, it just wasn’t worth it, so I suggested a clean start in this case.

If you have problems with an existing cluster, start a new topic here, and the community is likely to share different advice depending on the specifics you provide.

Topic		Replies	Views
No cluster leader (5 node cluster, how to recover?) Consul consul	10	9308	October 28, 2022
Consul not able to start in server side Consul	1	300	November 14, 2019
Can't Join New Servers to Existing Cluster Consul	12	1105	July 21, 2022
Consul deployment issues in a three node cluster setup Consul	0	324	July 6, 2022
New consul server cannot join consul cluster Consul	9	5681	September 22, 2020

Failed to sync remote state: error="No cluster leader"

Related topics