Failed to sync remote state: error="No cluster leader"

Hello,
I can’t init a cluster. I tried 5, 3, 2, and 1 node. Consul successfully works using only one node with the help of command bootstrap_expect 1. Config file (same on all nodes):

datacenter = "dc1"
data_dir = "/opt/consul"
client_addr = "0.0.0.0"
log_level = "INFO"
enable_syslog = true
node_name = "%{node_name}%"
server = true
bootstrap_expect = 3
bind_addr = "0.0.0.0"
advertise_addr = "10.xxx.xxx.xxx"
start_join =  [
        "10.xxx.xxx.xxx",
        "10.xxx.xxx.xxx",
        "10.xxx.xxx.xxx"
    ]
retry_join = [
        "10.xxx.xxx.xxx:8301",
        "10.xxx.xxx.xxx:8301",
        "10.xxx.xxx.xxx:8301"
  ]

rejoin_after_leave = true

Logs:
node0:

Jun 29 20:21:17 srv1-prod consul[44883]: agent: Started DNS server: address=0.0.0.0:8600 network=udp
Jun 29 20:21:17 srv1-prod consul[44883]: agent: Started DNS server: address=0.0.0.0:8600 network=tcp
Jun 29 20:21:17 srv1-prod consul[44883]: agent: Starting server: address=[::]:8500 network=tcp protocol=http
Jun 29 20:21:17 srv1-prod consul[44883]: agent: Joining cluster
Jun 29 20:21:17 srv1-prod consul[44883]: agent: (LAN) joining: lan_addresses=[10.0.1.4, 10.0.1.5, 10.0.1.9]
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.memberlist.lan: memberlist: Initiating push/pull sync with:  10.0.1.4:8301
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.memberlist.lan: memberlist: Stream connection from=10.0.1.4:48710
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server: Existing Raft peers reported by server, disabling bootstrap mode: server=srv2-prod
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server: Adding LAN server: server="srv2-prod (Addr: tcp/10.0.1.5:8300) (DC: dc1)"
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.memberlist.lan: memberlist: Initiating push/pull sync with:  10.0.1.5:8301
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.440+0300 [DEBUG] agent.server.memberlist.lan: memberlist: Initiating push/pull sync with:  10.0.1.9:8>
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.memberlist.lan: memberlist: Initiating push/pull sync with:  10.0.1.9:8301
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.444+0300 [INFO]  agent: (LAN) joined: number_of_nodes=3
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.444+0300 [DEBUG] agent: systemd notify failed: error="No socket"
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.444+0300 [INFO]  agent: Join completed. Initial agents synced with: agent_count=3
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.445+0300 [INFO]  agent: started state syncer
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.445+0300 [INFO]  agent: Consul agent running!
Jun 29 20:21:17 srv1-prod consul[44883]: agent: (LAN) joined: number_of_nodes=3
Jun 29 20:21:17 srv1-prod consul[44883]: agent: systemd notify failed: error="No socket"
Jun 29 20:21:17 srv1-prod consul[44883]: agent: Join completed. Initial agents synced with: agent_count=3
Jun 29 20:21:17 srv1-prod consul[44883]: agent: started state syncer
Jun 29 20:21:17 srv1-prod consul[44883]: agent: Consul agent running!
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.713+0300 [DEBUG] agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.785+0300 [DEBUG] agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.812+0300 [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=10.0.1.9:53510
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.memberlist.lan: memberlist: Stream connection from=10.0.1.9:53510
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.812+0300 [DEBUG] agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: 2022-06-29T20:21:17.911+0300 [DEBUG] agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:17 srv1-prod consul[44883]: agent.server.serf.lan: serf: messageJoinType: srv1-prod
Jun 29 20:21:23 srv1-prod consul[44883]: 2022-06-29T20:21:23.426+0300 [WARN]  agent: Check missed TTL, is now critical: check=redis@10.0.1.4:6379:replication-stat>
Jun 29 20:21:23 srv1-prod consul[44883]: agent: Check missed TTL, is now critical: check=redis@10.0.1.4:6379:replication-status-check
Jun 29 20:21:24 srv1-prod consul[44883]: 2022-06-29T20:21:24.500+0300 [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
Jun 29 20:21:24 srv1-prod consul[44883]: agent.anti_entropy: failed to sync remote state: error="No cluster leader"
Jun 29 20:21:24 srv1-prod consul[44883]: 2022-06-29T20:21:24.765+0300 [DEBUG] agent.server.memberlist.wan: memberlist: Stream connection from=10.0.1.5:47670
Jun 29 20:21:24 srv1-prod consul[44883]: agent.server.memberlist.wan: memberlist: Stream connection from=10.0.1.5:47670

node1:

Jun 29 20:18:13 srv2-prod consul[17328]: agent: (LAN) joined: number_of_nodes=3
Jun 29 20:18:13 srv2-prod consul[17328]: agent: Join completed. Initial agents synced with: agent_count=3
Jun 29 20:18:13 srv2-prod consul[17328]: agent: started state syncer
Jun 29 20:18:13 srv2-prod consul[17328]: agent: Consul agent running!
Jun 29 20:18:19 srv2-prod consul[17328]: 2022-06-29T20:18:19.588+0300 [WARN]  agent: Check missed TTL, is now critical: check=redis@10.0.1.5:6379:replication-stat>
Jun 29 20:18:19 srv2-prod consul[17328]: 2022-06-29T20:18:19.588+0300 [WARN]  agent: Check missed TTL, is now critical: check="Resec: slave replication status"
Jun 29 20:18:19 srv2-prod consul[17328]: agent: Check missed TTL, is now critical: check=redis@10.0.1.5:6379:replication-status-check
Jun 29 20:18:19 srv2-prod consul[17328]: agent: Check missed TTL, is now critical: check="Resec: slave replication status"
Jun 29 20:18:20 srv2-prod consul[17328]: 2022-06-29T20:18:20.681+0300 [ERROR] agent.rpcclient.health: subscribe call failed: err="rpc error: code = Unknown desc =>
Jun 29 20:18:20 srv2-prod consul[17328]: 2022-06-29T20:18:20.681+0300 [ERROR] agent.http: Request error: method=GET url=/v1/health/service/redis?index=1&passing=1>
Jun 29 20:18:20 srv2-prod consul[17328]: agent.rpcclient.health: subscribe call failed: err="rpc error: code = Unknown desc = No cluster leader" topic=ServiceHeal>
Jun 29 20:18:20 srv2-prod consul[17328]: agent.http: Request error: method=GET url=/v1/health/service/redis?index=1&passing=1&tag=master&wait=30000ms from=172.17.>
Jun 29 20:18:20 srv2-prod consul[17328]: 2022-06-29T20:18:20.721+0300 [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
Jun 29 20:18:20 srv2-prod consul[17328]: agent.anti_entropy: failed to sync remote state: error="No cluster leader"

node2:

Jun 29 17:20:33 clusterNode3 consul[2238]: agent: (LAN) joined: number_of_nodes=3
Jun 29 17:20:33 clusterNode3 consul[2238]: agent: Join completed. Initial agents synced with: agent_count=3
Jun 29 17:20:33 clusterNode3 consul[2238]: agent: started state syncer
Jun 29 17:20:33 clusterNode3 consul[2238]: agent: Consul agent running!
Jun 29 17:20:36 clusterNode3 consul[2238]: 2022-06-29T17:20:36.030Z [WARN]  agent: Check socket connection failed: check=_nomad-check-bb4d3ebd68c5282a64fbbe7e52b9b
9ca4cc8dd8c error="dial tcp 0.0.0.0:4648: connect: connection refused"
Jun 29 17:20:36 clusterNode3 consul[2238]: 2022-06-29T17:20:36.030Z [WARN]  agent: Check is now critical: check=_nomad-check-bb4d3ebd68c5282a64fbbe7e52b9b9ca4cc8dd
8c
Jun 29 17:20:36 clusterNode3 consul[2238]: agent: Check socket connection failed: check=_nomad-check-bb4d3ebd68c5282a64fbbe7e52b9b9ca4cc8dd
8c error="dial tcp 0.0.0.0:4648: connect: connection refused"
Jun 29 17:20:36 clusterNode3 consul[2238]: agent: Check is now critical: check=_nomad-check-bb4d3ebd68c5282a64fbbe7e52b9b9ca4cc8dd8c
Jun 29 17:20:38 clusterNode3 consul[2238]: 2022-06-29T17:20:38.533Z [WARN]  agent: Check socket connection failed: check=_nomad-check-eed2ace5fdfef736b944c575e1d06
c3b09aba34e error="dial tcp 0.0.0.0:4647: connect: connection refused"
Jun 29 17:20:38 clusterNode3 consul[2238]: 2022-06-29T17:20:38.533Z [WARN]  agent: Check is now critical: check=_nomad-check-eed2ace5fdfef736b944c575e1d06c3b09aba3
4e
Jun 29 17:20:38 clusterNode3 consul[2238]: agent: Check socket connection failed: check=_nomad-check-eed2ace5fdfef736b944c575e1d06c3b09aba3
4e error="dial tcp 0.0.0.0:4647: connect: connection refused"
Jun 29 17:20:38 clusterNode3 consul[2238]: agent: Check is now critical: check=_nomad-check-eed2ace5fdfef736b944c575e1d06c3b09aba34e
Jun 29 17:20:40 clusterNode3 consul[2238]: 2022-06-29T17:20:40.505Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
Jun 29 17:20:40 clusterNode3 consul[2238]: agent.anti_entropy: failed to sync remote state: error="No cluster leader"

I have tried:

  • Increase/decrease count of nodes
  • Enable/disable bootstrap
  • Create peers.json on each node
  • force-leave
  • Manual join using CLI command

What else can I try?

There’s a lot going on in these logs and I wonder if all your attempts have created existing state in the data directories that is now causing follow-on complications.

Try wiping all your Consul data directories, and starting from scratch, with a 3 node cluster. Then post those logs - including the header part Consul prints out when it first starts before the main log.

You can also simplify your config a bit:

  • It’s not really needed or useful to specify both retry_join and start_join. Pick just one. Probably retry_join.

  • You don’t need to specify the port number on each entry there, either.

  • You probably don’t need to set bind_addr or advertise_addr either - the defaults are usually fine.

Hello,
Thank you for your reply. I stopped all transfered services to Consul and Consul itself, wiped all data in each node, and then restarted Consul. And it worked. But this error occurs so very often after any changes. Is there any magic approach to fix the issue besides cleaning up? Or maybe you can just explain why this error appears?

There is no “magic approach”, just good old traditional debugging.

It’s definitely possible, but it sounded like your cluster had been through quite a few configuration changes already as you had been trying to investigate, which I feared may have compounded the problems. It seemed likely to be painfully hard to unravel through messages in a discussion forum, requiring a lot of questions back and forth.

Given you introduced the problem as initial cluster setup, it just wasn’t worth it, so I suggested a clean start in this case.

If you have problems with an existing cluster, start a new topic here, and the community is likely to share different advice depending on the specifics you provide.

1 Like