Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

Building a consul cluster via ansible.

  • Host OS - Ubuntu 2204 LTS
  • FirewallD - Off
  • UFW has ports required opened for Consul (also dropped UFW same issue remains)

Consul Members
Node Address Status Type Build Protocol DC Partition Segment
PG1-UBUNTU 192.168.90.137:8301 alive server 1.15.3 2 james default
PG2-UBUNTU 192.168.90.123:8301 alive server 1.15.3 2 james default
PG3-UBUNTU 192.168.90.125:8301 alive server 1.15.3 2 james default

Server.json file
{
“bootstrap_expect”: 3,
“datacenter”: “james”,
“server”: true,
“data_dir”: “/db/consul/data”,
“encrypt”: “kv5HaiAGTcZsVARv20NY9+ughWyJTF2p9jlaeak8iT4=”,
“log_level”: “ERR”,
“log_file”: “/var/log/consul/consul.log”,
“ui”: true,
“bind_addr”: “192.168.90.137”,
“retry_join”: [“192.168.90.137”,“192.168.90.123”,“192.168.90.125”]
}

Service unit file
● consul.service - Consul Service
Loaded: loaded (/lib/systemd/system/consul.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2023-07-14 17:49:28 UTC; 14min ago
Docs: https://www.consul.io/
Main PID: 885384 (consul)
Tasks: 7 (limit: 1177)
Memory: 30.1M
CPU: 5.244s
CGroup: /system.slice/consul.service
└─885384 /usr/bin/consul agent --config-file /db/consul/server/server.json --bind 127.0.0.1 --bind 192.168.90.137


[Unit]
Description=Consul Service
Documentation=https://www.consul.io/
DefaultDependencies=no
After=network.target

[Service]
User=consul
Type=simple
RemainAfterExit=yes
ExecStart=/usr/bin/consul agent --config-file /db/consul/server/server.json --bind 127.0.0.1 --bind 192.168.90.137
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target

Consul Service Status -
Jul 14 18:02:47 PG1-UBUNTU consul[885384]: 2023-07-14T18:02:47.375Z [ERROR] agent.anti_entropy: failed to sync remote state: error=“No cluster leader”
Jul 14 18:02:48 PG1-UBUNTU consul[885384]: 2023-07-14T18:02:48.530Z [ERROR] agent: Coordinate update error: error=“No cluster leader”

consul operator raft list-peers
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

consul operator raft list-peers -stale
no output…

Found the issue - but I am very confused. 1 of the 3 nodes is causing the issue. It is looking at another consul and setting as leader.

consul:
acl = disabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 192.168.200.40:8300
server = true
raft:
applied_index = 5460264
commit_index = 5460264
fsm_pending = 0
last_contact = 38.753194ms
last_log_index = 5460264
last_log_term = 6353234
last_snapshot_index = 5456541
last_snapshot_term = 6349339
latest_configuration = [{Suffrage:Voter ID:8a0e9f33-f9b0-687e-13fb-1907aefef646 Address:192.168.90.125:8300} {Suffrage:Voter ID:62e603eb-9ed1-8969-2e0e-210d759f9387 Address:192.168.200.40:8300}]
latest_configuration_index = 0
num_peers = 0
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 6353234

192.168.200.40 is not part of the cluster, yet this node seems to think it could be.

Jul 15 16:00:19 PG3-UBUNTU consul[31171]: 2023-07-15T16:00:19.639Z [ERROR] agent.anti_entropy: failed to sync remote state: error="Raft leader not found in server>
Jul 15 16:00:19 PG3-UBUNTU consul[31171]: agent.anti_entropy: failed to sync remote state: error=“Raft leader not found in server lookup mapping”

How is this possible? How to stop this from happening?

I am unfamiliar with the format of the information you have shown above.

In order to understand the current status of your cluster, it would be helpful if you could share the current Raft configuration from each of your nodes.

Normally this would be obtainable via consul operator raft list-peers -stale -http-addr 192.168.x.y:8500, directing the query to each of your nodes in turn.

If you are unable to get it that way, it is also logged during server startup, in a log line mentioning [INFO] agent.server.raft: initial configuration:

Understanding how your cluster got into its current state may not be possible unless you are willing to share historical logs, that show the time the problem node began to interact with an unexpected host.

@maxb thanks for the reply. All nodes report

Error getting peers: Failed to retrieve raft configuration: Get “http://192.168.90.123:8500/v1/operator/raft/configuration?stale=”: dial tcp 192.168.90.123:8500: connect: connection refused

Looking in the consul logs /var/log/consul turned up:

Node PG3 (problem node)
2023-07-15T18:03:47.037Z [INFO] agent.server.raft: initial configuration: index=0 servers=
2023-07-15T18:03:47.042Z [INFO] agent.server.raft: entering follower state: follower=“Node at 192.168.90.125:8300 [Follower]” leader-address= leader-id=
2023-07-15T18:03:48.249Z [DEBUG] agent.server.raft: accepted connection: local-address=192.168.90.125:8300 remote-address=192.168.200.40:53793
2023-07-15T18:03:48.250Z [DEBUG] agent.server.raft: lost leadership because received a requestVote with a newer term
2023-07-15T18:03:48.254Z [WARN] agent.server.raft: failed to get previous log: previous-index=5460719 last-index=1 error=“log not found”
2023-07-15T18:03:48.255Z [INFO] agent.server.raft.snapshot: creating new snapshot: path=/db/consul/data/raft/snapshots/6349339-5456541-1689444228255.tmp
2023-07-15T18:03:48.255Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=8a0e9f33-f9b0-687e-13fb-1907aefef646 fallback=192.168.90.125:8300 error=“Could not find address for server id 8a0e9f33-f9b0-687e-13fb-1907aefef646”
2023-07-15T18:03:48.255Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=62e603eb-9ed1-8969-2e0e-210d759f9387 fallback=192.168.200.40:8300 error=“Could not find address for server id 62e603eb-9ed1-8969-2e0e-210d759f9387”
2023-07-15T18:03:48.259Z [INFO] agent.server.raft: snapshot network transfer progress: read-bytes=18757 percent-complete=“100.00%”
2023-07-15T18:03:48.265Z [INFO] agent.server.raft: copied to local snapshot: bytes=18757
2023-07-15T18:03:48.271Z [INFO] agent.server.raft: snapshot restore progress: id=6349339-5456541-1689444228255 last-index=5456541 last-term=6349339 size-in-bytes=18757 read-bytes=18757 percent-complete=“100.00%”
2023-07-15T18:03:48.272Z [INFO] agent.server.raft: Installed remote snapshot
2023-07-15T18:03:48.273Z [DEBUG] agent.server.raft: accepted connection: local-address=192.168.90.125:8300 remote-address=192.168.200.40:58637
2023-07-15T18:03:48.429Z [DEBUG] agent.server.raft: accepted connection: local-address=192.168.90.125:8300 remote-address=192.168.200.40:60283

Logs from PG1 Node
PG1-UBUNTU:~$ sudo cat /var/log/consul/*.log | grep agent.server.raft
2023-07-15T18:05:36.792Z [INFO] agent.server.raft: initial configuration: index=0 servers=
2023-07-15T18:05:36.793Z [INFO] agent.server.raft: entering follower state: follower=“Node at 192.168.90.137:8300 [Follower]” leader-address= leader-id=
2023-07-15T18:05:43.309Z [WARN] agent.server.raft: no known peers, aborting election

2023-07-15T18:22:14.480Z [DEBUG] agent.server.raft: accepted connection: local-address=192.168.90.125:8300 remote-address=192.168.200.40:50945
2023-07-15T18:22:14.480Z [DEBUG] agent.server.raft: lost leadership because received a requestVote with a newer term

created new encrypt key and ran the build again… error seems to be cleared I was not using an encrypt key from the other consul. But it appears if you build the cluster multiple times with the same key - this happens.