I have created a new 3 node cluster and completed the consul install . I am constantly getting the following error messages in the logs for all 3 server s
consul operator raft list-peers -stale
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)
However I am able to see the consul members
consul members
Node Address Status Type Build Protocol DC Partition Segment
Server1host x.x.x.x.:8301 alive server 1.15.1 2 prod default <all>
Server2host x.x.x.x.:8301 alive server 1.15.1 2 prod default <all>
Server3host x.x.x.x.:8301 alive server 1.15.1 2 prod default <all>
This surprises me, a lot. I’m pretty sure I’ve successfully used the -stale flag to retrieve the Raft peers without a leader in the past. It’s possible this has regressed, which would be really bad, since it’s a critical diagnostic tool in understanding a broken Raft configuration.
Without this information, it’s really difficult to make any useful suggestions. However, it’s possible the current Raft configuration may be logged during startup - I think I remember seeing it there.
Can you restart a Consul server process, collect a few minutes of logs, starting with the startup, and post them here?
Please do not fully obfuscate IP addresses or other node identifiers, as they may be relevant to understanding the problem.
I am not sure I can paste full logs without obfucating but can I just knock off the data_dir and start fresh ? I can see them join the cluster but have trouble electing leader
2023-05-26T13:24:54.010-0400 [INFO] agent: Joining cluster...: cluster=LAN
2023-05-26T13:24:54.010-0400 [INFO] agent: (LAN) joining: lan_addresses=["host1", "host2"]
2023-05-26T13:24:54.010-0400 [INFO] agent: started state syncer
2023-05-26T13:24:54.010-0400 [INFO] agent: Consul agent running!
2023-05-26T13:24:54.024-0400 [INFO] agent: (LAN) joined: number_of_nodes=1
2023-05-26T13:24:54.024-0400 [INFO] agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=1
2023-05-26T13:24:59.412-0400 [WARN] agent.server.raft: no known peers, aborting election
2023-05-26T13:25:01.468-0400 [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="No cluster leader" index=0
2023-05-26T13:25:01.468-0400 [ERROR] agent.server.cert-manager: failed to handle cache update event: error="leaf cert watch returned an error: No cluster leader"
I completely wiped out the DATA_DIR and restarted all 3 servers manually but still same problem . this is a brand new cluster and has similar settings to my other cluster …a bit stumped …but could it be some blocking port ? I have opened firewall for all ports
Thanks @maxb . There is company relevant hostnames and ip that I cannot disclose but looking for any advise. I mean I compeltely wiped out and “consul members” show active members and I also had a “client” join without problem. I can even access the GUI but in the end it is useless because it does not have a cluster leader and nothing works .
Is there a certain port or a certain config parameter I should focus on ? Any pointers appreciated .
Even this command fails
consul operator raft list-peers
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)
Only raft relevant message in log I found was
2023-05-26T14:04:03.549-0400 [INFO] agent.server.raft: initial configuration: index=0 servers=[]
2023-05-26T14:04:03.549-0400 [INFO] agent.server.raft: entering follower state: follower="Node at x.x.x.x.:8303 [Follower]" leader-address= leader-id=
2023-05-26T14:04:09.730-0400 [WARN] agent.server.raft: no known peers, aborting election
If you do need to obfuscate, because it’s too hard to talk sense into people imposing requirements, then the way to do it is to replace hostnames and IPs with other generic hostnames and IPs that:
Still look like hostnames/IPs, so they communicate what was replaced
Always replace the same hostname/IP with the same unique replacement, so that someone reading the obfuscated logs can still identify that the same node is being referenced across multiple lines of logs.
Well, yes, it would. Without the -stale option, it by definition tries to reach a cluster leader.
I’m beginning to wonder… has this cluster ever worked?
Could you paste your entire Consul server configuration file, not just the “snippet” you showed earlier?
Have you perhaps not done anything to bootstrap the cluster, either via the configuration file or CLI command?