I have a cluster of 5 nodes, it was running for days with “No cluster leader” but I didn’t notice until I restarted an agent and got errors like:
rpcinsecure error making call: No cluster leader
I stopped all nodes, and started only one with:
consul agent -bootstrap-expect=1 -config-dir=/etc/consul.d/
But still couldn’t make the node become leader, so I ended up bootstrapping the whole cluster from scratch (losing all data).
How could this be fixed or what could be the best way to mark a node as a leader?
I am using the latest stable 1.13.2 but also notice while trying to follow the recovery guide that in the raft directory I don’t have the file raft.json instead I have a raft.db
Just in case this is the configuration I have in all the nodes:
The bootstrap options are solely for creating new clusters. They are not for use on existing clusters, even in recovery situations, and will either do nothing or break things further.
Without logs, and the output of consul operator raft list-peers -stale executed against the API of each server node, there’s no data to know what happened in this case, so all I can say is generalities.
There is no way to “mark” a node as leader, leadership always comes from an election.
To understand why an election is not completing when you think it should, the first thing to check is what is each node’s view of the peer set (the electorate), which is done by consul operator raft list-peers -stale - the -stale allows for getting a response from non-leader nodes.
Once you have confirmed that all nodes do actually agree which nodes are part of the cluster, you can evaluate whether there’s a way to get enough of these nodes talking to each other to generate a quorum.
If there isn’t, that’s when the recovery method involving using a peers.json file to manually replace the peer set with a user-specified configuration comes in.
I think you must have misread, there’s no mention of raft.json.
But I was getting ACL errors (I was missing the operator = read)
I misread sorry I have instead peers.info not peers.json
Any idea of how to test this scenario? I created a 3-node cluster and by just restarting the nodes they automatically find a leader, in the 5 node cluster, all nodes were responding also DNS queries but all the logs show “No cluster lead” apart multiple lines with something like:
[WARN] agent.server.raft: rejecting vote request since node is not a voter: from=X.X.X.X:8300
You may be experiencing problems similar to this other active topic:
But what we really need to see to say for certain is consul operator raft list-peers -stale output from each node, to understand what state the cluster is in. That will requiring finding a suitably permissioned ACL token that already exists on the cluster.
Did you find out what caused the No cluster leader? Regarding recovery in these scenarios, you should follow the peers.json recovery method (linked below).
Why peers.json?
Out of 5 nodes, you stopped all 4 nodes when you didn’t have a leader in the cluster.
Because of the above, the 1 node you are trying to recover still has all the other 4 nodes in its raft pool and will continuously request a vote and fail (as those nodes are down).
The peers.json method helps you define specific agents you want to be part of the raft peer set. In your case, you will have only one left node.
In case you plan to bring the rest of the agents back into the cluster, ensure you clean the data directory on them and start fresh. They will join the cluster as followers, replicate the leader’s data, and continue to function.