Debugging loss of leader

alexiri · May 23, 2024, 3:34pm

We have a 5-node cluster running 1.7.3 and we’ve just had a loss of leader issue when a single node rebooted (due to a kernel bug). This isn’t the first time we’ve lost quorum when a single node had an issue, but we’ve also performed plenty of in-place upgrades by rebooting a single node at a time and haven’t had any problems.

Does anybody have any tips on how to try to debug why we lost quorum this time even though there was only a single-node failure, and the cluster should be able to tolerate two? The raft logs are fairly verbose but kind of hard to read, at least for me.

jrasell · May 29, 2024, 9:26am

Hi @alexiri,

The logs and timings are probably the most useful initial thing to take a look at. It would allow you to create a timeline of events and view of the cluster from each servers view. If you have logs available from the servers, I’d be happy to take a look, as this is not desired behaviour.

Thanks,
jrasell and the Nomad team

shantanugadgil · May 31, 2024, 6:38am

I have faced loss-of-leader multiple times due to OOM on the servers, and a couple of times due to PEBKAC

If you have memory graphs of the servers during the outage, it can help.

The memory consumption has been surprisingly better in 1.7.7 (I would urge you to update the servers at the soonest)

When the servers are recovering from an outage they need really large amount of memory on the potential leader before the memory stabilizes down to a decent number.

For your case, non-leader reboot should NOT cause leader loss.

If the leader rebooted abruptly, this could occur, but ideally should not. A new leader should be chosen within a few seconds.

For the OOM issue of server leader, rebooting the server leader during a stable cluster state makes the memory consumption go down.

I would suggest setting notification alarms on the server for memory.

We even hacked up a job to run only on the servers which would directly alert to Slack for human intervention !!!

Depending on how you manage servers (autoscaling group or standalone VMs), I would also recommend to put a regular update policy to update the OS and the binary versions periodically.

Topic		Replies	Views
Nomad Cluster question and Job retry Nomad	6	436	March 26, 2025
1.3.x: No cluster leader on single node cluster Nomad	2	934	May 24, 2022
3-server Nomad cluster seems to become unstable after brief network partition of non-leader server? Nomad	2	1221	October 27, 2022
No Cluster Leader when cluster node is down Nomad	6	4213	November 17, 2021
3-server cluster becomes unstable when follower recovers from temporary network outage? Consul	5	2949	October 31, 2022

Debugging loss of leader

Related topics