Debugging loss of leader

We have a 5-node cluster running 1.7.3 and we’ve just had a loss of leader issue when a single node rebooted (due to a kernel bug). This isn’t the first time we’ve lost quorum when a single node had an issue, but we’ve also performed plenty of in-place upgrades by rebooting a single node at a time and haven’t had any problems.

Does anybody have any tips on how to try to debug why we lost quorum this time even though there was only a single-node failure, and the cluster should be able to tolerate two? The raft logs are fairly verbose but kind of hard to read, at least for me.

Hi @alexiri,

The logs and timings are probably the most useful initial thing to take a look at. It would allow you to create a timeline of events and view of the cluster from each servers view. If you have logs available from the servers, I’d be happy to take a look, as this is not desired behaviour.

Thanks,
jrasell and the Nomad team

I have faced loss-of-leader multiple times due to OOM on the servers, and a couple of times due to PEBKAC :man_facepalming:

If you have memory graphs of the servers during the outage, it can help.

The memory consumption has been surprisingly better in 1.7.7 (I would urge you to update the servers at the soonest)

When the servers are recovering from an outage they need really large amount of memory on the potential leader before the memory stabilizes down to a decent number.

For your case, non-leader reboot should NOT cause leader loss.

If the leader rebooted abruptly, this could occur, but ideally should not. A new leader should be chosen within a few seconds.

For the OOM issue of server leader, rebooting the server leader during a stable cluster state makes the memory consumption go down.

I would suggest setting notification alarms on the server for memory.

We even hacked up a job to run only on the servers which would directly alert to Slack for human intervention !!!

Depending on how you manage servers (autoscaling group or standalone VMs), I would also recommend to put a regular update policy to update the OS and the binary versions periodically.