Nomad Cluster question and Job retry

Say I have a 3 server and 5 client Nomad cluster . We have a leader , however if the leader dies or the host crashes or gets rebooted …
a) What happens to the cluster ? If there is no Quorum will it come up ? I think I have bootstrap_expect=3
b) Also what happens to scheduled job if the cluster is not in shape to serve requests ? It won’t be invoked at all ? Is there a way to keep retrying the job until all Nomad servers are up ?

Hi @sammy676776 if you want detailed answers I suggest reading through all of the Nomad Concepts documentation, particularly around consensus.

But to answer your questions,

a) If the leader server crashes, the other two servers vote on a new leader between the two of them. If there is no quorum (e.g. 2 of the 3 servers crashed), then no leader can be elected and any operations that require a leader (e.g. submitting a job) will fail. Many production critical clusters operate 5 servers - so that with a quorum of 3, any 2 servers can safely be down at a given time.

b) If you submit a job while there is no leader, the HTTP request will return an error response code and the job will not be accepted. You can keep retrying until a leader is elected and is able to accept the job.

@seth.hoenig Thank you . I will surely go and read those . So in our case we only have 3 servers and I am assuming if one goes down then the other 2 will keep serving …So thats good . Followup on your last comment …how can we make sure that for a given job until a leader is elected it will keep trying ? Is there a stanza in Job that we can specify for this ?

how can we make sure that for a given job until a leader is elected it will keep trying ?

Check out the -check-index argument to job-run - it can be used in conjunction with job plan to ensure you only actually (re)submit the job once, regardless of retrying during a leader election.

As for actually doing the retry’s, you’ll have to manage that yourself.

1 Like

Thank for the explanation @seth.hoenig

@biddtimate84 I would say you will need at least 2 for quorum . As per the doc

Servers Quorum Size Failure Tolerance
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3