Nomad Cluster question and Job retry

Say I have a 3 server and 5 client Nomad cluster . We have a leader , however if the leader dies or the host crashes or gets rebooted …
a) What happens to the cluster ? If there is no Quorum will it come up ? I think I have bootstrap_expect=3
b) Also what happens to scheduled job if the cluster is not in shape to serve requests ? It won’t be invoked at all ? Is there a way to keep retrying the job until all Nomad servers are up ?

Hi @sammy676776 if you want detailed answers I suggest reading through all of the Nomad Concepts documentation, particularly around consensus.

But to answer your questions,

a) If the leader server crashes, the other two servers vote on a new leader between the two of them. If there is no quorum (e.g. 2 of the 3 servers crashed), then no leader can be elected and any operations that require a leader (e.g. submitting a job) will fail. Many production critical clusters operate 5 servers - so that with a quorum of 3, any 2 servers can safely be down at a given time.

b) If you submit a job while there is no leader, the HTTP request will return an error response code and the job will not be accepted. You can keep retrying until a leader is elected and is able to accept the job.

@seth.hoenig Thank you . I will surely go and read those . So in our case we only have 3 servers and I am assuming if one goes down then the other 2 will keep serving …So thats good . Followup on your last comment …how can we make sure that for a given job until a leader is elected it will keep trying ? Is there a stanza in Job that we can specify for this ?

how can we make sure that for a given job until a leader is elected it will keep trying ?

Check out the -check-index argument to job-run - it can be used in conjunction with job plan to ensure you only actually (re)submit the job once, regardless of retrying during a leader election.

As for actually doing the retry’s, you’ll have to manage that yourself.

1 Like

Thank for the explanation @seth.hoenig

@biddtimate84 I would say you will need at least 2 for quorum . As per the doc

Servers Quorum Size Failure Tolerance
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3

In a Nomad cluster Techs slassh job retry refers to the mechanism by which Nomad automatically retries failed job .
When a task within a job allocation fails due to reasons such as network issues,
Nomad allows you to define retry policies for job tasks using the retry stanza in the job specification. This includes parameters such as the number of retries, retry interval, and backoff strategy.