Say I have a 3 server and 5 client Nomad cluster . We have a leader , however if the leader dies or the host crashes or gets rebooted …
a) What happens to the cluster ? If there is no Quorum will it come up ? I think I have bootstrap_expect=3
b) Also what happens to scheduled job if the cluster is not in shape to serve requests ? It won’t be invoked at all ? Is there a way to keep retrying the job until all Nomad servers are up ?
Hi @sammy676776 if you want detailed answers I suggest reading through all of the Nomad Concepts documentation, particularly around consensus.
But to answer your questions,
a) If the leader server crashes, the other two servers vote on a new leader between the two of them. If there is no quorum (e.g. 2 of the 3 servers crashed), then no leader can be elected and any operations that require a leader (e.g. submitting a job) will fail. Many production critical clusters operate 5 servers - so that with a quorum of 3, any 2 servers can safely be down at a given time.
b) If you submit a job while there is no leader, the HTTP request will return an error response code and the job will not be accepted. You can keep retrying until a leader is elected and is able to accept the job.
@seth.hoenig Thank you . I will surely go and read those . So in our case we only have 3 servers and I am assuming if one goes down then the other 2 will keep serving …So thats good . Followup on your last comment …how can we make sure that for a given job until a leader is elected it will keep trying ? Is there a stanza in Job that we can specify for this ?
how can we make sure that for a given job until a leader is elected it will keep trying ?
Check out the -check-index
argument to job-run
- it can be used in conjunction with job plan
to ensure you only actually (re)submit the job once, regardless of retrying during a leader election.
As for actually doing the retry’s, you’ll have to manage that yourself.
Thank for the explanation @seth.hoenig
@biddtimate84 I would say you will need at least 2 for quorum . As per the doc
Servers | Quorum Size | Failure Tolerance |
---|---|---|
1 | 1 | 0 |
2 | 2 | 0 |
3 | 2 | 1 |
4 | 3 | 1 |
5 | 3 | 2 |
6 | 4 | 2 |
7 | 4 | 3 |