I’m trying to understand how Nomad deals with jobs being lost when nodes go down.
My set up is to run a nomad server and then two nomad clients. I’m running the clients in docker containers so that I can easily stop and restart them.
I’m particularly interested in “batch” jobs.
I’m running a job that just sleeps for 30 seconds as a test.
If I run the job, then stop the node that it’s running on, Nomad will reschedule that job onto the other node. I can see that it marks the first allocation as “lost” and then created another one, which then succeeds. What I’m trying to do is control that behaviour. I’d like to be able to stop that happening.
I’ve experimented with the “RestartPolicy” and “ReschedulePolicy” but I’ve not so far been able to make a difference to this.
My scenario is that I’m scheduling what might be quite a long running task (hours). I don’t really want nomad to automatically restart that job if the node that it’s running on goes down. I just want the job to fail.
Is this possible?
Thanks.
Hi @tomqwpl. Thanks for using nomad. I’m sorry you are having a hard time getting the result you want. I’ve got a couple of troubleshooting questions for you?
- Can you post your jobspec?
- Is it safe to assume you tried disabling rescheduling to prevent the allocation being placed elsewhere?
- Is it also safe to assume you’ve set the restart
attempts
to 0 and mode
to fail?
Here’s an example of what I’m talking about.
job "batch-job" {
datacenters = ["dc1"]
type = "batch"
group "batch-group" {
count = 1
restart {
attempts = 0
mode = "fail"
}
reschedule {
attempts = 0
unlimited = false
}
task "batch-task" {
driver = "docker"
...
}
}
}
Hi Derek,
I’m going it in code rather than constructing a job spec in text, but yes, I’m specifying both of those on the task group:
RestartPolicy: &nomadapi.RestartPolicy{
Attempts: intptr(0),
},
ReschedulePolicy: &nomadapi.ReschedulePolicy{
Attempts: intptr(0),
Unlimited: boolptr(false),
},
If there’s a failure of another kind, it doesn’t restart. It appears to only be in the case of “lost” jobs that these appear to be being ignored and the job is restarted anyway.