Understanding job restart behaviour on lost jobs

tomqwpl · May 11, 2022, 10:44am

I’m trying to understand how Nomad deals with jobs being lost when nodes go down.
My set up is to run a nomad server and then two nomad clients. I’m running the clients in docker containers so that I can easily stop and restart them.
I’m particularly interested in “batch” jobs.

I’m running a job that just sleeps for 30 seconds as a test.
If I run the job, then stop the node that it’s running on, Nomad will reschedule that job onto the other node. I can see that it marks the first allocation as “lost” and then created another one, which then succeeds. What I’m trying to do is control that behaviour. I’d like to be able to stop that happening.
I’ve experimented with the “RestartPolicy” and “ReschedulePolicy” but I’ve not so far been able to make a difference to this.
My scenario is that I’m scheduling what might be quite a long running task (hours). I don’t really want nomad to automatically restart that job if the node that it’s running on goes down. I just want the job to fail.

Is this possible?
Thanks.

DerekStrickland · May 12, 2022, 2:33pm

Hi @tomqwpl. Thanks for using nomad. I’m sorry you are having a hard time getting the result you want. I’ve got a couple of troubleshooting questions for you?

Can you post your jobspec?
Is it safe to assume you tried disabling rescheduling to prevent the allocation being placed elsewhere?
Is it also safe to assume you’ve set the restart attempts to 0 and mode to fail?

Here’s an example of what I’m talking about.

job "batch-job" {
  datacenters = ["dc1"]
  type = "batch"

  group "batch-group" {
    count = 1

    restart {
      attempts = 0
      mode    = "fail"
    }

    reschedule {
      attempts  = 0
      unlimited = false
    }

    task "batch-task" {
      driver = "docker"
      ...
    }
  }
}

tomqwpl · May 12, 2022, 3:30pm

Hi Derek,
I’m going it in code rather than constructing a job spec in text, but yes, I’m specifying both of those on the task group:

				RestartPolicy: &nomadapi.RestartPolicy{
					Attempts: intptr(0),
				},
				ReschedulePolicy: &nomadapi.ReschedulePolicy{
					Attempts:  intptr(0),
					Unlimited: boolptr(false),
				},

If there’s a failure of another kind, it doesn’t restart. It appears to only be in the case of “lost” jobs that these appear to be being ignored and the job is restarted anyway.

Topic		Replies	Views
Why multiple dead and system jobs restart when restarting a Nomad client? Nomad	1	267	September 25, 2023
Nomad not rescheduling system jobs on nodes that previously ran out of disk space Nomad	2	298	July 7, 2022
[Nomad] Service Job Allow No Restart, but rescheduled Nomad	0	327	November 5, 2021
Restart policy of successful batch jobs Nomad	2	411	January 25, 2021
Does reloading a Nomad client or server creates Job downtime? Nomad	1	273	July 20, 2023

Understanding job restart behaviour on lost jobs

Related topics