Why multiple dead and system jobs restart when restarting a Nomad client?

Kamilcuk · January 3, 2023, 3:36pm

We have multiple batch job. These jobs have run and are currently dead and finished. The allocations may have been garbage collected, but the job is still there.

Then, we add a new Nomad client or restart a Nomad client to refresh its configuration.

Now we observe a odd thing. What happens, is that multiple jobs restart when doing that, causing Nomad to stall. These jobs do not run on the new node, they have constrains to run on other nodes. They are unrelated. Also multiple other evaluations and system jobs seems to restart.

It seems that the issue happens mostly for dead non-stopped jobs - a job that is finished, but still have “stop”: “false” in the json configuration. But I can’t confirm it, I have to test it.

While I work on a MCVE to make it easily reproducible, is this expected behavior? How can I investigate it further? Should dead batch jobs be affected by restart a nomad node?

Kamilcuk · September 25, 2023, 8:50am

After some investigation, it looks like the following happens:

a job has run and is dead. It is dead - no reschedules or reattempts.
the job has "Stop": False in the json spec. The job was not stopped, it just finished.
then the client is restarted

This causes the job to restart automatically by itself. What has really helped is to make sure that a job is stopped after it is finished.

Topic		Replies	Views
Understanding job restart behaviour on lost jobs Nomad	2	1139	May 12, 2022
Client is ready and job is dead Nomad	1	624	April 5, 2022
Stopping jobs in Nomad Nomad	1	817	August 24, 2022
Nomad job restart strategy Nomad	0	370	March 23, 2022
Nomad not rescheduling system jobs on nodes that previously ran out of disk space Nomad	2	289	July 7, 2022

Why multiple dead and system jobs restart when restarting a Nomad client?

Related topics