Why multiple dead and system jobs restart when restarting a Nomad client?

We have multiple batch job. These jobs have run and are currently dead and finished. The allocations may have been garbage collected, but the job is still there.

Then, we add a new Nomad client or restart a Nomad client to refresh its configuration.

Now we observe a odd thing. What happens, is that multiple jobs restart when doing that, causing Nomad to stall. These jobs do not run on the new node, they have constrains to run on other nodes. They are unrelated. Also multiple other evaluations and system jobs seems to restart.

It seems that the issue happens mostly for dead non-stopped jobs - a job that is finished, but still have “stop”: “false” in the json configuration. But I can’t confirm it, I have to test it.

While I work on a MCVE to make it easily reproducible, is this expected behavior? How can I investigate it further? Should dead batch jobs be affected by restart a nomad node?

After some investigation, it looks like the following happens:

  • a job has run and is dead. It is dead - no reschedules or reattempts.
  • the job has "Stop": False in the json spec. The job was not stopped, it just finished.
  • then the client is restarted

This causes the job to restart automatically by itself. What has really helped is to make sure that a job is stopped after it is finished.