We have multiple batch job. These jobs have run and are currently dead and finished. The allocations may have been garbage collected, but the job is still there.
Then, we add a new Nomad client or restart a Nomad client to refresh its configuration.
Now we observe a odd thing. What happens, is that multiple jobs restart when doing that, causing Nomad to stall. These jobs do not run on the new node, they have constrains to run on other nodes. They are unrelated. Also multiple other evaluations and system jobs seems to restart.
It seems that the issue happens mostly for dead non-stopped jobs - a job that is finished, but still have “stop”: “false” in the json configuration. But I can’t confirm it, I have to test it.
While I work on a MCVE to make it easily reproducible, is this expected behavior? How can I investigate it further? Should dead batch jobs be affected by restart a nomad node?