Nomad not rescheduling system jobs on nodes that previously ran out of disk space


We have 6 nodes, all running Nomad 1.3.1 in client mode. All these nodes are eligible to run a particular system job, and yesterday all 6 were running the job as expected. Over night, two of these nodes ran out of disk space and obviously went down. I’ve since fixed this problem (both nodes now have ~90% disk space free), but Nomad isn’t recreating the failed system job allocations. If I go into the UI and look at the topology, Nomad sees these two clients as empty, and the system job only has 4 running. So all this state is correct - but why is Nomad to reallocating the system job?

Hi @ocharles, any chance you can post the job file for one of the system jobs not being rescheduled? It helps to get an idea of what the restart configuration, etc. are doing.

Hi @seth.hoenig. Unfortunately I had to just get things up again (resubmitting the job/restarting nomad clients, etc). I also don’t have the job file I was using when this got stuck. Next time it happens, I’ll see if I can provide more information.