Nomad not rescheduling system jobs on nodes that previously ran out of disk space

ocharles · July 2, 2022, 9:33am

Hi.

We have 6 nodes, all running Nomad 1.3.1 in client mode. All these nodes are eligible to run a particular system job, and yesterday all 6 were running the job as expected. Over night, two of these nodes ran out of disk space and obviously went down. I’ve since fixed this problem (both nodes now have ~90% disk space free), but Nomad isn’t recreating the failed system job allocations. If I go into the UI and look at the topology, Nomad sees these two clients as empty, and the system job only has 4 running. So all this state is correct - but why is Nomad to reallocating the system job?

seth.hoenig · July 7, 2022, 2:13pm

Hi @ocharles, any chance you can post the job file for one of the system jobs not being rescheduled? It helps to get an idea of what the restart configuration, etc. are doing.

ocharles · July 7, 2022, 9:10pm

Hi @seth.hoenig. Unfortunately I had to just get things up again (resubmitting the job/restarting nomad clients, etc). I also don’t have the job file I was using when this got stuck. Next time it happens, I’ll see if I can provide more information.

Topic		Replies	Views
Nomad system jobs end up losing all allocations for no apparent reason, and not restarting them Nomad	2	490	February 21, 2024
Nomad not rescheduling allocations due to high usage on one node Nomad	2	3876	March 8, 2021
Understanding job restart behaviour on lost jobs Nomad	2	1101	May 12, 2022
Nomad 1.0.2 facing Dimension Disk Exhausted on 3 nodes Nomad	3	1715	April 22, 2021
How can I ask "please nomad, kindly place <job> onto this node" Nomad	5	1308	November 3, 2022

Nomad not rescheduling system jobs on nodes that previously ran out of disk space

Related topics