Dead batch tasks piling up

shantanugadgil · December 20, 2022, 8:28am

Apologies for the open ended question, but has anyone observed batch task piling up in the recent past?

A lot of jobs are stuck in stop failed state. No amount of system gc or system reconcile summaries is helping.

There are a couple of variables in my setup which have changed in the recent past, hence asking here:

upgraded agents from 1.4.2 to 1.4.3 (maybe 1.4.3 is the issue)
we switched our “machines which run periodic jobs” from reserved to spot (I doubt that has anything to do with the observed issue.)
we added max_instance_lifetime to the ASG of the “periodic job machines”.

Anyone with any pointers and/or fix would be useful!

Topic		Replies	Views
Nomad task pending for few minutes Nomad	1	2115	August 30, 2023
Is it better one job with 1000 tasks or 1000 jobs with one task? Nomad	7	1025	January 30, 2026
Nomad system jobs end up losing all allocations for no apparent reason, and not restarting them Nomad	2	759	February 21, 2024
Invalid job type: "sysbatch" Nomad	1	466	September 28, 2021
Completed batch job goes pending again after node goes down due to screen lock Nomad	0	116	March 27, 2024