Apologies for the open ended question, but has anyone observed batch task piling up in the recent past?
A lot of jobs are stuck in stop failed
state. No amount of system gc
or system reconcile summaries
is helping.
There are a couple of variables in my setup which have changed in the recent past, hence asking here:
- upgraded agents from
1.4.2
to1.4.3
(maybe 1.4.3 is the issue) - we switched our “machines which run periodic jobs” from
reserved
tospot
(I doubt that has anything to do with the observed issue.) - we added
max_instance_lifetime
to the ASG of the “periodic job machines”.
Anyone with any pointers and/or fix would be useful!