Apologies for the open ended question, but has anyone observed batch task piling up in the recent past?
A lot of jobs are stuck in stop failed state. No amount of system gc or system reconcile summaries is helping.
There are a couple of variables in my setup which have changed in the recent past, hence asking here:
- upgraded agents from
1.4.2to1.4.3(maybe 1.4.3 is the issue) - we switched our “machines which run periodic jobs” from
reservedtospot(I doubt that has anything to do with the observed issue.) - we added
max_instance_lifetimeto the ASG of the “periodic job machines”.
Anyone with any pointers and/or fix would be useful!