Dead batch tasks piling up

Apologies for the open ended question, but has anyone observed batch task piling up in the recent past?

A lot of jobs are stuck in stop failed state. No amount of system gc or system reconcile summaries is helping.

There are a couple of variables in my setup which have changed in the recent past, hence asking here:

  • upgraded agents from 1.4.2 to 1.4.3 (maybe 1.4.3 is the issue)
  • we switched our “machines which run periodic jobs” from reserved to spot (I doubt that has anything to do with the observed issue.)
  • we added max_instance_lifetime to the ASG of the “periodic job machines”.

Anyone with any pointers and/or fix would be useful!