AWS Batch and AWS Step functions broken by terraform apply

A short simplified description of my setup:

Let’s say I have a SFN State machine with 2 states, A,B. Each of them use an AWS Batch job with a certain revision like so:

A -> jobA:3
B -> jobB:4

This State machine is being run on a cron schedule and might take a few hours to complete.

When a deployment via CI/CD happens, the following things may occur:

  1. the docker image for jobB changes, thus the container_properties for the job change
  2. terraform will force a new resource, causing jobB:4 to be marked as INACTIVE and a new revision jobB:5 to be submitted
  3. The sfn state machine now is:
   A -> jobA:3
   B -> jobB:5

If a SFN is running while this deployment is being made:

  • the SFN definition includes the old revision of jobB
  • state A finishes, SFN will try to queue jobB:4 causing an error

Is there a way of preventing this? I know that terraform’s default behaviour for forcing a new resource is “delete old resource, create a new one”, but in the case of AWS Batch where you have revisions, it would be nice if there were a way to preserve the old revisions of a job without marking them as inactive

Run terraform apply to change resources when you want them changed. This sounds like a glib answer, but how can terraform choose behavior at a logical level above resources, so-called orchestration logic?

I wasn’t saying it should be aware of logical level resources, I was suggesting that perhaps for resources that have revisions(like AWS Batch) it can have a flag to not automatically de-register a revision when “forcing” a new resource

All resources may still be in-use according to a user’s assessment. For example, if I update the value of ami for an aws_instance, then run terraform apply, this change destroys any existing instance and creates a new one from the new image. Had important processes been running on the instance that I wished to keep running, then I should have waited to apply the change until I have finished with the process. A process running in a aws_sfn_state_machine is the same idea.

Terraform plan shows the changes to be applied to the resources to converge the state of those resources to the semantic descriptions expressed in the template. The change to the docker image, in this case, requires a new resource to be created in order to achieve convergence. Terraform reports this constraint in the output of terraform plan. The action of terraform apply converges the state of the resources as planned.