AWS Batch and AWS Step functions broken by terraform apply

Link512 · October 8, 2019, 10:33am

A short simplified description of my setup:

Let’s say I have a SFN State machine with 2 states, A,B. Each of them use an AWS Batch job with a certain revision like so:

A -> jobA:3
B -> jobB:4

This State machine is being run on a cron schedule and might take a few hours to complete.

When a deployment via CI/CD happens, the following things may occur:

the docker image for jobB changes, thus the container_properties for the job change
terraform will force a new resource, causing jobB:4 to be marked as INACTIVE and a new revision jobB:5 to be submitted
The sfn state machine now is:

   A -> jobA:3
   B -> jobB:5

If a SFN is running while this deployment is being made:

the SFN definition includes the old revision of jobB
state A finishes, SFN will try to queue jobB:4 causing an error

Is there a way of preventing this? I know that terraform’s default behaviour for forcing a new resource is “delete old resource, create a new one”, but in the case of AWS Batch where you have revisions, it would be nice if there were a way to preserve the old revisions of a job without marking them as inactive

outthought · October 8, 2019, 10:58am

Run terraform apply to change resources when you want them changed. This sounds like a glib answer, but how can terraform choose behavior at a logical level above resources, so-called orchestration logic?

Link512 · October 8, 2019, 12:08pm

I wasn’t saying it should be aware of logical level resources, I was suggesting that perhaps for resources that have revisions(like AWS Batch) it can have a flag to not automatically de-register a revision when “forcing” a new resource

outthought · October 8, 2019, 12:59pm

All resources may still be in-use according to a user’s assessment. For example, if I update the value of ami for an aws_instance, then run terraform apply, this change destroys any existing instance and creates a new one from the new image. Had important processes been running on the instance that I wished to keep running, then I should have waited to apply the change until I have finished with the process. A process running in a aws_sfn_state_machine is the same idea.

Terraform plan shows the changes to be applied to the resources to converge the state of those resources to the semantic descriptions expressed in the template. The change to the docker image, in this case, requires a new resource to be created in order to achieve convergence. Terraform reports this constraint in the output of terraform plan. The action of terraform apply converges the state of the resources as planned.

Topic		Replies	Views
How to prevent Batch downtime when updating the tags/image of compute environment Terraform	1	105	October 23, 2025
Terraform Destroyed Resources when doing batch deployments Terraform	1	425	September 29, 2021
AWS Batch (EC2) jobs stay in RUNNABLE state AWS	1	1022	December 14, 2022
Error: error deleting Batch Compute Environment : Cannot delete, found existing JobQueue relationship AWS	6	7087	February 9, 2023
Second time apply without any change in code or manual console change causing force replacement AWS	7	3778	February 1, 2021

AWS Batch and AWS Step functions broken by terraform apply

Related topics