How to prevent Batch downtime when updating the tags/image of compute environment

Hi, everyone

I have an AWS infrastructure defined by Terraform. However, when the tags or image of its Batch compute environment get updated, Terraform replaces (i.e. destroys and recreates) the compute environment, which causes downtime in Batch. Details:

Context

  • An AWS infrastructure defined by Terraform.

  • The infrastructure includes Batch compute environment and job queue that is mapped to the compute environment.

  • The default tags are applied to both compute environment and job queue.

  • The image of the compute environment is set to a specific AMI defined in tfvars.json file.

Problem

  • When tags or AMI of the compute environment get updated, Terraform destroys and recreates the compute environment.

  • When Terraform destroys the compute environment, it needs to be detached from a job queue. Thus, today, we destroy the entire job queue prior to terraform apply if the compute environment needs to be replaced.

  • This causes Batch downtime, which is critical since there are scheduled jobs to be run in Batch.

Idea: Have two pairs of Batch compute environment and job queue (blue-green deployment)

  • Description: During an upgrade of the infrastructure, have Terraform only update one pair, while the other pair handles scheduled Batch jobs
  • Limitation: This requires lifecycle.ignore_changes attribute of aws_batch_compute_environment to be set dynamically (i.e. when updating the blue pair, set lifecycle.ignore_changes attribute of green compute environment to all). However, lifecycle.ignore_changes only takes static expressions. Otherwise, Terraform fails with A static list expression is required. error

How can we prevent Batch downtime when the Batch compute environment and job queue need to be updated with new tags and the AMI of the compute environment needs to be updated? Any suggestion/idea would be appreciated. Thanks!

Stumbled upon this from Google. Facing the same issue but I’m using CDK so not a Terraform user. Would be interested to hear how you solved this, or otherwise +1